This document provides instructions on how to use the ``polymon`` command-line interface (CLI).

The ``polymon`` CLI has three main modes:

- `Train`_: Train a machine learning or deep learning model.
- `Merge`_: Merge two datasets.
- `Predict`_: Predict labels for a given dataset.

Train
============

This command is used to train a model.

**Usage:**

.. code-block:: bash

   polymon train [OPTIONS]

**Arguments:**

.. list-table::
   :widths: 50 10 20 50
   :header-rows: 1

   * - Supported Arguments in ``train``
     - Type
     - Default
     - Description
   * - ``--raw-csv``
     - str
     - ``database/database.csv``
     - Path to the raw csv file.
   * - ``--sources``
     - str (multiple)
     - ``['Kaggle']``
     - Sources to use for training.
   * - ``--tag``
     - str
     - ``debug``
     - Tag to use for training.
   * - ``--labels``
     - str (multiple)
     - **Required**
     - Labels to use for training.
   * - ``--feature-names``
     - str (multiple)
     - ``['rdkit2d']``
     - Feature names to use for training.
   * - ``--n-trials``
     - int
     - ``None``
     - Number of trials to run for hyperparameter optimization.
   * - ``--out-dir``
     - str
     - ``./results``
     - Path to the output directory.
   * - ``--hparams-from``
     - str
     - ``None``
     - Path to the hparams file. Allowed formats: .json, .pt, .pkl.
   * - ``--n-fold``
     - int
     - ``1``
     - Number of folds to use for cross-validation.
   * - ``--split-mode``
     - str
     - ``random``
     - Mode to split the data into training, validation, and test sets.
   * - ``--seed``
     - int
     - ``42``
     - Seed to use for training.
   * - ``--remove-hydrogens``
     - bool
     - ``False``
     - Whether to remove hydrogens from the molecules.
   * - ``--descriptors``
     - str (multiple)
     - ``None``
     - Descriptors to use for training. For ML models, this must be specified.
   * - ``--model``
     - str
     - ``rf``
     - Model to use for training.
   * - ``--hidden-dim``
     - int
     - ``32``
     - Hidden dimension of the model.
   * - ``--num-layers``
     - int
     - ``3``
     - Number of layers of the model.
   * - ``--batch-size``
     - int
     - ``128``
     - Batch size to use for training.
   * - ``--lr``
     - float
     - ``1e-3``
     - Learning rate to use for training.
   * - ``--num-epochs``
     - int
     - ``2500``
     - Number of epochs to use for training.
   * - ``--early-stopping-patience``
     - int
     - ``250``
     - Number of epochs to wait before early stopping.
   * - ``--device``
     - str
     - ``cuda``
     - Device to use for training.
   * - ``--run-production``
     - bool
     - ``False``
     - Whether to run the training in production mode, which means train:val:test splits will be forced to 0.95:0.05:0.0.
   * - ``--finetune``
     - bool
     - ``False``
     - Whether to finetune the model.
   * - ``--finetune-csv-path``
     - str
     - ``None``
     - Path to the csv file to finetune the model on.
   * - ``--pretrained-model``
     - str
     - ``None``
     - Path to the pretrained model.
   * - ``--n-estimator``
     - int
     - ``1``
     - Number of estimators to use for training.
   * - ``--additional-features``
     - str (multiple)
     - ``None``
     - Additional features to use for training.
   * - ``--skip-train``
     - bool
     - ``False``
     - Whether to skip the training step.
   * - ``--low-fidelity-model``
     - str
     - ``None``
     - Path to the low fidelity model.
   * - ``--estimator-name``
     - str
     - ``None``
     - Name of the estimator to give base predictions.
   * - ``--emb-model``
     - str
     - ``None``
     - Name of the embedding model for base graph embeddings.
   * - ``--ensemble-type``
     - str
     - ``voting``
     - Type of ensemble to use for training.
   * - ``--train-residual``
     - bool
     - ``False``
     - Whether to train the residual of the model.
   * - ``--normalizer-type``
     - str
     - ``normalizer``
     - Type of normalizer to use for training. Choices: ``normalizer``, ``log_normalizer``, ``none``.
   * - ``--augmentation``
     - bool
     - ``False``
     - Whether to use data augmentation.

Merge
============

This command is used to merge two datasets.

**Usage:**

.. code-block:: bash

   polymon merge [OPTIONS]

**Arguments:**

.. list-table::
   :widths: 30 10 20 50
   :header-rows: 1

   * - Supported Arguments in ``merge``
     - Type
     - Default
     - Description
   * - ``--sources``
     - str (multiple)
     - **Required**
     - Sources to merge.
   * - ``--label``
     - str
     - **Required**
     - Label to merge.
   * - ``--hparams-from``
     - str
     - **Required**
     - Path to the hparams file.
   * - ``--acquisition``
     - str
     - **Required**
     - Acquisition function to use for merging. Choices: ``epig``, ``uncertainty``, ``difference``.
   * - ``--sample-size``
     - int
     - ``20``
     - Sample size to use for merging.
   * - ``--uncertainty-threshold``
     - float
     - ``0.1``
     - Uncertainty threshold to use for merging.
   * - ``--difference-threshold``
     - float
     - ``0.1``
     - Difference threshold to use for merging.
   * - ``--target-size``
     - int
     - ``1000``
     - Target size to use for merging.
   * - ``--base-csv``
     - str
     - ``None``
     - Path to the base csv file.

Predict
============

This command is used to predict labels for a given dataset.

**Usage:**

.. code-block:: bash

   polymon predict [OPTIONS]

**Arguments:**

.. list-table::
   :widths: 30 10 20 50
   :header-rows: 1

   * - Argument
     - Type
     - Default
     - Description
   * - ``--model-path``
     - str
     - **Required**
     - Path to the model.
   * - ``--csv-path``
     - str
     - **Required**
     - Path to the csv file.
   * - ``--smiles-column``
     - str
     - **Required**
     - Name of the smiles column.