This document provides instructions on how to use the polymon command-line interface (CLI).

The polymon CLI has three main modes:

  • Train: Train a machine learning or deep learning model.

  • Merge: Merge two datasets.

  • Predict: Predict labels for a given dataset.

Train

This command is used to train a model.

Usage:

polymon train [OPTIONS]

Arguments:

Supported Arguments in train

Type

Default

Description

--raw-csv

str

database/database.csv

Path to the raw csv file.

--sources

str (multiple)

['Kaggle']

Sources to use for training.

--tag

str

debug

Tag to use for training.

--labels

str (multiple)

Required

Labels to use for training.

--feature-names

str (multiple)

['rdkit2d']

Feature names to use for training.

--n-trials

int

None

Number of trials to run for hyperparameter optimization.

--out-dir

str

./results

Path to the output directory.

--hparams-from

str

None

Path to the hparams file. Allowed formats: .json, .pt, .pkl.

--n-fold

int

1

Number of folds to use for cross-validation.

--split-mode

str

random

Mode to split the data into training, validation, and test sets.

--seed

int

42

Seed to use for training.

--remove-hydrogens

bool

False

Whether to remove hydrogens from the molecules.

--descriptors

str (multiple)

None

Descriptors to use for training. For ML models, this must be specified.

--model

str

rf

Model to use for training.

--hidden-dim

int

32

Hidden dimension of the model.

--num-layers

int

3

Number of layers of the model.

--batch-size

int

128

Batch size to use for training.

--lr

float

1e-3

Learning rate to use for training.

--num-epochs

int

2500

Number of epochs to use for training.

--early-stopping-patience

int

250

Number of epochs to wait before early stopping.

--device

str

cuda

Device to use for training.

--run-production

bool

False

Whether to run the training in production mode, which means train:val:test splits will be forced to 0.95:0.05:0.0.

--finetune

bool

False

Whether to finetune the model.

--finetune-csv-path

str

None

Path to the csv file to finetune the model on.

--pretrained-model

str

None

Path to the pretrained model.

--n-estimator

int

1

Number of estimators to use for training.

--additional-features

str (multiple)

None

Additional features to use for training.

--skip-train

bool

False

Whether to skip the training step.

--low-fidelity-model

str

None

Path to the low fidelity model.

--estimator-name

str

None

Name of the estimator to give base predictions.

--emb-model

str

None

Name of the embedding model for base graph embeddings.

--ensemble-type

str

voting

Type of ensemble to use for training.

--train-residual

bool

False

Whether to train the residual of the model.

--normalizer-type

str

normalizer

Type of normalizer to use for training. Choices: normalizer, log_normalizer, none.

--augmentation

bool

False

Whether to use data augmentation.

Merge

This command is used to merge two datasets.

Usage:

polymon merge [OPTIONS]

Arguments:

Supported Arguments in merge

Type

Default

Description

--sources

str (multiple)

Required

Sources to merge.

--label

str

Required

Label to merge.

--hparams-from

str

Required

Path to the hparams file.

--acquisition

str

Required

Acquisition function to use for merging. Choices: epig, uncertainty, difference.

--sample-size

int

20

Sample size to use for merging.

--uncertainty-threshold

float

0.1

Uncertainty threshold to use for merging.

--difference-threshold

float

0.1

Difference threshold to use for merging.

--target-size

int

1000

Target size to use for merging.

--base-csv

str

None

Path to the base csv file.

Predict

This command is used to predict labels for a given dataset.

Usage:

polymon predict [OPTIONS]

Arguments:

Argument

Type

Default

Description

--model-path

str

Required

Path to the model.

--csv-path

str

Required

Path to the csv file.

--smiles-column

str

Required

Name of the smiles column.