This document provides instructions on how to use the polymon command-line interface (CLI).

The polymon CLI has three main modes:

Train: Train a machine learning or deep learning model.
Merge: Merge two datasets.
Predict: Predict labels for a given dataset.

Train

This command is used to train a model.

Usage:

polymon train [OPTIONS]

Arguments:

Supported Arguments in `train`	Type	Default	Description
`--raw-csv`	str	`database/database.csv`	Path to the raw csv file.
`--sources`	str (multiple)	`['Kaggle']`	Sources to use for training.
`--tag`	str	`debug`	Tag to use for training.
`--labels`	str (multiple)	Required	Labels to use for training.
`--feature-names`	str (multiple)	`['rdkit2d']`	Feature names to use for training.
`--n-trials`	int	`None`	Number of trials to run for hyperparameter optimization.
`--out-dir`	str	`./results`	Path to the output directory.
`--hparams-from`	str	`None`	Path to the hparams file. Allowed formats: .json, .pt, .pkl.
`--n-fold`	int	`1`	Number of folds to use for cross-validation.
`--split-mode`	str	`random`	Mode to split the data into training, validation, and test sets.
`--seed`	int	`42`	Seed to use for training.
`--remove-hydrogens`	bool	`False`	Whether to remove hydrogens from the molecules.
`--descriptors`	str (multiple)	`None`	Descriptors to use for training. For ML models, this must be specified.
`--model`	str	`rf`	Model to use for training.
`--hidden-dim`	int	`32`	Hidden dimension of the model.
`--num-layers`	int	`3`	Number of layers of the model.
`--batch-size`	int	`128`	Batch size to use for training.
`--lr`	float	`1e-3`	Learning rate to use for training.
`--num-epochs`	int	`2500`	Number of epochs to use for training.
`--early-stopping-patience`	int	`250`	Number of epochs to wait before early stopping.
`--device`	str	`cuda`	Device to use for training.
`--run-production`	bool	`False`	Whether to run the training in production mode, which means train:val:test splits will be forced to 0.95:0.05:0.0.
`--finetune`	bool	`False`	Whether to finetune the model.
`--finetune-csv-path`	str	`None`	Path to the csv file to finetune the model on.
`--pretrained-model`	str	`None`	Path to the pretrained model.
`--n-estimator`	int	`1`	Number of estimators to use for training.
`--additional-features`	str (multiple)	`None`	Additional features to use for training.
`--skip-train`	bool	`False`	Whether to skip the training step.
`--low-fidelity-model`	str	`None`	Path to the low fidelity model.
`--estimator-name`	str	`None`	Name of the estimator to give base predictions.
`--emb-model`	str	`None`	Name of the embedding model for base graph embeddings.
`--ensemble-type`	str	`voting`	Type of ensemble to use for training.
`--train-residual`	bool	`False`	Whether to train the residual of the model.
`--normalizer-type`	str	`normalizer`	Type of normalizer to use for training. Choices: `normalizer`, `log_normalizer`, `none`.
`--augmentation`	bool	`False`	Whether to use data augmentation.

Merge

This command is used to merge two datasets.

Usage:

polymon merge [OPTIONS]

Arguments:

Supported Arguments in `merge`	Type	Default	Description
`--sources`	str (multiple)	Required	Sources to merge.
`--label`	str	Required	Label to merge.
`--hparams-from`	str	Required	Path to the hparams file.
`--acquisition`	str	Required	Acquisition function to use for merging. Choices: `epig`, `uncertainty`, `difference`.
`--sample-size`	int	`20`	Sample size to use for merging.
`--uncertainty-threshold`	float	`0.1`	Uncertainty threshold to use for merging.
`--difference-threshold`	float	`0.1`	Difference threshold to use for merging.
`--target-size`	int	`1000`	Target size to use for merging.
`--base-csv`	str	`None`	Path to the base csv file.

Predict

This command is used to predict labels for a given dataset.

Usage:

polymon predict [OPTIONS]

Arguments:

Argument	Type	Default	Description
`--model-path`	str	Required	Path to the model.
`--csv-path`	str	Required	Path to the csv file.
`--smiles-column`	str	Required	Name of the smiles column.