Current predictor: Aerobic biodegradation

Last update: October 22, 2021

About


Dataset:
The classification model was based on more than 3,000 data points with SMILES strings as the inputs and the class (0 or 1) as the output. Only ready biodegradation data with time of 28 and principles of closed bottle test, closed respirometer, and CO2 evolution were considered.

ML algorithms:
A total of 14 ML algorithms were examined to find the best one, including K nearest neighbors, Linear support vector machine (SVM), Radial basis function SVM (RBF SVM), Gaussian process, Neural net multi-layer perceptron classifier, Decision tree, Random forest, Bagging, Adaptive boosting, Gradient boosting, XGBoost, Extra tree, Gaussian Naive Bayes, Quadratic discriminant analysis.

XGBoost was found to be the best one.

Chemical representation:
MACCS fingerprints

Other notes:
Data balancing was performance as the two classes were not well balanced. Bayesian optimization was conducted for tuning the model hyperparameters. Chemical similarity calculation was performed using the fingerprint similarity based on Tanimoto index to determine the model applicability domain.