Current predictor: Aerobic biodegradation

Last update: October 22, 2021

About


Dataset:
The regression model was built on more than 4,000 data points and included SMILES strings, guideline (e.g., OECD 301F), principle (e.g., closed respirometer), and reliability (e.g., 1 or 2) as the inputs. The biodegradation percentages are the output.

ML algorithms:
A total of 12 ML algorithms were examined to find the best one, including Ridge, Lasso, K nearest neighbors, Support vector regression, Decision tree, Random forest, Extra trees, Bagging, Adaptive boosting, Gradient boosting, and XGBoost.

XGBoost was found to be the best one.

Chemical representation:
MACCS fingerprints

Other notes:
Bayesian optimization was performed to tune the model hyperparameters. Chemical similarity calculation was conducted using the fingerprint similarity based on Tanimoto index to determine the model applicability domain.