Advancing Biodegradability Assessment Through AI

Models

Existing machine learning models built at Aropha, mainly covering biodegradation of organic contaminant in aquatic environment. New models are continuously added.

Aerobic biodegradation in water

Aerobic biodegradation -- regression

Predicts continuous biodegradation percentages in 28 days of incubation, based on more than 4,000 data points considering ready biodegradation in water. Multiple standard guidelines were included such as OECD 301 and EU method C.

Aerobic biodegradation in water

Aerobic biodegradation -- classification

Predicts if an organic contaminant is readily biodegradable (pass 60% of degradation in 28 days), based on more than 3,000 data points considering ready biodegradation in water. Data covers multiple standard guidelines and test principles.

Datasets

The datasets used for the development of above models

Aerobic biodegradation regression

Containing over 4,000 data points and SMILES strings, guideline (e.g., OECD 301F), principle (e.g., closed respirometer), and reliability (e.g., 1 or 2) as the inputs. The biodegradation percentages are the output.

Aerobic biodegradation classification

Containing over 3,000 data points with SMILES strings as the inputs and the class (0 or 1) as the output. Only ready biodegradation data with time of 28 and principles of closed bottle test, closed respirometer, and CO2 evolution were considered.

Sample Python Code

The example python code (in JupyterNotebook) for using the model files.

Aerobic biodegradation regression

A downloadable jupyter notebook guiding you to perform your own predictions step by step using the provided model file, including data preparation, prediction, accuracy evaluation, and results export.

Aerobic biodegradation classification

A downloadable jupyter notebook guiding you to perform your own predictions step by step using the provided model file, including data preparation, prediction, accuracy evaluation, and results export.

Tools

Useful frameworks/libraries used for the development of these models

Python3

The most widely used programming language for machine learning.

Jupyter Notebook

One of the most widely used web application for machine learning, which allows users to create and share documents that contain live code, equations, visualizations and narrative text.

Scikit-learn

One of the most useful tools providing dozens of ML models for classification, regression, clustering, and so on. It is a simple and efficient tool for predictive data analysis.

Pandas

One of the most popular tools for working with Excel or CSV files, or dataframe. It is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool.

RDKit

One of the most popularly used tools for working with organic chemistry. It allows users to draw chemicals, calculate molecular fingerprints, perform similarity calculations, and more.

Matplotlib

One of most widely used libraries for creating static, animated, and interactive visualizations in Python.

TensorFlow

An end-to-end open source platform for machine learning, widely used for developing deep neural network models.

PyTorch

An open source machine learning framework that accelerates the path from research prototyping to production deployment.