Basic work flow to build classification QSAR models
Tutorial for building QSAR Models in KNIME: Part 1
Once everything is set up, we can dive into preparing your bioactivity dataset and normalizing structures. Let’s get started!
Preparing Your Bioactivity Dataset for Machine Learning
In machine learning, high-quality data is crucial to achieving good model performance. The phrase "garbage in, garbage out" is particularly true here, so it’s essential to take the time to clean and normalize your dataset. For this tutorial, we are going to use the bioactivity dataset for carbonic anhydrase 9, with chembl ID: CHEMBL3594 Here’s how you can prepare your bioactivity data in KNIME: (Please click the image to see the enlarged version of the workflow).
We will prepare the dataset in two steps, the first will be to extract relevant information for QSAR modeling, and the second will be to process the crude smiles notations. This processing makes the structural notations fit for efficient molecular descriptor calculations.
Step1: Processing crude textual dataset from retrieved ChEMBL DB
1. Select Relevant Columns
Start by narrowing down your dataset to include only the key columns needed for modeling:
- Chembl ID: The unique identifier for each compound.
- SMILES: String representations of the molecules.
- Standard Value: The measured bioactivity.
- Standard Relationship: Defines the relationship of the standard value (e.g., “=”, “>”, “<”).
- Standard Unit: The bioactivity measurement unit.
- Assay ID (optional): Useful for focusing on a particular assay type.
2. Remove Missing and Duplicate Values
Remove rows that contain missing or duplicate values based on chembl IDs. This step ensures that your dataset is clean, reducing potential errors or noise in the model.
3. Retain Exact Measurements
Filter your data to include only records where the Standard Relationship equals "="
. This ensures you're working with precise bioactivity measurements.
4. Choose Your Bioactivity of Interest
Identify the bioactivity type (e.g., IC50, Ki) with the highest number of molecules in your dataset, in my case for CA9 dataset, Ki which inhbition constant values seems to be the majority of the dataset, thus we will moving ahead with this type (more data makes better be models). This will provide a more robust dataset for modeling.
5. Ensure Consistent Units
Standardize the bioactivity units by retaining only entries measured in nanomolar (nM). Consistent units make the data easier to interpret and model.
6. Normalize Bioactivity Data and Labeling
Normalize bioactivity values by taking the negative logarithm of the activity values (e.g., pIC50, pKi). This transforms the values into manageable and interpretable scales:
- p(activity) > 7: Active
- p(activity) 5–7: Intermediate
- p(activity) < 5: Inactive
To simplify modeling, you can remove intermediate values and create a binary classification model for active vs. inactive compounds.
šNote: If you intend to develop a model to predict exact numerical activity (regression tasks), you may skip the labeling part and proceed with the actual numerical activity values as target variables.
Once your dataset is prepared and labeling is done, it’s time to process and normalize the molecular structures.
Step 2: Processing Smiles string and their normalization
Structure normalization is another critical step to ensure your machine-learning models are built on molecular descriptors derived from clean and consistent structural data. Here are six generic steps to normalize molecular structures, especially SMILES strings:
1. Remove Salts and Mixtures
Salts and inorganic mixtures can introduce noise into your dataset. Focus on small organic molecules by removing multicomponent SMILES strings, leading to cleaner data and more accurate predictions. Below is an example of multicomponent/mixture smiles.
2. Remove Inorganic and Metal-Organic Molecules
Most QSAR models and descriptor-generation tools are designed for small organic compounds. Removing inorganic or metal-organic molecules ensures compatibility with cheminformatics tools and better predictions.
3. Strip Stereochemistry
If your dataset isn’t dependent on stereochemistry, strip this information from the molecules. This makes it easier for the model to focus on essential structural patterns. However, if stereochemistry is relevant (e.g., for bioactivity data), you can use tools like the CDK fingerprint node to account for chirality.
4. Normalize Structures
Normalize aromatic rings and optimize overall molecular structures using tools like RDKit. This ensures consistency in how molecular rings and functional groups are represented.
5. Deduplicate Using SMARTS and InChI Keys
Avoid redundant structures by deduplicating them using SMARTS patterns or InChI keys. This step catches any duplicates that might have slipped through earlier, ensuring a clean dataset for modeling.
6. Save Normalized Structures
Finally, save the normalized dataset in a format that’s easy to work with later (e.g., SDF). Include molecule identifiers, normalized structures, and activity labels.
Following these steps, you’ll have a high-quality, normalized molecular dataset ready for descriptor calculation and machine-learning modeling. In the next blog, I’ll cover how to generate molecular fingerprints and refine them for your Classification QSAR models.
Am also working on a Collab notebook to perform this task.
Stay tuned!
Bonus content:
I didn't know that u could reduce size of many pictures at one go, here is the simple code using image macgick package :
sudo apt install imagemagick-6.q16
for img in *.PNG; do mogrify -resize 50% "$img"; done
Just navigate into the desired dir and execute this lines, and in place of PNG just put your file type.
Comments
Post a Comment