Basic work flow to build classification QSAR models

Tutorial for building QSAR Models in KNIME: Part 1





In this blog post, I’ll guide you through the process of building QSAR (Quantitative Structure-Activity Relationship) models using the KNIME platform. Before starting, make sure you’ve installed the required cheminformatics nodes. If you need help with that, you can follow this video or video for instructions. We will be using combination of native knime and cheminfomatics nodes.

Once everything is set up, we can dive into preparing your bioactivity dataset and normalizing structures. Let’s get started!


Preparing Your Bioactivity Dataset for Machine Learning

In machine learning, high-quality data is crucial to achieving good model performance. The phrase "garbage in, garbage out" is particularly true here, so it’s essential to take the time to clean and normalize your dataset. For this tutorial, we are going to use the bioactivity dataset for carbonic anhydrase 9, with chembl ID: CHEMBL3594 Here’s how you can prepare your bioactivity data in KNIME: (Please click the image to see the enlarged version of the workflow).

We will prepare the dataset in two steps, the first will be to extract relevant information for QSAR modeling, and the second will be to process the crude smiles notations. This processing makes the structural notations fit for efficient molecular descriptor calculations.

Step1: Processing crude textual dataset from retrieved ChEMBL DB

1. Select Relevant Columns

Start by narrowing down your dataset to include only the key columns needed for modeling:

  • Chembl ID: The unique identifier for each compound.
  • SMILES: String representations of the molecules.
  • Standard Value: The measured bioactivity.
  • Standard Relationship: Defines the relationship of the standard value (e.g., “=”, “>”, “<”).
  • Standard Unit: The bioactivity measurement unit.
  • Assay ID (optional): Useful for focusing on a particular assay type.


2. Remove Missing and Duplicate Values

Remove rows that contain missing or duplicate values based on chembl IDs. This step ensures that your dataset is clean, reducing potential errors or noise in the model.

3. Retain Exact Measurements

Filter your data to include only records where the Standard Relationship equals "=". This ensures you're working with precise bioactivity measurements.

4. Choose Your Bioactivity of Interest

Identify the bioactivity type (e.g., IC50, Ki) with the highest number of molecules in your dataset, in my case for CA9 dataset, Ki which inhbition constant values seems to be the majority of the dataset, thus we will moving ahead with this type (more data makes better be models). This will provide a more robust dataset for modeling.

5. Ensure Consistent Units

Standardize the bioactivity units by retaining only entries measured in nanomolar (nM). Consistent units make the data easier to interpret and model.

6. Normalize Bioactivity Data and Labeling

Normalize bioactivity values by taking the negative logarithm of the activity values (e.g., pIC50, pKi). This transforms the values into manageable and interpretable scales:

  • p(activity) > 7: Active
  • p(activity) 5–7: Intermediate
  • p(activity) < 5: Inactive

To simplify modeling, you can remove intermediate values and create a binary classification model for active vs. inactive compounds.

😌Note: If you intend to develop a model to predict exact numerical activity (regression tasks), you may skip the labeling part and proceed with the actual numerical activity values as target variables.

Once your dataset is prepared and labeling is done, it’s time to process and normalize the molecular structures.


Step 2: Processing Smiles string and their normalization

Structure normalization is another critical step to ensure your machine-learning models are built on molecular descriptors derived from clean and consistent structural data. Here are six generic steps to normalize molecular structures, especially SMILES strings:

1. Remove Salts and Mixtures

Salts and inorganic mixtures can introduce noise into your dataset. Focus on small organic molecules by removing multicomponent SMILES strings, leading to cleaner data and more accurate predictions. Below is an example of multicomponent/mixture smiles.


2. Remove Inorganic and Metal-Organic Molecules

Most QSAR models and descriptor-generation tools are designed for small organic compounds. Removing inorganic or metal-organic molecules ensures compatibility with cheminformatics tools and better predictions.

3. Strip Stereochemistry

If your dataset isn’t dependent on stereochemistry, strip this information from the molecules. This makes it easier for the model to focus on essential structural patterns. However, if stereochemistry is relevant (e.g., for bioactivity data), you can use tools like the CDK fingerprint node to account for chirality.

4. Normalize Structures

Normalize aromatic rings and optimize overall molecular structures using tools like RDKit. This ensures consistency in how molecular rings and functional groups are represented.

5. Deduplicate Using SMARTS and InChI Keys

Avoid redundant structures by deduplicating them using SMARTS patterns or InChI keys. This step catches any duplicates that might have slipped through earlier, ensuring a clean dataset for modeling.

6. Save Normalized Structures

Finally, save the normalized dataset in a format that’s easy to work with later (e.g., SDF). Include molecule identifiers, normalized structures, and activity labels.


Following these steps, you’ll have a high-quality, normalized molecular dataset ready for descriptor calculation and machine-learning modeling. In the next blog, I’ll cover how to generate molecular fingerprints and refine them for your Classification QSAR models.

Am also working on a Collab notebook to perform this task.

Stay tuned!

Bonus content:

I didn't know that u could reduce size of many pictures at one go, here is the simple code using image macgick package :

sudo apt install imagemagick-6.q16

for img in *.PNG; do mogrify -resize 50% "$img"; done

Just navigate into the desired dir and execute this lines, and in place of PNG just put your file type.







Comments

Sagar's Profile Image

Sagar Singh Shyamal

Venturing into the domain of computational chemistry, with focus on guiding research with data-driven approaches.

Popular posts from this blog

Interactive Protein-Ligand Complexes for pan MTBCAs Inhibitor - Compound 5

Interactive Network Visualisation