Skip to main content
The NCI Community Hub will be retiring in May 2024. For more information please visit the NCIHub Retirement Page:
  • Discoverability Visible
  • Join Policy Open/Anyone
  • Created 08 Sep 2021

Do you want to know how to use Machine Learning (ML) for accelerating drug discovery? Join us on September 14, 2:00 pm – 3:30 pm ET for the second workshop on using Machine Learning (ML) to accelerate drug discovery! The workshop focuses on using the Atom Modeling PipeLine (AMPL), an open-source conda-based software that automates key drug discovery steps. AMPL is designed to take molecular binding data (ex., IC50, ki, etc.) and carry out the ML steps with minimal user intervention (see the figure shown above). The first workshop held in June highlighted AMPL’s capabilities for creating ML-ready datasets.

Date: Tuesday, Sep 14, 2021

Time: 2:00 p.m – 3:30 p.m. ET

Location: Webex

Registration: Not required

Presenter: Sarangan Ravichandran, PhD, PMP Senior Data Scientist, 
ATOM Consortium/Frederick National Laboratory for Cancer Research (FNLCR) and Adjunct Professor in Bioinformatics, Hood College

Supporting materials: Tutorial and AMPL: A Data-Driven Modeling Pipeline for Drug Discovery

The second workshop on September 14 will demonstrate three preliminary in-silico drug discovery topics:

  • data ingestion
  • cleaning/tidying
  • curation on AMPL

Note: This session will be 90 minutes and will use Google COLAB notebooks (compatible to Jupyter notebooks) for demonstration. Please see the outline below:

Notebook-1: Ingestion, Cleaning and Exploratory Data Analysis(EDA) of Binding Assay Data (30 minutes)

  • Issues associated with data ingestion and curation (data sources: Drug Data Commons; ChEMBL and ExCAPE-DB)
  • Exploratory data analysis of the ingested datasets
  • Standardization of outcome units such as IC50 (etc. um to nM)
  • Data visualization and comparison

Notebook-2: Standardization of SMILES, Featurization and Compound Overlap/Diversity using a Python Jupyter Notebook (30 minutes)

  • Compound overlap
  • SMILES standardization
  • Explore compound diversity using featurization and Tanimoto distance
  • Create plots/heatmaps for analysis

Notebook-3: Curate, Merge Datasets to Create the Final ML-ready Dataset (30 minutes)

  • Removal of duplicates
  • Filter extreme data
  • Merge the DTC, ChEMBL, and ExCAPE-DB datasets to create a curated dataset
  • Data curation on the merged data
  • Creation of ML-ready dataset

To learn more about the software, visit the AMPL GitHub repository at this link

Questions? Contact the NCI Data Science Learning Exchange

Created by Clint Malone Last Modified Tue October 26, 2021 9:55 pm by Clint Malone