P45 Development of a multitask deep learning QSAR model using data from individual Cytochrome P450 isozymes.

Pranav Shah , NIH, Rockville
Alexey Zakharov , NIH, Rockville
R. Scott Obach , Pharmcokinetics, Dynamics and Metabolism, Pfizer Inc., Groton, CT
Anton Simeonov , NIH, Rockville
Cornelis Hop , Drug Metabolism & Pharmacokinetics, Genentec, Inc, South San Francisco, CA
Dac-Trung Nguyen , NIH, Rockville
Eric Gonzalez , NIH, Rockville
Hongmao Sun , NIH, Rockville
Xin Xu , NIH, Rockville
It is very well known that P450 plays an important role in drug metabolism. Understanding the relationship between chemical structures and CYP isoform activities will guide medicinal chemists in structure optimization. NCATS and IQ Consortium have initiated a collaborative project to measure intrinsic clearance (CLint) for a set of compounds with major human cytochrome P450 (CYP) isozymes from which in silico prediction tools could be developed. It is anticipated that an extensive database and the in silico tools developed from this database would enhance drug discovery in the structure optimization and the lead selection, and ultimately accelerate drug development for unmet medical needs. The goal is to make the datasets and the models publicly available which will benefit scientists in academic institutes, non-profit organizations, and pharmaceutical companies. For this project, we have tested 4000 compounds initially with 3 major human CYP isozymes, i.e. CYP3A4, CYP2C9 and CYP2D6. Experimentally, we successfully developed a high-throughput, semi-automated, robust assay that will allow for rapid screening for this large set of compounds. For the development of the QSAR models we used Morgan fingerprints with radius 3, which belonged to family of circular fingerprints implemented in RDkit package. Multitask or joint learning allows to solve different tasks or in the case of QSAR the biological activities of compounds, at the same time. Thus, three biological activities embedded in a joint deep learning model share the same feature representations as well as weights and bias in the hidden layers but have their own unique weights and bias in the output layer. In this study, we used the multi-layer feedforward neural networks implemented in Keras using the Theano backend. For minimization of the loss function we used ADAM algorithm. To avoid hyperparameters optimization of the constructed neural network we used a pyramidal architecture: three hidden layers with {1000, 700, 200} number of neurons and ReLu activation function. The developed model was evaluated by 5 fold cross-validation. The balanced accuracy (BA) was calculated for the three CYP450s. Model showed good accuracy of prediction for all activities: BA ranged between 0.75-0.85. The model will be incorporated in the publicly available NCATS web services (http://tripod.nih.gov/adme) once it is tested.