Ohio State is in the process of revising websites and program materials to accurately reflect compliance with the law. While this work occurs, language referencing protected class status or other activities prohibited by Ohio Senate Bill 1 may still appear in some places. However, all programs and activities are being administered in compliance with federal and state law.

Seminar: Joseph S. Verducci and Jeffrey Bjoraker

Statistics Seminar
May 3, 2001
All Day
Dreese Lab 260

Title

Using Random Trees to Identify Key Chemical Assemblies Associated with Biological Activity

Speakers

Professor Joseph S. Verducci and Dr. Jeffrey Bjoraker, Department of Statistics, The Ohio State University and Leadscope, Inc. 

Abstract

Complex molecules are completely described by a decomposition into (1) binary fingerprints indicating the incidence of two-dimensional features that comprise gross structure, and (2) a supplementary library that completes the structure and identifies each atom in the molecule. A goal is to identify those key chemical assemblies whose presence makes a compound biologically active.

Phamaceutical databases used in this problem may consist of a million molecular compounds, with thousands of binary structural features recorded for each compound. For any such data matrix, we use simulated annealing to find a set of up to five features (the maximum that might be physically required for chemical binding) whose simultaneous presence or absence best separates the largest group of most active compounds. The search is incorporated into a recursive partitioning (RP) design to produce a regression tree for biological activity on the space of structural fingerprints. Each node is characterized by some specific combination of structural features, and the terminal nodes with high average activities correspond, roughly, to different classes of compounds that achieve their biological activity through different forms of cell absorption and/or binding sites.

Since the feature-searching part of the algorithm is stochastic, different trees are produced as the initial seed and/or parameters of the annealing process are allowed to vary. An interesting problem is how to resolve the resulting random trees into some sensible inference about underlying chemical structures.We discuss one possible resolution called Common Feature Assemble, based on information in the supplemental libraries.

These procedures are now programmed into an alpha-release of the software package LeadscopeEnterprise, which is being refined for use at several large pharmaceutical companies. We will demonstrate this software at the talk.