investigator_user investigator user funding collaborators pending menu bell message arrow_up arrow_down filter layers globe marker add arrow close download edit facebook info linkedin minus plus save share search sort twitter remove user-plus user-minus
  • Project leads
  • Collaborators

Large Databases of Small Molecules - Drug Development Tool and Public Resource

Marc Nicklaus

5 Collaborator(s)

Funding source

National Cancer Institute (NIH)
The principal objective of this project is to make large collections of small molecules available for aiding in drug development, both in-house and publicly, to advance the fields of chemical structure identification and processing and of unique compound identifier generation, as well as to provide free chemoinformatics tools aiding one in dealing with such databases. This project started with posting the information in the Open NCI Database on the CADD Group's public web server. Many databases are available to the user, including large vendor catalogs of compounds that can be acquired for screening. Advanced processing is applied to the data, and powerful searching and display capabilities have been implemented. The nature of the resources currently being developed is exemplified by a brief description of this service: The data in this current Enhanced NCI Web Browser web service comprise data from NCI's Developmental Therapeutics Program (DTP) and additional information with which we have augmented the DTP data sets. We have subjected the Open NCI Database of about 260,000 compounds to various analyses that help to better understand its characteristics and put it in perspective of other large databases used in computer-aided drug design and chemical information sciences. Various clustering methods have been applied to it to elucidate its diversity, and the results have been compared with those for other databases. The Open NCI Database has been converted into various formats, suitable for further processing including 3D pharmacophore searching. We have also implemented a powerful public search tool for the Open NCI Database with a web interface based on the chemical information toolkit CACTVS. Using just a web browser, the user is able to search about 250,000 structures for more than 600 criteria. We have greatly augmented the original DTP files with numerous additional data fields, be it calculated, predicted or hyperlinked information. These data have also been made available in directly downloadable format. Links to several additional services for further processing have been implemented. An online 3D pharmacophore capability has been built, a capability that is currently unique on the web, as far as we are aware of. Searchable predictions of more than 550 different biological activities, calculated by the program PASS for most of the quarter-million compounds, have been included in the web service (abstract). A more recent service is our Chemical Structure Lookup Service (CSLS), available at http://cactus.nci.nih.gov/lookup. CSLS is essentially a "phone book" for small molecules, allowing the user to quickly find out in which, if any, of over 100 different databases (both public and commercial), comprising more than 74 million entries, their compounds occur. Updates of both the user interface and the structure and data holdings are underway as of the time of this writing, which will push the number of entries in CSLS beyond the 100 million mark. Part of these projects is the downloading, reformatting and evaluation for cancer-related purposes, of the massive set of structure and assay data as deposited in PubChem. One of our public tools is the Optical Structure Recognition service for molecules, mostly developed by Dr. Igor Filippov. OSRA is a utility designed to convert graphical representations of chemical structures, as they appear in journal articles, patent documents, textbooks, trade magazines etc., into SMILES (Simplified Molecular Input Line Entry Specification - see http://en.wikipedia.org/wiki/SMILES - a computer recognizable molecular structure format. OSRA can read a document in over 90 graphical formats parseable - including GIF, JPEG, PNG, TIFF, PDF, PS etc., and generate the SMILES representation of the molecular structure images encountered within that document. The latest version of OSRA has added automatic recognition of reactions in patents and literature. OSRA now supports multi-step reaction extraction from the surrounding text and graphics, OCR of the reaction agents & comments, and the conversion of the results to either reaction SMILES or RXN format. OSRA can be used as a stand-alone command-line utility, a library (C++ and Java through JNI), and also through the GUI provided as Accelrys Draw plugin which has been updated to handle the new reaction recognition capabilities. The Chemical Identifier Resolver (CIR), developed by Dr. Markus Sitzmann is the service with the most use, with typically several hundred thousand to more than a million user requests per month. CIR works as a resolver for different chemical structure identifiers and allows one to convert a given structure identifier into another representation or structure identifier. Among others, our NCI/CADD Structure Identifiers developed in-house as well as the new Standard InChI and InChIKey identifiers are handled by this service. One of CIR's key features is that it is a programmatic interface into the Chemical Structure Database (CSDB). An update of CSDB has been completed to over 360 million original database records representing approximately 128 unique million small-molecule structures (making this one of the largest chemical databases in the world). Many additional capabilities are planned to be added to this service, which is increasingly being integrated with other web services and chemoinformatics tools world-wide. CIR will also become increasingly important in the area of publications involving chemical structures, as efforts increase to make inclusion of computer-readable representations of all compounds presented in a paper mandatory. We are working on the next generation web platform which will be the basis for a series of new web services and updates of existing services including CADD Group's Chemical Structure Lookup Service (CSLS II). The URL of our public web server is http://cactus.nci.nih.gov. We have analyzed a set of 43 million chemical structure records extracted from patent data (EP, US PTO, WO) by the IBM-led consortium of large pharmaceutical companies in the context of the SIIP (Strategic IP Insight Platform) project. OSRA was used in this project. Part of these data were given for public use to both PubChem and the CADD Group (see, e.g., http://www-935.ibm.com/services/us/gbs/bao/siip/nih/?sid=0015AFBF08D8F183C1F8E32A430CFFEB). Efforts to implement a new resource for making affordable chemical synthesis of screening samples available to all NIH researchers were successfully concluded. This was realized in the form of an extension of the contract with the company ChemNavigator, now part of Sigma-Aldrich, who have implemented the so-called Semi-Custom Synthesis Online Request System (SCSORS). This resource is being increasingly used in our (and other groups') in silico screening, synthetic chemistry, and sample acquisition projects. Very recently, we have released a new tool on our web server: The Chemical Activity Predictor (CAP), which allows the user to calculate physicochemical properties and activities for compounds. Ongoing recent activities in this context include efforts to design a synthetically accessible virtual inventory of screening samples based on robust chemical reactions and existing commercially available building blocks. Likewise, our database and chemoinformatics tools will benefit from the work pertaining to tautomerism, in particular related to the redesign of the handling of tautomerism for version 2 of the IUPAC InChI identifier.

Related projects