Data Sets, Tools & Workflows

Data sets, tools and workflows

C-CAS is committed to making all data sets, tools and workflows freely available to the scientific community. This page provides a one-stop list of C-CAS contributions to the field of data chemistry, together with the publications and, where available, tutorial material developed within C-CAS. If you use the tools, please ensure to cite the appropriate publications and DOI’s.

Open Reaction Database

The Open Reaction Database (ORD) is an open-access schema and infrastructure for structuring and sharing organic reaction data, including a centralized data repository. The ORD schema supports conventional and emerging technologies, from benchtop reactions to automated high-throughput experiments and flow chemistry.

The Open Reaction Database can be accessed at: https://open-reaction-database.org/

Publications:

Kearnes SM, Maser MR, Wleklinski M, Kast A, Doyle AG, Dreher SD, Hawkins JM, Jensen KF, Coley CW. The Open Reaction Database. J. Am. Chem. Soc. 2021, 143, 18820-18826. doi.org/10.1021/jacs.1c09820

Mercado R, Kearnes SM, Coley C. Data Sharing in Chemistry: Lessons Learned and a Case for Mandating Structured Reaction Data. J. Chem. Inf. Model. 2023, 63, 4253-4265. doi.org/10.1021/acs.jcim.3c00607

Tutorial material:

https://www.youtube.com/watch?v=eMbgkvNek0c

Kraken

Kraken is a discovery platform covering monodentate organophosphorus(III) ligands providing comprehensive physicochemical descriptors based on representative conformer ensembles. Using quantum-mechanical methods, we calculated descriptors for 1558 ligands, including commercially available examples, and trained machine learning models to predict properties of over 300000 new ligands.

Kraken can be accessed at https://descriptor-libraries.molssi.org/kraken/

The molecular descriptors in Kraken form the basis of the Phosphine Predictor, a web tool by Sigma-Aldrich for the selection of phosphine ligands for cross-coupling reactions.

Publications:

Gensch, T.; dos Passos Gomes, G.; Friederich, P.; Peters, E.; Gaudin, T.; Pollice, R.; Jorner, K.; Nigam, A.; Lindner-D'Addario, M.; Sigman, M. S.; Aspuru-Guzik, A. A Comprehensive Discovery Platform for Organophosphorus Ligands for Catalysis. J. Am. Chem. Soc. 2022, 144, 3, 1205–1217. doi.org/10.1021/jacs.1c09718

Tutorial Material:

https://www.youtube.com/watch?v=ApWO7OSvUTk

Auto-QChem

Auto-QChem is an automatic, high-throughput and end-to-end DFT calculation workflow that computes chemical descriptors for organic molecules. Tailored toward users without extensive programming experience, Auto-QChem has facilitated more than 38 000 DFT calculations for 17 000 molecules as of January 2022. Starting from string representations of molecules, Auto-QChem automatically (a) generates conformational ensembles, (b) submits and manages DFT calculations on a high-performance computing (HPC) cluster, (c) extracts production-ready features that are suitable for statistical analysis and machine learning model development, and (d) stores resulting calculations in a cloud-hosted and web-accessible database

Auto-QChem is available at https://github.com/doyle-lab-ucla/auto-qchem

A web interface to the Auto-QChem database is available at https://autoqchem.org/

Publication:

Żurański, A.M.; Wang, J.Y.; Shields, B.J.; Doyle, A.G. Auto-QChem: an automated workflow for the generation and storage of DFT calculations for organic molecules. React. Chem. Engin. 2022, 7, 1276-1284 doi.org/10.1039/D2RE00030J

DBStep

DBStep is a python package for obtaining DFT-Based Steric Parameters from 3-dimensional chemical structures. It can parse the outputs from most computational chemistry programs and other common molecular structure file formats. Steric properties can either be obtained exactly or by using a Cartesian grid, the latter approach being amenable to the featurization of a molecular isodensity surface (DBSTEP can process wavefunction files) rather than using classical atomic radii. Currently, traditional Sterimol parameters (L, Bmin, Bmax) and percent buried volume parameters are implemented, as well as our novel steric parameter vectors Sterimol2vec and vol2vec. This package is designed for use on the command line or alternatively implemented in a Python script for use in a computational workflow to collect steric parameters

DB Step can be downloaded from 10.5281/zenodo.4702097

Publication:

Luchini, G.; Patterson, T.; Paton, R. S. DBSTEP: DFT Based Steric Parameters. 2022, DOI: 10.5281/zenodo.4702097

Tutorial Material:

https://www.youtube.com/watch?v=oBofadojwUI

A general overview of modern steric parameters is available at:

https://www.youtube.com/watch?v=TPZ-aX0zbuY

Cascade

CASCADE stands for ChemicAl Shift CAlculation with DEep learning. It is a stereochemistry-aware online calculator for NMR chemical shifts using a graph network approach developed at Colorado State University. Molecular input can be specified as SMILES or through the graphical interface. An automated workflow executes 3D structure embedding and MMFF conformer searching. The full ensemble of optimized conformations are passed to a trained graph neural network to predict the NMR chemical shift (in ppm) for each carbon.

Cascade can be downloaded from: github.com/patonlab/CASCADE

A webversion of Cascade is available at https://nova.chem.colostate.edu/cascade/predict

Publication:

Guan, Y.; Sowndarya, S. V. S.; Gallegos, L. C.; St. John P. C.; Paton, R. S. Real-time prediction of 1H and 13C chemical shifts with DFT accuracy using a 3D graph neural network Chem. Sci. 2021, 12, 12012-12026 doi.org/10.1039/D1SC03343C

AQME

AQME is an ensemble of automated QM workflows, including: 1) RDKit- and CREST-based conformer generator and ready-to-submit QM input files starting from individual files or databases, 2) post-processing of QM output files to fix extra imaginary frequencies, unfinished jobs and error terminate

AQME is available at https://github.com/patonlab/aqme

Publication:

Alegre-Requena, J. V.; Sowndarya, S.; Pérez-Soto, R.; Alturaifi, T.; Paton, R. AQME: Automated Quantum Mechanical Environments for Researchers and Educators. Wiley Interdiscip. Rev. Comput. Mol. Sci. 2023, 13, e1663 doi.org/10.1002/wcms.1663

Tutorial Material:

https://www.youtube.com/watch?v=8aj6zDxhUMA

EDBO+

EDBO+ is a multi-objective reaction Bayesian optimization platform that builds on the previously published Bayesian optimizers EDBO. The web-based application incorporates features such as condition modification on the fly and data visualization.

The web interface for the EDBO+ optimizer is at https://edboplus.org and the github version can be found at https://github.com/doyle-lab-ucla/edboplus

Publications:

EDBO+ Torres, J. A. G.; Lau, S. H.; Anchuri, P.; Stevens, J. M; Tabora, J. E; Li, J.; Borovika, A.; Adams, R. P; Doyle, A.G. A Multi-Objective Active Learning Platform and Web App for Reaction Optimization. J. Am. Chem. Soc. 2022, 144, 1999-2007. doi/10.1021/jacs.2c08592

EDBO Shields, B.J.; Stevens, J.; Li, J.; Prarasram, M.; Damani, F.; Martinez Alvaro, J., Janey, J. Adams, R.P., Doyle, A. Bayesian Reaction Optimization as A Tool for Chemical Synthesis. Nature 2021, 590, 89-96. doi.org/10.1038/s41586-021-03213-y

Tutorial material:

EDBO+ https://www.youtube.com/watch?v=Fo_ZplPyLZo

General introduction to Bayesian Optimization and “Over the Arrow” optimization:

https://www.youtube.com/watch?v=iKg30HoFO0c

https://www.youtube.com/watch?v=rMkweLya3T8

Bandit-Optimization

This code uses reinforcement learning, specifically the multi-armed bandit approach, for reaction optimization. It demonstrates data-efficient learning at high accuracies and has unique functionalities.

The Bandit optimizer code base is available at 10.5281/zenodo.8181283.

Publication:

Wang JY, Stevens JM, Kariofillis SK, Tom MJ, Golden DL, Li J, Tabora JE, Parasram M, Shields BJ, Primer DN, Hao B. Identifying general reaction conditions by bandit optimization. Nature. 2024, 626, 1025-1033. doi.org/10.1038/s41586-024-07021-y

Maxbridge

Maxbridge is a web-based deterministic graphing program that permits the identification of the maximally bridged ring (or rings) for any molecule using the Chemistry Development Kit (CDK) software library

The web interface of Maxbridge can be found at Maxbridge.org

Publication:

Marth, C.J., Gallego, G.M., Lee, J.C., Lebold, T.P., Kulyk, S., Kou, K.G.M., Qin, J., Lilien, R. and Sarpong, R., 2015. Network-analysis-guided synthesis of weisaconitine D and liljestrandinine. Nature, 528(7583), pp.493-498. doi.org/10.1038/nature16440

AIMNet2

This package integrates the powerful AIMNet2 neural network potential into your simulation workflows. AIMNet2 provides fast and reliable energy, force, and property calculations for molecules containing a diverse range of elements.

AIMNet2 is available at https://github.com/isayevlab/AIMNet2

Publication:

Anstine D, Zubatyuk R, Isayev O. AIMNet2: A Neural Network Potential to Meet your Neutral, Charged, Organic, and Elemental-Organic Needs. ChemRxiv. 2024; doi:10.26434/chemrxiv-2023-296ch-v2

Molcomplex

This package is developed in a collaboration of the Paton and Sarpong groups. It Implements a variety of complementary metrics for molecular complexity and synthetic accessibility.

It is accessible at https://github.com/patonlab/molcomplex

The Web interface can be visited at http://molcomplex.org