The goal of the Center for Computer Aided Synthesis (C-CAS) is to provide academic and nonacademic researchers tools to significantly enhance the effectiveness of synthesis planning and optimization, thus allowing them to focus on "what should be made and why" and less on "how to make it". We will build on recent methodological advances at the interface of machine learning and chemistry as well as our collaboration with industrial partners to create datasets that bring predictability to how to perform every individual step in a synthetic route.

The C-CAS is organized in three research thrusts that will integrate datasets from a variety of sources, exploit this information to solve the “over-the-arrow” problem and apply the lessons learn to the synthesis of complex molecules.

Go to C-CAS Publications

Thrust 1: Data Mining and Integration

Data Mining and Integration

The central hypothesis of Thrust 1 of C-CAS is that better data lead to better performance by the machine learning algorithms. However, current datasets used for reaction predictions are mostly based on published literature, which has a substantial bias towards successful reactions, is often incomplete and sparse, and does not exploit the wealth of information encoded in molecular structure. Therefore, it is critical to create “better data” to guide understanding and predictability in chemical synthesis using machine learning.

A goal of C-CAS is to generate heterogeneous knowledge graphs, wherein the nodes can be a data source and the links are observed or inferred relationships among these data sources about chemical reactions. These knowledge graphs integrate information from many different sources, including high-throughput experimentation, atomistic calculations and unpublished datasets from electronic notebooks of our industrial and academic collaborators, in addition to the scientific literature and patent databases.

These can then be used to determine the features relevant for a chemical reaction. This work will also explore best practices for the generation and acquisition of clean datasets from High Throughput Experimentation (HTE) data, which will help answer the question about what are the most effective experiments to perform in order to obtain predictive models that extrapolate beyond the dataset. Finally, this Thrust will also focus on developing new strategies to describe molecular structures. An important pre-cursor to application of any machine learning algorithm is an effective feature representation of molecules. These features and descriptors about molecular structures will be incorporated into the knowledge graph representation.

Thrust 2: Optimization and Machine Learning

Optimization and Machine Learning

While there have been major recent advances in using ML for retrosynthetic planning and forward synthesis prediction, these advances represent necessary but not sufficient tools for the realization of computer-assisted synthesis planning. Reaction condition recommendation and optimization are essential elements of this long-term goal since the selection of “above the arrow” conditions governs the success or failure of reactions conducted in the laboratory. Further, reaction success and the definition of “optimal” has many attributes depending on the context, from high yield and selectivity to environmental impact and cost of components. 

A goal of C-CAS is to develop active and transfer learning methods that disrupt the current practices in “over the arrow” prediction and optimization by integrating software engineering, computational chemistry, and ML tools with experimental methods in chemical synthesis. We will ask:

  1. Does ML-driven optimization outperform traditional optimization campaigns in synthetic chemistry?
  2. How can a workflow be developed that utilizes the fewest number of experiments possible to obtain predictive models?
  3. How can ML inform new reaction development through the interconnectedness of previously developed reactions? 

These will be answered for selected examples of critical importance in pharmaceutical research and complex molecule synthesis with an emphasis on identifying how each type of ML method should be applied. Research in this area will be highly collaborative requiring the expertise of all of the C-CAS members and the extensive resources of our industrial affiliates.

Thrust 3: Complex Molecule Synthesis Planning and Scoring

Complex Molecule Synthesis Planning and Scoring

Synthesis pathways for the preparation of complex molecules are not unlike mazes, which are replete with unexpected twists and turns and dead ends. Even in cases where there are numerous viable routes that will go from beginning to end, one of these pathways may be superior depending on the factors that one prioritizes (e.g. length, cost, etc.). Wouldn’t it be wonderful to have a way to predict a priori if a path would lead successfully to the end or not i.e., a “go/no-go” decision, and determine which of many possible pathways one should pursue?

One goal of the C-CAS is to apply the workflows developed in the earlier thrusts by ranking the feasibility of many possible synthetic pathways in order to provide “go/no-go” decisions and recommend an ideal route. The quantitative scoring and experimental validation of the generated synthetic pathways will be a significant contribution to pushing the boundaries of computer-assisted synthesis. Ultimately, we will use these computer-generated insights to navigate the maze of natural product synthesis.


Publications of the Center for Computer Assisted Synthesis (C-CAS)

  1. Guo, Z., Yu, W., Zhang, C., Jiang, M. and Chawla, N.V. GraSeq: Graph and Sequence Fusion Learning for Molecular Property Prediction. Proc. 29th ACM Intl. Conf. Inf. Knowl. Manag. 2020, 435-443.

  2. Tang P, Jiang M, Xia BN, Pitera JW, Welser J, Chawla NV. Multi-label patent categorization with non-local attention-based graph convolutional network. Proc. AAAI Conf. Art. Int.  2020 34, 9024-9031.

  3. Luchini, G., Alegre-Requena, J. V., Funes-Ardoiz, I., Paton, R.S. GoodVibes: Automated thermochemistry for heterogeneous computational chemistry data. F1000Research, 2020, 9, 291.

  4. Shields, B.J. ; Stevens, J.; Li, J.; Prarasram, M.; Damani, F.; Martinez Alvaro, J., Janey, J. Adams, R.P., Doyle, A. Bayesian Reaction Optimization as A Tool for Chemical Synthesis. Nature 2021, 590, 89-96.

  5. Guo, Z., Zhang, C., Yu, W., Herr, J., Wiest, O., and Chawla, N.V. Few-Shot Graph Learning for Molecular Property Prediction. Proc. TheWebConf2021 2021, 2559-2567

  6. Shen, Y., Borowski, J., Hardy, M., Sarpong, R. Doyle, A., Cernak, T. Automation and computer-assisted planning for chemical synthesis. Nat Rev Methods Primers, 2021, 23, 1.

  7. Gallegos, L.C.; Luchini, G.; St John, P.C.; Kim, S.; Paton, R.S. Importance of Engineered and Learned Molecular Representations in Predicting Organic Reactivity, Selectivity, and Chemical Properties Acc. Chem. Res. 2021, 54, 827-836.

  8. Żurański, A.M., Martinez Alvarado, J.I., Shields, B.J. and Doyle, A.G. 2021.  Predicting Reaction Yields via Supervised Learning. Acc. Chem. Res. 2021, 54, 1856-865.

  9. Kariofillis S, Jiang S, Żurański A, Gandhi S, Martinez Alvarado J, Doyle A. Using Data Science to Guide Aryl Bromide Substrate Scope Analysis in a Ni/Photoredox-Catalyzed Cross-Coupling with Acetals as Alcohol-Derived Radical Sources. J. Am. Chem. Soc. 2022, 144 ASAP .

  10. Christensen M, Yunker L, Adedeji F, Häse F, Roch L, Gensch T, dos Passos Gomes G, Zepel T, Sigman M, Aspuru-Guzik A, Hein J. Data-science driven autonomous process optimization.  Data-science driven autonomous process optimization. Commun. Chem. 2021, 4 112.

  11. Gensch T, dos Passos Gomes G, Friederich P, Peters E, Gaudin T, Pollice R, et al. A Comprehensive Discovery Platform for Organophosphorus Ligands for Catalysis. J. Am. Chem. Soc. 2022, 144 ASAP

  12. Newman-Stonebraker, S. H.;  Smith, S. R.;  Borowski, J. E.;  Peters, E.;  Gensch, T.;  Johnson, H. C.;  Sigman, M. S.; Doyle, A. G., Univariate classification of phosphine ligation state and reactivity in cross-coupling catalysis. Science 2021, 374, 301-308

  13. Silva, J. D. J.;  Bartalucci, N.;  Jelier, B.;  Grosslight, S.;  Gensch, T.;  Schünemann, C.;  Müller, B.;  Kamer, P. C.;  Copéret, C.; Sigman, M. S., Development and Molecular Understanding of a Pd-catalyzed Cyanation of Aryl Boronic Acids Enabled by High-Throughput Experimentation and Data Analysis. Helv. Chim. Acta 2021 e2100200.

  14. Guan, Y.; Sowndarya. S.S.; Gallegos, L.C.; St. John, P.C.; Paton, R.S., Real-Time Prediction of 1H and 13C Chemical Shifts with DFT Accuracy Using a 3D Graph Neural Network. Chem. Sci. 2021, 12, 12012-12026.

  15. Saebi, M.;  Nan, B.;  Herr, J.;  Wahlers, J.;  Zuranski, A. M.;  Kegej, T.;  Norrby, P.-O.;  Doyle, A. G.;  Wiest, O.; Chawla, N., On the Use of Real-World Data Sets for Reaction Yield Prediction  ChemRxiv 2021 10.33774/chemrxiv-2021-2x06r-v3.

  16. Zell D;  Kingston C;  Jermaks J;  Smith S.R.;  Seeger N;  Wassmer J;  Sirois, L.E.;  Han, C.;  Zhang, H.;  Sigman, M.S.; Gossling, F., Stereoconvergent and -divergent Synthesis of Tetrasubstituted Alkenes by Nickel-Catalyzed Cross-Couplings. J. Am. Chem. Soc. 2021, 143, 45,19078 -19090.

  17. Gensch, T.; Smith, S.R; Colacot, T.J.; Timsina, Y.; Xu, G.; Glasspoole, B.W.; Sigman, M.S, Design and Application of a Screening Set for Monophosphine Lig-Ands in Metal Catalysis. ChemRxiv 202110.33774/chemrxiv-2021-fgm7v

  18. Williams, W.L.; Zeng, L.; Gensch, T.; Sigman, M.S.; Doyle, A.G.; Anslyn, E. V. The Evolution of Data-Driven Modeling in Organic Chemistry ACS Cent. Sci. 2021, 7, 1622-1637.

  19. Jones, K.E.; Park, B.; Doering, N.A.; Baik, M.H.; Sarpong, R.  Rearrangements of the Chrysanthenol Core: Application to a Formal Synthesis of Xishacorene B. J. Am. Chem. Soc. 2021, 143, 20482–20490

  20. Hardy, M.A.; Nan, B.; Wiest, O.; Sarpong, R.  Strategic elements in computer-aided retrosynthesis: A case study of the pupukeanane natural products Tetrahedron 2022, 103, 132584