The goal of the Center for Computer Aided Synthesis (C-CAS) is to provide academic and nonacademic researchers tools to significantly enhance the effectiveness of synthesis planning and optimization, thus allowing them to focus on "what should be made and why" and less on "how to make it". We will build on recent methodological advances at the interface of machine learning and chemistry as well as our collaboration with industrial partners to create datasets that bring predictability to how to perform every individual step in a synthetic route.
The C-CAS is organized in three research thrusts that will integrate datasets from a variety of sources, exploit this information to solve the “over-the-arrow” problem and apply the lessons learn to the synthesis of complex molecules.
The central hypothesis of Thrust 1 of C-CAS is that better data lead to better performance by the machine learning algorithms. However, current datasets used for reaction predictions are mostly based on published literature, which has a substantial bias towards successful reactions, is often incomplete and sparse, and does not exploit the wealth of information encoded in molecular structure. Therefore, it is critical to create “better data” to guide understanding and predictability in chemical synthesis using machine learning.
A goal of C-CAS is to generate heterogeneous knowledge graphs, wherein the nodes can be a data source and the links are observed or inferred relationships among these data sources about chemical reactions. These knowledge graphs integrate information from many different sources, including high-throughput experimentation, atomistic calculations and unpublished datasets from electronic notebooks of our industrial and academic collaborators, in addition to the scientific literature and patent databases.
These can then be used to determine the features relevant for a chemical reaction. This work will also explore best practices for the generation and acquisition of clean datasets from High Throughput Experimentation (HTE) data, which will help answer the question about what are the most effective experiments to perform in order to obtain predictive models that extrapolate beyond the dataset. Finally, this Thrust will also focus on developing new strategies to describe molecular structures. An important pre-cursor to application of any machine learning algorithm is an effective feature representation of molecules. These features and descriptors about molecular structures will be incorporated into the knowledge graph representation.
While there have been major recent advances in using ML for retrosynthetic planning and forward synthesis prediction, these advances represent necessary but not sufficient tools for the realization of computer-assisted synthesis planning. Reaction condition recommendation and optimization are essential elements of this long-term goal since the selection of “above the arrow” conditions governs the success or failure of reactions conducted in the laboratory. Further, reaction success and the definition of “optimal” has many attributes depending on the context, from high yield and selectivity to environmental impact and cost of components.
A goal of C-CAS is to develop active and transfer learning methods that disrupt the current practices in “over the arrow” prediction and optimization by integrating software engineering, computational chemistry, and ML tools with experimental methods in chemical synthesis. We will ask:
- Does ML-driven optimization outperform traditional optimization campaigns in synthetic chemistry?
- How can a workflow be developed that utilizes the fewest number of experiments possible to obtain predictive models?
- How can ML inform new reaction development through the interconnectedness of previously developed reactions?
These will be answered for selected examples of critical importance in pharmaceutical research and complex molecule synthesis with an emphasis on identifying how each type of ML method should be applied. Research in this area will be highly collaborative requiring the expertise of all of the C-CAS members and the extensive resources of our industrial affiliates.
Synthesis pathways for the preparation of complex molecules are not unlike mazes, which are replete with unexpected twists and turns and dead ends. Even in cases where there are numerous viable routes that will go from beginning to end, one of these pathways may be superior depending on the factors that one prioritizes (e.g. length, cost, etc.). Wouldn’t it be wonderful to have a way to predict a priori if a path would lead successfully to the end or not i.e., a “go/no-go” decision, and determine which of many possible pathways one should pursue?
One goal of the C-CAS is to apply the workflows developed in the earlier thrusts by ranking the feasibility of many possible synthetic pathways in order to provide “go/no-go” decisions and recommend an ideal route. The quantitative scoring and experimental validation of the generated synthetic pathways will be a significant contribution to pushing the boundaries of computer-assisted synthesis. Ultimately, we will use these computer-generated insights to navigate the maze of natural product synthesis.