Development of machine learning framework for automatically extracting detailed synthesis procedures from organic chemistry articles

Many scientific papers in organic chemistry propose a new chemical reaction process and this new reaction is registered in a chemical reaction database such as Reaxys. Although the information included in the database consists of information such as reactants, catalysts and products, details regarding the synthetic procedure must be looked up in the original paper. In this study, the corpus OSPAR (Organic Synthesis Procedures with Argument Roles), which includes annotated information about operational procedures and targets, was created as a framework for machine-based automatic extraction of detailed reaction procedures from scientific articles. Procedures were extracted from explanations in articles from Organic Syntheses, a trusted journal within the field of organic chemistry.

This corpus organizes the results of semantic analysis performed on the verbs and related nouns used in experimental procedure explanation, allowing for the distinction between the operations “add A to B” and “add B to A”. Additionally, by utilizing deep learning analysis with this corpus, it was demonstrated that such additional procedural details could be extracted from scientific papers that were not included in the training data.

Using this foundational technology will enable chemical reaction databases to provide additional details regarding synthetic procedure, rather than ambiguous “A+B” instructions. This additional information is expected to be useful for scientists when planning follow-up experiments and detailed experimental procedures.

Details can be found in the original article by K. Machi, S. Akiyama, Y. Nagata and M. Yoshioka (DOI: 10.1021/acs.jcim.3c01449). This article is open access and can be viewed by anyone.