Publication: Decoding Molecular Mysteries: A Chemical Language Model for Structure Elucidation from NMR Data
Files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Access Restrictions
Abstract
Elucidating molecular structures from nuclear magnetic resonance (NMR) spectra is a central yet time-intensive task in chemistry that is traditionally performed manually by chemists with extensive academic lab training. This thesis aims to dvance an existing chemical language model approach to automatically infer molecular structures from routine 1H and 13C NMR spectra data. Building upon the language model methodology of Hu et al., we substitute their output of a SMILES string representation of a molecule with a fragment-centric, sequential SAFE string format in an attempt to improve the model’s generative capabilities and test its ability to assemble more complex molecules than explored by Hu et al. Towards this end, we train the model on multiple SMILES-based and SAFE-based datasets we curate, including SpectraBase (i.e., the dataset used by Hu et al.), 5mer (i.e., sequences of 5 amino acids), and NPAtlas (i.e., natural products). While the SAFE-based models achieve competitive and, in some cases, superior performance to the SMILES-based models, significant challenges remain — particularly with structurally complex natural products. These findings underscore the need for improved data representation and model architectures to perform the spectrum-to-structure elucidation task reliably.