Campus users should disconnect from VPN to access senior theses, as there is a temporary disruption affecting VPN.
 

Publication:

Decoding Molecular Mysteries: A Chemical Language Model for Structure Elucidation from NMR Data

datacite.rightsrestricted
dc.contributor.advisorZhong, Ellen
dc.contributor.authorAlauddin, Foyez
dc.date.accessioned2026-01-05T16:59:11Z
dc.date.available2026-01-05T16:59:11Z
dc.date.issued2025
dc.description.abstractElucidating molecular structures from nuclear magnetic resonance (NMR) spectra is a central yet time-intensive task in chemistry that is traditionally performed manually by chemists with extensive academic lab training. This thesis aims to dvance an existing chemical language model approach to automatically infer molecular structures from routine 1H and 13C NMR spectra data. Building upon the language model methodology of Hu et al., we substitute their output of a SMILES string representation of a molecule with a fragment-centric, sequential SAFE string format in an attempt to improve the model’s generative capabilities and test its ability to assemble more complex molecules than explored by Hu et al. Towards this end, we train the model on multiple SMILES-based and SAFE-based datasets we curate, including SpectraBase (i.e., the dataset used by Hu et al.), 5mer (i.e., sequences of 5 amino acids), and NPAtlas (i.e., natural products). While the SAFE-based models achieve competitive and, in some cases, superior performance to the SMILES-based models, significant challenges remain — particularly with structurally complex natural products. These findings underscore the need for improved data representation and model architectures to perform the spectrum-to-structure elucidation task reliably.
dc.identifier.urihttps://theses-dissertations.princeton.edu/handle/88435/dsp01gx41mn32x
dc.language.isoen_US
dc.titleDecoding Molecular Mysteries: A Chemical Language Model for Structure Elucidation from NMR Data
dc.typePrinceton University Senior Theses
dspace.entity.typePublication
dspace.workflow.startDateTime2025-12-16T15:24:56.064Z
pu.contributor.authorid920279476
pu.date.classyear2025
pu.departmentComputer Science

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
fa1073_written_final_report-1.pdf
Size:
8.23 MB
Format:
Adobe Portable Document Format
Download

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
100 B
Format:
Item-specific license agreed to upon submission
Description:
Download