Survey on Hunter Heidenreich | ML Research Scientist

Materials Representations for ML Review

Mon, 06 Apr 2026 00:00:00 +0000

A Systematization of Material Representations

This paper is a Systematization that organizes and categorizes the strategies researchers use to convert solid-state materials into numerical representations suitable for machine learning models. Rather than proposing a new method, the review provides a structured taxonomy of existing approaches, connecting each to the practical constraints of data availability, computational cost, and prediction targets. It covers structural descriptors, graph-based learned representations, compositional features, transfer learning, and generative models for inverse design.

Why Material Representations Matter

Machine learning has enabled rapid property prediction for materials, but every ML pipeline depends on how the material is encoded as a numerical input. The authors identify three guiding principles for effective representations:

Similarity preservation: Similar materials should have similar representations, and dissimilar materials should diverge in representation space.
Domain coverage: The representation should be constructable for every material in the target domain.
Cost efficiency: Computing the representation should be cheaper than computing the target property directly (e.g., via DFT).

In practice, materials scientists face several barriers. Atomistic structures span diverse space groups, supercell sizes, and disorder parameters. Real material performance depends on defects, microstructure, and interfaces. Structural information often requires expensive experimental or computational effort to obtain. Datasets in materials science tend to be small, sparse, and biased toward well-studied systems.

Structural Descriptors: Local, Global, and Topological

The review covers three families of hand-crafted structural descriptors that encode atomic positions and types.

Local Descriptors

Local descriptors characterize the environment around each atom. Atom-centered symmetry functions (ACSF), introduced by Behler and Parrinello, define radial and angular functions:

$$ G_{i}^{1} = \sum_{j \neq i}^{\text{neighbors}} e^{-\eta(R_{ij} - R_{s})^{2}} f_{c}(R_{ij}) $$

$$ G_{i}^{2} = 2^{1-\zeta} \sum_{j,k \neq i}^{\text{neighbors}} (1 + \lambda \cos \theta_{ijk})^{\zeta} e^{-\eta(R_{ij}^{2} + R_{ik}^{2} + R_{jk}^{2})} f_{c}(R_{ij}) f_{c}(R_{ik}) f_{c}(R_{jk}) $$

The Smooth Overlap of Atomic Positions (SOAP), proposed by Bartók et al., defines atomic neighborhood density as a sum of Gaussians and computes a rotationally invariant kernel through expansion in radial functions and spherical harmonics:

$$ \rho_{i}(\mathbf{r}) = \sum_{j} \exp\left(-\frac{|\mathbf{r} - \mathbf{r}_{ij}|^{2}}{2\sigma^{2}}\right) = \sum_{nlm} c_{nlm} g_{n}(\mathbf{r}) Y_{lm}(\hat{\mathbf{r}}) $$

The power spectrum $\mathbf{p}(\mathbf{r}) \equiv \sum_{m} c_{nlm}(c_{n’lm})^{*}$ serves as a vector descriptor of the local environment. SOAP has seen wide adoption both as a similarity metric and as input to ML models.

Voronoi tessellation provides another local approach, segmenting space into cells and extracting features like effective coordination numbers, cell volumes, and neighbor properties.

Global Descriptors

Global descriptors encode the full structure. The Coulomb matrix models electrostatic interactions between atoms:

$$ M_{i,j} = \begin{cases} Z_{i}^{2.4} & \text{for } i = j \\ \frac{Z_{i}Z_{j}}{|r_{i} - r_{j}|} & \text{for } i \neq j \end{cases} $$

Other global methods include partial radial distribution functions (PRDF), the many-body tensor representation (MBTR), and cluster expansions. The Atomic Cluster Expansion (ACE) framework generalizes cluster expansions to continuous environments and has become a foundation for modern deep learning potentials.

Topological Descriptors

Persistent homology from topological data analysis (TDA) identifies geometric features at multiple length scales. Topological descriptors capture pore geometries in porous materials and have outperformed traditional structural descriptors for predicting CO$_{2}$ adsorption in metal-organic frameworks and methane storage in zeolites. A caveat is the $O(N^{3})$ worst-case computational cost per filtration.

Crystal Graph Neural Networks

Graph neural networks bypass manual feature engineering by learning representations directly from structural data. Materials are converted to graphs $G(V, E)$ where nodes represent atoms and edges connect neighbors within a cutoff radius, with periodic boundary conditions.

Key architectures discussed include:

Model	Key Innovation
CGCNN	Crystal graph convolutions for broad property prediction
MEGNet	Materials graph networks with global state attributes
ALIGNN	Line graph neural networks incorporating three-body angular features
Equivariant GNNs	E(3)-equivariant message passing for tensorial properties

The review identifies several limitations. Graph convolutions based on local neighborhoods can fail to capture long-range interactions or periodicity-dependent properties (e.g., lattice parameters, phonon spectra). Strategies to address this include concatenation with hand-tuned descriptors, plane-wave periodic basis modulation, and reciprocal-space features.

A major practical restriction is the requirement for relaxed atomic positions. Graphs built from unrelaxed crystal prototypes lose information about geometric distortions, degrading accuracy. Approaches to mitigate this include data augmentation with perturbed structures, Bayesian optimization of prototypes, and surrogate force-field relaxation.

Equivariant models that introduce higher-order tensors to node and edge features, constrained to transform correctly under E(3) operations, achieve state-of-the-art accuracy and can match structural descriptor performance even in low-data (~100 datapoints) regimes.

Compositional Descriptors Without Structure

When crystal structures are unavailable, representations can be built purely from stoichiometry and tabulated atomic properties (radii, electronegativity, valence electrons). Despite their simplicity, these methods have distinct advantages: zero computational overhead, accessibility to non-experts, and robustness for high-throughput screening.

Key methods include:

MagPie: 145 input features derived from elemental properties
SISSO: Compressive sensing over algebraic combinations of atomic properties, capable of discovering interpretable descriptors (e.g., a new tolerance factor $\tau$ for perovskite stability)
ElemNet: Deep neural network using only fractional stoichiometry as input, outperforming MagPie with >3,000 training points
ROOST: Fully-connected compositional graph with attention-based message passing, achieving strong performance with only hundreds of examples
CrabNet: Self-attention on element embeddings with fractional encoding, handling dopant-level concentrations via log-scale inputs

Compositional models cannot distinguish polymorphs and generally underperform structural approaches. They are most valuable when atomistic resolution is unavailable.

Defects, Surfaces, and Grain Boundaries

The review extends beyond idealized unit cells to practical materials challenges:

Point defects: Representations of the pristine bulk can predict vacancy formation energies through linear relationships with band structure descriptors. Frey et al. proposed using relative differences between defect and parent structure properties, requiring no DFT on the defect itself.

Surfaces and catalysis: Binding energy prediction for catalysis requires representations beyond the bulk unit cell. The d-band center for metals and oxygen 2p-band center for metal oxides serve as simple electronic descriptors, following the Sabatier principle that optimal catalytic activity requires intermediate binding strength. Graph neural networks trained on the Open Catalyst 2020 dataset (>1 million DFT energies) have enabled broader screening, though errors remain high for certain adsorbates and non-metallic surfaces.

Grain boundaries: SOAP descriptors computed for atoms near grain boundaries and clustered into local environment classes can predict grain boundary energy, mobility, and shear coupling. This approach provides interpretable structure-property relationships.

Transfer Learning Across Representations

When target datasets are small, transfer learning leverages representations learned from large, related datasets. The standard procedure involves: (1) pretraining on a large dataset (e.g., all Materials Project formation energies), (2) freezing parameters up to a chosen depth, and (3) either fine-tuning remaining layers or extracting features for a separate model.

Key findings from the review:

Transfer learning is most effective when the source dataset is orders of magnitude larger than the target
Physically related tasks transfer better (e.g., Open Catalyst absorption energies transfer well to new adsorbates, less so to unrelated small molecules)
Earlier neural network layers learn more general representations and transfer better across properties
Multi-depth feature extraction, combining activations from multiple layers, can improve transfer
Predictions from surrogate models can serve as additional descriptors, expanding screening domains by orders of magnitude

Generative Models for Crystal Inverse Design

Generative models for solid-state materials face challenges beyond molecular generation: more diverse atomic species, the need to specify both positions and lattice parameters, non-unique definitions (rotations, translations, supercell scaling), and large unit cells (>100 atoms for zeolites and MOFs).

The review traces the progression of approaches:

Voxel representations: Discretize unit cells into volume elements. Early work (iMatGen, Court et al.) demonstrated feasibility but was restricted to specific chemistries or cubic systems.
Continuous coordinate models: Point cloud and invertible representations allowed broader chemical spaces but lacked symmetry invariances.
Symmetry-aware models: Crystal Diffusion VAE (CDVAE) uses periodic graphs and SE(3)-equivariant message passing for translationally and rotationally invariant generation, establishing benchmark tasks for the field.
Constrained models for porous materials: Approaches like SmVAE represent MOFs through their topological building blocks (RFcodes), ensuring all generated structures are physically valid.

Open Problems and Future Directions

The review highlights four high-impact open questions:

Local vs. global descriptor trade-offs: Local descriptors (SOAP) excel for short-range interactions but struggle with long-range physics. Global descriptors model periodicity but lack generality across space groups. Combining local and long-range features could provide more universal models.
Prediction from unrelaxed prototypes: ML force fields can relax structures at a fraction of DFT cost, potentially expanding screening domains. Key questions remain about required training data scale and generalizability.
Applicability of compositional descriptors: The performance gap between compositional and structural models may be property-dependent, being smaller for properties like band gap that depend on global features rather than local site energies.
Extensions of generative models: Diffusion-based architectures have improved on voxel approaches for small unit cells, but extending to microstructure, dimensionality, and surface generation remains open.

Reproducibility Details

This paper is a review and does not present new experimental results or release any novel code, data, or models. The paper is open-access (hybrid OA at Annual Reviews) and the arXiv preprint is freely available. The following artifacts table covers key publicly available resources discussed in the review.

Artifacts

Artifact	Type	License	Notes
arXiv preprint (2301.08813)	Other	arXiv (open access)	Free preprint version
Materials Project	Dataset	CC-BY-4.0	DFT energies, band gaps, structures for >100,000 compounds
OQMD	Dataset	CC-BY-4.0	Open Quantum Materials Database, >600,000 DFT entries
Open Catalyst 2020 (OC20)	Dataset	CC-BY-4.0	>1,000,000 DFT surface adsorption energies
AFLOW	Dataset	Public	High-throughput ab initio library, >3,000,000 entries
Matminer	Code	BSD	Open-source toolkit for materials data mining and featurization

Algorithms

The review covers: ACSF, SOAP, Voronoi tessellation, Coulomb matrices, PRDF, MBTR, cluster expansions, ACE, persistent homology, CGCNN, MEGNet, ALIGNN, E(3)-equivariant GNNs, MagPie, SISSO, ElemNet, ROOST, CrabNet, VAE, GAN, and diffusion-based crystal generators.

Hardware

No new experiments are conducted. Hardware requirements vary by the referenced methods (DFT calculations require HPC; GNN training typically requires 1-8 GPUs).

Reproducibility Status

Partially Reproducible: The review paper itself is open-access. All major datasets discussed (Materials Project, OQMD, OC20, AFLOW) are publicly available under permissive licenses. Most referenced model implementations (CGCNN, MEGNet, ALIGNN, ROOST, CDVAE) have open-source code. No novel artifacts are released by the authors.

Paper Information

Citation: Damewood, J., Karaguesian, J., Lunger, J. R., Tan, A. R., Xie, M., Peng, J., & Gómez-Bombarelli, R. (2023). Representations of Materials for Machine Learning. Annual Review of Materials Research, 53. https://doi.org/10.1146/annurev-matsci-080921-085947

Publication: Annual Review of Materials Research, 2023

@article{damewood2023representations,
  title={Representations of Materials for Machine Learning},
  author={Damewood, James and Karaguesian, Jessica and Lunger, Jaclyn R. and Tan, Aik Rui and Xie, Mingrou and Peng, Jiayu and G{\'o}mez-Bombarelli, Rafael},
  journal={Annual Review of Materials Research},
  volume={53},
  year={2023},
  doi={10.1146/annurev-matsci-080921-085947}
}

Transformers and LLMs for Chemistry Drug Discovery

Sat, 28 Mar 2026 00:00:00 +0000

A Systematization of Transformers in Chemistry

This book chapter by Bran and Schwaller is a Systematization paper that organizes the growing body of work applying transformer architectures to chemistry and drug discovery. Rather than proposing a new method, the authors trace a three-stage evolution: (1) task-specific single-modality models operating on SMILES and reaction strings, (2) multimodal models bridging molecular representations with spectra, synthesis actions, and natural language, and (3) large language models and LLM-powered agents capable of general chemical reasoning.

Why Transformers for Chemistry?

The authors motivate the review by drawing analogies between natural language and chemical language. Just as text can be decomposed into subwords and tokens, molecules can be linearized into SMILES or SELFIES strings, and chemical reactions can be encoded as reaction SMILES. This structural parallel enabled direct transfer of transformer architectures, originally designed for machine translation, to chemical prediction tasks.

Several factors accelerated this adoption:

The publication of open chemical databases and benchmarks (e.g., MoleculeNet, Open Reaction Database, Therapeutics Data Commons)
Improvements in compute infrastructure and training algorithms
The success of attention mechanisms at capturing context-dependent relationships, which proved effective for learning chemical grammar and atom-level correspondences

The review positions the transformer revolution in chemistry as a natural extension of NLP advances, noting that the gap between chemical and natural language is progressively closing.

Molecular Representations as Language

A key section of the review covers text-based molecular representations that make transformer applications possible:

SMILES (Simplified Molecular Input Line Entry System): The dominant linearization scheme since the 1980s, encoding molecular graphs as character sequences with special symbols for bonds, branches, and rings.
SELFIES (Self-Referencing Embedded Strings): A newer representation that guarantees every string maps to a valid molecule, addressing the robustness issues of SMILES in generative settings.
Reaction SMILES: Extends molecular representations to encode full chemical reactions in the format “A.B > catalyst.reagent > C.D”, enabling reaction prediction as a sequence-to-sequence task.

The authors note that while IUPAC names, InChI, and DeepSMILES exist as alternatives, SMILES and SELFIES dominate practical applications.

Stage 1: Task-Specific Transformer Models

The first stage of transformer adoption focused on clearly defined chemical tasks, with models trained on a single data modality (molecular strings).

Chemical Translation Tasks

The encoder-decoder architecture was directly applied to tasks framed as translation:

Molecular Transformer (Schwaller et al.): Treated reaction prediction as translation from reactant SMILES to product SMILES, becoming a leading method for forward synthesis prediction.
Retrosynthetic planning: The reverse task, predicting reactants from products, with iterative application to construct full retrosynthetic trees mapping to commercially available building blocks.
Chemformer (Irwin et al.): A pre-trained model across multiple chemical tasks, offering transferability to new applications with improved performance.
Graph-to-sequence models (Tu and Coley): Used a custom graph encoder with a transformer decoder, achieving improvements through permutation-invariant molecular graph encoding.

Representation Learning and Feature Extraction

Encoder-only transformers proved valuable for generating molecular and reaction embeddings:

Reaction representations (Wang et al., SMILES-BERT): Trained models to generate reaction vectors that outperformed hand-engineered features on downstream regression tasks.
Reaction classification (Schwaller et al.): Replaced the decoder with a classification layer to map chemical reactions by class, revealing clustering patterns by reaction type, data source, and molecular properties.
Yield prediction: Regression heads attached to encoders achieved strong results on high-throughput experimentation datasets.
Protein language models (Rives et al., ESM): Trained on 250 million protein sequences using unsupervised learning, achieving strong performance on protein property prediction and structure forecasting.
RXNMapper (Schwaller et al.): A notable application where attention weight analysis revealed that transformers internally learn atom-to-atom mappings in chemical reactions, leading to an open-source atom mapping algorithm that outperformed existing approaches.

Stage 2: Multimodal Chemical Models

The second stage extended transformers beyond molecular strings to incorporate additional data types:

Molecular captioning: Describing molecules in natural language, covering scaffolds, sources, drug interactions, and other features (Edwards et al.).
Bidirectional molecule-text conversion: Models capable of generating molecules from text queries and performing molecule-to-molecule tasks (Christofidellis et al.).
Experimental procedure prediction: Generating actionable synthesis steps from reaction SMILES (Vaucher et al.), bridging the gap between retrosynthetic planning and laboratory execution.
Structural elucidation from IR spectra: Encoding IR spectra as text sequences alongside chemical formulas, then predicting SMILES from these inputs (Alberts et al.), achieving 45% accuracy in structure prediction and surpassing prior approaches for functional group identification.

Stage 3: Large Language Models and Chemistry Agents

The most recent stage builds on foundation models pre-trained on vast text corpora, adapted for chemistry through fine-tuning and in-context learning.

Scaling Laws and Emergent Capabilities

The authors discuss how model scaling leads to emergent capabilities relevant to chemistry:

Below certain compute thresholds, model performance on chemistry tasks appears random.
Above critical sizes, sudden improvements emerge, along with capabilities like chain-of-thought (CoT) reasoning and instruction following.
These emergent abilities enable chemistry tasks that require multi-step reasoning without explicit training on chemical data.

LLMs as Chemistry Tools

Key applications of LLMs in chemistry include:

Fine-tuning for low-data chemistry (Jablonka et al.): GPT-3 fine-tuned on limited chemistry datasets performed comparably to, and sometimes exceeded, specialized models with engineered features for tasks like predicting transition wavelengths and phase classification.
In-context learning: Providing LLMs with a few examples enables prediction on chemistry tasks without any parameter updates, particularly valuable when data is scarce.
Bayesian optimization with LLMs (Ramos et al.): Using GPT models for uncertainty-calibrated regression, enabling catalyst and molecular optimization directly from synthesis procedures without feature engineering.
3D structure generation (Flam-Shepherd and Aspuru-Guzik): Using language models to generate molecular structures with three-dimensional atomic positions in XYZ, CIF, and PDB formats, matching graph-based algorithms while overcoming representation limitations.

LLM-Powered Chemistry Agents

The review highlights the agent paradigm as the most impactful recent development:

14 LLM use-cases (Jablonka et al.): A large-scale collaborative effort demonstrating applications from computational tool wrappers to reaction optimization assistants and scientific question answering.
ChemCrow (Bran, Cox et al.): An LLM-powered agent equipped with curated computational chemistry tools, capable of planning and executing tasks across drug design, materials design, and synthesis. ChemCrow demonstrated that tool integration overcomes LLM hallucination issues by grounding responses in reliable data sources.
Autonomous scientific research (Boiko et al.): Systems with focus on cloud laboratory operability.

The agent paradigm offers tool composability through natural language interfaces, allowing users to chain multiple computational tools into custom pipelines.

Outlook and Limitations

The authors identify several themes for the future:

The three stages represent increasing generality, from task-specific single-modality models to open-ended agents.
Natural language interfaces are progressively closing the gap between chemical and human language.
Tool integration through agents provides grounding that mitigates hallucination, a known limitation of direct LLM application to chemistry.
The review acknowledges that LLMs have a “high propensity to generate false and inaccurate content” on chemical tasks, making tool-augmented approaches preferable to direct application.

The chapter does not provide quantitative benchmarks or systematic comparisons across the methods discussed, as its goal is to organize the landscape rather than evaluate individual methods.

Reproducibility Details

This is a review/survey chapter and does not introduce new models, datasets, or experiments. The reproducibility assessment applies to the referenced works rather than the review itself.

Key Referenced Resources

Several open-source tools and datasets discussed in the review are publicly available:

Artifact	Type	License	Notes
RXNMapper	Code	MIT	Attention-based atom mapping
ChemCrow	Code	MIT	LLM-powered chemistry agent
MoleculeNet	Dataset	Various	Molecular ML benchmarks
Open Reaction Database	Dataset	CC-BY-SA-4.0	Curated reaction data
Therapeutics Data Commons	Dataset	MIT	Drug discovery ML datasets

Reproducibility Classification

Not applicable (review paper). Individual referenced works range from Highly Reproducible (open-source models like RXNMapper, ChemCrow) to Partially Reproducible (some models without released code) to Closed (proprietary LLMs like GPT-3/GPT-4 used in fine-tuning studies).

Paper Information

Citation: Bran, A. M., & Schwaller, P. (2024). Transformers and Large Language Models for Chemistry and Drug Discovery. In Drug Development Supported by Informatics (pp. 143-163). Springer Nature Singapore. https://doi.org/10.1007/978-981-97-4828-0_8

@incollection{bran2024transformers,
  title={Transformers and Large Language Models for Chemistry and Drug Discovery},
  author={Bran, Andres M. and Schwaller, Philippe},
  booktitle={Drug Development Supported by Informatics},
  pages={143--163},
  year={2024},
  publisher={Springer Nature Singapore},
  doi={10.1007/978-981-97-4828-0_8}
}

Transformers for Molecular Property Prediction Review

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Transformers for Molecular Property Prediction

This is a Systematization paper. Sultan et al. provide the first comprehensive, structured review of sequence-based transformer models applied to molecular property prediction (MPP). The review catalogs 16 models published between 2019 and 2023, organizes them by architecture type (encoder-decoder, encoder-only, decoder-only), and systematically examines seven key design decisions that arise when building a transformer for MPP. The paper’s primary contribution is identifying gaps in current evaluation practices and articulating what standardization the field needs for meaningful progress.

The Problem: Inconsistent Evaluation Hinders Progress

Molecular property prediction is essential for drug discovery, crop protection, and environmental science. Deep learning approaches, including transformers, have been increasingly applied to this task by learning molecular representations from string notations like SMILES and SELFIES. However, the field faces several challenges:

Small labeled datasets: Labeled molecular property datasets typically contain only hundreds or thousands of molecules, making supervised learning alone insufficient.
No standardized evaluation protocol: Different papers use different data splits (scaffold vs. random), different splitting implementations, different numbers of repetitions (3 to 50), and sometimes do not share their test sets. This makes direct comparison across models infeasible.
Unclear design choices: With many possible configurations for pre-training data, chemical language, tokenization, positional embeddings, model size, pre-training objectives, and fine-tuning approaches, the field lacks systematic analyses to guide practitioners.

The authors note that standard machine learning methods with fixed-size molecular fingerprints remain strong baselines for real-world datasets, illustrating that the promise of transformers for MPP has not yet been fully realized.

Seven Design Questions for Molecular Transformers

The central organizing framework of this review addresses seven questions practitioners must answer when building a transformer for MPP. For each, the authors synthesize findings across the 16 reviewed models.

Reviewed Models

The paper catalogs 16 models organized by architecture:

Architecture	Base Model	Models
Encoder-Decoder	Transformer, BART	ST, Transformer-CNN, X-Mol, ChemFormer
Encoder-Only	BERT	SMILES-BERT, MAT, MolBERT, Mol-BERT, Chen et al., K-BERT, FP-BERT, MolFormer
Encoder-Only	RoBERTa	ChemBERTa, ChemBERTa-2, SELFormer
Decoder-Only	XLNet	Regression Transformer (RT)

The core attention mechanism shared by all these models is the scaled dot-product attention:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V $$

where $Q$, $K$, and $V$ are the query, key, and value matrices, and $d_{k}$ is the dimension of the key vectors.

Question 1: Which Database and How Many Molecules?

Pre-training data sources vary considerably. The three main databases are ZINC (37 billion molecules in ZINC22), ChEMBL (2.4 million unique molecules with 20 million bioactivity measurements), and PubChem (111 million unique molecules). Pre-training set sizes ranged from 900K (ST on ChEMBL) to 1.1B molecules (MolFormer on ZINC + PubChem).

Model	Database	Size	Language
ST	ChEMBL	900K	SMILES
MolBERT	ChEMBL (GuacaMol)	1.6M	SMILES
ChemBERTa	PubChem	100K-10M	SMILES, SELFIES
ChemBERTa-2	PubChem	5M-77M	SMILES
MAT	ZINC	2M	List of atoms
MolFormer	ZINC + PubChem	1.1B	SMILES
Chen et al.	C, CP, CPZ	2M-775M	SMILES

A key finding is that larger pre-training datasets do not consistently improve downstream performance. MolFormer showed minimal difference between models trained on 100M vs. 1.1B molecules. ChemBERTa-2 found that the model trained on 5M molecules using MLM performed comparably to 77M molecules for BBBP (both around 0.70 ROC-AUC). Chen et al. reported comparable $R^{2}$ values of $0.925 \pm 0.01$, $0.917 \pm 0.012$, and $0.915 \pm 0.01$ for ESOL across datasets of 2M, 103M, and 775M molecules, respectively. The data composition and covered chemical space appear to matter more than raw size.

Question 2: Which Chemical Language?

Most models use SMILES. ChemBERTa, RT, and SELFormer also explored SELFIES. MAT uses a simple list of atoms with structural features, while Mol-BERT and FP-BERT use circular fingerprints.

Direct comparisons between SMILES and SELFIES (by ChemBERTa on Tox21 SR-p53 and RT for drug-likeness prediction) found no significant performance difference. The RT authors reported that SELFIES models performed approximately $0.004 \pm 0.01$ better on RMSE, while SMILES models performed approximately $0.004 \pm 0.01$ better on Pearson correlation. The choice of chemical language does not appear to be a major factor in prediction performance, and even non-string representations (atom lists in MAT, fingerprints in Mol-BERT) perform competitively.

Question 3: How to Tokenize?

Tokenization methods span atom-level (42-66 vocabulary tokens), regex-based (47-2,362 tokens), BPE (509-52K tokens), and substructure-based (3,357-13,325 tokens) approaches. No systematic comparison of tokenization strategies exists in the literature. The vocabulary size varied dramatically, from 42 tokens for MolBERT to over 52K for ChemBERTa. The authors argue that chemically meaningful tokenization (e.g., functional group-based fragmentation) could improve both performance and explainability.

Question 4: How to Add Positional Embeddings?

Most models inherited the absolute positional embedding from their NLP base models. MolBERT and RT adopted relative positional embeddings. MolFormer combined absolute and Rotary Positional Embedding (RoPE). MAT incorporated spatial information (inter-atomic 3D distances and adjacency) alongside self-attention.

MolFormer’s comparison showed that RoPE became superior to absolute embeddings only when the pre-training dataset was very large. The performance difference (MAE on QM9) between absolute and RoPE embeddings for models trained on 111K, 111M, and 1.1B molecules was approximately $-0.20 \pm 0.18$, $-0.44 \pm 0.22$, and $0.27 \pm 0.12$, respectively.

The authors highlight that SMILES and SELFIES are linearizations of a 2D molecular graph, so consecutive tokens in a sequence are not necessarily spatially close. Positional embeddings that reflect 2D or 3D molecular structure remain underexplored.

Question 5: How Many Parameters?

Model sizes range from approximately 7M (ST, Mol-BERT) to over 100M parameters (MAT). Most chemical language models operate with 100M parameters or fewer, much smaller than NLP models like BERT (110M-330M) or GPT-3 (175B).

Model	Dimensions	Heads	Layers	Parameters
ST	256	4	4	7M
MolBERT	768	12	12	85M
MolFormer	768	12	6, 12	43M, 85M
SELFormer	768	12, 4	8, 12	57M, 85M
MAT	1024	16	8	101M
ChemBERTa	768	12	6	43M

SELFormer and MolFormer both tested different model sizes. SELFormer’s larger model (approximately 86M parameters) showed approximately 0.034 better ROC-AUC for BBBP compared to the smaller model. MolFormer’s larger model (approximately 87M parameters) performed approximately 0.04 better ROC-AUC on average for BBBP, HIV, BACE, and SIDER. The field lacks the systematic scaling analyses (analogous to Kaplan et al. and Hoffmann et al. in NLP) needed to establish proper scaling laws for chemical language models.

Question 6: Which Pre-training Objectives?

Pre-training objectives fall into domain-agnostic and domain-specific categories:

Model	Pre-training Objective	Fine-tuning
MolFormer	MLM	Frozen, Update
SMILES-BERT	MLM	Update
MolBERT	MLM, PhysChemPred, SMILES-EQ	Frozen, Update
K-BERT	Atom feature, MACCS prediction, CL	Update last layer
ChemBERTa-2	MLM, MTR	Update
MAT	MLM, 2D Adjacency, 3D Distance	Update
ChemFormer	Denoising Span MLM, Augmentation	Update
RT	PLM (Permutation Language Modeling)	-

Domain-specific objectives (predicting physico-chemical properties, atom features, or MACCS keys) showed promising but inconsistent results. MolBERT’s PhysChemPred performed closely to the full three-objective model (approximately $0.72 \pm 0.06$ vs. $0.71 \pm 0.06$ ROC-AUC in virtual screening). The SMILES-EQ objective (identifying equivalent SMILES) was found to lower performance when combined with other objectives. K-BERT’s contrastive learning objective did not significantly change performance (average ROC-AUC of 0.806 vs. 0.807 with and without CL).

ChemBERTa-2’s Multi-Task Regression (MTR) objective performed noticeably better than MLM-only for almost all four classification tasks across pre-training dataset sizes.

Question 7: How to Fine-tune?

Fine-tuning through weight updates generally outperforms frozen representations. SELFormer showed this most dramatically, with a difference of 2.187 RMSE between frozen and updated models on FreeSolv. MolBERT showed a much smaller difference (0.575 RMSE on FreeSolv), likely because its domain-specific pre-training objectives already produced representations closer to the downstream tasks.

Benchmarking Challenges and Performance Comparison

Downstream Datasets

The review focuses on nine benchmark datasets across three categories from MoleculeNet:

Dataset	Molecules	Tasks	Type	Application
ESOL	1,128	1 regression	Physical chemistry	Aqueous solubility
FreeSolv	642	1 regression	Physical chemistry	Hydration free energy
Lipophilicity	4,200	1 regression	Physical chemistry	LogD at pH 7.4
BBBP	2,050	1 classification	Physiology	Blood-brain barrier
ClinTox	1,484	2 classification	Physiology	Clinical trial toxicity
SIDER	1,427	27 classification	Physiology	Drug side effects
Tox21	7,831	12 classification	Physiology	Nuclear receptor/stress pathways
BACE	1,513	1 classification	Biophysics	Beta-secretase 1 binding
HIV	41,127	1 classification	Biophysics	Anti-HIV activity

Inconsistencies in Evaluation

The authors document substantial inconsistencies that prevent fair model comparison:

Data splitting: Models used different splitting methods (scaffold vs. random) and different implementations even when using the same method. Not all models adhered to scaffold splitting for classification tasks as recommended.
Different test sets: Even models using the same split type may not evaluate on identical test molecules due to different random seeds.
Varying repetitions: Repetitions ranged from 3 (RT) to 50 (Chen et al.), making some analyses more statistically robust than others.
Metric inconsistency: Most use ROC-AUC for classification and RMSE for regression, but some models report only averages without standard deviations, while others report standard errors.

Performance Findings

When comparing only models evaluated on the same test sets (Figure 2 in the paper), the authors observe that transformer models show comparable, but not consistently superior, performance to existing ML and DL models. The performance varies considerably across models and datasets.

For BBBP, the Mol-BERT model reported lower ROC-AUC than its corresponding MPNN (approximately 0.88 vs. 0.91), while MolBERT outperformed its corresponding CDDD model (approximately 0.86 vs. 0.76 ROC-AUC) and its SVM baseline (approximately 0.86 vs. 0.70 ROC-AUC). A similar mixed pattern appeared for HIV: ChemBERTa performed worse than its corresponding ML models, while MolBERT performed better than its ML (approximately 0.08 higher ROC-AUC) and DL (approximately 0.03 higher ROC-AUC) baselines. For SIDER, Mol-BERT performed approximately 0.1 better ROC-AUC than its corresponding MPNN. For regression, MAT and MolBERT showed improved performance over their ML and DL baselines on ESOL, FreeSolv, and Lipophilicity. For example, MAT performed approximately 0.2 lower RMSE than an SVM model and approximately 0.03 lower RMSE than the Weave model on ESOL.

Key Takeaways and Future Directions

The review concludes with six main takeaways:

Performance: Transformers using SMILES show comparable but not consistently superior performance to existing ML and DL models for MPP.
Scaling: No systematic analysis of model parameter scaling relative to data size exists for chemical language models. Such analysis is essential.
Pre-training data: Dataset size alone is not the sole determinant of downstream performance. Composition and chemical space coverage matter.
Chemical language: SMILES and SELFIES perform similarly. Alternative representations (atom lists, fingerprints) also work when the architecture is adjusted.
Domain knowledge: Domain-specific pre-training objectives show promise, but tokenization and positional encoding remain underexplored.
Benchmarking: The community needs standardized data splitting, fixed test sets, statistical analysis, and consistent reporting to enable meaningful comparison.

The authors also highlight the need for attention visualization and explainability analysis, investigation of NLP-originated techniques (pre-training regimes, fine-tuning strategies like LoRA, explainability methods), and adaptation of these techniques to the specific characteristics of chemical data (smaller vocabularies, shorter sequences).

Reproducibility Details

Data

This is a review paper. No new data or models are introduced. All analyses use previously reported results from the 16 reviewed papers, with additional visualization and comparison. The authors provide a GitHub repository with the code and data used to generate their comparative figures.

Algorithms

Not applicable (review paper). The paper describes training strategies at a conceptual level, referencing the original publications for implementation details.

Models

Not applicable (review paper). The paper catalogs 16 models with their architecture details, parameter counts, and training configurations across Tables 1, 4, 5, 6, and 7.

Evaluation

The paper compiles performance across nine MoleculeNet datasets. Key comparison figures (Figures 2 and 7) restrict to models evaluated on the same test sets for fair comparison, using ROC-AUC for classification and RMSE for regression.

Hardware

Not applicable (review paper).

Artifact	Type	License	Notes
Transformers4MPP_review	Code	MIT	Figure generation code and compiled data

Paper Information

Citation: Sultan, A., Sieg, J., Mathea, M., & Volkamer, A. (2024). Transformers for Molecular Property Prediction: Lessons Learned from the Past Five Years. Journal of Chemical Information and Modeling, 64(16), 6259-6280. https://doi.org/10.1021/acs.jcim.4c00747

@article{sultan2024transformers,
  title={Transformers for Molecular Property Prediction: Lessons Learned from the Past Five Years},
  author={Sultan, Afnan and Sieg, Jochen and Mathea, Miriam and Volkamer, Andrea},
  journal={Journal of Chemical Information and Modeling},
  volume={64},
  number={16},
  pages={6259--6280},
  year={2024},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.4c00747}
}

Transformer CLMs for SMILES: Literature Review 2024

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Transformer-Based Chemical Language Models

This paper is a Systematization (literature review) that surveys the landscape of transformer-based chemical language models (CLMs) operating on SMILES representations. It organizes the field into three architectural categories (encoder-only, decoder-only, encoder-decoder), discusses tokenization strategies, pre-training and fine-tuning methodologies, and identifies open challenges and future research directions. The review covers approximately 30 distinct CLMs published through early 2024.

Why Review Transformer CLMs for SMILES?

The chemical space is vast, with databases like ZINC20 exceeding 5.5 billion compounds, and the amount of unlabeled molecular data far outstrips available labeled data for specific tasks like toxicity prediction or binding affinity estimation. Traditional molecular representations (fingerprints, descriptors, graph-based methods) require expert-engineered features and extensive domain knowledge.

Transformer-based language models, originally developed for NLP, have emerged as a compelling alternative. By treating SMILES strings as a “chemical language,” these models can leverage large-scale unsupervised pre-training on abundant unlabeled molecules, then fine-tune on small labeled datasets for specific downstream tasks. Earlier approaches like Seq2Seq and Seq3Seq fingerprint methods used RNN-based encoder-decoders, but these suffered from vanishing gradients and sequential processing bottlenecks when handling long SMILES sequences.

The authors motivate this review by noting that no prior survey has comprehensively organized transformer-based CLMs by architecture type while simultaneously covering tokenization, embedding strategies, and downstream application domains.

Architectural Taxonomy: Encoder, Decoder, and Encoder-Decoder Models

The core organizational contribution is a three-way taxonomy of transformer CLMs based on their architectural backbone.

Encoder-Only Models (BERT Family)

These models capture bidirectional context, making them well suited for extracting molecular representations for property prediction tasks. The review covers:

BERT (Lee and Nam, 2022): Adapted for SMILES processing with linguistic knowledge infusion, using BPE tokenization
MOLBERT (Fabian et al., 2020): Chemistry-specific BERT for physicochemical property and bioactivity prediction
SMILES-BERT (Wang et al., 2019): BERT variant designed to learn molecular representations directly from SMILES without feature engineering
ChemBERTa / ChemBERTa-2 (Chithrananda et al., 2020; Ahmad et al., 2022): RoBERTa-based models optimized for chemical property prediction, with ChemBERTa-2 exploring multi-task pre-training
GPT-MolBERTa (Balaji et al., 2023): Combines GPT molecular features with a RoBERTa backbone
MoLFormer (Ross et al., 2022): Large-scale model trained on 1.1 billion molecules, published in Nature Machine Intelligence
SELFormer (Yuksel et al., 2023): Operates on SELFIES representations rather than SMILES
Mol-BERT / MolRoPE-BERT (Li and Jiang, 2021; Liu et al., 2023): Differ in positional embedding strategy, with MolRoPE-BERT using rotary position embedding to handle longer sequences
BET (Chen et al., 2021): Extracts predictive representations from hundreds of millions of molecules

Decoder-Only Models (GPT Family)

These models excel at generative tasks, including de novo molecular design:

GPT-2-based model (Adilov, 2021): Generative pre-training from molecules
MolXPT (Liu et al., 2023): Wraps molecules with text for generative pre-training, connecting chemical and natural language
BioGPT (Luo et al., 2022): Focuses on biomedical text generation and mining
MolGPT (Haroon et al., 2023): Uses relative attention to capture token distances and relationships for de novo drug design
Mol-Instructions (Fang et al., 2023): Large-scale biomolecular instruction dataset for LLMs

Encoder-Decoder Models

These combine encoding and generation capabilities for sequence-to-sequence tasks:

Chemformer (Irwin et al., 2022): BART-based model for reaction prediction and molecular property prediction
MolT5 (adapted T5): Unified text-to-text framework for molecular tasks
SMILES Transformer (Honda et al., 2019): Pre-trained molecular fingerprints for low-data drug discovery
X-MOL (Xue et al., 2020): Large-scale pre-training for molecular understanding
Regression Transformer (Born and Manica, 2023): Operates on SELFIES, enabling concurrent regression and generation
TransAntivirus (Mao et al., 2023): Specialized for antiviral drug design using IUPAC nomenclature

Tokenization, Embedding, and Pre-Training Strategies

SMILES Tokenization

The review identifies tokenization as a critical preprocessing step that affects downstream performance. SMILES tokenization differs from standard NLP tokenization because SMILES strings lack whitespace and use parentheses for branching rather than sentence separation. The key approaches include:

Strategy	Source	Description
Atom-in-SMILES (AIS)	Ucak et al. (2023)	Atom-level tokens preserving chemical identity
SMILES Pair Encoding (SPE)	Li and Fourches (2021)	BPE-inspired substructure tokenization
Byte-Pair Encoding (BPE)	Chithrananda et al. (2020); Lee and Nam (2022)	Standard subword tokenization adapted for SMILES
SMILESTokenizer	Chithrananda et al. (2020)	Character-level tokenization with chemical adjustments

Positional Embeddings

The models use various positional encoding strategies: absolute, relative key, relative key-query, rotary (RoPE), and sinusoidal. Notably, SMILES-based models omit segmentation embeddings since SMILES data consists of single sequences rather than sentence pairs.

Pre-Training and Fine-Tuning Pipeline

The standard workflow follows two phases:

Pre-training: Unsupervised training on large unlabeled SMILES databases (ZINC, PubChem, ChEMBL) using masked language modeling (MLM), where the model learns to predict masked tokens within SMILES strings
Fine-tuning: Supervised adaptation on smaller labeled datasets for specific tasks (classification or regression)

The self-attention mechanism, central to all transformer CLMs, is formulated as:

$$ Z = \text{Softmax}\left(\frac{(XW^Q)(XW^K)^T}{\sqrt{d_k}}\right) XW^V $$

where $X \in \mathbb{R}^{N \times M}$ is the input feature matrix, $W^Q$, $W^K$, $W^V \in \mathbb{R}^{M \times d_k}$ are learnable weight matrices, and $\sqrt{d_k}$ is the scaling factor.

Benchmark Datasets and Evaluation Landscape

The review catalogs the standard evaluation ecosystem for CLMs. Pre-training databases include ZINC, PubChem, and ChEMBL. Fine-tuning and evaluation rely heavily on MoleculeNet benchmarks:

Category	Datasets	Task Type	Example Size
Physical Chemistry	ESOL, FreeSolv, Lipophilicity	Regression	642 to 4,200
Biophysics	PCBA, MUV, HIV, PDBbind, BACE	Classification/Regression	11,908 to 437,929
Physiology	BBBP, Tox21, ToxCast, SIDER, ClinTox	Classification	1,427 to 8,575

The authors also propose four new fine-tuning datasets targeting diseases: COVID-19 drug compounds, cocrystal formation, antimalarial drugs (Plasmodium falciparum targets), and cancer gene expression/drug response data.

Challenges, Limitations, and Future Directions

Current Challenges

The review identifies several persistent limitations:

Data efficiency: Despite transfer learning, transformer CLMs still require substantial pre-training data, and labeled datasets for specific tasks remain scarce
Interpretability: The complexity of transformer architectures makes it difficult to understand how specific molecular features contribute to predictions
Computational cost: Training large-scale models demands significant GPU resources, limiting accessibility
Handling rare molecules: Models struggle with molecular structures that deviate significantly from training data distributions
SMILES limitations: Non-unique representations, invalid strings, exceeded atom valency, and inadequate spatial information capture

SMILES Representation Issues

The authors highlight five specific problems with SMILES as an input representation:

Non-canonical representations reduce string uniqueness for the same molecule
Many symbol combinations produce chemically invalid outputs
Valid SMILES strings can encode chemically impossible molecules (e.g., exceeded valency)
Spatial information is inadequately captured
Syntactic and semantic robustness is limited

Future Research Directions

The review proposes several directions:

Alternative molecular representations: Exploring SELFIES, DeepSMILES, IUPAC, and InChI beyond SMILES
Role of SMILES token types: Strategic masking of metals, non-metals, bonds, and branches during MLM pre-training to identify which components are most critical
Few-shot learning: Combining few-shot approaches with large-scale pre-trained CLMs for data-scarce scenarios
Drug repurposing: Training CLMs to distinguish identical compounds with different biological activity profiles across therapeutic domains
Improved benchmarks: Incorporating disease-specific datasets (malaria, cancer, COVID-19) for more realistic evaluation
Ethical considerations: Addressing dual-use risks, data biases, and responsible open-source release of CLMs

Reproducibility Details

This is a literature review paper. It does not introduce new models, code, or experimental results. The reproducibility assessment focuses on the accessibility of the reviewed works and proposed datasets.

Data

Purpose	Dataset	Size	Notes
Pre-training	ZINC20	5.5B+ compounds	Publicly available
Pre-training	PubChem	100M+ compounds	Publicly available
Pre-training	ChEMBL	2M+ compounds	Publicly available
Fine-tuning	MoleculeNet (8 datasets)	642 to 437,929	Standard benchmark suite
Proposed	COVID-19 drug compounds	740	From Harigua-Souiai et al. (2021)
Proposed	Cocrystal formation	3,282	From Mswahili et al. (2021)
Proposed	Antimalarial drugs	4,794	From Mswahili et al. (2024)
Proposed	Cancer gene/drug response	201 drugs, 734 cell lines	From Kim et al. (2021)

Artifacts

Artifact	Type	License	Notes
DAI Lab website	Other	N/A	Authors’ research lab

No code, models, or evaluation scripts are released with this review. The paper does not include a supplementary materials section or GitHub repository.

Hardware

Not applicable (literature review).

Paper Information

Citation: Mswahili, M. E., & Jeong, Y.-S. (2024). Transformer-based models for chemical SMILES representation: A comprehensive literature review. Heliyon, 10(20), e39038. https://doi.org/10.1016/j.heliyon.2024.e39038

@article{mswahili2024transformer,
  title={Transformer-based models for chemical {SMILES} representation: A comprehensive literature review},
  author={Mswahili, Medard Edmund and Jeong, Young-Seob},
  journal={Heliyon},
  volume={10},
  number={20},
  pages={e39038},
  year={2024},
  publisher={Elsevier},
  doi={10.1016/j.heliyon.2024.e39038}
}

Systematic Review of Deep Learning CLMs (2020-2024)

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Chemical Language Models for Molecular Generation

This paper is a Systematization that provides a comprehensive, PRISMA-guided systematic review of deep learning chemical language models (CLMs) used for de novo molecular generation. The primary contribution is a structured statistical analysis of 72 retrieved articles from 2020 to June 2024, comparing architectures (RNNs, transformers, VAEs, GANs, S4 models), molecular representations, biased generation strategies, and quality metrics from the MOSES and GuacaMol benchmarking platforms. The review addresses five research questions about architecture configuration effects, best-performing architectures, impactful hyperparameters, common molecular representations, and effective biased generation methods.

Motivation: Evaluating Four Years of Generative CLM Progress

Deep learning molecular generation has expanded rapidly since 2018, when Gomez-Bombarelli et al. and Segler et al. demonstrated that deep generative models could learn to produce novel molecules from SMILES representations. By 2020, multiple architectures (RNNs, transformers, VAEs, GANs) were being applied to chemical language modeling, and benchmarking platforms like MOSES and GuacaMol had been introduced to enable standardized evaluation.

Despite this growth, existing reviews largely focused on theoretical background or drug development applications rather than systematic statistical comparison of model performance. Few studies had examined how architecture choice, training dataset size, molecular representation format, and biased learning strategies interact to affect generation quality metrics like validity, uniqueness, and novelty. This review fills that gap by restricting the analysis to papers reporting MOSES or GuacaMol metrics, enabling quantitative cross-study comparison.

PRISMA-Based Systematic Review Methodology

The review follows the Preferred Reporting Items for Systematic Review and Meta-Analysis (PRISMA) guidelines. Articles were retrieved from Scopus, Web of Science, and Google Scholar using six Boolean search queries combining terms like “Molecule Generation,” “Chemical Language Models,” “Deep Learning,” and specific architecture names. The search window covered January 2020 to June 2024.

Eligibility Criteria

Papers were included if they:

Were written in English
Explicitly presented at least two metrics of uniqueness, validity, or novelty
Defined these metrics consistent with MOSES or GuacaMol concepts
Used deep learning generative models for de novo molecule design
Used conventional (non-quantum) deep learning methods
Were published between January 2020 and June 2024

This yielded 48 articles from query-based search and 25 from citation search, totaling 72 articles. Of these, 62 used CLM approaches (string-based molecular representations) and 10 used graph-based representations.

Data Collection

For each article, the authors extracted: journal details, database name, training dataset size, molecular representation type (SMILES, SELFIES, InChI, DeepSMILES), architecture details (embedding length, layers, hidden units, trainable parameters, dropout, temperature, batch size, epochs, learning rate, optimizer), biased method usage (TL, RL, conditional learning), and generation metrics (validity, uniqueness, novelty, scaffold diversity, SNN, FCD).

Evaluation Metrics

The review focuses on three core MOSES metrics:

$$ \text{Validity}(V_m) = \frac{\text{Valid molecules}}{\text{Molecules produced}} $$

$$ \text{Uniqueness} = \frac{\text{set}(V_m)}{V_m} $$

$$ \text{Novelty} = 1 - \frac{V_m \cap T_d}{V_m} $$

where $V_m$ denotes valid molecules and $T_d$ the training dataset.

Architecture Distribution and Performance Comparison

Architecture Trends (2020-2024)

The review found that RNNs and transformers dominate CLM usage, with a growing trend toward transformers over time. The breakdown across 62 CLM articles: 24 RNN-based, 23 transformer-based, 16 VAE-based, 8 GAN-based, and 1 S4-based model. Among RNN variants, LSTM was the most common, followed by GRU, despite GRU having fewer trainable parameters.

The increase in transformer adoption is attributed to self-attention mechanisms enabling parallel computation and effective long-range dependency capture. Meanwhile, GANs and VAEs saw lower adoption rates, partly due to higher memory and time complexity and reduced ability to generate large molecules.

Molecular Representations and Databases

SMILES was used exclusively in 77.27% of CLM articles, reflecting its wide database availability and compact format. SELFIES, DeepSMILES, and InChI each appeared in smaller fractions. The dominant databases were ChEMBL and ZINC (27 articles each), followed by PubChem (4 articles). Approximately 71% of reviewed articles focused on drug discovery applications.

Database	Molecules (millions)	Representation	Articles
ChEMBL	2.4	SMILES, InChI	27
ZINC	750	SMILES	27
PubChem	115.3	SMILES, InChI	4
COCONUT	0.695	SMILES, InChI	1
DNA-Encoded Library	1,040	SMILES	1

Unbiased Model Performance

Validity: No statistically significant differences were observed across architecture families. Transformers generally achieved high validity through self-attention mechanisms that retain uncompressed sequence information. However, one transformer model (TransMol) achieved only 6.9% validity when using stochastic sampling with Gaussian noise to explore unseen chemical space. GANs showed high dispersion, with validity as low as 8.5% when learning from gene expression signatures rather than molecular structures directly.

Uniqueness: No significant differences in median uniqueness across architectures. Transformer-based models using masked self-attention achieved near-perfect uniqueness scores. Scaffold decoration and fragment-linking approaches sometimes compromised uniqueness due to overfit-driven redundancy.

Validity-Novelty Trade-off: The authors propose a “Valid/Sample” metric (Validity x Novelty) and find an inverse trend between validity and novelty (Spearman $\rho = -0.3575$, p-value = 0.0618). Only 17.9% of models achieved above-median values for both validity (95.6%) and novelty (96.5%) simultaneously. SELFIES-based models achieve 100% validity by construction, which can help address this trade-off.

Biased Model Performance

The review examines three biased generation strategies:

Transfer Learning (TL): The most prevalent biased method, used across all architecture types. Fine-tuning transfers pre-trained parameters to a target model, requiring significantly fewer training molecules (median ~2,507 vs. ~1.1M for unbiased). TL does not significantly affect validity (p = 0.16) or novelty (p = 0.84), but uniqueness decreases significantly (median 90.2% vs. 97.9%, p = 0.014), likely due to overfitting on small target datasets.

Metric	Unbiased (median)	TL Target (median)	p-value
Training size	1,128,920	2,507	<0.0001
Validity	98.05%	95.5%	0.1602
Uniqueness	97.9%	90.2%	0.0144
Novelty	91.6%	96.0%	0.8438

Reinforcement Learning (RL): Applied only to RNNs and transformers in the reviewed set. 90.1% of RL implementations used policy gradient methods with scoring functions for properties like synthesizability, binding affinity, and membrane permeability. No significant effects on generation metrics were observed.

Metric	Unbiased (median)	RL Target (median)	p-value
Validity	91.1%	96.5%	0.1289
Uniqueness	99.9%	89.7%	0.0935
Novelty	91.5%	93.5%	0.2500

Conditional Learning (CL): Integrates domain-specific data (properties, bioactivities, functional groups) directly into training via constraint tokens or property embeddings. Used primarily with encoder-decoder architectures (ARAEs, VAEs, transformers). CL does not significantly degrade generation metrics relative to unbiased models.

Metric	Unbiased (median)	CL Target (median)	p-value
Validity	98.5%	96.8%	0.4648
Uniqueness	99.9%	97.5%	0.0753
Novelty	89.3%	99.6%	0.2945

Key Findings and Directions for Chemical Language Models

Main Conclusions

Transformers are overtaking RNNs as the dominant CLM architecture, driven by self-attention mechanisms that capture long-range dependencies without the gradient vanishing issues of recurrent models.
SMILES remains dominant (77% of models) despite known limitations (non-uniqueness, syntax errors). SELFIES shows promise for improving the validity-novelty trade-off.
No architecture achieves both high validity and high novelty easily. Only 17.9% of unbiased models exceeded medians for both metrics simultaneously, highlighting a fundamental tension in generative chemistry.
Transfer learning requires only ~2,500 molecules to generate targeted compounds, compared to ~1.1M for unbiased training, but at the cost of reduced uniqueness.
Combining biased methods (e.g., TL + RL, CL + TL) shows promise for multi-objective optimization and exploring distant regions of chemical space.
S4 models were newly introduced for CLMs in 2023, showing competitive performance with the dual nature of convolution during training and recurrent generation.

Limitations

The review is restricted to papers reporting MOSES or GuacaMol metrics, which excludes many molecular generation studies that use alternative evaluation frameworks. The statistical comparisons rely on median values reported across different experimental settings, making direct architecture comparisons approximate. Graph-based approaches are included only for coarse comparison (10 of 72 articles) and are not the focus of the analysis.

Reproducibility Details

Data

This is a systematic review, so no new models were trained. The authors collected metadata from 72 published articles. No datasets were generated or analyzed beyond the literature corpus.

Algorithms

Statistical comparisons used Mann-Whitney U tests for paired samples. Spearman correlation was used to assess the validity-novelty relationship. Outlier identification used the Valid/Sample (Validity x Novelty) metric with box plot analysis.

Evaluation

The review evaluates models using MOSES metrics: validity, uniqueness, novelty, scaffold diversity, scaffold novelty, fragment similarity, SNN, internal diversity, and FCD. Statistical tests were applied to compare medians across architecture families and between biased and unbiased models.

Hardware

Not applicable (systematic review, no model training performed).

Paper Information

Citation: Flores-Hernandez, H., & Martínez-Ledesma, E. (2024). A systematic review of deep learning chemical language models in recent era. Journal of Cheminformatics, 16(1), 129. https://doi.org/10.1186/s13321-024-00916-y

@article{floreshernandez2024systematic,
  title={A systematic review of deep learning chemical language models in recent era},
  author={Flores-Hernandez, Hector and Mart{\'i}nez-Ledesma, Emmanuel},
  journal={Journal of Cheminformatics},
  volume={16},
  number={1},
  pages={129},
  year={2024},
  publisher={BioMed Central},
  doi={10.1186/s13321-024-00916-y}
}

Survey of Transformer Architectures in Molecular Science

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Transformer Architectures for Molecular Science

This paper is a Systematization review. It organizes and taxonomizes 12 families of transformer architectures that have been applied across molecular science, including chemistry, biology, and drug discovery. The primary contribution is not a new method or dataset, but a structured technical overview of the algorithmic internals of each transformer variant and their specific applications to molecular problems. The review covers 201 references and provides a unified treatment of how these architectures capture molecular patterns from sequential, graphical, and image-based data.

Bridging the Gap Between Transformer Variants and Molecular Applications

Transformer-based models have become widespread in molecular science, yet the authors identify a gap: there is no organized taxonomy linking these diverse techniques in the existing literature. Individual papers introduce specific architectures or applications, but practitioners lack a unified reference that explains the technical differences between GPT, BERT, BART, graph transformers, and other variants in the context of molecular data. The review aims to fill this gap by providing an in-depth investigation of the algorithmic components of each model family, explaining how their architectural innovations contribute to processing complex molecular data. The authors note that the success of transformers in molecular science stems from several factors: the sequential nature of chemical and biological molecules (DNA, RNA, proteins, SMILES strings), the attention mechanism’s ability to capture long-range dependencies within molecular structures, and the capacity for transfer learning through pre-training on large chemical and biological datasets.

Twelve Transformer Families and Their Molecular Mechanisms

The review covers transformer preliminaries before diving into 12 specific architecture families. The core self-attention mechanism computes:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

where $d_k$ is the dimension of the key vectors. The position-wise feed-forward network is:

$$ \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2 $$

The 12 architecture families covered are:

GPT (Generative Pre-trained Transformer): Uses the decoder part of the transformer for autoregressive generation. Applications include MolGPT for molecular generation, DrugGPT for protein-ligand binding, and cMolGPT for target-specific de novo molecular generation.
BERT (Bidirectional Encoder Representations from Transformers): Uses transformer encoders with masked language modeling and next-sentence prediction for pre-training. Molecular applications include FP-BERT for molecular property prediction using composite fingerprint representations, Graph-BERT for protein-protein interaction identification, SMILES-BERT, and Mol-BERT.
BART (Bidirectional and Auto-Regressive Transformers): Functions as a denoising autoencoder with both encoder and decoder. Molecular applications include Chemformer for sequence-to-sequence chemistry tasks, MS2Mol for mass spectrometry analysis, and MolBART for molecular feature learning.
Graph Transformer: Leverages self-attention on graph-structured data to capture global context. Applications include GraphSite for protein-DNA binding site prediction (using AlphaFold2 structure predictions), KPGT for knowledge-guided molecular graph pre-training, and PAGTN for establishing long-range dependencies in molecular graphs.
Transformer-XL: Incorporates relative positional encoding for modeling long sequences. Used for small molecule retention time prediction, drug design with CHEMBL data (1.27 million molecules), and Heck reaction generation.
T5 (Text-to-Text Transfer Transformer): Unifies NLP tasks into text-to-text mapping. T5Chem was pre-trained on 97 million molecules from PubChem and achieved 99.5% accuracy on reaction classification (USPTO 500 MT). C5T5 uses IUPAC naming for molecular optimization in drug discovery.
Vision Transformer (ViT): Applies transformer architecture to image patches. Used for organic molecule classification (97% accuracy with WGAN-generated data), bacterial identification via SERS, and molecular property prediction from mass spectrometry data (TransG-Net).
DETR (Detection Transformer): End-to-end object detection using transformers. Applied to cryo-EM particle picking (TransPicker), molecular structure image recognition (IMG2SMI), and cell segmentation (Cell-DETR).
Conformer: Integrates convolutional modules into transformer structure. Used for DNA storage error correction (RRCC-DNN), drug-target affinity prediction (NG-DTA with Davis and Kiba datasets).
CLIP (Contrastive Language-Image Pre-training): Multimodal learning linking text and images. Applied to peptide design (Cut&CLIP for protein degradation), gene identification (pathCLIP), and drug discovery (CLOOME for zero-shot transfer learning).
Sparse Transformers: Use sparse attention matrices to reduce complexity to $O(n\sqrt{n})$. Applied to drug-target interaction prediction with gated cross-attention mechanisms.
Mobile and Efficient Transformers: Compressed variants (TinyBERT, MobileBERT) for resource-constrained environments. Molormer uses ProbSparse self-attention for drug-drug interaction prediction. LOGO is a lightweight pre-trained language model for non-coding genome interpretation.

Survey Organization and Coverage of Molecular Domains

As a survey paper, this work does not present new experiments. Instead, it catalogues existing applications across multiple molecular domains:

Drug Discovery and Design: GPT-based ligand design (DrugGPT), BART-based molecular generation (Chemformer, MolBART), graph transformer pre-training for molecular property prediction (KPGT), T5-based chemical reaction prediction (T5Chem), and sparse transformer methods for drug-target interactions.

Protein Science: BERT-based protein-protein interaction prediction (Graph-BERT), graph transformer methods for protein-DNA binding (GraphSite with AlphaFold2 integration), conformer-based drug-target affinity prediction (NG-DTA), and CLIP-based peptide design (Cut&CLIP).

Molecular Property Prediction: FP-BERT for fingerprint-based prediction, SMILES-BERT and Mol-BERT for end-to-end prediction from SMILES, KPGT for knowledge-guided graph pre-training, and Transformer-XL for property modeling with relative positional encoding.

Structural Biology: DETR-based cryo-EM particle picking (TransPicker), vision transformer applications in cell imaging, and Cell-DETR for instance segmentation in microscopy.

Genomics: Conformer-based DNA storage error correction (RRCC-DNN), LOGO for non-coding genome interpretation, and MetaTransformer for metagenomic sequencing analysis.

Future Directions and Limitations of the Survey

The review concludes with four future directions:

ChatGPT integration into molecular science: Using LLMs for data analysis, literature review, and hypothesis generation in chemistry and biology.
Multifunction transformers: Models that extract features across diverse molecular structures and sequences simultaneously.
Molecular-aware transformers: Architectures that handle multiple data types (text, sequence, structure, image, energy, molecular dynamics, function) in a unified framework.
Self-assessment transformers and superintelligence: Speculative discussion of models that learn from seemingly unrelated data sources.

The review has several limitations worth noting. The coverage is broad but shallow: each architecture family receives only 1-2 pages of discussion, and the paper largely describes existing work rather than critically evaluating it. The review does not systematically compare the architectures against each other on common benchmarks. The future directions section (particularly the superintelligence discussion) is speculative and lacks concrete proposals. The paper also focuses primarily on technical architecture descriptions rather than analyzing failure modes, scalability challenges, or reproducibility concerns across the surveyed methods. As a review article, no new data were created or analyzed.

Reproducibility Details

Data

This is a survey paper. No new datasets were created or used. The paper reviews applications involving datasets such as PubChem (97 million molecules for T5Chem), CHEMBL (1.27 million molecules for Transformer-XL drug design), USPTO 500 MT (reaction classification), ESOL (5,328 molecules for property prediction), and Davis/Kiba (drug-target affinity).

Algorithms

No new algorithms are introduced. The paper provides mathematical descriptions of the core transformer components (self-attention, positional encoding, feed-forward networks, layer normalization) and describes how 12 architecture families modify these components.

Models

No new models are presented. The paper surveys existing models including MolGPT, DrugGPT, FP-BERT, SMILES-BERT, Chemformer, MolBART, GraphSite, KPGT, T5Chem, TransPicker, Cell-DETR, CLOOME, and Molormer, among others.

Evaluation

No new evaluation is performed. Performance numbers cited from the literature include: T5Chem reaction classification accuracy of 99.5%, ViT organic molecule classification at 97%, Transformer-XL property prediction RMSE of 0.6 on ESOL, and Heck reaction generation feasibility rate of 47.76%.

Hardware

No hardware requirements are specified, as this is a survey paper.

Artifact	Type	License	Notes
Paper (open access)	Paper	CC-BY-NC-ND	Open access via Wiley

Paper Information

Citation: Jiang, J., Ke, L., Chen, L., Dou, B., Zhu, Y., Liu, J., Zhang, B., Zhou, T., & Wei, G.-W. (2024). Transformer technology in molecular science. WIREs Computational Molecular Science, 14(4), e1725. https://doi.org/10.1002/wcms.1725

@article{jiang2024transformer,
  title={Transformer technology in molecular science},
  author={Jiang, Jian and Ke, Lu and Chen, Long and Dou, Bozheng and Zhu, Yueying and Liu, Jie and Zhang, Bengong and Zhou, Tianshou and Wei, Guo-Wei},
  journal={WIREs Computational Molecular Science},
  volume={14},
  number={4},
  pages={e1725},
  year={2024},
  publisher={Wiley},
  doi={10.1002/wcms.1725}
}

Review: Deep Learning for Molecular Design (2019)

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Deep Generative Models for Molecular Design

This is a Systematization paper that organizes and compares the rapidly growing literature on deep generative modeling for molecules. Published in 2019, it catalogs 45 papers from the preceding two years, classifying them by architecture (RNNs, VAEs, GANs, reinforcement learning) and molecular representation (SMILES strings, context-free grammars, graph tensors, 3D voxels). The review provides mathematical foundations for each technique, identifies cross-cutting themes, and proposes a framework for reward function design that addresses diversity, novelty, stability, and synthesizability.

The Challenge of Navigating Vast Chemical Space

The space of potential drug-like molecules has been estimated to contain between $10^{23}$ and $10^{60}$ compounds, while only about $10^{8}$ have ever been synthesized. Traditional approaches to molecular design rely on combinatorial methods, mixing known scaffolds and functional groups, but these generate many unstable or unsynthesizable candidates. High-throughput screening (HTS) and virtual screening (HTVS) help but remain computationally expensive. The average cost to bring a new drug to market exceeds one billion USD, with a 13-year average timeline from discovery to market.

By 2016, deep generative models had shown strong results in producing original images, music, and text. The “molecular autoencoder” of Gomez-Bombarelli et al. (2016/2018) first applied these techniques to molecular generation, triggering an explosion of follow-up work. By the time of this review, the landscape had grown complex enough, with many architectures, representation schemes, and no agreed-upon benchmarking standards, to warrant systematic organization.

Molecular Representations and Architecture Taxonomy

The review’s core organizational contribution is a two-axis taxonomy: molecular representations on one axis and deep learning architectures on the other.

Molecular Representations

The review categorizes representations into 3D and 2D graph-based schemes:

3D representations include raw voxels (placing nuclear charges on a grid), smoothed voxels (Gaussian blurring around nuclei), and tensor field networks. These capture full geometric information but suffer from high dimensionality, sparsity, and difficulty encoding rotation/translation invariance.

2D graph representations include:

SMILES strings: The dominant representation, encoding molecular graphs as ASCII character sequences via depth-first traversal. Non-unique (each molecule with $N$ heavy atoms has at least $N$ SMILES representations), but invertible and widely supported.
Canonical SMILES: Unique but potentially encode grammar rules rather than chemical structure.
Context-free grammars (CFGs): Decompose SMILES into grammar rules to improve validity rates, though not to 100%.
Tensor representations: Store atom types in a vertex feature matrix $X \in \mathbb{R}^{N \times |\mathcal{A}|}$ and bond types in an adjacency tensor $A \in \mathbb{R}^{N \times N \times Y}$.
Graph operations: Directly build molecular graphs by adding atoms and bonds, guaranteeing 100% chemical validity.

Deep Learning Architectures

Recurrent Neural Networks (RNNs) generate SMILES strings character by character, typically using LSTM or GRU units. Training uses maximum likelihood estimation (MLE) with teacher forcing:

$$ L^{\text{MLE}} = -\sum_{s \in \mathcal{X}} \sum_{t=2}^{T} \log \pi_{\theta}(s_{t} \mid S_{1:t-1}) $$

Thermal rescaling of the output distribution controls the diversity-validity tradeoff via a temperature parameter $T$. RNNs achieved SMILES validity rates of 94-98%.

Variational Autoencoders (VAEs) learn a continuous latent space by maximizing the evidence lower bound (ELBO):

$$ \mathcal{L}_{\theta,\phi}(x) = \mathbb{E}_{z \sim q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - D_{\text{KL}}[q_{\phi}(z|x), p(z)] $$

The first term encourages accurate reconstruction while the KL divergence term regularizes the latent distribution toward a standard Gaussian prior $p(z) = \mathcal{N}(z, 0, I)$. Variants include grammar VAEs (GVAEs), syntax-directed VAEs, junction tree VAEs, and adversarial autoencoders (AAEs) that replace the KL term with adversarial training.

Generative Adversarial Networks (GANs) train a generator against a discriminator using the minimax objective:

$$ \min_{G} \max_{D} V(D, G) = \mathbb{E}_{x \sim p_{d}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_{z}(z)}[\log(1 - D(G(z)))] $$

The review shows that with an optimal discriminator, the generator objective reduces to minimizing the Jensen-Shannon divergence, which captures both forward and reverse KL divergence terms. This provides a more “balanced” training signal than MLE alone. The Wasserstein GAN (WGAN) uses the Earth mover’s distance for more stable training:

$$ W(p, q) = \inf_{\gamma \in \Pi(p,q)} \mathbb{E}_{(x,y) \sim \gamma} |x - y| $$

Reinforcement Learning recasts molecular generation as a sequential decision problem. The policy gradient (REINFORCE) update is:

$$ \nabla J(\theta) = \mathbb{E}\left[G_{t} \frac{\nabla_{\theta} \pi_{\theta}(a_{t} \mid y_{1:t-1})}{\pi_{\theta}(a_{t} \mid y_{1:t-1})}\right] $$

To prevent RL fine-tuning from causing the generator to “drift” away from viable chemical structures, an augmented reward function incorporates the prior likelihood:

$$ R’(S) = [\sigma R(S) + \log P_{\text{prior}}(S) - \log P_{\text{current}}(S)]^{2} $$

Cataloging 45 Models and Their Design Choices

Rather than running new experiments, the review’s methodology involves systematically cataloging and comparing 45 published models. Table 2 in the paper lists each model’s architecture, representation, training dataset, and dataset size. Key patterns include:

RNN-based models (16 entries): Almost exclusively use SMILES, trained on ZINC or ChEMBL datasets with 0.1M-1.7M molecules.
VAE variants (20 entries): The most diverse category, spanning SMILES VAEs, grammar VAEs, junction tree VAEs, graph-based VAEs, and 3D VAEs. Training sets range from 10K to 72M molecules.
GAN models (7 entries): Include ORGAN, RANC, ATNC, MolGAN, and CycleGAN approaches. Notably, GANs appear to work with fewer training samples.
Other approaches (2 entries): Pure RL methods from Zhou et al. and Stahl et al. that do not require pretraining on a dataset.

The review also catalogs 13 publicly available datasets (Table 3), ranging from QM9 (133K molecules with quantum chemical properties) to GDB-13 (977M combinatorially generated molecules) and ZINC15 (750M+ commercially available compounds).

Metrics and Reward Function Design

A significant contribution is the systematic treatment of reward functions. The review argues that generated molecules should satisfy six desiderata: diversity, novelty, stability, synthesizability, non-triviality, and good properties. Key metrics formalized include:

Diversity using Tanimoto similarity over fingerprints:

$$ r_{\text{diversity}} = 1 - \frac{1}{|\mathcal{G}|} \sum_{(x_{1}, x_{2}) \in \mathcal{G} \times \mathcal{G}} D(x_{1}, x_{2}) $$

Novelty measured as the fraction of generated molecules not appearing in a hold-out test set:

$$ r_{\text{novel}} = 1 - \frac{|\mathcal{G} \cap \mathcal{T}|}{|\mathcal{T}|} $$

Synthesizability primarily assessed via the SA score, sometimes augmented with ring penalties and medicinal chemistry filters.

The review also discusses the Fréchet ChemNet Distance as an analog of FID for molecular generation, and notes the emergence of standardized benchmarking platforms including MOSES, GuacaMol, and DiversityNet.

Key Findings and Future Directions

The review identifies several major trends and conclusions:

Shift from SMILES to graph-based representations. SMILES-based methods struggle with validity (the molecular autoencoder VAE achieved only 0.7-75% valid SMILES depending on sampling strategy). Methods that work directly on molecular graphs with chemistry-preserving operations achieve 100% validity, and the review predicts this trend will continue.

Advantages of adversarial and RL training over MLE. The mathematical analysis shows that MLE only optimizes forward KL divergence, which can lead to models that place probability mass where the data distribution is zero. GAN training optimizes the Jensen-Shannon divergence, which balances forward and reverse KL terms. RL approaches, particularly pure RL without pretraining, showed competitive performance with much less training data.

Genetic algorithms remain competitive. The review notes that the latest genetic algorithm approaches (Grammatical Evolution) could match deep learning methods for molecular optimization under some metrics, and at 100x lower computational cost in some comparisons. This serves as an important baseline calibration.

Reward function design is underappreciated. Early models generated unstable molecules with labile groups (enamines, hemiaminals, enol ethers). Better reward functions that incorporate synthesizability, diversity, and stability constraints significantly improved practical utility.

Need for standardized benchmarks. The review identifies a lack of agreement on evaluation methodology as a major barrier to progress, noting that published comparisons are often subtly biased toward novel methods.

Limitations

As a review paper from early 2019, the work predates several important developments: transformer-based architectures (which would soon dominate), SELFIES representations, diffusion models for molecules, and large-scale pretrained chemical language models. The review focuses primarily on drug-like small molecules and does not deeply cover protein design or materials optimization.

Reproducibility Details

Data

This is a review paper that does not present new experimental results. The paper catalogs 13 publicly available datasets used across the reviewed works:

Purpose	Dataset	Size	Notes
Training/Eval	GDB-13	977M	Combinatorially generated library
Training/Eval	ZINC15	750M+	Commercially available compounds
Training/Eval	GDB-17	50M	Combinatorially generated library
Training/Eval	ChEMBL	2M	Curated bioactive molecules
Training/Eval	QM9	133,885	Small organic molecules with DFT properties
Training/Eval	PubChemQC	3.98M	PubChem compounds with DFT data

Algorithms

The review provides mathematical derivations for MLE training (Eq. 1), VAE ELBO (Eqs. 9-13), AAE objectives (Eqs. 15-16), GAN objectives (Eqs. 19-22), WGAN (Eq. 24), REINFORCE gradient (Eq. 7), and numerous reward function formulations (Eqs. 26-36).

Evaluation

Key evaluation frameworks discussed:

Fréchet ChemNet Distance (molecular analog of FID)
MOSES benchmarking platform
GuacaMol benchmarking suite
Validity rate, uniqueness, novelty, and internal diversity metrics

Paper Information

Citation: Elton, D. C., Boukouvalas, Z., Fuge, M. D., & Chung, P. W. (2019). Deep Learning for Molecular Design: A Review of the State of the Art. Molecular Systems Design & Engineering, 4(4), 828-849. https://doi.org/10.1039/C9ME00039A

@article{elton2019deep,
  title={Deep Learning for Molecular Design -- A Review of the State of the Art},
  author={Elton, Daniel C. and Boukouvalas, Zois and Fuge, Mark D. and Chung, Peter W.},
  journal={Molecular Systems Design \& Engineering},
  volume={4},
  number={4},
  pages={828--849},
  year={2019},
  publisher={Royal Society of Chemistry},
  doi={10.1039/C9ME00039A}
}

NLP Models That Automate Programming for Chemistry

Thu, 26 Mar 2026 00:00:00 +0000

A Perspective on Code-Generating LLMs for Chemistry

This is a Position paper that argues large language models (LLMs) capable of generating code from natural language prompts, specifically OpenAI’s Codex and GPT-3, are poised to transform both chemistry research and chemistry education. Published in the inaugural volume of Digital Discovery (RSC), the paper combines a brief history of NLP developments with concrete demonstrations of code generation for computational chemistry tasks, then offers a forward-looking perspective on challenges and opportunities.

Bridging the Gap Between Natural Language and Scientific Software

The authors identify a core friction in modern computational chemistry: while the number of available software packages has grown dramatically, researchers spend a large fraction of their time learning interfaces to these packages rather than doing science. Tasks like searching documentation, following tutorials, and trial-and-error experimentation with APIs consume effort that could be directed at research itself.

At the same time, programming assignments in chemistry courses serve dual pedagogical purposes (reinforcing physical intuition and teaching marketable skills), but are constrained by students’ median programming experience. The emergence of code-generating NLP models opens the possibility of reducing both barriers simultaneously.

Code Generation as a Chemistry Interface

The paper’s core thesis is that NLP models trained on code can serve as a natural language interface to the entire ecosystem of scientific computing tools. The authors demonstrate this with several concrete examples using OpenAI Codex:

Quantum chemistry: Prompting Codex to “compute the dissociation curve of H2 using pyscf” produced correct, runnable code that selected Hartree-Fock with STO-3G. A follow-up prompt requesting “the most accurate method” caused it to switch to CCSD in a large basis set.
Chemical entity recognition: Using GPT-3 with only three training examples, the authors demonstrated extraction of chemical entity names from published text, a task that previously required thousands of labeled examples.
Molecular visualization: Drawing caffeine from its SMILES string, generating Gaussian input files from SMILES, implementing random walks, and downloading and analyzing PDB structures with MDTraj.
Voice-controlled molecular dynamics: The authors previously built MARVIS, a voice-controlled molecular dynamics analysis tool that uses GPT-3 to convert natural language into VMD commands. Only about a dozen examples were needed to teach GPT-3 to render proteins, change representations, and select atoms.

An important caveat: the authors emphasize that all chemistry “knowledge” (including the SMILES string for caffeine) is entirely contained in the model’s learned floating-point weights. The model has no access to databases or curated lists of chemical concepts.

Demonstrations and Practical Evaluation

Rather than a formal experimental evaluation with benchmarks and metrics, this perspective paper relies on qualitative demonstrations. The key examples, with full details provided in the ESI, include:

Task	Input	Result
H2 dissociation curve	Natural language prompt	Correct PySCF code (HF/STO-3G)
Upgrade method accuracy	Follow-up prompt	Switched to CCSD with large basis
Chemical NER	3 examples + new text	Extracted compound names (with some gaps)
Molecule drawing	“Load caffeine from SMILES, draw it”	Correct RDKit rendering
Gaussian input file	Function with docstring	Complete file writer with B3LYP/6-31G(d)
PDB analysis	Natural language description	Downloaded structure and computed radius of gyration

The authors note that Codex generates correct code at about a 30% rate on a single attempt for standard problems, improving to above 50% when multiple solutions are tried. Mistakes tend to occur when complex algorithms are requested with little specificity, and the code rarely has syntax errors but may fail in obvious ways (missing imports, wrong data types).

Challenges: Access, Correctness, and Bias

The paper identifies three ongoing challenges:

Access and price. Advanced models from OpenAI were, at the time of writing, limited to early testers. Per-query costs (1-3 cents for GPT-3) would become prohibitive at the scale needed for parsing academic literature or supporting medium-sized courses. The authors advocate for open-source models and equitable deployment by researchers with computational resources.

Correctness. Code generation does not guarantee correctness. The authors raise a subtle point: Codex may produce code that executes successfully but does not follow best scientific practice for a particular computational task. Over-reliance on AI-generated code without verification could erode trust in scientific software. However, they argue that strategies for assessing code correctness apply equally to human-written and AI-generated code.

Fairness and bias. The authors flag several concerns: AI-generated code trained on its own outputs could narrow the range of packages, methods, or programming languages used in chemistry. They observed Codex’s preference for Python and for specific popular libraries (e.g., defaulting to Psi4 for single-point energy calculations). GPT-3 has also been shown to reflect racism, sexism, and other biases present in its training data.

Implications for Research and Education

The authors conclude with an optimistic but measured outlook:

For research: NLP code generation will increase accessibility of software tools and expand what a single research group can accomplish. Better tools have historically not reduced the need for scientists but expanded the complexity of problems that can be tackled.
For programming skills: Using Codex will make chemists better programmers, not worse. The process of crafting prompts, mentally checking outputs, testing on sample inputs, and iterating develops algorithmic thinking. The authors report discovering chemistry software libraries they would not have found otherwise through iterative prompt creation.
For education: Instructors should rethink programming assignments. The authors suggest moving toward more difficult compound assignments, treating code exercises as laboratory explorations of scientific concepts rather than syntax drills, and aligning coursework with the tools students will have access to in their careers.
For accessibility: NLP models can reduce barriers for non-native English speakers (though accuracy with non-English prompts was not fully explored) and for users who have difficulty with keyboard-and-mouse interfaces (via voice control).

The paper acknowledges that these capabilities were, in early 2022, just beginning, with Codex being the first capable code-generation model. Already at the time of writing, models surpassing GPT-3 in language tasks had appeared, and models matching GPT-3 with 1/20th the parameters had been demonstrated.

Reproducibility Details

This is a perspective paper with qualitative demonstrations rather than a reproducible experimental study. The authors provide all prompts and multiple responses in the ESI.

Data

All prompts and code outputs are provided in the Electronic Supplementary Information (ESI) available from the RSC.

Algorithms

The paper does not introduce new algorithms. It evaluates existing models (GPT-3, Codex) on chemistry-related code generation tasks.

Models

Model	Provider	Access
GPT-3	OpenAI	API access (commercial)
Codex	OpenAI	Early tester program (2021)
GPT-Neo	EleutherAI	Open source

Evaluation

No formal metrics are reported for the chemistry demonstrations. The authors cite the Codex paper’s reported ~30% pass rate on single attempts and >50% with multiple attempts on standard programming problems.

Hardware

No hardware requirements are specified for the demonstrations (API-based inference).

Artifacts

Artifact	Type	License	Notes
MARVIS	Code	MIT	Voice-controlled MD analysis using GPT-3

Paper Information

Citation: Hocky, G. M., & White, A. D. (2022). Natural language processing models that automate programming will transform chemistry research and teaching. Digital Discovery, 1(2), 79-83. https://doi.org/10.1039/d1dd00009h

@article{hocky2022natural,
  title={Natural language processing models that automate programming will transform chemistry research and teaching},
  author={Hocky, Glen M. and White, Andrew D.},
  journal={Digital Discovery},
  volume={1},
  number={2},
  pages={79--83},
  year={2022},
  publisher={Royal Society of Chemistry},
  doi={10.1039/d1dd00009h}
}

Generative AI Survey for De Novo Molecule and Protein Design

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Generative AI for Drug Design

This is a Systematization paper that provides a broad survey of generative AI methods applied to de novo drug design. The survey organizes the field into two overarching themes: small molecule generation and protein generation. Within each theme, the authors identify subtasks, catalog datasets and benchmarks, describe model architectures, and compare the performance of leading methods using standardized metrics. The paper covers over 200 references and provides 12 comparative benchmark tables.

The primary contribution is a unified organizational framework that allows both micro-level comparisons within each subtask and macro-level observations across the two application domains. The authors highlight parallel developments in both fields, particularly the shift from sequence-based to structure-based approaches and the growing dominance of diffusion models.

The Challenge of Navigating De Novo Drug Design

The drug design process requires creating ligands that interact with specific biological targets. These range from small molecules (tens of atoms) to large proteins (monoclonal antibodies). Traditional discovery methods are computationally expensive, with preclinical trials costing hundreds of millions of dollars and taking 3-6 years. The chemical space of potential drug-like compounds is estimated at $10^{23}$ to $10^{60}$, making brute-force exploration infeasible.

AI-driven generative methods have gained traction in recent years, with over 150 AI-focused biotech companies initiating small-molecule drugs in the discovery phase and 15 in clinical trials. The rate of AI-fueled drug design processes has expanded by almost 40% each year.

The rapid development of the field, combined with its inherent complexity, creates barriers for new researchers. Several prior surveys exist, but they focus on specific aspects: molecule generation, protein generation, antibody generation, or specific model architectures like diffusion models. This survey takes a broader approach, covering both molecule and protein generation under a single organizational framework.

Unified Taxonomy: Two Themes, Seven Subtasks

The survey’s core organizational insight is structuring de novo drug design into two themes with distinct subtasks, while identifying common architectural patterns across them.

Generative Model Architectures

The survey covers four main generative model families used across both molecule and protein generation:

Variational Autoencoders (VAEs) encode inputs into a latent distribution and decode from sampled points. The encoder maps input $x$ to a distribution parameterized by mean $\mu_\phi(x)$ and variance $\sigma^2_\phi(x)$. Training minimizes reconstruction loss plus KL divergence:

$$\mathcal{L} = \mathcal{L}_{\text{recon}} + \beta \mathcal{L}_{\text{KL}}$$

where the KL loss is:

$$\mathcal{L}_{\text{KL}} = -\frac{1}{2} \sum_{k} \left(1 + \log(\sigma_k^{(i)2}) - \mu_k^{(i)2} - \sigma_k^{(i)2}\right)$$

Generative Adversarial Networks (GANs) use a generator-discriminator game. The generator $G$ creates instances from random noise $z$ sampled from a prior $p_z(z)$, while the discriminator $D$ distinguishes real from synthetic data:

$$\min_{G} \max_{D} \mathbb{E}_x[\log D(x; \theta_d)] + \mathbb{E}_{z \sim p(z)}[\log(1 - D(G(z; \theta_g); \theta_d))]$$

Flow-Based Models generate data by applying an invertible function $f: z_0 \mapsto x$ to transform a simple latent distribution (Gaussian) to the target distribution. The log-likelihood is computed using the change-of-variable formula:

$$\log p(x) = \log p_0(z) + \log \left| \det \frac{\partial f}{\partial z} \right|$$

Diffusion Models gradually add Gaussian noise over $T$ steps in a forward process and learn to reverse the noising via a denoising neural network. The forward step is:

$$x_{t+1} = \sqrt{1 - \beta_t} x_t + \sqrt{\beta_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

The training loss minimizes the difference between the true noise and the predicted noise:

$$L_t = \mathbb{E}_{t \sim [1,T], x_0, \epsilon_t} \left[ | \epsilon_t - \epsilon_\theta(x_t, t) |^2 \right]$$

Graph neural networks (GNNs), particularly equivariant GNNs (EGNNs), are commonly paired with these generative methods to handle 2D/3D molecular and protein inputs. Diffusion and flow-based models are often paired with GNNs for processing 2D/3D-based input, while VAEs and GANs are typically used for 1D input.

Small Molecule Generation: Tasks, Datasets, and Models

Target-Agnostic Molecule Design

The goal is to generate a set of novel, valid, and stable molecules without conditioning on any specific biological target. Models are evaluated on atom stability, molecule stability, validity, uniqueness, novelty, and QED (Quantitative Estimate of Drug-Likeness).

Datasets: QM9 (small stable molecules from GDB-17) and GEOM-Drug (more complex, drug-like molecules).

The field has shifted from SMILES-based VAEs (CVAE, GVAE, SD-VAE) to 2D graph methods (JTVAE) and then to 3D diffusion-based models. Current leading methods on QM9:

Model	Type	At Stb. (%)	Mol Stb. (%)	Valid (%)	Val/Uniq. (%)
MiDi	EGNN, Diffusion	99.8	97.5	97.9	97.6
MDM	EGNN, VAE, Diffusion	99.2	89.6	98.6	94.6
JODO	EGNN, Diffusion	99.2	93.4	99.0	96.0
GeoLDM	VAE, Diffusion	98.9	89.4	93.8	92.7
EDM	EGNN, Diffusion	98.7	82.0	91.9	90.7

EDM provided an initial baseline using diffusion with an equivariant GNN. GCDM introduced attention-based geometric message-passing. MDM separately handles covalent bond edges and Van der Waals forces, and also addresses diversity through an additional distribution-controlling noise variable. GeoLDM maps molecules to a lower-dimensional latent space for more efficient diffusion. MiDi uses a “relaxed” EGNN and jointly models 2D and 3D information through a graph representation capturing both spatial and connectivity data.

On the larger GEOM-Drugs dataset, performance drops for most models:

Model	At Stb. (%)	Mol Stb. (%)	Valid (%)	Val/Uniq. (%)
MiDi	99.8	91.6	77.8	77.8
MDM	–	62.2	99.5	99.0
GeoLDM	84.4	–	99.3	–
EDM	81.3	–	–	–

MiDi distinguishes itself for generating more stable complex molecules, though at the expense of validity. Models generally perform well on QM9 but show room for improvement on more complex GEOM-Drugs molecules.

Target-Aware Molecule Design

Target-aware generation produces molecules for specific protein targets, using either ligand-based (LBDD) or structure-based (SBDD) approaches. SBDD methods have become more prevalent as protein structure information becomes increasingly available.

Datasets: CrossDocked2020 (22.5M ligand-protein pairs), ZINC20, Binding MOAD.

Metrics: Vina Score (docking energy), High Affinity Percentage, QED, SA Score (synthetic accessibility), Diversity (Tanimoto similarity).

Model	Type	Vina	Affinity (%)	QED	SA	Diversity
DiffSBDD	EGNN, Diffusion	-7.333	–	0.467	0.554	0.758
Luo et al.	SchNet	-6.344	29.09	0.525	0.657	0.720
TargetDiff	EGNN, Diffusion	-6.3	58.1	0.48	0.58	0.72
LiGAN	CNN, VAE	-6.144	21.1	0.39	0.59	0.66
Pocket2Mol	EGNN, MLP	-5.14	48.4	0.56	0.74	0.69

DrugGPT is an LBDD autoregressive model using transformers on tokenized protein-ligand pairs. Among the SBDD models, LiGAN introduces a 3D CNN-VAE framework, Pocket2Mol emphasizes binding pocket geometry using an EGNN with geometric vector MLP layers, and Luo et al. model atomic probabilities in the binding site using SchNet. TargetDiff performs diffusion on an EGNN and optimizes binding affinity by reflecting low atom type entropy. DiffSBDD applies an inpainting approach by masking and replacing segments of ligand-protein complexes. DiffSBDD leads in Vina score and diversity, while TargetDiff leads in high affinity. Interestingly, diffusion-based methods are outperformed by Pocket2Mol on drug-likeness metrics (QED and SA).

Molecular Conformation Generation

Conformation generation involves producing 3D structures from 2D connectivity graphs. Models are evaluated on Coverage (COV, percentage of ground-truth conformations “covered” within an RMSD threshold) and Matching (MAT, average RMSD to closest ground-truth conformation).

Datasets: GEOM-QM9, GEOM-Drugs, ISO17.

Model	Type	GEOM-QM9 COV (%)	GEOM-QM9 MAT	GEOM-Drugs COV (%)	GEOM-Drugs MAT
Torsional Diff.	Diffusion	92.8	0.178	72.7*	0.582
DGSM	MPNN, Diffusion	91.49	0.2139	78.73	1.0154
GeoDiff	GFN, Diffusion	90.07	0.209	89.13	0.8629
ConfGF	GIN, Diffusion	88.49	0.2673	62.15	1.1629
GeoMol	MPNN	71.26	0.3731	67.16	1.0875

*Torsional Diffusion uses a 0.75 A threshold instead of the standard 1.25 A for GEOM-Drugs coverage, leading to a deflated score. It outperforms GeoDiff and GeoMol when evaluated at the same threshold.

Torsional Diffusion operates in the space of torsion angles rather than Cartesian coordinates, allowing for improved representation and fewer denoising steps. GeoDiff uses Euclidean-space diffusion, treating each atom as a particle and incorporating Markov kernels that preserve E(3) equivariance through a graph field network (GFN) layer.

Protein Generation: From Sequence to Structure

Protein Representation Learning

Representation learning creates embeddings for protein inputs to support downstream tasks. Models are evaluated on contact prediction, fold classification (at family, superfamily, and fold levels), and stability prediction (Spearman’s $\rho$).

Key models include: UniRep (mLSTM RNN), ProtBERT (BERT applied to amino acid sequences), ESM-1B (33-layer, 650M parameter transformer), MSA Transformer (pre-trained on MSA input), and GearNET (Geo-EGNN using 3D structure with directed edges). OntoProtein and KeAP incorporate knowledge graphs for direct knowledge injection.

Protein Structure Prediction

Given an amino acid sequence, models predict 3D point coordinates for each residue. Evaluated using RMSD, GDT-TS, TM-score, and LDDT on CASP14 and CAMEO benchmarks.

AlphaFold2 is the landmark model, integrating MSA and pair representations through transformers with invariant point attention (IPA). ESMFold uses ESM-2 language model representations instead of MSAs, achieving faster processing. RoseTTAFold uses a three-track neural network learning from 1D sequence, 2D distance map, and 3D backbone coordinate information simultaneously. EigenFold uses diffusion, representing the protein as a system of harmonic oscillators.

Model	Type	CAMEO RMSD	CAMEO TMScore	CAMEO GDT-TS	CAMEO lDDT	CASP14 TMScore
AlphaFold2	Transformer	3.30	0.87	0.86	0.90	0.38
ESMFold	Transformer	3.99	0.85	0.83	0.87	0.68
RoseTTAFold	Transformer	5.72	0.77	0.71	0.79	0.37
EigenFold	Diffusion	7.37	0.75	0.71	0.78	–

Sequence Generation (Inverse Folding)

Given a fixed protein backbone structure, models generate amino acid sequences that will fold into that structure. The space of valid sequences is between $10^{65}$ and $10^{130}$.

Evaluated using Amino Acid Recovery (AAR), diversity, RMSD, nonpolar loss, and perplexity (PPL):

$$\text{PPL} = \exp\left(\frac{1}{N} \sum_{i=1}^{N} \log P(x_i | x_1, x_2, \ldots x_{i-1})\right)$$

ProteinMPNN is the current top performer, generating the most accurate sequences and leading in AAR, RMSD, and nonpolar loss. It uses a message-passing neural network with a flexible, order-agnostic autoregressive approach.

Model	Type	AAR (%)	Div.	RMSD	Non.	Time (s)
ProteinMPNN	MPNN	48.7	0.168	1.019	1.061	112
ESM-IF1	Transformer	47.7	0.184	1.265	1.201	1980
GPD	Transformer	46.2	0.219	1.758	1.333	35
ABACUS-R	Transformer	45.7	0.124	1.482	0.968	233280
3D CNN	CNN	44.5	0.272	1.62	1.027	536544
PiFold	GNN	42.8	0.141	1.592	1.464	221
ProteinSolver	GNN	24.6	0.186	5.354	1.389	180

Results are from the independent benchmark by Yu et al. GPD remains the fastest method, generating sequences around three times faster than ProteinMPNN. Current SOTA models recover fewer than half of target amino acid residues, indicating room for improvement.

Backbone Design

Backbone design creates protein structures from scratch, representing the core of de novo protein design. Models generate coordinates for backbone atoms (nitrogen, alpha-carbon, carbonyl, oxygen) and use external tools like Rosetta for side-chain packing.

Two evaluation paradigms exist: context-free generation (evaluated by self-consistency TM, or scTM) and context-given generation (inpainting, evaluated by AAR, PPL, RMSD).

ProtDiff represents residues as 3D Cartesian coordinates and uses particle-filtering diffusion. FoldingDiff instead uses an angular representation (six angles per residue) with a BERT-based DDPM. LatentDiff embeds proteins into a latent space using an equivariant autoencoder, then applies equivariant diffusion, analogous to GeoLDM for molecules. These early models work well for short proteins (up to 128 residues) but struggle with longer structures.

Frame-based methods address this scaling limitation. Genie uses Frenet-Serret frames with paired residue representations and IPA for noise prediction. FrameDiff parameterizes backbone structures on the $SE(3)^N$ manifold of frames using a score-based generative model. RFDiffusion is the current leading model, combining RoseTTAFold structure prediction with diffusion. It fine-tunes RoseTTAFold weights on a masked input sequence and random noise coordinates, using “self-conditioning” on predicted structures. Protpardelle co-designs sequence and structure by creating a “superposition” over possible sidechain states and collapsing them during each iterative diffusion step.

Model	Type	scTM (%)	Design. (%)	PPL	AAR (%)	RMSD
RFDiffusion	Diffusion	–	95.1	–	–	–
Protpardelle	Diffusion	85	–	–	–	–
FrameDiff	Diffusion	84	48.3	–	–	–
Genie	Diffusion	81.5	79.0	–	–	–
LatentDiff	EGNN, Diffusion	31.6	–	–	–	–
FoldingDiff	Diffusion	14.2	–	–	–	–
ProtDiff	EGNN, Diffusion	11.8	–	–	12.47*	8.01*

*ProtDiff context-given results are tested only on beta-lactamase metalloproteins from PDB.

Antibody Design

The survey covers antibody structure prediction, representation learning, and CDR-H3 generation. Antibodies are Y-shaped proteins with complementarity-determining regions (CDRs), where CDR-H3 is the most variable and functionally important region.

For CDR-H3 generation, models have progressed from sequence-based (LSTM) to structure-based (RefineGNN) and sequence-structure co-design approaches (MEAN, AntiDesigner, DiffAb). dyMEAN is the current leading model, providing an end-to-end method incorporating structure prediction, docking, and CDR generation into a single framework. MSA alignment cannot be used for antibody input, which makes general models like AlphaFold2 inefficient for antibody prediction. Specialized models like IgFold use sequence embeddings from AntiBERTy with invariant point attention to achieve faster antibody structure prediction.

Peptide Design

The survey briefly covers peptide generation, including models for therapeutic peptide generation (MMCD), peptide-protein interaction prediction (PepGB), peptide representation learning (PepHarmony), peptide sequencing (AdaNovo), and signal peptide prediction (PEFT-SP).

Current Trends, Challenges, and Future Directions

Current Trends

The survey identifies several parallel trends across molecule and protein generation:

Shift from sequence to structure: In molecule generation, graph-based diffusion models (GeoLDM, MiDi, TargetDiff) now dominate. In protein generation, structure-based representation learning (GearNET) and diffusion-based backbone design (RFDiffusion) have overtaken sequence-only methods.
Dominance of E(3) equivariant architectures: EGNNs appear across nearly all subtasks, reflecting the physical requirement that molecular and protein properties should be invariant to rotation and translation.
Structure-based over ligand-based approaches: In target-aware molecule design, SBDD methods that use 3D protein structures demonstrate clear advantages over LBDD approaches that operate on amino acid sequences alone.

Challenges

For small molecule generation:

Complexity: Models perform well on simple QM9 but struggle with complex GEOM-Drugs molecules.
Applicability: Generating molecules with high binding affinity to targets remains difficult.
Explainability: Methods are black-box, offering no insight into why generated molecules have desired properties.

For protein generation:

Benchmarking: Protein generative tasks lack a standard evaluative procedure, with variance between each model’s metrics and testing conditions.
Performance: SOTA models still struggle with fold classification, gene ontology, and antibody CDR-H3 generation.

The authors also note that many generative tasks are evaluated using predictive models (e.g., classifier networks for binding affinity or molecular properties). Improvements to these classification methods would lead to more precise alignment with real-world biological applications.

Future Directions

The authors identify increasing performance in existing tasks, defining more applicable tasks (especially in molecule-protein binding, antibody generation), and exploring entirely new areas of research as key future directions.

Reproducibility Details

As a survey paper, this work does not produce new models, datasets, or experimental results. All benchmark numbers reported are from the original papers cited.

Data

The survey catalogs the following key datasets across subtasks:

Subtask	Datasets	Notes
Target-agnostic molecule	QM9, GEOM-Drug	QM9 from GDB-17; GEOM-Drug for complex molecules
Target-aware molecule	CrossDocked2020, ZINC20, Binding MOAD	CrossDocked2020 most used (22.5M pairs)
Conformation generation	GEOM-QM9, GEOM-Drugs, ISO17	Conformer sets for molecules
Protein structure prediction	PDB, CASP14, CAMEO	CASP biennial blind evaluation
Protein sequence generation	PDB, UniRef, UniParc, CATH, TS500	CATH for domain classification
Backbone design	PDB, AlphaFoldDB, SCOP, CATH	AlphaFoldDB for expanded structural coverage
Antibody structure	SAbDab, RAB	SAbDab: all antibody structures from PDB
Antibody CDR generation	SAbDab, RAB, SKEMPI	SKEMPI for affinity optimization

Artifacts

Artifact	Type	License	Notes
GenAI4Drug	Code	Not specified	Organized repository of all covered sources

Paper Information

Citation: Tang, X., Dai, H., Knight, E., Wu, F., Li, Y., Li, T., & Gerstein, M. (2024). A survey of generative AI for de novo drug design: New frontiers in molecule and protein generation. Briefings in Bioinformatics, 25(4), bbae338. https://doi.org/10.1093/bib/bbae338

Publication: Briefings in Bioinformatics, Volume 25, Issue 4, 2024.

Additional Resources:

@article{tang2024survey,
  title={A survey of generative AI for de novo drug design: new frontiers in molecule and protein generation},
  author={Tang, Xiangru and Dai, Howard and Knight, Elizabeth and Wu, Fang and Li, Yunyang and Li, Tianxiao and Gerstein, Mark},
  journal={Briefings in Bioinformatics},
  volume={25},
  number={4},
  pages={bbae338},
  year={2024},
  doi={10.1093/bib/bbae338}
}

Foundation Models in Chemistry: A 2025 Perspective

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Foundation Models for Chemistry

This is a Systematization paper. It organizes the rapidly growing landscape of foundation models in chemistry into a coherent taxonomy. The paper distinguishes between “small” foundation models (pretrained for a single application domain) and “big” foundation models (adaptable across multiple domains such as property prediction and inverse design). It covers models based on graph neural networks (GNNs) and language models, reviews pretraining strategies (self-supervised, multimodal, supervised), and maps approximately 40 models across four application domains.

Why a Foundation Model Perspective for Chemistry?

Foundation models have transformed NLP and computer vision through large-scale pretraining and transfer learning. In chemistry, however, several persistent challenges motivate the adoption of this paradigm:

Data scarcity: Chemical datasets are often small and expensive to generate (requiring experiments or quantum mechanical calculations), unlike the large annotated datasets available in NLP/CV.
Poor generalization: ML models in chemistry frequently need to extrapolate to out-of-domain compounds (e.g., novel drug candidates, unseen crystal structures), where conventional models struggle.
Limited transferability: Traditional ML interatomic potentials (MLIPs) are trained on system-specific datasets and cannot be easily transferred across different chemical systems.

Foundation models address these by learning general representations from large unlabeled datasets, which can then be adapted to specific downstream tasks via finetuning. The paper argues that summarizing this fast-moving field is timely, given the diversity of approaches emerging across molecular property prediction, MLIPs, inverse design, and multi-domain applications.

Small vs. Big Foundation Models: A Two-Tier Taxonomy

The paper’s central organizing framework distinguishes two scopes of foundation model:

Small foundation models are pretrained models adapted to various tasks within a single application domain. Examples include:

A model pretrained on large molecular databases that predicts multiple molecular properties (band gap, formation energy, etc.)
A universal MLIP that can simulate diverse chemical systems
A pretrained generative model adapted for inverse design of different target properties

Big foundation models span multiple application domains, handling both property prediction and inverse design within a single framework. These typically use multimodal learning (combining SMILES/graphs with text) or build on large language models.

Architectures

The paper reviews two primary architecture families:

Graph Neural Networks (GNNs) represent molecules and crystals as graphs $G = (V, E)$ with nodes (atoms) and edges (bonds). Node features are updated through message passing:

$$ m_{i}^{t+1} = \sum_{j \in N(i)} M_{t}(v_{i}^{t}, v_{j}^{t}, e_{ij}^{t}) $$

$$ v_{i}^{t+1} = U_{t}(v_{i}^{t}, m_{i}^{t+1}) $$

After $T$ message-passing steps, a readout function produces a graph-level feature:

$$ g = R({v_{i}^{T} \mid i \in G}) $$

Recent equivariant GNNs (e.g., NequIP, MACE, EquformerV2) use vectorial features that respect geometric symmetries, improving expressivity for tasks sensitive to 3D structure.

Language Models operate on string representations of molecules (SMILES, SELFIES) or crystal structures. Autoregressive models like GPT maximize:

$$ \prod_{t=1}^{T} P(y_{t} \mid x_{1}, x_{2}, \ldots, x_{t-1}) $$

Transformers use self-attention:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V $$

Pretraining Strategies

The paper categorizes pretraining methods into three self-supervised learning (SSL) approaches plus supervised and multimodal strategies:

Strategy	Mechanism	Example Models
Contrastive learning	Maximize similarity between positive pairs, minimize for negatives	GraphCL, MolCLR, GraphMVP, CrysGNN
Predictive learning	Predict self-generated labels (node context, functional groups, space group)	GROVER, Hu et al., CrysGNN
Generative learning	Reconstruct masked nodes/edges or entire molecules/SMILES	SMILES-BERT, ChemBERTa-2, MoLFormer
Supervised pretraining	Train on energy, forces, stress from DFT databases	M3GNet, CHGNet, MACE-MP-0, MatterSim
Multimodal learning	Learn joint representations across SMILES/graph + text modalities	KV-PLM, MoMu, MoleculeSTM, SPMM

A common finding across studies is that combining local and global information (e.g., via contrastive learning between node-level and graph-level views, or supervised learning on both forces and total energy) produces more transferable representations.

Survey of Models Across Four Domains

Property Prediction

The paper reviews 13 models for molecular and materials property prediction. Key findings:

Contrastive learning approaches (GraphCL, MolCLR, GraphMVP) achieve strong results by defining positive pairs through augmentation, 2D/3D structure views, or crystal system membership.
Language model approaches (SMILES-BERT, ChemBERTa-2, MoLFormer) show that transformers trained on SMILES via masked language modeling can compete with GNN-based approaches.
MoLFormer, pretrained on 1.1 billion SMILES from PubChem and ZINC, outperformed many baselines including GNNs on MoleculeNet and QM9 benchmarks. Its attention maps captured molecular structural features directly from SMILES strings.
For crystalline materials, CrysGNN combined contrastive, predictive, and generative learning, demonstrating improvements even on small experimental datasets.

Machine Learning Interatomic Potentials (MLIPs)

The paper surveys 10 universal MLIPs, all using supervised learning on DFT-calculated energies, forces, and stresses:

Model	Architecture	Training Data Size	Key Capability
M3GNet	GNN	187K (MP)	First universal MLIP
CHGNet	GNN	1.58M (MPtrj)	Predicts magnetic moments
MACE-MP-0	MACE	1.58M (MPtrj)	35 diverse applications
GNoME potential	NequIP	89M	Zero-shot comparable to trained MLIPs
MatterSim	M3GNet/Graphormer	17M	SOTA on Matbench Discovery
eqV2	EquformerV2	118M (OMat24)	Structural relaxation

The GNoME potential, trained on approximately 89 million data points, achieved zero-shot performance comparable to state-of-the-art MLIPs trained from scratch. MatterSim, trained on over 17 million entries across wide temperature (0-5000K) and pressure (0-1000 GPa) ranges, achieved state-of-the-art on Matbench Discovery and accurately computed thermodynamic and lattice dynamic properties.

Inverse Design

Few pretrained generative models for inverse design exist. The paper highlights three:

MatterGen (Microsoft): Diffusion model pretrained on Alexandria/MP databases (607K structures), finetuned for conditional generation on band gap, elastic modulus, spacegroup, and composition. Generated S.U.N. (stable, unique, novel) materials at rates more than 2x the previous state of the art.
GP-MoLFormer (IBM): MoLFormer pretrained on 1.1B SMILES, finetuned via pair-tuning for property-guided molecular optimization.
CrystalLLM: Finetuned LLaMA-2 70B for crystal generation with target spacegroup and composition using string representations and prompting.

Multi-Domain Models

The paper covers two multi-domain categories:

Property prediction + MLIP: Denoising pretraining learns virtual forces that guide noisy configurations back to equilibrium, connecting to force prediction. Joint multi-domain pretraining (JMP) from Meta FAIR achieved state-of-the-art on 34 of 40 tasks spanning molecules, crystals, and MOFs by training simultaneously on diverse energy/force databases.

Property prediction + inverse design: Multimodal models (KV-PLM, MoMu, MoleculeSTM, MolFM, SPMM) learn joint representations from molecular structures and text, enabling text-based inverse design and property prediction in a single framework. LLM-based models (ChemDFM, nach0, finetuned GPT-3) can interact with humans and handle diverse chemistry tasks through instruction tuning.

Trends and Future Directions

Scope Expansion

The authors identify three axes for expanding foundation model scope:

Material types: Most models target molecules or a single material class. Foundation models that span molecules, crystals, surfaces, and MOFs could exploit shared chemistry across materials.
Modalities: Beyond SMILES, graphs, and text, additional modalities (images, spectral data like XRD patterns) remain underexplored.
Downstream tasks: Extending to new chemistry and tasks through emergent capabilities, analogous to the capabilities observed in LLMs at scale.

Performance and Scaling

Key scaling challenges include:

Data quality vs. quantity: Noisy DFT labels (e.g., HOMO-LUMO gaps with high uncertainty from different functionals/basis sets) can limit scalability and out-of-distribution performance.
GNN scalability: While transformers scale to hundreds of billions of parameters, GNNs have rarely been explored above one million parameters due to oversmoothing and the curse of dimensionality. Recent work by Sypetkowski et al. demonstrated scaling GNNs to 3 billion parameters with consistent improvements.
Database integration: Combining datasets from different DFT codes requires proper alignment (e.g., total energy alignment methods).

Efficiency

For MLIPs, efficiency is critical since MD simulations require millions of inference steps. Approaches include:

Knowledge distillation from expensive teacher models to lighter student models
Model compression techniques (quantization, pruning) adapted for GNNs
Investigating whether strict equivariance is always necessary

Interpretability

Foundation models can generate hallucinations or mode-collapsed outputs. The authors highlight recent interpretability advances (feature extraction from Claude 3, knowledge localization and editing in transformers) as promising directions for more reliable chemical applications.

Key Findings and Limitations

Key findings:

Combining local and global information in pretraining consistently improves downstream performance across all domains reviewed.
Self-supervised pretraining enables effective transfer learning even in low-data regimes, a critical advantage for chemistry.
Universal MLIPs have reached the point where zero-shot performance can be comparable to system-specific trained models.
Multimodal learning is the most promising approach for big foundation models capable of spanning property prediction and inverse design.

Limitations acknowledged by the authors:

The precise definition of “foundation model” in chemistry is not established and varies by scope.
Most surveyed models focus on molecules, with crystalline materials less explored.
Benchmarks for low-data regimes and out-of-distribution performance are insufficient.
The paper focuses on three domains (property prediction, MLIPs, inverse design) and does not cover retrosynthesis, reaction prediction, or other chemical tasks in depth.

Reproducibility Details

Data

This is a perspective/review paper. No new data or models are introduced. The paper surveys existing models and their training datasets, summarized in Table 1 of the paper.

Algorithms

Not applicable (review paper). The paper describes pretraining strategies (contrastive, predictive, generative, supervised, multimodal) at a conceptual level with references to the original works.

Models

Not applicable (review paper). The paper catalogs approximately 40 foundation models across four domains. See Table 1 in the paper for the complete listing.

Evaluation

Not applicable (review paper). The paper references benchmark results from the original studies (MoleculeNet, QM9, Matbench, Matbench Discovery, JARVIS-DFT) but does not perform independent evaluation.

Hardware

Not applicable (review paper).

Paper Information

Citation: Choi, J., Nam, G., Choi, J., & Jung, Y. (2025). A Perspective on Foundation Models in Chemistry. JACS Au, 5(4), 1499-1518. https://doi.org/10.1021/jacsau.4c01160

@article{choi2025perspective,
  title={A Perspective on Foundation Models in Chemistry},
  author={Choi, Junyoung and Nam, Gunwook and Choi, Jaesik and Jung, Yousung},
  journal={JACS Au},
  volume={5},
  number={4},
  pages={1499--1518},
  year={2025},
  publisher={American Chemical Society},
  doi={10.1021/jacsau.4c01160}
}

Review of Molecular Representation Learning Models

Wed, 25 Mar 2026 00:00:00 +0000

A Systematization of Molecular Representation Foundation Models

This paper is a Systematization that provides the first comprehensive review of foundation models for molecular representation learning (MRL). The authors classify existing models by their input modality (unimodal vs. multimodal), analyze four mainstream pretraining strategies, survey five downstream application domains, and propose practical guidelines for model selection. The review covers over 35 representative models published between 2020 and 2024, with parameter counts ranging from 2 million to over 1 trillion.

Why a Systematic Review of MRL Foundation Models Is Needed

Molecular representation learning transforms molecular structures and properties into numerical vectors that serve as inputs for machine learning models. The field has evolved rapidly from molecular fingerprints through SMILES-based sequence models to graph neural networks and 3D geometry-aware architectures. Foundation models, characterized by large-scale pretraining on unlabeled molecular data followed by fine-tuning on downstream tasks, have introduced new opportunities for generalizability and transfer learning in drug discovery.

Despite this rapid progress, the authors identify a gap: no prior work has systematically reviewed MRL foundation models across all input modalities and pretraining paradigms. Existing surveys tend to focus on specific representations (e.g., graph-based methods) or specific applications (e.g., property prediction) without providing the cross-cutting perspective needed to guide model selection. This review fills that gap by offering a unified taxonomy and practical guidelines.

Taxonomy of Molecular Descriptors and Model Architectures

The core organizational framework classifies models along two axes: the molecular descriptor used as input and the backbone architecture.

Molecular Descriptors

The review identifies five primary descriptor types:

Molecular fingerprints: Binary vectors encoding structural features (e.g., Morgan fingerprints). Rarely used in foundation models due to information loss and dimensional complexity.
1D sequences: SMILES and SELFIES string representations. SMILES is compact and widely used but can produce invalid molecules. SELFIES guarantees valid molecular strings by construction.
2D topological graphs: Atoms as nodes, bonds as edges. Can be derived from SMILES via RDKit, making graph datasets effectively interchangeable with SMILES datasets.
3D geometry: Spatial coordinates capturing conformational information, energy states, and stereochemistry. Experimentally expensive to obtain, limiting dataset availability.
Multimodal: Combinations of the above with text, IUPAC names, knowledge graphs, and molecular images.

The paper also discusses mathematically abstract molecular representations. For example, the Wiener index quantifies structural complexity:

$$ W = \frac{1}{2} \sum_{i < j} d_{ij} $$

where $d_{ij}$ is the topological distance (shortest bonding path length) between atoms $i$ and $j$.

Degree centrality captures local connectivity:

$$ C_{D}(v_{i}) = \sum_{j=1}^{n} A_{ij} $$

where $A \in \mathbb{R}^{n \times n}$ is the molecular graph adjacency matrix.

Model Architectures

Models are classified into two primary categories:

Unimodal-based models:

Sequence-based: Transformer models operating on SMILES/SELFIES (e.g., ChemBERTa-2, MoLFormer, MolGEN, LlaSMol). These capture syntactic patterns but miss spatial and topological features.
Topological graph-based: GNN variants (GIN, GCN, GAT) and Transformer-based graph models (Graphormer). GNNs capture local topology through message passing; Transformers overcome locality limitations through global self-attention.
3D geometry-based: Models like Uni-Mol and 3D PGT that incorporate spatial coordinates. Uni-Mol uses distance-aware self-attention with an SE(3)-equivariant coordinate head for rotation/translation invariance.
Image-based: CNN-based models (ImageMol) that process 2D molecular images using visual representation learning.

Multimodal-based models:

Sequence + Graph: DVMP, PanGu Drug Model. Combines the strengths of string and topological representations.
Graph + 3D Geometry: GraphMVP, Transformer-M. Enriches topological features with spatial information.
Text + Molecular Structure: KV-PLM, MolT5, MoleculeSTM, MolReGPT, Y-mol. Aligns molecular structural information with biomedical text through cross-modal learning.

Four Pretraining Paradigms for MRL

The review systematically categorizes pretraining strategies into four paradigms:

Masked Language Modeling (MLM)

The cornerstone strategy for sequence-based models. Randomly masks tokens in molecular sequences and trains the model to predict them. ChemBERTa pretrained on 77 million SMILES sequences from PubChem achieves 5-10% improvement in AUC-ROC on property prediction tasks compared to task-specific models. MLM captures local dependencies and global sequence patterns but cannot model spatial or topological features, making it best suited for unimodal sequence inputs.

Contrastive Learning (CL)

The dominant strategy for multimodal models. Constructs positive-negative sample pairs to align features across modalities or views. In unimodal settings, CL generates negative samples by perturbing molecular graphs. In multimodal settings, it aligns features from different modalities. GraphMVP, which contrasts 2D topological features with 3D spatial features, reduces RMSE by 15% on QM9 energy prediction compared to unimodal models. Performance depends heavily on the quality of positive sample construction.

Reconstruction-Based Pretraining (RBP)

Learns global molecular features by reconstructing original data from corrupted inputs. Tasks include node feature reconstruction, graph structure reconstruction, and coordinate/energy reconstruction. MGMAE masks more than 50% of nodes and edges in molecular graphs and trains the model to reconstruct them, achieving 94.2% AUC-ROC on BBBP. RBP captures global structural patterns but requires high model complexity and training cost.

Multimodal Alignment Pretraining (MAP)

Designed for multimodal inputs, aligning and fusing features from different modalities through cross-modal tasks. KV-PLM uses SMILES-to-text matching to align molecular structure and functional information. MAP fuses structural information (SMILES, graphs) with semantic information (text) but requires large-scale cross-modal labeled data, posing significant data acquisition challenges.

Downstream Applications and Performance Benchmarks

The review evaluates MRL foundation models across five application domains.

Molecular Property Prediction

The most common benchmark for MRL models. The review provides comprehensive ROC-AUC comparisons across eight MoleculeNet classification datasets:

Model	Type	BBBP	BACE	ClinTox	Tox21	SIDER	HIV
MGMAE	Graph	94.2	92.7	96.7	86.0	66.4	-
MPG	Graph	92.2	92.0	96.3	83.7	66.1	-
GROVER	Graph+Trans.	94.0	89.4	94.4	83.1	65.8	-
MoLFormer	Sequence	93.7	88.2	94.8	84.7	69.0	82.2
MM-Deacon	Seq.+IUPAC	78.5	-	99.5	-	69.3	80.1
Uni-Mol	3D	72.9	85.7	91.9	79.6	65.9	80.8
DVMP	Seq.+Graph	77.8	89.4	95.6	79.1	69.8	81.4
TxD-T-LLM	Seq.+Text	-	-	86.3	88.2	-	73.2

The table shows that no single architecture dominates across all datasets. Transformer- and GIN-based architectures with graph inputs generally perform well. The review notes that model effectiveness depends heavily on the dataset, with Mole-BERT encountering negative transfer due to a small and unbalanced atomic vocabulary.

Molecular Generation

MolGEN (SELFIES-based, 8B parameters) achieves 100% validity on synthetic molecules. MolT5 excels at text-to-molecule generation. Uni-Mol generates 3D conformations with 97.95% coverage on QM9.

Drug-Drug Interaction Prediction

MPG achieves 96.6% AUC-ROC on BIOSNAP by combining unsupervised pretraining with supervised fine-tuning and multi-task learning.

Retrosynthesis Prediction

DVMP achieves 66.5% top-1 accuracy on USPTO-50K when reaction types are provided as priors (54.2% without).

Drug Synergy Prediction

SynerGPT (GPT-based) achieves 77.7% AUC-ROC in few-shot settings for novel drug combinations, outperforming baselines through contextual learning.

Guidelines, Limitations, and Future Directions

Model Selection Guidelines

The authors provide structured guidelines for choosing MRL foundation models based on:

Task objective: Property prediction favors GNNs or large pretrained frameworks (ChemBERTa-2, Uni-Mol). Generation tasks favor GPT-style autoregressive models (MolGEN). Retrosynthesis benefits from multimodal architectures.
Data characteristics: SMILES/graph representations suit generation tasks. Knowledge graph-enhanced models benefit interaction and synergy prediction. Transfer learning helps data-limited scenarios.
Interpretability needs: Transformer architectures are preferred when interpretability is required, as attention matrices enable visualization of learned molecular features.
Computational budget: GIN-based models have $\mathcal{O}(|V| + |E|)$ complexity, while Transformer-based models scale as $\mathcal{O}(n^2 \cdot d)$.

Limitations and Future Directions

The review identifies five key challenges:

Multimodal data integration: Each representation paradigm has distinct limitations (1D neglects spatial configuration, 2D omits conformational details, 3D faces rotational invariance challenges). The authors propose incorporating molecular dynamics trajectories as a dynamic modality and using cross-modal data augmentation.
Data scarcity: Semi-supervised learning can achieve more than 90% of fully supervised performance using only 10% labeled data on QM9. Cross-modal augmentation (e.g., 3D InfoMax) can generate plausible 3D conformers from 2D graphs.
Interpretability: Current methods rely primarily on attention-based visualization, which is insufficient for multimodal models. The authors suggest assessing decision consistency across modalities and incorporating chemical knowledge graphs.
Training efficiency: Large parameter counts demand distributed parallel training techniques, with data parallelism being the most common approach.
Robustness and generalization: Strategies include data augmentation (multiple SMILES representations, 3D conformer generation), meta-learning for rapid adaptation, and sparse attention mechanisms to reduce sensitivity to irrelevant long-range interactions.

Reproducibility Details

This is a review paper, so standard reproducibility criteria for experimental papers do not directly apply. The review compiles results from the original publications of each surveyed model.

Data

The review catalogs 28 representative molecular datasets used by the surveyed foundation models:

Dataset	Size	Descriptor	Primary Use
PubChem	~118M	SMILES, 3D, Image, IUPAC	Pretraining
ZINC15	~980M	SMILES	Pretraining
ChEMBL	~2.4M	SMILES	Pretraining
QM9	133,884	SMILES	Property prediction
GEOM	450,000	3D coordinates	Property prediction
USPTO-full	950,000	SMILES	Reaction prediction
Molecule3D	4M	3D coordinates	Property prediction

Artifacts

Artifact	Type	License	Notes
Review Materials (GitHub)	Code/Data	Not specified	Code and data tables for figures
Paper (PMC)	Paper	CC-BY	Open access via PubMed Central

Evaluation

All performance metrics reported in the review are directly cited from the original studies. The evaluation protocols follow each model’s original setup. The review covers:

ROC-AUC for classification tasks (property prediction, DDI, synergy)
RMSE/MAE for regression tasks
Validity and novelty for molecular generation
Top-k accuracy for retrosynthesis
COV and MAT for conformation generation

Paper Information

Citation: Song, B., Zhang, J., Liu, Y., Liu, Y., Jiang, J., Yuan, S., Zhen, X., & Liu, Y. (2025). A systematic review of molecular representation learning foundation models. Briefings in Bioinformatics, 27(1), bbaf703. https://doi.org/10.1093/bib/bbaf703

@article{song2025systematic,
  title={A systematic review of molecular representation learning foundation models},
  author={Song, Bosheng and Zhang, Jiayi and Liu, Ying and Liu, Yuansheng and Jiang, Jing and Yuan, Sisi and Zhen, Xia and Liu, Yiping},
  journal={Briefings in Bioinformatics},
  volume={27},
  number={1},
  pages={bbaf703},
  year={2025},
  publisher={Oxford University Press},
  doi={10.1093/bib/bbaf703}
}

MolGenSurvey: Systematic Survey of ML for Molecule Design

Mon, 23 Mar 2026 00:00:00 +0000

A Taxonomy for ML-Driven Molecule Design

This is a Systematization paper that reviews machine learning approaches for molecule design across all three major molecular representations (1D string, 2D graph, 3D geometry) and both deep generative and combinatorial optimization paradigms. Prior surveys (including Sánchez-Lengeling & Aspuru-Guzik, 2018, Elton et al., 2019, Xue et al. 2019, Vanhaelen et al. 2020, Alshehri et al. 2020, Jiménez-Luna et al. 2020, and Axelrod et al. 2022) each covered subsets of the literature (e.g., only generative methods, or only specific task types). MolGenSurvey extends these by unifying the field into a single taxonomy based on input type, output type, and generation goal, identifying eight distinct molecule generation tasks. It catalogs over 100 methods across these categories and provides a structured comparison of evaluation metrics, datasets, and experimental setups.

The chemical space of drug-like molecules is estimated at $10^{23}$ to $10^{60}$, making exhaustive enumeration computationally infeasible. Traditional high-throughput screening searches existing databases but is slow and expensive. ML-based generative approaches offer a way to intelligently explore this space, either by learning continuous latent representations (deep generative models) or by directly searching the discrete chemical space (combinatorial optimization methods).

Molecular Representations

The survey identifies three mainstream featurization approaches for molecules, each carrying different tradeoffs for generation tasks.

1D String Descriptions

SMILES and SELFIES are the two dominant string representations. SMILES encodes molecules as character strings following grammar rules for bonds, branches, and ring closures. Its main limitation is that arbitrary strings are often chemically invalid. SELFIES augments the encoding rules for branches and rings to achieve 100% validity by construction.

Other string representations exist (InChI, SMARTS) but are less commonly used for generation. Representation learning over strings has adopted CNNs, RNNs, and Transformers from NLP.

2D Molecular Graphs

Molecules naturally map to graphs where atoms are nodes and bonds are edges. Graph neural networks (GNNs), particularly those following the message-passing neural network (MPNN) framework, have become the standard representation method. The MPNN updates each node’s representation by aggregating information from its $K$-hop neighborhood. Notable architectures include D-MPNN (directional message passing), PNA (diverse aggregation methods), AttentiveFP (attention-based), and Graphormer (transformer-based).

3D Molecular Geometry

Molecules are inherently 3D objects with conformations (3D structures at local energy minima) that determine function. Representing 3D geometry requires models that respect E(3) or SE(3) equivariance (invariance to rotation and translation). The survey catalogs architectures along this line including SchNet, DimeNet, EGNN, SphereNet, and PaiNN.

Additional featurization methods (molecular fingerprints/descriptors, 3D density maps, 3D surface meshes, and chemical images) are noted but have seen limited use in generation tasks.

Deep Generative Models

The survey covers six families of deep generative models applied to molecule design.

Autoregressive Models (ARs)

ARs factorize the joint distribution of a molecule as a product of conditional distributions over its subcomponents:

$$p(\boldsymbol{x}) = \prod_{i=1}^{d} p(\bar{x}_i \mid \bar{x}_1, \bar{x}_2, \ldots, \bar{x}_{i-1})$$

For molecular graphs, this means sequentially predicting the next atom or bond conditioned on the partial structure built so far. RNNs, Transformers, and BERT-style models all implement this paradigm.

Variational Autoencoders (VAEs)

VAEs learn a continuous latent space by maximizing the evidence lower bound (ELBO):

$$\log p(\boldsymbol{x}) \geq \mathbb{E}_{q(\boldsymbol{z}|\boldsymbol{x})}[\log p(\boldsymbol{x}|\boldsymbol{z})] - D_{KL}(q(\boldsymbol{z}|\boldsymbol{x}) | p(\boldsymbol{z}))$$

The first term is the reconstruction objective, and the second is a KL-divergence regularizer encouraging diverse, disentangled latent codes. Key molecular VAEs include ChemVAE (SMILES-based), JT-VAE (junction tree graphs), and GrammarVAE (grammar-constrained SMILES).

Normalizing Flows (NFs)

NFs model $p(\boldsymbol{x})$ via an invertible, deterministic mapping between data and latent space, using the change-of-variable formula with Jacobian determinants. Molecular applications include GraphNVP, MoFlow (one-shot graph generation), GraphAF (autoregressive flow), and GraphDF (discrete flow).

Generative Adversarial Networks (GANs)

GANs use a generator-discriminator game where the generator produces molecules and the discriminator distinguishes real from generated samples. Molecular GANs include MolGAN (graph-based with RL reward), ORGAN (SMILES-based with RL), and Mol-CycleGAN (molecule-to-molecule translation).

Diffusion Models

Diffusion models learn to reverse a gradual noising process. The forward process adds Gaussian noise over $T$ steps; a neural network learns to denoise at each step. The training objective reduces to predicting the noise added at each step:

$$\mathcal{L}_t = \mathbb{E}_{\boldsymbol{x}_0, \boldsymbol{\epsilon}}\left[|\epsilon_t - \epsilon_\theta(\sqrt{\bar{\alpha}_t}\boldsymbol{x}_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon_t, t)|^2\right]$$

Diffusion has been particularly successful for 3D conformation generation (ConfGF, GeoDiff, DGSM).

Energy-Based Models (EBMs)

EBMs define $p(\boldsymbol{x}) = \frac{\exp(-E_\theta(\boldsymbol{x}))}{A}$ where $E_\theta$ is a learned energy function. The challenge is computing the intractable partition function $A$, addressed via contrastive divergence, noise-contrastive estimation, or score matching.

Combinatorial Optimization Methods

Unlike DGMs that learn from data distributions, combinatorial optimization methods (COMs) search directly over discrete chemical space using oracle calls to evaluate candidate molecules.

Reinforcement Learning (RL)

RL formulates molecule generation as a Markov Decision Process: states are partial molecules, actions are adding/removing atoms or bonds, and rewards come from property oracles. Methods include GCPN (graph convolutional policy network), MolDQN (deep Q-network), RationaleRL (property-aware substructure assembly), and REINVENT (SMILES-based policy gradient).

Genetic Algorithms (GA)

GAs maintain a population of molecules and evolve them through mutation and crossover operations. GB-GA operates on molecular graphs, GA+D uses SELFIES with adversarial discriminator enhancement, and JANUS uses SELFIES with parallel exploration strategies.

Bayesian Optimization (BO)

BO builds a Gaussian process surrogate of the objective function and uses an acquisition function to decide which molecules to evaluate next. It is often combined with VAE latent spaces (Constrained-BO-VAE, MSO) to enable continuous optimization.

Monte Carlo Tree Search (MCTS)

MCTS explores the molecular construction tree by branching and evaluating promising intermediates. ChemTS and MP-MCTS combine MCTS with autoregressive SMILES generators.

MCMC Sampling

MCMC methods (MIMOSA, MARS) formulate molecule optimization as sampling from a target distribution defined by multiple property objectives, using graph neural networks as proposal distributions.

Other Approaches

The survey also identifies two additional paradigms that do not fit neatly into either DGM or COM categories. Optimal Transport (OT) is used when matching between groups of molecules, particularly for conformation generation where each molecule has multiple associated 3D structures (e.g., GeoMol, EquiBind). Differentiable Learning formulates discrete molecules as differentiable objects, enabling gradient-based continuous optimization directly on molecular graphs (e.g., DST).

Task Taxonomy: Eight Molecule Generation Tasks

The survey’s central organizational contribution is a unified taxonomy of eight distinct molecule design tasks, defined by three axes: (1) whether generation is de novo (from scratch, no reference molecule) or conditioned on an input molecule, (2) whether the goal is generation (distribution learning, producing valid and diverse molecules) or optimization (goal-directed search for molecules with specific properties), and (3) the input/output data representation (1D string, 2D graph, 3D geometry). The paper’s Table 2 maps all combinations of these axes, showing that many are not meaningful (e.g., 1D string input to 2D graph output with no goal). Only eight combinations correspond to active research areas.

1D/2D Tasks

De novo 1D/2D molecule generation: Generate new molecules from scratch to match a training distribution. Methods span VAEs (ChemVAE, JT-VAE), flows (GraphNVP, MoFlow, GraphAF), GANs (MolGAN, ORGAN), ARs (MolecularRNN), and EBMs (GraphEBM).
De novo 1D/2D molecule optimization: Generate molecules with optimal properties from scratch, using oracle feedback. Methods include RL (GCPN, MolDQN), GA (GB-GA, JANUS), MCTS (ChemTS), and MCMC (MIMOSA, MARS).
1D/2D molecule optimization: Optimize properties of a given input molecule via local search. Methods include graph-to-graph translation (VJTNN, CORE, MOLER), VAE+BO (MSO, Constrained-BO-VAE), GANs (Mol-CycleGAN, LatentGAN), and differentiable approaches (DST).

3D Tasks

De novo 3D molecule generation: Generate novel 3D molecular structures from scratch, respecting geometric validity. Methods include ARs (G-SchNet, G-SphereNet), VAEs (3DMolNet), flows (E-NFs), and RL (MolGym).
De novo 3D conformation generation: Generate 3D conformations from given 2D molecular graphs. Methods include VAEs (CVGAE, ConfVAE), diffusion models (ConfGF, GeoDiff, DGSM), and optimal transport (GeoMol).
De novo binding-based 3D molecule generation: Design 3D molecules for specific protein binding pockets. Methods include density-based VAEs (liGAN), RL (DeepLigBuilder), and ARs (3DSBDD).
De novo binding-pose conformation generation: Find the appropriate 3D conformation of a given molecule for a given protein pocket. Methods include EBMs (DeepDock) and optimal transport (EquiBind).
3D molecule optimization: Optimize 3D molecular properties (scaffold replacement, conformation refinement). Methods include BO (BOA), ARs (3D-Scaffold, cG-SchNet), and VAEs (Coarse-GrainingVAE).

Evaluation Metrics

The survey organizes evaluation metrics into four categories.

Generation Evaluation

Basic metrics assess the quality of generated molecules:

Validity: fraction of chemically valid molecules among all generated molecules
Novelty: fraction of generated molecules absent from the training set
Uniqueness: fraction of distinct molecules among generated samples
Quality: fraction passing a predefined chemical rule filter
Diversity (internal/external): measured via pairwise similarity (Tanimoto, scaffold, or fragment) within generated set and between generated and training sets

Distribution Evaluation

Metrics measuring how well generated molecules capture the training distribution: KL divergence over physicochemical descriptors, Fréchet ChemNet Distance (FCD), and Mean Maximum Discrepancy (MMD).

Optimization Evaluation

Property oracles used as optimization targets: Synthetic Accessibility (SA), Quantitative Estimate of Drug-likeness (QED), LogP, kinase inhibition scores (GSK3-beta, JNK3), DRD2 activity, GuacaMol benchmark oracles, and Vina docking scores. Constrained optimization additionally considers structural similarity to reference molecules via Tanimoto, scaffold, or fragment similarity.

3D Evaluation

3D-specific metrics include stability (matching valence rules in 3D), RMSD and Kabsch-RMSD (conformation alignment), and Coverage/Matching scores for conformation ensembles.

Datasets

The survey catalogs 12 major datasets spanning 1D/2D and 3D molecule generation:

Dataset	Scale	Dimensionality	Purpose
ZINC	250K	1D/2D	Virtual screening compounds
ChEMBL	2.1M	1D/2D	Bioactive molecules
MOSES	1.9M	1D/2D	Benchmarking generation
CEPDB	4.3M	1D/2D	Organic photovoltaics
GDB-13	970M	1D/2D	Enumerated small molecules
QM9	134K	1D/2D/3D	Quantum chemistry properties
GEOM	450K/37M	1D/2D/3D	Conformer ensembles
ISO17	200/431K	1D/2D/3D	Molecule-conformation pairs
Molecule3D	3.9M	1D/2D/3D	DFT ground-state geometries
CrossDock2020	22.5M	1D/2D/3D	Docked ligand poses
scPDB	16K	1D/2D/3D	Binding sites
DUD-E	23K	1D/2D/3D	Active compounds with decoys

Challenges and Opportunities

Challenges

Out-of-distribution generation: Most deep generative models imitate known molecule distributions and struggle to explore truly novel chemical space.
Unrealistic problem formulation: Many task setups do not respect real-world chemistry constraints.
Expensive oracle calls: Methods typically assume unlimited access to property evaluators, which is unrealistic in drug discovery.
Lack of interpretability: Few methods explain why generated molecules have desired properties. Quantitative interpretability evaluation remains an open problem.
No unified evaluation protocols: The field lacks consensus on what defines a “good” drug candidate and how to fairly compare methods.
Insufficient benchmarking: Despite the enormous chemical space ($10^{23}$ to $10^{60}$ drug-like molecules), available benchmarks use only small fractions of large databases.
Low-data regime: Many real-world applications have limited training data, and generating molecules under data scarcity remains difficult.

Opportunities

Extension to complex structured data: Techniques from small molecule generation may transfer to proteins, antibodies, genes, crystal structures, and polysaccharides.
Connection to later drug development phases: Bridging the gap between molecule design and preclinical/clinical trial outcomes could improve real-world impact.
Knowledge discovery: Generative models over molecular latent spaces could reveal chemical rules governing molecular properties, and graph structure learning could uncover implicit non-bonded interactions.

Limitations

The survey was published in March 2022, so it does not cover subsequent advances in diffusion models for molecules (e.g., EDM, DiffSBDD), large language models applied to chemistry, or flow matching approaches.
Coverage focuses on small molecules. Macromolecule design (proteins, nucleic acids) is noted as a future direction rather than surveyed.
The survey catalogs methods but does not provide head-to-head experimental comparisons across all 100+ methods. Empirical discussion relies on individual papers’ reported results.
1D string-based methods receive less detailed coverage than graph and geometry-based approaches, reflecting the field’s shift toward structured representations at the time of writing.
As a survey, this paper produces no code, models, or datasets. The surveyed methods’ individual repositories are referenced in their original publications but are not aggregated here.

Paper Information

Citation: Du, Y., Fu, T., Sun, J., & Liu, S. (2022). MolGenSurvey: A Systematic Survey in Machine Learning Models for Molecule Design. arXiv preprint arXiv:2203.14500.

Publication: arXiv preprint, March 2022. Note: This survey covers literature through early 2022 and does not include subsequent advances in diffusion models, LLMs for chemistry, or flow matching.

Additional Resources:

arXiv: 2203.14500

@article{du2022molgensurvey,
  title={MolGenSurvey: A Systematic Survey in Machine Learning Models for Molecule Design},
  author={Du, Yuanqi and Fu, Tianfan and Sun, Jimeng and Liu, Shengchao},
  journal={arXiv preprint arXiv:2203.14500},
  year={2022}
}

Review of OCSR Techniques and Models (Musazade 2022)

Thu, 18 Dec 2025 00:00:00 +0000

Systematization of OCSR Evolution

This is a Systematization paper ($\Psi_{\text{Systematization}}$). It organizes existing literature into two distinct evolutionary phases: Rule-based systems (1990s-2010s) and Machine Learning-based systems (2015-present). It synthesizes performance metrics across these paradigms to highlight the shift from simple classification to “image captioning” (sequence generation).

Justification: The paper focuses on “organizing and synthesizing existing literature” and answers the core question: “What do we know?” The dominant contribution is systematization based on several key indicators:

Survey Structure: The paper explicitly structures content by categorizing the field into two distinct historical and methodological groups: “Rule-based systems” and “ML-based systems”. It traces the “evolution of approaches from rule-based structure analyses to complex statistical models”, moving chronologically from early tools like OROCS and OSRA (1990s-2000s) to modern Deep Learning approaches like DECIMER and Vision Transformers.
Synthesis of Knowledge: The paper aggregates performance metrics from various distinct studies into unified comparison tables (Table 1 for rule-based and Table 2 for ML-based). It synthesizes technical details of different models, explaining how specific architectures (CNNs, LSTMs, Attention mechanisms) are applied to the specific problem of Optical Chemical Structure Recognition (OCSR).
Identification of Gaps: The authors dedicate specific sections to “Gaps of rule-based systems” and “Gaps of ML-based systems”. It concludes with recommendations for future development, such as the need for “standardized datasets” and specific improvements in image augmentation and evaluation metrics.

Motivation for Digitization in Cheminformatics

The primary motivation is the need to digitize vast amounts of chemical knowledge locked in non-digital formats (e.g., scanned PDFs, older textbooks). This is challenging because:

Representational Variety: A single chemical formula can be drawn in many visually distinct ways (e.g., different orientations, bond styles, fonts).
Legacy Data: Older documents contain noise, low resolution, and disconnected strokes that confuse standard computer vision models.
Lack of Standardization: There is no centralized database or standardized benchmark for evaluating OCSR performance, making comparison difficult.

Key Insights and the Paradigm Shift

The paper provides a structured comparison of the “evolution” of OCSR, specifically identifying the pivot point where the field moved from object detection to NLP-inspired sequence generation.

Key insights include:

The Paradigm Shift: Identifying that OCSR has effectively become an “image captioning” problem where the “caption” is a SMILES or InChI string.
Metric Critique: It critically analyzes the flaws in current evaluation metrics, noting that Levenshtein Distance (LD) is better than simple accuracy but still fails to capture semantic chemical severity (e.g., mistaking “F” for “S” is worse than a wrong digit).
Hybrid Potential: Despite the dominance of ML, the authors argue that rule-based heuristics are still valuable for post-processing validation (e.g., checking element order, sequence structure, and formula correspondence).

Comparative Analysis of Rule-Based vs. ML Systems

As a review paper, it aggregates experimental results from primary sources. It compares:

Rule-based systems: OSRA, chemoCR, Imago, Markov Logic OCSR, and various heuristic approaches.
ML-based systems: DECIMER (multiple versions), MSE-DUDL, ICMDT (Image Captioning Model based on Deep Transformer-in-Transformer), and other BMS Kaggle competition solutions.

It contrasts these systems using:

Datasets: BMS (synthetic, 4M images), PubChem (synthetic), U.S. Patents (real-world scanned).
Metrics: Tanimoto similarity (structural overlap) and Levenshtein distance (string edit distance).

Outcomes, Critical Gaps, and Recommendations

Transformers are SOTA: Attention-based encoder-decoder models outperform CNN-RNN hybrids. DECIMER 1.0 achieved 96.47% Tanimoto $= 1.0$ on its test set using an EfficientNet-B3 encoder and Transformer decoder.
Data Hungry: Modern approaches require massive datasets (millions of images) and significant compute. DECIMER 1.0 trained on 39M images for 14 days on TPU, while the original DECIMER took 27 days on a single GPU. Rule-based systems required neither large data nor heavy compute but hit a performance ceiling.
Critical Gaps:
- Super-atoms: Current models struggle with abbreviated super-atoms (e.g., “Ph”, “COOH”).
- Stereochemistry: 3D information (wedges/dashes) is often lost or misinterpreted.
- Resolution: Models are brittle to resolution changes; some require high-res, others fail if images aren’t downscaled.
Recommendation: Future systems should integrate “smart” pre-processing (denoising without cropping) and use domain-specific distance metrics. The authors also note that post-processing formula validation (checking element order, sequence structure, and formula correspondence) increases accuracy by around 5-6% on average. They suggest exploring Capsule Networks as an alternative to CNNs, since capsules add position invariance through routing-by-agreement rather than max-pooling.

Reproducibility

As a review paper, this work does not introduce original code, models, or datasets. The paper itself is open access via the Journal of Cheminformatics. This section summarizes the technical details of the systems reviewed.

Data

The review identifies the following key datasets used for training OCSR models:

Dataset	Type	Size	Notes
BMS (Bristol-Myers Squibb)	Synthetic	~4M images	2.4M train / 1.6M test. Used for Kaggle competition. Test images contain noise (salt & pepper, blur) and rotations absent from training images.
PubChem	Synthetic	~39M	Generated via CDK (Chemistry Development Kit). Used by DECIMER 1.0 (90/10 train/test split).
U.S. Patents (USPTO)	Scanned	Variable	Real-world noise, often low resolution. One of several training sources for MSE-DUDL (alongside PubChem and Indigo, totaling 50M+ samples).
ChemInfty	Scanned	869 images	Older benchmark used to evaluate rule-based systems (e.g., Markov Logic OCSR).

Algorithms

The review highlights the progression of algorithms:

Rule-Based: Hough transforms for bond detection, vectorization/skeletonization, and OCR for atom labels.
Sequence Modeling:
- Image Captioning: Encoder (CNN/ViT) → Decoder (RNN/Transformer).
- Tokenization: Parsing InChI/SMILES into discrete tokens (e.g., splitting C13 into C, 13).
- Beam Search: Used in inference (typical $k=15-20$) to find the most likely chemical string.

Models

Key architectures reviewed:

DECIMER 1.0: Uses EfficientNet-B3 (Encoder) and Transformer (Decoder). Predicts SELFIES strings (more robust than SMILES).
Swin Transformer: Often used in Kaggle ensembles as the visual encoder due to better handling of variable image sizes.
Grid LSTM: Used in older deep learning approaches (MSE-DUDL) to capture spatial dependencies.

Evaluation

Metrics standard in the field:

Levenshtein Distance (LD): Edit distance between predicted and ground truth strings. Lower is better. Formally, for two sequences $a$ and $b$ (e.g. SMILES strings) of lengths $|a|$ and $|b|$, the recursive distance $LD(a, b)$ is bounded from $0$ to $\max(|a|, |b|)$.
Tanimoto Similarity: Measures overlap of molecular fingerprints ($0.0 - 1.0$). Higher is better. DECIMER 1.0 achieved a Tanimoto of 0.99 on PubChem data (Table 2). Calculated as: $$ \begin{aligned} T(A, B) = \frac{N_c}{N_a + N_b - N_c} \end{aligned} $$ where $N_a$ and $N_b$ are the number of bits set to 1 in fingerprints $A$ and $B$, and $N_c$ is the number of common bits set to 1.
1-1 Match Rate: Exact string matching (accuracy). For DECIMER 1.0, 96.47% of results achieved Tanimoto $= 1.0$.

Hardware

Training Cost: High for SOTA. DECIMER 1.0 required ~14 days on TPU. The original DECIMER took ~27 days on a single NVIDIA GPU.
Inference: Transformer models are heavy; rule-based systems run on standard CPUs but with lower accuracy.

Paper Information

Citation: Musazade, F., Jamalova, N., & Hasanov, J. (2022). Review of techniques and models used in optical chemical structure recognition in images and scanned documents. Journal of Cheminformatics, 14(1), 61. https://doi.org/10.1186/s13321-022-00642-3

Publication: Journal of Cheminformatics 2022

@article{musazadeReviewTechniquesModels2022,
  title = {Review of Techniques and Models Used in Optical Chemical Structure Recognition in Images and Scanned Documents},
  author = {Musazade, Fidan and Jamalova, Narmin and Hasanov, Jamaladdin},
  year = 2022,
  month = sep,
  journal = {Journal of Cheminformatics},
  volume = {14},
  number = {1},
  pages = {61},
  doi = {10.1186/s13321-022-00642-3}
}

A Review of Optical Chemical Structure Recognition Tools

Wed, 17 Dec 2025 00:00:00 +0000

Systematization and Benchmarking of OCSR

This is primarily a Systematization paper ($0.7 \Psi_{\text{Systematization}}$) with a significant Resource component ($0.3 \Psi_{\text{Resource}}$).

It serves as a Systematization because it organizes nearly three decades of research in Optical Chemical Structure Recognition (OCSR), categorizing methods into rule-based systems (e.g., Kekulé, CLiDE, OSRA) and emerging machine-learning approaches (e.g., MSE-DUDL, Chemgrapher). It synthesizes information on 16 distinct tools, many of which are commercial or no longer available.

It acts as a Resource by defining a benchmark for the field. The authors evaluate the three available open-source tools (Imago, MolVec, OSRA) against four distinct datasets to establish baseline performance metrics for accuracy and speed.

Motivation: Digitizing Legacy Chemical Literature

A vast amount of chemical knowledge remains “hidden” in the primary scientific literature (printed or PDF), conveyed as 2D images. Because these depictions are not machine-readable, there is a “backlog of decades of chemical literature” that cannot be easily indexed or searched in open-access databases.

While Chemical Named Entity Recognition (NER) exists for text, translating graphical depictions into formats like SMILES or SDfiles requires specialized OCSR tools. The motivation is to enable the automated curation of this legacy data to feed public databases.

Core Innovations: Historical Taxonomy and Open Standards

The primary novelty is the comprehensive aggregation of the history of the field, which had not been thoroughly reviewed recently. It details the algorithmic evolution from the first work in 1990 to deep learning methods in 2019.

Specific contributions include:

Historical Taxonomy: Classification of tools into rule-based vs. machine-learning, and open-source vs. commercial/unavailable.
Open Source Benchmark: A comparative performance analysis of the only three open-source tools available at the time (Imago, MolVec, OSRA) on standardized datasets.
Algorithmic Breakdown: Detailed summaries of the workflows for closed-source or lost tools (e.g., Kekulé, OROCS, ChemReader) based on their original publications.

Benchmarking Methodology and Open-Source Evaluation

The authors performed a benchmark study to evaluate the accuracy and speed of three open-source OCSR tools: MolVec (0.9.7), Imago (2.0), and OSRA (2.1.0).

They tested these tools on four datasets of varying quality and origin:

USPTO: 5,719 images from US patents (high quality).
UOB: 5,740 images from the University of Birmingham, published alongside MolRec.
CLEF 2012: 961 images from the CLEF-IP evaluation (well-segmented, clean).
JPO: 450 images from Japanese patents (low quality, noise, Japanese characters).

Evaluation metrics were:

Accuracy: Percentage of perfectly recognized structures, mathematically defined as exact string matching between generated and reference standard InChI sequences $\text{Accuracy} = \frac{\text{Correct InChI Matches}}{\text{Total Images}}$ (verified by converting output to InChI strings and matching against reference InChIs).
Speed: Total processing time for the dataset.

Results and General Conclusions

Benchmark Results (Table 2):

Dataset	Metric	MolVec 0.9.7	Imago 2.0	OSRA 2.1.0
USPTO (5,719 images)	Time (min)	28.65	72.83	145.04
	Accuracy	88.41%	87.20%	87.69%
UOB (5,740 images)	Time (min)	28.42	152.52	125.78
	Accuracy	88.39%	63.54%	86.50%
CLEF 2012 (961 images)	Time (min)	4.41	16.03	21.33
	Accuracy	80.96%	65.45%	94.90%
JPO (450 images)	Time (min)	7.50	22.55	16.68
	Accuracy	66.67%	40.00%	57.78%

Key Observations:

MolVec was the fastest tool, processing datasets significantly quicker than competitors (e.g., 28.65 min for USPTO vs. 145.04 min for OSRA).
OSRA performed exceptionally well on clean, well-segmented data (94.90% on CLEF 2012) but was slower.
Imago generally lagged in accuracy compared to the other two, particularly on the UOB dataset (63.54% vs. 88.39% for MolVec and 86.50% for OSRA).
JPO Difficulty: All tools struggled with the noisy Japanese Patent Office dataset (accuracies ranged from 40.00% to 66.67%), highlighting issues with noise and non-standard labels.

General Conclusions:

No “gold standard” tool existed (as of 2020) that solved all problems (page segmentation, R-groups, NLP integration).
Rule-based approaches dominate the history of the field, but deep learning methods (MSE-DUDL, Chemgrapher) were emerging, though they were closed-source at the time of writing.
There was a critical need for tools that could handle full-page recognition (combining segmentation and recognition).

Reproducibility Details

The authors provided sufficient detail to replicate the benchmarking study.

Artifacts

Artifact	Type	License	Notes
OCSR_Review (GitHub)	Code / Data	MIT	Benchmark images (PNG, 72 dpi) and evaluation scripts
OSRA	Code	Open Source	Version 2.1.0 tested; precompiled binaries are commercial
Imago	Code	Open Source	Version 2.0 tested; no longer actively developed
MolVec	Code	LGPL-2.1	Version 0.9.7 tested; Java-based standalone tool

Data

The study used four public datasets. Images were converted to PNG (72 dpi) to ensure compatibility across all tools.

Dataset	Size	Source	Characteristics
USPTO	5,719	OSRA Validation Set	US Patent images, generally clean.
UOB	5,740	Univ. of Birmingham	Published alongside MolRec.
CLEF 2012	961	CLEF-IP 2012	Well-segmented, high quality.
JPO	450	Japanese Patent Office	Low quality, noisy, contains Japanese text.

Algorithms

The paper does not propose a new algorithm but benchmarks existing ones. The execution commands for reproducibility were:

Imago: Executed via command line without installation. ./imago_console -dir /image/directory/path
MolVec: Executed as a JAR file. java -cp [dependencies] gov.nih.ncats.molvec.Main -dir [input_dir] -outDir [output_dir]
OSRA: Installed via Conda (PyOSRA) due to compilation complexity. Required dictionaries for superatoms and spelling. osra -f sdf -a [superatom_dict] -l [spelling_dict] -w [output_file] [input_file]

Models

The specific versions of the open-source software tested were:

Tool	Version	Technology	License
MolVec	0.9.7	Java-based, rule-based	LGPL-2.1
Imago	2.0	C++, rule-based	Open Source
OSRA	2.1.0	C++, rule-based	Open Source

Evaluation

Metric: Perfect structural match. The output SDfile/SMILES was converted to a Standard InChI string and compared to the ground truth InChI. Any deviation counted as a failure.
Environment: Linux workstation (Ubuntu 20.04 LTS).

Hardware

The benchmark was performed on a high-end workstation to measure processing time.

CPUs: 2x Intel Xeon Silver 4114 (40 threads total).
RAM: 64 GB.
Parallelization: MolVec had pre-implemented parallelization features that contributed to its speed.

Paper Information

Citation: Rajan, K., Brinkhaus, H. O., Zielesny, A., & Steinbeck, C. (2020). A review of optical chemical structure recognition tools. Journal of Cheminformatics, 12(1), 60. https://doi.org/10.1186/s13321-020-00465-0

Publication: Journal of Cheminformatics 2020

@article{rajanReviewOpticalChemical2020,
  title = {A Review of Optical Chemical Structure Recognition Tools},
  author = {Rajan, Kohulan and Brinkhaus, Henning Otto and Zielesny, Achim and Steinbeck, Christoph},
  year = 2020,
  month = oct,
  journal = {Journal of Cheminformatics},
  volume = {12},
  number = {1},
  pages = {60},
  issn = {1758-2946},
  doi = {10.1186/s13321-020-00465-0}
}

Embedded-Atom Method: Theory and Applications Review

Sun, 14 Dec 2025 00:00:00 +0000

Systematizing the Embedded-Atom Method

This is a Systematization (Review) paper. It consolidates the theoretical development, semi-empirical parameterization, and broad applications of the Embedded-Atom Method (EAM) into a unified framework. The paper systematizes the field by connecting the EAM to related theories (Effective Medium Theory, Finnis-Sinclair, “glue” models) and organizing phenomenological results across diverse physical regimes (bulk, surfaces, interfaces).

The authors explicitly frame the work as a survey, stating “We review here the history, development, and application of the EAM” and “This review emphasizes the physical insight that motivated the EAM.” The paper follows a classic survey structure, organizing the literature by application domains.

The Failure of Pair Potentials in Metallic Systems

The primary motivation is the failure of pair-potential models to accurately describe metallic bonding, particularly at defects and interfaces.

Physics Gap: Pair potentials assume bond strength is independent of environment, implying cohesive energy scales linearly with coordination ($Z$), whereas in reality it scales roughly as $\sqrt{Z}$.

Empirical Failures: Pair potentials incorrectly predict the “Cauchy relation” ($C_{12} = C_{44}$) and predict a vacancy formation energy equal to the cohesive energy, contradicting experimental data for fcc metals.

Practical Need: First-principles calculations (like DFT) were computationally too expensive for low-symmetry systems like grain boundaries and fracture tips, creating a need for an efficient, semi-empirical many-body potential.

Theoretical Unification & Core Innovations

The paper’s core contribution is the synthesis of the EAM as a practical computational tool that captures “coordination-dependent bond strength” without the cost of ab initio methods.

Theoretical Unification: It demonstrates that the EAM ansatz can be derived from Density Functional Theory (DFT) by assuming the total electron density is a superposition of atomic densities.

Environmental Dependence: It explicitly formulates how the “effective” pair interaction stiffens and shortens as coordination decreases (e.g., at surfaces), a feature naturally arising from the non-linearity of the embedding function.

Broad Validation: It provides a centralized evaluation of the method across a vast array of metallic properties, establishing it as the standard for atomistic simulations of face-centered cubic (fcc) metals.

Validating EAM Across Application Domains

The authors review computational experiments using Energy Minimization, Molecular Dynamics (MD), and Monte Carlo (MC) simulations across several domains:

Bulk Properties: Calculation of phonon spectra, liquid structure factors, thermal expansion coefficients, and melting points for fcc metals (Ni, Pd, Pt, Cu, Ag, Au).

Defects: Computation of vacancy formation/migration energies and self-interstitial geometries.

Grain Boundaries: Calculation of grain boundary structures, energies, and elastic properties for twist and tilt boundaries in Au and Al. Computed structures show good agreement with X-ray diffraction and HRTEM experiments. The many-body interactions in the EAM produce somewhat better agreement than pair potentials, which tend to overestimate boundary expansion.

Surfaces: Analysis of surface energies, relaxations, reconstructions (e.g., Au(110) missing row), and surface phonons.

Alloys: Investigation of heat of solution, surface segregation profiles (e.g., Ni-Cu), and order-disorder transitions.

Mechanical Properties: Simulation of dislocation mobility, pinning by defects (He bubbles), and crack tip plasticity (ductile vs. brittle fracture modes).

Key Outcomes and the Limits of EAM

Many-Body Success: The EAM successfully reproduces the breakdown of the Cauchy relation and the correct ratio of vacancy formation energy to cohesive energy (~0.35) for fcc metals.

Surface Accuracy: It correctly predicts that surface bonds are shorter and stiffer than bulk bonds due to lower coordination. It accurately predicts surface reconstructions (e.g., Au(110) $(1 \times 2)$).

Alloy Behavior: The method naturally captures segregation phenomena, including oscillating concentration profiles in Ni-Cu, driven by the embedding energy.

Limitations: The method is less accurate for systems with strong directional bonding (covalent materials) or significant Fermi-surface effects, as it assumes spherically averaged electron densities.

Reproducibility Details

Data

Fitting Data: The semi-empirical functions are fitted to basic bulk properties: lattice constants, cohesive energy, elastic constants ($C_{11}$, $C_{12}$, $C_{44}$), and vacancy formation energy.

Universal Binding Curve: The cohesive energy as a function of lattice constant is constrained to follow the “universal binding curve” of Rose et al. to ensure accurate anharmonic behavior.

Alloy Data: For binary alloys, dilute heats of alloying are used for fitting cross-interactions.

Algorithms

Core Ansatz: The total energy is defined as:

$$E_{coh} = \sum_{i} G_i\left( \sum_{j \neq i} \rho_j^a(R_{ij}) \right) + \frac{1}{2} \sum_{i, j (j \neq i)} U_{ij}(R_{ij})$$

where $G$ is the embedding energy (function of local electron density $\rho$), and $U$ is a pair interaction.

Simulation Techniques:

Molecular Dynamics (MD): Used for liquids, phonons, and fracture simulations.
Monte Carlo (MC): Used for phase diagrams and segregation profiles (e.g., approximately $10^5$ iterations per atom).
Phonons: Calculated via the dynamical matrix derived from the force-constant tensor $K_{ij}$.
Normal-Mode Analysis: Vibrational normal modes obtained by diagonalizing the dynamical matrix, feasible for unit cells of up to about 260 atoms.

Models

Parameterizations: The review lists several specific function sets developed by the authors (Table 2), including:

Daw and Baskes: For Ni, Pd, H (elemental metals and H in solution/on surfaces)
Foiles: For Cu, Ag, Au, Ni, Pd, Pt (elemental metals)
Foiles: For Cu, Ni (tailored for the Ni-Cu alloy system)
Foiles, Baskes and Daw: For Cu, Ag, Au, Ni, Pd, Pt (dilute alloys)
Daw, Baskes, Bisson and Wolfer: For Ni, H (fracture, dislocations, H embrittlement)
Foiles and Daw: For Ni, Al (Ni-rich end of the Ni-Al alloy system)
Daw: For Ni (calculated from first principles, not semi-empirical)
Hoagland, Daw, Foiles and Baskes: For Al (elemental Al)

Many of these historical parameterizations are directly downloadable in machine-readable formats from the NIST Interatomic Potentials Repository (linked in the resources below).

Transferability: EAM functions are generally not transferable between different parameterization sets; mixing functions from different sets (e.g., Daw-Baskes Ni with Foiles Pd) is invalid.

Evaluation

Bulk Validation: Phonon dispersion curves for Cu show excellent agreement with experiment across the full Brillouin zone.

Thermal Properties: Linear thermal expansion coefficients match experiment well (e.g., Cu calculated: $16.4 \times 10^{-6}/K$ vs experimental: $16.7 \times 10^{-6}/K$).

Defect Energetics: Vacancy migration energies and divacancy binding energies (~0.1-0.2 eV) align with experimental data.

Surface Segregation: Correctly predicts segregation species for 18 distinct dilute alloy cases (e.g., Cu segregating in Ni).

Hardware

Compute Scale: At the time of publication (1993), Molecular Dynamics simulations of up to 35,000 atoms were possible.

Platforms: Calculations were performed on supercomputers like the CRAY-XMP, though smaller calculations were noted as feasible on high-performance workstations.

Paper Information

Citation: Daw, M. S., Foiles, S. M., & Baskes, M. I. (1993). The embedded-atom method: a review of theory and applications. Materials Science Reports, 9(7-8), 251-310. https://doi.org/10.1016/0920-2307(93)90001-U

Publication: Materials Science Reports 1993

@article{dawEmbeddedatomMethodReview1993,
  title = {The embedded-atom method: a review of theory and applications},
  shorttitle = {The Embedded-Atom Method},
  author = {Daw, Murray S. and Foiles, Stephen M. and Baskes, Michael I.},
  year = 1993,
  month = mar,
  journal = {Materials Science Reports},
  volume = {9},
  number = {7-8},
  pages = {251--310},
  issn = {0920-2307},
  doi = {10.1016/0920-2307(93)90001-U}
}

Additional Resources: