Optical Chemical Structure Recognition

A substantial fraction of chemical knowledge is recorded as 2D diagrams in journals, patents, and textbooks. Optical Chemical Structure Recognition (OCSR) is the task of extracting machine-readable molecular representations from those images: strings like SMILES (a compact text encoding of molecular structure) and InChI (a standardized identifier for chemical substances), or molecular graphs that encode atoms as nodes and bonds as edges. For a longer introduction to the field and its motivations, see the What is OCSR? post.

The notes are organized into eight sub-groups:

Rule-Based Systems cover the original OCSR pipeline (1990s to mid-2010s): vectorize an image, classify atoms and bonds with hand-coded rules, and compile a connection table. Tools like Kekulé, CLiDE, OSRA, and ChemReader defined this era.
Image-to-Sequence Models reframe recognition as image captioning, using encoder-decoder architectures to generate SMILES, InChI, or SELFIES strings. DECIMER, Img2Mol, and SwinOCSR are representative examples.
Image-to-Graph Models predict molecular graphs directly, detecting atoms and bonds as nodes and edges. This includes MolGrapher, MolScribe, and ABC-Net.
Vision-Language Models represent the latest generation, building on large pretrained vision-language backbones for improved generalization. MolParser, GTR-CoT, MolNexTR, and SubGrapher fall here.
Hand-Drawn Structure Recognition addresses the distinct challenge of interpreting molecules sketched by hand, from early structural analysis to modern deep learning augmentation strategies.
Online Recognition processes real-time pen strokes on tablets and touch devices, using stroke order and timing for chemical symbol and expression recognition.
Benchmarks and Reviews collects survey papers, the TREC-Chem 2011 and CLEF-IP 2012 competition reports and system descriptions, and comparative analyses of OCSR tools.
Markush Structures covers detection and parsing of the generic chemical representations used in patents to claim compound families.

For orientation, the two survey papers in the benchmarks group are the best starting points: Rajan et al. 2020 covers the rule-based era and benchmarks the transition period, while Musazade et al. 2022 picks up the thread with deep learning methods.

Benchmarks and Reviews

Survey papers, competition reports, comparative analyses, and TREC/CLEF system descriptions for OCSR evaluation.

Hand-Drawn Structure Recognition

OCSR methods for recognizing chemical structures from static images of hand-drawn or sketched molecular diagrams, from early heuristics to deep learning.

Image-to-Graph Models

Deep learning OCSR methods that predict molecular graph structure directly from images, detecting atoms and bonds as nodes and edges.

Image-to-Sequence Models

Deep learning OCSR methods that treat molecular recognition as image captioning, producing SMILES, InChI, or SELFIES strings.

Markush Structures

Methods for detecting and parsing Markush structures, the generic chemical representations used in patents to claim families of related compounds.

Online Recognition

Real-time recognition of chemical structures and symbols from pen strokes on tablets and touch devices, using stroke order and timing.

Rule-Based Systems

OCSR methods using vectorization, heuristics, and hand-coded rules to extract molecular structures from images (1990-2015).

Vision-Language Models

Latest-generation OCSR approaches built on pretrained vision-language models for improved generalization across diagram styles.