Computational Chemistry
The transformation from a 2D chemical structure image to a SMILES representation

What is Optical Chemical Structure Recognition (OCSR)?

A micro-review of Optical Chemical Structure Recognition (OCSR), from rule-based systems to modern AI models.

Document Processing
A colored molecule with annotations, representing the diverse drawing styles found in scientific papers that OCSR models must handle.

MolParser-7M and WildMol Datasets for Robust Chemical Structure Recognition

MolParser-7M is a 7.7M-pair dataset for molecule-to-text conversion, featuring real-world images and complex structures …

Computational Chemistry
ZINC-22 Tranche Browser showing molecular count distribution

ZINC-22: Multi-Billion Molecule Database

ZINC-22 dataset card covering 37+ billion make-on-demand molecules for virtual screening and drug discovery.

Computational Chemistry
MARCEL dataset Kraken ligand example in 3D conformation

MARCEL: Molecular Representation and Conformer Ensemble Learning

MARCEL dataset provides 722K+ conformers across 76K+ molecules for drug discovery, catalysis, and molecular …

Computational Chemistry
GEOM dataset example molecule: N-(4-pyrimidin-2-yloxyphenyl)acetamide

GEOM: Energy-Annotated Molecular Conformations

A dataset card for the GEOM dataset, a collection of energy-annotated molecular conformations for property prediction …

Computational Chemistry
GDB-11 molecule structure showing FC1C2OC1c3c(F)coc23

GDB-11: Chemical Universe Database (26.4M Molecules)

A dataset card for the Generated Database 11 (GDB-11), a database of 26.4 million small organic molecules for virtual …

Computational Chemistry
GDB-13 molecule structure showing CCCC(O)(CO)CC1CC1CN

GDB-13: Chemical Universe Database (970M Molecules)

A dataset card for the Generated Database 13 (GDB-13), a database of nearly 1 billion small organic molecules for …

Computational Chemistry
GDB-17 molecule structure showing complex polycyclic architecture

GDB-17: Chemical Universe Database (166B Molecules)

Dataset card for GDB-17, containing 166 billion small organic molecules representing the largest enumerated chemical …

Computational Chemistry
Comparison of 2D molecular graph versus 3D conformer ensemble showing latanoprost molecule in multiple conformations

GEOM Dataset: 3D Molecular Conformer Generation

Learn how GEOM transforms 2D molecular graphs into dynamic 3D conformer ensembles for molecular machine learning …

Computational Chemistry
3D ball-and-stick model of butane molecule representing the structural isomer generation process

Synthetic Isomer Data Generation Pipeline

An end-to-end cheminformatics pipeline transforming 1D chemical formulas into 3D conformer datasets using graph …...

Natural Language Processing
Word vector illustration showing text classification and NLP concepts

Sarcasm Detection with Transformers: A Cautionary Tale

Learn how dataset bias can lead to misleading results in NLP: a sarcasm detection model that actually learned to …

Computational Social Science
Top features for Economics and Public Finance policy classification across Congresses

How Does Congress Actually Work? Data from 15K Bills

What happens to bills in Congress? Analyzing 15K+ bills from the 117th Congress to understand legislative patterns, …