Molecular Databases & Datasets on Hunter Heidenreich | ML Research Scientist

VQM24: 836k Molecules at DFT and Diffusion QMC

Sat, 11 Apr 2026 00:00:00 +0000

Key Contribution

VQM24 (Vector-QM24) is the first exhaustive quantum mechanical dataset covering all possible neutral closed-shell small molecules with up to five heavy atoms from nine p-block elements (C, N, O, F, Si, P, S, Cl, Br). It provides DFT-level properties for all 836k structures and diffusion quantum Monte Carlo (DMC) energies for a 10,793-molecule subset, constituting the largest QMC dataset in chemical space to date. ML benchmarking reveals that VQM24 is significantly more challenging than QM9 despite containing smaller molecules.

Overview

Most existing QM datasets (QM7, QM9, ANI-1x) are derived from string-based molecular lists and are restricted to a few elements (typically CHONF), introducing selection bias and limiting ML model generalizability. VQM24 addresses this by exhaustively enumerating all valid stoichiometries, Lewis-rule-consistent graphs, and stable conformers for molecules composed of 9 elements with their most common valencies:

Element	Valencies
C	4
N	3, 5
O	2
F	1
Si	4
P	3, 5
S	2, 4, 6
Cl	1
Br	1

Dataset Subsets

Heavy Atoms	Stoichiometries	Graphs	Geometries
1	9	9	9
2	69	69	81
3	367	766	1,287
4	1,321	10,992	29,581
5	3,793	246,406	753,917
Total	5,559	258,242	784,875 (minima)

Including saddle points, the full dataset contains 835,947 converged structures. Extrapolation suggests ~33 million geometries at 6 heavy atoms.

Generation Pipeline

Stoichiometry enumeration: All combinations of up to 5 heavy atoms from the 13 element/valency types, with hydrogen counts determined by integer partitioning of total valency
Graph generation: Constitutional isomers enumerated using Surge for each stoichiometry
Geometry initialization: RDKit MMFF94 force field generates initial 3D coordinates
Semi-empirical optimization: GFN2-xTB geometry optimization
Conformer search: CREST identifies conformational isomers (~1.1M initial geometries)
DFT optimization: Three-pass $\omega$B97X-D3/cc-pVDZ optimization in PSI4 v1.7, all using Gaussian Tight convergence criteria with density fitting (cc-pVDZ-JKFIT auxiliary basis):
- Pass 1: Default PSI4 settings (DIIS for SCF, RFO optimizer in redundant internal coordinates), max 100 steps
- Pass 2: SOSCF with full Newton step, ultrafine Lebedev-Treutler grid (590 spherical, 99 radial points), max 100 steps
- Pass 3: Full Hessian evaluation at initial geometry and every 20th step, Cartesian coordinates, max 50 steps
DMC calculations: For 10,793 lowest-energy conformers with up to 4 heavy atoms, using QMCPACK with PBE0/ccECP/cc-pVQZ trial wavefunctions. Slater-Jastrow trial wavefunctions with Jastrow terms for 1-body (16 params/atom type, 8 Bohr cutoff), 2-body (20 params/spin-channel, 10 Bohr cutoff), and 3-body (26 params, 5 Bohr cutoff) interactions. DMC used a timestep of 0.001 a.u., 16,000 walkers, and 1,500 blocks of 40 imaginary time steps. ccECP pseudopotentials with the determinant-localization approximation and t-moves (DLTM) handled core electrons.

The $\omega$B97X-D3 functional was chosen for its strong GMTKN55 benchmark performance and for compatibility with ANI-1, ANI-1x, OrbNet Denali, QMugs, SPICE, and MultiXC-QM9, all of which use $\omega$B97X variants with double-zeta basis sets. This enables transfer learning across datasets.

Data Files and Access

The Zenodo dataset contains separate .npz files, loadable via NumPy:

File	Contents	Molecules
`DFT_all.npz`	DFT properties for all conformational minima	784,875
`DFT_uniques.npz`	DFT properties for constitutional isomers (most stable conformer)	258,242
`DFT_saddles.npz`	DFT properties for saddle point structures	51,072
`DMC.npz`	DMC total energies and error bars	10,793
`wavefunctions.tar.gz`	Wavefunction .molden files (includes MO energies)	~106.7 GB

All molecules are ordered consistently across every array within a file. Properties are accessed by key:

import numpy as np
data = np.load('DFT_all.npz', allow_pickle=True)
print(data.files)  # list all available properties
freqs = data['freqs']  # vibrational frequencies

Computed Properties

DFT ($\omega$B97X-D3/cc-pVDZ) properties and their NPZ access keys:

Property	Unit	Key
Total energies	Ha	`Etot`
Internal energies	Ha	`U0`
Atomization energies	Ha	`Eatomization`
Electron-electron energies	Ha	`Eee`
Exchange-correlation energies	Ha	`Exc`
Dispersion energy	Ha	`Edisp`
HOMO-LUMO gap	Ha	`gap`
Dipole moments	a.u.	`dipole`
Quadrupole moments	a.u.	`quadrupole`
Octupole moments	a.u.	`octupole`
Hexadecapole moments	a.u.	`hexadecapole`
Rotational constants	MHz	`rots`
Vibrational modes	Å	`vibmodes`
Vibrational frequencies	cm$^{-1}$	`freqs`
Gibbs free energy (H)	Ha	`G`
Internal (thermal) energy (H)	Ha	`U298`
Enthalpy (H)	Ha	`H`
ZPVE (H)	Ha	`zpves`
Entropy (H)	cal/mol K	`S`
Heat capacities (H)	cal/mol K	`Cv`, `Cp`
Electrostatic potentials at nuclei	a.u.	`Vesp`
Mulliken charges	a.u.	`Qmulliken`
SMILES		`graphs`
InChI strings		`inchi`

(H) indicates thermodynamic properties computed via the harmonic approximation. Molecular orbital energies are available in the wavefunction .molden files.

DMC properties (DMC.npz) include total energy (Etot) and statistical error bar (std) for each molecule.

DMC energies (PBE0/ccECP/cc-pVQZ nodal surfaces, Slater-Jastrow trial wavefunctions) achieve average statistical uncertainty of 0.4 mHa across ~2.3 billion samples per molecule.

ML Benchmarking: Harder Than QM9

Learning curves for atomization energy prediction show that VQM24 is substantially more challenging than QM9 for all tested models:

KRR models (CM, ACSF, LMBTR, FCHL19, cMBDF) and GNNs (SchNet, PaiNN) all show up to ~8x larger mean errors on VQM24 than QM9 at the same training set size
None of the tested models achieve chemical accuracy (1 kcal/mol) on VQM24, even with 128k training molecules
The atomization energy range in VQM24 (1,545 kcal/mol) is smaller than QM9 (2,427 kcal/mol), so the higher errors reflect greater chemical diversity rather than a wider property range
For a fair comparison with QM9 (which has no conformational isomers), learning curves use only the 258k unique constitutional isomers from VQM24

Benchmark methodology: KRR models use an atomic Gaussian kernel with hyperparameters (length-scale $l$, regularizer $\lambda$) optimized via grid search and 5-fold cross-validation. Both GNNs (SchNet, PaiNN) use 128 atomic basis functions (589k total parameters), trained for 1,000 epochs with Adam (lr = $10^{-4}$). Test set size is 10,000 randomly selected molecules, with results averaged over 5 runs. Training and evaluation scripts are available in the GitHub repository.

Prediction error analysis with the best KRR model (cMBDF, trained on 200k across 4 disjoint training sets on all 784,875 equilibrium geometries) yields an overall MAE of 0.75 kcal/mol (standard deviation 1.55 kcal/mol). The largest individual error reaches 167.3 kcal/mol, and the 25 largest outliers have a mean absolute error of 85.9 kcal/mol.

Strengths & Limitations

Strengths:

Exhaustive coverage of 1-5 heavy atom chemical space across 9 elements
Both DFT and DMC-level data (largest QMC dataset in chemical space)
Includes conformational isomers (average 3 per constitutional isomer)
Extensive property set including wavefunctions and multipole moments up to hexadecapole
More challenging ML benchmark than QM9, exposing model limitations

Limitations:

Limited to 5 heavy atoms (very small molecules)
262,542 structures (~24%) failed DFT convergence, with a strong silicon bias in failures
51,072 structures converged to saddle points rather than minima
DMC subset limited to 4 heavy atoms (10,793 molecules)
Does not include metals, rare gases, or heavier halogens (I)

Reproducibility Details

Status: Highly Reproducible

The paper, dataset, and code are all publicly available.

Artifact	Type	License	Notes
VQM24 Dataset (Zenodo)	Dataset	CC-BY-4.0	DFT .npz files + DMC .npz + wavefunction tarball (~108 GB total)
dkhan42/VQM24 (GitHub)	Code	MIT	Generation tools, PSI4 templates, KRR and GNN training scripts
arXiv preprint	Paper	arXiv license	Open-access preprint of the Scientific Data article

Software stack: Surge (graph enumeration), RDKit/MMFF94 (initial geometries), GFN2-xTB (semi-empirical optimization), CREST (conformer search), PSI4 v1.7 (DFT), PySCF (trial wavefunctions), QMCPACK (DMC), QMLcode (KRR models), SchNetPack (GNN models).

Hardware requirements:

DFT: Three-pass $\omega$B97X-D3/cc-pVDZ optimization in PSI4 (compute details not specified per-molecule for DFT)
DMC trial wavefunctions: Argonne LCRC Improv, single node (2x AMD EPYC 7713, 64 cores, 2 GHz), ~45 seconds per molecule, ~134 node-hours total
DMC calculations: Argonne Polaris HPC (AMD EPYC 7543P, 64 cores, 2.8 GHz), 20 nodes per molecule, ~15 minutes each, ~54,000 node-hours total

Citation

@article{khan2025quantum,
  title={Quantum mechanical dataset of 836k neutral closed-shell molecules
         with up to 5 heavy atoms from C, N, O, F, Si, P, S, Cl, Br},
  author={Khan, Danish and Benali, Anouar and Kim, Scott Y. H.
          and von Rudorff, Guido Falk and von Lilienfeld, O. Anatole},
  journal={Scientific Data},
  volume={12},
  number={1},
  pages={1551},
  year={2025},
  publisher={Nature Portfolio},
  doi={10.1038/s41597-025-05428-4}
}

QM9: Quantum Chemistry Properties of 134k Molecules

Sat, 11 Apr 2026 00:00:00 +0000

Key Contribution

QM9 provides a consistent, comprehensive set of quantum chemical properties for 133,885 small organic molecules (up to 9 heavy atoms of C, N, O, F) from the GDB-17 chemical universe. It is among the most widely used benchmark datasets in molecular machine learning, enabling systematic development and evaluation of structure-property prediction methods.

Overview

The dataset corresponds to the GDB-9 subset of the GDB-17 chemical universe: all neutral molecules with up to nine heavy atoms (C, O, N, F), not counting hydrogen. Cations, anions, and molecules containing S, Br, Cl, or I were excluded, though 1,705 zwitterions (relevant for small biomolecules like amino acids) were retained. The dataset spans 621 stoichiometries. It includes small amino acids (glycine, alanine), nucleobases (cytosine, uracil, thymine), and pharmaceutically relevant building blocks (pyruvic acid, piperazine, hydroxy urea).

Computed Properties

All properties were calculated at the B3LYP/6-31G(2df,p) level of DFT. The 15 scalar properties per molecule are:

Property	Unit	Description
A, B, C	GHz	Rotational constants
$\mu$	D	Dipole moment
$\alpha$	$a_0^3$	Isotropic polarizability
$\varepsilon_{\text{HOMO}}$	Ha	HOMO energy
$\varepsilon_{\text{LUMO}}$	Ha	LUMO energy
$\varepsilon_{\text{gap}}$	Ha	HOMO-LUMO gap
$\langle R^2 \rangle$	$a_0^2$	Electronic spatial extent
ZPVE	Ha	Zero-point vibrational energy
$U_0$	Ha	Internal energy at 0 K
$U$	Ha	Internal energy at 298.15 K
$H$	Ha	Enthalpy at 298.15 K
$G$	Ha	Free energy at 298.15 K
$C_v$	cal/mol K	Heat capacity at 298.15 K

Each molecule is stored in an extended XYZ file. The first line gives the atom count, and the second (comment) line packs all 15 scalar properties. Lines 3 through $n_a + 2$ contain element type, Cartesian coordinates (x, y, z in Angstroms), and Mulliken partial charges as a fifth column. Three trailing lines append harmonic vibrational frequencies ($3n_a - 5$ or $3n_a - 6$ modes, in cm$^{-1}$), SMILES strings (from GDB-17 and from the B3LYP-relaxed geometry), and InChI strings (from Corina and B3LYP geometries).

Dataset Subsets

Subset	Size	Description
GDB-9 (Full)	133,885	All molecules, B3LYP properties
C7H10O2 isomers	6,095	Predominant stoichiometry, with additional G4MP2 energetics
Validation set	100	Random subset with G4MP2, G4, and CBS-QB3 reference values

Geometry Generation Pipeline

Starting from GDB-17 SMILES strings, initial 3D coordinates were generated with Corina, then relaxed at the PM7 semi-empirical level (MOPAC), followed by B3LYP/6-31G(2df,p) geometry optimization (Gaussian 09). A five-stage iterative convergence procedure handled difficult cases: default thresholds, then ultrafine grids, tighter SCF criteria, Hessian-guided optimization (calcfc), and full Hessian optimization (calcall). After all stages, 11 molecules still failed to converge to true minima (6 converged with loose thresholds, 2 near-linear molecules converged to saddle points with very low imaginary frequencies below $i10 \text{ cm}^{-1}$).

Validation

Geometry consistency: B3LYP-relaxed geometries were converted back to InChI strings and compared against the original GDB-17 InChI. 3,054 molecules failed this round-trip test, primarily due to implementation-specific artifacts in SMILES/InChI conversion rather than actual geometry problems. Coulomb-matrix distances between Corina and B3LYP geometries quantified the magnitude of geometric changes.

Energy accuracy: For 100 randomly selected molecules, B3LYP atomization enthalpies were compared against higher-level composite methods. These reference methods are themselves near experimental accuracy: G4MP2 achieves MAE 1.0 and RMSE 1.5 kcal/mol against the G3/05 test set of 454 experimental energies, while G4 achieves MAE 0.8 and RMSE 1.2 kcal/mol on the same set. G4MP2 also deviates by only 1.4 kcal/mol from the highly accurate W1w composite procedure on 261 bond dissociation enthalpies (BDE261 dataset). Against these references, B3LYP shows:

Reference	MAE (kcal/mol)	RMSE (kcal/mol)	Max AE (kcal/mol)
G4MP2	5.0	6.1	16.0
G4	4.9	5.9	14.4
CBS-QB3	4.5	5.5	13.4

All 6,095 C7H10O2 isomers passed the geometry consistency check, and their G4MP2-level energetics provide a higher-accuracy benchmark within a fixed stoichiometry.

Strengths & Limitations

Strengths:

Comprehensive and consistent: same level of theory across all 134k molecules
Derived from a systematically enumerated chemical space (GDB-17), reducing selection bias
Rich property set covering geometric, electronic, energetic, and thermodynamic quantities
Widely adopted benchmark enabling reproducible comparisons across ML methods

Limitations:

Restricted to very small molecules (up to 9 heavy atoms), limiting relevance to drug-sized compounds
Only CHONF elements, excluding sulfur, halogens (Cl, Br, I), and metals
B3LYP/6-31G(2df,p) has known systematic errors (~5 kcal/mol MAE for atomization enthalpies)
3,054 molecules have geometry consistency issues in SMILES/InChI round-tripping
Single conformer per molecule (energy-minimized geometry only)

Reproducibility Details

Artifact	Type	License	Notes
Figshare collection	Dataset	CC BY-NC-SA 4.0	Full dataset: 134k molecules, C7H10O2 isomers, validation set, atomic references

The Figshare deposit contains four files:

dsgdb9nsd.xyz.tar.bz2: All 133,885 GDB-1 through GDB-9 molecules with B3LYP properties
dsC7O2H10nsd.xyz.tar.bz2: 6,095 C7H10O2 constitutional isomers with G4MP2 energetics
validation.txt: Atomization enthalpies at B3LYP, G4MP2, G4, and CBS-QB3 for 100 random molecules
atomref.txt: Atomic reference energies for computing atomization energies from total energies

All data is in extended XYZ plain-text format. The paper and its metadata are open access (CC BY-NC-SA 4.0 for the article, CC0 for metadata).

No source code is provided. The computational pipeline relies on commercial and semi-commercial software: Corina (3D coordinate generation), MOPAC (PM7 semi-empirical relaxation), and Gaussian 09 (B3LYP DFT calculations). Specific convergence keywords and iteration procedures are documented in the paper. Hardware requirements are not reported.

Reproducibility status: Partially Reproducible. The dataset itself is fully available, but regenerating it requires commercial licenses for Corina and Gaussian 09.

Citation

@article{ramakrishnan2014quantum,
  title={Quantum chemistry structures and properties of 134 kilo molecules},
  author={Ramakrishnan, Raghunathan and Dral, Pavlo O. and Rupp, Matthias and von Lilienfeld, O. Anatole},
  journal={Scientific Data},
  volume={1},
  number={1},
  pages={140022},
  year={2014},
  publisher={Nature Portfolio},
  doi={10.1038/sdata.2014.22}
}

GDBMedChem: Drug-Like Subset of GDB-17 (10M Molecules)

Sat, 11 Apr 2026 00:00:00 +0000

Key Contribution

GDBMedChem is a 10 million molecule subset of GDB-17 selected using medicinal chemistry criteria rather than the fragment-likeness rules used for FDB-17. The resulting database has reduced complexity and better synthetic accessibility than the full GDB-17, while retaining higher Fsp3 carbon fraction and natural product likeness compared to known drugs. Critically, 97% of its MHFP6 substructure shingles are absent from DrugBank, ChEMBL, and ZINC, making it an unprecedented source of structural diversity for drug design.

Overview

GDB-17 enumerates 166.4 billion molecules following chemical stability and synthetic feasibility rules, but does not consider medicinal chemistry criteria such as acceptable functional group types, overall structural complexity, or drug-likeness. GDBMedChem addresses this gap with a different filtering philosophy than FDB-17: instead of enforcing fragment-likeness (rotatable bond limits, small size), it applies medicinal chemistry-inspired rules that allow larger, more flexible molecules while excluding problematic functional groups and overly complex scaffolds.

Assembly Pipeline

Stage 1: Medicinal chemistry filters (166.4B to 17.8B, ~9.4x reduction)

Three categories of filters, each benchmarked against ChEMBL, DrugBank, and UNPD (natural products) to ensure low elimination of known bioactives:

Category	Key Filters	GDB-17 Eliminated
Functional groups	No amidines, imidates, aldehydes, aziridines, epoxides; no Br/I; no Cl/F on heterocycles; max 1 nitrile/alkyne/sulfone; max 2 ethers/amides/esters	53%
Structural complexity	Max 18 avalon fingerprint density; max 1 cyclic tetravalent node; max 4 stereocenters; max 3 bonds in fused ring systems; max 3 rings	62%
Polarity	Heteroatom-to-carbon ratio max 0.7	6%
Combined	All filters together	86%

These filters eliminate 86% of GDB-17 but only 36% of ChEMBL molecules and 50% of DrugBank drugs (the higher DrugBank rate is driven mainly by the heteroatom-to-carbon ratio filter removing highly polar drugs with negative clogP values).

Of the 21 filters, 16 are implemented as SMARTS queries and 5 (stereocenters, ring count, avalon density, heteroatom-to-carbon ratio, largest aromatic ring size) use other RDKit functions. Filters were applied progressively (simplest first), not in the order listed above. The benchmarking percentages for ChEMBL and DrugBank refer to ChEMBL 22 and DrugBank 5.011 molecules with HAC ≤ 17.

Stage 2: Even sampling (17.8B to 10M)

The 17,804,900,000 molecules in the filtered set are binned into 425 possible triplet combinations of HAC (1-17), heteroatoms (≤1, 2, 3, 4, ≥5), and stereocenters (0, 1, 2, 3, 4). Of these, 181 bins are unoccupied, leaving 244 bins. PySpark’s sampleBy function performs stratified sampling without replacement, using a round-robin allocation that increments each bin’s quota by one until the total reaches 10M. The resulting distribution is uniform except in low-HAC bins (HAC ≤ 10) where all available molecules are taken.

Comparison with FDB-17

GDBMedChem and FDB-17 are both 10M-molecule subsets of GDB-17 but take fundamentally different approaches:

Property	GDBMedChem	FDB-17
Parent set	17.8B (medchem filters)	4.6B (fragment filters)
Overlap	480M molecules shared between parent sets
Rotatable bonds	Similar to known drugs	Restricted to max 3 (fragment-like)
Key difference	Drug-like flexibility, medchem FG rules	Fragment-like rigidity, strict FG removal

Both databases retain GDB-17’s characteristic high Fsp3 fraction and 3D molecular shape diversity compared to predominantly planar known molecules.

Substructure Novelty

MHFP6 (MinHash fingerprint with diameter 6) shingle analysis reveals striking structural novelty:

Database	Molecules	Unique Shingles	Unique to Database
GDBMedChem	10M	17.3M	97%
ChEMBL	1.4M	1.6M	57%
ZINC	15M	1.5M	53%
DrugBank	8.3k	82k	12%

GDBMedChem contains 17.3 million unique shingles, roughly 10x more than the 15 million-molecule ZINC database, with 97% appearing in no other database. The cumulative unique shingle count grows faster and more steadily with database size for GDBMedChem than for known molecule databases, reflecting greater internal diversity. Among the most frequent shingles, oxygen-containing saturated or singly unsaturated substructures dominate GDBMedChem, in contrast to aromatic and nitrogen heterocycles in ZINC.

Property Profiles

Compared to known drugs (DrugBank17, ChEMBL17):

Synthetic accessibility: Slightly better than GDB-17 due to complexity filters, but still lower than known molecules
Natural product likeness: Significantly higher than drugs, approaching natural products (UNPD17)
Fsp3 fraction: Higher than drugs, reflecting more 3D-shaped molecules
Compound categories: Much higher fraction of heterocyclic molecules, much lower fraction of aromatic molecules (a consequence of combinatorial enumeration favoring heteroatom-in-ring combinations)

Strengths & Limitations

Strengths:

97% structurally novel substructures provide unprecedented diversity for drug design
Medicinal chemistry filters retain drug-relevant functional group patterns
Even sampling corrects GDB-17’s combinatorial bias toward large, complex molecules
Higher Fsp3 and natural product likeness compared to known drugs
Available with interactive 3D visualization, MQN/MHFP6 similarity search, and download

Limitations:

Synthetic accessibility scores remain lower than for known molecules
Excludes Br, I, and Cl/F on heterocycles, which are common in medicinal chemistry
Random sampling means specific molecules of interest from the 17.8B parent set may be absent
Overlap with FDB-17 is limited (different filtering philosophies), so both databases complement rather than replace each other

Technical Notes

Molecule Preprocessing

Before filtering, each molecule undergoes: counter-ion removal, largest-fragment retention, conversion to non-chiral SMILES, valence-error checking, and protonation at pH 7.4 (using ChemAxon JChem). Duplicates are removed by canonical SMILES comparison within each database.

Reference Databases

The comparison databases used specific versions: ChEMBL 22 (1.4M compounds with HAC ≤ 50; 105,423 with HAC ≤ 17), DrugBank 5.011 (8,299 approved/experimental drugs with HAC ≤ 50; 2,284 with HAC ≤ 17), UNPD (20,302 natural products with HAC ≤ 17), and ZINC 12 (15M commercially available compounds).

MHFP6 Shingle Computation

Shingles were computed using the mhfp Python package (also on PyPI), specifically the shingling_from_smiles function from the MHFPEncoder class. Each shingle represents an extended-connectivity substructure around an atom with a diameter of up to 6 bonds, plus all ring structures, encoded as rooted SMILES strings.

Avalon Fingerprint Density

The avalon fingerprint density, used as the overall structural complexity filter (max 18), is defined as the number of on-bits in the avalon fingerprint scaled to the heavy atom count.

Reproducibility Details

Artifact	Type	License	Notes
GDBMedChem download	Dataset	Non-commercial (no patents, no redistribution)	10M molecules in SMILES format
GDB web tools	Other	Unknown	3D visualization, MQN/MHFP6 similarity search
`mhfp` Python package	Code	MIT	MHFP6 fingerprint and shingle computation
PCA visualization tools	Code	Unknown	MQN-to-3D PCA projection preprocessing

Status: Partially Reproducible. The dataset itself is publicly available for download, and the paper describes the filtering and sampling pipeline in detail (RDKit 2017_09_03, PySpark 2.3.2, 98-node cluster with 252 GB RAM). The mhfp package for shingle analysis is open-source. However, no standalone filtering/sampling code is released: reproducing the pipeline from scratch requires reimplementing the 16 SMARTS filters and 5 RDKit-based filters, plus the PySpark stratified sampling procedure. The molecule preprocessing step also depends on ChemAxon JChem (commercial) for pH 7.4 protonation and MQN calculation.

The paper is published in the closed-access journal Molecular Informatics. An open-access preprint is available on ChemRxiv.

Citation

@article{awale2019medicinal,
  title={Medicinal Chemistry Aware Database GDBMedChem},
  author={Awale, Mahendra and Sirockin, Finton and Stiefl, Nikolaus and Reymond, Jean-Louis},
  journal={Molecular Informatics},
  volume={38},
  number={8-9},
  pages={e1900031},
  year={2019},
  publisher={Wiley},
  doi={10.1002/minf.201900031}
}

FDB-17: Fragment Database (10M Molecules)

Sat, 11 Apr 2026 00:00:00 +0000

Key Contribution

FDB-17 is a curated subset of 10 million fragment-like molecules extracted from the 166.4 billion molecules in GDB-17. It corrects the combinatorial bias of exhaustive enumeration (which overwhelmingly produces large, complex molecules) by evenly sampling across molecular size, polarity, and stereochemical complexity. The result is a database sized for practical virtual screening tools while retaining GDB-17’s distinctive 3D molecular shape diversity.

Overview

GDB-17 exhaustively enumerates molecules up to 17 heavy atoms, but the combinatorial explosion means the database is dominated by the largest, most functionalized, and stereochemically most complex entries. This makes it impractical for most virtual screening workflows and poorly suited for identifying simple, synthetically accessible fragments. FDB-17 addresses both problems through a two-stage reduction.

Assembly Pipeline

Stage 1: Fragment-likeness filters (166.4B to 4.6B, 36x reduction)

Criteria limiting structural and functional group complexity:

Category	Constraints
Scaffolds	Max 3 rings, max 2 small (3/4-membered) rings, max 2 quaternary centers, max 4 stereocenters, max 3 rotatable bonds
FG density	Max 5 N+O atoms, max 1 positive/negative charge at neutral pH, max 3 HBA, max 2 HBD
Excluded groups	Aldehydes, epoxides, aziridines, carbonates, imidates, nitro groups, aromatic rings >6 atoms, ≤ 1 cyano group
Removed elements	Non-aromatic C=C, C triple bonds, halogens (approximated by saturated C-C and methyl)

Stage 2: Even sampling (4.6B to 10M, 460x reduction)

The 4.6B fragment subset is binned into 175 cells defined by value triplets of (HAC, heteroatoms, stereocenters):

Dimension	Bin values
HAC	≤11, 12, 13, 14, 15, 16, 17 (7 bins)
Heteroatoms (N+O+S)	≤1, 2, 3, 4, ≥5 (5 bins)
Stereocenters	0, 1, 2, 3, 4 (5 bins)

Individual bins ranged from 3,359 to 446,322,188 molecules, reflecting the extreme combinatorial skew toward large, complex structures. Bins with ≤70,000 molecules are taken entirely; larger bins are randomly sampled to approximately 60,000 molecules each. The filtering was implemented in Java using ChemAxon’s JChem libraries and executed on a 500-node cluster in 10,000 CPU hours. The resulting even distribution across molecular size, polarity, and complexity replaces the exponentially skewed distribution of the parent database.

Property Profiles vs. Commercial Fragments

FDB-17 was compared against 40,986 commercial fragments collected from 8 vendors (AnalytiCon, ChemBridge, Enamine, FRAGMENTA, BIONET, LifeChemical, Maybridge, Vitas) and filtered by Congreve’s rule of three (mass ≤300, HBA ≤3, HBD ≤3, logP ≤3, RBC ≤3, PSA ≤60). Only 31% (12,847) of these commercial fragments appeared in the 4.6B fragment subset at all, due to functional groups absent from GDB-17 (halogens, thiols, azides, thioethers). Of those, only 6.7% (2,740) appeared in FDB-17 due to the random sampling step.

Key differences:

Size and polarity: FDB-17’s even sampling produces distributions comparable to commercial fragments, unlike the parent GDB-17 which peaks sharply at HAC = 17
Compound categories: Half are heteroaromatic in both sets, but FDB-17’s second half is predominantly heterocyclic vs. aromatic for commercial fragments
3D character: FDB-17 retains GDB-17’s coverage of the full PMI (principal moments of inertia) shape triangle, with a frequency peak at center-left (PMI computed from single low-energy CORINA conformers). Commercial fragments are predominantly planar. FDB-17 has significantly higher Fsp3 values
Ring count: Fragment subsets of GDB-17 are enriched in 2- and 3-ring molecules (a consequence of the rotatable bond limit, which constrains monocyclic molecules more than polycyclic ones)

Virtual Screening Validation

Nearest-neighbor searches were performed using two fingerprint spaces: MQN (42-dimensional molecular quantum numbers counting atoms, bonds, polarity, and topology) and Xfp (55-dimensional extended pharmacophore fingerprint capturing shape and pharmacophore features). Four fragment-like drugs were used as queries: fencamfamine, gabapentin, rimantadine, and levetiracetam. For each drug, 10,000 nearest neighbors were retrieved and scored by 3D-shape similarity using ROCS (Rapid Overlay of Chemical Structures). 3D conformers were generated with OMEGA (all possible stereoisomers, keeping the highest-scoring one). Molecules with ROCS Tanimoto Combo > 1.4 were considered virtual hits.

FDB-17 delivered comparable numbers of virtual hits to the full 4.6B fragment subset and the entire GDB-17, despite being 460x and 16,640x smaller respectively. Both close analogs (high substructure similarity, Tsfp > 0.7) and scaffold-hopping compounds (low substructure similarity but high shape similarity) were identified. Random sampling from FDB-17 and searches in the 41k commercial fragment set returned far fewer hits.

Strengths & Limitations

Strengths:

Manageable size (10M) compatible with docking and 3D-shape virtual screening tools
Even coverage of molecular size, polarity, and complexity avoids combinatorial bias
High 3D shape diversity compared to predominantly flat commercial fragment libraries
Available with interactive visualization (MQN/SMIfp-mapplet) and web-based nearest neighbor search

Limitations:

Only the 10M FDB-17 is released, not the 4.6B fragment-filtered intermediate. Practitioners who want a different sampling strategy or the full fragment subset cannot access it
Random sampling means specific molecules of interest from the 4.6B subset may be absent
Excludes halogens, non-aromatic unsaturations, and several functional group classes present in commercial fragments
Only 6.7% overlap with commercial fragments limits direct comparison
Still derived from GDB-17’s enumeration rules, so molecules outside those rules (e.g., containing metals or larger rings) are excluded

Reproducibility Details

FDB-17 is publicly available for download from the GDB project page as a single SMILES file (62.2 MB), hosted on Zenodo. Interactive visualization via the MQN/SMIfp-mapplet and web-based nearest neighbor search tools are also accessible through the same site. The multi-fingerprint browser supports nearest-neighbor search across six fingerprints: MQN (42D), SMIfp (34D), APfp (21D), Xfp (55D), Sfp (1024-bit Daylight-type), and ECfp4 (1024-bit circular). The filtering code was written in Java using JChem libraries (ChemAxon) and executed on a 500-node cluster in 10,000 CPU hours. The filtering code itself is not publicly released. Virtual screening additionally requires OMEGA (conformer generation) and ROCS (3D-shape scoring), both commercial tools from OpenEye.

Artifact	Type	License	Notes
FDB-17 SMILES	Dataset	Custom (no patents, no redistribution)	10M fragment-like molecules from GDB-17
MQN/SMIfp-mapplet	Other	Web tool	Interactive PCA visualization on 1000x1000 grids
Multi-fingerprint browser	Other	Web tool	Nearest neighbor search across 6 fingerprints (MQN, SMIfp, APfp, Xfp, Sfp, ECfp4)

Reproducibility status: Partially Reproducible. The 10M FDB-17 is freely downloadable, but the 4.6B fragment-filtered intermediate is not released. The filtering criteria are fully documented, but the Java filtering code is not released and depends on proprietary ChemAxon libraries. Reproducing the virtual screening experiments requires commercial tools (OMEGA, ROCS from OpenEye; CORINA for PMI analysis).

Citation

@article{visini2017fragment,
  title={Fragment Database FDB-17},
  author={Visini, Ricardo and Awale, Mahendra and Reymond, Jean-Louis},
  journal={Journal of Chemical Information and Modeling},
  volume={57},
  number={4},
  pages={700--709},
  year={2017},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.7b00020}
}

ZINC-22: A Multi-Billion Scale Database for Ligand Discovery

Sat, 27 Sep 2025 00:00:00 +0000

Key Contribution: Scaling Make-on-Demand Libraries

ZINC-22 addresses the critical infrastructure challenges of managing multi-billion-scale libraries of make-on-demand chemical compounds through a federated database architecture, the CartBlanche web interface, and cloud distribution systems that enable modern virtual screening.

Overview

ZINC-22 is a multi-billion scale public database of commercially available chemical compounds designed for virtual screening. It contains over 37 billion make-on-demand molecules and utilizes a distributed infrastructure capable of managing database indexing limits. For structural biology pipelines, it provides 4.5 billion ready-to-dock 3D conformations alongside pre-calculated pH-specific protonation states, tautomers, and AMSOL partial charges.

Dataset Examples

ZINC-22’s 2D Tranche Browser showing the organization of 37.2 billion molecules by physicochemical properties

Dataset Subsets

Subset	Count	Description
2D Database	37B+	Complete 2D chemical structures from make-on-demand catalogs (Enamine REAL, Enamine REAL Space, WuXi GalaXi, Mcule Ultimate)
3D Database	4.5B+	Ready-to-dock 3D conformations with pre-calculated charges and solvation energies
Custom Tranches	Variable	User-selected molecular subsets via Tranche Browser (e.g., lead-like, fragment-like)

Use Cases

ZINC-22 is designed for ultra-large virtual screening (ULVS), analog searching, and molecular docking campaigns. The Tranche Browser enables targeted subset selection (e.g., lead-like, fragment-like) for screening, and the CartBlanche interface supports both interactive and programmatic access to the database. The authors note that as the database grows, docking can identify better-fitting molecules.

Dataset	Relationship	Link
ZINC-20	Predecessor
Enamine REAL	Source catalog
WuXi GalaXi	Source catalog

Strengths

Massive scale: 37+ billion purchasable compounds from major vendors (Enamine, WuXi, Mcule)
Federated architecture: Supports asynchronous building and horizontal scaling to trillion-molecule growth
Platform access: CartBlanche GUI provides a shopping cart metaphor for compound acquisition
Privacy protection: Dual public/private server clusters protect patentability of undisclosed catalogs
Chemical diversity: Linear growth (1 new scaffold per 10 molecules added), with 96.3M+ unique Bemis-Murcko scaffolds
Ready-to-dock: 3D models include pre-calculated charges, protonation states, and solvation energies
Cloud distribution: Available via AWS Open Data, Oracle OCI, and UCSF servers
Scale-aware search: SmallWorld (similarity) and Arthor (substructure) tools partitioned to address specific constraints of billion-scale queries
Organized access: Tranche system enables targeted selection of chemical space
Open access: Entire database freely available to academic and commercial users

Limitations

Data Transfer Bottlenecks: Distributing 4.5 billion 3D alignments in standard rigid format (like db2 flexibase) requires roughly 1 Petabyte of storage. Transferring this takes months over standard gigabit connections, effectively mandating cloud-based compilation and rendering local copies impractical.
Search Result Caps: Interactive Arthor searches are capped at 20,000 molecules to maintain a reliable public service. Users needing more results can use the asynchronous Arthor search tool via TLDR, which sends results by email.
Enumeration Ceiling: Scaling relies entirely on PostgreSQL sharding. To continue using rigid docking tools, the database must fully enumerate structural states. The authors acknowledge that hardware limitations will likely cap full database enumeration well before the 10-trillion molecule mark, forcing future pipelines to accommodate unenumerated combinatorial fragment spaces.
Download Workflow: Individual 3D molecule downloads are unavailable directly; researchers must rebuild them via the TLDR tool.
Vendor Updates: There is difficulty removing discontinued vendor molecules due to the federated structure.

Technical Notes

Hardware & Software

Compute infrastructure:

1,700 cores across 14 computers for parallel processing
174 independent PostgreSQL 12.0 databases (110 ‘Sn’ for ZINC-ID, 64 ‘Sb’ for Supplier Codes)
Distributed across Amazon AWS, Oracle OCI, and UCSF servers

Software stack:

PostgreSQL 12.2
Python 3.6.8
RDKit 2020.03
Celery task queue with Redis for background processing
All code available on GitHub: docking-org/zinc22-2d, zinc22-3d

Data Organization & Access

Tranche system: Molecules organized into “Tranches” based on 4 dimensions:

Heavy Atom Count
Lipophilicity (LogP)
Charge
File Format

This enables downloading specific chemical neighborhoods (e.g., neutral lead-like molecules) without accessing the entire database.

Search infrastructure: Searching at the billion-molecule scale actively exceeds rapid-access computer memory limits. ZINC-22 splits retrieval between two distinct algorithms:

SmallWorld: Handles whole-molecule similarity using Graph Edit Distance (GED). GED defines the minimum cost of operations (node/edge insertions, deletions, or substitutions) required to transform graph $G_1$ into graph $G_2$:

$$ \text{GED}(G_1, G_2) = \min_{(e_1, …, e_k) \in \mathcal{P}(G_1, G_2)} \sum_{i=1}^k c(e_i) $$

Because SmallWorld searches pre-calculated anonymous graphs, it evaluates close neighbors in near $\mathcal{O}(1)$ time and scales sub-linearly, though it struggles with highly distant structural matches.
Arthor: Provides exact substructure and pattern matching. It scales linearly $\mathcal{O}(N)$ with database size and successfully finds distant hits (e.g., PAINS filters), but performance heavily degrades if the index exceeds available RAM.
CartBlanche: Web interface wrapping these search tools with shopping cart functionality.

3D Generation Pipeline

The 3D database construction pipeline involves multiple specialized tools:

ChemAxon JChem: Protonation state and tautomer generation at physiological pH
Corina: Initial 3D structure generation
Omega: Conformation sampling
AMSOL 7.1: Calculation of atomic partial charges and desolvation energies
Strain calculation: Relative energies of conformations

At sustained throughput, the pipeline builds approximately 11 million molecules per day, each with hundreds of pre-calculated conformations.

Chemical Diversity Analysis

A core debate in billion-scale library generation involves whether continuous enumeration merely yields repetitive derivatives. Analysis of Bemis-Murcko (BM) scaffolds demonstrates that chemical diversity in ZINC-22 continues to grow, but scales sub-linearly based on a power law. Specifically, the authors observe a $\log$ increase in BM scaffolds for every two $\log$ increase in database size:

$$ \log(\text{Scaffolds}_{BM}) \propto 0.5 \log(\text{Molecules}) $$

This suggests that while diversity does not saturate, it grows proportionally to the square root of the library size ($\mathcal{O}(\sqrt{N})$). The majority of this scaffold novelty stems from compounds with the highest heavy atom counts (HAC 24-25), which contribute roughly twice as many unique core structures as the combined HAC 06-23 subset.

Vendor Integration

ZINC-22 is built from five source catalogs with the following approximate sizes:

Enamine REAL Database: 5 billion compounds
Enamine REAL Space: 29 billion compounds
WuXi GalaXi: 2.5 billion compounds
Mcule Ultimate: 128 million compounds
ZINC20 in-stock: 4 million compounds (incorporated as layer “g”)

This focus on purchasable, make-on-demand molecules distinguishes ZINC-22 from theoretical chemical space databases. ZINC20 continues to be maintained separately for smaller catalogs and in-stock compounds.

Reproducibility Details

Artifact	Type	License	Notes
CartBlanche web interface	Dataset	Free access	Web GUI for searching and downloading ZINC-22
docking-org/zinc22-2d	Code	BSD-3-Clause	2D curation and loading pipeline
docking-org/zinc22-3d	Code	Unknown	3D building pipeline
docking-org/cartblanche22	Code	Unknown	CartBlanche22 web application
AWS Open Data / Oracle OCI	Dataset	Free access	Cloud-hosted 3D database mirrors

Data Availability: The compiled database is openly accessible and searchable through the CartBlanche web interface. Subsets can be downloaded, and programmatic access is provided via curl, wget, and Globus.
Code & Algorithms: The source code for database construction, parallel processing, and querying is open-source.
- 2D Pipeline: docking-org/zinc22-2d
- 3D Pipeline: docking-org/zinc22-3d
- CartBlanche: docking-org/cartblanche22
- TLDR modules: docking-org/TLDR and docking-org/tldr-modules (repositories no longer available)
Software Dependencies: While the orchestration code is public, the 3D structure generation relies on commercial software that requires separate licenses (CORINA, OpenEye OMEGA, ChemAxon JChem). This limits end-to-end reproducibility for researchers without access to these tools.
Hardware Limitations: Recreating the entire 37+ billion molecule database from raw vendor catalogs requires approximately 1,700 CPU cores and petabytes of data transfer, restricting full recreation to large institutional clusters or substantial cloud compute budgets.

Paper Information

Tingle, B. I., Tang, K. G., Castanon, M., Gutierrez, J. J., Khurelbaatar, M., Dandarchuluun, C., Moroz, Y. S., and Irwin, J. J. (2023). ZINC-22: A Free Multi-Billion-Scale Database of Tangible Compounds for Ligand Discovery. Journal of Chemical Information and Modeling, 63(4), 1166–1176. https://doi.org/10.1021/acs.jcim.2c01253

@article{Tingle_2023,
    title={ZINC-22: A Free Multi-Billion-Scale Database of Tangible Compounds for Ligand Discovery},
    volume={63},
    ISSN={1549-960X},
    url={http://dx.doi.org/10.1021/acs.jcim.2c01253},
    DOI={10.1021/acs.jcim.2c01253},
    number={4},
    journal={Journal of Chemical Information and Modeling},
    publisher={American Chemical Society (ACS)},
    author={Tingle, Benjamin I. and Tang, Khanh G. and Castanon, Mar and Gutierrez, John J. and Khurelbaatar, Munkhzul and Dandarchuluun, Chinzorig and Moroz, Yurii S. and Irwin, John J.},
    year={2023},
    month={Feb},
    pages={1166--1176}
}

MARCEL: Molecular Conformer Ensemble Learning Benchmark

Mon, 08 Sep 2025 00:00:00 +0000

Key Contribution

MARCEL provides a benchmark for conformer ensemble learning. It demonstrates that explicitly modeling full conformer distributions improves property prediction across drug-like molecules and organometallic catalysts.

Overview

The Molecular Representation and Conformer Ensemble Learning (MARCEL) dataset provides 722K+ conformations across 76K+ molecules spanning four diverse chemical domains: drug-like molecules (Drugs-75K), organophosphorus ligands (Kraken), chiral catalysts (EE), and organometallic complexes (BDE). MARCEL evaluates conformer ensemble methods across both pharmaceutical and catalysis applications.

Dataset Examples

Example conformer from Drugs-75K (SMILES: COC(=O)[C@@]1(Cc2ccc(OC)cc2)[C@H]2c3cc(C(=O)N(C)C)n(Cc4ccc(OC(F)(F)F)cc4)c3C[C@H]2CN1C(=O)c1ccccc1; IUPAC: methyl (2R,3R,6R)-4-benzoyl-10-(dimethylcarbamoyl)-3-[(4-methoxyphenyl)methyl]-9-[[4-(trifluoromethoxy)phenyl]methyl]-4,9-diazatricyclo[6.3.0.02,6]undeca-1(8),10-diene-3-carboxylate)

2D structure of Drugs-75K conformer above

Example conformer from Kraken (ligand 10, conformer 0) in 2D

Example conformer from Kraken (ligand 10, conformer 0) in 3D

Example substrate from BDE in 3D (Pt_9.63)

2D structure of BDE substrate above

Dataset Subsets

Subset	Count	Description
Drugs-75K	75,099 molecules	Drug-like molecules with at least 5 rotatable bonds
Kraken	1,552 molecules	Monodentate organophosphorus (III) ligands
EE	872 reactions	Rhodium (Rh)-bound atropisomeric catalyst-substrate pairs derived from chiral bisphosphine
BDE	5,915 reactions	Organometallic catalysts ML$_1$L$_2$ with electronic binding energies

Benchmarks

Ionization Potential (Drugs-75K)

Predict ionization potential from molecular structure

Subset: Drugs-75K

Rank	Model	MAE (eV)
🥇 1	Ensemble - GemNet GemNet on full conformer ensemble	0.4066
🥈 2	3D - GemNet Geometry-enhanced message passing (single conformer)	0.4069
🥉 3	Ensemble - DimeNet++ DimeNet++ on full conformer ensemble	0.4126
4	Ensemble - LEFTNet LEFTNet on full conformer ensemble	0.4149
5	3D - LEFTNet Local Environment Feature Transformer (single conformer)	0.4174
6	Ensemble - ClofNet ClofNet on full conformer ensemble	0.428
7	2D - GraphGPS Graph Transformer with positional encodings	0.4351
8	2D - GIN Graph Isomorphism Network	0.4354
9	2D - GIN+VN GIN with Virtual Nodes	0.4361
10	3D - ClofNet Conformation-ensemble learning network (single conformer)	0.4393
11	3D - SchNet Continuous-filter convolutional network (single conformer)	0.4394
12	3D - DimeNet++ Directional message passing network (single conformer)	0.4441
13	Ensemble - SchNet SchNet on full conformer ensemble	0.4452
14	Ensemble - PaiNN PaiNN on full conformer ensemble	0.4466
15	3D - PaiNN Polarizable Atom Interaction Network (single conformer)	0.4505
16	2D - ChemProp Message Passing Neural Network	0.4595
17	1D - LSTM LSTM on SMILES sequences	0.4788
18	1D - Random forest Random Forest on Morgan fingerprints	0.4987
19	1D - Transformer Transformer on SMILES sequences	0.6617

Electron Affinity (Drugs-75K)

Predict electron affinity from molecular structure

Subset: Drugs-75K

Rank	Model	MAE (eV)
🥇 1	Ensemble - GemNet GemNet on full conformer ensemble	0.391
🥈 2	3D - GemNet Geometry-enhanced message passing (single conformer)	0.3922
🥉 3	Ensemble - DimeNet++ DimeNet++ on full conformer ensemble	0.3944
4	Ensemble - LEFTNet LEFTNet on full conformer ensemble	0.3953
5	3D - LEFTNet Local Environment Feature Transformer (single conformer)	0.3964
6	Ensemble - ClofNet ClofNet on full conformer ensemble	0.4033
7	2D - GraphGPS Graph Transformer with positional encodings	0.4085
8	2D - GIN Graph Isomorphism Network	0.4169
9	2D - GIN+VN GIN with Virtual Nodes	0.4169
10	3D - SchNet Continuous-filter convolutional network (single conformer)	0.4207
11	3D - DimeNet++ Directional message passing network (single conformer)	0.4233
12	Ensemble - SchNet SchNet on full conformer ensemble	0.4232
13	3D - ClofNet Conformation-ensemble learning network (single conformer)	0.4251
14	Ensemble - PaiNN PaiNN on full conformer ensemble	0.4269
15	2D - ChemProp Message Passing Neural Network	0.4417
16	3D - PaiNN Polarizable Atom Interaction Network (single conformer)	0.4495
17	1D - LSTM LSTM on SMILES sequences	0.4648
18	1D - Random forest Random Forest on Morgan fingerprints	0.4747
19	1D - Transformer Transformer on SMILES sequences	0.585

Electronegativity (Drugs-75K)

Predict electronegativity (χ) from molecular structure

Subset: Drugs-75K

Rank	Model	MAE (eV)
🥇 1	3D - GemNet Geometry-enhanced message passing (single conformer)	0.197
🥈 2	Ensemble - GemNet GemNet on full conformer ensemble	0.2027
🥉 3	Ensemble - LEFTNet LEFTNet on full conformer ensemble	0.2069
4	3D - LEFTNet Local Environment Feature Transformer (single conformer)	0.2083
5	Ensemble - ClofNet ClofNet on full conformer ensemble	0.2199
6	2D - GraphGPS Graph Transformer with positional encodings	0.2212
7	3D - SchNet Continuous-filter convolutional network (single conformer)	0.2243
8	Ensemble - SchNet SchNet on full conformer ensemble	0.2243
9	2D - GIN Graph Isomorphism Network	0.226
10	2D - GIN+VN GIN with Virtual Nodes	0.2267
11	Ensemble - DimeNet++ DimeNet++ on full conformer ensemble	0.2267
12	Ensemble - PaiNN PaiNN on full conformer ensemble	0.2294
13	3D - PaiNN Polarizable Atom Interaction Network (single conformer)	0.2324
14	3D - ClofNet Conformation-ensemble learning network (single conformer)	0.2378
15	3D - DimeNet++ Directional message passing network (single conformer)	0.2436
16	2D - ChemProp Message Passing Neural Network	0.2441
17	1D - LSTM LSTM on SMILES sequences	0.2505
18	1D - Random forest Random Forest on Morgan fingerprints	0.2732
19	1D - Transformer Transformer on SMILES sequences	0.4073

B₅ Sterimol Parameter (Kraken)

Predict B₅ sterimol descriptor for organophosphorus ligands

Subset: Kraken

Rank	Model	MAE
🥇 1	Ensemble - PaiNN PaiNN on full conformer ensemble	0.2225
🥈 2	Ensemble - GemNet GemNet on full conformer ensemble	0.2313
🥉 3	Ensemble - DimeNet++ DimeNet++ on full conformer ensemble	0.263
4	Ensemble - LEFTNet LEFTNet on full conformer ensemble	0.2644
5	Ensemble - SchNet SchNet on full conformer ensemble	0.2704
6	3D - GemNet Geometry-enhanced message passing (single conformer)	0.2789
7	3D - LEFTNet Local Environment Feature Transformer (single conformer)	0.3072
8	2D - GIN Graph Isomorphism Network	0.3128
9	Ensemble - ClofNet ClofNet on full conformer ensemble	0.3228
10	3D - SchNet Continuous-filter convolutional network (single conformer)	0.3293
11	3D - PaiNN Polarizable Atom Interaction Network (single conformer)	0.3443
12	2D - GraphGPS Graph Transformer with positional encodings	0.345
13	3D - DimeNet++ Directional message passing network (single conformer)	0.351
14	2D - GIN+VN GIN with Virtual Nodes	0.3567
15	1D - Random forest Random Forest on Morgan fingerprints	0.476
16	2D - ChemProp Message Passing Neural Network	0.485
17	3D - ClofNet Conformation-ensemble learning network (single conformer)	0.4873
18	1D - LSTM LSTM on SMILES sequences	0.4879
19	1D - Transformer Transformer on SMILES sequences	0.9611

L Sterimol Parameter (Kraken)

Predict L sterimol descriptor for organophosphorus ligands

Subset: Kraken

Rank	Model	MAE
🥇 1	Ensemble - GemNet GemNet on full conformer ensemble	0.3386
🥈 2	Ensemble - DimeNet++ DimeNet++ on full conformer ensemble	0.3468
🥉 3	Ensemble - PaiNN PaiNN on full conformer ensemble	0.3619
4	Ensemble - LEFTNet LEFTNet on full conformer ensemble	0.3643
5	3D - GemNet Geometry-enhanced message passing (single conformer)	0.3754
6	2D - GIN Graph Isomorphism Network	0.4003
7	3D - DimeNet++ Directional message passing network (single conformer)	0.4174
8	1D - Random forest Random Forest on Morgan fingerprints	0.4303
9	Ensemble - SchNet SchNet on full conformer ensemble	0.4322
10	2D - GIN+VN GIN with Virtual Nodes	0.4344
11	2D - GraphGPS Graph Transformer with positional encodings	0.4363
12	3D - PaiNN Polarizable Atom Interaction Network (single conformer)	0.4471
13	Ensemble - ClofNet ClofNet on full conformer ensemble	0.4485
14	3D - LEFTNet Local Environment Feature Transformer (single conformer)	0.4493
15	1D - LSTM LSTM on SMILES sequences	0.5142
16	2D - ChemProp Message Passing Neural Network	0.5452
17	3D - SchNet Continuous-filter convolutional network (single conformer)	0.5458
18	3D - ClofNet Conformation-ensemble learning network (single conformer)	0.6417
19	1D - Transformer Transformer on SMILES sequences	0.8389

Buried B₅ Parameter (Kraken)

Predict buried B₅ sterimol descriptor for organophosphorus ligands

Subset: Kraken

Rank	Model	MAE
🥇 1	Ensemble - GemNet GemNet on full conformer ensemble	0.1589
🥈 2	Ensemble - PaiNN PaiNN on full conformer ensemble	0.1693
🥉 3	2D - GIN Graph Isomorphism Network	0.1719
4	3D - GemNet Geometry-enhanced message passing (single conformer)	0.1782
5	Ensemble - DimeNet++ DimeNet++ on full conformer ensemble	0.1783
6	Ensemble - SchNet SchNet on full conformer ensemble	0.2024
7	Ensemble - LEFTNet LEFTNet on full conformer ensemble	0.2017
8	2D - GraphGPS Graph Transformer with positional encodings	0.2066
9	3D - DimeNet++ Directional message passing network (single conformer)	0.2097
10	Ensemble - ClofNet ClofNet on full conformer ensemble	0.2178
11	3D - LEFTNet Local Environment Feature Transformer (single conformer)	0.2176
12	3D - SchNet Continuous-filter convolutional network (single conformer)	0.2295
13	3D - PaiNN Polarizable Atom Interaction Network (single conformer)	0.2395
14	2D - GIN+VN GIN with Virtual Nodes	0.2422
15	1D - Random forest Random Forest on Morgan fingerprints	0.2758
16	1D - LSTM LSTM on SMILES sequences	0.2813
17	3D - ClofNet Conformation-ensemble learning network (single conformer)	0.2884
18	2D - ChemProp Message Passing Neural Network	0.3002
19	1D - Transformer Transformer on SMILES sequences	0.4929

Buried L Parameter (Kraken)

Predict buried L sterimol descriptor for organophosphorus ligands

Subset: Kraken

Rank	Model	MAE
🥇 1	Ensemble - GemNet GemNet on full conformer ensemble	0.0947
🥈 2	Ensemble - DimeNet++ DimeNet++ on full conformer ensemble	0.1185
🥉 3	2D - GIN Graph Isomorphism Network	0.12
4	Ensemble - PaiNN PaiNN on full conformer ensemble	0.1324
5	Ensemble - LEFTNet LEFTNet on full conformer ensemble	0.1386
6	Ensemble - SchNet SchNet on full conformer ensemble	0.1443
7	3D - LEFTNet Local Environment Feature Transformer (single conformer)	0.1486
8	2D - GraphGPS Graph Transformer with positional encodings	0.15
9	1D - Random forest Random Forest on Morgan fingerprints	0.1521
10	3D - DimeNet++ Directional message passing network (single conformer)	0.1526
11	Ensemble - ClofNet ClofNet on full conformer ensemble	0.1548
12	3D - GemNet Geometry-enhanced message passing (single conformer)	0.1635
13	3D - PaiNN Polarizable Atom Interaction Network (single conformer)	0.1673
14	2D - GIN+VN GIN with Virtual Nodes	0.1741
15	3D - SchNet Continuous-filter convolutional network (single conformer)	0.1861
16	1D - LSTM LSTM on SMILES sequences	0.1924
17	2D - ChemProp Message Passing Neural Network	0.1948
18	3D - ClofNet Conformation-ensemble learning network (single conformer)	0.2529
19	1D - Transformer Transformer on SMILES sequences	0.2781

Enantioselectivity (EE)

Predict enantiomeric excess for Rh-catalyzed asymmetric reactions

Subset: EE

Rank	Model	MAE (%)
🥇 1	Ensemble - GemNet GemNet on full conformer ensemble	11.61
🥈 2	Ensemble - DimeNet++ DimeNet++ on full conformer ensemble	12.03
🥉 3	Ensemble - PaiNN PaiNN on full conformer ensemble	13.56
4	Ensemble - ClofNet ClofNet on full conformer ensemble	13.96
5	Ensemble - SchNet SchNet on full conformer ensemble	14.22
6	3D - DimeNet++ Directional message passing network (single conformer)	14.64
7	3D - SchNet Continuous-filter convolutional network (single conformer)	17.74
8	3D - GemNet Geometry-enhanced message passing (single conformer)	18.03
9	Ensemble - LEFTNet LEFTNet on full conformer ensemble	18.42
10	3D - LEFTNet Local Environment Feature Transformer (single conformer)	19.8
11	3D - PaiNN Polarizable Atom Interaction Network (single conformer)	20.24
12	3D - ClofNet Conformation-ensemble learning network (single conformer)	33.95
13	2D - ChemProp Message Passing Neural Network	61.03
14	1D - Random forest Random Forest on Morgan fingerprints	61.3
15	2D - GraphGPS Graph Transformer with positional encodings	61.63
16	1D - Transformer Transformer on SMILES sequences	62.08
17	2D - GIN Graph Isomorphism Network	62.31
18	2D - GIN+VN GIN with Virtual Nodes	62.38
19	1D - LSTM LSTM on SMILES sequences	64.01

Bond Dissociation Energy (BDE)

Predict metal-ligand bond dissociation energy for organometallic catalysts

Subset: BDE

Rank	Model	MAE (kcal/mol)
🥇 1	3D - DimeNet++ Directional message passing network (single conformer)	1.45
🥈 2	Ensemble - DimeNet++ DimeNet++ on full conformer ensemble	1.47
🥉 3	3D - LEFTNet Local Environment Feature Transformer (single conformer)	1.53
4	Ensemble - LEFTNet LEFTNet on full conformer ensemble	1.53
5	Ensemble - GemNet GemNet on full conformer ensemble	1.61
6	3D - GemNet Geometry-enhanced message passing (single conformer)	1.65
7	Ensemble - PaiNN PaiNN on full conformer ensemble	1.87
8	Ensemble - SchNet SchNet on full conformer ensemble	1.97
9	Ensemble - ClofNet ClofNet on full conformer ensemble	2.01
10	3D - PaiNN Polarizable Atom Interaction Network (single conformer)	2.13
11	2D - GraphGPS Graph Transformer with positional encodings	2.48
12	3D - SchNet Continuous-filter convolutional network (single conformer)	2.55
13	3D - ClofNet Conformation-ensemble learning network (single conformer)	2.61
14	2D - GIN Graph Isomorphism Network	2.64
15	2D - ChemProp Message Passing Neural Network	2.66
16	2D - GIN+VN GIN with Virtual Nodes	2.74
17	1D - LSTM LSTM on SMILES sequences	2.83
18	1D - Random forest Random Forest on Morgan fingerprints	3.03
19	1D - Transformer Transformer on SMILES sequences	10.08

Dataset	Relationship	Link
GEOM	Source	Notes

Strengths

Domain diversity: Beyond drug-like molecules, includes organometallics and catalysts rarely covered in existing benchmarks
Ensemble-based: Provides full conformer ensembles with statistical weights
DFT-quality energies: Drugs-75K features DFT-level conformers and energies (higher accuracy than GEOM-Drugs)
Realistic scenarios: BDE subset models the practical constraint of lacking DFT-computed conformers for large catalyst systems
Comprehensive baselines: Benchmarks 18 models across 1D (SMILES), 2D (graph), 3D (single conformer), and ensemble methods
Property diversity: Covers ionization potential, electron affinity, electronegativity, ligand descriptors, and catalytic properties

Limitations

Regression only: All tasks evaluate regression metrics exclusively
Chemical space coverage: The 76K molecules encapsulate a fraction of the expansive drug-like and catalyst chemical spaces
Compute requirements: Working with large conformer ensembles demands significant computational resources
Proprietary data: EE subset is proprietary (as of December 2025)
DFT bottleneck: BDE demonstrates a practical limitation: single DFT optimization can take 2-3 days, making conformer-level DFT infeasible for large organometallics
Uniform sampling baseline: The initial data augmentation strategy tested for handling ensembles samples conformers uniformly rather than by Boltzmann weight. This unprincipled physical assumption likely explains why the strategy occasionally introduces noise and fails to aid complex 3D architectures.
Drugs-75K properties: The large-scale benchmark (Drugs-75K) specifically targets electronic properties (Ionization Potential, Electron Affinity, Electronegativity). As the authors explicitly highlight in Section 5.2, these properties are generally less sensitive to conformational rotations compared to steric or spatial interactions. This significantly confounds evaluating whether explicit conformer ensembles actually benefit large-scale regression tasks.
Unrealistic single-conformer baselines: The 3D single-conformer models are exclusively evaluated on the lowest-energy conformer. This setup is inherently flawed for real-world application, as knowing the global minimum a priori requires exhaustively searching and computing energies for the entire conformer space.

Technical Notes

Data Generation Pipeline

Drugs-75K

Source: GEOM-Drugs subset

Filtering:

Minimum 5 rotatable bonds (focus on flexible molecules)
Allowed elements: H, C, N, O, F, Si, P, S, Cl

Conformer generation:

DFT-level calculations for both conformers and energies
Higher accuracy than original GEOM-Drugs (semi-empirical GFN2-xTB)

Properties: Ionization Potential (IP), Electron Affinity (EA), Electronegativity (χ)

Kraken

Source: Original Kraken dataset (1,552 monodentate organophosphorus(III) ligands)

Properties: 4 of 78 available properties (selected for high variance across conformer ensembles)

$B_5$: Sterimol B5, maximum width of substituent (steric descriptor)
$L$: Sterimol L, length of substituent (steric descriptor)
$\text{Bur}B_5$: Buried Sterimol B5, steric effects within the first coordination sphere
$\text{Bur}L$: Buried Sterimol L, steric effects within the first coordination sphere

EE (Enantiomeric Excess)

Generation method: Q2MM (Quantum-guided Molecular Mechanics)

Reactions: 872 catalyst-substrate pairs involving 253 Rhodium (Rh)-bound atropisomeric catalysts from chiral bisphosphine with 10 enamide substrates

Property: Enantiomeric excess (EE) for asymmetric catalysis

Availability: Proprietary-only (closed-source as of December 2025)

BDE (Bond Dissociation Energy)

Molecules: 5,915 organometallic catalysts (ML₁L₂ structure)

Initial conformers: OpenBabel with geometric optimization

Energies: DFT calculations

Property: Electronic binding energy (difference in minimum energies of bound-catalyst complex and unbound catalyst)

Key constraint: DFT optimization for full conformer ensembles computationally infeasible (2-3 days per molecule)

Benchmark Setup

Task: Predict molecular properties from structure using different representation strategies (1D/2D/3D/Ensemble). The ground-truth regression targets are calculated as the Boltzmann-averaged value of the property across the conformer ensemble:

$$ \langle y \rangle_{k_B} = \sum_{\mathbf{C}_i \in \mathcal{C}} p_i y_i $$

Where $p_i$ is the conformer probability (Boltzmann weight) under experimental conditions derived from the conformer energy $e_i$:

$$ p_i = \frac{\exp(-e_i / k_B T)}{\sum_j \exp(-e_j / k_B T)} $$

Data splits: Datasets are partitioned 70% train, 10% validation, and 20% test.

Model categories:

1D Models: SMILES-based (Random Forest on concatenated MACCS/ECFP/RDKit fingerprints, LSTM, Transformer).
2D Models: Graph-based (GIN, GIN+VN, ChemProp, GraphGPS).
3D Models: Single conformer (SchNet, DimeNet++, GemNet, PaiNN, ClofNet, LEFTNet). For evaluation, single 3D models exclusively ingest the lowest-energy conformer. This baseline setting often yields strong performance but is unrealistic in practice, as identifying the global minimum requires exhaustively searching the entire conformer space.
Ensemble Models: Full conformer ensemble processing via explicit set encoders. For each conformer embedding $\mathbf{z}_i$, three aggregation strategies are evaluated:

Mean Pooling: $$ \mathbf{s}_{\text{MEAN}} = \frac{1}{|\mathcal{C}|} \sum_{i=1}^{|\mathcal{C}|} \mathbf{z}_i $$

DeepSets: $$ \mathbf{s}_{\text{DS}} = g\left(\sum_{i=1}^{|\mathcal{C}|} h(\mathbf{z}_i)\right) $$

Self-Attention: $$ \begin{aligned} \mathbf{s}_{\text{ATT}} &= \sum_{i=1}^{|\mathcal{C}|} \mathbf{c}_i, \quad \text{where} \quad \mathbf{c}_i = g\left( \sum_{j=1}^{|\mathcal{C}|} \alpha_{ij} h(\mathbf{z}_j) \right) \\ \alpha_{ij} &= \frac{\exp\left((\mathbf{W} h(\mathbf{z}_i))^\top (\mathbf{W} h(\mathbf{z}_j))\right)}{\sum_{k=1}^{|\mathcal{C}|} \exp\left((\mathbf{W} h(\mathbf{z}_i))^\top (\mathbf{W} h(\mathbf{z}_k))\right)} \end{aligned} $$

Evaluation metric: Mean Absolute Error (MAE) for all tasks.

Key Findings

Ensemble superiority (task-dependent): Across benchmarks, explicitly modeling the full conformer set using DeepSets often achieved top performance. However, these improvements are not uniform:

Small-Scale Success: Ensemble methods show large improvements on tasks like Kraken (Ensemble PaiNN achieves 0.2225 on $B_5$ vs 0.3443 single) and EE (Ensemble GemNet achieves 11.61% vs 18.03% single).
Large-Scale Plateau: The performance improvements did not strongly transfer to large subsets like Drugs-75K (best ensemble strategy for GemNet achieves 0.4066 eV on IP vs 0.4069 eV single). The authors conjecture that the computational burden of encoding all conformers in each ensemble alters learning dynamics and increases training difficulty.

Conformer Sampling for Noise: Data augmentation (randomly sampling one conformer from an ensemble during training) improves performance and robustness when underlying conformers are imprecise (e.g., the forcefield-generated conformers in the BDE subset).

3D vs 2D: 3D models generally outperform 2D graph models, especially for conformationally-sensitive properties, though 1D and 2D methods remain highly competitive on low-resource datasets or less rotation-sensitive properties.

Model architecture: No single model dominates all tasks. GemNet and LEFTNet excel on large-scale Drugs-75K, while DimeNet++ shows strong performance on smaller Kraken and reaction datasets. Model selection depends on dataset size and task characteristics.

Reproducibility Details

Artifact	Type	License	Notes
SXKDZ/MARCEL	Code + Dataset	Apache-2.0	Benchmark suite, dataset loaders, and hyperparameter configs
Drugs-75K	Dataset	Apache-2.0	DFT-level conformers and energies derived from GEOM-Drugs
Kraken	Dataset	Copyright retained by original authors	Conformer ensembles and four steric descriptors
BDE	Dataset	Apache-2.0	OpenBabel-generated conformers with DFT binding energies
EE	Dataset	Proprietary	Closed-source as of 2026

Data: The Drugs-75K, Kraken, and BDE subsets are openly available via the project’s GitHub repository. The EE dataset remains closed-source/proprietary (as of 2026), making the EE suite of the benchmark currently irreproducible.
Code: The benchmark suite and PyTorch-Geometric dataset loaders are open-sourced at GitHub (SXKDZ/MARCEL) under the Apache-2.0 license.
Hardware: The authors trained models using Nvidia A100 (40GB) GPUs. Memory-intensive models (e.g., GemNet, LEFTNet) required Nvidia H100 (80GB) GPUs. Total computation across all benchmark experiments was approximately 6,000 GPU hours.
Algorithms/Models: Hyperparameters for all 18 evaluated models are provided in the repository configuration files (benchmarks/params). All baseline models use publicly available frameworks (e.g., PyTorch Geometric, OGB, RDKit).
Evaluation: Evaluation scripts are provided in the repository with consistent tracking of Mean Absolute Error (MAE) and proper configuration of benchmark splits.

Paper Information

Zhu, Y., Hwang, J., Adams, K., Liu, Z., Nan, B., Stenfors, B., Du, Y., Chauhan, J., Wiest, O., Isayev, O., Coley, C. W., Sun, Y., and Wang, W. (2024). Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks. In The Twelfth International Conference on Learning Representations (ICLR 2024). https://openreview.net/forum?id=NSDszJ2uIV

@inproceedings{zhu2024learning,
title={Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks},
author={Yanqiao Zhu and Jeehyun Hwang and Keir Adams and Zhen Liu and Bozhao Nan and Brock Stenfors and Yuanqi Du and Jatin Chauhan and Olaf Wiest and Olexandr Isayev and Connor W. Coley and Yizhou Sun and Wei Wang},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=NSDszJ2uIV}
}

GEOM: Energy-Annotated Molecular Conformations Dataset

Thu, 04 Sep 2025 00:00:00 +0000

Key Contribution

GEOM addresses the gap between 2D molecular graphs and flexible 3D properties by providing 450k+ molecules with 37M+ conformations. This extensive sampling connects conformer ensembles to experimental properties, providing the necessary infrastructure to benchmark conformer generation methods and train 3D-aware property predictors.

Overview

The Geometric Ensemble Of Molecules (GEOM) dataset provides energy-annotated molecular conformations generated through systematic computational methods. The dataset includes molecules from drug discovery campaigns (AICures), quantum chemistry benchmarks (QM9), and molecular property prediction benchmarks (MoleculeNet), with conformations sampled using CREST/GFN2-xTB and a subset refined with high-quality DFT calculations.

Dataset Examples

Example SARS-CoV-2 3CL protease active molecule: CC(=O)Nc1ccc(Oc2ncccn2)cc1 (N-(4-pyrimidin-2-yloxyphenyl)acetamide)

Dataset Subsets

Subset	Count	Description
Drug-like (AICures)	304,466 molecules	Drug-like molecules from AICures COVID-19 challenge (avg 44 atoms)
QM9	133,258 molecules	Small molecules from QM9 (up to 9 heavy atoms)
MoleculeNet	16,865 molecules	Molecules from MoleculeNet benchmarks for physical chemistry, biophysics, and physiology (includes BACE)
BACE (High-quality DFT)	1,511 molecules	BACE subset of MoleculeNet with high-quality DFT energies (r2scan-3c) and experimental inhibition data

Benchmarks

Gibbs Free Energy Prediction

Predict ensemble Gibbs free energy (G) from molecular structure

Subset: 100k AICures · Split: 60/20/20 train/val/test

Rank	Model	MAE (kcal/mol)
🥇 1	SchNetFeatures 3D SchNet + graph features (trained on highest-prob conformer)	0.203
🥈 2	ChemProp Message Passing Neural Network (graph model)	0.225
🥉 3	FFNN Feed-forward network on Morgan fingerprints	0.274
4	KRR Kernel Ridge Regression on Morgan fingerprints	0.289
5	Random Forest Random Forest on Morgan fingerprints	0.406

Average Energy Prediction

Predict ensemble average energy (E) from molecular structure

Subset: 100k AICures · Split: 60/20/20 train/val/test

Rank	Model	MAE (kcal/mol)
🥇 1	ChemProp Message Passing Neural Network (graph model)	0.11
🥈 2	SchNetFeatures 3D SchNet + graph features (trained on highest-prob conformer)	0.113
🥉 3	FFNN Feed-forward network on Morgan fingerprints	0.119
4	KRR Kernel Ridge Regression on Morgan fingerprints	0.131
5	Random Forest Random Forest on Morgan fingerprints	0.166

Conformer Count Prediction

Predict ln(number of unique conformers) from molecular structure

Subset: 100k AICures · Split: 60/20/20 train/val/test

Rank	Model	MAE
🥇 1	SchNetFeatures 3D SchNet + graph features (trained on highest-prob conformer)	0.363
🥈 2	ChemProp Message Passing Neural Network (graph model)	0.38
🥉 3	FFNN Feed-forward network on Morgan fingerprints	0.455
4	KRR Kernel Ridge Regression on Morgan fingerprints	0.484
5	Random Forest Random Forest on Morgan fingerprints	0.763

Dataset	Description
QM9	134k small molecules with up to 9 heavy atoms and DFT properties
PCQM4Mv2	Millions of computationally generated molecules for HOMO-LUMO gap prediction
PubChemQC	DFT structures and energy properties for millions of PubChem molecules

Strengths

Scale: 37M+ conformations across 450k+ molecules, providing massive coverage of drug-like and small molecule chemical space.
Energy Annotations: All conformations include semi-empirical energies (GFN2-xTB); the BACE subset includes high-quality DFT energies.
Quality Tiers: Three levels of computational rigor allow researchers to trade off dataset size for simulation accuracy.
Benchmark Ready: Includes validated splits and architectural baselines (e.g., ChemProp, SchNet) for property prediction tasks.
Task Diversity: Combines molecules sourced from drug discovery (AICures), quantum chemistry (QM9), and biophysiology domains (MoleculeNet).

Limitations

Computational Constraints: The highest-accuracy DFT subset (BACE) is limited to 1,511 molecules due to the extreme computational cost of exact free energy sampling and Hessian estimation.
Semi-Empirical Accuracy Gap: The $p^{\text{CREST}}$ statistical weights rely on GFN2-xTB energies, which exhibit a $\sim$2 kcal/mol MAE against true DFT. At room temperature ($k_BT \approx 0.59$ kcal/mol), this error heavily skews the Boltzmann distribution, meaning standard subset weights are imprecise.
Solvation Assumptions: Most subsets rely on vacuum calculations. Only the BACE subset uses an implicit solvent (ALPB/C-PCM for water).
Coverage Lapses: Extremely flexible molecules (e.g., within the SIDER dataset) frequently failed the conformer generation pipeline due to runaway topologies.

Technical Notes

Data Generation Pipeline

Initial conformer sampling (RDKit):

EmbedMultipleConfs with numConfs=50, pruneRmsThresh=0.01 Å
MMFF force field optimization
GFN2-xTB optimization of seed conformer

Conformational exploration (CREST):

Metadynamics in NVT ensemble driven by a pushing bias potential: $$ V_{\text{bias}} = \sum_i k_i \exp(-\alpha_i \Delta_i^2) $$ where $\Delta_i$ is the root-mean-square displacement (RMSD) against the $i$-th reference structure.
12 independent MTD runs per molecule with different settings for $k_i$ and $\alpha_i$.
6.0 kcal/mol safety window for conformer retention.
Solvent: ALPB for water (BACE); vacuum for others.

Energy calculation & Weighting:

Standard (GFN2-xTB): Semi-empirical tight-binding DFT ($\approx$ 2 kcal/mol MAE vs DFT). Conformers are assigned a statistical probability based on energy $E_i$ and rotamer degeneracy $d_i$: $$ p^{\text{CREST}}_i = \frac{d_i \exp(-E_i / k_B T)}{\sum_j d_j \exp(-E_j / k_B T)} $$
High-Quality DFT (CENSO): Refines structures using the r2scan-3c functional, computing exact conformation-dependent free energies ($G_i$) that remove the need for explicit rotamer degeneracy approximations:

$$ \begin{aligned} p^{\text{CENSO}}_i &= \frac{\exp(-G_i / k_B T)}{\sum_j \exp(-G_j / k_B T)} \\ G_i &= E_{\text{gas}}^{(i)} + \delta G_{\text{solv}}^{(i)}(T) + G_{\text{trv}}^{(i)}(T) \end{aligned} $$

Quality Levels

Level	Method	Subset	Accuracy
Standard	CREST/GFN2-xTB	All subsets	~2 kcal/mol MAE vs DFT
DFT Single-Point	r2scan-3c/mTZVPP on CREST geometries	BACE (1,511 molecules)	Sub-kcal/mol
DFT Optimized	CENSO full optimization + free energies	BACE (534 molecules)	~0.3 kcal/mol vs CCSD(T)

Benchmark Setup

Task: Predict ensemble summary statistics directly from the 2D molecular structure. The target properties include:

Conformational Free Energy ($G$): $G = -TS$, where $S = -R \sum_i p_i \log p_i$.
Average Energy ($\langle E \rangle$): $\langle E \rangle = \sum_i p_i E_i$.
Unique Conformers: Natural log of the conformer count retained within the energy window.

Data: 100,000 species randomly sampled from AICures subset, split 60/20/20 (train/validation/test).

Hyperparameters: Optimized using Hyperopt package for each model/task combination.

Models:

SchNetFeatures: 3D SchNet architecture + graph features, trained on highest-probability conformer
ChemProp: Message Passing Neural Network on molecular graphs
FFNN: Feed-forward network on Morgan fingerprints
KRR: Kernel Ridge Regression on Morgan fingerprints
Random Forest: Random Forest on Morgan fingerprints

Hardware & Computational Cost

CREST/GFN2-xTB Generation

Total compute: ~15.7 million core hours

AICures subset:

13M core hours on Knights Landing (32-core nodes)
1.2M core hours on Cascade Lake/Sky Lake (13-core nodes)
Average wall time: 2.8 hours/molecule (KNL) or 0.63 hours/molecule (Sky Lake)

MoleculeNet subset: 1.5M core hours

DFT Calculations (BACE only)

Software: CENSO 1.1.2 + ORCA 5.0.1 (r2scan-3c/mTZVPP functional)

Solvent: C-PCM implicit solvation (water)

Hardware: ~54 cores per job

Compute cost:

781,000 CPU hours for CENSO optimizations
1.1M CPU hours for single-point energy calculations

Reproducibility Details

Data Availability: All generated conformations, energies, and thermodynamic properties are publicly hosted on Harvard Dataverse. The data is provided in language-agnostic MessagePack format and Python-specific RDKit .pkl formats.
Code & Analysis: The primary GitHub repository (learningmatter-mit/geom) provides tutorials for data extraction, RDKit processing, and conformational visualization.
Model Training & Baselines: The machine learning benchmarks (SchNet, ChemProp) and corresponding training scripts used to evaluate the dataset can be reproduced using the authors’ NeuralForceField repository.
Hardware & Compute: Extreme compute was required (15.7M core hours for CREST sampling alone), heavily utilizing Knights Landing (KNL) and Cascade Lake architectures. See Hardware & Computational Cost section above for full details.
Software Versions: Precise reproduction of conformational properties requires specific versions to mitigate numerical variances: CREST v2.9, xTB v6.2.3/v6.4.1, CENSO v1.1.2, ORCA v5.0.1/v5.0.2, and RDKit v2020.09.1.
Open-Access Paper: The full methodology is accessible via the arXiv preprint.

Paper Information

Axelrod, S. and Gómez-Bombarelli, R. (2022). GEOM, energy-annotated molecular conformations for property prediction and molecular generation. Scientific Data, 9(1), 185. https://doi.org/10.1038/s41597-022-01288-4

@article{Axelrod_2022,
    title={GEOM, energy-annotated molecular conformations for property prediction and molecular generation},
    volume={9},
    ISSN={2052-4463},
    url={http://dx.doi.org/10.1038/s41597-022-01288-4},
    DOI={10.1038/s41597-022-01288-4},
    number={1},
    journal={Scientific Data},
    publisher={Springer Science and Business Media LLC},
    author={Axelrod, Simon and Gómez-Bombarelli, Rafael},
    year={2022},
    month={apr},
    pages={185}
}

GDB-11: Chemical Universe Database (26.4M Molecules)

Fri, 29 Aug 2025 00:00:00 +0000

Dataset Examples

GDB-11 molecule (SMILES: FC1C2OC1c3c(F)coc23)

Dataset	Relationship	Link
GDB-13	Successor	Notes
GDB-17	Successor	Notes

Key Contribution

The generation and analysis of the Generated Database (GDB), an exhaustive collection of all possible small molecules that meet specific criteria for stability and synthetic feasibility.

Overview

GDB-11 represents the first systematic enumeration of the small molecule chemical universe up to 11 atoms of C, N, O, and F. The database contains 26.4 million unique molecules corresponding to 110.9 million stereoisomers. It was created to support virtual screening and drug discovery by providing a comprehensive collection of diverse, drug-like small molecules that obey standard chemical stability rules.

Strengths

Systematic Enumeration: Exhaustive coverage of mathematically and chemically possible structures up to 11 atoms.
Drug-Likeness: 100% of compounds follow Lipinski’s “Rule of 5” for bioavailability, and 50% (13.2 million) follow Congreve’s more restrictive “Rule of 3” for lead-likeness.
Structural Novelty: Features 538 newly identified ring systems that were previously unknown in existing chemical databases (like the CAS Registry or Beilstein).
High Chirality: Over 70% of GDB molecules are chiral, providing rich 3D structural diversity, particularly in fused carbocycles and heterocycles.

Limitations

Size Restriction: Strictly limited to small molecules with a maximum of 11 heavy atoms.
Element Restriction: Only contains C, N, O, and F. Important biological and pharmaceutical elements like Phosphorus (P), Sulfur (S), and Silicon (Si) are excluded to prevent combinatorial explosion.
Excluded Topologies: Excludes highly strained molecules (e.g., specific bridged systems), allenes, and bridgehead double bonds.
Unstable Functional Groups: Excludes chemical classes deemed unstable or highly reactive (e.g., gem-diols, hemiacetals, aminals, enols, orthoacids).
Computational Nature: Consists entirely of computer-generated, theoretical structures without experimental synthesis or biological validation.

Technical Notes

Construction

Graph Selection

The program GENG was used to generate an initial set of 843,335 connected graphs with up to 11 nodes and a maximum node connectivity of 4. These were filtered to 15,726 stable saturated hydrocarbon graphs using:

Topological Criteria: Removed graphs with a node in multiple small (3- or 4-membered) rings, tetravalent bridgeheads in small rings, and nonplanar graphs (e.g., Claus-benzol).
Steric Criteria: Graphs containing highly distorted centers were removed using an adapted MM2 force field energy-minimization with a cutoff of +17 kcal/mol.

Structure Generation

Graph symmetry algorithms identified valid locations for unsaturations and heteroatoms (C, N, O, F). Specific valence constraints were continuously enforced. Combinatorial distribution of elements and multiple bonds (excluding bridgehead double bonds, triple bonds in rings smaller than nine, and allenes) yielded a theoretical “dark matter universe” (DMU) of over 1.7 billion unique structures.

Filters

The 1.7 billion structural candidates contained unstable environments which were aggressively filtered, reducing the set to 27.7 million possible stable molecules. Rejected unstable/reactive features included:

High-Energy Bonds: Gem-diols, non-stabilized aminals, hemiaminals, enols, orthoesters, unstable imines, acyl fluorides, and geminal di-heteroatoms.
Heteroatom-Heteroatom Bonds: Peroxides (O-O), N-O, N-N, N-F, and triazanes, unless stabilized (e.g., hydrazones, oximes).
Strained Topologies: 3/4-membered rings containing N-N or N-O bonds, and bridgehead heteroatom bonds causing instabilities (like Bredt’s rule violations).

Removal of redundant tautomeric forms collapsed the set to the foundational 26.4 million structures.

Stereoisomer Generation

Stereoisomers were cleanly enumerated by identifying all asymmetric centers and functional double bonds, blocking Z/E isomerism in rings smaller than 10 nodes. From the 26.4 million unique constitutional isomers, 110.9 million stereoisomers were generated (averaging 4.2 stereoisomers per molecule).

Analysis Methodology

Kohonen Maps (Self-Organizing Maps)

The chemical space visualization and compound class analysis used a Kohonen Map (Self-Organizing Map/SOM):

Input Features: 48-dimensional autocorrelation vectors encoding topological relationships and atomic properties. The autocorrelation vector $\text{AC}_d$ for a topological distance $d$ is defined as:

$$ \text{AC}_d = \sum_{i=1}^{N} \sum_{j=1}^{N} \delta (p_i p_j)_d $$

(where $N$ is the number of atoms, $p$ is the atomic property, and $\delta (p_i, p_j)_d = p_i p_j$ if the topological distance between atoms $i$ and $j$ is $d$, and 0 otherwise).

Training Data: Random subset of 1,000,000 GDB molecules
Architecture: 200x200 neuron grid
Training Protocol: 250,000 epochs with 100 molecules presented per epoch
Algorithm: Standard Kohonen algorithm
Key Insight: Reveals that “lead-like” compounds cluster in chiral regions of fused carbocycles/heterocycles

Comparison

The full database was compared comprehensively to a Reference Database (RDB) of 63,857 known compounds (up to 11 atoms) extracted from PubChem, ChemACX, ChemSCX, NCI Open Database, and the Merck Index. Of the 63,857 RDB compounds, 37,393 (58.6%) were found in GDB. The remaining 26,464 compounds were absent due to structural rule violations, exclusion of elements beyond C/N/O/F, and filtered unstable chemistries.

New Rings

All 309 entirely acyclic graphs in GDB mapped cleanly to published structures. External databases contained only 670 of the 1,208 purely cyclic theoretical ring systems (55.5%). Furthermore, 367 of the 538 newly identified ring systems (68.2%) express inherently chiral topologies.

Stereochemistry

Small molecules under 5 heavy atoms skew strongly towards simple achiral structures. As the atom count increases, a dominant stereochemical shift emerges: over two-thirds of structures containing exactly 10 or 11 atoms occupy chiral configuration spaces. Approximately 86% of the molecules in GDB contain exactly 11 atoms (22.8 million of 26.4 million).

Physicochemical Properties

Because all GDB molecules contain at most 11 heavy atoms, 100% of them satisfy Lipinski’s “Rule of 5” for bioavailability. Under the more restrictive Congreve “Rule of 3” for lead-likeness (MW < 300, RBC < 3, logP < 3, HBDC < 3, HBAC < 3, TPSA < 60 $\text{\AA}^2$), exactly 50% (13.2 million structures) qualify. Virtual screening using the Molinspiration miscreen toolkit (Bayesian statistics-based) identified 42,804 virtual hits across three drug target classes: 3,043 kinase inhibitor candidates, 24,489 GPCR ligand candidates, and 19,696 ion-channel modulator candidates. Of these virtual hits, 59.8% occupied Kohonen map neurons not populated by any known RDB compound.

Reproducibility Details

While the generated GDB-11 database is openly available, reproducing the exact generation from graph to stereoisomer relies on in-house and proprietary software which is not publicly available.

Artifact	Type	License	Notes
GDB Downloads (University of Berne)	Dataset	Unknown	Official host for GDB databases
Zenodo Record (10.5281/zenodo.5172017)	Dataset	Unknown	Version-agnostic Zenodo archive of GDB-11

Paper Accessibility: Closed-access (Published in JCIM 2007; no preprint available).
Data Availability: The complete dataset is hosted on an open-access Zenodo repository (version-agnostic DOI): 10.5281/zenodo.5172017.
Software Dependencies (Closed/Commercial):
- Generation code is a closed-source Java (J2SE v5.0) application.
- Relies on proprietary ChemAxon libraries (JChem v3.1, Marvin v4.0 API).
- Virtual screening evaluation utilized the commercial Molinspiration miscreen toolkit.
Hardware Profile:
- CPUs: Two AMD Opteron 252 2.6 GHz processors
- Parallelization: 80-fold parallelization
- Compute Time: Approximately 20 hours for full generation

Force Field

A custom implementation of the MM2 force field was used for steric energy minimization during structure validation. It used the parameter set from Allinger, specifically adding a quartic term for bond stretching to prevent bond lengthening far from equilibrium:

$$ \begin{aligned} E_{\text{Steric}} &= \sum_{\text{bonds}} k_b(l_i - l_{0,i})^2 \left[1 + k’_b(l_i - l_{0,i}) + k’’_b(l_i - l_{0,i})^2\right] \\ &\quad + \sum_{\text{angles}} k_\theta(\theta_i - \theta_{0,i})^2 \left[1 + k’_\theta(\theta_i - \theta_{0,i})^4\right] \\ &\quad + \sum_{\text{angles}} k_{b,\theta}(\theta_i - \theta_{0,i})^2 \left[(l_a - l_{0,a}) + (l_b - l_{0,b})\right] \\ &\quad + \sum_{\text{torsions}} \left[ \frac{V_1}{2} (1 + \cos \omega) + \frac{V_2}{2} (1 - \cos 2\omega) + \frac{V_3}{2} (1 + \cos 3\omega) \right] \\ &\quad + \sum_{i=1}^N \sum_{j=i+1}^N \epsilon_{ij} \left[ A \exp \left( \frac{-B r_{ij}}{\sum r^{\ast}_{ij}} \right) - C \left( \frac{r_{ij}}{\sum r^{\ast}_{ij}} \right)^6 \right] \end{aligned} $$

Paper Information

Fink, T. and Reymond, J.-L. (2007). Virtual Exploration of the Chemical Universe up to 11 Atoms of C, N, O, F: Assembly of 26.4 Million Structures (110.9 Million Stereoisomers) and Analysis for New Ring Systems, Stereochemistry, Physicochemical Properties, Compound Classes, and Drug Discovery. Journal of Chemical Information and Modeling, 47(2), 342–353. https://doi.org/10.1021/ci600423u

@article{fink2007virtual,
  title={Virtual exploration of the chemical universe up to 11 atoms of C, N, O, and F: assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physicochemical properties, compound classes, and drug discovery},
  author={Fink, Tobias and Reymond, Jean-Louis},
  journal={Journal of Chemical Information and Modeling},
  volume={47},
  number={2},
  pages={342--353},
  year={2007},
  publisher={ACS Publications}
}

GDB-17: Chemical Universe Database (166.4B Molecules)

Sat, 16 Aug 2025 00:00:00 +0000

Key Contribution

The systematic enumeration of 166.4 billion organic molecules (GDB-17) up to 17 atoms, extending the known chemical universe into the drug-relevant size range. This reveals a highly dense novel chemical space that is measurably richer in complex stereochemical and three-dimensional structures compared to historically biased chemical databases.

Overview

GDB-17 represents the largest enumerated database of drug-like small molecules, reaching the size range typical of lead compounds and approved drugs ($100 < \text{MW} < 350$ Da). It contains 166.4 billion structures consisting of up to 17 atoms of C, N, O, S, and halogens (F, Cl, Br, I). Because the bounds of combinatorial possibilities scale exponentially with heavy atom count (HAC), the MW distribution of the database sharply peaks in the $240$-$250 \text{ Da}$ range. Compared to known molecules in databases like PubChem, GDB-17 molecules are significantly richer in non-aromatic heterocycles, quaternary centers, and stereoisomers, avoiding “flatland” by deeply populating the third dimension in shape space.

Dataset Examples

Example GDB-17 molecule (SMILES: C1CC2C3CCCC3C3(C4CCC3CC4)C2C1) demonstrating the complex polycyclic structures and 3D diversity characteristic of the database

Dataset Subsets

Subset	Size	Description
GDB-17 (Full)	166.4B	Complete enumeration of the database
GDBLL-17	29B	Lead-like subset ($1 < \text{clogP} < 3$ and $100 < \text{MW} < 350$ Da)
GDBLLnoSR-17	22B	Lead-like subset excluding compounds with small rings (3- or 4-membered)
Random Sample	50M	Random 50M subset available for download, including pre-filtered lead-like and no-small-ring fractions

Benchmarks

Note: As an enumerated database of theoretical structures, GDB-17 lacks standard supervised ML benchmarks. It functions primarily as a generative compass and foundational exploration library for unsupervised learning and molecular generation.

Dataset	Relationship	Link
GDB-11	Predecessor	Notes
GDB-13	Predecessor	Notes

Strengths & Limitations

Strengths:

3D Shape Space (“Escape out of Flatland”): Populates the third dimension (spherical, non-planar shapes) significantly better than known structures in PubChem or ChEMBL, which are primarily flat and rod-like due to aromatic dominance
Stereochemical Complexity: Averages 6.4 possible stereoisomers per molecule (compared to 2.0 in PubChem-17), driven by an abundance of non-planar features and small rings
Massive Scaffold Diversity: Features 35-fold more Murcko scaffolds and 61-fold more ring systems than molecules of matching size in PubChem
Rich in Known Drug Isomers: Contains millions of exact geometric and formula isomers of approved drugs, offering direct variations and “methyl walk” analogs

Limitations:

Experimental Gap: These are virtual, combinatorially enumerated molecules. Despite strict computational stability filtering, they remain unsynthesized and lack experimental validation.
Small Ring Dominance: Up to 16 atoms, roughly 83% of the database consists of compounds with challenging small (3- or 4-membered) rings, though this drops for the 17-atom set, resulting in an overall 28% fraction of small ring compounds
Elemental Scope Restrictions: Elements like P, Si, and B, which occasionally appear in drugs, are completely excluded
Strict Stability Filters: Excludes some potentially viable functional groups strictly to manage the combinatorial explosion and avoid unstable structures (e.g., hemiacetals, aminals, acyclic acetals)
Polarity Skew: The full database contains disproportionately more polar molecules ($\text{clogP} < 0$) than reference sets, and its sheer size makes it computationally demanding to query using advanced docking or 3D shape tools

Technical Notes

Generation Pipeline

GDB-17 was generated from first principles through a highly filtered, multi-stage pipeline:

Graphs $\rightarrow$ Hydrocarbons: Started with 114.3 billion topologies (generated using GENG), filtered down to 5.4 million stable hydrocarbons by applying geometrical strain rules (H-filters).
Hydrocarbons $\rightarrow$ Skeletons: Substituted single bonds with double and triple bonds to yield 1.3 billion skeletons, simultaneously removing reactive unsaturations like allenes (S-filters).
Skeletons $\rightarrow$ CNO Molecules: Diversified into 110.4 billion molecules by combinatorially substituting C with N and O, explicitly avoiding heteroatom-heteroatom bounds and enforcing stability filters (F-filters).
Post-processing: Added diversity by transforming groups to generate aromatics, oximes, $\text{CF}_3$, halogens, and sulfones (P-filters), yielding the final 166.4 billion count.

Hardware & Software

Compute: Mastered over 40,000 jobs spread across a 360-CPU cluster, consuming 100,000 CPU hours (~11 CPU years)
Software: Powered by GENG (Nauty package) for graph generation, CORINA for 3D stereoisomer generation, and ChemAxon JChem libraries running inside custom Java 1.6 applications

Shape Analysis (PMI)

To quantitatively define the “escape from flatland,” the origin paper classifies molecular shape using the normalized Principal Moments of Inertia (PMI) of the generated 3D conformers. The principal moments ($I_1 \le I_2 \le I_3$) are derived by diagonalizing the standard moment of inertia tensor. Molecules are plotted within a normalized 2D triangular space mapped by the ratios:

$$ P_1 = \frac{I_1}{I_3}, \quad P_2 = \frac{I_2}{I_3} $$

The vertices of this plot define the three geometrical boundaries of chemical space:

Rod-like (1D): $(0, 1)$ typical of stretched alkanes
Disc-like (2D): $(0.5, 0.5)$ typical of flat aromatics like benzene
Sphere-like (3D): $(1, 1)$ typical of globular structures like cubane

GDB-17’s core structural finding is that mathematically enumerated chemical space thickly populates the interior and $(1,1)$ spherical regions of this plot, demonstrating significant 3D structure. Empirical libraries traditionally cluster densely along the rod-to-disc axis.

Differences from GDB-13

The algorithm was completely rewritten optimizing memory efficiency, boosting computing speed roughly 400-fold and allowing enumeration beyond the previous 13-atom limit
Scope aggressively expanded to include all functional halogens (F, Cl, Br, I) within the base framework
Introduced intensive, size-dependent graph selection filters (prohibiting complex bridgeheads and completely eliminating small rings in 17-atom graphs) to manage combinatorial explosion
Functional post-processing cycles deliberately decoupled to add features like cyclic oximes, aromatic halogens, and sulfones that would otherwise be rejected or break underlying generation constraints

Reproducibility Details

Paper Accessibility: The original paper is published in the Journal of Chemical Information and Modeling and is available as an Open Access publication under a CC-BY license.
Data Availability: The full 166.4 billion molecule dataset is not publicly available for download (estimated >400 GB compressed). However, a 50 million random subset and pre-filtered lead-like fractions are openly available on the GDB website and archived on Zenodo.
Code & Algorithms: The enumeration rules and logic are well-described in the paper, but the actual Java 1.6 source code has not been released.
Dependencies: The pipeline is a mix of open-source and proprietary software tools. Graph generation uses open-source GENG (Nauty), while chemical logic and stereoisomer generation rely on proprietary ChemAxon JChem libraries and CORINA.
Hardware Specifications: The original database generation was explicitly parallelized across a 360-CPU cluster, consuming 100,000 CPU hours (approximately 11 CPU years) with over 40,000 calculation runs.

Paper Information

Ruddigkeit, L., van Deursen, R., Blum, L. C., and Reymond, J.-L. (2012). Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17. Journal of Chemical Information and Modeling, 52(11), 2864–2875. https://doi.org/10.1021/ci300415d

@article{Ruddigkeit_2012,
  title={Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17},
  volume={52},
  ISSN={1549-960X},
  url={http://dx.doi.org/10.1021/ci300415d},
  DOI={10.1021/ci300415d},
  number={11},
  journal={Journal of Chemical Information and Modeling},
  publisher={American Chemical Society (ACS)},
  author={Ruddigkeit, Lars and van Deursen, Ruud and Blum, Lorenz C. and Reymond, Jean-Louis},
  year={2012},
  month=nov,
  pages={2864--2875}
}

GDB-13: Chemical Universe Database (970M Molecules)

Sat, 16 Aug 2025 00:00:00 +0000

Dataset Examples

Example GDB-13 molecule (SMILES: CCCC(O)(CO)CC1CC1CN)

Dataset Subsets

Subset	Size	Description
C/N/O Set	~910.1M	Molecules containing up to 13 atoms of Carbon, Nitrogen, and Oxygen.
Cl/S Set	~67.3M	Molecules containing up to 13 atoms, adding Sulfur (aromatic heterocycles, sulfones, sulfonamides, thioureas) and Chlorine (aromatic substituents).

Dataset	Relationship	Link
GDB-11	Predecessor	Notes
GDB-17	Successor	Notes

Key Contribution

The creation and release of the 977.5 million-compound GDB-13, a significant expansion in molecular size (up to 13 atoms) and elemental diversity (including S and Cl) made possible by key algorithmic optimizations that significantly accelerated the enumeration process.

Overview

GDB-13 extends the systematic enumeration of drug-like chemical space to molecules containing up to 13 atoms of Carbon, Nitrogen, Oxygen, Sulfur, and Chlorine. Building on the methodology established in GDB-11, this database represents a 37-fold increase in size while maintaining 100% Lipinski compliance for virtual screening applications. The enumeration results in a vast array of cyclic topologies, where 54% of the database comprises molecules with at least one three- or four-membered ring.

Strengths

Systematic coverage of structures with up to 13 atoms
High drug-likeness: 100% Lipinski compliance and 99.5% Vieth compliance
High proportion of leadlike (98.9%) and fragmentlike (45.1%) molecules
Structural novelty providing fragments absent from established databases like ZINC, ACX, and PubChem

Limitations

Limited to small molecules with up to 13 atoms of C, N, O, S, and Cl
Omits 66.2% of known chemical space up to 13 atoms found in external databases
Excludes specific nonenumerated elements (F, Br, I, P, Si, metals) and functional groups (chlorine on nonaromatic carbons, mercaptans, sulfoxides, enamines, allenes)
Excludes highly strained molecules and highly polar combinations
Consists entirely of computer-generated structures pending experimental validation

Technical Notes

Algorithmic Approach

Type: Rule-Based Combinatorial Graph Enumeration

This approach relies on combinatorial enumeration. It utilizes a rule-based graph generation algorithm (GENG) paired with chemical stability filters to construct the dataset.

Process:

Start with mathematical graphs representing saturated hydrocarbons up to 13 nodes using GENG (non-planar graphs discarded)
Apply topological filters to remove highly strained small ring systems (e.g., fused cyclopropanes and bridgehead 3/4-membered rings)
Generate 3D structures via CORINA or ChemAxon to apply a 3D volume-based strain filter. The local strain of a tetravalent carbon is estimated by the volume $V$ of the tetrahedron formed by extending a $1 \text{ \AA}$ line along its four single bonds. Hydrocarbons with planar or pyramidal carbon centers are discarded if: $$ V < 0.345 \text{ \AA}^3 $$
Introduce unsaturations and heteroatoms through systematic substitution
Apply chemical rule filters and element-ratio heuristics to ensure stability and drug-likeness
Apply post-processing algorithms to introduce nitro groups, nitriles, aromatic chlorines, thiophenes, sulfonamides, and thioureas

Key Optimization: Replaced computationally expensive MM2 minimization (used in GDB-11) with a fast geometry-based estimation of strained polycyclic ring systems, combined with fast “element-ratio” filters. This achieved a 6.4-fold speedup in structure validation early in the pipeline.

Differences from GDB-11

Element Selection: Fluorine removed from allowed elements; sulfur and chlorine added for higher drug relevance (e.g., thiophenes, sulfonamides).
Optimization Method: MM2-based structure optimization replaced with a much faster, custom geometry-based estimation of local strain (measuring the tetrahedron volume of carbon centers).
Heuristic Filters: Fast elemental ratio filters added to quickly reject highly polar, unstable combinations early in the pipeline.

Reproducibility Details

Paper & Data Availability

Paper Access: The original paper is published in the Journal of the American Chemical Society (JACS) and is closed-access/paywalled. No open-access preprint exists on arXiv or ChemRxiv.
Data Access: The full GDB-13 database and its subsets are freely available via the Reymond Group Downloads Page and are persistently hosted on Zenodo.

Artifacts

Artifact	Type	License	Notes
GDB-13 Database (Reymond Group)	Dataset	Free download	Official download page hosted by the Reymond Group
GDB-13 on Zenodo	Dataset	Unknown	Persistent archival copy

Source Code & Algorithms

The exact custom source code (e.g., GENG orchestration, local strain filters) is not publicly available. Researchers must re-implement the rules strictly described in the paper and supplementary materials.

Heuristic Filters

Implemented element-ratio filters derived from analyzing known compound databases to reject chemically unstable or highly polar molecules early in the generation pipeline:

$$ \begin{aligned} \frac{N + O}{C} &< 1.0 \\ \frac{N}{C} &< 0.571 \\ \frac{O}{C} &< 0.666 \end{aligned} $$

Excluded Functional Groups

O-O bonds (peroxides)
Hemiacetals, aminals, acyclic imines, non-aromatic enols
Compounds containing both primary/secondary amines and aldehydes/ketones
Nonenumerated elements (F, Br, I, P, Si, metals)
High-heteroatom ratio structures (e.g., mannitol)

Hardware & Compute

Compute Cost: ~40,000 CPU hours for the 910 million C/N/O structures.
Infrastructure: Executed in parallel on a 500-node cluster
Assembly Optimization: The switch from MM2 minimization to geometry-based estimation of strained polycyclic ring systems, alongside element-ratio filters, reduced assembly time 6.4-fold comparing GDB-11 workloads (1600 CPU hours to 250 CPU hours).

Paper Information

Blum, L. C. and Reymond, J.-L. (2009). 970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. Journal of the American Chemical Society, 131(25), 8732–8733. https://doi.org/10.1021/ja902302h

@article{blum2009gdb13,
  title={970 million druglike small molecules for virtual screening in the chemical universe database GDB-13},
  author={Blum, Lorenz C and Reymond, Jean-Louis},
  journal={Journal of the American Chemical Society},
  volume={131},
  number={25},
  pages={8732--8733},
  year={2009},
  publisher={ACS Publications},
  doi={10.1021/ja902302h}
}

Molecular Databases & Datasets on Hunter Heidenreich | ML Research Scientist

VQM24: 836k Molecules at DFT and Diffusion QMC

Key Contribution

Overview

Dataset Subsets

Generation Pipeline

Data Files and Access

Computed Properties

ML Benchmarking: Harder Than QM9

Strengths & Limitations

Reproducibility Details

Citation

QM9: Quantum Chemistry Properties of 134k Molecules

Key Contribution

Overview

Computed Properties

Dataset Subsets

Geometry Generation Pipeline

Validation

Strengths & Limitations

Reproducibility Details

Citation

GDBMedChem: Drug-Like Subset of GDB-17 (10M Molecules)

Key Contribution

Overview

Assembly Pipeline

Comparison with FDB-17

Substructure Novelty

Property Profiles

Strengths & Limitations

Technical Notes

Molecule Preprocessing

Reference Databases

MHFP6 Shingle Computation

Avalon Fingerprint Density

Reproducibility Details

Citation

FDB-17: Fragment Database (10M Molecules)

Key Contribution

Overview

Assembly Pipeline

Property Profiles vs. Commercial Fragments

Virtual Screening Validation

Strengths & Limitations

Reproducibility Details

Citation

ZINC-22: A Multi-Billion Scale Database for Ligand Discovery

Key Contribution: Scaling Make-on-Demand Libraries

Overview

Dataset Examples

Dataset Subsets

Use Cases

Related Datasets

Strengths

Limitations

Technical Notes

Hardware & Software

Data Organization & Access

3D Generation Pipeline

Chemical Diversity Analysis

Vendor Integration

Reproducibility Details

Paper Information

MARCEL: Molecular Conformer Ensemble Learning Benchmark

Key Contribution

Overview

Dataset Examples

Dataset Subsets

Benchmarks

Ionization Potential (Drugs-75K)#

Electron Affinity (Drugs-75K)#

Electronegativity (Drugs-75K)#

B₅ Sterimol Parameter (Kraken)#

L Sterimol Parameter (Kraken)#

Buried B₅ Parameter (Kraken)#

Buried L Parameter (Kraken)#

Enantioselectivity (EE)#

Bond Dissociation Energy (BDE)#

Related Datasets

Strengths

Ionization Potential (Drugs-75K)

Electron Affinity (Drugs-75K)

Electronegativity (Drugs-75K)

B₅ Sterimol Parameter (Kraken)

L Sterimol Parameter (Kraken)

Buried B₅ Parameter (Kraken)

Buried L Parameter (Kraken)

Enantioselectivity (EE)

Bond Dissociation Energy (BDE)

Gibbs Free Energy Prediction

Average Energy Prediction

Conformer Count Prediction