<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Molecular Databases &amp; Datasets on Hunter Heidenreich | ML Research Scientist</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/</link><description>Recent content in Molecular Databases &amp; Datasets on Hunter Heidenreich | ML Research Scientist</description><image><title>Hunter Heidenreich | ML Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Sun, 12 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/notes/chemistry/datasets/index.xml" rel="self" type="application/rss+xml"/><item><title>VQM24: 836k Molecules at DFT and Diffusion QMC</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/vqm24/</link><pubDate>Sat, 11 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/datasets/vqm24/</guid><description>Dataset card for VQM24, providing DFT and diffusion QMC properties for 836k exhaustively enumerated small molecules across 9 elements.</description><content:encoded><![CDATA[<h2 id="key-contribution">Key Contribution</h2>
<p>VQM24 (Vector-QM24) is the first exhaustive quantum mechanical dataset covering all possible neutral closed-shell small molecules with up to five heavy atoms from nine p-block elements (C, N, O, F, Si, P, S, Cl, Br). It provides DFT-level properties for all 836k structures and <a href="https://en.wikipedia.org/wiki/Diffusion_Monte_Carlo">diffusion quantum Monte Carlo</a> (DMC) energies for a 10,793-molecule subset, constituting the largest QMC dataset in chemical space to date. ML benchmarking reveals that VQM24 is significantly more challenging than <a href="/notes/chemistry/datasets/qm9/">QM9</a> despite containing smaller molecules.</p>
<h2 id="overview">Overview</h2>
<p>Most existing QM datasets (QM7, QM9, ANI-1x) are derived from string-based molecular lists and are restricted to a few elements (typically CHONF), introducing selection bias and limiting ML model generalizability. VQM24 addresses this by exhaustively enumerating all valid stoichiometries, <a href="https://en.wikipedia.org/wiki/Lewis_structure">Lewis-rule-consistent</a> graphs, and stable conformers for molecules composed of 9 elements with their most common valencies:</p>
<table>
  <thead>
      <tr>
          <th>Element</th>
          <th>Valencies</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>C</td>
          <td>4</td>
      </tr>
      <tr>
          <td>N</td>
          <td>3, 5</td>
      </tr>
      <tr>
          <td>O</td>
          <td>2</td>
      </tr>
      <tr>
          <td>F</td>
          <td>1</td>
      </tr>
      <tr>
          <td>Si</td>
          <td>4</td>
      </tr>
      <tr>
          <td>P</td>
          <td>3, 5</td>
      </tr>
      <tr>
          <td>S</td>
          <td>2, 4, 6</td>
      </tr>
      <tr>
          <td>Cl</td>
          <td>1</td>
      </tr>
      <tr>
          <td>Br</td>
          <td>1</td>
      </tr>
  </tbody>
</table>
<h2 id="dataset-subsets">Dataset Subsets</h2>
<table>
  <thead>
      <tr>
          <th>Heavy Atoms</th>
          <th>Stoichiometries</th>
          <th>Graphs</th>
          <th>Geometries</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>9</td>
          <td>9</td>
          <td>9</td>
      </tr>
      <tr>
          <td>2</td>
          <td>69</td>
          <td>69</td>
          <td>81</td>
      </tr>
      <tr>
          <td>3</td>
          <td>367</td>
          <td>766</td>
          <td>1,287</td>
      </tr>
      <tr>
          <td>4</td>
          <td>1,321</td>
          <td>10,992</td>
          <td>29,581</td>
      </tr>
      <tr>
          <td>5</td>
          <td>3,793</td>
          <td>246,406</td>
          <td>753,917</td>
      </tr>
      <tr>
          <td><strong>Total</strong></td>
          <td><strong>5,559</strong></td>
          <td><strong>258,242</strong></td>
          <td><strong>784,875</strong> (minima)</td>
      </tr>
  </tbody>
</table>
<p>Including saddle points, the full dataset contains 835,947 converged structures. Extrapolation suggests ~33 million geometries at 6 heavy atoms.</p>
<h2 id="generation-pipeline">Generation Pipeline</h2>
<ol>
<li><strong>Stoichiometry enumeration</strong>: All combinations of up to 5 heavy atoms from the 13 element/valency types, with hydrogen counts determined by integer partitioning of total valency</li>
<li><strong>Graph generation</strong>: <a href="https://en.wikipedia.org/wiki/Structural_isomer">Constitutional isomers</a> enumerated using <a href="/notes/chemistry/molecular-design/chemical-space/surge-chemical-graph-generator/">Surge</a> for each stoichiometry</li>
<li><strong>Geometry initialization</strong>: RDKit <a href="https://en.wikipedia.org/wiki/Merck_molecular_force_field">MMFF94</a> force field generates initial 3D coordinates</li>
<li><strong>Semi-empirical optimization</strong>: GFN2-xTB geometry optimization</li>
<li><strong>Conformer search</strong>: CREST identifies conformational isomers (~1.1M initial geometries)</li>
<li><strong>DFT optimization</strong>: Three-pass $\omega$B97X-D3/cc-pVDZ optimization in PSI4 v1.7, all using Gaussian Tight convergence criteria with density fitting (cc-pVDZ-JKFIT auxiliary basis):
<ul>
<li><strong>Pass 1</strong>: Default PSI4 settings (DIIS for SCF, RFO optimizer in redundant internal coordinates), max 100 steps</li>
<li><strong>Pass 2</strong>: SOSCF with full Newton step, ultrafine Lebedev-Treutler grid (590 spherical, 99 radial points), max 100 steps</li>
<li><strong>Pass 3</strong>: Full Hessian evaluation at initial geometry and every 20th step, Cartesian coordinates, max 50 steps</li>
</ul>
</li>
<li><strong>DMC calculations</strong>: For 10,793 lowest-energy conformers with up to 4 heavy atoms, using QMCPACK with PBE0/ccECP/cc-pVQZ trial wavefunctions. Slater-Jastrow trial wavefunctions with Jastrow terms for 1-body (16 params/atom type, 8 Bohr cutoff), 2-body (20 params/spin-channel, 10 Bohr cutoff), and 3-body (26 params, 5 Bohr cutoff) interactions. DMC used a timestep of 0.001 a.u., 16,000 walkers, and 1,500 blocks of 40 imaginary time steps. ccECP pseudopotentials with the determinant-localization approximation and t-moves (DLTM) handled core electrons.</li>
</ol>
<p>The $\omega$B97X-D3 functional was chosen for its strong GMTKN55 benchmark performance and for compatibility with ANI-1, ANI-1x, OrbNet Denali, QMugs, SPICE, and MultiXC-QM9, all of which use $\omega$B97X variants with double-zeta basis sets. This enables transfer learning across datasets.</p>
<h2 id="data-files-and-access">Data Files and Access</h2>
<p>The Zenodo dataset contains separate .npz files, loadable via NumPy:</p>
<table>
  <thead>
      <tr>
          <th>File</th>
          <th>Contents</th>
          <th>Molecules</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>DFT_all.npz</code></td>
          <td>DFT properties for all conformational minima</td>
          <td>784,875</td>
      </tr>
      <tr>
          <td><code>DFT_uniques.npz</code></td>
          <td>DFT properties for constitutional isomers (most stable conformer)</td>
          <td>258,242</td>
      </tr>
      <tr>
          <td><code>DFT_saddles.npz</code></td>
          <td>DFT properties for saddle point structures</td>
          <td>51,072</td>
      </tr>
      <tr>
          <td><code>DMC.npz</code></td>
          <td>DMC total energies and error bars</td>
          <td>10,793</td>
      </tr>
      <tr>
          <td><code>wavefunctions.tar.gz</code></td>
          <td>Wavefunction .molden files (includes MO energies)</td>
          <td>~106.7 GB</td>
      </tr>
  </tbody>
</table>
<p>All molecules are ordered consistently across every array within a file. Properties are accessed by key:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> numpy <span style="color:#66d9ef">as</span> np
</span></span><span style="display:flex;"><span>data <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>load(<span style="color:#e6db74">&#39;DFT_all.npz&#39;</span>, allow_pickle<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>print(data<span style="color:#f92672">.</span>files)  <span style="color:#75715e"># list all available properties</span>
</span></span><span style="display:flex;"><span>freqs <span style="color:#f92672">=</span> data[<span style="color:#e6db74">&#39;freqs&#39;</span>]  <span style="color:#75715e"># vibrational frequencies</span>
</span></span></code></pre></div><h2 id="computed-properties">Computed Properties</h2>
<p>DFT ($\omega$B97X-D3/cc-pVDZ) properties and their NPZ access keys:</p>
<table>
  <thead>
      <tr>
          <th>Property</th>
          <th>Unit</th>
          <th>Key</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Total energies</td>
          <td>Ha</td>
          <td><code>Etot</code></td>
      </tr>
      <tr>
          <td>Internal energies</td>
          <td>Ha</td>
          <td><code>U0</code></td>
      </tr>
      <tr>
          <td>Atomization energies</td>
          <td>Ha</td>
          <td><code>Eatomization</code></td>
      </tr>
      <tr>
          <td>Electron-electron energies</td>
          <td>Ha</td>
          <td><code>Eee</code></td>
      </tr>
      <tr>
          <td>Exchange-correlation energies</td>
          <td>Ha</td>
          <td><code>Exc</code></td>
      </tr>
      <tr>
          <td>Dispersion energy</td>
          <td>Ha</td>
          <td><code>Edisp</code></td>
      </tr>
      <tr>
          <td>HOMO-LUMO gap</td>
          <td>Ha</td>
          <td><code>gap</code></td>
      </tr>
      <tr>
          <td>Dipole moments</td>
          <td>a.u.</td>
          <td><code>dipole</code></td>
      </tr>
      <tr>
          <td>Quadrupole moments</td>
          <td>a.u.</td>
          <td><code>quadrupole</code></td>
      </tr>
      <tr>
          <td>Octupole moments</td>
          <td>a.u.</td>
          <td><code>octupole</code></td>
      </tr>
      <tr>
          <td>Hexadecapole moments</td>
          <td>a.u.</td>
          <td><code>hexadecapole</code></td>
      </tr>
      <tr>
          <td>Rotational constants</td>
          <td>MHz</td>
          <td><code>rots</code></td>
      </tr>
      <tr>
          <td>Vibrational modes</td>
          <td>Å</td>
          <td><code>vibmodes</code></td>
      </tr>
      <tr>
          <td>Vibrational frequencies</td>
          <td>cm$^{-1}$</td>
          <td><code>freqs</code></td>
      </tr>
      <tr>
          <td>Gibbs free energy (H)</td>
          <td>Ha</td>
          <td><code>G</code></td>
      </tr>
      <tr>
          <td>Internal (thermal) energy (H)</td>
          <td>Ha</td>
          <td><code>U298</code></td>
      </tr>
      <tr>
          <td>Enthalpy (H)</td>
          <td>Ha</td>
          <td><code>H</code></td>
      </tr>
      <tr>
          <td>ZPVE (H)</td>
          <td>Ha</td>
          <td><code>zpves</code></td>
      </tr>
      <tr>
          <td>Entropy (H)</td>
          <td>cal/mol K</td>
          <td><code>S</code></td>
      </tr>
      <tr>
          <td>Heat capacities (H)</td>
          <td>cal/mol K</td>
          <td><code>Cv</code>, <code>Cp</code></td>
      </tr>
      <tr>
          <td>Electrostatic potentials at nuclei</td>
          <td>a.u.</td>
          <td><code>Vesp</code></td>
      </tr>
      <tr>
          <td>Mulliken charges</td>
          <td>a.u.</td>
          <td><code>Qmulliken</code></td>
      </tr>
      <tr>
          <td>SMILES</td>
          <td></td>
          <td><code>graphs</code></td>
      </tr>
      <tr>
          <td>InChI strings</td>
          <td></td>
          <td><code>inchi</code></td>
      </tr>
  </tbody>
</table>
<p>(H) indicates thermodynamic properties computed via the harmonic approximation. Molecular orbital energies are available in the wavefunction .molden files.</p>
<p>DMC properties (<code>DMC.npz</code>) include total energy (<code>Etot</code>) and statistical error bar (<code>std</code>) for each molecule.</p>
<p>DMC energies (PBE0/ccECP/cc-pVQZ nodal surfaces, Slater-Jastrow trial wavefunctions) achieve average statistical uncertainty of 0.4 mHa across ~2.3 billion samples per molecule.</p>
<h2 id="ml-benchmarking-harder-than-qm9">ML Benchmarking: Harder Than QM9</h2>
<p>Learning curves for atomization energy prediction show that VQM24 is substantially more challenging than QM9 for all tested models:</p>
<ul>
<li>KRR models (CM, ACSF, LMBTR, FCHL19, cMBDF) and GNNs (SchNet, PaiNN) all show up to ~8x larger mean errors on VQM24 than QM9 at the same training set size</li>
<li>None of the tested models achieve chemical accuracy (1 kcal/mol) on VQM24, even with 128k training molecules</li>
<li>The atomization energy range in VQM24 (1,545 kcal/mol) is smaller than QM9 (2,427 kcal/mol), so the higher errors reflect greater chemical diversity rather than a wider property range</li>
<li>For a fair comparison with QM9 (which has no conformational isomers), learning curves use only the 258k unique constitutional isomers from VQM24</li>
</ul>
<p><strong>Benchmark methodology</strong>: KRR models use an atomic Gaussian kernel with hyperparameters (length-scale $l$, regularizer $\lambda$) optimized via grid search and 5-fold cross-validation. Both GNNs (SchNet, PaiNN) use 128 atomic basis functions (589k total parameters), trained for 1,000 epochs with Adam (lr = $10^{-4}$). Test set size is 10,000 randomly selected molecules, with results averaged over 5 runs. Training and evaluation scripts are available in the <a href="https://github.com/dkhan42/VQM24">GitHub repository</a>.</p>
<p>Prediction error analysis with the best KRR model (cMBDF, trained on 200k across 4 disjoint training sets on all 784,875 equilibrium geometries) yields an overall MAE of 0.75 kcal/mol (standard deviation 1.55 kcal/mol). The largest individual error reaches 167.3 kcal/mol, and the 25 largest outliers have a mean absolute error of 85.9 kcal/mol.</p>
<h2 id="strengths--limitations">Strengths &amp; Limitations</h2>
<p><strong>Strengths</strong>:</p>
<ul>
<li>Exhaustive coverage of 1-5 heavy atom chemical space across 9 elements</li>
<li>Both DFT and DMC-level data (largest QMC dataset in chemical space)</li>
<li>Includes conformational isomers (average 3 per constitutional isomer)</li>
<li>Extensive property set including wavefunctions and multipole moments up to hexadecapole</li>
<li>More challenging ML benchmark than QM9, exposing model limitations</li>
</ul>
<p><strong>Limitations</strong>:</p>
<ul>
<li>Limited to 5 heavy atoms (very small molecules)</li>
<li>262,542 structures (~24%) failed DFT convergence, with a strong silicon bias in failures</li>
<li>51,072 structures converged to saddle points rather than minima</li>
<li>DMC subset limited to 4 heavy atoms (10,793 molecules)</li>
<li>Does not include metals, rare gases, or heavier halogens (I)</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p><strong>Status: Highly Reproducible</strong></p>
<p>The paper, dataset, and code are all publicly available.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://zenodo.org/records/15442257">VQM24 Dataset (Zenodo)</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>DFT .npz files + DMC .npz + wavefunction tarball (~108 GB total)</td>
      </tr>
      <tr>
          <td><a href="https://github.com/dkhan42/VQM24">dkhan42/VQM24 (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Generation tools, PSI4 templates, KRR and GNN training scripts</td>
      </tr>
      <tr>
          <td><a href="https://arxiv.org/abs/2405.05961">arXiv preprint</a></td>
          <td>Paper</td>
          <td>arXiv license</td>
          <td>Open-access preprint of the Scientific Data article</td>
      </tr>
  </tbody>
</table>
<p><strong>Software stack</strong>: Surge (graph enumeration), RDKit/MMFF94 (initial geometries), GFN2-xTB (semi-empirical optimization), CREST (conformer search), PSI4 v1.7 (DFT), PySCF (trial wavefunctions), QMCPACK (DMC), QMLcode (KRR models), SchNetPack (GNN models).</p>
<p><strong>Hardware requirements</strong>:</p>
<ul>
<li>DFT: Three-pass $\omega$B97X-D3/cc-pVDZ optimization in PSI4 (compute details not specified per-molecule for DFT)</li>
<li>DMC trial wavefunctions: Argonne LCRC Improv, single node (2x AMD EPYC 7713, 64 cores, 2 GHz), ~45 seconds per molecule, ~134 node-hours total</li>
<li>DMC calculations: Argonne Polaris HPC (AMD EPYC 7543P, 64 cores, 2.8 GHz), 20 nodes per molecule, ~15 minutes each, ~54,000 node-hours total</li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{khan2025quantum,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Quantum mechanical dataset of 836k neutral closed-shell molecules
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">         with up to 5 heavy atoms from C, N, O, F, Si, P, S, Cl, Br}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Khan, Danish and Benali, Anouar and Kim, Scott Y. H.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">          and von Rudorff, Guido Falk and von Lilienfeld, O. Anatole}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Scientific Data}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1551}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Portfolio}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41597-025-05428-4}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>QM9: Quantum Chemistry Properties of 134k Molecules</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/qm9/</link><pubDate>Sat, 11 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/datasets/qm9/</guid><description>Dataset card for QM9, providing DFT-computed geometric, electronic, and thermodynamic properties for 134k small organic molecules from GDB-9.</description><content:encoded><![CDATA[<h2 id="key-contribution">Key Contribution</h2>
<p>QM9 provides a consistent, comprehensive set of quantum chemical properties for 133,885 small organic molecules (up to 9 heavy atoms of C, N, O, F) from the <a href="/notes/chemistry/datasets/gdb-17/">GDB-17</a> chemical universe. It is among the most widely used benchmark datasets in molecular machine learning, enabling systematic development and evaluation of structure-property prediction methods.</p>
<h2 id="overview">Overview</h2>
<p>The dataset corresponds to the GDB-9 subset of the GDB-17 chemical universe: all neutral molecules with up to nine heavy atoms (C, O, N, F), not counting hydrogen. Cations, anions, and molecules containing S, Br, Cl, or I were excluded, though 1,705 <a href="https://en.wikipedia.org/wiki/Zwitterion">zwitterions</a> (relevant for small biomolecules like amino acids) were retained. The dataset spans 621 stoichiometries. It includes small amino acids (glycine, alanine), nucleobases (cytosine, uracil, thymine), and pharmaceutically relevant building blocks (pyruvic acid, piperazine, hydroxy urea).</p>
<h2 id="computed-properties">Computed Properties</h2>
<p>All properties were calculated at the <a href="https://en.wikipedia.org/wiki/Hybrid_functionals">B3LYP</a>/6-31G(2df,p) level of DFT. The 15 scalar properties per molecule are:</p>
<table>
  <thead>
      <tr>
          <th>Property</th>
          <th>Unit</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>A, B, C</td>
          <td>GHz</td>
          <td>Rotational constants</td>
      </tr>
      <tr>
          <td>$\mu$</td>
          <td>D</td>
          <td>Dipole moment</td>
      </tr>
      <tr>
          <td>$\alpha$</td>
          <td>$a_0^3$</td>
          <td>Isotropic polarizability</td>
      </tr>
      <tr>
          <td>$\varepsilon_{\text{HOMO}}$</td>
          <td>Ha</td>
          <td>HOMO energy</td>
      </tr>
      <tr>
          <td>$\varepsilon_{\text{LUMO}}$</td>
          <td>Ha</td>
          <td>LUMO energy</td>
      </tr>
      <tr>
          <td>$\varepsilon_{\text{gap}}$</td>
          <td>Ha</td>
          <td>HOMO-LUMO gap</td>
      </tr>
      <tr>
          <td>$\langle R^2 \rangle$</td>
          <td>$a_0^2$</td>
          <td>Electronic spatial extent</td>
      </tr>
      <tr>
          <td>ZPVE</td>
          <td>Ha</td>
          <td>Zero-point vibrational energy</td>
      </tr>
      <tr>
          <td>$U_0$</td>
          <td>Ha</td>
          <td>Internal energy at 0 K</td>
      </tr>
      <tr>
          <td>$U$</td>
          <td>Ha</td>
          <td>Internal energy at 298.15 K</td>
      </tr>
      <tr>
          <td>$H$</td>
          <td>Ha</td>
          <td>Enthalpy at 298.15 K</td>
      </tr>
      <tr>
          <td>$G$</td>
          <td>Ha</td>
          <td>Free energy at 298.15 K</td>
      </tr>
      <tr>
          <td>$C_v$</td>
          <td>cal/mol K</td>
          <td>Heat capacity at 298.15 K</td>
      </tr>
  </tbody>
</table>
<p>Each molecule is stored in an extended XYZ file. The first line gives the atom count, and the second (comment) line packs all 15 scalar properties. Lines 3 through $n_a + 2$ contain element type, Cartesian coordinates (x, y, z in Angstroms), and <a href="https://en.wikipedia.org/wiki/Mulliken_population_analysis">Mulliken partial charges</a> as a fifth column. Three trailing lines append harmonic vibrational frequencies ($3n_a - 5$ or $3n_a - 6$ modes, in cm$^{-1}$), <a href="/notes/chemistry/molecular-representations/notations/smiles/">SMILES</a> strings (from GDB-17 and from the B3LYP-relaxed geometry), and <a href="/notes/chemistry/molecular-representations/notations/inchi/">InChI</a> strings (from Corina and B3LYP geometries).</p>
<h2 id="dataset-subsets">Dataset Subsets</h2>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Size</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GDB-9 (Full)</strong></td>
          <td>133,885</td>
          <td>All molecules, B3LYP properties</td>
      </tr>
      <tr>
          <td><strong>C7H10O2 isomers</strong></td>
          <td>6,095</td>
          <td>Predominant stoichiometry, with additional G4MP2 energetics</td>
      </tr>
      <tr>
          <td><strong>Validation set</strong></td>
          <td>100</td>
          <td>Random subset with G4MP2, G4, and CBS-QB3 reference values</td>
      </tr>
  </tbody>
</table>
<h2 id="geometry-generation-pipeline">Geometry Generation Pipeline</h2>
<p>Starting from GDB-17 SMILES strings, initial 3D coordinates were generated with Corina, then relaxed at the PM7 semi-empirical level (<a href="https://en.wikipedia.org/wiki/MOPAC">MOPAC</a>), followed by B3LYP/6-31G(2df,p) geometry optimization (<a href="https://en.wikipedia.org/wiki/Gaussian_(software)">Gaussian 09</a>). A five-stage iterative convergence procedure handled difficult cases: default thresholds, then ultrafine grids, tighter SCF criteria, Hessian-guided optimization (calcfc), and full Hessian optimization (calcall). After all stages, 11 molecules still failed to converge to true minima (6 converged with loose thresholds, 2 near-linear molecules converged to saddle points with very low imaginary frequencies below $i10 \text{ cm}^{-1}$).</p>
<h2 id="validation">Validation</h2>
<p><strong>Geometry consistency</strong>: B3LYP-relaxed geometries were converted back to InChI strings and compared against the original GDB-17 InChI. 3,054 molecules failed this round-trip test, primarily due to implementation-specific artifacts in SMILES/InChI conversion rather than actual geometry problems. Coulomb-matrix distances between Corina and B3LYP geometries quantified the magnitude of geometric changes.</p>
<p><strong>Energy accuracy</strong>: For 100 randomly selected molecules, B3LYP atomization enthalpies were compared against higher-level composite methods. These reference methods are themselves near experimental accuracy: G4MP2 achieves MAE 1.0 and RMSE 1.5 kcal/mol against the G3/05 test set of 454 experimental energies, while G4 achieves MAE 0.8 and RMSE 1.2 kcal/mol on the same set. G4MP2 also deviates by only 1.4 kcal/mol from the highly accurate W1w composite procedure on 261 bond dissociation enthalpies (BDE261 dataset). Against these references, B3LYP shows:</p>
<table>
  <thead>
      <tr>
          <th>Reference</th>
          <th>MAE (kcal/mol)</th>
          <th>RMSE (kcal/mol)</th>
          <th>Max AE (kcal/mol)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>G4MP2</td>
          <td>5.0</td>
          <td>6.1</td>
          <td>16.0</td>
      </tr>
      <tr>
          <td>G4</td>
          <td>4.9</td>
          <td>5.9</td>
          <td>14.4</td>
      </tr>
      <tr>
          <td>CBS-QB3</td>
          <td>4.5</td>
          <td>5.5</td>
          <td>13.4</td>
      </tr>
  </tbody>
</table>
<p>All 6,095 C7H10O2 isomers passed the geometry consistency check, and their G4MP2-level energetics provide a higher-accuracy benchmark within a fixed stoichiometry.</p>
<h2 id="strengths--limitations">Strengths &amp; Limitations</h2>
<p><strong>Strengths</strong>:</p>
<ul>
<li>Comprehensive and consistent: same level of theory across all 134k molecules</li>
<li>Derived from a systematically enumerated chemical space (GDB-17), reducing selection bias</li>
<li>Rich property set covering geometric, electronic, energetic, and thermodynamic quantities</li>
<li>Widely adopted benchmark enabling reproducible comparisons across ML methods</li>
</ul>
<p><strong>Limitations</strong>:</p>
<ul>
<li>Restricted to very small molecules (up to 9 heavy atoms), limiting relevance to drug-sized compounds</li>
<li>Only CHONF elements, excluding sulfur, halogens (Cl, Br, I), and metals</li>
<li>B3LYP/6-31G(2df,p) has known systematic errors (~5 kcal/mol MAE for atomization enthalpies)</li>
<li>3,054 molecules have geometry consistency issues in SMILES/InChI round-tripping</li>
<li>Single conformer per molecule (energy-minimized geometry only)</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://figshare.com/collections/Quantum_chemistry_structures_and_properties_of_134_kilo_molecules/978904">Figshare collection</a></td>
          <td>Dataset</td>
          <td>CC BY-NC-SA 4.0</td>
          <td>Full dataset: 134k molecules, C7H10O2 isomers, validation set, atomic references</td>
      </tr>
  </tbody>
</table>
<p>The Figshare deposit contains four files:</p>
<ul>
<li><code>dsgdb9nsd.xyz.tar.bz2</code>: All 133,885 GDB-1 through GDB-9 molecules with B3LYP properties</li>
<li><code>dsC7O2H10nsd.xyz.tar.bz2</code>: 6,095 C7H10O2 constitutional isomers with G4MP2 energetics</li>
<li><code>validation.txt</code>: Atomization enthalpies at B3LYP, G4MP2, G4, and CBS-QB3 for 100 random molecules</li>
<li><code>atomref.txt</code>: Atomic reference energies for computing atomization energies from total energies</li>
</ul>
<p>All data is in extended XYZ plain-text format. The paper and its metadata are open access (CC BY-NC-SA 4.0 for the article, CC0 for metadata).</p>
<p>No source code is provided. The computational pipeline relies on commercial and semi-commercial software: Corina (3D coordinate generation), MOPAC (PM7 semi-empirical relaxation), and Gaussian 09 (B3LYP DFT calculations). Specific convergence keywords and iteration procedures are documented in the paper. Hardware requirements are not reported.</p>
<p><strong>Reproducibility status</strong>: Partially Reproducible. The dataset itself is fully available, but regenerating it requires commercial licenses for Corina and Gaussian 09.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ramakrishnan2014quantum,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Quantum chemistry structures and properties of 134 kilo molecules}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ramakrishnan, Raghunathan and Dral, Pavlo O. and Rupp, Matthias and von Lilienfeld, O. Anatole}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Scientific Data}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{140022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2014}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Portfolio}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/sdata.2014.22}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GDBMedChem: Drug-Like Subset of GDB-17 (10M Molecules)</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/gdb-medchem/</link><pubDate>Sat, 11 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/datasets/gdb-medchem/</guid><description>Dataset card for GDBMedChem, 10 million drug-like molecules from GDB-17 filtered by medicinal chemistry criteria and evenly sampled.</description><content:encoded><![CDATA[<h2 id="key-contribution">Key Contribution</h2>
<p>GDBMedChem is a 10 million molecule subset of <a href="/notes/chemistry/datasets/gdb-17/">GDB-17</a> selected using medicinal chemistry criteria rather than the fragment-likeness rules used for <a href="/notes/chemistry/datasets/fdb-17/">FDB-17</a>. The resulting database has reduced complexity and better synthetic accessibility than the full GDB-17, while retaining higher Fsp3 carbon fraction and natural product likeness compared to known drugs. Critically, 97% of its MHFP6 substructure shingles are absent from <a href="https://en.wikipedia.org/wiki/DrugBank">DrugBank</a>, <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a>, and ZINC, making it an unprecedented source of structural diversity for drug design.</p>
<h2 id="overview">Overview</h2>
<p>GDB-17 enumerates 166.4 billion molecules following chemical stability and synthetic feasibility rules, but does not consider medicinal chemistry criteria such as acceptable functional group types, overall structural complexity, or drug-likeness. GDBMedChem addresses this gap with a different filtering philosophy than FDB-17: instead of enforcing fragment-likeness (rotatable bond limits, small size), it applies medicinal chemistry-inspired rules that allow larger, more flexible molecules while excluding problematic functional groups and overly complex scaffolds.</p>
<h2 id="assembly-pipeline">Assembly Pipeline</h2>
<p><strong>Stage 1: Medicinal chemistry filters (166.4B to 17.8B, ~9.4x reduction)</strong></p>
<p>Three categories of filters, each benchmarked against ChEMBL, DrugBank, and UNPD (natural products) to ensure low elimination of known bioactives:</p>
<table>
  <thead>
      <tr>
          <th>Category</th>
          <th>Key Filters</th>
          <th>GDB-17 Eliminated</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Functional groups</strong></td>
          <td>No amidines, imidates, aldehydes, aziridines, epoxides; no Br/I; no Cl/F on heterocycles; max 1 nitrile/alkyne/sulfone; max 2 ethers/amides/esters</td>
          <td>53%</td>
      </tr>
      <tr>
          <td><strong>Structural complexity</strong></td>
          <td>Max 18 avalon fingerprint density; max 1 cyclic tetravalent node; max 4 stereocenters; max 3 bonds in fused ring systems; max 3 rings</td>
          <td>62%</td>
      </tr>
      <tr>
          <td><strong>Polarity</strong></td>
          <td>Heteroatom-to-carbon ratio max 0.7</td>
          <td>6%</td>
      </tr>
      <tr>
          <td><strong>Combined</strong></td>
          <td>All filters together</td>
          <td>86%</td>
      </tr>
  </tbody>
</table>
<p>These filters eliminate 86% of GDB-17 but only 36% of ChEMBL molecules and 50% of DrugBank drugs (the higher DrugBank rate is driven mainly by the heteroatom-to-carbon ratio filter removing highly polar drugs with negative clogP values).</p>
<p>Of the 21 filters, 16 are implemented as SMARTS queries and 5 (stereocenters, ring count, avalon density, heteroatom-to-carbon ratio, largest aromatic ring size) use other <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a> functions. Filters were applied progressively (simplest first), not in the order listed above. The benchmarking percentages for ChEMBL and DrugBank refer to ChEMBL 22 and DrugBank 5.011 molecules with HAC ≤ 17.</p>
<p><strong>Stage 2: Even sampling (17.8B to 10M)</strong></p>
<p>The 17,804,900,000 molecules in the filtered set are binned into 425 possible triplet combinations of HAC (1-17), heteroatoms (≤1, 2, 3, 4, ≥5), and stereocenters (0, 1, 2, 3, 4). Of these, 181 bins are unoccupied, leaving 244 bins. PySpark&rsquo;s <code>sampleBy</code> function performs stratified sampling without replacement, using a round-robin allocation that increments each bin&rsquo;s quota by one until the total reaches 10M. The resulting distribution is uniform except in low-HAC bins (HAC ≤ 10) where all available molecules are taken.</p>
<h2 id="comparison-with-fdb-17">Comparison with FDB-17</h2>
<p>GDBMedChem and FDB-17 are both 10M-molecule subsets of GDB-17 but take fundamentally different approaches:</p>
<table>
  <thead>
      <tr>
          <th>Property</th>
          <th>GDBMedChem</th>
          <th>FDB-17</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Parent set</strong></td>
          <td>17.8B (medchem filters)</td>
          <td>4.6B (fragment filters)</td>
      </tr>
      <tr>
          <td><strong>Overlap</strong></td>
          <td>480M molecules shared between parent sets</td>
          <td></td>
      </tr>
      <tr>
          <td><strong>Rotatable bonds</strong></td>
          <td>Similar to known drugs</td>
          <td>Restricted to max 3 (fragment-like)</td>
      </tr>
      <tr>
          <td><strong>Key difference</strong></td>
          <td>Drug-like flexibility, medchem FG rules</td>
          <td>Fragment-like rigidity, strict FG removal</td>
      </tr>
  </tbody>
</table>
<p>Both databases retain GDB-17&rsquo;s characteristic high Fsp3 fraction and 3D molecular shape diversity compared to predominantly planar known molecules.</p>
<h2 id="substructure-novelty">Substructure Novelty</h2>
<p>MHFP6 (<a href="https://en.wikipedia.org/wiki/MinHash">MinHash fingerprint</a> with diameter 6) shingle analysis reveals striking structural novelty:</p>
<table>
  <thead>
      <tr>
          <th>Database</th>
          <th>Molecules</th>
          <th>Unique Shingles</th>
          <th>Unique to Database</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GDBMedChem</strong></td>
          <td>10M</td>
          <td>17.3M</td>
          <td>97%</td>
      </tr>
      <tr>
          <td>ChEMBL</td>
          <td>1.4M</td>
          <td>1.6M</td>
          <td>57%</td>
      </tr>
      <tr>
          <td>ZINC</td>
          <td>15M</td>
          <td>1.5M</td>
          <td>53%</td>
      </tr>
      <tr>
          <td>DrugBank</td>
          <td>8.3k</td>
          <td>82k</td>
          <td>12%</td>
      </tr>
  </tbody>
</table>
<p>GDBMedChem contains 17.3 million unique shingles, roughly 10x more than the 15 million-molecule <a href="/notes/chemistry/datasets/zinc-22/">ZINC database</a>, with 97% appearing in no other database. The cumulative unique shingle count grows faster and more steadily with database size for GDBMedChem than for known molecule databases, reflecting greater internal diversity. Among the most frequent shingles, oxygen-containing saturated or singly unsaturated substructures dominate GDBMedChem, in contrast to aromatic and nitrogen heterocycles in ZINC.</p>
<h2 id="property-profiles">Property Profiles</h2>
<p>Compared to known drugs (DrugBank17, ChEMBL17):</p>
<ul>
<li><strong>Synthetic accessibility</strong>: Slightly better than GDB-17 due to complexity filters, but still lower than known molecules</li>
<li><strong>Natural product likeness</strong>: Significantly higher than drugs, approaching natural products (UNPD17)</li>
<li><strong>Fsp3 fraction</strong>: Higher than drugs, reflecting more 3D-shaped molecules</li>
<li><strong>Compound categories</strong>: Much higher fraction of heterocyclic molecules, much lower fraction of aromatic molecules (a consequence of combinatorial enumeration favoring heteroatom-in-ring combinations)</li>
</ul>
<h2 id="strengths--limitations">Strengths &amp; Limitations</h2>
<p><strong>Strengths</strong>:</p>
<ul>
<li>97% structurally novel substructures provide unprecedented diversity for drug design</li>
<li>Medicinal chemistry filters retain drug-relevant functional group patterns</li>
<li>Even sampling corrects GDB-17&rsquo;s combinatorial bias toward large, complex molecules</li>
<li>Higher Fsp3 and natural product likeness compared to known drugs</li>
<li>Available with interactive 3D visualization, MQN/MHFP6 similarity search, and download</li>
</ul>
<p><strong>Limitations</strong>:</p>
<ul>
<li>Synthetic accessibility scores remain lower than for known molecules</li>
<li>Excludes Br, I, and Cl/F on heterocycles, which are common in medicinal chemistry</li>
<li>Random sampling means specific molecules of interest from the 17.8B parent set may be absent</li>
<li>Overlap with FDB-17 is limited (different filtering philosophies), so both databases complement rather than replace each other</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="molecule-preprocessing">Molecule Preprocessing</h3>
<p>Before filtering, each molecule undergoes: counter-ion removal, largest-fragment retention, conversion to non-chiral SMILES, valence-error checking, and protonation at pH 7.4 (using ChemAxon JChem). Duplicates are removed by <a href="/notes/chemistry/molecular-representations/notations/smiles/">canonical SMILES</a> comparison within each database.</p>
<h3 id="reference-databases">Reference Databases</h3>
<p>The comparison databases used specific versions: ChEMBL 22 (1.4M compounds with HAC ≤ 50; 105,423 with HAC ≤ 17), DrugBank 5.011 (8,299 approved/experimental drugs with HAC ≤ 50; 2,284 with HAC ≤ 17), UNPD (20,302 natural products with HAC ≤ 17), and ZINC 12 (15M commercially available compounds).</p>
<h3 id="mhfp6-shingle-computation">MHFP6 Shingle Computation</h3>
<p>Shingles were computed using the <a href="https://github.com/reymond-group/mhfp"><code>mhfp</code> Python package</a> (also on <a href="https://pypi.org/project/mhfp/">PyPI</a>), specifically the <code>shingling_from_smiles</code> function from the <code>MHFPEncoder</code> class. Each shingle represents an extended-connectivity substructure around an atom with a diameter of up to 6 bonds, plus all ring structures, encoded as rooted SMILES strings.</p>
<h3 id="avalon-fingerprint-density">Avalon Fingerprint Density</h3>
<p>The avalon fingerprint density, used as the overall structural complexity filter (max 18), is defined as the number of on-bits in the avalon fingerprint scaled to the heavy atom count.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://gdb.unibe.ch/downloads/">GDBMedChem download</a></td>
          <td>Dataset</td>
          <td>Non-commercial (no patents, no redistribution)</td>
          <td>10M molecules in SMILES format</td>
      </tr>
      <tr>
          <td><a href="https://gdb.unibe.ch">GDB web tools</a></td>
          <td>Other</td>
          <td>Unknown</td>
          <td>3D visualization, MQN/MHFP6 similarity search</td>
      </tr>
      <tr>
          <td><a href="https://github.com/reymond-group/mhfp"><code>mhfp</code> Python package</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>MHFP6 fingerprint and shingle computation</td>
      </tr>
      <tr>
          <td><a href="https://github.com/reymond-group/pca">PCA visualization tools</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>MQN-to-3D PCA projection preprocessing</td>
      </tr>
  </tbody>
</table>
<p><strong>Status: Partially Reproducible.</strong> The dataset itself is publicly available for download, and the paper describes the filtering and sampling pipeline in detail (RDKit 2017_09_03, PySpark 2.3.2, 98-node cluster with 252 GB RAM). The <code>mhfp</code> package for shingle analysis is open-source. However, no standalone filtering/sampling code is released: reproducing the pipeline from scratch requires reimplementing the 16 SMARTS filters and 5 RDKit-based filters, plus the PySpark stratified sampling procedure. The molecule preprocessing step also depends on ChemAxon JChem (commercial) for pH 7.4 protonation and MQN calculation.</p>
<p>The paper is published in the closed-access journal <em>Molecular Informatics</em>. An open-access preprint is available on <a href="https://doi.org/10.26434/chemrxiv.7770809.v1">ChemRxiv</a>.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{awale2019medicinal,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Medicinal Chemistry Aware Database GDBMedChem}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Awale, Mahendra and Sirockin, Finton and Stiefl, Nikolaus and Reymond, Jean-Louis}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Molecular Informatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{38}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{8-9}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{e1900031}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Wiley}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1002/minf.201900031}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>FDB-17: Fragment Database (10M Molecules)</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/fdb-17/</link><pubDate>Sat, 11 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/datasets/fdb-17/</guid><description>Dataset card for FDB-17, a 10 million fragment-like molecule subset of GDB-17 evenly sampled across size, polarity, and stereochemical complexity.</description><content:encoded><![CDATA[<h2 id="key-contribution">Key Contribution</h2>
<p>FDB-17 is a curated subset of 10 million <a href="https://en.wikipedia.org/wiki/Fragment-based_lead_discovery">fragment-like</a> molecules extracted from the 166.4 billion molecules in <a href="/notes/chemistry/datasets/gdb-17/">GDB-17</a>. It corrects the combinatorial bias of exhaustive enumeration (which overwhelmingly produces large, complex molecules) by evenly sampling across molecular size, polarity, and stereochemical complexity. The result is a database sized for practical virtual screening tools while retaining GDB-17&rsquo;s distinctive 3D molecular shape diversity.</p>
<h2 id="overview">Overview</h2>
<p>GDB-17 exhaustively enumerates molecules up to 17 heavy atoms, but the combinatorial explosion means the database is dominated by the largest, most functionalized, and stereochemically most complex entries. This makes it impractical for most <a href="https://en.wikipedia.org/wiki/Virtual_screening">virtual screening</a> workflows and poorly suited for identifying simple, synthetically accessible fragments. FDB-17 addresses both problems through a two-stage reduction.</p>
<h2 id="assembly-pipeline">Assembly Pipeline</h2>
<p><strong>Stage 1: Fragment-likeness filters (166.4B to 4.6B, 36x reduction)</strong></p>
<p>Criteria limiting structural and functional group complexity:</p>
<table>
  <thead>
      <tr>
          <th>Category</th>
          <th>Constraints</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Scaffolds</strong></td>
          <td>Max 3 rings, max 2 small (3/4-membered) rings, max 2 quaternary centers, max 4 stereocenters, max 3 rotatable bonds</td>
      </tr>
      <tr>
          <td><strong>FG density</strong></td>
          <td>Max 5 N+O atoms, max 1 positive/negative charge at neutral pH, max 3 HBA, max 2 HBD</td>
      </tr>
      <tr>
          <td><strong>Excluded groups</strong></td>
          <td>Aldehydes, epoxides, aziridines, carbonates, imidates, nitro groups, aromatic rings &gt;6 atoms, ≤ 1 cyano group</td>
      </tr>
      <tr>
          <td><strong>Removed elements</strong></td>
          <td>Non-aromatic C=C, C triple bonds, halogens (approximated by saturated C-C and methyl)</td>
      </tr>
  </tbody>
</table>
<p><strong>Stage 2: Even sampling (4.6B to 10M, 460x reduction)</strong></p>
<p>The 4.6B fragment subset is binned into 175 cells defined by value triplets of (HAC, heteroatoms, stereocenters):</p>
<table>
  <thead>
      <tr>
          <th>Dimension</th>
          <th>Bin values</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>HAC</strong></td>
          <td>≤11, 12, 13, 14, 15, 16, 17 (7 bins)</td>
      </tr>
      <tr>
          <td><strong>Heteroatoms (N+O+S)</strong></td>
          <td>≤1, 2, 3, 4, ≥5 (5 bins)</td>
      </tr>
      <tr>
          <td><strong>Stereocenters</strong></td>
          <td>0, 1, 2, 3, 4 (5 bins)</td>
      </tr>
  </tbody>
</table>
<p>Individual bins ranged from 3,359 to 446,322,188 molecules, reflecting the extreme combinatorial skew toward large, complex structures. Bins with ≤70,000 molecules are taken entirely; larger bins are randomly sampled to approximately 60,000 molecules each. The filtering was implemented in Java using ChemAxon&rsquo;s JChem libraries and executed on a 500-node cluster in 10,000 CPU hours. The resulting even distribution across molecular size, polarity, and complexity replaces the exponentially skewed distribution of the parent database.</p>
<h2 id="property-profiles-vs-commercial-fragments">Property Profiles vs. Commercial Fragments</h2>
<p>FDB-17 was compared against 40,986 commercial fragments collected from 8 vendors (AnalytiCon, ChemBridge, Enamine, FRAGMENTA, BIONET, LifeChemical, Maybridge, Vitas) and filtered by Congreve&rsquo;s <a href="https://en.wikipedia.org/wiki/Lipinski%27s_rule_of_five">rule of three</a> (mass ≤300, HBA ≤3, HBD ≤3, logP ≤3, RBC ≤3, PSA ≤60). Only 31% (12,847) of these commercial fragments appeared in the 4.6B fragment subset at all, due to functional groups absent from GDB-17 (halogens, thiols, azides, thioethers). Of those, only 6.7% (2,740) appeared in FDB-17 due to the random sampling step.</p>
<p>Key differences:</p>
<ul>
<li><strong>Size and polarity</strong>: FDB-17&rsquo;s even sampling produces distributions comparable to commercial fragments, unlike the parent GDB-17 which peaks sharply at HAC = 17</li>
<li><strong>Compound categories</strong>: Half are heteroaromatic in both sets, but FDB-17&rsquo;s second half is predominantly heterocyclic vs. aromatic for commercial fragments</li>
<li><strong>3D character</strong>: FDB-17 retains GDB-17&rsquo;s coverage of the full PMI (principal moments of inertia) shape triangle, with a frequency peak at center-left (PMI computed from single low-energy CORINA conformers). Commercial fragments are predominantly planar. FDB-17 has significantly higher Fsp3 values</li>
<li><strong>Ring count</strong>: Fragment subsets of GDB-17 are enriched in 2- and 3-ring molecules (a consequence of the rotatable bond limit, which constrains monocyclic molecules more than polycyclic ones)</li>
</ul>
<h2 id="virtual-screening-validation">Virtual Screening Validation</h2>
<p>Nearest-neighbor searches were performed using two fingerprint spaces: MQN (42-dimensional molecular quantum numbers counting atoms, bonds, polarity, and topology) and Xfp (55-dimensional extended <a href="https://en.wikipedia.org/wiki/Pharmacophore">pharmacophore</a> fingerprint capturing shape and pharmacophore features). Four fragment-like drugs were used as queries: fencamfamine, gabapentin, rimantadine, and levetiracetam. For each drug, 10,000 nearest neighbors were retrieved and scored by 3D-shape similarity using ROCS (Rapid Overlay of Chemical Structures). 3D conformers were generated with OMEGA (all possible stereoisomers, keeping the highest-scoring one). Molecules with ROCS Tanimoto Combo &gt; 1.4 were considered virtual hits.</p>
<p>FDB-17 delivered comparable numbers of virtual hits to the full 4.6B fragment subset and the entire GDB-17, despite being 460x and 16,640x smaller respectively. Both close analogs (high substructure similarity, Tsfp &gt; 0.7) and scaffold-hopping compounds (low substructure similarity but high shape similarity) were identified. Random sampling from FDB-17 and searches in the 41k commercial fragment set returned far fewer hits.</p>
<h2 id="strengths--limitations">Strengths &amp; Limitations</h2>
<p><strong>Strengths</strong>:</p>
<ul>
<li>Manageable size (10M) compatible with docking and 3D-shape virtual screening tools</li>
<li>Even coverage of molecular size, polarity, and complexity avoids combinatorial bias</li>
<li>High 3D shape diversity compared to predominantly flat commercial fragment libraries</li>
<li>Available with interactive visualization (MQN/SMIfp-mapplet) and web-based nearest neighbor search</li>
</ul>
<p><strong>Limitations</strong>:</p>
<ul>
<li>Only the 10M FDB-17 is released, not the 4.6B fragment-filtered intermediate. Practitioners who want a different sampling strategy or the full fragment subset cannot access it</li>
<li>Random sampling means specific molecules of interest from the 4.6B subset may be absent</li>
<li>Excludes halogens, non-aromatic unsaturations, and several functional group classes present in commercial fragments</li>
<li>Only 6.7% overlap with commercial fragments limits direct comparison</li>
<li>Still derived from GDB-17&rsquo;s enumeration rules, so molecules outside those rules (e.g., containing metals or larger rings) are excluded</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>FDB-17 is publicly available for download from the <a href="https://gdb.unibe.ch/downloads/">GDB project page</a> as a single SMILES file (62.2 MB), hosted on Zenodo. Interactive visualization via the MQN/SMIfp-mapplet and web-based nearest neighbor search tools are also accessible through the same site. The multi-fingerprint browser supports nearest-neighbor search across six fingerprints: MQN (42D), SMIfp (34D), APfp (21D), Xfp (55D), Sfp (1024-bit Daylight-type), and ECfp4 (1024-bit circular). The filtering code was written in Java using JChem libraries (ChemAxon) and executed on a 500-node cluster in 10,000 CPU hours. The filtering code itself is not publicly released. Virtual screening additionally requires OMEGA (conformer generation) and ROCS (3D-shape scoring), both commercial tools from OpenEye.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://gdb.unibe.ch/downloads/">FDB-17 SMILES</a></td>
          <td>Dataset</td>
          <td>Custom (no patents, no redistribution)</td>
          <td>10M fragment-like molecules from GDB-17</td>
      </tr>
      <tr>
          <td><a href="https://gdb.unibe.ch/downloads/">MQN/SMIfp-mapplet</a></td>
          <td>Other</td>
          <td>Web tool</td>
          <td>Interactive PCA visualization on 1000x1000 grids</td>
      </tr>
      <tr>
          <td><a href="https://gdb.unibe.ch/downloads/">Multi-fingerprint browser</a></td>
          <td>Other</td>
          <td>Web tool</td>
          <td>Nearest neighbor search across 6 fingerprints (MQN, SMIfp, APfp, Xfp, Sfp, ECfp4)</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility status</strong>: Partially Reproducible. The 10M FDB-17 is freely downloadable, but the 4.6B fragment-filtered intermediate is not released. The filtering criteria are fully documented, but the Java filtering code is not released and depends on proprietary ChemAxon libraries. Reproducing the virtual screening experiments requires commercial tools (OMEGA, ROCS from OpenEye; CORINA for PMI analysis).</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{visini2017fragment,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Fragment Database FDB-17}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Visini, Ricardo and Awale, Mahendra and Reymond, Jean-Louis}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{57}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{700--709}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2017}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.7b00020}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ZINC-22: A Multi-Billion Scale Database for Ligand Discovery</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/zinc-22/</link><pubDate>Sat, 27 Sep 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/datasets/zinc-22/</guid><description>The ZINC-22 dataset provides over 37 billion make-on-demand molecules enabling virtual screening and modern drug discovery.</description><content:encoded><![CDATA[<h2 id="key-contribution-scaling-make-on-demand-libraries">Key Contribution: Scaling Make-on-Demand Libraries</h2>
<p>ZINC-22 addresses the critical infrastructure challenges of managing multi-billion-scale libraries of make-on-demand chemical compounds through a federated database architecture, the CartBlanche web interface, and cloud distribution systems that enable modern virtual screening.</p>
<h2 id="overview">Overview</h2>
<p>ZINC-22 is a multi-billion scale public database of commercially available chemical compounds designed for virtual screening. It contains over 37 billion make-on-demand molecules and utilizes a distributed infrastructure capable of managing database indexing limits. For structural biology pipelines, it provides 4.5 billion ready-to-dock 3D conformations alongside pre-calculated pH-specific protonation states, tautomers, and AMSOL partial charges.</p>
<h2 id="dataset-examples">Dataset Examples</h2>















<figure class="post-figure center ">
    <img src="/img/zinc-22-sample.webp"
         alt="ZINC-22&#39;s 2D Tranche Browser"
         title="ZINC-22&#39;s 2D Tranche Browser"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">ZINC-22&rsquo;s 2D Tranche Browser showing the organization of 37.2 billion molecules by physicochemical properties</figcaption>
    
</figure>

<h2 id="dataset-subsets">Dataset Subsets</h2>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Count</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>2D Database</strong></td>
          <td>37B+</td>
          <td>Complete 2D chemical structures from make-on-demand catalogs (Enamine REAL, Enamine REAL Space, WuXi GalaXi, Mcule Ultimate)</td>
      </tr>
      <tr>
          <td><strong>3D Database</strong></td>
          <td>4.5B+</td>
          <td>Ready-to-dock 3D conformations with pre-calculated charges and solvation energies</td>
      </tr>
      <tr>
          <td><strong>Custom Tranches</strong></td>
          <td>Variable</td>
          <td>User-selected molecular subsets via Tranche Browser (e.g., lead-like, fragment-like)</td>
      </tr>
  </tbody>
</table>
<h2 id="use-cases">Use Cases</h2>
<p>ZINC-22 is designed for ultra-large virtual screening (ULVS), analog searching, and molecular docking campaigns. The Tranche Browser enables targeted subset selection (e.g., lead-like, fragment-like) for screening, and the CartBlanche interface supports both interactive and programmatic access to the database. The authors note that as the database grows, docking can identify better-fitting molecules.</p>
<h2 id="related-datasets">Related Datasets</h2>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Relationship</th>
          <th>Link</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>ZINC-20</strong></td>
          <td>Predecessor</td>
          <td></td>
      </tr>
      <tr>
          <td><strong>Enamine REAL</strong></td>
          <td>Source catalog</td>
          <td></td>
      </tr>
      <tr>
          <td><strong>WuXi GalaXi</strong></td>
          <td>Source catalog</td>
          <td></td>
      </tr>
  </tbody>
</table>
<h2 id="strengths">Strengths</h2>
<ul>
<li><strong>Massive scale</strong>: 37+ billion purchasable compounds from major vendors (Enamine, WuXi, Mcule)</li>
<li><strong>Federated architecture</strong>: Supports asynchronous building and horizontal scaling to trillion-molecule growth</li>
<li><strong>Platform access</strong>: CartBlanche GUI provides a shopping cart metaphor for compound acquisition</li>
<li><strong>Privacy protection</strong>: Dual public/private server clusters protect patentability of undisclosed catalogs</li>
<li><strong>Chemical diversity</strong>: Linear growth (1 new scaffold per 10 molecules added), with 96.3M+ unique Bemis-Murcko scaffolds</li>
<li><strong>Ready-to-dock</strong>: 3D models include pre-calculated charges, protonation states, and solvation energies</li>
<li><strong>Cloud distribution</strong>: Available via AWS Open Data, Oracle OCI, and UCSF servers</li>
<li><strong>Scale-aware search</strong>: SmallWorld (similarity) and Arthor (substructure) tools partitioned to address specific constraints of billion-scale queries</li>
<li><strong>Organized access</strong>: Tranche system enables targeted selection of chemical space</li>
<li><strong>Open access</strong>: Entire database freely available to academic and commercial users</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li><strong>Data Transfer Bottlenecks</strong>: Distributing 4.5 billion 3D alignments in standard rigid format (like db2 flexibase) requires roughly 1 Petabyte of storage. Transferring this takes months over standard gigabit connections, effectively mandating cloud-based compilation and rendering local copies impractical.</li>
<li><strong>Search Result Caps</strong>: Interactive Arthor searches are capped at 20,000 molecules to maintain a reliable public service. Users needing more results can use the asynchronous Arthor search tool via TLDR, which sends results by email.</li>
<li><strong>Enumeration Ceiling</strong>: Scaling relies entirely on PostgreSQL sharding. To continue using rigid docking tools, the database must fully enumerate structural states. The authors acknowledge that hardware limitations will likely cap full database enumeration well before the 10-trillion molecule mark, forcing future pipelines to accommodate unenumerated combinatorial fragment spaces.</li>
<li><strong>Download Workflow</strong>: Individual 3D molecule downloads are unavailable directly; researchers must rebuild them via the TLDR tool.</li>
<li><strong>Vendor Updates</strong>: There is difficulty removing discontinued vendor molecules due to the federated structure.</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="hardware--software">Hardware &amp; Software</h3>
<p><strong>Compute infrastructure</strong>:</p>
<ul>
<li>1,700 cores across 14 computers for parallel processing</li>
<li>174 independent PostgreSQL 12.0 databases (110 &lsquo;Sn&rsquo; for ZINC-ID, 64 &lsquo;Sb&rsquo; for Supplier Codes)</li>
<li>Distributed across Amazon AWS, Oracle OCI, and UCSF servers</li>
</ul>
<p><strong>Software stack</strong>:</p>
<ul>
<li>PostgreSQL 12.2</li>
<li>Python 3.6.8</li>
<li>RDKit 2020.03</li>
<li>Celery task queue with Redis for background processing</li>
<li>All code available on GitHub: docking-org/zinc22-2d, zinc22-3d</li>
</ul>
<h3 id="data-organization--access">Data Organization &amp; Access</h3>
<p><strong>Tranche system</strong>: Molecules organized into &ldquo;Tranches&rdquo; based on 4 dimensions:</p>
<ol>
<li>Heavy Atom Count</li>
<li>Lipophilicity (LogP)</li>
<li>Charge</li>
<li>File Format</li>
</ol>
<p>This enables downloading specific chemical neighborhoods (e.g., neutral lead-like molecules) without accessing the entire database.</p>
<p><strong>Search infrastructure</strong>:
Searching at the billion-molecule scale actively exceeds rapid-access computer memory limits. ZINC-22 splits retrieval between two distinct algorithms:</p>
<ul>
<li>
<p><strong>SmallWorld</strong>: Handles whole-molecule similarity using Graph Edit Distance (GED). GED defines the minimum cost of operations (node/edge insertions, deletions, or substitutions) required to transform graph $G_1$ into graph $G_2$:</p>
<p>$$
\text{GED}(G_1, G_2) = \min_{(e_1, &hellip;, e_k) \in \mathcal{P}(G_1, G_2)} \sum_{i=1}^k c(e_i)
$$</p>
<p>Because SmallWorld searches pre-calculated anonymous graphs, it evaluates close neighbors in near $\mathcal{O}(1)$ time and scales sub-linearly, though it struggles with highly distant structural matches.</p>
</li>
<li>
<p><strong>Arthor</strong>: Provides exact substructure and pattern matching. It scales linearly $\mathcal{O}(N)$ with database size and successfully finds distant hits (e.g., PAINS filters), but performance heavily degrades if the index exceeds available RAM.</p>
</li>
<li>
<p><strong>CartBlanche</strong>: Web interface wrapping these search tools with shopping cart functionality.</p>
</li>
</ul>
<h3 id="3d-generation-pipeline">3D Generation Pipeline</h3>
<p>The 3D database construction pipeline involves multiple specialized tools:</p>
<ol>
<li><strong>ChemAxon JChem</strong>: Protonation state and tautomer generation at physiological pH</li>
<li><strong>Corina</strong>: Initial 3D structure generation</li>
<li><strong>Omega</strong>: Conformation sampling</li>
<li><strong>AMSOL 7.1</strong>: Calculation of atomic partial charges and desolvation energies</li>
<li><strong>Strain calculation</strong>: Relative energies of conformations</li>
</ol>
<p>At sustained throughput, the pipeline builds approximately 11 million molecules per day, each with hundreds of pre-calculated conformations.</p>
<h3 id="chemical-diversity-analysis">Chemical Diversity Analysis</h3>
<p>A core debate in billion-scale library generation involves whether continuous enumeration merely yields repetitive derivatives. Analysis of Bemis-Murcko (BM) scaffolds demonstrates that chemical diversity in ZINC-22 continues to grow, but scales sub-linearly based on a power law. Specifically, the authors observe a $\log$ increase in BM scaffolds for every two $\log$ increase in database size:</p>
<p>$$
\log(\text{Scaffolds}_{BM}) \propto 0.5 \log(\text{Molecules})
$$</p>
<p>This suggests that while diversity does not saturate, it grows proportionally to the square root of the library size ($\mathcal{O}(\sqrt{N})$). The majority of this scaffold novelty stems from compounds with the highest heavy atom counts (HAC 24-25), which contribute roughly twice as many unique core structures as the combined HAC 06-23 subset.</p>
<h3 id="vendor-integration">Vendor Integration</h3>
<p>ZINC-22 is built from five source catalogs with the following approximate sizes:</p>
<ul>
<li><strong>Enamine REAL Database</strong>: 5 billion compounds</li>
<li><strong>Enamine REAL Space</strong>: 29 billion compounds</li>
<li><strong>WuXi GalaXi</strong>: 2.5 billion compounds</li>
<li><strong>Mcule Ultimate</strong>: 128 million compounds</li>
<li><strong>ZINC20 in-stock</strong>: 4 million compounds (incorporated as layer &ldquo;g&rdquo;)</li>
</ul>
<p>This focus on purchasable, make-on-demand molecules distinguishes ZINC-22 from theoretical chemical space databases. ZINC20 continues to be maintained separately for smaller catalogs and in-stock compounds.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://cartblanche22.docking.org/">CartBlanche web interface</a></td>
          <td>Dataset</td>
          <td>Free access</td>
          <td>Web GUI for searching and downloading ZINC-22</td>
      </tr>
      <tr>
          <td><a href="https://github.com/docking-org/zinc22-2d">docking-org/zinc22-2d</a></td>
          <td>Code</td>
          <td>BSD-3-Clause</td>
          <td>2D curation and loading pipeline</td>
      </tr>
      <tr>
          <td><a href="https://github.com/docking-org/zinc22-3d">docking-org/zinc22-3d</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>3D building pipeline</td>
      </tr>
      <tr>
          <td><a href="https://github.com/docking-org/cartblanche22">docking-org/cartblanche22</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>CartBlanche22 web application</td>
      </tr>
      <tr>
          <td>AWS Open Data / Oracle OCI</td>
          <td>Dataset</td>
          <td>Free access</td>
          <td>Cloud-hosted 3D database mirrors</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Data Availability</strong>: The compiled database is openly accessible and searchable through the <a href="https://cartblanche22.docking.org/">CartBlanche web interface</a>. Subsets can be downloaded, and programmatic access is provided via curl, wget, and Globus.</li>
<li><strong>Code &amp; Algorithms</strong>: The source code for database construction, parallel processing, and querying is open-source.
<ul>
<li>2D Pipeline: <a href="https://github.com/docking-org/zinc22-2d">docking-org/zinc22-2d</a></li>
<li>3D Pipeline: <a href="https://github.com/docking-org/zinc22-3d">docking-org/zinc22-3d</a></li>
<li>CartBlanche: <a href="https://github.com/docking-org/cartblanche22">docking-org/cartblanche22</a></li>
<li>TLDR modules: docking-org/TLDR and docking-org/tldr-modules (repositories no longer available)</li>
</ul>
</li>
<li><strong>Software Dependencies</strong>: While the orchestration code is public, the 3D structure generation relies on commercial software that requires separate licenses (CORINA, OpenEye OMEGA, ChemAxon JChem). This limits end-to-end reproducibility for researchers without access to these tools.</li>
<li><strong>Hardware Limitations</strong>: Recreating the entire 37+ billion molecule database from raw vendor catalogs requires approximately 1,700 CPU cores and petabytes of data transfer, restricting full recreation to large institutional clusters or substantial cloud compute budgets.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p>Tingle, B. I., Tang, K. G., Castanon, M., Gutierrez, J. J., Khurelbaatar, M., Dandarchuluun, C., Moroz, Y. S., and Irwin, J. J. (2023). ZINC-22: A Free Multi-Billion-Scale Database of Tangible Compounds for Ligand Discovery. <em>Journal of Chemical Information and Modeling</em>, 63(4), 1166&ndash;1176. <a href="https://doi.org/10.1021/acs.jcim.2c01253">https://doi.org/10.1021/acs.jcim.2c01253</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{Tingle_2023,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{ZINC-22: A Free Multi-Billion-Scale Database of Tangible Compounds for Ligand Discovery}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{63}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">ISSN</span>=<span style="color:#e6db74">{1549-960X}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{http://dx.doi.org/10.1021/acs.jcim.2c01253}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">DOI</span>=<span style="color:#e6db74">{10.1021/acs.jcim.2c01253}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society (ACS)}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Tingle, Benjamin I. and Tang, Khanh G. and Castanon, Mar and Gutierrez, John J. and Khurelbaatar, Munkhzul and Dandarchuluun, Chinzorig and Moroz, Yurii S. and Irwin, John J.}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">month</span>=<span style="color:#e6db74">{Feb}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1166--1176}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MARCEL: Molecular Conformer Ensemble Learning Benchmark</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/marcel/</link><pubDate>Mon, 08 Sep 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/datasets/marcel/</guid><description>MARCEL dataset provides 722K+ conformers across 76K+ molecules for drug discovery, catalysis, and molecular representation learning research.</description><content:encoded><![CDATA[<h2 id="key-contribution">Key Contribution</h2>
<p>MARCEL provides a benchmark for conformer ensemble learning. It demonstrates that explicitly modeling full conformer distributions improves property prediction across drug-like molecules and organometallic catalysts.</p>
<h2 id="overview">Overview</h2>
<p>The Molecular Representation and Conformer Ensemble Learning (MARCEL) dataset provides 722K+ conformations across 76K+ molecules spanning four diverse chemical domains: drug-like molecules (Drugs-75K), organophosphorus ligands (Kraken), chiral catalysts (EE), and organometallic complexes (BDE). MARCEL evaluates conformer ensemble methods across both pharmaceutical and catalysis applications.</p>
<h2 id="dataset-examples">Dataset Examples</h2>















<figure class="post-figure center ">
    <img src="/img/marcel-drugs-75k-example-conformer.webp"
         alt="Example conformer from Drugs-75K"
         title="Example conformer from Drugs-75K"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example conformer from Drugs-75K (SMILES: <code>COC(=O)[C@@]1(Cc2ccc(OC)cc2)[C@H]2c3cc(C(=O)N(C)C)n(Cc4ccc(OC(F)(F)F)cc4)c3C[C@H]2CN1C(=O)c1ccccc1</code>; IUPAC: methyl (2R,3R,6R)-4-benzoyl-10-(dimethylcarbamoyl)-3-[(4-methoxyphenyl)methyl]-9-[[4-(trifluoromethoxy)phenyl]methyl]-4,9-diazatricyclo[6.3.0.02,6]undeca-1(8),10-diene-3-carboxylate)</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/marcel-drugs-75k-example-conformer-2d.webp"
         alt="2D structure of Drugs-75K conformer"
         title="2D structure of Drugs-75K conformer"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">2D structure of Drugs-75K conformer above</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/marcel-kraken-ligand10-conf0-2d.webp"
         alt="Example conformer from Kraken in 2D"
         title="Example conformer from Kraken in 2D"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example conformer from Kraken (ligand 10, conformer 0) in 2D</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/marcel-kraken-ligand10-conf0-3d.webp"
         alt="Example conformer from Kraken in 3D"
         title="Example conformer from Kraken in 3D"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example conformer from Kraken (ligand 10, conformer 0) in 3D</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/marcel-bde-Pt_9.63-3d.webp"
         alt="Example substrate from BDE in 3D"
         title="Example substrate from BDE in 3D"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example substrate from BDE in 3D (Pt_9.63)</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/marcel-bde-Pt_9.63-2d.webp"
         alt="2D structure of BDE substrate"
         title="2D structure of BDE substrate"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">2D structure of BDE substrate above</figcaption>
    
</figure>

<h2 id="dataset-subsets">Dataset Subsets</h2>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Count</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Drugs-75K</strong></td>
          <td>75,099 molecules</td>
          <td>Drug-like molecules with at least 5 rotatable bonds</td>
      </tr>
      <tr>
          <td><strong>Kraken</strong></td>
          <td>1,552 molecules</td>
          <td>Monodentate organophosphorus (III) ligands</td>
      </tr>
      <tr>
          <td><strong>EE</strong></td>
          <td>872 reactions</td>
          <td>Rhodium (Rh)-bound atropisomeric catalyst-substrate pairs derived from chiral bisphosphine</td>
      </tr>
      <tr>
          <td><strong>BDE</strong></td>
          <td>5,915 reactions</td>
          <td>Organometallic catalysts ML$_1$L$_2$ with electronic binding energies</td>
      </tr>
  </tbody>
</table>
<h2 id="benchmarks">Benchmarks</h2>

<div class="benchmarks-content">
  <div class="benchmark-section">
    <h3 id="drugs-75k-ionization-potential">Ionization Potential (Drugs-75K)<a hidden class="anchor" aria-hidden="true" href="#drugs-75k-ionization-potential">#</a></h3>
    <p class="benchmark-description">Predict ionization potential from molecular structure</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Drugs-75K
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (eV)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.4066</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.4069</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.4126</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.4149</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.4174</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.428</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.4351</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.4354</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.4361</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.4393</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.4394</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.4441</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.4452</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.4466</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.4505</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.4595</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.4788</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.4987</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.6617</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="drugs-75k-electron-affinity">Electron Affinity (Drugs-75K)<a hidden class="anchor" aria-hidden="true" href="#drugs-75k-electron-affinity">#</a></h3>
    <p class="benchmark-description">Predict electron affinity from molecular structure</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Drugs-75K
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (eV)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.391</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.3922</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.3944</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.3953</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.3964</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.4033</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.4085</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.4169</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.4169</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.4207</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.4233</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.4232</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.4251</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.4269</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.4417</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.4495</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.4648</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.4747</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.585</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="drugs-75k-electronegativity">Electronegativity (Drugs-75K)<a hidden class="anchor" aria-hidden="true" href="#drugs-75k-electronegativity">#</a></h3>
    <p class="benchmark-description">Predict electronegativity (χ) from molecular structure</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Drugs-75K
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (eV)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.197</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.2027</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.2069</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.2083</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.2199</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.2212</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.2243</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.2243</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.226</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.2267</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.2267</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.2294</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.2324</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.2378</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.2436</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.2441</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.2505</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.2732</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.4073</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="kraken-b5">B₅ Sterimol Parameter (Kraken)<a hidden class="anchor" aria-hidden="true" href="#kraken-b5">#</a></h3>
    <p class="benchmark-description">Predict B₅ sterimol descriptor for organophosphorus ligands</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Kraken
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.2225</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.2313</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.263</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.2644</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.2704</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.2789</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.3072</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.3128</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.3228</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.3293</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.3443</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.345</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.351</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.3567</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.476</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.485</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.4873</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.4879</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.9611</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="kraken-l">L Sterimol Parameter (Kraken)<a hidden class="anchor" aria-hidden="true" href="#kraken-l">#</a></h3>
    <p class="benchmark-description">Predict L sterimol descriptor for organophosphorus ligands</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Kraken
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.3386</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.3468</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.3619</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.3643</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.3754</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.4003</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.4174</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.4303</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.4322</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.4344</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.4363</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.4471</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.4485</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.4493</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.5142</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.5452</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.5458</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.6417</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.8389</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="kraken-burb5">Buried B₅ Parameter (Kraken)<a hidden class="anchor" aria-hidden="true" href="#kraken-burb5">#</a></h3>
    <p class="benchmark-description">Predict buried B₅ sterimol descriptor for organophosphorus ligands</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Kraken
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.1589</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.1693</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.1719</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.1782</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.1783</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.2024</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.2017</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.2066</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.2097</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.2178</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.2176</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.2295</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.2395</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.2422</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.2758</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.2813</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.2884</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.3002</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.4929</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="kraken-burl">Buried L Parameter (Kraken)<a hidden class="anchor" aria-hidden="true" href="#kraken-burl">#</a></h3>
    <p class="benchmark-description">Predict buried L sterimol descriptor for organophosphorus ligands</p>
    <p class="benchmark-meta"><strong>Subset:</strong> Kraken
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>0.0947</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>0.1185</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>0.12</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>0.1324</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>0.1386</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>0.1443</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>0.1486</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>0.15</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.1521</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>0.1526</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>0.1548</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>0.1635</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>0.1673</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>0.1741</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>0.1861</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>0.1924</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>0.1948</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>0.2529</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>0.2781</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="ee-enantioselectivity">Enantioselectivity (EE)<a hidden class="anchor" aria-hidden="true" href="#ee-enantioselectivity">#</a></h3>
    <p class="benchmark-description">Predict enantiomeric excess for Rh-catalyzed asymmetric reactions</p>
    <p class="benchmark-meta"><strong>Subset:</strong> EE
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (%)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>11.61</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>12.03</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>13.56</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>13.96</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>14.22</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>14.64</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>17.74</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>18.03</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>18.42</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>19.8</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>20.24</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>33.95</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>61.03</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>61.3</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>61.63</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>62.08</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>62.31</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>62.38</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>64.01</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="bde-bond-dissociation">Bond Dissociation Energy (BDE)<a hidden class="anchor" aria-hidden="true" href="#bde-bond-dissociation">#</a></h3>
    <p class="benchmark-description">Predict metal-ligand bond dissociation energy for organometallic catalysts</p>
    <p class="benchmark-meta"><strong>Subset:</strong> BDE
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (kcal/mol)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>3D - DimeNet&#43;&#43;</strong><br><small>Directional message passing network (single conformer)</small>
          </td>
          <td>1.45</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>Ensemble - DimeNet&#43;&#43;</strong><br><small>DimeNet&#43;&#43; on full conformer ensemble</small>
          </td>
          <td>1.47</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>3D - LEFTNet</strong><br><small>Local Environment Feature Transformer (single conformer)</small>
          </td>
          <td>1.53</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>Ensemble - LEFTNet</strong><br><small>LEFTNet on full conformer ensemble</small>
          </td>
          <td>1.53</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Ensemble - GemNet</strong><br><small>GemNet on full conformer ensemble</small>
          </td>
          <td>1.61</td>
        </tr>
        <tr>
          <td>6</td>
          <td>
            <strong>3D - GemNet</strong><br><small>Geometry-enhanced message passing (single conformer)</small>
          </td>
          <td>1.65</td>
        </tr>
        <tr>
          <td>7</td>
          <td>
            <strong>Ensemble - PaiNN</strong><br><small>PaiNN on full conformer ensemble</small>
          </td>
          <td>1.87</td>
        </tr>
        <tr>
          <td>8</td>
          <td>
            <strong>Ensemble - SchNet</strong><br><small>SchNet on full conformer ensemble</small>
          </td>
          <td>1.97</td>
        </tr>
        <tr>
          <td>9</td>
          <td>
            <strong>Ensemble - ClofNet</strong><br><small>ClofNet on full conformer ensemble</small>
          </td>
          <td>2.01</td>
        </tr>
        <tr>
          <td>10</td>
          <td>
            <strong>3D - PaiNN</strong><br><small>Polarizable Atom Interaction Network (single conformer)</small>
          </td>
          <td>2.13</td>
        </tr>
        <tr>
          <td>11</td>
          <td>
            <strong>2D - GraphGPS</strong><br><small>Graph Transformer with positional encodings</small>
          </td>
          <td>2.48</td>
        </tr>
        <tr>
          <td>12</td>
          <td>
            <strong>3D - SchNet</strong><br><small>Continuous-filter convolutional network (single conformer)</small>
          </td>
          <td>2.55</td>
        </tr>
        <tr>
          <td>13</td>
          <td>
            <strong>3D - ClofNet</strong><br><small>Conformation-ensemble learning network (single conformer)</small>
          </td>
          <td>2.61</td>
        </tr>
        <tr>
          <td>14</td>
          <td>
            <strong>2D - GIN</strong><br><small>Graph Isomorphism Network</small>
          </td>
          <td>2.64</td>
        </tr>
        <tr>
          <td>15</td>
          <td>
            <strong>2D - ChemProp</strong><br><small>Message Passing Neural Network</small>
          </td>
          <td>2.66</td>
        </tr>
        <tr>
          <td>16</td>
          <td>
            <strong>2D - GIN&#43;VN</strong><br><small>GIN with Virtual Nodes</small>
          </td>
          <td>2.74</td>
        </tr>
        <tr>
          <td>17</td>
          <td>
            <strong>1D - LSTM</strong><br><small>LSTM on SMILES sequences</small>
          </td>
          <td>2.83</td>
        </tr>
        <tr>
          <td>18</td>
          <td>
            <strong>1D - Random forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>3.03</td>
        </tr>
        <tr>
          <td>19</td>
          <td>
            <strong>1D - Transformer</strong><br><small>Transformer on SMILES sequences</small>
          </td>
          <td>10.08</td>
        </tr>
      </tbody>
    </table>
  </div>
</div>

<h2 id="related-datasets">Related Datasets</h2>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Relationship</th>
          <th>Link</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GEOM</strong></td>
          <td>Source</td>
          <td><a href="/notes/chemistry/datasets/geom/">Notes</a></td>
      </tr>
  </tbody>
</table>
<h2 id="strengths">Strengths</h2>
<ul>
<li><strong>Domain diversity</strong>: Beyond drug-like molecules, includes organometallics and catalysts rarely covered in existing benchmarks</li>
<li><strong>Ensemble-based</strong>: Provides full conformer ensembles with statistical weights</li>
<li><strong>DFT-quality energies</strong>: Drugs-75K features DFT-level conformers and energies (higher accuracy than GEOM-Drugs)</li>
<li><strong>Realistic scenarios</strong>: BDE subset models the practical constraint of lacking DFT-computed conformers for large catalyst systems</li>
<li><strong>Comprehensive baselines</strong>: Benchmarks 18 models across 1D (SMILES), 2D (graph), 3D (single conformer), and ensemble methods</li>
<li><strong>Property diversity</strong>: Covers ionization potential, electron affinity, electronegativity, ligand descriptors, and catalytic properties</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li><strong>Regression only</strong>: All tasks evaluate regression metrics exclusively</li>
<li><strong>Chemical space coverage</strong>: The 76K molecules encapsulate a fraction of the expansive drug-like and catalyst chemical spaces</li>
<li><strong>Compute requirements</strong>: Working with large conformer ensembles demands significant computational resources</li>
<li><strong>Proprietary data</strong>: EE subset is proprietary (as of December 2025)</li>
<li><strong>DFT bottleneck</strong>: BDE demonstrates a practical limitation: single DFT optimization can take 2-3 days, making conformer-level DFT infeasible for large organometallics</li>
<li><strong>Uniform sampling baseline</strong>: The initial data augmentation strategy tested for handling ensembles samples conformers uniformly rather than by Boltzmann weight. This unprincipled physical assumption likely explains why the strategy occasionally introduces noise and fails to aid complex 3D architectures.</li>
<li><strong>Drugs-75K properties</strong>: The large-scale benchmark (Drugs-75K) specifically targets electronic properties (Ionization Potential, Electron Affinity, Electronegativity). As the authors explicitly highlight in Section 5.2, these properties are generally less sensitive to conformational rotations compared to steric or spatial interactions. This significantly confounds evaluating whether explicit conformer ensembles actually benefit large-scale regression tasks.</li>
<li><strong>Unrealistic single-conformer baselines</strong>: The 3D single-conformer models are exclusively evaluated on the lowest-energy conformer. This setup is inherently flawed for real-world application, as knowing the global minimum <em>a priori</em> requires exhaustively searching and computing energies for the entire conformer space.</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="data-generation-pipeline">Data Generation Pipeline</h3>
<h4 id="drugs-75k">Drugs-75K</h4>
<p><strong>Source</strong>: GEOM-Drugs subset</p>
<p><strong>Filtering</strong>:</p>
<ul>
<li>Minimum 5 rotatable bonds (focus on flexible molecules)</li>
<li>Allowed elements: H, C, N, O, F, Si, P, S, Cl</li>
</ul>
<p><strong>Conformer generation</strong>:</p>
<ul>
<li>DFT-level calculations for both conformers and energies</li>
<li>Higher accuracy than original GEOM-Drugs (semi-empirical GFN2-xTB)</li>
</ul>
<p><strong>Properties</strong>: Ionization Potential (IP), Electron Affinity (EA), Electronegativity (χ)</p>
<h4 id="kraken">Kraken</h4>
<p><strong>Source</strong>: Original Kraken dataset (1,552 monodentate organophosphorus(III) ligands)</p>
<p><strong>Properties</strong>: 4 of 78 available properties (selected for high variance across conformer ensembles)</p>
<ul>
<li>$B_5$: Sterimol B5, maximum width of substituent (steric descriptor)</li>
<li>$L$: Sterimol L, length of substituent (steric descriptor)</li>
<li>$\text{Bur}B_5$: Buried Sterimol B5, steric effects within the first coordination sphere</li>
<li>$\text{Bur}L$: Buried Sterimol L, steric effects within the first coordination sphere</li>
</ul>
<h4 id="ee-enantiomeric-excess">EE (Enantiomeric Excess)</h4>
<p><strong>Generation method</strong>: Q2MM (Quantum-guided Molecular Mechanics)</p>
<p><strong>Reactions</strong>: 872 catalyst-substrate pairs involving 253 Rhodium (Rh)-bound atropisomeric catalysts from chiral bisphosphine with 10 enamide substrates</p>
<p><strong>Property</strong>: Enantiomeric excess (EE) for asymmetric catalysis</p>
<p><strong>Availability</strong>: Proprietary-only (closed-source as of December 2025)</p>
<h4 id="bde-bond-dissociation-energy">BDE (Bond Dissociation Energy)</h4>
<p><strong>Molecules</strong>: 5,915 organometallic catalysts (ML₁L₂ structure)</p>
<p><strong>Initial conformers</strong>: OpenBabel with geometric optimization</p>
<p><strong>Energies</strong>: DFT calculations</p>
<p><strong>Property</strong>: Electronic binding energy (difference in minimum energies of bound-catalyst complex and unbound catalyst)</p>
<p><strong>Key constraint</strong>: DFT optimization for full conformer ensembles computationally infeasible (2-3 days per molecule)</p>
<h3 id="benchmark-setup">Benchmark Setup</h3>
<p><strong>Task</strong>: Predict molecular properties from structure using different representation strategies (1D/2D/3D/Ensemble).
The ground-truth regression targets are calculated as the Boltzmann-averaged value of the property across the conformer ensemble:</p>
<p>$$
\langle y \rangle_{k_B} = \sum_{\mathbf{C}_i \in \mathcal{C}} p_i y_i
$$</p>
<p>Where $p_i$ is the conformer probability (Boltzmann weight) under experimental conditions derived from the conformer energy $e_i$:</p>
<p>$$
p_i = \frac{\exp(-e_i / k_B T)}{\sum_j \exp(-e_j / k_B T)}
$$</p>
<p><strong>Data splits</strong>: Datasets are partitioned 70% train, 10% validation, and 20% test.</p>
<p><strong>Model categories</strong>:</p>
<ol>
<li><strong>1D Models</strong>: SMILES-based (Random Forest on concatenated MACCS/ECFP/RDKit fingerprints, LSTM, Transformer).</li>
<li><strong>2D Models</strong>: Graph-based (GIN, GIN+VN, ChemProp, GraphGPS).</li>
<li><strong>3D Models</strong>: Single conformer (SchNet, DimeNet++, GemNet, PaiNN, ClofNet, LEFTNet). For evaluation, single 3D models exclusively ingest the lowest-energy conformer. This baseline setting often yields strong performance but is unrealistic in practice, as identifying the global minimum requires exhaustively searching the entire conformer space.</li>
<li><strong>Ensemble Models</strong>: Full conformer ensemble processing via explicit set encoders. For each conformer embedding $\mathbf{z}_i$, three aggregation strategies are evaluated:</li>
</ol>
<p><strong>Mean Pooling:</strong>
$$
\mathbf{s}_{\text{MEAN}} = \frac{1}{|\mathcal{C}|} \sum_{i=1}^{|\mathcal{C}|} \mathbf{z}_i
$$</p>
<p><strong>DeepSets:</strong>
$$
\mathbf{s}_{\text{DS}} = g\left(\sum_{i=1}^{|\mathcal{C}|} h(\mathbf{z}_i)\right)
$$</p>
<p><strong>Self-Attention:</strong>
$$
\begin{aligned}
\mathbf{s}_{\text{ATT}} &amp;= \sum_{i=1}^{|\mathcal{C}|} \mathbf{c}_i, \quad \text{where} \quad \mathbf{c}_i = g\left( \sum_{j=1}^{|\mathcal{C}|} \alpha_{ij} h(\mathbf{z}_j) \right) \\
\alpha_{ij} &amp;= \frac{\exp\left((\mathbf{W} h(\mathbf{z}_i))^\top (\mathbf{W} h(\mathbf{z}_j))\right)}{\sum_{k=1}^{|\mathcal{C}|} \exp\left((\mathbf{W} h(\mathbf{z}_i))^\top (\mathbf{W} h(\mathbf{z}_k))\right)}
\end{aligned}
$$</p>
<p><strong>Evaluation metric</strong>: Mean Absolute Error (MAE) for all tasks.</p>
<h3 id="key-findings">Key Findings</h3>
<p><strong>Ensemble superiority (task-dependent)</strong>: Across benchmarks, explicitly modeling the full conformer set using DeepSets often achieved top performance. However, these improvements are not uniform:</p>
<ul>
<li><strong>Small-Scale Success</strong>: Ensemble methods show large improvements on tasks like Kraken (Ensemble PaiNN achieves 0.2225 on $B_5$ vs 0.3443 single) and EE (Ensemble GemNet achieves 11.61% vs 18.03% single).</li>
<li><strong>Large-Scale Plateau</strong>: The performance improvements did not strongly transfer to large subsets like Drugs-75K (best ensemble strategy for GemNet achieves 0.4066 eV on IP vs 0.4069 eV single). The authors conjecture that the computational burden of encoding all conformers in each ensemble alters learning dynamics and increases training difficulty.</li>
</ul>
<p><strong>Conformer Sampling for Noise</strong>: Data augmentation (randomly sampling one conformer from an ensemble during training) improves performance and robustness when underlying conformers are imprecise (e.g., the forcefield-generated conformers in the BDE subset).</p>
<p><strong>3D vs 2D</strong>: 3D models generally outperform 2D graph models, especially for conformationally-sensitive properties, though 1D and 2D methods remain highly competitive on low-resource datasets or less rotation-sensitive properties.</p>
<p><strong>Model architecture</strong>: No single model dominates all tasks. GemNet and LEFTNet excel on large-scale Drugs-75K, while DimeNet++ shows strong performance on smaller Kraken and reaction datasets. Model selection depends on dataset size and task characteristics.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/SXKDZ/MARCEL">SXKDZ/MARCEL</a></td>
          <td>Code + Dataset</td>
          <td>Apache-2.0</td>
          <td>Benchmark suite, dataset loaders, and hyperparameter configs</td>
      </tr>
      <tr>
          <td><a href="https://github.com/SXKDZ/MARCEL/tree/main/datasets/Drugs">Drugs-75K</a></td>
          <td>Dataset</td>
          <td>Apache-2.0</td>
          <td>DFT-level conformers and energies derived from GEOM-Drugs</td>
      </tr>
      <tr>
          <td><a href="https://github.com/SXKDZ/MARCEL/tree/main/datasets/Kraken">Kraken</a></td>
          <td>Dataset</td>
          <td>Copyright retained by original authors</td>
          <td>Conformer ensembles and four steric descriptors</td>
      </tr>
      <tr>
          <td><a href="https://github.com/SXKDZ/MARCEL/tree/main/datasets/BDE">BDE</a></td>
          <td>Dataset</td>
          <td>Apache-2.0</td>
          <td>OpenBabel-generated conformers with DFT binding energies</td>
      </tr>
      <tr>
          <td>EE</td>
          <td>Dataset</td>
          <td>Proprietary</td>
          <td>Closed-source as of 2026</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Data</strong>: The Drugs-75K, Kraken, and BDE subsets are openly available via the project&rsquo;s GitHub repository. The EE dataset remains closed-source/proprietary (as of 2026), making the EE suite of the benchmark currently irreproducible.</li>
<li><strong>Code</strong>: The benchmark suite and PyTorch-Geometric dataset loaders are open-sourced at <a href="https://github.com/SXKDZ/MARCEL">GitHub (SXKDZ/MARCEL)</a> under the Apache-2.0 license.</li>
<li><strong>Hardware</strong>: The authors trained models using Nvidia A100 (40GB) GPUs. Memory-intensive models (e.g., GemNet, LEFTNet) required Nvidia H100 (80GB) GPUs. Total computation across all benchmark experiments was approximately 6,000 GPU hours.</li>
<li><strong>Algorithms/Models</strong>: Hyperparameters for all 18 evaluated models are provided in the repository configuration files (<code>benchmarks/params</code>). All baseline models use publicly available frameworks (e.g., PyTorch Geometric, OGB, RDKit).</li>
<li><strong>Evaluation</strong>: Evaluation scripts are provided in the repository with consistent tracking of Mean Absolute Error (MAE) and proper configuration of benchmark splits.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p>Zhu, Y., Hwang, J., Adams, K., Liu, Z., Nan, B., Stenfors, B., Du, Y., Chauhan, J., Wiest, O., Isayev, O., Coley, C. W., Sun, Y., and Wang, W. (2024). Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks. In <em>The Twelfth International Conference on Learning Representations (ICLR 2024)</em>. <a href="https://openreview.net/forum?id=NSDszJ2uIV">https://openreview.net/forum?id=NSDszJ2uIV</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{zhu2024learning,
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks}</span>,
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Yanqiao Zhu and Jeehyun Hwang and Keir Adams and Zhen Liu and Bozhao Nan and Brock Stenfors and Yuanqi Du and Jatin Chauhan and Olaf Wiest and Olexandr Isayev and Connor W. Coley and Yizhou Sun and Wei Wang}</span>,
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{The Twelfth International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://openreview.net/forum?id=NSDszJ2uIV}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GEOM: Energy-Annotated Molecular Conformations Dataset</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/geom/</link><pubDate>Thu, 04 Sep 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/datasets/geom/</guid><description>Dataset card for GEOM, providing energy-annotated molecular conformations generated via CREST/xTB and refined with DFT for property prediction benchmarks.</description><content:encoded><![CDATA[<h2 id="key-contribution">Key Contribution</h2>
<p>GEOM addresses the gap between 2D molecular graphs and flexible 3D properties by providing 450k+ molecules with 37M+ conformations. This extensive sampling connects conformer ensembles to experimental properties, providing the necessary infrastructure to benchmark conformer generation methods and train 3D-aware property predictors.</p>
<h2 id="overview">Overview</h2>
<p>The Geometric Ensemble Of Molecules (GEOM) dataset provides energy-annotated molecular conformations generated through systematic computational methods. The dataset includes molecules from drug discovery campaigns (AICures), quantum chemistry benchmarks (<a href="/notes/chemistry/datasets/qm9/">QM9</a>), and molecular property prediction benchmarks (MoleculeNet), with conformations sampled using CREST/GFN2-xTB and a subset refined with high-quality DFT calculations.</p>
<h2 id="dataset-examples">Dataset Examples</h2>















<figure class="post-figure center ">
    <img src="/img/GEOM-sample-_4-pyrimidin-2-yloxyphenyl_acetamide.webp"
         alt="Example SARS-CoV-2 3CL protease active molecule"
         title="Example SARS-CoV-2 3CL protease active molecule"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example SARS-CoV-2 3CL protease active molecule: CC(=O)Nc1ccc(Oc2ncccn2)cc1 (N-(4-pyrimidin-2-yloxyphenyl)acetamide)</figcaption>
    
</figure>

<h2 id="dataset-subsets">Dataset Subsets</h2>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Count</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Drug-like (AICures)</strong></td>
          <td>304,466 molecules</td>
          <td>Drug-like molecules from AICures COVID-19 challenge (avg 44 atoms)</td>
      </tr>
      <tr>
          <td><strong>QM9</strong></td>
          <td>133,258 molecules</td>
          <td>Small molecules from QM9 (up to 9 heavy atoms)</td>
      </tr>
      <tr>
          <td><strong>MoleculeNet</strong></td>
          <td>16,865 molecules</td>
          <td>Molecules from MoleculeNet benchmarks for physical chemistry, biophysics, and physiology (includes BACE)</td>
      </tr>
      <tr>
          <td><strong>BACE (High-quality DFT)</strong></td>
          <td>1,511 molecules</td>
          <td>BACE subset of MoleculeNet with high-quality DFT energies (r2scan-3c) and experimental inhibition data</td>
      </tr>
  </tbody>
</table>
<h2 id="benchmarks">Benchmarks</h2>

<div class="benchmarks-content">
  <div class="benchmark-section">
    <h3 id="gibbs-free-energy-prediction">Gibbs Free Energy Prediction<a hidden class="anchor" aria-hidden="true" href="#gibbs-free-energy-prediction">#</a></h3>
    <p class="benchmark-description">Predict ensemble Gibbs free energy (G) from molecular structure</p>
    <p class="benchmark-meta"><strong>Subset:</strong> 100k AICures · <strong>Split:</strong> 60/20/20 train/val/test
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (kcal/mol)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>SchNetFeatures</strong><br><small>3D SchNet &#43; graph features (trained on highest-prob conformer)</small>
          </td>
          <td>0.203</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>ChemProp</strong><br><small>Message Passing Neural Network (graph model)</small>
          </td>
          <td>0.225</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>FFNN</strong><br><small>Feed-forward network on Morgan fingerprints</small>
          </td>
          <td>0.274</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>KRR</strong><br><small>Kernel Ridge Regression on Morgan fingerprints</small>
          </td>
          <td>0.289</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Random Forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.406</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="average-energy-prediction">Average Energy Prediction<a hidden class="anchor" aria-hidden="true" href="#average-energy-prediction">#</a></h3>
    <p class="benchmark-description">Predict ensemble average energy (E) from molecular structure</p>
    <p class="benchmark-meta"><strong>Subset:</strong> 100k AICures · <strong>Split:</strong> 60/20/20 train/val/test
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE (kcal/mol)</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>ChemProp</strong><br><small>Message Passing Neural Network (graph model)</small>
          </td>
          <td>0.11</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>SchNetFeatures</strong><br><small>3D SchNet &#43; graph features (trained on highest-prob conformer)</small>
          </td>
          <td>0.113</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>FFNN</strong><br><small>Feed-forward network on Morgan fingerprints</small>
          </td>
          <td>0.119</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>KRR</strong><br><small>Kernel Ridge Regression on Morgan fingerprints</small>
          </td>
          <td>0.131</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Random Forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.166</td>
        </tr>
      </tbody>
    </table>
  </div>
  <div class="benchmark-section">
    <h3 id="conformer-count-prediction">Conformer Count Prediction<a hidden class="anchor" aria-hidden="true" href="#conformer-count-prediction">#</a></h3>
    <p class="benchmark-description">Predict ln(number of unique conformers) from molecular structure</p>
    <p class="benchmark-meta"><strong>Subset:</strong> 100k AICures · <strong>Split:</strong> 60/20/20 train/val/test
    </p>
    <table class="benchmark-table">
      <thead>
        <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>MAE</th>
        </tr>
      </thead>
      <tbody>
        <tr class="top-result">
          <td>🥇 1</td>
          <td>
            <strong>SchNetFeatures</strong><br><small>3D SchNet &#43; graph features (trained on highest-prob conformer)</small>
          </td>
          <td>0.363</td>
        </tr>
        <tr class="top-result">
          <td>🥈 2</td>
          <td>
            <strong>ChemProp</strong><br><small>Message Passing Neural Network (graph model)</small>
          </td>
          <td>0.38</td>
        </tr>
        <tr class="top-result">
          <td>🥉 3</td>
          <td>
            <strong>FFNN</strong><br><small>Feed-forward network on Morgan fingerprints</small>
          </td>
          <td>0.455</td>
        </tr>
        <tr>
          <td>4</td>
          <td>
            <strong>KRR</strong><br><small>Kernel Ridge Regression on Morgan fingerprints</small>
          </td>
          <td>0.484</td>
        </tr>
        <tr>
          <td>5</td>
          <td>
            <strong>Random Forest</strong><br><small>Random Forest on Morgan fingerprints</small>
          </td>
          <td>0.763</td>
        </tr>
      </tbody>
    </table>
  </div>
</div>

<h2 id="related-datasets">Related Datasets</h2>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>QM9</strong></td>
          <td>134k small molecules with up to 9 heavy atoms and DFT properties</td>
      </tr>
      <tr>
          <td><strong>PCQM4Mv2</strong></td>
          <td>Millions of computationally generated molecules for HOMO-LUMO gap prediction</td>
      </tr>
      <tr>
          <td><strong>PubChemQC</strong></td>
          <td>DFT structures and energy properties for millions of PubChem molecules</td>
      </tr>
  </tbody>
</table>
<h2 id="strengths">Strengths</h2>
<ul>
<li><strong>Scale</strong>: 37M+ conformations across 450k+ molecules, providing massive coverage of drug-like and small molecule chemical space.</li>
<li><strong>Energy Annotations</strong>: All conformations include semi-empirical energies (GFN2-xTB); the BACE subset includes high-quality DFT energies.</li>
<li><strong>Quality Tiers</strong>: Three levels of computational rigor allow researchers to trade off dataset size for simulation accuracy.</li>
<li><strong>Benchmark Ready</strong>: Includes validated splits and architectural baselines (e.g., ChemProp, SchNet) for property prediction tasks.</li>
<li><strong>Task Diversity</strong>: Combines molecules sourced from drug discovery (AICures), quantum chemistry (QM9), and biophysiology domains (MoleculeNet).</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li><strong>Computational Constraints</strong>: The highest-accuracy DFT subset (BACE) is limited to 1,511 molecules due to the extreme computational cost of exact free energy sampling and Hessian estimation.</li>
<li><strong>Semi-Empirical Accuracy Gap</strong>: The $p^{\text{CREST}}$ statistical weights rely on GFN2-xTB energies, which exhibit a $\sim$2 kcal/mol MAE against true DFT. At room temperature ($k_BT \approx 0.59$ kcal/mol), this error heavily skews the Boltzmann distribution, meaning standard subset weights are imprecise.</li>
<li><strong>Solvation Assumptions</strong>: Most subsets rely on vacuum calculations. Only the BACE subset uses an implicit solvent (ALPB/C-PCM for water).</li>
<li><strong>Coverage Lapses</strong>: Extremely flexible molecules (e.g., within the SIDER dataset) frequently failed the conformer generation pipeline due to runaway topologies.</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="data-generation-pipeline">Data Generation Pipeline</h3>
<p><strong>Initial conformer sampling</strong> (RDKit):</p>
<ul>
<li><code>EmbedMultipleConfs</code> with <code>numConfs=50</code>, <code>pruneRmsThresh=0.01</code> Å</li>
<li>MMFF force field optimization</li>
<li>GFN2-xTB optimization of seed conformer</li>
</ul>
<p><strong>Conformational exploration</strong> (CREST):</p>
<ul>
<li>Metadynamics in NVT ensemble driven by a pushing bias potential:
$$ V_{\text{bias}} = \sum_i k_i \exp(-\alpha_i \Delta_i^2) $$
where $\Delta_i$ is the root-mean-square displacement (RMSD) against the $i$-th reference structure.</li>
<li>12 independent MTD runs per molecule with different settings for $k_i$ and $\alpha_i$.</li>
<li>6.0 kcal/mol safety window for conformer retention.</li>
<li>Solvent: ALPB for water (BACE); vacuum for others.</li>
</ul>
<p><strong>Energy calculation &amp; Weighting</strong>:</p>
<ul>
<li>
<p><strong>Standard (GFN2-xTB)</strong>: Semi-empirical tight-binding DFT ($\approx$ 2 kcal/mol MAE vs DFT). Conformers are assigned a statistical probability based on energy $E_i$ and rotamer degeneracy $d_i$:
$$ p^{\text{CREST}}_i = \frac{d_i \exp(-E_i / k_B T)}{\sum_j d_j \exp(-E_j / k_B T)} $$</p>
</li>
<li>
<p><strong>High-Quality DFT (CENSO)</strong>: Refines structures using the <code>r2scan-3c</code> functional, computing exact conformation-dependent free energies ($G_i$) that remove the need for explicit rotamer degeneracy approximations:</p>
<p>$$
\begin{aligned}
p^{\text{CENSO}}_i &amp;= \frac{\exp(-G_i / k_B T)}{\sum_j \exp(-G_j / k_B T)} \\
G_i &amp;= E_{\text{gas}}^{(i)} + \delta G_{\text{solv}}^{(i)}(T) + G_{\text{trv}}^{(i)}(T)
\end{aligned}
$$</p>
</li>
</ul>
<h3 id="quality-levels">Quality Levels</h3>
<table>
  <thead>
      <tr>
          <th>Level</th>
          <th>Method</th>
          <th>Subset</th>
          <th>Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Standard</strong></td>
          <td>CREST/GFN2-xTB</td>
          <td>All subsets</td>
          <td>~2 kcal/mol MAE vs DFT</td>
      </tr>
      <tr>
          <td><strong>DFT Single-Point</strong></td>
          <td>r2scan-3c/mTZVPP on CREST geometries</td>
          <td>BACE (1,511 molecules)</td>
          <td>Sub-kcal/mol</td>
      </tr>
      <tr>
          <td><strong>DFT Optimized</strong></td>
          <td>CENSO full optimization + free energies</td>
          <td>BACE (534 molecules)</td>
          <td>~0.3 kcal/mol vs CCSD(T)</td>
      </tr>
  </tbody>
</table>
<h3 id="benchmark-setup">Benchmark Setup</h3>
<p><strong>Task</strong>: Predict ensemble summary statistics directly from the 2D molecular structure. The target properties include:</p>
<ul>
<li><strong>Conformational Free Energy ($G$)</strong>: $G = -TS$, where $S = -R \sum_i p_i \log p_i$.</li>
<li><strong>Average Energy ($\langle E \rangle$)</strong>: $\langle E \rangle = \sum_i p_i E_i$.</li>
<li><strong>Unique Conformers</strong>: Natural log of the conformer count retained within the energy window.</li>
</ul>
<p><strong>Data</strong>: 100,000 species randomly sampled from AICures subset, split 60/20/20 (train/validation/test).</p>
<p><strong>Hyperparameters</strong>: Optimized using Hyperopt package for each model/task combination.</p>
<p><strong>Models</strong>:</p>
<ul>
<li><strong>SchNetFeatures</strong>: 3D SchNet architecture + graph features, trained on highest-probability conformer</li>
<li><strong>ChemProp</strong>: Message Passing Neural Network on molecular graphs</li>
<li><strong>FFNN</strong>: Feed-forward network on Morgan fingerprints</li>
<li><strong>KRR</strong>: Kernel Ridge Regression on Morgan fingerprints</li>
<li><strong>Random Forest</strong>: Random Forest on Morgan fingerprints</li>
</ul>
<h3 id="hardware--computational-cost">Hardware &amp; Computational Cost</h3>
<h4 id="crestgfn2-xtb-generation">CREST/GFN2-xTB Generation</h4>
<p><strong>Total compute</strong>: ~15.7 million core hours</p>
<p><strong>AICures subset</strong>:</p>
<ul>
<li>13M core hours on Knights Landing (32-core nodes)</li>
<li>1.2M core hours on Cascade Lake/Sky Lake (13-core nodes)</li>
<li>Average wall time: 2.8 hours/molecule (KNL) or 0.63 hours/molecule (Sky Lake)</li>
</ul>
<p><strong>MoleculeNet subset</strong>: 1.5M core hours</p>
<h4 id="dft-calculations-bace-only">DFT Calculations (BACE only)</h4>
<p><strong>Software</strong>: CENSO 1.1.2 + ORCA 5.0.1 (r2scan-3c/mTZVPP functional)</p>
<p><strong>Solvent</strong>: C-PCM implicit solvation (water)</p>
<p><strong>Hardware</strong>: ~54 cores per job</p>
<p><strong>Compute cost</strong>:</p>
<ul>
<li>781,000 CPU hours for CENSO optimizations</li>
<li>1.1M CPU hours for single-point energy calculations</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<ul>
<li><strong>Data Availability</strong>: All generated conformations, energies, and thermodynamic properties are publicly hosted on <a href="https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/JNGTDF">Harvard Dataverse</a>. The data is provided in language-agnostic MessagePack format and Python-specific RDKit <code>.pkl</code> formats.</li>
<li><strong>Code &amp; Analysis</strong>: The primary GitHub repository (<a href="https://github.com/learningmatter-mit/geom">learningmatter-mit/geom</a>) provides tutorials for data extraction, RDKit processing, and conformational visualization.</li>
<li><strong>Model Training &amp; Baselines</strong>: The machine learning benchmarks (SchNet, ChemProp) and corresponding training scripts used to evaluate the dataset can be reproduced using the authors&rsquo; <a href="https://github.com/learningmatter-mit/NeuralForceField">NeuralForceField repository</a>.</li>
<li><strong>Hardware &amp; Compute</strong>: Extreme compute was required (15.7M core hours for CREST sampling alone), heavily utilizing Knights Landing (KNL) and Cascade Lake architectures. See <em>Hardware &amp; Computational Cost</em> section above for full details.</li>
<li><strong>Software Versions</strong>: Precise reproduction of conformational properties requires specific versions to mitigate numerical variances: CREST v2.9, xTB v6.2.3/v6.4.1, CENSO v1.1.2, ORCA v5.0.1/v5.0.2, and RDKit v2020.09.1.</li>
<li><strong>Open-Access Paper</strong>: The full methodology is accessible via the <a href="https://arxiv.org/abs/2006.05531">arXiv preprint</a>.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p>Axelrod, S. and Gómez-Bombarelli, R. (2022). GEOM, energy-annotated molecular conformations for property prediction and molecular generation. <em>Scientific Data</em>, 9(1), 185. <a href="https://doi.org/10.1038/s41597-022-01288-4">https://doi.org/10.1038/s41597-022-01288-4</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{Axelrod_2022,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{GEOM, energy-annotated molecular conformations for property prediction and molecular generation}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{9}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">ISSN</span>=<span style="color:#e6db74">{2052-4463}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{http://dx.doi.org/10.1038/s41597-022-01288-4}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">DOI</span>=<span style="color:#e6db74">{10.1038/s41597-022-01288-4}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Scientific Data}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer Science and Business Media LLC}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Axelrod, Simon and Gómez-Bombarelli, Rafael}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">month</span>=<span style="color:#e6db74">{apr}</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{185}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GDB-11: Chemical Universe Database (26.4M Molecules)</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/gdb-11/</link><pubDate>Fri, 29 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/datasets/gdb-11/</guid><description>GDB-11 systematically enumerates 26.4M small organic molecules (up to 11 atoms of C, N, O, F) for virtual screening and drug discovery.</description><content:encoded><![CDATA[<h2 id="dataset-examples">Dataset Examples</h2>















<figure class="post-figure center ">
    <img src="/img/gdb_11_sample.webp"
         alt="GDB-11 molecule"
         title="GDB-11 molecule"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">GDB-11 molecule (SMILES: <code>FC1C2OC1c3c(F)coc23</code>)</figcaption>
    
</figure>

<h2 id="related-datasets">Related Datasets</h2>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Relationship</th>
          <th>Link</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GDB-13</strong></td>
          <td>Successor</td>
          <td><a href="/notes/chemistry/datasets/gdb-13/">Notes</a></td>
      </tr>
      <tr>
          <td><strong>GDB-17</strong></td>
          <td>Successor</td>
          <td><a href="/notes/chemistry/datasets/gdb-17/">Notes</a></td>
      </tr>
  </tbody>
</table>
<h2 id="key-contribution">Key Contribution</h2>
<p>The generation and analysis of the Generated Database (GDB), an exhaustive collection of all possible small molecules that meet specific criteria for stability and synthetic feasibility.</p>
<h2 id="overview">Overview</h2>
<p>GDB-11 represents the first systematic enumeration of the small molecule chemical universe up to 11 atoms of C, N, O, and F. The database contains 26.4 million unique molecules corresponding to 110.9 million stereoisomers. It was created to support virtual screening and drug discovery by providing a comprehensive collection of diverse, drug-like small molecules that obey standard chemical stability rules.</p>
<h2 id="strengths">Strengths</h2>
<ul>
<li><strong>Systematic Enumeration</strong>: Exhaustive coverage of mathematically and chemically possible structures up to 11 atoms.</li>
<li><strong>Drug-Likeness</strong>: 100% of compounds follow Lipinski&rsquo;s &ldquo;Rule of 5&rdquo; for bioavailability, and 50% (13.2 million) follow Congreve&rsquo;s more restrictive &ldquo;Rule of 3&rdquo; for lead-likeness.</li>
<li><strong>Structural Novelty</strong>: Features 538 newly identified ring systems that were previously unknown in existing chemical databases (like the CAS Registry or Beilstein).</li>
<li><strong>High Chirality</strong>: Over 70% of GDB molecules are chiral, providing rich 3D structural diversity, particularly in fused carbocycles and heterocycles.</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li><strong>Size Restriction</strong>: Strictly limited to small molecules with a maximum of 11 heavy atoms.</li>
<li><strong>Element Restriction</strong>: Only contains C, N, O, and F. Important biological and pharmaceutical elements like Phosphorus (P), Sulfur (S), and Silicon (Si) are excluded to prevent combinatorial explosion.</li>
<li><strong>Excluded Topologies</strong>: Excludes highly strained molecules (e.g., specific bridged systems), allenes, and bridgehead double bonds.</li>
<li><strong>Unstable Functional Groups</strong>: Excludes chemical classes deemed unstable or highly reactive (e.g., gem-diols, hemiacetals, aminals, enols, orthoacids).</li>
<li><strong>Computational Nature</strong>: Consists entirely of computer-generated, theoretical structures without experimental synthesis or biological validation.</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="construction">Construction</h3>
<h4 id="graph-selection">Graph Selection</h4>
<p>The program GENG was used to generate an initial set of 843,335 connected graphs with up to 11 nodes and a maximum node connectivity of 4. These were filtered to 15,726 stable saturated hydrocarbon graphs using:</p>
<ul>
<li><strong>Topological Criteria</strong>: Removed graphs with a node in multiple small (3- or 4-membered) rings, tetravalent bridgeheads in small rings, and nonplanar graphs (e.g., Claus-benzol).</li>
<li><strong>Steric Criteria</strong>: Graphs containing highly distorted centers were removed using an adapted MM2 force field energy-minimization with a cutoff of +17 kcal/mol.</li>
</ul>
<h4 id="structure-generation">Structure Generation</h4>
<p>Graph symmetry algorithms identified valid locations for unsaturations and heteroatoms (C, N, O, F). Specific valence constraints were continuously enforced. Combinatorial distribution of elements and multiple bonds (excluding bridgehead double bonds, triple bonds in rings smaller than nine, and allenes) yielded a theoretical &ldquo;dark matter universe&rdquo; (DMU) of over 1.7 billion unique structures.</p>
<h4 id="filters">Filters</h4>
<p>The 1.7 billion structural candidates contained unstable environments which were aggressively filtered, reducing the set to 27.7 million possible stable molecules. Rejected unstable/reactive features included:</p>
<ul>
<li><strong>High-Energy Bonds</strong>: Gem-diols, non-stabilized aminals, hemiaminals, enols, orthoesters, unstable imines, acyl fluorides, and geminal di-heteroatoms.</li>
<li><strong>Heteroatom-Heteroatom Bonds</strong>: Peroxides (O-O), N-O, N-N, N-F, and triazanes, unless stabilized (e.g., hydrazones, oximes).</li>
<li><strong>Strained Topologies</strong>: 3/4-membered rings containing N-N or N-O bonds, and bridgehead heteroatom bonds causing instabilities (like Bredt&rsquo;s rule violations).</li>
</ul>
<p>Removal of redundant tautomeric forms collapsed the set to the foundational 26.4 million structures.</p>
<h4 id="stereoisomer-generation">Stereoisomer Generation</h4>
<p>Stereoisomers were cleanly enumerated by identifying all asymmetric centers and functional double bonds, blocking Z/E isomerism in rings smaller than 10 nodes. From the 26.4 million unique constitutional isomers, 110.9 million stereoisomers were generated (averaging 4.2 stereoisomers per molecule).</p>
<h3 id="analysis-methodology">Analysis Methodology</h3>
<h4 id="kohonen-maps-self-organizing-maps">Kohonen Maps (Self-Organizing Maps)</h4>
<p>The chemical space visualization and compound class analysis used a Kohonen Map (Self-Organizing Map/SOM):</p>
<ul>
<li><strong>Input Features</strong>: 48-dimensional autocorrelation vectors encoding topological relationships and atomic properties. The autocorrelation vector $\text{AC}_d$ for a topological distance $d$ is defined as:</li>
</ul>
<p>$$
\text{AC}_d = \sum_{i=1}^{N} \sum_{j=1}^{N} \delta (p_i p_j)_d
$$</p>
<p><em>(where $N$ is the number of atoms, $p$ is the atomic property, and $\delta (p_i, p_j)_d = p_i p_j$ if the topological distance between atoms $i$ and $j$ is $d$, and 0 otherwise).</em></p>
<ul>
<li><strong>Training Data</strong>: Random subset of 1,000,000 GDB molecules</li>
<li><strong>Architecture</strong>: 200x200 neuron grid</li>
<li><strong>Training Protocol</strong>: 250,000 epochs with 100 molecules presented per epoch</li>
<li><strong>Algorithm</strong>: Standard Kohonen algorithm</li>
<li><strong>Key Insight</strong>: Reveals that &ldquo;lead-like&rdquo; compounds cluster in chiral regions of fused carbocycles/heterocycles</li>
</ul>
<h4 id="comparison">Comparison</h4>
<p>The full database was compared comprehensively to a Reference Database (RDB) of 63,857 known compounds (up to 11 atoms) extracted from PubChem, ChemACX, ChemSCX, NCI Open Database, and the Merck Index. Of the 63,857 RDB compounds, 37,393 (58.6%) were found in GDB. The remaining 26,464 compounds were absent due to structural rule violations, exclusion of elements beyond C/N/O/F, and filtered unstable chemistries.</p>
<h4 id="new-rings">New Rings</h4>
<p>All 309 entirely acyclic graphs in GDB mapped cleanly to published structures. External databases contained only 670 of the 1,208 purely cyclic theoretical ring systems (55.5%). Furthermore, 367 of the 538 newly identified ring systems (68.2%) express inherently chiral topologies.</p>
<h4 id="stereochemistry">Stereochemistry</h4>
<p>Small molecules under 5 heavy atoms skew strongly towards simple achiral structures. As the atom count increases, a dominant stereochemical shift emerges: over two-thirds of structures containing exactly 10 or 11 atoms occupy chiral configuration spaces. Approximately 86% of the molecules in GDB contain exactly 11 atoms (22.8 million of 26.4 million).</p>
<h4 id="physicochemical-properties">Physicochemical Properties</h4>
<p>Because all GDB molecules contain at most 11 heavy atoms, 100% of them satisfy Lipinski&rsquo;s &ldquo;Rule of 5&rdquo; for bioavailability. Under the more restrictive Congreve &ldquo;Rule of 3&rdquo; for lead-likeness (MW &lt; 300, RBC &lt; 3, logP &lt; 3, HBDC &lt; 3, HBAC &lt; 3, TPSA &lt; 60 $\text{\AA}^2$), exactly 50% (13.2 million structures) qualify. Virtual screening using the Molinspiration miscreen toolkit (Bayesian statistics-based) identified 42,804 virtual hits across three drug target classes: 3,043 kinase inhibitor candidates, 24,489 GPCR ligand candidates, and 19,696 ion-channel modulator candidates. Of these virtual hits, 59.8% occupied Kohonen map neurons not populated by any known RDB compound.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p>While the generated GDB-11 database is openly available, reproducing the exact generation from graph to stereoisomer relies on in-house and proprietary software which is not publicly available.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://gdb.unibe.ch/downloads/">GDB Downloads (University of Berne)</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Official host for GDB databases</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.5172017">Zenodo Record (10.5281/zenodo.5172017)</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Version-agnostic Zenodo archive of GDB-11</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Paper Accessibility</strong>: Closed-access (Published in JCIM 2007; no preprint available).</li>
<li><strong>Data Availability</strong>: The complete dataset is hosted on an open-access Zenodo repository (version-agnostic DOI): <a href="https://doi.org/10.5281/zenodo.5172017">10.5281/zenodo.5172017</a>.</li>
<li><strong>Software Dependencies (Closed/Commercial)</strong>:
<ul>
<li>Generation code is a closed-source Java (J2SE v5.0) application.</li>
<li>Relies on proprietary ChemAxon libraries (JChem v3.1, Marvin v4.0 API).</li>
<li>Virtual screening evaluation utilized the commercial Molinspiration <code>miscreen</code> toolkit.</li>
</ul>
</li>
<li><strong>Hardware Profile</strong>:
<ul>
<li><strong>CPUs</strong>: Two AMD Opteron 252 2.6 GHz processors</li>
<li><strong>Parallelization</strong>: 80-fold parallelization</li>
<li><strong>Compute Time</strong>: Approximately 20 hours for full generation</li>
</ul>
</li>
</ul>
<h3 id="force-field">Force Field</h3>
<p>A custom implementation of the MM2 force field was used for steric energy minimization during structure validation. It used the parameter set from Allinger, specifically adding a quartic term for bond stretching to prevent bond lengthening far from equilibrium:</p>
<p>$$
\begin{aligned}
E_{\text{Steric}} &amp;= \sum_{\text{bonds}} k_b(l_i - l_{0,i})^2 \left[1 + k&rsquo;_b(l_i - l_{0,i}) + k&rsquo;&rsquo;_b(l_i - l_{0,i})^2\right] \\
&amp;\quad + \sum_{\text{angles}} k_\theta(\theta_i - \theta_{0,i})^2 \left[1 + k&rsquo;_\theta(\theta_i - \theta_{0,i})^4\right] \\
&amp;\quad + \sum_{\text{angles}} k_{b,\theta}(\theta_i - \theta_{0,i})^2 \left[(l_a - l_{0,a}) + (l_b - l_{0,b})\right] \\
&amp;\quad + \sum_{\text{torsions}} \left[ \frac{V_1}{2} (1 + \cos \omega) + \frac{V_2}{2} (1 - \cos 2\omega) + \frac{V_3}{2} (1 + \cos 3\omega) \right] \\
&amp;\quad + \sum_{i=1}^N \sum_{j=i+1}^N \epsilon_{ij} \left[ A \exp \left( \frac{-B r_{ij}}{\sum r^{\ast}_{ij}} \right) - C \left( \frac{r_{ij}}{\sum r^{\ast}_{ij}} \right)^6 \right]
\end{aligned}
$$</p>
<h2 id="paper-information">Paper Information</h2>
<p>Fink, T. and Reymond, J.-L. (2007). Virtual Exploration of the Chemical Universe up to 11 Atoms of C, N, O, F: Assembly of 26.4 Million Structures (110.9 Million Stereoisomers) and Analysis for New Ring Systems, Stereochemistry, Physicochemical Properties, Compound Classes, and Drug Discovery. <em>Journal of Chemical Information and Modeling</em>, 47(2), 342&ndash;353. <a href="https://doi.org/10.1021/ci600423u">https://doi.org/10.1021/ci600423u</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{fink2007virtual,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Virtual exploration of the chemical universe up to 11 atoms of C, N, O, and F: assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physicochemical properties, compound classes, and drug discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Fink, Tobias and Reymond, Jean-Louis}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{47}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{342--353}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2007}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{ACS Publications}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GDB-17: Chemical Universe Database (166.4B Molecules)</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/gdb-17/</link><pubDate>Sat, 16 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/datasets/gdb-17/</guid><description>Dataset card for GDB-17, containing 166.4 billion small organic molecules representing the largest enumerated chemical space to date.</description><content:encoded><![CDATA[<h2 id="key-contribution">Key Contribution</h2>
<p>The systematic enumeration of 166.4 billion organic molecules (GDB-17) up to 17 atoms, extending the known chemical universe into the drug-relevant size range. This reveals a highly dense novel chemical space that is measurably richer in complex stereochemical and three-dimensional structures compared to historically biased chemical databases.</p>
<h2 id="overview">Overview</h2>
<p>GDB-17 represents the largest enumerated database of drug-like small molecules, reaching the size range typical of lead compounds and approved drugs ($100 &lt; \text{MW} &lt; 350$ Da). It contains 166.4 billion structures consisting of up to 17 atoms of C, N, O, S, and halogens (F, Cl, Br, I). Because the bounds of combinatorial possibilities scale exponentially with heavy atom count (HAC), the MW distribution of the database sharply peaks in the $240$-$250 \text{ Da}$ range. Compared to known molecules in databases like PubChem, GDB-17 molecules are significantly richer in non-aromatic heterocycles, quaternary centers, and stereoisomers, avoiding &ldquo;flatland&rdquo; by deeply populating the third dimension in shape space.</p>
<h2 id="dataset-examples">Dataset Examples</h2>















<figure class="post-figure center ">
    <img src="/img/gdb_17_sample.webp"
         alt="Example GDB-17 molecule"
         title="Example GDB-17 molecule"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example GDB-17 molecule (SMILES: <code>C1CC2C3CCCC3C3(C4CCC3CC4)C2C1</code>) demonstrating the complex polycyclic structures and 3D diversity characteristic of the database</figcaption>
    
</figure>

<h2 id="dataset-subsets">Dataset Subsets</h2>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Size</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GDB-17 (Full)</strong></td>
          <td>166.4B</td>
          <td>Complete enumeration of the database</td>
      </tr>
      <tr>
          <td><strong>GDBLL-17</strong></td>
          <td>29B</td>
          <td>Lead-like subset ($1 &lt; \text{clogP} &lt; 3$ and $100 &lt; \text{MW} &lt; 350$ Da)</td>
      </tr>
      <tr>
          <td><strong>GDBLLnoSR-17</strong></td>
          <td>22B</td>
          <td>Lead-like subset excluding compounds with small rings (3- or 4-membered)</td>
      </tr>
      <tr>
          <td><strong>Random Sample</strong></td>
          <td>50M</td>
          <td>Random 50M subset available for download, including pre-filtered lead-like and no-small-ring fractions</td>
      </tr>
  </tbody>
</table>
<h2 id="benchmarks">Benchmarks</h2>
<p><em>Note: As an enumerated database of theoretical structures, GDB-17 lacks standard supervised ML benchmarks. It functions primarily as a generative compass and foundational exploration library for unsupervised learning and molecular generation.</em></p>
<h2 id="related-datasets">Related Datasets</h2>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Relationship</th>
          <th>Link</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GDB-11</strong></td>
          <td>Predecessor</td>
          <td><a href="/notes/chemistry/datasets/gdb-11/">Notes</a></td>
      </tr>
      <tr>
          <td><strong>GDB-13</strong></td>
          <td>Predecessor</td>
          <td><a href="/notes/chemistry/datasets/gdb-13/">Notes</a></td>
      </tr>
  </tbody>
</table>
<h2 id="strengths--limitations">Strengths &amp; Limitations</h2>
<p><strong>Strengths:</strong></p>
<ul>
<li><strong>3D Shape Space (&ldquo;Escape out of Flatland&rdquo;)</strong>: Populates the third dimension (spherical, non-planar shapes) significantly better than known structures in PubChem or ChEMBL, which are primarily flat and rod-like due to aromatic dominance</li>
<li><strong>Stereochemical Complexity</strong>: Averages 6.4 possible stereoisomers per molecule (compared to 2.0 in PubChem-17), driven by an abundance of non-planar features and small rings</li>
<li><strong>Massive Scaffold Diversity</strong>: Features 35-fold more Murcko scaffolds and 61-fold more ring systems than molecules of matching size in PubChem</li>
<li><strong>Rich in Known Drug Isomers</strong>: Contains millions of exact geometric and formula isomers of approved drugs, offering direct variations and &ldquo;methyl walk&rdquo; analogs</li>
</ul>
<p><strong>Limitations:</strong></p>
<ul>
<li><strong>Experimental Gap</strong>: These are virtual, combinatorially enumerated molecules. Despite strict computational stability filtering, they remain unsynthesized and lack experimental validation.</li>
<li><strong>Small Ring Dominance</strong>: Up to 16 atoms, roughly 83% of the database consists of compounds with challenging small (3- or 4-membered) rings, though this drops for the 17-atom set, resulting in an overall 28% fraction of small ring compounds</li>
<li><strong>Elemental Scope Restrictions</strong>: Elements like P, Si, and B, which occasionally appear in drugs, are completely excluded</li>
<li><strong>Strict Stability Filters</strong>: Excludes some potentially viable functional groups strictly to manage the combinatorial explosion and avoid unstable structures (e.g., hemiacetals, aminals, acyclic acetals)</li>
<li><strong>Polarity Skew</strong>: The full database contains disproportionately more polar molecules ($\text{clogP} &lt; 0$) than reference sets, and its sheer size makes it computationally demanding to query using advanced docking or 3D shape tools</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="generation-pipeline">Generation Pipeline</h3>
<p>GDB-17 was generated from first principles through a highly filtered, multi-stage pipeline:</p>
<ol>
<li><strong>Graphs $\rightarrow$ Hydrocarbons</strong>: Started with 114.3 billion topologies (generated using GENG), filtered down to 5.4 million stable hydrocarbons by applying geometrical strain rules (H-filters).</li>
<li><strong>Hydrocarbons $\rightarrow$ Skeletons</strong>: Substituted single bonds with double and triple bonds to yield 1.3 billion skeletons, simultaneously removing reactive unsaturations like allenes (S-filters).</li>
<li><strong>Skeletons $\rightarrow$ CNO Molecules</strong>: Diversified into 110.4 billion molecules by combinatorially substituting C with N and O, explicitly avoiding heteroatom-heteroatom bounds and enforcing stability filters (F-filters).</li>
<li><strong>Post-processing</strong>: Added diversity by transforming groups to generate aromatics, oximes, $\text{CF}_3$, halogens, and sulfones (P-filters), yielding the final 166.4 billion count.</li>
</ol>
<h3 id="hardware--software">Hardware &amp; Software</h3>
<ul>
<li><strong>Compute</strong>: Mastered over 40,000 jobs spread across a 360-CPU cluster, consuming 100,000 CPU hours (~11 CPU years)</li>
<li><strong>Software</strong>: Powered by <strong>GENG</strong> (Nauty package) for graph generation, <strong>CORINA</strong> for 3D stereoisomer generation, and ChemAxon JChem libraries running inside custom Java 1.6 applications</li>
</ul>
<h3 id="shape-analysis-pmi">Shape Analysis (PMI)</h3>
<p>To quantitatively define the &ldquo;escape from flatland,&rdquo; the origin paper classifies molecular shape using the normalized Principal Moments of Inertia (PMI) of the generated 3D conformers. The principal moments ($I_1 \le I_2 \le I_3$) are derived by diagonalizing the standard moment of inertia tensor. Molecules are plotted within a normalized 2D triangular space mapped by the ratios:</p>
<p>$$ P_1 = \frac{I_1}{I_3}, \quad P_2 = \frac{I_2}{I_3} $$</p>
<p>The vertices of this plot define the three geometrical boundaries of chemical space:</p>
<ul>
<li><strong>Rod-like (1D)</strong>: $(0, 1)$ typical of stretched alkanes</li>
<li><strong>Disc-like (2D)</strong>: $(0.5, 0.5)$ typical of flat aromatics like benzene</li>
<li><strong>Sphere-like (3D)</strong>: $(1, 1)$ typical of globular structures like cubane</li>
</ul>
<p>GDB-17&rsquo;s core structural finding is that mathematically enumerated chemical space thickly populates the interior and $(1,1)$ spherical regions of this plot, demonstrating significant 3D structure. Empirical libraries traditionally cluster densely along the rod-to-disc axis.</p>
<h3 id="differences-from-gdb-13">Differences from GDB-13</h3>
<ul>
<li>The algorithm was completely rewritten optimizing memory efficiency, boosting computing speed roughly 400-fold and allowing enumeration beyond the previous 13-atom limit</li>
<li>Scope aggressively expanded to include all functional halogens (F, Cl, Br, I) within the base framework</li>
<li>Introduced intensive, size-dependent graph selection filters (prohibiting complex bridgeheads and completely eliminating small rings in 17-atom graphs) to manage combinatorial explosion</li>
<li>Functional post-processing cycles deliberately decoupled to add features like cyclic oximes, aromatic halogens, and sulfones that would otherwise be rejected or break underlying generation constraints</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<ul>
<li><strong>Paper Accessibility</strong>: The original paper is published in the <em>Journal of Chemical Information and Modeling</em> and is available as an Open Access publication under a CC-BY license.</li>
<li><strong>Data Availability</strong>: The full 166.4 billion molecule dataset is not publicly available for download (estimated &gt;400 GB compressed). However, a 50 million random subset and pre-filtered lead-like fractions are openly available on the <a href="https://gdb.unibe.ch/downloads/">GDB website</a> and archived on <a href="https://zenodo.org/records/5172018">Zenodo</a>.</li>
<li><strong>Code &amp; Algorithms</strong>: The enumeration rules and logic are well-described in the paper, but the actual Java 1.6 source code has not been released.</li>
<li><strong>Dependencies</strong>: The pipeline is a mix of open-source and proprietary software tools. Graph generation uses open-source GENG (Nauty), while chemical logic and stereoisomer generation rely on proprietary ChemAxon JChem libraries and CORINA.</li>
<li><strong>Hardware Specifications</strong>: The original database generation was explicitly parallelized across a 360-CPU cluster, consuming 100,000 CPU hours (approximately 11 CPU years) with over 40,000 calculation runs.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p>Ruddigkeit, L., van Deursen, R., Blum, L. C., and Reymond, J.-L. (2012). Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17. <em>Journal of Chemical Information and Modeling</em>, 52(11), 2864&ndash;2875. <a href="https://doi.org/10.1021/ci300415d">https://doi.org/10.1021/ci300415d</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{Ruddigkeit_2012,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{52}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">ISSN</span>=<span style="color:#e6db74">{1549-960X}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{http://dx.doi.org/10.1021/ci300415d}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">DOI</span>=<span style="color:#e6db74">{10.1021/ci300415d}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society (ACS)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ruddigkeit, Lars and van Deursen, Ruud and Blum, Lorenz C. and Reymond, Jean-Louis}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2012}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span>=nov,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{2864--2875}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>GDB-13: Chemical Universe Database (970M Molecules)</title><link>https://hunterheidenreich.com/notes/chemistry/datasets/gdb-13/</link><pubDate>Sat, 16 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/chemistry/datasets/gdb-13/</guid><description>A dataset card for the Generated Database 13 (GDB-13), a database of nearly 1 billion small organic molecules for virtual screening and drug discovery.</description><content:encoded><![CDATA[<h2 id="dataset-examples">Dataset Examples</h2>















<figure class="post-figure center ">
    <img src="/img/gdb_13_sample.webp"
         alt="Example GDB-13 molecule"
         title="Example GDB-13 molecule"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example GDB-13 molecule (SMILES: <code>CCCC(O)(CO)CC1CC1CN</code>)</figcaption>
    
</figure>

<h2 id="dataset-subsets">Dataset Subsets</h2>
<table>
  <thead>
      <tr>
          <th>Subset</th>
          <th>Size</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>C/N/O Set</strong></td>
          <td>~910.1M</td>
          <td>Molecules containing up to 13 atoms of Carbon, Nitrogen, and Oxygen.</td>
      </tr>
      <tr>
          <td><strong>Cl/S Set</strong></td>
          <td>~67.3M</td>
          <td>Molecules containing up to 13 atoms, adding Sulfur (aromatic heterocycles, sulfones, sulfonamides, thioureas) and Chlorine (aromatic substituents).</td>
      </tr>
  </tbody>
</table>
<h2 id="related-datasets">Related Datasets</h2>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Relationship</th>
          <th>Link</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>GDB-11</strong></td>
          <td>Predecessor</td>
          <td><a href="/notes/chemistry/datasets/gdb-11/">Notes</a></td>
      </tr>
      <tr>
          <td><strong>GDB-17</strong></td>
          <td>Successor</td>
          <td><a href="/notes/chemistry/datasets/gdb-17/">Notes</a></td>
      </tr>
  </tbody>
</table>
<h2 id="key-contribution">Key Contribution</h2>
<p>The creation and release of the 977.5 million-compound GDB-13, a significant expansion in molecular size (up to 13 atoms) and elemental diversity (including S and Cl) made possible by key algorithmic optimizations that significantly accelerated the enumeration process.</p>
<h2 id="overview">Overview</h2>
<p>GDB-13 extends the systematic enumeration of drug-like chemical space to molecules containing up to 13 atoms of Carbon, Nitrogen, Oxygen, Sulfur, and Chlorine. Building on the methodology established in GDB-11, this database represents a 37-fold increase in size while maintaining 100% Lipinski compliance for virtual screening applications. The enumeration results in a vast array of cyclic topologies, where 54% of the database comprises molecules with at least one three- or four-membered ring.</p>
<h2 id="strengths">Strengths</h2>
<ul>
<li>Systematic coverage of structures with up to 13 atoms</li>
<li>High drug-likeness: 100% Lipinski compliance and 99.5% Vieth compliance</li>
<li>High proportion of leadlike (98.9%) and fragmentlike (45.1%) molecules</li>
<li>Structural novelty providing fragments absent from established databases like ZINC, ACX, and PubChem</li>
</ul>
<h2 id="limitations">Limitations</h2>
<ul>
<li>Limited to small molecules with up to 13 atoms of C, N, O, S, and Cl</li>
<li>Omits 66.2% of known chemical space up to 13 atoms found in external databases</li>
<li>Excludes specific nonenumerated elements (F, Br, I, P, Si, metals) and functional groups (chlorine on nonaromatic carbons, mercaptans, sulfoxides, enamines, allenes)</li>
<li>Excludes highly strained molecules and highly polar combinations</li>
<li>Consists entirely of computer-generated structures pending experimental validation</li>
</ul>
<h2 id="technical-notes">Technical Notes</h2>
<h3 id="algorithmic-approach">Algorithmic Approach</h3>
<p><strong>Type</strong>: Rule-Based Combinatorial Graph Enumeration</p>
<p>This approach relies on <strong>combinatorial enumeration</strong>. It utilizes a rule-based graph generation algorithm (GENG) paired with chemical stability filters to construct the dataset.</p>
<p><strong>Process</strong>:</p>
<ol>
<li>Start with mathematical graphs representing saturated hydrocarbons up to 13 nodes using GENG (non-planar graphs discarded)</li>
<li>Apply topological filters to remove highly strained small ring systems (e.g., fused cyclopropanes and bridgehead 3/4-membered rings)</li>
<li>Generate 3D structures via CORINA or ChemAxon to apply a 3D volume-based strain filter. The local strain of a tetravalent carbon is estimated by the volume $V$ of the tetrahedron formed by extending a $1 \text{ \AA}$ line along its four single bonds. Hydrocarbons with planar or pyramidal carbon centers are discarded if:
$$ V &lt; 0.345 \text{ \AA}^3 $$</li>
<li>Introduce unsaturations and heteroatoms through systematic substitution</li>
<li>Apply chemical rule filters and element-ratio heuristics to ensure stability and drug-likeness</li>
<li>Apply post-processing algorithms to introduce nitro groups, nitriles, aromatic chlorines, thiophenes, sulfonamides, and thioureas</li>
</ol>
<p><strong>Key Optimization</strong>: Replaced computationally expensive MM2 minimization (used in GDB-11) with a fast geometry-based estimation of strained polycyclic ring systems, combined with fast &ldquo;element-ratio&rdquo; filters. This achieved a <strong>6.4-fold speedup</strong> in structure validation early in the pipeline.</p>
<h3 id="differences-from-gdb-11">Differences from GDB-11</h3>
<ul>
<li><strong>Element Selection</strong>: Fluorine removed from allowed elements; sulfur and chlorine added for higher drug relevance (e.g., thiophenes, sulfonamides).</li>
<li><strong>Optimization Method</strong>: MM2-based structure optimization replaced with a much faster, custom geometry-based estimation of local strain (measuring the tetrahedron volume of carbon centers).</li>
<li><strong>Heuristic Filters</strong>: Fast elemental ratio filters added to quickly reject highly polar, unstable combinations early in the pipeline.</li>
</ul>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="paper--data-availability">Paper &amp; Data Availability</h3>
<ul>
<li><strong>Paper Access</strong>: The original paper is published in the Journal of the American Chemical Society (JACS) and is closed-access/paywalled. No open-access preprint exists on arXiv or ChemRxiv.</li>
<li><strong>Data Access</strong>: The full GDB-13 database and its subsets are freely available via the <a href="https://gdb.unibe.ch/downloads/">Reymond Group Downloads Page</a> and are persistently hosted on <a href="https://doi.org/10.5281/zenodo.5172018">Zenodo</a>.</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://gdb.unibe.ch/downloads/">GDB-13 Database (Reymond Group)</a></td>
          <td>Dataset</td>
          <td>Free download</td>
          <td>Official download page hosted by the Reymond Group</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.5172018">GDB-13 on Zenodo</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Persistent archival copy</td>
      </tr>
  </tbody>
</table>
<h3 id="source-code--algorithms">Source Code &amp; Algorithms</h3>
<p>The exact custom source code (e.g., GENG orchestration, local strain filters) is not publicly available. Researchers must re-implement the rules strictly described in the paper and supplementary materials.</p>
<h3 id="heuristic-filters">Heuristic Filters</h3>
<p>Implemented element-ratio filters derived from analyzing known compound databases to reject chemically unstable or highly polar molecules early in the generation pipeline:</p>
<p>$$
\begin{aligned}
\frac{N + O}{C} &amp;&lt; 1.0 \\
\frac{N}{C} &amp;&lt; 0.571 \\
\frac{O}{C} &amp;&lt; 0.666
\end{aligned}
$$</p>
<h3 id="excluded-functional-groups">Excluded Functional Groups</h3>
<ul>
<li>O-O bonds (peroxides)</li>
<li>Hemiacetals, aminals, acyclic imines, non-aromatic enols</li>
<li>Compounds containing both primary/secondary amines and aldehydes/ketones</li>
<li>Nonenumerated elements (F, Br, I, P, Si, metals)</li>
<li>High-heteroatom ratio structures (e.g., mannitol)</li>
</ul>
<h3 id="hardware--compute">Hardware &amp; Compute</h3>
<ul>
<li><strong>Compute Cost</strong>: ~40,000 CPU hours for the 910 million C/N/O structures.</li>
<li><strong>Infrastructure</strong>: Executed in parallel on a <strong>500-node cluster</strong></li>
<li><strong>Assembly Optimization</strong>: The switch from MM2 minimization to geometry-based estimation of strained polycyclic ring systems, alongside element-ratio filters, reduced assembly time 6.4-fold comparing GDB-11 workloads (1600 CPU hours to 250 CPU hours).</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p>Blum, L. C. and Reymond, J.-L. (2009). 970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. <em>Journal of the American Chemical Society</em>, 131(25), 8732&ndash;8733. <a href="https://doi.org/10.1021/ja902302h">https://doi.org/10.1021/ja902302h</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{blum2009gdb13,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{970 million druglike small molecules for virtual screening in the chemical universe database GDB-13}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Blum, Lorenz C and Reymond, Jean-Louis}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of the American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{131}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{25}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{8732--8733}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2009}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{ACS Publications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/ja902302h}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item></channel></rss>