Overview
This software update paper documents major improvements to the SELFIES Python library (version 2.1.1), covering its history, underlying algorithms, design, and performance.
Limitations in the Original SELFIES Implementation
While the original SELFIES concept was promising, the initial 2019 implementation had critical limitations that prevented widespread adoption:
- Performance: Too slow for production ML workflows
- Limited chemistry: Couldn’t represent aromatic molecules, stereochemistry, or many other important chemical features
- Poor usability: Lacked user-friendly APIs for common tasks
These barriers meant that despite SELFIES’ theoretical advantages (100% validity guarantee), researchers couldn’t practically use it for real-world applications like drug discovery or materials science.
Architectural Refactoring and New ML Integrations
The 2023 update refactors the underlying SELFIES engine with improvements to design, efficiency, and supported features. The key updates include:
Streamlined Grammar: The underlying context-free grammar has been generalized and streamlined, improving execution speed and extensibility while maintaining the 100% validity guarantee.
Expanded Chemical Support: Adds support for aromatic systems (via internal kekulization), stereochemistry (chirality, cis/trans), charged species, and isotopic data, covering nearly all features supported by SMILES while preserving the validity guarantee.
Semantic Constraint API: Introduces the
set_semantic_constraints()function, allowing specification of custom valence definitions useful for theoretical studies or hypervalent states.ML Utility Functions: Provides tokenization (
split_selfies), length estimation (len_selfies), label/one-hot encoding (selfies_to_encoding), vocabulary extraction, and attribution tracking for integration with neural network pipelines.
Performance Benchmarks & Validity Testing
The authors validated the library through several benchmarks:
Performance testing: Roundtrip conversion (SMILES to SELFIES to SMILES) on the DTP open compound collection (slightly over 300K molecules) completed in 252 seconds total (136s encoding, 116s decoding), using pure Python with no external dependencies.
Random SELFIES generation: Demonstrated that random SELFIES strings of varying lengths always decode to valid molecules, with the size distribution of generated molecules controllable by filtering the sampling alphabet (e.g., removing multi-bond and low-valence atom symbols shifts the distribution toward larger molecules).
Validity guarantee: By construction, every SELFIES string decodes to a valid molecule. The grammar’s bond demotion and deferred ring closure mechanisms make it impossible to generate chemically invalid structures.
Attribution system: Showed both encoder and decoder can track which input symbols produce which output symbols, useful for property alignment.
Future Trajectories for General Chemical Representations
The 2023 update successfully addresses the main adoption barriers:
- Fast enough for large-scale ML applications (300K molecules in ~4 minutes)
- Chemically comprehensive enough for drug discovery and materials science
- User-friendly enough for straightforward integration into existing workflows
The validity guarantee, SELFIES’ core advantage, is now practically accessible for real-world research. The roadmap includes future extensions for polymers, crystals, chemical reactions, and non-covalent interactions, which would expand SELFIES’ applicability beyond small-molecule chemistry.
Limitations acknowledged: The paper focuses on implementation improvements. Some advanced chemical systems (polymers, crystals) still need future work.
Reproducibility Details
Artifacts
| Artifact | Type | License | Notes |
|---|---|---|---|
| selfies | Code | Apache 2.0 | Official Python library, installable via pip install selfies |
Code
The selfies library is completely open-source and written in pure Python. It requires no extra dependencies and is available on GitHub, installable via pip install selfies. The repository includes testing suites (tox) and example benchmarking scripts to reproduce the translation speeds reported in the paper.
Hardware
Performance benchmarks (e.g., the 252-second roundtrip conversion on 300K molecules) were executed on Google Colaboratory using two 2.20GHz Intel Xeon CPUs.
Algorithms
Technical Specification: The Grammar
The core innovation of SELFIES is a Context-Free Grammar (CFG) augmented with state-machine logic to ensure that every derived string represents a valid molecule. While the software features are important, understanding the underlying derivation rules is essential for replication or extension of the system.
1. Derivation Rules: The Atom State Machine
The fundamental mechanism that guarantees validity is a state machine that tracks the remaining valence of the most recently added atom:
- State Tracking: The derivation maintains a non-terminal state $X_l$, where $l$ represents the current atom’s remaining valence (number of bonds it can still form)
- Standard Derivation: An atom symbol $[\beta \alpha]$ (bond order + atom type) transitions the state from $S$ (start) to $X_l$, where $l$ is calculated from the atom’s standard valence minus the incoming bond order
- Bond Demotion (The Key Rule): When deriving atom symbol $[\beta \alpha]$ in state $X_i$, the actual bond order used is $d_0 = \min(\ell, i, d(\beta))$, where $\ell$ is the new atom’s valence, $i$ is the previous atom’s remaining capacity, and $d(\beta)$ is the requested bond order. This automatic downward adjustment is the mathematical core of the validity guarantee.
This state machine ensures that no atom ever exceeds its allowed valence, making it impossible to generate chemically invalid structures.
2. Control Symbols: Branches and Rings
Branch length calculation: SELFIES uses a hexadecimal encoding to determine branch lengths. A branch symbol [Branch l] consumes the next $\ell$ symbols from the queue and converts them to integer indices $c_1, \dots, c_\ell$ via a fixed mapping (Table III in the paper). The number of symbols $N$ to include in the branch is then:
$$ N = 1 + \sum_{k=1}^{\ell} 16^{\ell - k} , c_k $$
This formula interprets the indices as hexadecimal digits, allowing compact specification of branches up to hundreds of symbols long.
Ring closure queue system: Ring formation uses a deferred evaluation strategy to maintain validity. Ring symbols don’t create bonds immediately; instead, they push closure candidates into a queue $R$. These candidates are resolved after the main derivation completes. A ring closure candidate is rejected if either ring atom has no remaining valence ($m_1 = 0$ or $m_2 = 0$), or if the left and right ring atoms are not distinct (to avoid self-loops). If a prior bond already exists between the two atoms, the bond order is incremented rather than duplicated. This deferred validation prevents invalid ring structures while keeping the grammar context-free during the main derivation.
3. Symbol Structure and Standardization
SELFIES enforces a strict, standardized format for atom symbols to eliminate ambiguity:
- Canonical Format: Atom symbols follow the structure
[Bond, Isotope, Element, Chirality, H-count, Charge] - No Variation: There is only one way to write each symbol (e.g.,
[Fe++]and[Fe+2]are standardized to a single form) - Order Matters: The components must appear in the specified order
4. Default Semantic Constraints
By default, the library enforces standard organic chemistry valence rules:
- Charge-Dependent Valences: Default constraints specify maximum bonds per charge state (e.g., C: 4/5/3 for neutral/+1/-1; S: 6/7/5). Unlisted atom types default to 8 maximum bonds as a catch-all.
- Preset Options: Three preset constraint sets are available:
default,octet_rule, andhypervalent. - Customizable: Constraints can be modified via
set_semantic_constraints()for specialized applications (hypervalent compounds, theoretical studies, etc.)
The combination of these grammar rules with the state machine ensures that every valid SELFIES string decodes to a chemically valid molecule, regardless of how the string was generated (random, ML model output, manual construction, etc.).
Data
Benchmark dataset: DTP (Developmental Therapeutics Program) open compound collection with slightly over 300K SMILES strings, a set of molecules tested experimentally for potential treatment against cancer and AIDS.
Random generation testing: Random SELFIES strings of varying lengths (10, 100, 250 symbols) generated from both basic and filtered alphabets to test decoding validity and molecule size distributions.
Evaluation
Performance metric: Roundtrip conversion time (SMILES to SELFIES to SMILES) is 252 seconds for 300K+ molecules (136s encoding, 116s decoding). Times averaged over 3 replicate trials on Google Colaboratory.
Validity testing: Random SELFIES strings of lengths 10, 100, and 250 all decode to valid molecules. Decoding 1000 random strings of length 250 from the basic alphabet takes 0.341s; from the filtered alphabet, 1.633s.
Attribution system: Both encoder() and decoder() support an attribute flag that returns AttributionMap objects, tracing which input symbols produce which output symbols for property alignment.
Paper Information
Citation: Lo, A., Pollice, R., Nigam, A., White, A. D., Krenn, M., & Aspuru-Guzik, A. (2023). Recent advances in the self-referencing embedded strings (SELFIES) library. Digital Discovery, 2(4), 897-908. https://doi.org/10.1039/D3DD00044C
Publication: Digital Discovery 2023
@article{lo2023recent,
title={Recent advances in the self-referencing embedded strings (SELFIES) library},
author={Lo, Alston and Pollice, Robert and Nigam, AkshatKumar and White, Andrew D and Krenn, Mario and Aspuru-Guzik, Al{\'a}n},
journal={Digital Discovery},
volume={2},
number={4},
pages={897--908},
year={2023},
publisher={Royal Society of Chemistry},
doi={10.1039/D3DD00044C}
}
Additional Resources:
