Method on Hunter Heidenreich | ML Research Scientist

Block-Recurrent Transformers for Long Sequences

Tue, 07 Apr 2026 00:00:00 +0000

A Method for Combining Attention with Block-Level Recurrence

This is a Method paper that introduces the Block-Recurrent Transformer, a model architecture that integrates recurrence into the transformer framework at the block level. Rather than processing tokens one at a time (as in traditional RNNs) or attending over entire sequences (as in standard transformers), this approach applies a transformer layer recurrently across blocks of tokens. The result is a model with linear complexity in sequence length that maintains the parallelism benefits of transformers during training. A related approach, RWKV, later explored similar ideas using linear attention with channel-wise decay.

Why Transformers Struggle with Long Documents

Transformers have largely replaced RNNs for sequence modeling tasks, but their quadratic self-attention cost limits the length of sequences they can process. A transformer with a window size of 512 tokens cannot see information beyond that window, making it blind to long-range dependencies in books, technical papers, or source code repositories.

Prior approaches to this problem fall into several categories: sparse attention patterns (BigBird, Routing Transformers, Reformer), sequence compression (Linformer, Funnel Transformers), and linearized attention approximations. These methods either sacrifice the expressiveness of full softmax attention or introduce implementation complexity.

Traditional RNNs like LSTMs offer linear complexity but suffer from three key limitations: sequential processing prevents parallelism on modern hardware, a single state vector bottlenecks information capacity, and vanishing gradients limit effective memory to a few hundred tokens.

Block-Level Recurrence with LSTM-Style Gates

The core innovation is applying a standard transformer layer in a recurrent fashion along the sequence, operating on blocks of $W$ tokens rather than individual tokens. The recurrent cell maintains $S$ state vectors (typically $S = W = 512$) that are updated at each block boundary.

The Recurrent Cell

The cell has two processing directions:

Vertical direction: An ordinary transformer layer with self-attention over input tokens and cross-attention to recurrent states, producing output embeddings.
Horizontal direction: Self-attention over current state vectors and cross-attention to input tokens, producing updated state vectors. Residual connections are replaced with gates.

Self-attention and cross-attention are computed in parallel (not sequentially), with results concatenated and fed into a linear projection. Keys and values are shared between directions, while queries are separate, yielding four query sets: $Q_e^v$, $Q_s^v$ (vertical) and $Q_s^h$, $Q_e^h$ (horizontal).

Gating Mechanisms

Two gate types are explored. The fixed gate uses a learned convex combination:

$$ g = \sigma(b_g) $$

$$ c_{t+1} = c_t \odot g + z_t \odot (1 - g) $$

where $g$ is constant after training, implementing an exponential moving average.

The LSTM gate uses input and forget gates:

$$ i_t = \sigma(W_i h_t + b_i - 1) $$

$$ f_t = \sigma(W_f h_t + b_f + 1) $$

$$ c_{t+1} = c_t \odot f_t + z_t \odot i_t $$

The bias offsets ($-1$ for input, $+1$ for forget) initialize the model to “remember” by default, which is critical for training stability. Without careful initialization, the model can fall into a local optimum where it ignores the recurrent state entirely. This echoes the gate initialization challenges studied by Tallec and Ollivier, who derived chrono initialization for LSTMs from time-warping invariance.

Gate Configurations

Three configurations are tested: dual (gates on both attention and MLP outputs), single (gate only on MLP output), and skip (gate only on attention output, no MLP). The skip configuration removes the large MLP from the recurrent layer entirely.

Learned State IDs

Since the same weights are applied to all state vectors, learned “state IDs” (analogous to position embeddings) are added so each state vector can issue distinct queries. T5-style relative position bias is used for token self-attention, with no position bias for state-token cross-attention.

Language Modeling on PG19, arXiv, and GitHub

Experimental Setup

The base model is a 12-layer transformer with 150M parameters (8 heads of size 128, embedding dimension 1024, MLP hidden size 4096). The recurrent layer is placed at layer 10 with segment length $N = 4096$ and window size $W = 512$. The architecture is evaluated on three long-document datasets:

PG19: Full-length books from Project Gutenberg (pre-1919)
arXiv: Mathematics papers in LaTeX
GitHub: Concatenated source code from open-source repositories

All models report bits-per-token ($\log_2$ perplexity, lower is better).

Baselines

Five baselines are compared: Transformer-XL with window sizes of 512, 1024, and 2048, plus 12-layer and 13-layer sliding window models. The 13-layer sliding window (Slide:13L) is the primary comparison, having equivalent computation cost and parameter count to the recurrent models.

Main Results

Model	Step Time	PG19 (bytes)	PG19 (tokens)	arXiv	GitHub
XL:512	0.88	1.01	3.62	1.45	1.21
XL:2048	2.11	0.990	3.58	1.31	1.01
Slide:13L	1.00	0.989	3.58	1.42	1.17
Rec:fixed:skip	0.99	0.952	3.53	1.24	0.976
Rec:fixed:dual	1.01	0.957	3.52	1.27	0.991
Feedback:fixed:skip	1.35	0.935	3.49	1.24	-
Memorizing Trans. 64k	1.94	0.950	3.53	1.22	-

The Rec:fixed:skip configuration achieves the best overall results while being slightly faster than the 13-layer baseline. It outperforms XL:2048, which runs over 2x slower. The block feedback variant (allowing all layers to cross-attend to recurrent states) improves perplexity further at ~35-40% higher step time.

Scaling Behavior

Models from 40M to 1.3B parameters show that the benefit of recurrence is consistent across scales and increases with model size. At larger sizes, adding recurrence provides a benefit greater than doubling the number of parameters. The 1.3B parameter model achieves 26.50 word-level perplexity on PG19, setting a new state of the art at the time of publication.

Model	Layers	PG19 Perplexity	Parameters
Compressive Transformer	36	33.6	-
Routing Transformer	22	33.2	490M
Perceiver AR	60	28.9	974.6M
Block-Recurrent Transformer	24	26.50	1.3B

Ablations

Multiple recurrent layers: Two adjacent layers (9, 10) provide no benefit. Two separated layers (4, 10) help but no more than adding another non-recurrent layer.
Number of states: Improvement up to 1024 states, degradation at 2048.
Window size reduction: Reducing the sliding window hurts Transformer-XL dramatically but has smaller impact on the recurrent model, which compensates via recurrence.
Gate type: The fixed gate consistently outperforms the LSTM gate despite being theoretically less expressive.

Qualitative Analysis

Comparing per-token predictions against Transformer-XL on PG19 books, the recurrent model’s advantage comes overwhelmingly from predicting proper names (17/20 top-improvement tokens). In 19/20 cases, the predicted word was outside the attention window, confirming it was stored in recurrent state. The model can remember book titles and authors across 60,000+ tokens.

Findings, Limitations, and Future Directions

The Block-Recurrent Transformer demonstrates that recurrence at the block level is a cost-effective way to improve language modeling on long sequences. The fixed:skip configuration (the simplest variant) performs best, suggesting the model primarily uses recurrence for long-range name lookup rather than complex reasoning. The fact that removing the MLP from the recurrent layer has minimal impact further supports this interpretation.

Key limitations include: the model was only evaluated on language modeling perplexity (no downstream tasks), the LSTM gate underperforms the simpler fixed gate (suggesting untapped potential for more expressive recurrence), and the authors acknowledge that training the recurrent layer to fully exploit its capacity for knowledge extraction will require further advances.

The authors note that evaluating on downstream tasks requiring long-range context (book summarization, long-document QA, code completion) is an important direction for future work.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Eval	PG19	~29k books	Public domain, freely available
Training/Eval	arXiv	Mathematics papers	Obtained via private channels, not redistributable
Training/Eval	GitHub	Open-source repos	Obtained via private channels, not redistributable

Algorithms

Optimizer: Adafactor
Learning rate: 1.0 with inverse square root decay (initial experiments), cosine decay with max 0.01 (scaling experiments)
Warmup: 1000 steps
Dropout: 0.05
Vocabulary: 32k SentencePiece (T5 pretrained for initial, custom for scaling)
Gate initialization: bias of $+1$ for forget gate, $-1$ for input gate to ensure initial “remember” behavior

Models

Variant	Layers	Parameters	Recurrent Layers
Base	12 (+1 recurrent)	~151-164M	Layer 10
Large	24 (+2 recurrent)	650M	Layers 10, 20
XL	24 (+2 recurrent)	1.3B	Layers 10, 20

Evaluation

Metric	Best Model	PG19 (tokens)	arXiv	GitHub
Bits-per-token	Rec:fixed:skip	3.53	1.24	0.976
Word-level PPL	1.3B model	26.50	-	-

Error bars on PG19 are between 0.002 and 0.007 (3 runs with different seeds).

Hardware

Training: 32 Google V4 TPU replicas
Training time: ~48 hours for 500k steps on PG19
Batch size: 32 (segment length 4096) or 256 (segment length 512), adjusted so each model sees the same tokens per step

Artifacts

Artifact	Available	License	URL
Code (Meliad)	Yes	Apache 2.0	github.com/google-research/meliad
PG19 Dataset	Yes	Public Domain	Public
arXiv Dataset	No	Not redistributable	Private
GitHub Dataset	No	Not redistributable	Private
Pretrained Models	No	-	-

Reproducibility Assessment: Partially Reproducible. Source code is available under Apache 2.0 and the PG19 dataset is public. However, two of three evaluation datasets (arXiv, GitHub) were obtained via private channels and are not redistributable. No pretrained model checkpoints are released.

Paper Information

Citation: Hutchins, D., Schlag, I., Wu, Y., Dyer, E., & Neyshabur, B. (2022). Block-Recurrent Transformers. Advances in Neural Information Processing Systems 35 (NeurIPS 2022).

@misc{hutchins2022block,
  title={Block-Recurrent Transformers},
  author={Hutchins, DeLesley and Schlag, Imanol and Wu, Yuhuai and Dyer, Ethan and Neyshabur, Behnam},
  year={2022},
  eprint={2203.07852},
  archiveprefix={arXiv},
  primaryclass={cs.LG}
}

NaViT: Native Resolution Vision Transformer

Mon, 06 Apr 2026 00:00:00 +0000

A Method for Flexible-Resolution Vision Transformers

This is a Method paper that introduces NaViT (Native Resolution ViT), a Vision Transformer trained using sequence packing to handle images of arbitrary resolution and aspect ratio. The core idea, called “Patch n’ Pack,” borrows example packing from NLP and applies it to vision: patches from multiple images of different sizes are concatenated into a single sequence, enabling native-resolution processing without resizing or padding.

Why Fixed-Resolution Pipelines Are Suboptimal

Standard computer vision pipelines resize all images to a fixed square resolution before processing. This practice originates from convolutional neural network constraints, where fixed spatial dimensions were architecturally required. Even with Vision Transformers, which operate on sequences of patches and could in principle handle variable lengths, the convention of fixed-resolution input persists.

This approach has clear drawbacks. Most images are not square: analysis of ImageNet, LVIS, and WebLI shows that most images deviate more than 20% from a 1:1 aspect ratio. Resizing distorts content and discards information, while padding wastes computation. Prior work like FlexiViT addressed variable patch sizes and Pix2Struct introduced aspect-ratio-preserving patching, but neither fully solved the problem of training efficiently on images at their original resolution.

Patch n’ Pack: Sequence Packing for Vision

The key insight is that ViT already processes images as sequences of patch tokens, and NLP has long used example packing to handle variable-length sequences efficiently. NaViT applies this directly: patches from multiple images (each at its native resolution and aspect ratio) are packed into a single fixed-length sequence.

Architectural Modifications

Three changes enable Patch n’ Pack:

Masked self-attention and masked pooling: Attention masks prevent patches from different images from attending to each other. Masked pooling extracts a single representation per image from the packed sequence.
Factorized positional embeddings: Standard 1D positional embeddings cannot handle arbitrary resolutions. NaViT decomposes position into separate $x$ and $y$ embeddings $\phi_{x}$ and $\phi_{y}$, which are summed together. Two schemes are considered:
- Absolute embeddings: $\phi(p): [0, \text{maxLen}] \to \mathbb{R}^{D}$, a function of the absolute patch index
- Fractional embeddings: $\phi(r): [0, 1] \to \mathbb{R}^{D}$, where $r = p / \text{side-length}$ is the relative position along the image
Chunked contrastive loss: For contrastive pretraining, the $\mathcal{O}(n^{2})$ loss computation is handled via chunked computation across device subsets to support the high number of examples per sequence.

Training Innovations

Packing enables two techniques that were previously impractical:

Continuous token dropping: Instead of dropping the same proportion of tokens from every image, the drop rate varies per image. Some images keep all tokens while others have aggressive dropping, reducing the train/inference discrepancy. The drop rate can follow a schedule that decreases over training.
Resolution sampling: Each image’s resolution is sampled from a distribution (e.g., $R \sim \mathcal{U}(64, R_{\text{max}})$) while preserving aspect ratio. This mixes the throughput benefits of small images with the detail of large ones.

Computational Overhead

A natural concern is the $\mathcal{O}(n^{2})$ attention cost for longer packed sequences. In practice, as the transformer hidden dimension scales, attention becomes an increasingly small fraction of total compute (the MLP dominates). Packing overhead is typically less than 2% from padding tokens, using a simple greedy bin-packing algorithm.

Pretraining and Downstream Evaluation

NaViT is evaluated in two pretraining setups:

Classification pretraining on JFT-4B with sigmoid cross-entropy loss, evaluated via linear probing (10 examples per class)
Contrastive pretraining on WebLI using image-text contrastive loss, evaluated on zero-shot ImageNet classification and COCO retrieval

Training Efficiency

At fixed compute budget, NaViT consistently outperforms ViT across model scales. The top-performing ViT can be matched by NaViT with 4x less compute. The primary driver is throughput: packing with variable resolution and token dropping enables NaViT-L/16 to process approximately 5x more images during training.

Variable Resolution Results

Models trained with variable resolution ($R \sim \mathcal{U}(64, R_{\text{max}})$) outperform fixed-resolution models even when evaluated at the fixed resolution’s own training resolution. Sampling side lengths from a truncated normal biased toward lower values gives the best cost-performance trade-off.

For fine-tuning on ImageNet-1k, a single NaViT fine-tuned with variable resolutions (64 to 512) matches the performance of models fine-tuned at each specific resolution individually.

Positional Embedding Comparison

Factorized embeddings outperform both standard ViT 1D embeddings (with interpolation) and Pix2Struct’s learned 2D embeddings. The factorized approach generalizes to resolutions outside the training range, while 2D embeddings fail because they require seeing all $(x, y)$ coordinate pairs during training. Additive combination of $\phi_{x}$ and $\phi_{y}$ works best.

Token Dropping Strategies

Variable token dropping with Beta-distributed rates consistently outperforms constant rates. Resolution-dependent dropping (higher rates for higher-resolution images) further improves performance. Scheduling the drop rate to decrease over training provides additional gains.

Downstream Tasks

Task	Setup	Result
Semantic segmentation	ADE20k, L/16, linear decoder	NaViT at $R_{384}$ beats ViT at $R_{512}$ while being 2x faster
Object detection	OWL-ViT-L/14 backbone	NaViT: 28.3% LVIS AP vs. ViT: 23.3%
Video classification	Kinetics-400, tubelet extraction	NaViT-L matches ViViT-L (80.4%) in ~6x fewer epochs
Fairness annotation	FairFace, CelebA linear probes	Statistically significant accuracy improvements ($p = 3 \times 10^{-4}$)

Out-of-Distribution Robustness

NaViT shows strong gains on ImageNet-A (which contains many extreme aspect ratios) when evaluated without center cropping. Performance on ObjectNet is also competitive. The model maintains stable calibration (ECE between 0.045 and 0.047) across a wide range of token counts per image (128 to 1024).

Key Findings and Limitations

NaViT demonstrates that sequence packing, when applied to Vision Transformers, yields substantial improvements in training efficiency, inference flexibility, and downstream performance. The approach processes images at their native resolution without the information loss from resizing or the waste from padding.

Key takeaways:

4x compute reduction to match top ViT performance
A single model works across a continuous range of resolutions at inference time
Variable-resolution training and token dropping provide complementary efficiency gains
Factorized positional embeddings generalize to unseen resolutions
Benefits transfer to detection, segmentation, video, and fairness tasks

Limitations: The paper does not release model weights or code. All experiments use Google-internal datasets (JFT-4B, WebLI) and infrastructure (TPUs, JAX/Scenic), making direct reproduction difficult. The attention masking approach for packing assumes that cross-image attention is undesirable, which may not hold for all tasks.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Classification pretraining	JFT-4B	~4B labeled images	Google-internal, not publicly available
Contrastive pretraining	WebLI	Large-scale web data	Google-internal, not publicly available
Classification fine-tuning	ImageNet-1k	1.28M images	Publicly available
Segmentation	ADE20k	20K images	Publicly available
Detection	LVIS	164K images	Publicly available
Video	Kinetics-400	~240K videos	Publicly available (partial)
Fairness	FairFace, CelebA	108K / 200K images	Publicly available

Algorithms

Greedy bin-packing for sequence construction (less than 2% padding tokens)
Resolution sampling: side length from truncated normal $\mathcal{N}_{t}(-0.5, 1)$ mapped to $[64, R_{\text{max}}]$
Token dropping: Beta-distributed per-image rates, optionally resolution-dependent
Factorized positional embeddings with additive combination

Models

NaViT variants: B/16, L/16, L/14
Based on vanilla ViT with query-key normalization, no biases, attention pooling
Implemented in JAX/FLAX within the Scenic framework
No public model checkpoints available

Evaluation

Metric	NaViT	ViT Baseline	Notes
JFT linear probe (L/16)	Matches top ViT	4x more compute	Compute-matched comparison
ImageNet zero-shot (L/14)	72.9%	68.3%	Contrastive pretraining
LVIS AP (L/14)	28.3%	23.3%	OWL-ViT detection
LVIS AP rare (L/14)	24.3%	17.2%	OWL-ViT detection
ADE20k mIoU (L/16, 384)	Beats ViT@512	At 2x cost	Segmenter linear decoder

Hardware

Training on Cloud TPUs (specific configuration not detailed)
Inference latency measured on Cloud TPUv3

Paper Information

Citation: Dehghani, M., Mustafa, B., Djolonga, J., Heek, J., Minderer, M., Caron, M., Steiner, A., Puigcerver, J., Geirhos, R., Alabdulmohsin, I., Oliver, A., Padlewski, P., Gritsenko, A., Lučić, M., & Houlsby, N. (2023). Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution. Advances in Neural Information Processing Systems 36 (NeurIPS 2023).

@misc{dehghani2023patch,
  title={Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution},
  author={Dehghani, Mostafa and Mustafa, Basil and Djolonga, Josip and Heek, Jonathan and Minderer, Matthias and Caron, Mathilde and Steiner, Andreas and Puigcerver, Joan and Geirhos, Robert and Alabdulmohsin, Ibrahim and Oliver, Avital and Padlewski, Piotr and Gritsenko, Alexey and Lučić, Mario and Houlsby, Neil},
  year={2023},
  eprint={2307.06304},
  archiveprefix={arXiv},
  primaryclass={cs.CV}
}

Beyond Atoms: 3D Space Modeling for Molecular Pretraining

Sat, 23 Aug 2025 00:00:00 +0000

Paper Typology and Contribution

This is a Method paper. It challenges the atom-centric paradigm of molecular representation learning by proposing a novel framework that models the continuous 3D space surrounding atoms. The core contribution is SpaceFormer, a Transformer-based architecture that discretizes molecular space into grids to capture physical phenomena (electron density, electromagnetic fields) often missed by traditional point-cloud models.

The Physical Intuition: Modeling “Empty” Space

The Gap: Prior 3D molecular representation models, such as Uni-Mol, treat molecules as discrete sets of atoms, essentially point clouds in 3D space. However, from a quantum physics perspective, the “empty” space between atoms is far from empty. It is permeated by electron density distributions and electromagnetic fields that determine molecular properties.

The Hypothesis: Explicitly modeling this continuous 3D space alongside discrete atom positions yields superior representations for downstream tasks, particularly for computational properties that depend on electronic structure, such as HOMO/LUMO energies and energy gaps.

A Surprising Observation: Virtual Points Improve Representations

Before proposing SpaceFormer, the authors present a simple yet revealing experiment. They augment Uni-Mol by adding randomly sampled virtual points (VPs) from the 3D space within the circumscribed cuboid of each molecule. These VPs carry no chemical information whatsoever: they are purely random noise points.

The result is surprising: adding just 10 random VPs already yields a noticeable improvement in validation loss. The improvement remains consistent and gradually increases as the number of VPs grows, eventually reaching a plateau. This observation holds across downstream tasks as well, with Uni-Mol + VPs improving on several quantum property predictions (LUMO, E1-CC2, E2-CC2) compared to vanilla Uni-Mol.

The implication is that even uninformative spatial context helps the model learn better representations, motivating a principled framework for modeling the full 3D molecular space.

SpaceFormer: Voxelization and 3D Positional Encodings

The key innovation is treating the molecular representation problem as 3D space modeling. SpaceFormer follows these core steps:

Voxelizes the entire 3D space into a grid with cells of $0.49\text{\AA}$ (based on O-H bond length to ensure at most one atom per cell).
Uses adaptive multi-resolution grids to efficiently handle empty space, keeping it fine-grained near atoms and coarse-grained far away.
Applies Transformers to 3D spatial tokens with custom positional encodings that achieve linear complexity.

Specifically, the model utilizes two forms of 3D Positional Encoding:

3D Directional PE (RoPE Extension) They extend Rotary Positional Encoding (RoPE) to 3D continuous space by splitting the Query and Key vectors into three blocks (one for each spatial axis). The directional attention mechanism takes the form:

$$ \begin{aligned} \mathbf{q}_{i}^{\top} \mathbf{k}_{j} = \sum_{s=1}^{3} \mathbf{q}_{i,s}^{\top} \mathbf{R}(c_{j,s} - c_{i,s}) \mathbf{k}_{j,s} \end{aligned} $$

3D Distance PE (RFF Approximation) To compute invariant geometric distance without incurring quadratic memory overhead, they use Random Fourier Features (RFF) to approximate a Gaussian kernel of pairwise distances:

$$ \begin{aligned} \exp \left( - \frac{| \mathbf{c}_i - \mathbf{c}_j |_2^2}{2\sigma^2} \right) &\approx z(\mathbf{c}_i)^\top z(\mathbf{c}_j) \\ z(\mathbf{c}_i) &= \sqrt{\frac{2}{d}} \cos(\sigma^{-1} \mathbf{c}_i^\top \boldsymbol{\omega} + \mathbf{b}) \end{aligned} $$

This approach enables the model to natively encode complex field-like phenomena without computing exhaustive $O(N^2)$ distance matrices.

Experimental Setup and Downstream Tasks

Pretraining Data: 19 million unlabeled molecules from the same dataset used by Uni-Mol.

Downstream Benchmarks: The authors propose a new benchmark of 15 tasks, motivated by known limitations of MoleculeNet: invalid structures, inconsistent chemical representations, data curation errors, and an inability to adequately distinguish model performance. The tasks split into two categories:

Computational Properties (Quantum Mechanics)
- Subsets of GDB-17 (HOMO, LUMO, GAP energy prediction, 20K samples; E1-CC2, E2-CC2, f1-CC2, f2-CC2, 21.7K samples)
- Cata-condensed polybenzenoid hydrocarbons (Dipole moment, adiabatic ionization potential, D3 dispersion correction, 8,678 samples)
- Metric: Mean Absolute Error (MAE)
Experimental Properties (Pharma/Bio)
- MoleculeNet tasks (BBBP, BACE for drug discovery)
- Biogen ADME tasks (HLM, MME, Solubility)
- Metrics: AUC for classification, MAE for regression

Splitting Strategy: All datasets use 8:1:1 train/validation/test ratio with scaffold splitting to test out-of-distribution generalization.

Training Setup:

Objective: Masked Auto-Encoder (MAE) with 30% random masking. Model predicts whether a cell contains an atom, and if so, regresses both atom type and precise offset position.
Hardware: ~50 hours on 8 NVIDIA A100 GPUs
Optimizer: Adam ($\beta_1=0.9, \beta_2=0.99$)
Learning Rate: Peak 1e-4 with linear decay and 0.01 warmup ratio
Batch Size: 128
Total Updates: 1 million

Baseline Comparisons: GROVER (2D graph-based MPR), GEM (2D graph enhanced with 3D information), 3D Infomax (GNN with 3D information), Uni-Mol (3D MPR, primary baseline using the same pretraining dataset), and Mol-AE (extends Uni-Mol with atom-based MAE pretraining).

Results and Analysis

Strong Contextual Performance: SpaceFormer ranked 1st in 10 of 15 tasks and in the top 2 for 14 of 15 tasks. It surpassed the runner-up models by approximately 20% on quantum property tasks (HOMO, LUMO, GAP, E1-CC2, Dipmom), validating that modeling non-atom space captures electronic structure better than atom-only regimes.

Key Results on Quantum Properties

Task	GROVER	GEM	3D Infomax	Uni-Mol	Mol-AE	SpaceFormer
HOMO (Ha)	0.0075	0.0068	0.0065	0.0052	0.0050	0.0042
LUMO (Ha)	0.0086	0.0080	0.0070	0.0060	0.0057	0.0040
GAP (Ha)	0.0109	0.0107	0.0095	0.0081	0.0080	0.0064
E1-CC2 (eV)	0.0101	0.0090	0.0089	0.0067	0.0070	0.0058
Dipmom (Debye)	0.0752	0.0289	0.0291	0.0106	0.0113	0.0083

SpaceFormer’s advantage is most pronounced on computational properties that depend on electronic structure. On experimental biological tasks (e.g., BBBP), where measurements are noisy, the advantage narrows or reverses: Uni-Mol achieves 0.9066 AUC on BBBP compared to SpaceFormer’s 0.8605.

Ablation Studies

The authors present several ablations that isolate the source of SpaceFormer’s improvements:

MAE vs. Denoising: SpaceFormer with MAE pretraining outperforms SpaceFormer with denoising on all four ablation tasks. The MAE objective requires predicting whether an atom exists in a masked voxel, which forces the model to learn global structural dependencies. In the denoising variant, only atom cells are masked so the model never needs to predict atom existence, reducing the task to coordinate regression.

FLOPs Control: A SpaceFormer-Large model (4x width, atom-only) trained with comparable FLOPs still falls short of SpaceFormer with 1000 non-atom cells on most downstream tasks. This confirms the improvement comes from modeling 3D space, not from additional compute.

Virtual Points vs. SpaceFormer: Adding up to 200 random virtual points to Uni-Mol improves some tasks but leaves a significant gap compared to SpaceFormer, demonstrating that principled space discretization outperforms naive point augmentation.

Efficiency Validation: The Adaptive Grid Merging method reduces the number of cells by roughly 10x with virtually no performance degradation. The 3D positional encodings scale linearly with the number of cells, while Uni-Mol’s pretraining cost scales quadratically.

Scope and Future Directions

SpaceFormer does not incorporate built-in SE(3) equivariance, relying instead on data augmentation (random rotations and random boundary padding) during training. The authors identify extending SpaceFormer to force field tasks and larger systems such as proteins and complexes as promising future directions.

Reproducibility Details

Code and Data Availability

Source Code: As of the current date, the authors have not released the official source code or pre-trained weights.
Datasets: Pretraining utilized the same 19M unlabeled molecule dataset as Uni-Mol. Downstream tasks use a newly curated internal benchmark built from subsets of GDB-17, MoleculeNet, and Biogen ADME. The exact customized scaffold splits for these evaluations are pending the official code release.
Compute: Pretraining the base SpaceFormer encoder (~67.8M parameters, configured to merge level 3) required approximately 50 hours on 8 NVIDIA A100 GPUs.

Artifact	Type	License	Notes
Source code	Code	N/A	Not publicly released as of March 2026
Pre-trained weights	Model	N/A	Not publicly released
Pretraining data (19M molecules)	Dataset	Unknown	Same dataset as Uni-Mol; not independently released
Downstream benchmark splits	Dataset	N/A	Custom scaffold splits pending code release

Models

The model treats a molecule as a 3D “image” via voxelization, processed by a Transformer.

Input Representation:

Discretization: 3D space divided into grid cells with length $0.49\text{\AA}$ (based on O-H bond length to ensure at most one atom per cell)
Tokenization: Tokens are pairs $(t_i, c_i)$ where $t_i$ is atom type (or NULL) and $c_i$ is the coordinate
Embeddings: Continuous embeddings with dimension 512. Inner-cell positions discretized with $0.01\text{\AA}$ precision

Transformer Specifications:

Component	Layers	Attention Heads	Embedding Dim	FFN Dim
Encoder	16	8	512	2048
Decoder (MAE)	4	4	256	1024

Attention Mechanism: FlashAttention for efficient handling of large sequence lengths.

Positional Encodings:

3D Directional PE: Extension of Rotary Positional Embedding (RoPE) to 3D continuous space, capturing relative directionality
3D Distance PE: Random Fourier Features (RFF) to approximate Gaussian kernel of pairwise distances with linear complexity

Visualizing RFF and RoPE

Visual intuition for SpaceFormer’s positional encodings: Top row shows RFF distance encoding (Gaussian-like attention decay and high-frequency feature fingerprints). Bottom row shows RoPE directional encoding (vector rotation fields and resulting attention patterns).

Top Row (Distance / RFF): Shows how the model learns “closeness.” Distance is represented by a complex “fingerprint” of waves that creates a Gaussian-like force field.

Top Left (The Force Field): The attention score (dot product) naturally forms a Gaussian curve. It is high when atoms are close and decays to zero as they move apart. This mimics physical forces without the model needing to learn that math from scratch.
Top Right (The Fingerprint): Each dimension oscillates at a different frequency. A specific distance (e.g., $d=2$) has a unique combination of high and low values across these dimensions, creating a unique “fingerprint” for that exact distance.

Bottom Row (Direction / RoPE): Shows how the model learns “relative position.” It visualizes the vector rotation and how that creates a grid-like attention pattern.

Bottom Left (The Rotation): This visualizes the “X-axis chunk” of the vector. As you move from left ($x=-3$) to right ($x=3$), the arrows rotate. The model compares angles between atoms to determine relative positions.
Bottom Right (The Grid): The resulting attention pattern when combining X-rotations and Y-rotations. The red/blue regions show where the model pays attention relative to the center, forming a grid-like interference pattern that distinguishes relative positions (e.g., “top-right” vs “bottom-left”).

Adaptive Grid Merging

To make the 3D grid approach computationally tractable, two key strategies are employed:

Grid Sampling: Randomly selecting 10-20% of empty cells during training
Adaptive Grid Merging: Recursively merging $2 \times 2 \times 2$ blocks of empty cells into larger “coarse” cells, creating a multi-resolution view that is fine-grained near atoms and coarse-grained in empty space (merging set to Level 3)

Visualizing Adaptive Grid Merging:

Adaptive grid merging demonstrated on H₂O. Red cells (Level 0) contain atoms and remain at full resolution. Progressively darker blue cells represent merged empty regions at higher levels, covering the same volume with fewer tokens.

The adaptive grid process compresses empty space around molecules while maintaining high resolution near atoms:

Red Cells (Level 0): The smallest squares ($0.49$Å) containing atoms. These are kept at highest resolution because electron density changes rapidly here.
Light Blue Cells (Level 0/1): Small empty regions close to atoms.
Darker Blue Cells (Level 2/3): Large blocks of empty space further away.

If we used a naive uniform grid, we would have to process thousands of empty “Level 0” cells containing almost zero information. By merging them into larger blocks (the dark blue squares), the model covers the same volume with significantly fewer input tokens, reducing the number of tokens by roughly 10x compared to a dense grid.

Adaptive grid merging for benzene (C₆H₆). The model maintains maximum resolution (red Level 0 cells) only where atoms exist, while merging vast empty regions into large blocks (dark blue L3/L4 cells). This allows the model to focus computational power on chemically active zones.

The benzene example above demonstrates how this scales to larger molecules. The characteristic hexagonal ring of 6 carbon atoms (black) and 6 hydrogen atoms (white) occupies a small fraction of the total grid. The dark blue corners (L3, L4) represent massive merged blocks of empty space, allowing the model to focus 90% of its computational power on the red “active” zones where chemistry actually happens.

Paper Information

Citation: Lu, S., Ji, X., Zhang, B., Yao, L., Liu, S., Gao, Z., Zhang, L., & Ke, G. (2025). Beyond Atoms: Enhancing Molecular Pretrained Representations with 3D Space Modeling. Proceedings of the 42nd International Conference on Machine Learning (ICML), 267, 40491-40504. https://proceedings.mlr.press/v267/lu25e.html

Publication: ICML 2025

@inproceedings{lu2025beyond,
  title={Beyond Atoms: Enhancing Molecular Pretrained Representations with 3D Space Modeling},
  author={Lu, Shuqi and Ji, Xiaohong and Zhang, Bohang and Yao, Lin and Liu, Siyuan and Gao, Zhifeng and Zhang, Linfeng and Ke, Guolin},
  booktitle={Proceedings of the 42nd International Conference on Machine Learning},
  pages={40491--40504},
  volume={267},
  series={Proceedings of Machine Learning Research},
  publisher={PMLR},
  year={2025}
}

Additional Resources: