<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Method on Hunter Heidenreich | ML Research Scientist</title><link>https://hunterheidenreich.com/tags/method/</link><description>Recent content in Method on Hunter Heidenreich | ML Research Scientist</description><image><title>Hunter Heidenreich | ML Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Tue, 07 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/tags/method/index.xml" rel="self" type="application/rss+xml"/><item><title>Block-Recurrent Transformers for Long Sequences</title><link>https://hunterheidenreich.com/notes/machine-learning/model-architectures/block-recurrent-transformers/</link><pubDate>Tue, 07 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/model-architectures/block-recurrent-transformers/</guid><description>Block-Recurrent Transformers combine attention and recurrence for linear-complexity language modeling on long documents like books and code.</description><content:encoded><![CDATA[<h2 id="a-method-for-combining-attention-with-block-level-recurrence">A Method for Combining Attention with Block-Level Recurrence</h2>
<p>This is a <strong>Method</strong> paper that introduces the Block-Recurrent Transformer, a model architecture that integrates recurrence into the transformer framework at the block level. Rather than processing tokens one at a time (as in traditional RNNs) or attending over entire sequences (as in standard transformers), this approach applies a transformer layer recurrently across blocks of tokens. The result is a model with linear complexity in sequence length that maintains the parallelism benefits of transformers during training. A related approach, <a href="/notes/machine-learning/model-architectures/rwkv-rnn-transformer-architecture/">RWKV</a>, later explored similar ideas using linear attention with channel-wise decay.</p>
<h2 id="why-transformers-struggle-with-long-documents">Why Transformers Struggle with Long Documents</h2>
<p>Transformers have largely replaced RNNs for sequence modeling tasks, but their quadratic self-attention cost limits the length of sequences they can process. A transformer with a window size of 512 tokens cannot see information beyond that window, making it blind to long-range dependencies in books, technical papers, or source code repositories.</p>
<p>Prior approaches to this problem fall into several categories: sparse attention patterns (BigBird, Routing Transformers, Reformer), sequence compression (Linformer, Funnel Transformers), and linearized attention approximations. These methods either sacrifice the expressiveness of full softmax attention or introduce implementation complexity.</p>
<p>Traditional RNNs like LSTMs offer linear complexity but suffer from three key limitations: sequential processing prevents parallelism on modern hardware, a single state vector bottlenecks information capacity, and vanishing gradients limit effective memory to a few hundred tokens.</p>
<h2 id="block-level-recurrence-with-lstm-style-gates">Block-Level Recurrence with LSTM-Style Gates</h2>
<p>The core innovation is applying a standard transformer layer in a recurrent fashion along the sequence, operating on blocks of $W$ tokens rather than individual tokens. The recurrent cell maintains $S$ state vectors (typically $S = W = 512$) that are updated at each block boundary.</p>
<h3 id="the-recurrent-cell">The Recurrent Cell</h3>
<p>The cell has two processing directions:</p>
<ul>
<li><strong>Vertical direction</strong>: An ordinary transformer layer with self-attention over input tokens and cross-attention to recurrent states, producing output embeddings.</li>
<li><strong>Horizontal direction</strong>: Self-attention over current state vectors and cross-attention to input tokens, producing updated state vectors. Residual connections are replaced with gates.</li>
</ul>
<p>Self-attention and cross-attention are computed in parallel (not sequentially), with results concatenated and fed into a linear projection. Keys and values are shared between directions, while queries are separate, yielding four query sets: $Q_e^v$, $Q_s^v$ (vertical) and $Q_s^h$, $Q_e^h$ (horizontal).</p>
<h3 id="gating-mechanisms">Gating Mechanisms</h3>
<p>Two gate types are explored. The <strong>fixed gate</strong> uses a learned convex combination:</p>
<p>$$
g = \sigma(b_g)
$$</p>
<p>$$
c_{t+1} = c_t \odot g + z_t \odot (1 - g)
$$</p>
<p>where $g$ is constant after training, implementing an <a href="https://en.wikipedia.org/wiki/Moving_average">exponential moving average</a>.</p>
<p>The <strong>LSTM gate</strong> uses input and forget gates:</p>
<p>$$
i_t = \sigma(W_i h_t + b_i - 1)
$$</p>
<p>$$
f_t = \sigma(W_f h_t + b_f + 1)
$$</p>
<p>$$
c_{t+1} = c_t \odot f_t + z_t \odot i_t
$$</p>
<p>The bias offsets ($-1$ for input, $+1$ for forget) initialize the model to &ldquo;remember&rdquo; by default, which is critical for training stability. Without careful initialization, the model can fall into a local optimum where it ignores the recurrent state entirely. This echoes the <a href="/notes/machine-learning/model-architectures/can-recurrent-neural-networks-warp-time/">gate initialization challenges studied by Tallec and Ollivier</a>, who derived chrono initialization for LSTMs from time-warping invariance.</p>
<h3 id="gate-configurations">Gate Configurations</h3>
<p>Three configurations are tested: <strong>dual</strong> (gates on both attention and MLP outputs), <strong>single</strong> (gate only on MLP output), and <strong>skip</strong> (gate only on attention output, no MLP). The skip configuration removes the large MLP from the recurrent layer entirely.</p>
<h3 id="learned-state-ids">Learned State IDs</h3>
<p>Since the same weights are applied to all state vectors, learned &ldquo;state IDs&rdquo; (analogous to position embeddings) are added so each state vector can issue distinct queries. T5-style relative position bias is used for token self-attention, with no position bias for state-token cross-attention.</p>
<h2 id="language-modeling-on-pg19-arxiv-and-github">Language Modeling on PG19, arXiv, and GitHub</h2>
<h3 id="experimental-setup">Experimental Setup</h3>
<p>The base model is a 12-layer transformer with 150M parameters (8 heads of size 128, embedding dimension 1024, MLP hidden size 4096). The recurrent layer is placed at layer 10 with segment length $N = 4096$ and window size $W = 512$. The architecture is evaluated on three long-document datasets:</p>
<ul>
<li><strong>PG19</strong>: Full-length books from <a href="https://en.wikipedia.org/wiki/Project_Gutenberg">Project Gutenberg</a> (pre-1919)</li>
<li><strong>arXiv</strong>: Mathematics papers in LaTeX</li>
<li><strong>GitHub</strong>: Concatenated source code from open-source repositories</li>
</ul>
<p>All models report bits-per-token ($\log_2$ perplexity, lower is better).</p>
<h3 id="baselines">Baselines</h3>
<p>Five baselines are compared: Transformer-XL with window sizes of 512, 1024, and 2048, plus 12-layer and 13-layer sliding window models. The 13-layer sliding window (Slide:13L) is the primary comparison, having equivalent computation cost and parameter count to the recurrent models.</p>
<h3 id="main-results">Main Results</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Step Time</th>
          <th>PG19 (bytes)</th>
          <th>PG19 (tokens)</th>
          <th>arXiv</th>
          <th>GitHub</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>XL:512</td>
          <td>0.88</td>
          <td>1.01</td>
          <td>3.62</td>
          <td>1.45</td>
          <td>1.21</td>
      </tr>
      <tr>
          <td>XL:2048</td>
          <td>2.11</td>
          <td>0.990</td>
          <td>3.58</td>
          <td>1.31</td>
          <td>1.01</td>
      </tr>
      <tr>
          <td>Slide:13L</td>
          <td>1.00</td>
          <td>0.989</td>
          <td>3.58</td>
          <td>1.42</td>
          <td>1.17</td>
      </tr>
      <tr>
          <td>Rec:fixed:skip</td>
          <td>0.99</td>
          <td>0.952</td>
          <td>3.53</td>
          <td>1.24</td>
          <td>0.976</td>
      </tr>
      <tr>
          <td>Rec:fixed:dual</td>
          <td>1.01</td>
          <td>0.957</td>
          <td>3.52</td>
          <td>1.27</td>
          <td>0.991</td>
      </tr>
      <tr>
          <td>Feedback:fixed:skip</td>
          <td>1.35</td>
          <td>0.935</td>
          <td>3.49</td>
          <td>1.24</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Memorizing Trans. 64k</td>
          <td>1.94</td>
          <td>0.950</td>
          <td>3.53</td>
          <td>1.22</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>The Rec:fixed:skip configuration achieves the best overall results while being slightly faster than the 13-layer baseline. It outperforms XL:2048, which runs over 2x slower. The block feedback variant (allowing all layers to cross-attend to recurrent states) improves perplexity further at ~35-40% higher step time.</p>
<h3 id="scaling-behavior">Scaling Behavior</h3>
<p>Models from 40M to 1.3B parameters show that the benefit of recurrence is <a href="/notes/machine-learning/model-architectures/scaling-laws-vs-model-architectures/">consistent across scales</a> and increases with model size. At larger sizes, adding recurrence provides a benefit greater than doubling the number of parameters. The 1.3B parameter model achieves 26.50 word-level perplexity on PG19, setting a new state of the art at the time of publication.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Layers</th>
          <th>PG19 Perplexity</th>
          <th>Parameters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Compressive Transformer</td>
          <td>36</td>
          <td>33.6</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Routing Transformer</td>
          <td>22</td>
          <td>33.2</td>
          <td>490M</td>
      </tr>
      <tr>
          <td>Perceiver AR</td>
          <td>60</td>
          <td>28.9</td>
          <td>974.6M</td>
      </tr>
      <tr>
          <td>Block-Recurrent Transformer</td>
          <td>24</td>
          <td>26.50</td>
          <td>1.3B</td>
      </tr>
  </tbody>
</table>
<h3 id="ablations">Ablations</h3>
<ul>
<li><strong>Multiple recurrent layers</strong>: Two adjacent layers (9, 10) provide no benefit. Two separated layers (4, 10) help but no more than adding another non-recurrent layer.</li>
<li><strong>Number of states</strong>: Improvement up to 1024 states, degradation at 2048.</li>
<li><strong>Window size reduction</strong>: Reducing the sliding window hurts Transformer-XL dramatically but has smaller impact on the recurrent model, which compensates via recurrence.</li>
<li><strong>Gate type</strong>: The fixed gate consistently outperforms the LSTM gate despite being theoretically less expressive.</li>
</ul>
<h3 id="qualitative-analysis">Qualitative Analysis</h3>
<p>Comparing per-token predictions against Transformer-XL on PG19 books, the recurrent model&rsquo;s advantage comes overwhelmingly from predicting proper names (17/20 top-improvement tokens). In 19/20 cases, the predicted word was outside the attention window, confirming it was stored in recurrent state. The model can remember book titles and authors across 60,000+ tokens.</p>
<h2 id="findings-limitations-and-future-directions">Findings, Limitations, and Future Directions</h2>
<p>The Block-Recurrent Transformer demonstrates that recurrence at the block level is a cost-effective way to improve language modeling on long sequences. The fixed:skip configuration (the simplest variant) performs best, suggesting the model primarily uses recurrence for long-range name lookup rather than complex reasoning. The fact that removing the MLP from the recurrent layer has minimal impact further supports this interpretation.</p>
<p>Key limitations include: the model was only evaluated on language modeling perplexity (no downstream tasks), the LSTM gate underperforms the simpler fixed gate (suggesting untapped potential for more expressive recurrence), and the authors acknowledge that training the recurrent layer to fully exploit its capacity for knowledge extraction will require further advances.</p>
<p>The authors note that evaluating on downstream tasks requiring long-range context (book summarization, long-document QA, code completion) is an important direction for future work.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Eval</td>
          <td>PG19</td>
          <td>~29k books</td>
          <td>Public domain, freely available</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>arXiv</td>
          <td>Mathematics papers</td>
          <td>Obtained via private channels, not redistributable</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>GitHub</td>
          <td>Open-source repos</td>
          <td>Obtained via private channels, not redistributable</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Optimizer: Adafactor</li>
<li>Learning rate: 1.0 with inverse square root decay (initial experiments), cosine decay with max 0.01 (scaling experiments)</li>
<li>Warmup: 1000 steps</li>
<li>Dropout: 0.05</li>
<li>Vocabulary: 32k SentencePiece (T5 pretrained for initial, custom for scaling)</li>
<li>Gate initialization: bias of $+1$ for forget gate, $-1$ for input gate to ensure initial &ldquo;remember&rdquo; behavior</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Variant</th>
          <th>Layers</th>
          <th>Parameters</th>
          <th>Recurrent Layers</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Base</td>
          <td>12 (+1 recurrent)</td>
          <td>~151-164M</td>
          <td>Layer 10</td>
      </tr>
      <tr>
          <td>Large</td>
          <td>24 (+2 recurrent)</td>
          <td>650M</td>
          <td>Layers 10, 20</td>
      </tr>
      <tr>
          <td>XL</td>
          <td>24 (+2 recurrent)</td>
          <td>1.3B</td>
          <td>Layers 10, 20</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best Model</th>
          <th>PG19 (tokens)</th>
          <th>arXiv</th>
          <th>GitHub</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Bits-per-token</td>
          <td>Rec:fixed:skip</td>
          <td>3.53</td>
          <td>1.24</td>
          <td>0.976</td>
      </tr>
      <tr>
          <td>Word-level PPL</td>
          <td>1.3B model</td>
          <td>26.50</td>
          <td>-</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>Error bars on PG19 are between 0.002 and 0.007 (3 runs with different seeds).</p>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Training: 32 Google V4 TPU replicas</li>
<li>Training time: ~48 hours for 500k steps on PG19</li>
<li>Batch size: 32 (segment length 4096) or 256 (segment length 512), adjusted so each model sees the same tokens per step</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Available</th>
          <th>License</th>
          <th>URL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Code (Meliad)</td>
          <td>Yes</td>
          <td>Apache 2.0</td>
          <td><a href="https://github.com/google-research/meliad">github.com/google-research/meliad</a></td>
      </tr>
      <tr>
          <td>PG19 Dataset</td>
          <td>Yes</td>
          <td>Public Domain</td>
          <td>Public</td>
      </tr>
      <tr>
          <td>arXiv Dataset</td>
          <td>No</td>
          <td>Not redistributable</td>
          <td>Private</td>
      </tr>
      <tr>
          <td>GitHub Dataset</td>
          <td>No</td>
          <td>Not redistributable</td>
          <td>Private</td>
      </tr>
      <tr>
          <td>Pretrained Models</td>
          <td>No</td>
          <td>-</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility Assessment</strong>: Partially Reproducible. Source code is available under Apache 2.0 and the PG19 dataset is public. However, two of three evaluation datasets (arXiv, GitHub) were obtained via private channels and are not redistributable. No pretrained model checkpoints are released.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Hutchins, D., Schlag, I., Wu, Y., Dyer, E., &amp; Neyshabur, B. (2022). Block-Recurrent Transformers. <em>Advances in Neural Information Processing Systems 35 (NeurIPS 2022)</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{hutchins2022block,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Block-Recurrent Transformers}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Hutchins, DeLesley and Schlag, Imanol and Wu, Yuhuai and Dyer, Ethan and Neyshabur, Behnam}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2203.07852}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span>=<span style="color:#e6db74">{cs.LG}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>NaViT: Native Resolution Vision Transformer</title><link>https://hunterheidenreich.com/notes/machine-learning/model-architectures/navit-native-resolution-vit/</link><pubDate>Mon, 06 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/machine-learning/model-architectures/navit-native-resolution-vit/</guid><description>NaViT uses sequence packing to train Vision Transformers on images at native resolution and aspect ratio, improving efficiency and flexibility.</description><content:encoded><![CDATA[<h2 id="a-method-for-flexible-resolution-vision-transformers">A Method for Flexible-Resolution Vision Transformers</h2>
<p>This is a <strong>Method</strong> paper that introduces NaViT (Native Resolution ViT), a Vision Transformer trained using sequence packing to handle images of arbitrary resolution and aspect ratio. The core idea, called &ldquo;Patch n&rsquo; Pack,&rdquo; borrows example packing from NLP and applies it to vision: patches from multiple images of different sizes are concatenated into a single sequence, enabling native-resolution processing without resizing or padding.</p>
<h2 id="why-fixed-resolution-pipelines-are-suboptimal">Why Fixed-Resolution Pipelines Are Suboptimal</h2>
<p>Standard computer vision pipelines resize all images to a fixed square resolution before processing. This practice originates from convolutional neural network constraints, where fixed spatial dimensions were architecturally required. Even with Vision Transformers, which operate on sequences of patches and could in principle handle variable lengths, the convention of fixed-resolution input persists.</p>
<p>This approach has clear drawbacks. Most images are not square: analysis of ImageNet, LVIS, and WebLI shows that most images deviate more than 20% from a 1:1 aspect ratio. Resizing distorts content and discards information, while padding wastes computation. Prior work like FlexiViT addressed variable patch sizes and Pix2Struct introduced aspect-ratio-preserving patching, but neither fully solved the problem of training efficiently on images at their original resolution.</p>
<h2 id="patch-n-pack-sequence-packing-for-vision">Patch n&rsquo; Pack: Sequence Packing for Vision</h2>
<p>The key insight is that ViT already processes images as sequences of patch tokens, and NLP has long used example packing to handle variable-length sequences efficiently. NaViT applies this directly: patches from multiple images (each at its native resolution and aspect ratio) are packed into a single fixed-length sequence.</p>
<h3 id="architectural-modifications">Architectural Modifications</h3>
<p>Three changes enable Patch n&rsquo; Pack:</p>
<ol>
<li>
<p><strong>Masked self-attention and masked pooling</strong>: Attention masks prevent patches from different images from attending to each other. Masked pooling extracts a single representation per image from the packed sequence.</p>
</li>
<li>
<p><strong>Factorized positional embeddings</strong>: Standard 1D positional embeddings cannot handle arbitrary resolutions. NaViT decomposes position into separate $x$ and $y$ embeddings $\phi_{x}$ and $\phi_{y}$, which are summed together. Two schemes are considered:</p>
<ul>
<li>Absolute embeddings: $\phi(p): [0, \text{maxLen}] \to \mathbb{R}^{D}$, a function of the absolute patch index</li>
<li>Fractional embeddings: $\phi(r): [0, 1] \to \mathbb{R}^{D}$, where $r = p / \text{side-length}$ is the relative position along the image</li>
</ul>
</li>
<li>
<p><strong>Chunked contrastive loss</strong>: For contrastive pretraining, the $\mathcal{O}(n^{2})$ loss computation is handled via chunked computation across device subsets to support the high number of examples per sequence.</p>
</li>
</ol>
<h3 id="training-innovations">Training Innovations</h3>
<p>Packing enables two techniques that were previously impractical:</p>
<ul>
<li>
<p><strong>Continuous token dropping</strong>: Instead of dropping the same proportion of tokens from every image, the drop rate varies per image. Some images keep all tokens while others have aggressive dropping, reducing the train/inference discrepancy. The drop rate can follow a schedule that decreases over training.</p>
</li>
<li>
<p><strong>Resolution sampling</strong>: Each image&rsquo;s resolution is sampled from a distribution (e.g., $R \sim \mathcal{U}(64, R_{\text{max}})$) while preserving aspect ratio. This mixes the throughput benefits of small images with the detail of large ones.</p>
</li>
</ul>
<h3 id="computational-overhead">Computational Overhead</h3>
<p>A natural concern is the $\mathcal{O}(n^{2})$ attention cost for longer packed sequences. In practice, as the transformer hidden dimension scales, attention becomes an increasingly small fraction of total compute (the MLP dominates). Packing overhead is typically less than 2% from padding tokens, using a simple greedy bin-packing algorithm.</p>
<h2 id="pretraining-and-downstream-evaluation">Pretraining and Downstream Evaluation</h2>
<p>NaViT is evaluated in two pretraining setups:</p>
<ul>
<li><strong>Classification pretraining</strong> on JFT-4B with sigmoid cross-entropy loss, evaluated via linear probing (10 examples per class)</li>
<li><strong>Contrastive pretraining</strong> on WebLI using image-text contrastive loss, evaluated on zero-shot ImageNet classification and COCO retrieval</li>
</ul>
<h3 id="training-efficiency">Training Efficiency</h3>
<p>At fixed compute budget, NaViT consistently outperforms ViT across model scales. The top-performing ViT can be matched by NaViT with 4x less compute. The primary driver is throughput: packing with variable resolution and token dropping enables NaViT-L/16 to process approximately 5x more images during training.</p>
<h3 id="variable-resolution-results">Variable Resolution Results</h3>
<p>Models trained with variable resolution ($R \sim \mathcal{U}(64, R_{\text{max}})$) outperform fixed-resolution models even when evaluated at the fixed resolution&rsquo;s own training resolution. Sampling side lengths from a truncated normal biased toward lower values gives the best cost-performance trade-off.</p>
<p>For fine-tuning on ImageNet-1k, a single NaViT fine-tuned with variable resolutions (64 to 512) matches the performance of models fine-tuned at each specific resolution individually.</p>
<h3 id="positional-embedding-comparison">Positional Embedding Comparison</h3>
<p>Factorized embeddings outperform both standard ViT 1D embeddings (with interpolation) and Pix2Struct&rsquo;s learned 2D embeddings. The factorized approach generalizes to resolutions outside the training range, while 2D embeddings fail because they require seeing all $(x, y)$ coordinate pairs during training. Additive combination of $\phi_{x}$ and $\phi_{y}$ works best.</p>
<h3 id="token-dropping-strategies">Token Dropping Strategies</h3>
<p>Variable token dropping with Beta-distributed rates consistently outperforms constant rates. Resolution-dependent dropping (higher rates for higher-resolution images) further improves performance. Scheduling the drop rate to decrease over training provides additional gains.</p>
<h3 id="downstream-tasks">Downstream Tasks</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Setup</th>
          <th>Result</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Semantic segmentation</td>
          <td>ADE20k, L/16, linear decoder</td>
          <td>NaViT at $R_{384}$ beats ViT at $R_{512}$ while being 2x faster</td>
      </tr>
      <tr>
          <td>Object detection</td>
          <td>OWL-ViT-L/14 backbone</td>
          <td>NaViT: 28.3% LVIS AP vs. ViT: 23.3%</td>
      </tr>
      <tr>
          <td>Video classification</td>
          <td>Kinetics-400, tubelet extraction</td>
          <td>NaViT-L matches ViViT-L (80.4%) in ~6x fewer epochs</td>
      </tr>
      <tr>
          <td>Fairness annotation</td>
          <td>FairFace, CelebA linear probes</td>
          <td>Statistically significant accuracy improvements ($p = 3 \times 10^{-4}$)</td>
      </tr>
  </tbody>
</table>
<h3 id="out-of-distribution-robustness">Out-of-Distribution Robustness</h3>
<p>NaViT shows strong gains on ImageNet-A (which contains many extreme aspect ratios) when evaluated without center cropping. Performance on ObjectNet is also competitive. The model maintains stable calibration (ECE between 0.045 and 0.047) across a wide range of token counts per image (128 to 1024).</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<p>NaViT demonstrates that sequence packing, when applied to Vision Transformers, yields substantial improvements in training efficiency, inference flexibility, and downstream performance. The approach processes images at their native resolution without the information loss from resizing or the waste from padding.</p>
<p>Key takeaways:</p>
<ul>
<li>4x compute reduction to match top ViT performance</li>
<li>A single model works across a continuous range of resolutions at inference time</li>
<li>Variable-resolution training and token dropping provide complementary efficiency gains</li>
<li>Factorized positional embeddings generalize to unseen resolutions</li>
<li>Benefits transfer to detection, segmentation, video, and fairness tasks</li>
</ul>
<p>Limitations: The paper does not release model weights or code. All experiments use Google-internal datasets (JFT-4B, WebLI) and infrastructure (TPUs, JAX/Scenic), making direct reproduction difficult. The attention masking approach for packing assumes that cross-image attention is undesirable, which may not hold for all tasks.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Classification pretraining</td>
          <td>JFT-4B</td>
          <td>~4B labeled images</td>
          <td>Google-internal, not publicly available</td>
      </tr>
      <tr>
          <td>Contrastive pretraining</td>
          <td>WebLI</td>
          <td>Large-scale web data</td>
          <td>Google-internal, not publicly available</td>
      </tr>
      <tr>
          <td>Classification fine-tuning</td>
          <td>ImageNet-1k</td>
          <td>1.28M images</td>
          <td>Publicly available</td>
      </tr>
      <tr>
          <td>Segmentation</td>
          <td>ADE20k</td>
          <td>20K images</td>
          <td>Publicly available</td>
      </tr>
      <tr>
          <td>Detection</td>
          <td>LVIS</td>
          <td>164K images</td>
          <td>Publicly available</td>
      </tr>
      <tr>
          <td>Video</td>
          <td>Kinetics-400</td>
          <td>~240K videos</td>
          <td>Publicly available (partial)</td>
      </tr>
      <tr>
          <td>Fairness</td>
          <td>FairFace, CelebA</td>
          <td>108K / 200K images</td>
          <td>Publicly available</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Greedy bin-packing for sequence construction (less than 2% padding tokens)</li>
<li>Resolution sampling: side length from truncated normal $\mathcal{N}_{t}(-0.5, 1)$ mapped to $[64, R_{\text{max}}]$</li>
<li>Token dropping: Beta-distributed per-image rates, optionally resolution-dependent</li>
<li>Factorized positional embeddings with additive combination</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>NaViT variants: B/16, L/16, L/14</li>
<li>Based on vanilla ViT with query-key normalization, no biases, attention pooling</li>
<li>Implemented in JAX/FLAX within the Scenic framework</li>
<li>No public model checkpoints available</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>NaViT</th>
          <th>ViT Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>JFT linear probe (L/16)</td>
          <td>Matches top ViT</td>
          <td>4x more compute</td>
          <td>Compute-matched comparison</td>
      </tr>
      <tr>
          <td>ImageNet zero-shot (L/14)</td>
          <td>72.9%</td>
          <td>68.3%</td>
          <td>Contrastive pretraining</td>
      </tr>
      <tr>
          <td>LVIS AP (L/14)</td>
          <td>28.3%</td>
          <td>23.3%</td>
          <td>OWL-ViT detection</td>
      </tr>
      <tr>
          <td>LVIS AP rare (L/14)</td>
          <td>24.3%</td>
          <td>17.2%</td>
          <td>OWL-ViT detection</td>
      </tr>
      <tr>
          <td>ADE20k mIoU (L/16, 384)</td>
          <td>Beats ViT@512</td>
          <td>At 2x cost</td>
          <td>Segmenter linear decoder</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Training on Cloud TPUs (specific configuration not detailed)</li>
<li>Inference latency measured on Cloud TPUv3</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Dehghani, M., Mustafa, B., Djolonga, J., Heek, J., Minderer, M., Caron, M., Steiner, A., Puigcerver, J., Geirhos, R., Alabdulmohsin, I., Oliver, A., Padlewski, P., Gritsenko, A., Lučić, M., &amp; Houlsby, N. (2023). Patch n&rsquo; Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution. <em>Advances in Neural Information Processing Systems 36 (NeurIPS 2023)</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{dehghani2023patch,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Patch n&#39; Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Dehghani, Mostafa and Mustafa, Basil and Djolonga, Josip and Heek, Jonathan and Minderer, Matthias and Caron, Mathilde and Steiner, Andreas and Puigcerver, Joan and Geirhos, Robert and Alabdulmohsin, Ibrahim and Oliver, Avital and Padlewski, Piotr and Gritsenko, Alexey and Lučić, Mario and Houlsby, Neil}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2307.06304}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span>=<span style="color:#e6db74">{cs.CV}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Beyond Atoms: 3D Space Modeling for Molecular Pretraining</title><link>https://hunterheidenreich.com/notes/computational-chemistry/molecular-modeling/beyond-atoms/</link><pubDate>Sat, 23 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/molecular-modeling/beyond-atoms/</guid><description>Lu et al. introduce SpaceFormer, a Transformer that models entire 3D molecular space including atoms for superior representations.</description><content:encoded><![CDATA[<h2 id="paper-typology-and-contribution">Paper Typology and Contribution</h2>
<p>This is a <strong>Method</strong> paper. It challenges the atom-centric paradigm of molecular representation learning by proposing a novel framework that models the continuous 3D space surrounding atoms. The core contribution is <strong>SpaceFormer</strong>, a Transformer-based architecture that discretizes molecular space into grids to capture physical phenomena (electron density, electromagnetic fields) often missed by traditional point-cloud models.</p>
<h2 id="the-physical-intuition-modeling-empty-space">The Physical Intuition: Modeling &ldquo;Empty&rdquo; Space</h2>
<p><strong>The Gap</strong>: Prior 3D molecular representation models, such as Uni-Mol, treat molecules as discrete sets of atoms, essentially point clouds in 3D space. However, from a quantum physics perspective, the &ldquo;empty&rdquo; space between atoms is far from empty. It is permeated by electron density distributions and electromagnetic fields that determine molecular properties.</p>
<p><strong>The Hypothesis</strong>: Explicitly modeling this continuous 3D space alongside discrete atom positions yields superior representations for downstream tasks, particularly for computational properties that depend on electronic structure, such as HOMO/LUMO energies and energy gaps.</p>
<h2 id="a-surprising-observation-virtual-points-improve-representations">A Surprising Observation: Virtual Points Improve Representations</h2>
<p>Before proposing SpaceFormer, the authors present a simple yet revealing experiment. They augment Uni-Mol by adding randomly sampled virtual points (VPs) from the 3D space within the circumscribed cuboid of each molecule. These VPs carry no chemical information whatsoever: they are purely random noise points.</p>
<p>The result is surprising: adding just 10 random VPs already yields a noticeable improvement in validation loss. The improvement remains consistent and gradually increases as the number of VPs grows, eventually reaching a plateau. This observation holds across downstream tasks as well, with Uni-Mol + VPs improving on several quantum property predictions (LUMO, E1-CC2, E2-CC2) compared to vanilla Uni-Mol.</p>
<p>The implication is that even uninformative spatial context helps the model learn better representations, motivating a principled framework for modeling the full 3D molecular space.</p>
<h2 id="spaceformer-voxelization-and-3d-positional-encodings">SpaceFormer: Voxelization and 3D Positional Encodings</h2>
<p>The key innovation is treating the molecular representation problem as <strong>3D space modeling</strong>. SpaceFormer follows these core steps:</p>
<ol>
<li><strong>Voxelizes the entire 3D space</strong> into a grid with cells of $0.49\text{\AA}$ (based on O-H bond length to ensure at most one atom per cell).</li>
<li><strong>Uses adaptive multi-resolution grids</strong> to efficiently handle empty space, keeping it fine-grained near atoms and coarse-grained far away.</li>
<li><strong>Applies Transformers to 3D spatial tokens</strong> with custom positional encodings that achieve linear complexity.</li>
</ol>
<p>Specifically, the model utilizes two forms of 3D Positional Encoding:</p>
<p><strong>3D Directional PE (RoPE Extension)</strong>
They extend Rotary Positional Encoding (RoPE) to 3D continuous space by splitting the Query and Key vectors into three blocks (one for each spatial axis). The directional attention mechanism takes the form:</p>
<p>$$
\begin{aligned}
\mathbf{q}_{i}^{\top} \mathbf{k}_{j} = \sum_{s=1}^{3} \mathbf{q}_{i,s}^{\top} \mathbf{R}(c_{j,s} - c_{i,s}) \mathbf{k}_{j,s}
\end{aligned}
$$</p>
<p><strong>3D Distance PE (RFF Approximation)</strong>
To compute invariant geometric distance without incurring quadratic memory overhead, they use Random Fourier Features (RFF) to approximate a Gaussian kernel of pairwise distances:</p>
<p>$$
\begin{aligned}
\exp \left( - \frac{| \mathbf{c}_i - \mathbf{c}_j |_2^2}{2\sigma^2} \right) &amp;\approx z(\mathbf{c}_i)^\top z(\mathbf{c}_j) \\
z(\mathbf{c}_i) &amp;= \sqrt{\frac{2}{d}} \cos(\sigma^{-1} \mathbf{c}_i^\top \boldsymbol{\omega} + \mathbf{b})
\end{aligned}
$$</p>
<p>This approach enables the model to natively encode complex field-like phenomena without computing exhaustive $O(N^2)$ distance matrices.</p>
<h2 id="experimental-setup-and-downstream-tasks">Experimental Setup and Downstream Tasks</h2>
<p><strong>Pretraining Data</strong>: 19 million unlabeled molecules from the same dataset used by Uni-Mol.</p>
<p><strong>Downstream Benchmarks</strong>: The authors propose a new benchmark of 15 tasks, motivated by known limitations of MoleculeNet: invalid structures, inconsistent chemical representations, data curation errors, and an inability to adequately distinguish model performance. The tasks split into two categories:</p>
<ol>
<li>
<p><strong>Computational Properties (Quantum Mechanics)</strong></p>
<ul>
<li>Subsets of <a href="/notes/computational-chemistry/datasets/gdb-17/">GDB-17</a> (HOMO, LUMO, GAP energy prediction, 20K samples; E1-CC2, E2-CC2, f1-CC2, f2-CC2, 21.7K samples)</li>
<li>Cata-condensed polybenzenoid hydrocarbons (Dipole moment, adiabatic ionization potential, D3 dispersion correction, 8,678 samples)</li>
<li>Metric: Mean Absolute Error (MAE)</li>
</ul>
</li>
<li>
<p><strong>Experimental Properties (Pharma/Bio)</strong></p>
<ul>
<li>MoleculeNet tasks (BBBP, BACE for drug discovery)</li>
<li>Biogen ADME tasks (HLM, MME, Solubility)</li>
<li>Metrics: AUC for classification, MAE for regression</li>
</ul>
</li>
</ol>
<p><strong>Splitting Strategy</strong>: All datasets use 8:1:1 train/validation/test ratio with <strong>scaffold splitting</strong> to test out-of-distribution generalization.</p>
<p><strong>Training Setup</strong>:</p>
<ul>
<li><strong>Objective</strong>: Masked Auto-Encoder (MAE) with 30% random masking. Model predicts whether a cell contains an atom, and if so, regresses both atom type and precise offset position.</li>
<li><strong>Hardware</strong>: ~50 hours on 8 NVIDIA A100 GPUs</li>
<li><strong>Optimizer</strong>: Adam ($\beta_1=0.9, \beta_2=0.99$)</li>
<li><strong>Learning Rate</strong>: Peak 1e-4 with linear decay and 0.01 warmup ratio</li>
<li><strong>Batch Size</strong>: 128</li>
<li><strong>Total Updates</strong>: 1 million</li>
</ul>
<p><strong>Baseline Comparisons</strong>: GROVER (2D graph-based MPR), GEM (2D graph enhanced with 3D information), 3D Infomax (GNN with 3D information), Uni-Mol (3D MPR, primary baseline using the same pretraining dataset), and Mol-AE (extends Uni-Mol with atom-based MAE pretraining).</p>
<h2 id="results-and-analysis">Results and Analysis</h2>
<p><strong>Strong Contextual Performance</strong>: SpaceFormer ranked 1st in 10 of 15 tasks and in the top 2 for 14 of 15 tasks. It surpassed the runner-up models by approximately 20% on quantum property tasks (HOMO, LUMO, GAP, E1-CC2, Dipmom), validating that modeling non-atom space captures electronic structure better than atom-only regimes.</p>
<h3 id="key-results-on-quantum-properties">Key Results on Quantum Properties</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>GROVER</th>
          <th>GEM</th>
          <th>3D Infomax</th>
          <th>Uni-Mol</th>
          <th>Mol-AE</th>
          <th><strong>SpaceFormer</strong></th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>HOMO (Ha)</td>
          <td>0.0075</td>
          <td>0.0068</td>
          <td>0.0065</td>
          <td>0.0052</td>
          <td>0.0050</td>
          <td><strong>0.0042</strong></td>
      </tr>
      <tr>
          <td>LUMO (Ha)</td>
          <td>0.0086</td>
          <td>0.0080</td>
          <td>0.0070</td>
          <td>0.0060</td>
          <td>0.0057</td>
          <td><strong>0.0040</strong></td>
      </tr>
      <tr>
          <td>GAP (Ha)</td>
          <td>0.0109</td>
          <td>0.0107</td>
          <td>0.0095</td>
          <td>0.0081</td>
          <td>0.0080</td>
          <td><strong>0.0064</strong></td>
      </tr>
      <tr>
          <td>E1-CC2 (eV)</td>
          <td>0.0101</td>
          <td>0.0090</td>
          <td>0.0089</td>
          <td>0.0067</td>
          <td>0.0070</td>
          <td><strong>0.0058</strong></td>
      </tr>
      <tr>
          <td>Dipmom (Debye)</td>
          <td>0.0752</td>
          <td>0.0289</td>
          <td>0.0291</td>
          <td>0.0106</td>
          <td>0.0113</td>
          <td><strong>0.0083</strong></td>
      </tr>
  </tbody>
</table>
<p>SpaceFormer&rsquo;s advantage is most pronounced on computational properties that depend on electronic structure. On experimental biological tasks (e.g., BBBP), where measurements are noisy, the advantage narrows or reverses: Uni-Mol achieves 0.9066 AUC on BBBP compared to SpaceFormer&rsquo;s 0.8605.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<p>The authors present several ablations that isolate the source of SpaceFormer&rsquo;s improvements:</p>
<p><strong>MAE vs. Denoising</strong>: SpaceFormer with MAE pretraining outperforms SpaceFormer with denoising on all four ablation tasks. The MAE objective requires predicting <em>whether</em> an atom exists in a masked voxel, which forces the model to learn global structural dependencies. In the denoising variant, only atom cells are masked so the model never needs to predict atom existence, reducing the task to coordinate regression.</p>
<p><strong>FLOPs Control</strong>: A SpaceFormer-Large model (4x width, atom-only) trained with comparable FLOPs still falls short of SpaceFormer with 1000 non-atom cells on most downstream tasks. This confirms the improvement comes from modeling 3D space, not from additional compute.</p>
<p><strong>Virtual Points vs. SpaceFormer</strong>: Adding up to 200 random virtual points to Uni-Mol improves some tasks but leaves a significant gap compared to SpaceFormer, demonstrating that principled space discretization outperforms naive point augmentation.</p>
<p><strong>Efficiency Validation</strong>: The Adaptive Grid Merging method reduces the number of cells by roughly 10x with virtually no performance degradation. The 3D positional encodings scale linearly with the number of cells, while Uni-Mol&rsquo;s pretraining cost scales quadratically.</p>
<h3 id="scope-and-future-directions">Scope and Future Directions</h3>
<p>SpaceFormer does not incorporate built-in SE(3) equivariance, relying instead on data augmentation (random rotations and random boundary padding) during training. The authors identify extending SpaceFormer to force field tasks and larger systems such as proteins and complexes as promising future directions.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="code-and-data-availability">Code and Data Availability</h3>
<ul>
<li><strong>Source Code</strong>: As of the current date, the authors have not released the official source code or pre-trained weights.</li>
<li><strong>Datasets</strong>: Pretraining utilized the same 19M unlabeled molecule dataset as Uni-Mol. Downstream tasks use a newly curated internal benchmark built from subsets of GDB-17, MoleculeNet, and Biogen ADME. The exact customized scaffold splits for these evaluations are pending the official code release.</li>
<li><strong>Compute</strong>: Pretraining the base SpaceFormer encoder (~67.8M parameters, configured to merge level 3) required approximately 50 hours on 8 NVIDIA A100 GPUs.</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Source code</td>
          <td>Code</td>
          <td>N/A</td>
          <td>Not publicly released as of March 2026</td>
      </tr>
      <tr>
          <td>Pre-trained weights</td>
          <td>Model</td>
          <td>N/A</td>
          <td>Not publicly released</td>
      </tr>
      <tr>
          <td>Pretraining data (19M molecules)</td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Same dataset as Uni-Mol; not independently released</td>
      </tr>
      <tr>
          <td>Downstream benchmark splits</td>
          <td>Dataset</td>
          <td>N/A</td>
          <td>Custom scaffold splits pending code release</td>
      </tr>
  </tbody>
</table>
<h3 id="models">Models</h3>
<p>The model treats a molecule as a 3D &ldquo;image&rdquo; via voxelization, processed by a Transformer.</p>
<p><strong>Input Representation</strong>:</p>
<ul>
<li><strong>Discretization</strong>: 3D space divided into grid cells with length <strong>$0.49\text{\AA}$</strong> (based on O-H bond length to ensure at most one atom per cell)</li>
<li><strong>Tokenization</strong>: Tokens are pairs $(t_i, c_i)$ where $t_i$ is atom type (or NULL) and $c_i$ is the coordinate</li>
<li><strong>Embeddings</strong>: Continuous embeddings with dimension 512. Inner-cell positions discretized with $0.01\text{\AA}$ precision</li>
</ul>
<p><strong>Transformer Specifications</strong>:</p>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Layers</th>
          <th>Attention Heads</th>
          <th>Embedding Dim</th>
          <th>FFN Dim</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Encoder</strong></td>
          <td>16</td>
          <td>8</td>
          <td>512</td>
          <td>2048</td>
      </tr>
      <tr>
          <td><strong>Decoder</strong> (MAE)</td>
          <td>4</td>
          <td>4</td>
          <td>256</td>
          <td>1024</td>
      </tr>
  </tbody>
</table>
<p><strong>Attention Mechanism</strong>: FlashAttention for efficient handling of large sequence lengths.</p>
<p><strong>Positional Encodings</strong>:</p>
<ol>
<li><strong>3D Directional PE</strong>: Extension of Rotary Positional Embedding (RoPE) to 3D continuous space, capturing relative directionality</li>
<li><strong>3D Distance PE</strong>: Random Fourier Features (RFF) to approximate Gaussian kernel of pairwise distances with linear complexity</li>
</ol>
<h4 id="visualizing-rff-and-rope">Visualizing RFF and RoPE</h4>















<figure class="post-figure center ">
    <img src="/img/notes/spaceformer-rff-rope-visualization.webp"
         alt="Four-panel visualization showing RFF distance encoding and RoPE directional encoding mechanisms"
         title="Four-panel visualization showing RFF distance encoding and RoPE directional encoding mechanisms"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Visual intuition for SpaceFormer&rsquo;s positional encodings: Top row shows RFF distance encoding (Gaussian-like attention decay and high-frequency feature fingerprints). Bottom row shows RoPE directional encoding (vector rotation fields and resulting attention patterns).</figcaption>
    
</figure>

<p><strong>Top Row (Distance / RFF):</strong> Shows how the model learns &ldquo;closeness.&rdquo; Distance is represented by a complex &ldquo;fingerprint&rdquo; of waves that creates a Gaussian-like force field.</p>
<ul>
<li><strong>Top Left (The Force Field):</strong> The attention score (dot product) naturally forms a Gaussian curve. It is high when atoms are close and decays to zero as they move apart. This mimics physical forces without the model needing to learn that math from scratch.</li>
<li><strong>Top Right (The Fingerprint):</strong> Each dimension oscillates at a different frequency. A specific distance (e.g., $d=2$) has a unique combination of high and low values across these dimensions, creating a unique &ldquo;fingerprint&rdquo; for that exact distance.</li>
</ul>
<p><strong>Bottom Row (Direction / RoPE):</strong> Shows how the model learns &ldquo;relative position.&rdquo; It visualizes the vector rotation and how that creates a grid-like attention pattern.</p>
<ul>
<li><strong>Bottom Left (The Rotation):</strong> This visualizes the &ldquo;X-axis chunk&rdquo; of the vector. As you move from left ($x=-3$) to right ($x=3$), the arrows rotate. The model compares angles between atoms to determine relative positions.</li>
<li><strong>Bottom Right (The Grid):</strong> The resulting attention pattern when combining X-rotations and Y-rotations. The red/blue regions show where the model pays attention relative to the center, forming a grid-like interference pattern that distinguishes relative positions (e.g., &ldquo;top-right&rdquo; vs &ldquo;bottom-left&rdquo;).</li>
</ul>
<h4 id="adaptive-grid-merging">Adaptive Grid Merging</h4>
<p>To make the 3D grid approach computationally tractable, two key strategies are employed:</p>
<ol>
<li><strong>Grid Sampling</strong>: Randomly selecting 10-20% of empty cells during training</li>
<li><strong>Adaptive Grid Merging</strong>: Recursively merging $2 \times 2 \times 2$ blocks of empty cells into larger &ldquo;coarse&rdquo; cells, creating a multi-resolution view that is fine-grained near atoms and coarse-grained in empty space (merging set to Level 3)</li>
</ol>
<p><strong>Visualizing Adaptive Grid Merging</strong>:</p>















<figure class="post-figure center ">
    <img src="/img/notes/spaceformer-adaptive-grid-merging.webp"
         alt="2D simulation of adaptive grid merging for an H2O molecule showing multi-resolution cells"
         title="2D simulation of adaptive grid merging for an H2O molecule showing multi-resolution cells"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Adaptive grid merging demonstrated on H₂O. Red cells (Level 0) contain atoms and remain at full resolution. Progressively darker blue cells represent merged empty regions at higher levels, covering the same volume with fewer tokens.</figcaption>
    
</figure>

<p>The adaptive grid process compresses empty space around molecules while maintaining high resolution near atoms:</p>
<ul>
<li><strong>Red Cells (Level 0):</strong> The smallest squares ($0.49$Å) containing atoms. These are kept at highest resolution because electron density changes rapidly here.</li>
<li><strong>Light Blue Cells (Level 0/1):</strong> Small empty regions close to atoms.</li>
<li><strong>Darker Blue Cells (Level 2/3):</strong> Large blocks of empty space further away.</li>
</ul>
<p>If we used a naive uniform grid, we would have to process thousands of empty &ldquo;Level 0&rdquo; cells containing almost zero information. By merging them into larger blocks (the dark blue squares), the model covers the same volume with significantly fewer input tokens, reducing the number of tokens by roughly <strong>10x</strong> compared to a dense grid.</p>















<figure class="post-figure center ">
    <img src="/img/notes/spaceformer-adaptive-grid-benzene.webp"
         alt="Adaptive grid merging visualization for benzene molecule showing hexagonal ring with multi-resolution grid cells"
         title="Adaptive grid merging visualization for benzene molecule showing hexagonal ring with multi-resolution grid cells"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Adaptive grid merging for benzene (C₆H₆). The model maintains maximum resolution (red Level 0 cells) only where atoms exist, while merging vast empty regions into large blocks (dark blue L3/L4 cells). This allows the model to focus computational power on chemically active zones.</figcaption>
    
</figure>

<p>The benzene example above demonstrates how this scales to larger molecules. The characteristic hexagonal ring of 6 carbon atoms (black) and 6 hydrogen atoms (white) occupies a small fraction of the total grid. The dark blue corners (L3, L4) represent massive merged blocks of empty space, allowing the model to focus 90% of its computational power on the red &ldquo;active&rdquo; zones where chemistry actually happens.</p>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Lu, S., Ji, X., Zhang, B., Yao, L., Liu, S., Gao, Z., Zhang, L., &amp; Ke, G. (2025). Beyond Atoms: Enhancing Molecular Pretrained Representations with 3D Space Modeling. <em>Proceedings of the 42nd International Conference on Machine Learning (ICML)</em>, 267, 40491-40504. <a href="https://proceedings.mlr.press/v267/lu25e.html">https://proceedings.mlr.press/v267/lu25e.html</a></p>
<p><strong>Publication</strong>: ICML 2025</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{lu2025beyond,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Beyond Atoms: Enhancing Molecular Pretrained Representations with 3D Space Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Lu, Shuqi and Ji, Xiaohong and Zhang, Bohang and Yao, Lin and Liu, Siyuan and Gao, Zhifeng and Zhang, Linfeng and Ke, Guolin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 42nd International Conference on Machine Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{40491--40504}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{267}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">series</span>=<span style="color:#e6db74">{Proceedings of Machine Learning Research}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{PMLR}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://openreview.net/forum?id=Wd9KPQCKwq">OpenReview forum</a></li>
<li><a href="https://openreview.net/pdf?id=Wd9KPQCKwq">PDF on OpenReview</a></li>
<li><a href="https://icml.cc/virtual/2025/poster/45004">ICML 2025 poster page</a></li>
</ul>
]]></content:encoded></item></channel></rss>