<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Natural Language Processing on Hunter Heidenreich | ML Research Scientist</title><link>https://hunterheidenreich.com/categories/natural-language-processing/</link><description>Recent content in Natural Language Processing on Hunter Heidenreich | ML Research Scientist</description><image><title>Hunter Heidenreich | ML Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Sat, 11 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/categories/natural-language-processing/index.xml" rel="self" type="application/rss+xml"/><item><title>SpeechT5: Unified Speech-Text Pre-Training Framework</title><link>https://hunterheidenreich.com/notes/natural-language-processing/language-models/speecht5-unified-speech-text-pretraining/</link><pubDate>Sat, 11 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/natural-language-processing/language-models/speecht5-unified-speech-text-pretraining/</guid><description>SpeechT5 introduces a shared encoder-decoder framework with cross-modal vector quantization for joint speech and text pre-training across six tasks.</description><content:encoded><![CDATA[<h2 id="a-unified-encoder-decoder-for-spoken-language-processing">A Unified Encoder-Decoder for Spoken Language Processing</h2>
<p>SpeechT5 is a <strong>Method</strong> paper that introduces a shared encoder-decoder pre-training framework for spoken language processing. Inspired by <a href="/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/">T5&rsquo;s</a> text-to-text paradigm, SpeechT5 reformulates all spoken language tasks as &ldquo;speech/text to speech/text&rdquo; problems. The framework uses modal-specific pre-nets and post-nets to interface between raw speech or text and a shared Transformer encoder-decoder, enabling a single pre-trained model to handle six downstream tasks: automatic speech recognition (ASR), text-to-speech synthesis (TTS), speech translation (ST), voice conversion (VC), speech enhancement (SE), and speaker identification (SID).</p>
<h2 id="bridging-the-gap-between-speech-and-text-pre-training">Bridging the Gap Between Speech and Text Pre-Training</h2>
<p>Prior speech pre-training work (wav2vec 2.0, HuBERT) suffered from two key limitations. First, these models learned speech representations from unlabeled audio alone, ignoring the complementary information in text data that is critical for cross-modal tasks like ASR and TTS. Second, they relied on encoder-only architectures with task-specific prediction heads, leaving the decoder un-pretrained for sequence-to-sequence generation tasks.</p>
<p>SpeechT5 addresses both gaps by (1) jointly pre-training on unlabeled speech and text data, and (2) using a full encoder-decoder architecture that benefits generation tasks directly. The approach builds on the observation that speech and text, despite their surface differences, share underlying semantic structure that a unified representation can capture.</p>
<h2 id="cross-modal-vector-quantization-for-alignment">Cross-Modal Vector Quantization for Alignment</h2>
<p>The core innovation in SpeechT5 is a cross-modal <a href="https://en.wikipedia.org/wiki/Vector_quantization">vector quantization</a> (VQ) mechanism that aligns speech and text representations into a shared semantic space. The architecture consists of three components:</p>
<p><strong>Shared encoder-decoder backbone.</strong> A Transformer with 12 encoder blocks and 6 decoder blocks (768-dim, 12 heads), using relative position embeddings.</p>
<p><strong>Modal-specific pre/post-nets.</strong> Six specialized networks handle the conversion between raw modalities and the shared representation space:</p>
<ul>
<li>Speech-encoder pre-net: a convolutional feature extractor (from wav2vec 2.0) downsampling raw waveforms</li>
<li>Speech-decoder pre-net: three FC layers with ReLU, processing 80-dimensional log Mel-filterbank features</li>
<li>Speech-decoder post-net: a linear layer predicting Mel features plus five 1D conv layers (256 channels) for residual refinement, with an x-vector speaker embedding concatenated for multi-speaker support</li>
<li>Text pre/post-nets: shared embedding layers mapping between character-level token indices and hidden states (768-dim)</li>
</ul>
<p><strong>Cross-modal vector quantization.</strong> A shared codebook $\mathbf{C}^{K}$ with $K$ learnable embeddings bridges the two modalities. Encoder outputs $\mathbf{u}_i$ are quantized via nearest-neighbor lookup:</p>
<p>$$
\mathbf{c}_i = \arg\min_{j \in [K]} | \mathbf{u}_i - \mathbf{c}_j |_2
$$</p>
<p>A proportion (10%) of contextual representations are randomly replaced with these quantized latent units before being fed to the decoder&rsquo;s cross-attention. This mixing forces the quantizer to capture cross-modal features. A diversity loss encourages full codebook utilization:</p>
<p>$$
\mathcal{L}_d = \frac{1}{K} \sum_{k=1}^{K} p_k \log p_k
$$</p>
<h3 id="pre-training-objectives">Pre-Training Objectives</h3>
<p>SpeechT5 combines three pre-training objectives:</p>
<p><strong>Speech pre-training</strong> uses two tasks. A bidirectional masked prediction loss $\mathcal{L}_{mlm}^{s}$ follows HuBERT&rsquo;s approach, masking 8% of timesteps in 10-step spans and predicting frame-level targets from an acoustic unit discovery model:</p>
<p>$$
\mathcal{L}_{mlm}^{s} = \sum_{n \in \mathcal{M}} \log p(\mathbf{z}_n \mid \hat{\mathbf{H}}, n)
$$</p>
<p>A reconstruction loss $\mathcal{L}_{1}^{s}$ minimizes the $L_1$ distance between predicted and original Mel-filterbank features, plus a binary cross-entropy stop-token loss $\mathcal{L}_{bce}^{s}$.</p>
<p><strong>Text pre-training</strong> uses BART-style denoising, masking 30% of text spans (Poisson $\lambda = 3.5$) and training with maximum likelihood estimation:</p>
<p>$$
\mathcal{L}_{mle}^{t} = \sum_{n=1}^{N^t} \log p(\mathbf{y}_n^t \mid \mathbf{y}_{&lt; n}^t, \hat{\mathbf{X}}^t)
$$</p>
<p>The full pre-training loss combines all components:</p>
<p>$$
\mathcal{L} = \mathcal{L}_{mlm}^{s} + \mathcal{L}_{1}^{s} + \mathcal{L}_{bce}^{s} + \mathcal{L}_{mle}^{t} + \gamma \mathcal{L}_d
$$</p>
<p>where $\gamma = 0.1$.</p>
<h2 id="evaluation-across-six-spoken-language-tasks">Evaluation Across Six Spoken Language Tasks</h2>
<p>SpeechT5 was evaluated on six downstream tasks, each using a different combination of the shared encoder-decoder and task-appropriate pre/post-nets:</p>
<h3 id="automatic-speech-recognition-asr">Automatic Speech Recognition (ASR)</h3>
<p>Fine-tuned on LibriSpeech 100h with joint <a href="https://en.wikipedia.org/wiki/Connectionist_temporal_classification">CTC</a>/attention decoding. The decoding objective maximizes a combination of decoder, CTC, and language model log-probabilities:</p>
<p>$$
\alpha \log P_{Dec} + (1 - \alpha) \log P_{CTC} + \beta \log P_{LM}
$$</p>
<p>where $\alpha = 0.5$ and $\beta = 1.0$ for the 100h setting (beam size 30). Results on the test sets:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>LM</th>
          <th>test-clean</th>
          <th>test-other</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>wav2vec 2.0 BASE</td>
          <td>-</td>
          <td>6.1</td>
          <td>13.3</td>
      </tr>
      <tr>
          <td>HuBERT BASE</td>
          <td>-</td>
          <td>5.8</td>
          <td>13.3</td>
      </tr>
      <tr>
          <td><strong>SpeechT5</strong></td>
          <td><strong>-</strong></td>
          <td><strong>4.4</strong></td>
          <td><strong>10.4</strong></td>
      </tr>
      <tr>
          <td>wav2vec 2.0 BASE</td>
          <td>Transf.</td>
          <td>2.6</td>
          <td>6.3</td>
      </tr>
      <tr>
          <td><strong>SpeechT5</strong></td>
          <td><strong>Transf.</strong></td>
          <td><strong>2.4</strong></td>
          <td><strong>5.8</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="text-to-speech-synthesis-tts">Text-to-Speech Synthesis (TTS)</h3>
<p>Fine-tuned on LibriTTS 460h clean sets with HiFi-GAN vocoder:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Naturalness</th>
          <th>MOS</th>
          <th>CMOS</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Ground Truth</td>
          <td>-</td>
          <td>3.87 ± 0.04</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Baseline</td>
          <td>2.76</td>
          <td>3.56 ± 0.05</td>
          <td>0</td>
      </tr>
      <tr>
          <td><strong>SpeechT5</strong></td>
          <td><strong>2.91</strong></td>
          <td><strong>3.65 ± 0.04</strong></td>
          <td><strong>+0.290</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="speech-translation-st">Speech Translation (ST)</h3>
<p>Evaluated on MUST-C English-to-German and English-to-French:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>EN-DE</th>
          <th>EN-FR</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Fairseq ST</td>
          <td>22.70</td>
          <td>32.90</td>
      </tr>
      <tr>
          <td>Adapter Tuning</td>
          <td>24.63</td>
          <td>34.98</td>
      </tr>
      <tr>
          <td>Baseline (HuBERT init)</td>
          <td>23.43</td>
          <td>33.76</td>
      </tr>
      <tr>
          <td><strong>SpeechT5</strong></td>
          <td><strong>25.18</strong></td>
          <td><strong>35.30</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="voice-conversion-vc">Voice Conversion (VC)</h3>
<p>Evaluated on CMU Arctic:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>WER (bdl→slt)</th>
          <th>MCD (bdl→slt)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>VTN w/ TTS</td>
          <td>7.6%</td>
          <td>6.33</td>
      </tr>
      <tr>
          <td>Many-to-many VTN</td>
          <td>-</td>
          <td>6.13</td>
      </tr>
      <tr>
          <td><strong>SpeechT5</strong></td>
          <td><strong>7.8%</strong></td>
          <td><strong>5.93</strong></td>
      </tr>
  </tbody>
</table>
<h3 id="speech-enhancement-se">Speech Enhancement (SE)</h3>
<p>On WHAM! dataset, SpeechT5 reduced WER from 76.1% (noisy) to 8.9%, a relative 9% improvement over the baseline&rsquo;s 10.9%.</p>
<h3 id="speaker-identification-sid">Speaker Identification (SID)</h3>
<p>On VoxCeleb1, SpeechT5 achieved 96.49% accuracy, outperforming HuBERT LARGE at 90.33% (from SUPERB) and SpeechNet multi-task at 87.90%.</p>
<h2 id="ablation-study-and-key-findings">Ablation Study and Key Findings</h2>
<p>The ablation study reveals the contribution of each pre-training component:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>ASR (clean)</th>
          <th>ASR (other)</th>
          <th>VC (MCD)</th>
          <th>SID (ACC)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SpeechT5</td>
          <td>4.4</td>
          <td>10.7</td>
          <td>5.93</td>
          <td>96.49%</td>
      </tr>
      <tr>
          <td>w/o Speech PT</td>
          <td>-</td>
          <td>-</td>
          <td>6.49</td>
          <td>38.61%</td>
      </tr>
      <tr>
          <td>w/o Text PT</td>
          <td>5.4</td>
          <td>12.8</td>
          <td>6.03</td>
          <td>95.60%</td>
      </tr>
      <tr>
          <td>w/o Joint PT</td>
          <td>4.6</td>
          <td>11.3</td>
          <td>6.18</td>
          <td>95.54%</td>
      </tr>
      <tr>
          <td>w/o $\mathcal{L}_{mlm}^{s}$</td>
          <td>7.6</td>
          <td>22.4</td>
          <td>6.29</td>
          <td>90.91%</td>
      </tr>
  </tbody>
</table>
<p>Key findings:</p>
<ol>
<li><strong>Speech pre-training is critical</strong>: without it, ASR fails to converge entirely, and SID accuracy drops to 38.61%.</li>
<li><strong>Text pre-training complements speech</strong>: removing it degrades ASR by ~20% relative, confirming that textual knowledge transfers to speech tasks.</li>
<li><strong>Joint pre-training enables cross-modal transfer</strong>: the vector quantization approach is essential for modality-bridging tasks like ASR.</li>
<li><strong>The masked prediction loss $\mathcal{L}_{mlm}^{s}$ is the most important single component</strong>, responsible for learning strong acoustic features.</li>
</ol>
<p>The authors note limitations in the current scope (English-only, BASE model size) and propose scaling to larger models and multilingual settings as future work.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Speech pre-training</td>
          <td>LibriSpeech</td>
          <td>960 hours</td>
          <td>Full training set</td>
      </tr>
      <tr>
          <td>Text pre-training</td>
          <td>LibriSpeech LM text</td>
          <td>400M sentences</td>
          <td>Normalized language model text</td>
      </tr>
      <tr>
          <td>ASR fine-tuning</td>
          <td>LibriSpeech</td>
          <td>100h / 960h subsets</td>
          <td></td>
      </tr>
      <tr>
          <td>TTS fine-tuning</td>
          <td>LibriTTS</td>
          <td>460h clean sets</td>
          <td></td>
      </tr>
      <tr>
          <td>ST fine-tuning</td>
          <td>MUST-C</td>
          <td>EN-DE, EN-FR</td>
          <td></td>
      </tr>
      <tr>
          <td>VC fine-tuning</td>
          <td>CMU Arctic</td>
          <td>4 speakers</td>
          <td>bdl, clb, slt, rms</td>
      </tr>
      <tr>
          <td>SE fine-tuning</td>
          <td>WHAM!</td>
          <td>16 kHz max</td>
          <td>enhance-single task</td>
      </tr>
      <tr>
          <td>SID fine-tuning</td>
          <td>VoxCeleb1</td>
          <td>100k+ utterances</td>
          <td>1,251 speakers</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Optimizer: Adam with warmup (8% of steps) to peak LR $2 \times 10^{-4}$, then linear decay</li>
<li>Speech masking: 8% of timesteps, 10-step spans</li>
<li>Text masking: 30% of spans, Poisson $\lambda = 3.5$</li>
<li>Vector quantization: 2 codebooks × 100 entries = $10^4$ theoretical maximum codes</li>
<li>CTC/attention joint decoding for ASR (beam size 30)</li>
<li>HiFi-GAN vocoder for TTS and SE waveform generation</li>
<li>Parallel WaveGAN vocoder for VC</li>
</ul>
<h3 id="fine-tuning-hyperparameters">Fine-Tuning Hyperparameters</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>GPUs</th>
          <th>Steps</th>
          <th>Peak LR</th>
          <th>Batch (per GPU)</th>
          <th>Schedule</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ASR (100h)</td>
          <td>8×V100</td>
          <td>80k</td>
          <td>6e-5</td>
          <td>256k audio samples</td>
          <td>Warmup 10%, hold 40%, linear decay</td>
      </tr>
      <tr>
          <td>ASR (960h)</td>
          <td>8×V100</td>
          <td>320k</td>
          <td>1.3e-4</td>
          <td>256k audio samples</td>
          <td>Warmup 10%, hold 40%, linear decay</td>
      </tr>
      <tr>
          <td>TTS</td>
          <td>8×V100</td>
          <td>120k</td>
          <td>4e-4</td>
          <td>45k tokens</td>
          <td>Warmup 10k steps, inv. sqrt decay</td>
      </tr>
      <tr>
          <td>ST</td>
          <td>8×V100</td>
          <td>80k</td>
          <td>-</td>
          <td>-</td>
          <td>Warmup 10k steps</td>
      </tr>
      <tr>
          <td>VC</td>
          <td>8×V100</td>
          <td>60k</td>
          <td>1e-4</td>
          <td>20k tokens</td>
          <td>6k warmup, inv. sqrt decay</td>
      </tr>
      <tr>
          <td>SE</td>
          <td>8×V100</td>
          <td>100k</td>
          <td>1e-4</td>
          <td>16k tokens</td>
          <td>10k warmup, inv. sqrt decay</td>
      </tr>
      <tr>
          <td>SID</td>
          <td>8×V100</td>
          <td>60k</td>
          <td>5e-4</td>
          <td>64 segments (3s each)</td>
          <td>Triangular cyclical (1e-8 to 5e-4)</td>
      </tr>
  </tbody>
</table>
<h3 id="models">Models</h3>
<ul>
<li>Encoder: 12 Transformer blocks (768-dim, 3072 FFN, 12 heads)</li>
<li>Decoder: 6 Transformer blocks (same dimensions)</li>
<li>Speech-encoder pre-net: 7 conv blocks (512 channels, strides [5,2,2,2,2,2,2], kernels [10,3,3,3,3,2,2])</li>
<li>Code and pre-trained models available at <a href="https://github.com/microsoft/SpeechT5">github.com/microsoft/SpeechT5</a> (MIT license)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/microsoft/SpeechT5">microsoft/SpeechT5</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official Fairseq-based implementation</td>
      </tr>
      <tr>
          <td>Pre-trained models (via repo)</td>
          <td>Model</td>
          <td>MIT</td>
          <td>SpeechT5 BASE encoder-decoder checkpoints</td>
      </tr>
      <tr>
          <td><a href="https://www.openslr.org/12">LibriSpeech</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>960h speech pre-training and ASR fine-tuning</td>
      </tr>
      <tr>
          <td><a href="https://www.openslr.org/60">LibriTTS</a></td>
          <td>Dataset</td>
          <td>CC-BY-4.0</td>
          <td>460h TTS fine-tuning</td>
      </tr>
      <tr>
          <td><a href="https://ict.fbk.eu/must-c/">MUST-C</a></td>
          <td>Dataset</td>
          <td>CC-BY-NC-ND-4.0</td>
          <td>Speech translation fine-tuning</td>
      </tr>
      <tr>
          <td><a href="http://www.festvox.org/cmu_arctic/">CMU Arctic</a></td>
          <td>Dataset</td>
          <td>Free</td>
          <td>Voice conversion fine-tuning</td>
      </tr>
      <tr>
          <td><a href="http://wham.whisper.ai/">WHAM!</a></td>
          <td>Dataset</td>
          <td>CC-BY-NC-4.0</td>
          <td>Speech enhancement fine-tuning</td>
      </tr>
      <tr>
          <td><a href="https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html">VoxCeleb1</a></td>
          <td>Dataset</td>
          <td>CC-BY-SA-4.0</td>
          <td>Speaker identification fine-tuning</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pre-training: 32 NVIDIA V100 GPUs</li>
<li>Batch: ~90s speech per GPU + 12k text tokens per GPU, gradient accumulation 2</li>
<li>Pre-training steps: 500k</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ao, J., Wang, R., Zhou, L., Wang, C., Ren, S., Wu, Y., Liu, S., Ko, T., Li, Q., Zhang, Y., Wei, Z., Qian, Y., Li, J., &amp; Wei, F. (2022). SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing. <em>Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</em>, 5723-5738.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{ao2022speecht,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and Wei, Zhihua and Qian, Yao and Li, Jinyu and Wei, Furu}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{5723--5738}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.18653/v1/2022.acl-long.393}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>T5: Exploring Transfer Learning Limits</title><link>https://hunterheidenreich.com/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/</link><pubDate>Wed, 08 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/</guid><description>Raffel et al. systematically study transfer learning for NLP with a text-to-text framework, ablating architectures, objectives, data, and multi-task mixing.</description><content:encoded><![CDATA[<h2 id="a-systematic-study-of-nlp-transfer-learning">A systematic study of NLP transfer learning</h2>
<p>This is a <strong>systematization paper</strong> that provides a comprehensive empirical survey of transfer learning techniques for NLP. Rather than proposing a single new method, T5 introduces a unified text-to-text framework and uses it as a testbed to systematically compare pre-training objectives, architectures, unlabeled data sources, transfer approaches, and multi-task mixing strategies. The scale of the ablation study (covering dozens of configurations) and the release of C4, pre-trained models, and code make it both a reference guide and a resource.</p>
<h2 id="unifying-nlp-tasks-as-text-to-text">Unifying NLP tasks as text-to-text</h2>
<p>The core design decision is to cast every NLP task as a text-to-text problem: both the input and output are text strings, with a task-specific prefix. Classification, regression, summarization, translation, and question answering all use the same model, loss function (cross-entropy on output tokens), and decoding procedure. This simplicity enables fair comparison across tasks and training strategies.</p>
<p>The model architecture is a standard encoder-decoder Transformer. The paper finds that this form outperforms decoder-only (language model) and encoder-only (BERT-style) variants in the text-to-text setting, while having similar computational cost to decoder-only models despite twice the parameters (the encoder processes the input only once, then the decoder attends to it).</p>
<h2 id="multi-task-mixing-strategies-and-findings">Multi-task mixing: strategies and findings</h2>
<p>The most thesis-relevant contribution is the systematic ablation of multi-task mixing strategies (Section 3.5.2). When training on multiple tasks simultaneously (which in the text-to-text framework simply means mixing data from different sources), the central question is how to set the proportion of data from each task.</p>
<h3 id="three-mixing-strategies">Three mixing strategies</h3>
<p><strong>Examples-proportional mixing.</strong> Sample in proportion to each dataset&rsquo;s size, with an artificial cap $K$ on the maximum dataset size. Without the cap, the unsupervised pre-training data (orders of magnitude larger) would dominate all batches. The mixing rate for task $m$ is:</p>
<p>$$
r_{m} = \frac{\min(e_{m}, K)}{\sum_{n} \min(e_{n}, K)}
$$</p>
<p>where $e_{m}$ is the number of examples in task $m$&rsquo;s dataset.</p>
<p><strong>Temperature-scaled mixing.</strong> Raise each mixing rate $r_{m}$ to the power $1/T$ and renormalize. At $T=1$ this equals examples-proportional mixing; as $T$ increases, proportions approach equal mixing. Uses a large cap $K = 2^{21}$.</p>
<p><strong>Equal mixing.</strong> Sample uniformly from all tasks. Included as a negative reference: the model overfits on low-resource tasks and underfits on high-resource tasks.</p>
<h3 id="results">Results</h3>
<table>
  <thead>
      <tr>
          <th>Mixing strategy</th>
          <th>GLUE</th>
          <th>CNN/DM</th>
          <th>SQuAD</th>
          <th>SuperGLUE</th>
          <th>EnDe</th>
          <th>EnFr</th>
          <th>EnRo</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Baseline (pre-train/fine-tune)</td>
          <td>83.28</td>
          <td>19.24</td>
          <td>80.88</td>
          <td>71.36</td>
          <td>26.98</td>
          <td>39.82</td>
          <td>27.65</td>
      </tr>
      <tr>
          <td>Equal</td>
          <td>76.13</td>
          <td>19.02</td>
          <td>76.51</td>
          <td>63.37</td>
          <td>23.89</td>
          <td>34.31</td>
          <td>26.78</td>
      </tr>
      <tr>
          <td>Examples-proportional, $K=2^{18}$</td>
          <td>81.67</td>
          <td>19.07</td>
          <td>78.17</td>
          <td>67.94</td>
          <td>24.57</td>
          <td>35.19</td>
          <td>27.39</td>
      </tr>
      <tr>
          <td>Examples-proportional, $K=2^{19}$</td>
          <td>81.42</td>
          <td>19.24</td>
          <td>79.78</td>
          <td>67.30</td>
          <td>25.21</td>
          <td>36.30</td>
          <td>27.76</td>
      </tr>
      <tr>
          <td>Temperature-scaled, $T=2$</td>
          <td>81.90</td>
          <td>19.28</td>
          <td>79.42</td>
          <td>69.92</td>
          <td>25.42</td>
          <td>36.72</td>
          <td>27.20</td>
      </tr>
  </tbody>
</table>
<p><strong>Key findings on mixing:</strong></p>
<ol>
<li>
<p><strong>Multi-task training underperforms pre-train-then-fine-tune on most tasks.</strong> No mixing strategy matches the baseline of unsupervised pre-training followed by task-specific fine-tuning.</p>
</li>
<li>
<p><strong>Equal mixing is worst.</strong> It dramatically degrades performance, confirming that proportions matter.</p>
</li>
<li>
<p><strong>There exists a task-specific sweet spot for the cap $K$.</strong> Most tasks have an optimal $K$ value; larger or smaller values hurt. The exception is very high-resource tasks (WMT English-French) that always benefit from higher mixing proportions.</p>
</li>
<li>
<p><strong>Temperature scaling at $T=2$ provides the best single compromise.</strong> It achieves reasonable performance across all tasks without requiring per-task tuning of $K$.</p>
</li>
<li>
<p><strong>Multi-task pre-training followed by fine-tuning closes the gap.</strong> When multi-task training is used as pre-training (not as the final training stage), followed by task-specific fine-tuning, performance becomes comparable to unsupervised pre-training alone. This suggests that multi-task exposure during pre-training provides useful early signal without the negative effects of forcing a single model to perform all tasks simultaneously.</p>
</li>
<li>
<p><strong>&ldquo;Leave-one-out&rdquo; training works.</strong> Pre-training on a multi-task mixture that excludes a target task, then fine-tuning on it, produces only slightly worse results. This indicates that multi-task pre-training builds general capabilities that transfer to unseen tasks without dramatic task interference.</p>
</li>
</ol>
<h2 id="data-repetition-degrades-performance">Data repetition degrades performance</h2>
<p>The paper also systematically tests the effect of pre-training data set size by truncating C4 and training over repeated data:</p>
<table>
  <thead>
      <tr>
          <th>Unique tokens</th>
          <th>Repeats</th>
          <th>GLUE</th>
          <th>SQuAD</th>
          <th>SuperGLUE</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Full dataset</td>
          <td>0</td>
          <td>83.28</td>
          <td>80.88</td>
          <td>71.36</td>
      </tr>
      <tr>
          <td>$2^{29}$</td>
          <td>64</td>
          <td>82.87</td>
          <td>80.97</td>
          <td>72.03</td>
      </tr>
      <tr>
          <td>$2^{27}$</td>
          <td>256</td>
          <td>82.62</td>
          <td>79.78</td>
          <td>69.97</td>
      </tr>
      <tr>
          <td>$2^{25}$</td>
          <td>1,024</td>
          <td>79.55</td>
          <td>76.27</td>
          <td>64.76</td>
      </tr>
      <tr>
          <td>$2^{23}$</td>
          <td>4,096</td>
          <td>76.34</td>
          <td>70.92</td>
          <td>59.29</td>
      </tr>
  </tbody>
</table>
<p>Performance degrades as data shrinks, with 64 repeats showing limited effects but 1,024+ repeats causing significant degradation. Training loss curves confirm memorization at high repetition counts. The paper recommends using large, diverse pre-training datasets whenever possible.</p>
<h2 id="scaling-and-final-configuration">Scaling and final configuration</h2>
<p>The paper compares scaling strategies: more data, larger models, and ensembles. Training a larger model for fewer steps generally outperforms training a smaller model on more data. Ensembles of independently pre-trained and fine-tuned models provide orthogonal gains.</p>
<p>The final T5-11B model combines the best choices from all ablations: encoder-decoder architecture, span corruption objective, C4 pre-training data, multi-task pre-training followed by fine-tuning, and scaling to 11B parameters trained on over 1 trillion tokens. It achieves state-of-the-art results on GLUE (90.3 average), SuperGLUE (88.9, near human performance of 89.8), SQuAD, and CNN/Daily Mail. It does not achieve state-of-the-art on WMT translation tasks, where methods using backtranslation and cross-lingual pre-training retain the lead.</p>
<h2 id="implications-and-limitations">Implications and limitations</h2>
<p>The T5 paper&rsquo;s multi-task mixing findings are its most enduring contribution beyond the model itself. The core lessons: proportions matter enormously (equal mixing fails), examples-proportional mixing with a cap is a reasonable default, temperature scaling provides a single-knob alternative, and multi-task pre-training followed by fine-tuning can match pure unsupervised pre-training.</p>
<p><strong>Limitations:</strong></p>
<ul>
<li>All ablations use the same encoder-decoder architecture. Findings may not transfer to decoder-only models that dominate current practice.</li>
<li>The multi-task mixing experiments treat each task as a separate &ldquo;domain.&rdquo; Interactions between similar tasks (e.g., multiple classification tasks) are not isolated.</li>
<li>The paper does not provide a principled method for choosing $K$ or $T$; both require empirical search.</li>
<li>C4 has known quality issues (templated text, noisy content) that have been addressed in later datasets.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p><strong>Status: Highly Reproducible.</strong> Code, pre-trained models, and the C4 dataset are all publicly released.</p>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>C4 (Colossal Clean Crawled Corpus)</td>
          <td>~750 GB</td>
          <td>Heuristically cleaned Common Crawl</td>
      </tr>
      <tr>
          <td>Downstream</td>
          <td>GLUE, SuperGLUE, SQuAD, CNN/DM, WMT (EnDe, EnFr, EnRo)</td>
          <td>Standard splits</td>
          <td>Text-to-text format</td>
      </tr>
  </tbody>
</table>
<h3 id="models">Models</h3>
<p>Encoder-decoder Transformer. Sizes: Base (220M), Small (60M), Large (770M), 3B, 11B. Baseline uses Base size. SentencePiece vocabulary with 32K tokens. Pre-trained for $2^{19}$ steps, fine-tuned for $2^{18}$ steps on individual tasks.</p>
<h3 id="algorithms">Algorithms</h3>
<p>Multi-task mixing: examples-proportional with cap $K \in {2^{16}, \ldots, 2^{21}}$, temperature-scaled with $T \in {2, 4, 8}$, and equal mixing. Unsupervised objective: span corruption (mean span length 3, 15% corruption rate). Training with Adafactor optimizer, inverse square root learning rate schedule.</p>
<h3 id="hardware">Hardware</h3>
<p>All models trained using Mesh TensorFlow on TPU slices. T5-11B pre-trained for 1M steps with batch size $2^{11}$ sequences of length 512 (~1 trillion tokens total). Exact TPU pod configurations per experiment not detailed.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/google-research/text-to-text-transfer-transformer">T5 Code</a></td>
          <td>Code</td>
          <td>Apache 2.0</td>
          <td>Official TensorFlow implementation (JAX successor: T5X)</td>
      </tr>
      <tr>
          <td><a href="https://github.com/google-research/text-to-text-transfer-transformer#released-model-checkpoints">T5 Models</a></td>
          <td>Model</td>
          <td>Apache 2.0</td>
          <td>Pre-trained checkpoints (Small through 11B)</td>
      </tr>
      <tr>
          <td><a href="https://www.tensorflow.org/datasets/catalog/c4">C4 Dataset</a></td>
          <td>Dataset</td>
          <td>-</td>
          <td>~750 GB cleaned Common Crawl, via TensorFlow Datasets</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{raffel2020exploring,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Machine Learning Research}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{21}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{140}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1--67}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SlimPajama-DC: Data Combinations for LLM Training</title><link>https://hunterheidenreich.com/notes/natural-language-processing/language-models/slimpajama-dc-data-combinations/</link><pubDate>Wed, 08 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/natural-language-processing/language-models/slimpajama-dc-data-combinations/</guid><description>Shen et al. study how global deduplication and domain combinations in SlimPajama affect LLM training, finding diversity after dedup is key.</description><content:encoded><![CDATA[<h2 id="an-empirical-study-of-data-domain-combinations">An empirical study of data domain combinations</h2>
<p>This is a <strong>discovery paper</strong> that empirically investigates how different combinations and proportions of data domains affect language model pretraining. Using the SlimPajama dataset (a globally deduplicated, 627B token refinement of RedPajama), the study trains seven 1.3B model configurations with varying domain mixtures to identify which combinations and deduplication strategies produce the best downstream performance.</p>
<h2 id="why-data-combination-strategy-matters">Why data combination strategy matters</h2>
<p>Multi-source pretraining datasets combine data from web crawls, code repositories, books, academic papers, and other sources. Two underexplored questions drive this work: (1) Does deduplication within each source (local) versus across all sources (global) meaningfully affect model quality? (2) When sources are thoroughly deduplicated, how does the combination and proportion of domains affect downstream performance? Most open-source LLM training datasets (RedPajama, The Pile) perform only local deduplication, leaving cross-source redundancy unaddressed.</p>
<h2 id="global-deduplication-and-the-slimpajama-dataset">Global deduplication and the SlimPajama dataset</h2>
<p>SlimPajama applies global MinHashLSH deduplication (Jaccard similarity threshold 0.8, 13-gram signatures) across all seven data sources simultaneously. This reduces RedPajama&rsquo;s 1.2T tokens to 627B tokens, a roughly 48% reduction. The heaviest deduplication hits CommonCrawl and GitHub, which had the most cross-source overlap.</p>
<p>The key processing steps:</p>
<ol>
<li><strong>Low-length document filtering</strong>: Remove documents below a minimum length threshold.</li>
<li><strong>Global deduplication</strong>: MinHashLSH across all sources simultaneously, requiring 64 CPU cores and 1.4TB peak memory. This removes both within-source and between-source duplicates.</li>
</ol>
<p>The resulting dataset composition:</p>
<table>
  <thead>
      <tr>
          <th>Source</th>
          <th>SlimPajama</th>
          <th>RedPajama</th>
          <th>LLaMA 1</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CommonCrawl</td>
          <td>52.2% (333B)</td>
          <td>72.6% (878B)</td>
          <td>67.0%</td>
      </tr>
      <tr>
          <td>C4</td>
          <td>26.7% (170B)</td>
          <td>14.4% (175B)</td>
          <td>15.0%</td>
      </tr>
      <tr>
          <td>GitHub</td>
          <td>5.2% (33B)</td>
          <td>4.9% (59B)</td>
          <td>4.5%</td>
      </tr>
      <tr>
          <td>Books</td>
          <td>4.2% (27B)</td>
          <td>2.1% (26B)</td>
          <td>4.5%</td>
      </tr>
      <tr>
          <td>ArXiv</td>
          <td>4.6% (29B)</td>
          <td>2.3% (28B)</td>
          <td>2.5%</td>
      </tr>
      <tr>
          <td>Wikipedia</td>
          <td>3.8% (24B)</td>
          <td>2.0% (24B)</td>
          <td>4.5%</td>
      </tr>
      <tr>
          <td>StackExchange</td>
          <td>3.3% (21B)</td>
          <td>1.7% (20B)</td>
          <td>2.0%</td>
      </tr>
  </tbody>
</table>
<h2 id="seven-domain-combination-configurations">Seven domain combination configurations</h2>
<p>All configurations train 1.3B parameter models on 330B tokens with identical architecture and hyperparameters. The configurations systematically vary domain diversity:</p>
<ul>
<li><strong>DC-1</strong>: CommonCrawl only (single source)</li>
<li><strong>DC-2</strong>: CommonCrawl + C4 (two web sources)</li>
<li><strong>DC-3</strong>: CommonCrawl + C4 with adjusted proportions</li>
<li><strong>DC-4</strong>: Wikipedia + Books + GitHub + ArXiv + StackExchange (no web crawl)</li>
<li><strong>DC-5</strong>: CommonCrawl + C4 + Wikipedia + Books (four sources, no code/academic)</li>
<li><strong>DC-6</strong>: All seven SlimPajama sources (maximum diversity)</li>
<li><strong>DC-7</strong>: RefinedWeb CommonCrawl (external single-source baseline)</li>
</ul>
<p>The experimental design probes: incremental diversity (DC-1 to DC-2 to DC-5 to DC-6), proportion sensitivity (DC-2 vs DC-3), source importance (DC-3 vs DC-4), and specialization vs generalization (individual vs combined).</p>
<h2 id="diversity-after-global-deduplication-drives-performance">Diversity after global deduplication drives performance</h2>
<h3 id="hugging-face-leaderboard-results">Hugging Face leaderboard results</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Average</th>
          <th>ARC</th>
          <th>HellaSwag</th>
          <th>MMLU</th>
          <th>TruthfulQA</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RedPajama-1.3B</td>
          <td>38.0</td>
          <td>37.2</td>
          <td>55.8</td>
          <td>24.9</td>
          <td>34.3</td>
      </tr>
      <tr>
          <td>DC-1 (CC only)</td>
          <td>38.5</td>
          <td>36.3</td>
          <td>56.0</td>
          <td>27.0</td>
          <td>34.8</td>
      </tr>
      <tr>
          <td>DC-4 (no web)</td>
          <td>37.6</td>
          <td>33.4</td>
          <td>53.3</td>
          <td>26.0</td>
          <td>37.6</td>
      </tr>
      <tr>
          <td>DC-6 (all sources)</td>
          <td>40.0</td>
          <td>33.7</td>
          <td>61.0</td>
          <td>26.9</td>
          <td>38.4</td>
      </tr>
      <tr>
          <td>DC-7 (RefinedWeb)</td>
          <td>41.0</td>
          <td>35.1</td>
          <td>64.7</td>
          <td>26.2</td>
          <td>37.9</td>
      </tr>
  </tbody>
</table>
<p><strong>Key patterns:</strong></p>
<ol>
<li>
<p><strong>More domain diversity improves average performance.</strong> The progression DC-1 (38.5) to DC-2 (38.4) to DC-5 (38.6) to DC-6 (40.0) shows that adding domains consistently lifts average accuracy once global deduplication has removed cross-source redundancy.</p>
</li>
<li>
<p><strong>Global deduplication enables clean combination.</strong> All SlimPajama configurations except DC-4 outperform RedPajama-1.3B (38.0), which uses local deduplication only. The elimination of cross-source overlap means adding sources contributes genuinely new information.</p>
</li>
<li>
<p><strong>Removing web crawl data hurts.</strong> DC-4 (no CommonCrawl/C4) scores lowest (37.6), demonstrating that web text provides essential breadth even when specialized sources are included.</p>
</li>
<li>
<p><strong>Individual domains excel at specific tasks.</strong> DC-1 (CC only) achieves the highest ARC and MMLU scores. DC-4 leads on Winogrande. DC-5 leads on WSC273. No single combination dominates all tasks, reinforcing that diversity trades specialization for generalization.</p>
</li>
<li>
<p><strong>Findings transfer to 7B scale.</strong> The best 1.3B configuration insights were applied to a 7B model trained with large batch sizes, achieving 63.4 average accuracy across the extended benchmark suite.</p>
</li>
</ol>
<h3 id="training-loss-patterns">Training loss patterns</h3>
<p>DC-6 (all sources) achieves the lowest training loss among SlimPajama configurations, consistent with the downstream results. DC-4 (no web crawl) shows the highest training loss, confirming that the large, diverse web crawl data is the most important single component.</p>
<h2 id="implications-and-limitations">Implications and limitations</h2>
<p>The central finding is that <strong>diversity matters most after deduplication</strong>. When cross-source redundancy is removed, each additional source contributes genuinely new signal. Without global deduplication, adding sources may just increase redundancy without proportional benefit.</p>
<p><strong>Limitations:</strong></p>
<ul>
<li>Only seven fixed configurations are tested. No systematic search over continuous mixture proportions (contrast with <a href="/notes/natural-language-processing/language-models/doremi-data-mixture-optimization/">DoReMi</a> or <a href="/notes/natural-language-processing/language-models/data-mixing-laws-pretraining/">Data Mixing Laws</a>).</li>
<li>The configurations are not independent: DC-6 includes all sources from DC-1 through DC-5, making it difficult to isolate the contribution of any single addition.</li>
<li>Only 1.3B and 7B scales tested. Whether the diversity benefit continues scaling is unverified.</li>
<li>English-only. Cross-lingual diversity effects are not studied.</li>
<li>The paper is a technical report without formal peer review.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p><strong>Status: Highly Reproducible.</strong> All 1.3B models and datasets are publicly released under MIT license on HuggingFace.</p>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>SlimPajama</td>
          <td>627B tokens</td>
          <td>Globally deduplicated from 1.2T RedPajama</td>
      </tr>
      <tr>
          <td>Training</td>
          <td>RefinedWeb</td>
          <td>600B tokens</td>
          <td>External CC-only baseline</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>HF Leaderboard (ARC, HellaSwag, MMLU, TruthfulQA)</td>
          <td>Standard</td>
          <td>4 benchmarks</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Extended suite</td>
          <td>12 additional benchmarks</td>
          <td>Zero and few-shot</td>
      </tr>
  </tbody>
</table>
<h3 id="models">Models</h3>
<p>1.3B parameter Cerebras-GPT architecture with ALiBi positional encoding and SwiGLU activation. All configurations trained on 330B tokens. 7B model trained with large batch-size (LBS) strategy on Cerebras 16x CS-2 cluster (80 PFLOP/s in bf16).</p>
<h3 id="hardware">Hardware</h3>
<p>Cerebras 16x CS-2 cluster, 80 PFLOP/s in bf16 mixed precision.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://huggingface.co/MBZUAI-LLM/SlimPajama-DC">SlimPajama-DC Models</a></td>
          <td>Model</td>
          <td>MIT</td>
          <td>All 1.3B DC configurations (select via revision)</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/datasets/MBZUAI-LLM/SlimPajama-627B-DC">SlimPajama-627B-DC Dataset</a></td>
          <td>Dataset</td>
          <td>-</td>
          <td>Source-split version of SlimPajama-627B</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{shen2023slimpajamadc,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{SlimPajama-DC: Understanding Data Combinations for LLM Training}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Shen, Zhiqiang and Tao, Tianhua and Ma, Liqun and Neiswanger, Willie and Liu, Zhengzhong and Wang, Hongyi and Tan, Bowen and Hestness, Joel and Vassilieva, Natalia and Soboleva, Daria and Xing, Eric}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2309.10818}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Scaling Data-Constrained Language Models</title><link>https://hunterheidenreich.com/notes/natural-language-processing/language-models/scaling-data-constrained-language-models/</link><pubDate>Wed, 08 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/natural-language-processing/language-models/scaling-data-constrained-language-models/</guid><description>Muennighoff et al. extend Chinchilla scaling laws to repeated data, finding up to 4 epochs cause negligible loss and 16 epochs mark diminishing returns.</description><content:encoded><![CDATA[<h2 id="an-empirical-study-of-scaling-under-data-constraints">An empirical study of scaling under data constraints</h2>
<p>This is a <strong>discovery paper</strong> that systematically investigates what happens when language models are trained for multiple epochs on repeated data. It extends the Chinchilla scaling laws to the data-constrained regime by proposing a new scaling formula that accounts for the diminishing value of repeated tokens, validated across 400+ training runs ranging from 10M to 9B parameters and up to 1500 epochs.</p>
<h2 id="running-out-of-unique-training-data">Running out of unique training data</h2>
<p>The Chinchilla scaling laws assume unlimited unique data: for a given compute budget, there exists an optimal balance of model parameters and training tokens. But extrapolating these laws to larger models implies data requirements that exceed what is available. Villalobos et al. estimated that high-quality English text would be exhausted by 2024 under Chinchilla-optimal scaling. Most prior large language models trained for a single epoch, and some work explicitly warned against data reuse. The Galactica models (trained for 4.25 epochs) showed that multi-epoch training could work, but no systematic study had quantified the tradeoff between repeated data and fresh data, or how to allocate compute optimally when data is finite.</p>
<h2 id="effective-data-with-exponential-decay-for-repetition">Effective data with exponential decay for repetition</h2>
<p>The paper generalizes the Chinchilla scaling law by replacing raw token count $D$ with an effective data term $D&rsquo;$ that accounts for the diminishing value of repeated tokens:</p>
<p>$$
L(N, D) = \frac{A}{N&rsquo;^{\alpha}} + \frac{B}{D&rsquo;^{\beta}} + E
$$</p>
<p>where the effective data is:</p>
<p>$$
D&rsquo; = U_{D} + U_{D} R_{D}^{<em>} \left(1 - e^{-R_{D}/R_{D}^{</em>}}\right)
$$</p>
<p>Here $U_{D}$ is the number of unique tokens, $R_{D}$ is the number of repetitions (epochs minus 1), and $R_{D}^{<em>}$ is a learned constant representing the &ldquo;half-life&rdquo; of data repetition. When $R_{D} = 0$ (single epoch), $D&rsquo; = U_{D} = D$ and the formula reduces to standard Chinchilla. When $R_{D} \ll R_{D}^{</em>}$, repeated data is worth almost the same as fresh data. As $R_{D}$ grows large, the value of repeated tokens decays to zero, and $D&rsquo;$ saturates at $U_{D}(1 + R_{D}^{<em>})$, meaning no amount of repetition can substitute for more than $R_{D}^{</em>}$ epochs&rsquo; worth of fresh data.</p>
<p>A symmetric formula handles excess parameters:</p>
<p>$$
N&rsquo; = U_{N} + U_{N} R_{N}^{<em>} \left(1 - e^{-R_{N}/R_{N}^{</em>}}\right)
$$</p>
<p>where $U_{N}$ is the compute-optimal parameter count for $U_{D}$ unique tokens and $R_{N}$ measures how much the model exceeds that count. The fitted values are $R_{D}^{<em>} \approx 15.0$ (data repetition half-life at ~16 epochs) and $R_{N}^{</em>} \approx 5.3$ (excess parameters decay faster than repeated data).</p>
<h2 id="experiments-across-400-models">Experiments across 400+ models</h2>
<p><strong>Scale.</strong> Models from 10M to 9B parameters, trained for up to 1500 epochs. Three experimental protocols: fixed unique data (100M, 400M, 1.5B tokens), fixed FLOPs, and parametric fitting across all runs. Training on C4 (English web text) with GPT-2 architecture decoder-only transformers.</p>
<h3 id="resource-allocation-epochs-scale-faster-than-parameters">Resource allocation: epochs scale faster than parameters</h3>
<p>With fixed unique data, results show that more than 50% loss reduction is possible by training beyond one epoch and increasing model size beyond the single-epoch optimum. The data-constrained efficient frontier recommends allocating most additional compute to more epochs rather than more parameters, because excess parameters decay faster ($R_{N}^{<em>} &lt; R_{D}^{</em>}$). This contrasts with Chinchilla, which recommends scaling both equally.</p>
<p>A concrete validation: training the data-constrained compute-optimal model for $9.3 \times 10^{21}$ FLOPs with 25B unique tokens, the recommended allocation (27% fewer parameters, more epochs) achieves better loss and downstream performance than the Chinchilla-optimal allocation.</p>
<h3 id="resource-return-the-4-epoch-safe-zone-and-16-epoch-half-life">Resource return: the 4-epoch safe zone and 16-epoch half-life</h3>
<table>
  <thead>
      <tr>
          <th>Epochs</th>
          <th>Loss impact</th>
          <th>Downstream impact</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1 (baseline)</td>
          <td>Optimal</td>
          <td>Optimal</td>
      </tr>
      <tr>
          <td>Up to 4</td>
          <td>Negligible (+0.5% loss)</td>
          <td>No significant difference</td>
      </tr>
      <tr>
          <td>~16 ($R_{D}^{*}$)</td>
          <td>Diminishing returns begin sharply</td>
          <td>Measurable degradation</td>
      </tr>
      <tr>
          <td>Beyond 16</td>
          <td>Returns decay to near zero</td>
          <td>Significant degradation</td>
      </tr>
      <tr>
          <td>Extreme (44+)</td>
          <td>Training can diverge</td>
          <td>Failure</td>
      </tr>
  </tbody>
</table>
<p>The 8.7B parameter model trained for 4 epochs ($D_{C} = 44$B unique tokens) finishes with only 0.5% higher validation loss than the single-epoch model ($D_{C} = 178$B unique tokens). Beyond 16 epochs, each repeated token retains only $1 - 1/e \approx 63%$ of the value of a fresh token, meaning roughly 37% of value is lost per repetition cycle at the half-life point.</p>
<h3 id="complementary-strategies-code-augmentation-and-filtering">Complementary strategies: code augmentation and filtering</h3>
<p>When data is limited, two strategies can extend the effective dataset:</p>
<p><strong>Code augmentation.</strong> Mixing Python code from The Stack with natural language data. Up to 50% code (42B tokens) shows no degradation on natural language benchmarks, effectively providing a 2x increase in useful training data. Some tasks (WebNLG generation, bAbI reasoning) actually improve with code, possibly because code trains long-range state-tracking capabilities.</p>
<p><strong>Filtering relaxation.</strong> Perplexity filtering (keeping the 25% lowest-perplexity samples) is effective on noisy datasets, but deduplication filtering does not improve downstream performance (though it may reduce memorization). The recommendation: reserve aggressive filtering for noisy data sources; for clean datasets, more data through reduced filtering is better than less data through strict filtering.</p>
<p><strong>Combined strategy</strong>: doubling available data with code and then repeating for 4 epochs yields 8x more training tokens with performance expected to match 8x more unique data.</p>
<h2 id="key-findings-and-limitations">Key findings and limitations</h2>
<p><strong>Key findings:</strong></p>
<ul>
<li>Multi-epoch training is beneficial, not harmful, up to moderate repetition counts.</li>
<li>The data-constrained scaling law accurately predicts loss under repetition using an exponential decay formulation.</li>
<li>Compute should be allocated to epochs faster than parameters when data is constrained.</li>
<li>Code augmentation and selective filtering extend effective data without quality degradation.</li>
</ul>
<p><strong>Limitations:</strong></p>
<ul>
<li>All experiments use the GPT-2 transformer architecture; applicability to other architectures or modalities is untested.</li>
<li>Only the entire dataset is repeated uniformly. Selectively repeating subsets (e.g., high-value data for more epochs) is not modeled.</li>
<li>Hyperparameter sensitivity (learning rate, dropout) to epoch count is unexplored. Higher learning rates may cause earlier onset of diminishing returns.</li>
<li>Focused on English text. Cross-lingual augmentation effects are not studied.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<p><strong>Status: Highly Reproducible.</strong> Code, models, datasets, and hyperparameters are all publicly released under Apache 2.0.</p>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>C4 (English)</td>
          <td>Varies by experiment</td>
          <td>Fixed unique data: 100M, 400M, 1.5B tokens</td>
      </tr>
      <tr>
          <td>Code augmentation</td>
          <td>The Stack (Python)</td>
          <td>Up to 42B tokens</td>
          <td>Mixed with natural language</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>19 NL tasks</td>
          <td>Standard splits</td>
          <td>Zero to five-shot, 114 scores per model</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>Data-constrained scaling law: $D&rsquo; = U_{D} + U_{D} R_{D}^{<em>}(1 - e^{-R_{D}/R_{D}^{</em>}})$ with $R_{D}^{<em>} \approx 15.0$, $R_{N}^{</em>} \approx 5.3$. Fitted using the methodology of Hoffmann et al. (2022) adapted for the repetition terms. 400+ training runs used for fitting.</p>
<h3 id="models">Models</h3>
<p>GPT-2 architecture decoder-only transformers with GPT-2 tokenizer. Sizes: 10M to 8.7B parameters. Cosine learning rate schedule (max 2e-4, decay to 2e-5), Adam optimizer ($\beta_2 = 0.999$), dropout 0.1, weight decay 0.1, gradient clipping at 1.0. bfloat16 precision. Trained using Megatron-DeepSpeed.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Data-Constrained Optimal</th>
          <th>Chinchilla Optimal</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validation loss (9.3e21 FLOPs, 25B unique)</td>
          <td>Lower</td>
          <td>Higher</td>
          <td>27% fewer parameters</td>
      </tr>
      <tr>
          <td>Downstream (4 epochs vs 1)</td>
          <td>No significant difference</td>
          <td>Baseline</td>
          <td>8.7B params, 44B unique tokens</td>
      </tr>
      <tr>
          <td>Code augmentation (50% code)</td>
          <td>No NL degradation</td>
          <td>Baseline</td>
          <td>Some tasks improve</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Trained on the LUMI supercomputer (Finland) using AMD Instinct MI250X GPUs with data, tensor, and pipeline parallelism. Up to 256 GPUs (64 nodes) per run, with up to 2,200 nodes (~8,800 GPUs) used in parallel across all concurrent runs. Total compute: approximately 3 million GPU hours. The cluster runs on 100% renewable hydroelectric energy.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/huggingface/datablations">datablations</a></td>
          <td>Code + Models + Data</td>
          <td>Apache 2.0</td>
          <td>All 400+ models, datasets, and training code</td>
      </tr>
      <tr>
          <td><a href="https://github.com/TurkuNLP/Megatron-DeepSpeed">Megatron-DeepSpeed fork</a></td>
          <td>Code</td>
          <td>-</td>
          <td>Training framework adapted for AMD ROCm</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{muennighoff2023scaling,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Scaling Data-Constrained Language Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Muennighoff, Niklas and Rush, Alexander M. and Barak, Boaz and Le Scao, Teven and Piktus, Aleksandra and Tazi, Nouamane and Pyysalo, Sampo and Wolf, Thomas and Raffel, Colin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Advances in Neural Information Processing Systems}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{36}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>DoReMi: Optimizing Data Mixtures for LM Pretraining</title><link>https://hunterheidenreich.com/notes/natural-language-processing/language-models/doremi-data-mixture-optimization/</link><pubDate>Wed, 08 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/natural-language-processing/language-models/doremi-data-mixture-optimization/</guid><description>DoReMi uses a small proxy model with distributionally robust optimization to learn domain weights that speed up large-scale language model pretraining by 2.6x.</description><content:encoded><![CDATA[<h2 id="a-method-for-automatic-domain-reweighting">A method for automatic domain reweighting</h2>
<p>This is a <strong>method paper</strong> that introduces Domain Reweighting with Minimax Optimization (DoReMi), an algorithm for automatically tuning the mixture proportions of pretraining data domains. Rather than relying on heuristics or expensive downstream-task-based tuning, DoReMi uses a small proxy model trained with <a href="https://en.wikipedia.org/wiki/Robust_optimization">group distributionally robust optimization (Group DRO)</a> to produce domain weights that transfer to much larger models.</p>
<h2 id="why-data-mixture-proportions-matter">Why data mixture proportions matter</h2>
<p>Language model pretraining datasets combine text from many domains: web crawls, Wikipedia, books, code, academic papers, and others. The mixture proportions (how much of each domain to include) significantly affect downstream performance, but existing approaches either set them by hand (<a href="https://en.wikipedia.org/wiki/The_Pile_(dataset)">The Pile</a> uses heuristic weights) or tune them against downstream tasks (GLaM/PaLM), which is expensive and risks overfitting to a specific evaluation set. No principled, task-agnostic method existed for determining mixture proportions.</p>
<h2 id="minimax-optimization-over-domain-excess-loss">Minimax optimization over domain excess loss</h2>
<p>DoReMi&rsquo;s core insight is to frame data mixture optimization as a minimax problem: find domain weights that minimize the worst-case excess loss across all domains. The algorithm has three steps.</p>
<p><strong>Step 1</strong>: Train a small reference model (280M parameters) on some default domain weights $\alpha_{\text{ref}}$ (e.g., proportional to raw token count).</p>
<p><strong>Step 2</strong>: Train a small proxy model $p_{\theta}$ using Group DRO, which solves the minimax objective:</p>
<p>$$
\min_{\theta} \max_{\alpha \in \Delta^{k}} \sum_{i=1}^{k} \alpha_{i} \cdot \left[ \frac{1}{\sum_{x \in D_{i}} |x|} \sum_{x \in D_{i}} \ell_{\theta}(x) - \ell_{\text{ref}}(x) \right]
$$</p>
<p>where $\ell_{\theta}(x) = -\log p_{\theta}(x)$ and $\ell_{\text{ref}}(x) = -\log p_{\text{ref}}(x)$. The excess loss $\ell_{\theta}(x) - \ell_{\text{ref}}(x)$ measures how much headroom the proxy has to improve on each example relative to the reference. The inner maximization upweights domains with high excess loss via exponentiated gradient ascent, while the outer minimization trains the proxy on those upweighted domains.</p>
<p>At each training step, the domain weights update as:</p>
<p>$$
\alpha_{t}&rsquo; \leftarrow \alpha_{t-1} \exp(\eta \lambda_{t})
$$</p>
<p>where $\lambda_{t}[i]$ is the per-domain excess loss (clipped at zero), followed by renormalization and smoothing with a uniform component: $\alpha_{t} \leftarrow (1-c)\frac{\alpha_{t}&rsquo;}{\sum_{i} \alpha_{t}&rsquo;[i]} + cu$, with $c = 10^{-3}$.</p>
<p>The final domain weights are the average over all training steps: $\bar{\alpha} = \frac{1}{T}\sum_{t=1}^{T} \alpha_{t}$.</p>
<p><strong>Step 3</strong>: Resample data according to $\bar{\alpha}$ and train the full-scale model using standard procedures.</p>
<p><strong>Iterated DoReMi</strong> extends this by running multiple rounds, using the previous round&rsquo;s optimized weights as the next round&rsquo;s reference weights. This converges within 3 rounds on the GLaM dataset.</p>
<h2 id="experiments-across-the-pile-and-glam-datasets">Experiments across The Pile and GLaM datasets</h2>
<p><strong>Datasets.</strong> The Pile (22 domains, 800GB) and the GLaM dataset (8 domains, also used for PaLM). On The Pile, baseline weights come from the dataset defaults. On GLaM, baseline weights are uniform, with downstream-tuned oracle weights available for comparison.</p>
<p><strong>Setup.</strong> Transformer decoder-only LMs trained with next-token prediction. All models use batch size 512 and sequence length 1024. Proxy and reference models are 280M parameters. Main models are 8B parameters (30x larger). Training runs: 200K steps (Pile) or 300K steps (GLaM). The domain weight optimization cost (training two 280M models) is 8% of the compute for the 8B main model.</p>
<p><strong>Evaluation.</strong> Per-domain held-out perplexity and one-shot generative accuracy on five tasks: TriviaQA, NaturalQuestions, WebQuestions, SQuADv2, and LAMBADA.</p>
<h3 id="key-domain-weight-shifts">Key domain weight shifts</h3>
<p>On The Pile, DoReMi (280M) dramatically upweights diverse web text (Pile-CC: 0.112 to 0.606) while downweighting specialized domains like ArXiv (0.105 to 0.004), PubMed Central (0.107 to 0.005), and StackExchange (0.093 to 0.015). Smaller, underrepresented domains like YouTubeSubtitles and PhilPapers receive proportionally large increases.</p>
<h3 id="scaling-behavior">Scaling behavior</h3>
<p>DoReMi was tested with matched proxy/main model sizes (280M through 1B) and with varying proxy sizes (70M through 1B) feeding into an 8B main model.</p>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>Speedup to baseline accuracy</th>
          <th>Downstream improvement</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DoReMi (280M to 280M)</td>
          <td>4x</td>
          <td>+2% avg accuracy</td>
      </tr>
      <tr>
          <td>DoReMi (280M to 8B)</td>
          <td>2.6x</td>
          <td>+6.5% avg accuracy</td>
      </tr>
      <tr>
          <td>DoReMi (150M to 8B)</td>
          <td>~2x</td>
          <td>Significant</td>
      </tr>
      <tr>
          <td>DoReMi (1B to 8B)</td>
          <td>~2x</td>
          <td>Significant</td>
      </tr>
  </tbody>
</table>
<p>Improvements are consistent across all tested model scales (280M to 1B matched), with no sign of diminishing returns at larger sizes.</p>
<h2 id="perplexity-improves-everywhere-even-on-downweighted-domains">Perplexity improves everywhere, even on downweighted domains</h2>
<p>The most striking finding is that DoReMi improves perplexity on all 22 domains in The Pile, including domains it downweights. The proposed explanation: the lowest-entropy domains need few samples to learn (they&rsquo;re statistically simple), while the highest-entropy domains have token distributions close to the uniform initialization and also need fewer samples. Reallocating weight to medium-entropy domains generates positive transfer that lifts all domains.</p>
<p>On The Pile, DoReMi reaches the baseline&rsquo;s downstream accuracy in 75K steps versus 200K for the baseline (2.6x speedup) and achieves a 6.5% absolute improvement in average one-shot accuracy at 200K steps.</p>
<p>On the GLaM dataset, iterated DoReMi (round 2) matches the performance of domain weights that were tuned directly on downstream task performance, despite having no knowledge of downstream tasks. Domain weights converge within 3 iterations.</p>
<h3 id="ablations">Ablations</h3>
<p>Using only the proxy model&rsquo;s loss (prefer hardest domains) or only the negative reference loss (prefer easiest domains) both underperform the full excess loss formulation. Both components are necessary: the excess loss identifies domains where the proxy has room to improve relative to what is learnable.</p>
<p>The proxy model itself typically underperforms the main model trained on its weights, and this gap grows at larger proxy scales. A 1B proxy model underperforms the 1B baseline, yet its domain weights still improve 1B main model training by over 2x. This suggests the domain weight signal is robust even when the proxy model itself is not well-trained.</p>
<h3 id="limitations">Limitations</h3>
<p>The domain weight landscape may have multiple local optima: a 280M proxy puts most weight on Pile-CC, while a 1B proxy favors OpenWebText2 instead. Both configurations improve over baseline, but the optimal weights are not unique.</p>
<p>The granularity of &ldquo;domains&rdquo; matters. DoReMi works better with more domains (22 on The Pile versus 8 on GLaM). Domains are defined by data provenance, which is coarse-grained. Fine-grained domain definitions (e.g., via clustering) could improve results but also risk DRO putting all weight on a small set of worst-case examples.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>The Pile</td>
          <td>800 GB, 22 domains</td>
          <td>Default heuristic weights as baseline</td>
      </tr>
      <tr>
          <td>Pretraining</td>
          <td>GLaM dataset</td>
          <td>8 domains</td>
          <td>Uniform weights as baseline; downstream-tuned oracle available</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>TriviaQA, NaturalQuestions, WebQuestions, SQuADv2, LAMBADA</td>
          <td>Standard splits</td>
          <td>One-shot generative evaluation</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>Group DRO with exponentiated gradient ascent for domain weight updates. Step size $\eta = 1$, smoothing $c = 10^{-3}$. Per-token excess loss clipped at zero. Domain weights averaged over all training steps. Iterated DoReMi converges when $|\bar{\alpha} - \alpha_{\text{ref}}|_{\infty} &lt; 10^{-3}$.</p>
<h3 id="models">Models</h3>
<p>Vanilla Transformer decoder-only models with 256K vocabulary. Sizes: 70M (3 layers), 150M (6 layers), 280M (12 layers), 510M (12 layers), 760M (12 layers), 1B (16 layers), 8B (32 layers). All use 64-dim attention heads except 8B (128-dim).</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>DoReMi (280M to 8B)</th>
          <th>Baseline (8B)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Avg one-shot accuracy</td>
          <td>+6.5% over baseline</td>
          <td>Reference</td>
          <td>5 generative tasks</td>
      </tr>
      <tr>
          <td>Worst-case log-perplexity</td>
          <td>1.46</td>
          <td>1.71</td>
          <td>Across 22 Pile domains</td>
      </tr>
      <tr>
          <td>Avg log-perplexity</td>
          <td>1.40</td>
          <td>1.64</td>
          <td>Across 22 Pile domains</td>
      </tr>
      <tr>
          <td>Domains beating baseline</td>
          <td>22/22</td>
          <td>0/22</td>
          <td>Per-domain perplexity</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Proxy and reference models (under 1B) trained on TPUv3. Models at 1B and 8B trained on TPUv4. Domain weight optimization (two 280M runs) costs 8% of 8B training FLOPs.</p>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{xie2023doremi,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Xie, Sang Michael and Pham, Hieu and Dong, Xuanyi and Du, Nan and Liu, Hanxiao and Lu, Yifeng and Liang, Percy and Le, Quoc V. and Ma, Tengyu and Yu, Adams Wei}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Advances in Neural Information Processing Systems}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{36}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Data Mixing Laws for LM Pretraining Optimization</title><link>https://hunterheidenreich.com/notes/natural-language-processing/language-models/data-mixing-laws-pretraining/</link><pubDate>Wed, 08 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/natural-language-processing/language-models/data-mixing-laws-pretraining/</guid><description>Ye et al. discover that LM loss follows an exponential law over domain mixture proportions, enabling cheap prediction and optimization of data mixtures.</description><content:encoded><![CDATA[<h2 id="an-empirical-discovery-of-predictable-mixture-loss-relationships">An empirical discovery of predictable mixture-loss relationships</h2>
<p>This is a <strong>discovery paper</strong> that identifies a quantitative, functional relationship between pretraining data mixture proportions and language model loss. The key finding is that domain-specific validation loss follows an exponential law over the linear combination of training domain proportions, and this law composes with standard scaling laws to enable cheap prediction of large-model performance under arbitrary mixtures.</p>
<h2 id="the-missing-quantitative-link-between-data-mixtures-and-performance">The missing quantitative link between data mixtures and performance</h2>
<p>Pretraining data for large language models combines text from many domains (web, code, academic, books, etc.), and mixture proportions significantly affect model quality. Existing approaches either set proportions by hand without disclosed criteria (LLaMA, Baichuan) or use algorithmic methods like <a href="/notes/natural-language-processing/language-models/doremi-data-mixture-optimization/">DoReMi</a> that optimize qualitatively but cannot predict the quantitative effect of a specific mixture before training. Scaling laws exist for model size and data quantity, but no equivalent existed for mixture proportions. This paper fills that gap.</p>
<h2 id="the-exponential-data-mixing-law">The exponential data mixing law</h2>
<p>The core finding: for a model of fixed size trained for a fixed number of steps, the validation loss on domain $i$ as a function of the training mixture proportions $r_{1 \dots M}$ follows:</p>
<p>$$
L_{i}(r_{1 \dots M}) = c_{i} + k_{i} \exp\left(\sum_{j=1}^{M} t_{ij} r_{j}\right)
$$</p>
<p>where $c_{i}$, $k_{i}$, and $t_{ij}$ are fitted parameters. The constant $c_{i}$ represents the irreducible loss (not affected by mixture changes). The interaction coefficients $t_{ij}$ capture how training domain $j$ affects validation loss on domain $i$: negative $t_{ij}$ means domain $j$ helps domain $i$, positive means it hurts.</p>
<p>This was discovered progressively:</p>
<ol>
<li><strong>Two domains</strong>: Log-reducible-loss is linear in domain proportion (univariate exponential).</li>
<li><strong>Three domains</strong>: The exponential generalizes to a linear combination over all domain proportions (Eq. above), outperforming alternatives with comparable parameter count.</li>
<li><strong>General validation</strong>: For a validation set composed of $K$ domains with proportions $s_{1 \dots K}$, the overall loss is:</li>
</ol>
<p>$$
L(r_{1 \dots M}) = \sum_{i=1}^{K} s_{i} \left[ c_{i} + k_{i} \exp\left(\sum_{j=1}^{M} t_{ij} r_{j}\right) \right]
$$</p>
<p>When the validation set composition is unknown, implicit domain aggregation treats $s_{i}$ as learnable parameters. Setting the number of implicit domains larger than the true number works well and is robust to overestimation.</p>
<h3 id="domain-interaction-patterns">Domain interaction patterns</h3>
<p>Visualizing the fitted $t_{ij}$ coefficients across 5 coarse Pile domains reveals three relationship types: most domain pairs are <strong>unrelated</strong> (sparse interaction matrix where each domain&rsquo;s loss is dominated by its own training proportion), some show <strong>facilitation</strong> (e.g., dialogue data helps internet text), and some show <strong>conflict</strong> (e.g., symbolic data hurts prose). This sparsity explains why the law can be fitted with fewer samples than the quadratic parameter count would suggest.</p>
<h2 id="nested-scaling-pipeline-for-cheap-prediction">Nested scaling pipeline for cheap prediction</h2>
<p>Fitting data mixing laws directly at target scale is too expensive (requires many full training runs at different mixtures). The paper proposes nesting three scaling laws:</p>
<p><strong>Step 1</strong>: For each mixture $r_{i}$ and each small model size $N_{j}$, train for $S_{0}$ steps. Fit a <a href="https://en.wikipedia.org/wiki/Power_law">power law</a> $L(S) = E_{1} + B/S^{\beta}$ over steps to extrapolate to the target step count $S_{\text{target}}$.</p>
<p><strong>Step 2</strong>: With the step-extrapolated losses for each mixture, fit a power law $L(N) = E_{2} + A/N^{\alpha}$ over model sizes to extrapolate to the target model size $N_{\text{target}}$.</p>
<p><strong>Step 3</strong>: With the predicted losses at $(N_{\text{target}}, S_{\text{target}})$ for all sampled mixtures, fit the data mixing law and search for the optimal mixture.</p>
<p>This pipeline requires only training small models (70M to 410M) for short runs (30B tokens) to predict performance of a 1B model trained for 100B tokens.</p>
<h3 id="mixture-sampling-strategy">Mixture sampling strategy</h3>
<p>To get informative samples efficiently, the paper uses double-diminishing proportions: for each domain, enumerate proportions by halving from the maximum available. This distributes losses evenly across the exponential law&rsquo;s range. From 40 candidate mixtures trained at the smallest scale (70M), 20 are selected based on which subset minimizes data mixing law fitting error.</p>
<h2 id="experiments-on-redpajama-and-continual-pretraining">Experiments on RedPajama and continual pretraining</h2>
<p><strong>Main experiment.</strong> Models trained on RedPajama, validated on the Pile (mimicking the common scenario where validation data comes from a different distribution than training). Small models: 70M, 160M, 305M, 410M trained for 30B tokens. Target: 1B model for 100B tokens.</p>
<p>The optimized mixture dramatically redistributes weight compared to RedPajama defaults:</p>
<table>
  <thead>
      <tr>
          <th>Domain</th>
          <th>Default</th>
          <th>Optimized</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CommonCrawl</td>
          <td>0.670</td>
          <td>0.125</td>
      </tr>
      <tr>
          <td>C4</td>
          <td>0.150</td>
          <td>0.250</td>
      </tr>
      <tr>
          <td>GitHub</td>
          <td>0.045</td>
          <td>0.141</td>
      </tr>
      <tr>
          <td>ArXiv</td>
          <td>0.045</td>
          <td>0.250</td>
      </tr>
      <tr>
          <td>Books</td>
          <td>0.045</td>
          <td>0.094</td>
      </tr>
      <tr>
          <td>StackExchange</td>
          <td>0.025</td>
          <td>0.125</td>
      </tr>
      <tr>
          <td>Wikipedia</td>
          <td>0.020</td>
          <td>0.016</td>
      </tr>
  </tbody>
</table>
<p>The optimized mixture reaches the default mixture&rsquo;s final performance in 73% of the training steps and eventually achieves performance equivalent to 48% more training on the default mixture.</p>
<p><strong>Comparison to DoReMi and DoGE.</strong> Data mixing laws outperform both: the predicted-optimal mixture achieves lower validation loss than DoReMi and DoGE (both universal and OOD settings) for 1B models trained for 100B tokens on RedPajama.</p>
<p><strong>Continual pretraining.</strong> The law extends to continual pretraining (Pythia-70M on Pile + Python code). It accurately predicts the critical mixture proportion that avoids <a href="https://en.wikipedia.org/wiki/Catastrophic_interference">catastrophic forgetting</a> on the original domain while improving the target domain. This suggests data mixing laws could guide dynamic data schedules across multi-stage pretraining.</p>
<h2 id="implications-and-limitations">Implications and limitations</h2>
<p>The data mixing law provides a predictive framework rather than just an optimization algorithm. Key implications:</p>
<ul>
<li>The interaction coefficients $t_{ij}$ make domain relationships quantitatively observable before full-scale training, identifying facilitation and conflict pairs.</li>
<li>The nested pipeline&rsquo;s cost is dominated by the small-model training runs (40 mixtures at 70M scale), which is orders of magnitude cheaper than even a single target-scale run.</li>
<li>The continual pretraining application opens the door to optimizing dynamic data schedules, where mixture proportions change across training stages.</li>
</ul>
<p><strong>Limitations</strong>: The &ldquo;domain&rdquo; concept remains loosely defined (provenance-based). The nested scaling laws introduce compounding errors at each step, and predictions tend to slightly underestimate actual loss. The number of required fitting samples, while subquadratic in practice due to sparsity, still scales with the number of domains. No theoretical justification for the exponential form is provided; it is a purely empirical finding.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training (pilot)</td>
          <td>The Pile (GitHub, Pile-CC, Books3)</td>
          <td>30B tokens</td>
          <td>2-domain and 3-domain experiments</td>
      </tr>
      <tr>
          <td>Training (main)</td>
          <td>RedPajama</td>
          <td>100B tokens</td>
          <td>7 domains</td>
      </tr>
      <tr>
          <td>Validation</td>
          <td>The Pile validation set</td>
          <td>Standard split</td>
          <td>Out-of-distribution relative to RedPajama</td>
      </tr>
      <tr>
          <td>Continual pretraining</td>
          <td>Pile + Python code</td>
          <td>10B tokens</td>
          <td>Pythia-70M base model</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p>Data mixing law: $L_{i}(r_{1 \dots M}) = c_{i} + k_{i} \exp(\sum_{j} t_{ij} r_{j})$. Fitted via AdaBoost Regressor on sampled mixtures. Step scaling law: $L(S) = E_{1} + B/S^{\beta}$. Model size scaling law: $L(N) = E_{2} + A/N^{\alpha}$. Both fitted via Huber loss minimization with LBFGS. Decomposed Chinchilla-style (separate fits for stability). 40 candidate mixtures sampled via double-diminishing proportions, 20 selected for the final pipeline.</p>
<h3 id="models">Models</h3>
<p>Transformer decoder-only LMs. Pilot: 70M, 160M. Main pipeline: 70M, 160M, 305M, 410M (for fitting), 1B (target). Batch size: 1M tokens. Cosine learning rate decay with 2K step warmup, decaying to 0.1x at 100K steps.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Optimized Mixture</th>
          <th>Default Mixture</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Steps to match default final loss</td>
          <td>73K (73%)</td>
          <td>100K (100%)</td>
          <td>27% training reduction</td>
      </tr>
      <tr>
          <td>Equivalent extra training</td>
          <td>+48%</td>
          <td>Baseline</td>
          <td>Estimated via step scaling law</td>
      </tr>
      <tr>
          <td>Validation loss (1B, 100B)</td>
          <td>Lowest</td>
          <td>Higher than optimized</td>
          <td>Also beats DoReMi and DoGE</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>8 A100 GPUs. Training times per 30B-token run: 3.5 hours (70M), 8 hours (160M), 16 hours (305M), 21 hours (410M).</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://pile.eleuther.ai/">The Pile</a></td>
          <td>Dataset</td>
          <td>MIT</td>
          <td>Pilot and validation data</td>
      </tr>
      <tr>
          <td><a href="https://github.com/togethercomputer/RedPajama-Data">RedPajama</a></td>
          <td>Dataset</td>
          <td>Apache 2.0</td>
          <td>Main training data</td>
      </tr>
      <tr>
          <td><a href="https://github.com/EleutherAI/pythia">Pythia Suite</a></td>
          <td>Model</td>
          <td>Apache 2.0</td>
          <td>Model architecture configs; Pythia-70M checkpoint for continual pretraining</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility status: Partially Reproducible.</strong> Datasets and base model checkpoints are public. No official code release for the data mixing law fitting pipeline, mixture sampling, or the nested scaling law prediction workflow.</p>
<hr>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{ye2025datamixinglaws,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ye, Jiasheng and Liu, Peiju and Sun, Tianxiang and Zhan, Jun and Zhou, Yunhua and Qiu, Xipeng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{International Conference on Learning Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>RWKV: Linear-Cost RNN with Transformer Training</title><link>https://hunterheidenreich.com/notes/natural-language-processing/language-models/rwkv-rnn-transformer-architecture/</link><pubDate>Tue, 07 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/natural-language-processing/language-models/rwkv-rnn-transformer-architecture/</guid><description>RWKV combines parallelizable transformer training with constant-cost RNN inference using linear attention and channel-wise decay.</description><content:encoded><![CDATA[<h2 id="a-new-architecture-bridging-rnns-and-transformers">A New Architecture Bridging RNNs and Transformers</h2>
<p>This is a <strong>Method</strong> paper that introduces RWKV (Receptance Weighted Key Value), a novel sequence model architecture that combines the parallelizable training of Transformers with the efficient $O(Td)$ inference of RNNs. RWKV can be formulated equivalently as either a Transformer (for parallel training) or an RNN (for sequential inference), achieving the lowest computational and memory complexity among comparable architectures while matching Transformer-level performance. The authors scale RWKV to 14 billion parameters, making it the largest dense RNN ever trained at the time of publication.</p>
<h2 id="the-quadratic-cost-of-self-attention">The Quadratic Cost of Self-Attention</h2>
<p>Transformers have become the dominant architecture for NLP, powering models like GPT-3, LLaMA, and Chinchilla. Their self-attention mechanism captures both local and long-range dependencies while supporting parallelized training. However, self-attention scales quadratically with sequence length in both time ($O(T^2d)$) and space ($O(T^2 + Td)$), making it computationally and memory intensive for long sequences and resource-constrained deployment.</p>
<p>RNNs, by contrast, offer linear scaling in memory and computation, but suffer from the vanishing gradient problem and cannot parallelize across the time dimension during training. This limits their scalability and makes them unable to match Transformer performance in practice.</p>
<p>Prior work on efficient Transformers (Reformer, Performer, Linformer, AFT, MEGA) has attempted to reduce this quadratic cost, often at the expense of model expressivity. RWKV aims to combine the best of both worlds: Transformer-grade training efficiency with RNN-grade inference cost, without any approximation to the attention mechanism.</p>
<h2 id="linear-attention-via-channel-wise-decay">Linear Attention via Channel-Wise Decay</h2>
<p>RWKV is built on four core vectors that interact multiplicatively at each timestep:</p>
<ul>
<li><strong>R</strong> (Receptance): receives past information, acting as a gating signal</li>
<li><strong>W</strong> (Weight): a trainable positional weight decay vector</li>
<li><strong>K</strong> (Key): analogous to keys in standard attention</li>
<li><strong>V</strong> (Value): analogous to values in standard attention</li>
</ul>
<p>The architecture consists of stacked residual blocks, each containing a <strong>time-mixing</strong> sub-block and a <strong>channel-mixing</strong> sub-block.</p>
<h3 id="token-shift">Token Shift</h3>
<p>All linear projection vectors are produced by interpolating between the current input $x_t$ and the previous input $x_{t-1}$, creating a token shift mechanism:</p>
<p>$$
r_t = W_r \cdot (\mu_r \odot x_t + (1 - \mu_r) \odot x_{t-1})
$$</p>
<p>$$
k_t = W_k \cdot (\mu_k \odot x_t + (1 - \mu_k) \odot x_{t-1})
$$</p>
<p>$$
v_t = W_v \cdot (\mu_v \odot x_t + (1 - \mu_v) \odot x_{t-1})
$$</p>
<p>where $\mu_r$, $\mu_k$, $\mu_v$ are learnable interpolation parameters. This is implemented efficiently as a simple offset in the temporal dimension.</p>
<h3 id="the-wkv-operator">The WKV Operator</h3>
<p>The core attention-like computation replaces standard dot-product attention with a channel-wise weighted sum using exponential decay:</p>
<p>$$
wkv_t = \frac{\sum_{i=1}^{t-1} e^{-(t-1-i)w + k_i} \odot v_i + e^{u + k_t} \odot v_t}{\sum_{i=1}^{t-1} e^{-(t-1-i)w + k_i} + e^{u + k_t}}
$$</p>
<p>Here $w$ is the channel-wise time decay vector and $u$ is a separate bonus vector that attends specifically to the current token. Unlike AFT where $W$ is a pairwise matrix, RWKV treats $W$ as a channel-wise vector modified by relative position, enabling the recurrent formulation.</p>
<h3 id="output-gating">Output Gating</h3>
<p>The receptance vector gates the WKV output through a sigmoid:</p>
<p>$$
o_t = W_o \cdot (\sigma(r_t) \odot wkv_t)
$$</p>
<p>The channel-mixing block uses a similar gating mechanism with squared ReLU activation:</p>
<p>$$
o&rsquo;_t = \sigma(r&rsquo;_t) \odot (W&rsquo;_v \cdot \max(k&rsquo;_t, 0)^2)
$$</p>
<h3 id="dual-mode-operation">Dual-Mode Operation</h3>
<p>During <strong>training</strong>, RWKV operates in time-parallel mode. The matrix multiplications ($W_\lambda$ for $\lambda \in {r, k, v, o}$) dominate at $O(BTd^2)$ and parallelize identically to standard Transformers. The element-wise WKV computation is $O(BTd)$ and parallelizes along batch and channel dimensions.</p>
<p>During <strong>inference</strong>, RWKV switches to time-sequential mode. Each timestep updates a fixed-size state vector, giving constant $O(d)$ memory and $O(Td)$ total time for generating $T$ tokens, compared to $O(T^2d)$ for standard Transformers.</p>
<h3 id="optimizations">Optimizations</h3>
<p>Three additional design choices improve training:</p>
<ol>
<li><strong>Custom CUDA kernels</strong> for the sequential WKV computation, fusing it into a single kernel on training accelerators</li>
<li><strong>Small init embedding</strong>: initializing the embedding matrix with small values plus an additional LayerNorm, accelerating convergence</li>
<li><strong>Custom initialization</strong>: most weights initialized to zero with no biases, following identity-mapping principles from residual network design</li>
</ol>
<h2 id="scaling-to-14b-parameters-and-benchmark-evaluation">Scaling to 14B Parameters and Benchmark Evaluation</h2>
<h3 id="model-scaling">Model Scaling</h3>
<p>The authors train six RWKV models from 169M to 14B parameters, all for one epoch (330B tokens) on the Pile:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Layers</th>
          <th>Dimension</th>
          <th>Parameters</th>
          <th>FLOP/Token</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>169M</td>
          <td>12</td>
          <td>768</td>
          <td>$1.69 \times 10^8$</td>
          <td>$2.61 \times 10^8$</td>
      </tr>
      <tr>
          <td>430M</td>
          <td>24</td>
          <td>1024</td>
          <td>$4.30 \times 10^8$</td>
          <td>$7.57 \times 10^8$</td>
      </tr>
      <tr>
          <td>1.5B</td>
          <td>24</td>
          <td>2048</td>
          <td>$1.52 \times 10^9$</td>
          <td>$2.82 \times 10^9$</td>
      </tr>
      <tr>
          <td>3B</td>
          <td>32</td>
          <td>2560</td>
          <td>$2.99 \times 10^9$</td>
          <td>$5.71 \times 10^9$</td>
      </tr>
      <tr>
          <td>7B</td>
          <td>32</td>
          <td>4096</td>
          <td>$7.39 \times 10^9$</td>
          <td>$1.44 \times 10^{10}$</td>
      </tr>
      <tr>
          <td>14B</td>
          <td>40</td>
          <td>5120</td>
          <td>$1.42 \times 10^{10}$</td>
          <td>$2.78 \times 10^{10}$</td>
      </tr>
  </tbody>
</table>
<p>The parameter count follows: $\text{params} = 2VD + 13D^2L + D(11L + 4)$, where $V = 50277$ is vocabulary size, $D$ is model dimension, and $L$ is layers. FLOPs match the standard transformer formula: $\text{FLOP} = 6 \cdot [\text{tokens}] \cdot [\text{params}]$.</p>
<h3 id="scaling-laws">Scaling Laws</h3>
<p>Training 45 RWKV models across varied (dataset, parameters) pairs, the authors find that RWKV follows the same log-log linear scaling law established for Transformers. The linear fit to Pareto-optimal points achieves $r^2 = 0.994$, and extrapolation an additional order of magnitude still yields $r^2 = 0.875$. This contrasts with prior claims that LSTMs do not follow transformer-like scaling.</p>
<h3 id="nlp-benchmarks">NLP Benchmarks</h3>
<p>RWKV is compared against similarly-sized models trained on comparable token budgets: Pythia, OPT, and BLOOM (all FLOP-matched). Results span twelve benchmarks: ARC (Easy/Challenge), BoolQ, COPA, HeadQA, HellaSwag, LAMBADA, OpenBookQA, PIQA, ReCoRD, SciQ, and Winogrande.</p>
<p>RWKV performs competitively with Transformers across all model sizes. On average across benchmarks, RWKV tracks closely with Pythia and outperforms OPT and BLOOM at comparable scales.</p>
<h3 id="long-context-and-extended-finetuning">Long Context and Extended Finetuning</h3>
<p>RWKV can extend its context length after pretraining through progressive finetuning: doubling from 1024 to 2048 (10B tokens), then to 4096 (100B tokens), and finally to 8192 (100B tokens). Each doubling reduces test loss on the Pile, indicating effective use of longer context.</p>
<p>On the Long Range Arena (LRA) benchmark, which tests sequences from 1,000 to 16,000 tokens, RWKV performs second only to S4 across the five datasets.</p>
<h3 id="inference-efficiency">Inference Efficiency</h3>
<p>Benchmarking text generation on CPU (x86) and GPU (NVIDIA A100 80GB) at float32 precision shows that RWKV exhibits linear scaling in generation time, while Transformers scale quadratically. This advantage grows with sequence length: for long outputs, RWKV completes generation substantially faster at equivalent model sizes.</p>
<h2 id="competitive-performance-with-key-caveats">Competitive Performance with Key Caveats</h2>
<p>RWKV demonstrates that RNN-class models can match Transformer performance at scale, while maintaining $O(Td)$ time and $O(d)$ memory during inference. The key findings are:</p>
<ol>
<li><strong>Scaling laws hold</strong>: RWKV follows the same compute-optimal scaling as Transformers ($r^2 = 0.994$), contradicting earlier claims about RNN scaling behavior</li>
<li><strong>Competitive NLP performance</strong>: Across twelve benchmarks, RWKV matches similarly-sized Transformers trained on comparable data</li>
<li><strong>Linear inference cost</strong>: Generation time scales linearly rather than quadratically, with constant memory regardless of sequence length</li>
<li><strong>Context extension</strong>: Progressive finetuning effectively extends the context window post-training</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The authors identify two primary limitations:</p>
<p><strong>Information compression</strong>: Linear attention funnels all past information through a single fixed-size state vector. For tasks requiring recall of specific details over very long contexts, this is mechanistically more constrained than full self-attention, which maintains direct access to all previous tokens.</p>
<p><strong>Prompt sensitivity</strong>: RWKV is more sensitive to prompt engineering than standard Transformers. The linear attention mechanism limits how much prompt information carries forward, making the order of information in the prompt particularly important. Reordering prompts improved F1 from 44.2% to 74.8% on one task.</p>
<h3 id="future-directions">Future Directions</h3>
<p>The authors suggest several avenues: applying parallel scan to reduce WKV cost to $O(B \log(T) d)$, extending RWKV to encoder-decoder and multimodal architectures, leveraging hidden states for interpretability and safety, and increasing internal state size to improve long-range recall.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/BlinkDL/RWKV-LM">BlinkDL/RWKV-LM</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Official PyTorch training and inference implementation</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/BlinkDL/rwkv-4-pile-14b">Pre-trained weights (169M to 14B)</a></td>
          <td>Model</td>
          <td>Apache-2.0</td>
          <td>All six Pile-trained sizes on HuggingFace (<code>BlinkDL/rwkv-4-pile-*</code>)</td>
      </tr>
      <tr>
          <td><a href="https://pile.eleuther.ai/">The Pile</a></td>
          <td>Dataset</td>
          <td>Mixed</td>
          <td>825 GiB pretraining corpus; component licenses vary by source</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility classification</strong>: Highly Reproducible. Training code (Apache-2.0), pre-trained weights for all six model sizes, the full training corpus, and complete hyperparameters (Appendix G) are all publicly available. The only missing detail is the specific GPU cluster configuration used for pretraining.</p>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>The Pile</td>
          <td>330B tokens</td>
          <td>One full epoch for all model sizes</td>
      </tr>
      <tr>
          <td>Context extension</td>
          <td>The Pile</td>
          <td>210B additional tokens</td>
          <td>Progressive doubling: 1024 to 8192</td>
      </tr>
      <tr>
          <td>NLP evaluation</td>
          <td>ARC, BoolQ, COPA, HeadQA, HellaSwag, LAMBADA, OpenBookQA, PIQA, ReCoRD, SciQ, Winogrande</td>
          <td>Various</td>
          <td>Zero-shot evaluation</td>
      </tr>
      <tr>
          <td>Long-range evaluation</td>
          <td>Long Range Arena (LRA)</td>
          <td>1K-16K tokens</td>
          <td>Five sub-tasks</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Optimizer: Adam ($\beta = (0.9, 0.99)$), no weight decay</li>
<li>Precision: bfloat16</li>
<li>Training context length: 1024 tokens</li>
<li>Learning rate: constant warmup, then exponential decay</li>
<li>Auxiliary loss from PaLM (softmax normalizer regularization)</li>
<li>Batch size: 128 or 256 sequences (dynamically switched)</li>
<li>Training organized into mini-epochs of 40,320 samples each (8,043 mini-epochs per Pile epoch)</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Init LR</th>
          <th>Warmup Mini-Epochs</th>
          <th>End LR</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>169M</td>
          <td>6e-4</td>
          <td>361</td>
          <td>1e-5</td>
      </tr>
      <tr>
          <td>430M</td>
          <td>4e-4</td>
          <td>411</td>
          <td>1e-5</td>
      </tr>
      <tr>
          <td>1.5B</td>
          <td>3e-4</td>
          <td>443</td>
          <td>1e-5</td>
      </tr>
      <tr>
          <td>3B</td>
          <td>1.5e-4</td>
          <td>451</td>
          <td>1e-5</td>
      </tr>
      <tr>
          <td>7B</td>
          <td>1.5e-4</td>
          <td>465</td>
          <td>1e-5</td>
      </tr>
      <tr>
          <td>14B</td>
          <td>1e-4</td>
          <td>544</td>
          <td>7e-6</td>
      </tr>
  </tbody>
</table>
<p>All pretrained models (169M to 14B) are publicly released on HuggingFace (<code>BlinkDL/rwkv-4-pile-*</code>) under Apache-2.0. Training code is at <a href="https://github.com/BlinkDL/RWKV-LM">BlinkDL/RWKV-LM</a> (Apache-2.0).</p>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>All NLP benchmarks evaluated in zero-shot setting</li>
<li>FLOP-matched comparison against Pythia, OPT, BLOOM</li>
<li>Inference benchmarked on CPU (x86) and GPU (NVIDIA A100 80GB) at float32</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Inference experiments: NVIDIA A100 80GB GPU</li>
<li>Training hardware details not fully specified; FLOP budgets reported per model</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., &hellip; &amp; Zhu, R.-J. (2023). RWKV: Reinventing RNNs for the Transformer Era. In <em>Findings of the Association for Computational Linguistics: EMNLP 2023</em>, pp. 14048-14077.</p>
<p><strong>Publication</strong>: Findings of EMNLP 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/BlinkDL/RWKV-LM">GitHub Repository (Apache-2.0)</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{peng2023rwkv,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{RWKV: Reinventing RNNs for the Transformer Era}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Peng, Bo and Alcaide, Eric and Anthony, Quentin and Albalak, Alon and Arcadinho, Samuel and Biderman, Stella and Cao, Huanqi and Cheng, Xin and Chung, Michael and Derczynski, Leon and Du, Xingjian and Grella, Matteo and GV, Kranthi Kiran and He, Xuzheng and Hou, Haowen and Kazienko, Przemys{\l}aw and Koco{\&#39;n}, Jan and Kong, Jiaming and Koptyra, Bart{\l}omiej and Lau, Hayden and Lin, Jiaju and Mantri, Krishna Sri Ipsit and Mom, Ferdinand and Saito, Atsushi and Song, Guangyu and Tang, Xiangru and Wind, Johan S. and Wo{\&#39;z}niak, Stanis{\l}aw and Zhang, Zhenyuan and Zhou, Qinghua and Zhu, Jian and Zhu, Rui-Jie}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Findings of the Association for Computational Linguistics: EMNLP 2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{14048--14077}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.18653/v1/2023.findings-emnlp.936}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Block-Recurrent Transformers for Long Sequences</title><link>https://hunterheidenreich.com/notes/natural-language-processing/language-models/block-recurrent-transformers/</link><pubDate>Tue, 07 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/natural-language-processing/language-models/block-recurrent-transformers/</guid><description>Block-Recurrent Transformers combine attention and recurrence for linear-complexity language modeling on long documents like books and code.</description><content:encoded><![CDATA[<h2 id="a-method-for-combining-attention-with-block-level-recurrence">A Method for Combining Attention with Block-Level Recurrence</h2>
<p>This is a <strong>Method</strong> paper that introduces the Block-Recurrent Transformer, a model architecture that integrates recurrence into the transformer framework at the block level. Rather than processing tokens one at a time (as in traditional RNNs) or attending over entire sequences (as in standard transformers), this approach applies a transformer layer recurrently across blocks of tokens. The result is a model with linear complexity in sequence length that maintains the parallelism benefits of transformers during training. A related approach, <a href="/notes/natural-language-processing/language-models/rwkv-rnn-transformer-architecture/">RWKV</a>, later explored similar ideas using linear attention with channel-wise decay.</p>
<h2 id="why-transformers-struggle-with-long-documents">Why Transformers Struggle with Long Documents</h2>
<p>Transformers have largely replaced RNNs for sequence modeling tasks, but their quadratic self-attention cost limits the length of sequences they can process. A transformer with a window size of 512 tokens cannot see information beyond that window, making it blind to long-range dependencies in books, technical papers, or source code repositories.</p>
<p>Prior approaches to this problem fall into several categories: sparse attention patterns (BigBird, Routing Transformers, Reformer), sequence compression (Linformer, Funnel Transformers), and linearized attention approximations. These methods either sacrifice the expressiveness of full softmax attention or introduce implementation complexity.</p>
<p>Traditional RNNs like LSTMs offer linear complexity but suffer from three key limitations: sequential processing prevents parallelism on modern hardware, a single state vector bottlenecks information capacity, and vanishing gradients limit effective memory to a few hundred tokens.</p>
<h2 id="block-level-recurrence-with-lstm-style-gates">Block-Level Recurrence with LSTM-Style Gates</h2>
<p>The core innovation is applying a standard transformer layer in a recurrent fashion along the sequence, operating on blocks of $W$ tokens rather than individual tokens. The recurrent cell maintains $S$ state vectors (typically $S = W = 512$) that are updated at each block boundary.</p>
<h3 id="the-recurrent-cell">The Recurrent Cell</h3>
<p>The cell has two processing directions:</p>
<ul>
<li><strong>Vertical direction</strong>: An ordinary transformer layer with self-attention over input tokens and cross-attention to recurrent states, producing output embeddings.</li>
<li><strong>Horizontal direction</strong>: Self-attention over current state vectors and cross-attention to input tokens, producing updated state vectors. Residual connections are replaced with gates.</li>
</ul>
<p>Self-attention and cross-attention are computed in parallel (not sequentially), with results concatenated and fed into a linear projection. Keys and values are shared between directions, while queries are separate, yielding four query sets: $Q_e^v$, $Q_s^v$ (vertical) and $Q_s^h$, $Q_e^h$ (horizontal).</p>
<h3 id="gating-mechanisms">Gating Mechanisms</h3>
<p>Two gate types are explored. The <strong>fixed gate</strong> uses a learned convex combination:</p>
<p>$$
g = \sigma(b_g)
$$</p>
<p>$$
c_{t+1} = c_t \odot g + z_t \odot (1 - g)
$$</p>
<p>where $g$ is constant after training, implementing an <a href="https://en.wikipedia.org/wiki/Moving_average">exponential moving average</a>.</p>
<p>The <strong>LSTM gate</strong> uses input and forget gates:</p>
<p>$$
i_t = \sigma(W_i h_t + b_i - 1)
$$</p>
<p>$$
f_t = \sigma(W_f h_t + b_f + 1)
$$</p>
<p>$$
c_{t+1} = c_t \odot f_t + z_t \odot i_t
$$</p>
<p>The bias offsets ($-1$ for input, $+1$ for forget) initialize the model to &ldquo;remember&rdquo; by default, which is critical for training stability. Without careful initialization, the model can fall into a local optimum where it ignores the recurrent state entirely. This echoes the <a href="/notes/machine-learning/model-architectures/can-recurrent-neural-networks-warp-time/">gate initialization challenges studied by Tallec and Ollivier</a>, who derived chrono initialization for LSTMs from time-warping invariance.</p>
<h3 id="gate-configurations">Gate Configurations</h3>
<p>Three configurations are tested: <strong>dual</strong> (gates on both attention and MLP outputs), <strong>single</strong> (gate only on MLP output), and <strong>skip</strong> (gate only on attention output, no MLP). The skip configuration removes the large MLP from the recurrent layer entirely.</p>
<h3 id="learned-state-ids">Learned State IDs</h3>
<p>Since the same weights are applied to all state vectors, learned &ldquo;state IDs&rdquo; (analogous to position embeddings) are added so each state vector can issue distinct queries. <a href="/notes/natural-language-processing/language-models/t5-text-to-text-transfer-transformer/">T5</a>-style relative position bias is used for token self-attention, with no position bias for state-token cross-attention.</p>
<h2 id="language-modeling-on-pg19-arxiv-and-github">Language Modeling on PG19, arXiv, and GitHub</h2>
<h3 id="experimental-setup">Experimental Setup</h3>
<p>The base model is a 12-layer transformer with 150M parameters (8 heads of size 128, embedding dimension 1024, MLP hidden size 4096). The recurrent layer is placed at layer 10 with segment length $N = 4096$ and window size $W = 512$. The architecture is evaluated on three long-document datasets:</p>
<ul>
<li><strong>PG19</strong>: Full-length books from <a href="https://en.wikipedia.org/wiki/Project_Gutenberg">Project Gutenberg</a> (pre-1919)</li>
<li><strong>arXiv</strong>: Mathematics papers in LaTeX</li>
<li><strong>GitHub</strong>: Concatenated source code from open-source repositories</li>
</ul>
<p>All models report bits-per-token ($\log_2$ perplexity, lower is better).</p>
<h3 id="baselines">Baselines</h3>
<p>Five baselines are compared: Transformer-XL with window sizes of 512, 1024, and 2048, plus 12-layer and 13-layer sliding window models. The 13-layer sliding window (Slide:13L) is the primary comparison, having equivalent computation cost and parameter count to the recurrent models.</p>
<h3 id="main-results">Main Results</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Step Time</th>
          <th>PG19 (bytes)</th>
          <th>PG19 (tokens)</th>
          <th>arXiv</th>
          <th>GitHub</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>XL:512</td>
          <td>0.88</td>
          <td>1.01</td>
          <td>3.62</td>
          <td>1.45</td>
          <td>1.21</td>
      </tr>
      <tr>
          <td>XL:2048</td>
          <td>2.11</td>
          <td>0.990</td>
          <td>3.58</td>
          <td>1.31</td>
          <td>1.01</td>
      </tr>
      <tr>
          <td>Slide:13L</td>
          <td>1.00</td>
          <td>0.989</td>
          <td>3.58</td>
          <td>1.42</td>
          <td>1.17</td>
      </tr>
      <tr>
          <td>Rec:fixed:skip</td>
          <td>0.99</td>
          <td>0.952</td>
          <td>3.53</td>
          <td>1.24</td>
          <td>0.976</td>
      </tr>
      <tr>
          <td>Rec:fixed:dual</td>
          <td>1.01</td>
          <td>0.957</td>
          <td>3.52</td>
          <td>1.27</td>
          <td>0.991</td>
      </tr>
      <tr>
          <td>Feedback:fixed:skip</td>
          <td>1.35</td>
          <td>0.935</td>
          <td>3.49</td>
          <td>1.24</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Memorizing Trans. 64k</td>
          <td>1.94</td>
          <td>0.950</td>
          <td>3.53</td>
          <td>1.22</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>The Rec:fixed:skip configuration achieves the best overall results while being slightly faster than the 13-layer baseline. It outperforms XL:2048, which runs over 2x slower. The block feedback variant (allowing all layers to cross-attend to recurrent states) improves perplexity further at ~35-40% higher step time.</p>
<h3 id="scaling-behavior">Scaling Behavior</h3>
<p>Models from 40M to 1.3B parameters show that the benefit of recurrence is <a href="/notes/machine-learning/model-architectures/scaling-laws-vs-model-architectures/">consistent across scales</a> and increases with model size. At larger sizes, adding recurrence provides a benefit greater than doubling the number of parameters. The 1.3B parameter model achieves 26.50 word-level perplexity on PG19, setting a new state of the art at the time of publication.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Layers</th>
          <th>PG19 Perplexity</th>
          <th>Parameters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Compressive Transformer</td>
          <td>36</td>
          <td>33.6</td>
          <td>-</td>
      </tr>
      <tr>
          <td>Routing Transformer</td>
          <td>22</td>
          <td>33.2</td>
          <td>490M</td>
      </tr>
      <tr>
          <td>Perceiver AR</td>
          <td>60</td>
          <td>28.9</td>
          <td>974.6M</td>
      </tr>
      <tr>
          <td>Block-Recurrent Transformer</td>
          <td>24</td>
          <td>26.50</td>
          <td>1.3B</td>
      </tr>
  </tbody>
</table>
<h3 id="ablations">Ablations</h3>
<ul>
<li><strong>Multiple recurrent layers</strong>: Two adjacent layers (9, 10) provide no benefit. Two separated layers (4, 10) help but no more than adding another non-recurrent layer.</li>
<li><strong>Number of states</strong>: Improvement up to 1024 states, degradation at 2048.</li>
<li><strong>Window size reduction</strong>: Reducing the sliding window hurts Transformer-XL dramatically but has smaller impact on the recurrent model, which compensates via recurrence.</li>
<li><strong>Gate type</strong>: The fixed gate consistently outperforms the LSTM gate despite being theoretically less expressive.</li>
</ul>
<h3 id="qualitative-analysis">Qualitative Analysis</h3>
<p>Comparing per-token predictions against Transformer-XL on PG19 books, the recurrent model&rsquo;s advantage comes overwhelmingly from predicting proper names (17/20 top-improvement tokens). In 19/20 cases, the predicted word was outside the attention window, confirming it was stored in recurrent state. The model can remember book titles and authors across 60,000+ tokens.</p>
<h2 id="findings-limitations-and-future-directions">Findings, Limitations, and Future Directions</h2>
<p>The Block-Recurrent Transformer demonstrates that recurrence at the block level is a cost-effective way to improve language modeling on long sequences. The fixed:skip configuration (the simplest variant) performs best, suggesting the model primarily uses recurrence for long-range name lookup rather than complex reasoning. The fact that removing the MLP from the recurrent layer has minimal impact further supports this interpretation.</p>
<p>Key limitations include: the model was only evaluated on language modeling perplexity (no downstream tasks), the LSTM gate underperforms the simpler fixed gate (suggesting untapped potential for more expressive recurrence), and the authors acknowledge that training the recurrent layer to fully exploit its capacity for knowledge extraction will require further advances.</p>
<p>The authors note that evaluating on downstream tasks requiring long-range context (book summarization, long-document QA, code completion) is an important direction for future work.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Eval</td>
          <td>PG19</td>
          <td>~29k books</td>
          <td>Public domain, freely available</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>arXiv</td>
          <td>Mathematics papers</td>
          <td>Obtained via private channels, not redistributable</td>
      </tr>
      <tr>
          <td>Training/Eval</td>
          <td>GitHub</td>
          <td>Open-source repos</td>
          <td>Obtained via private channels, not redistributable</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Optimizer: Adafactor</li>
<li>Learning rate: 1.0 with inverse square root decay (initial experiments), cosine decay with max 0.01 (scaling experiments)</li>
<li>Warmup: 1000 steps</li>
<li>Dropout: 0.05</li>
<li>Vocabulary: 32k SentencePiece (T5 pretrained for initial, custom for scaling)</li>
<li>Gate initialization: bias of $+1$ for forget gate, $-1$ for input gate to ensure initial &ldquo;remember&rdquo; behavior</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Variant</th>
          <th>Layers</th>
          <th>Parameters</th>
          <th>Recurrent Layers</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Base</td>
          <td>12 (+1 recurrent)</td>
          <td>~151-164M</td>
          <td>Layer 10</td>
      </tr>
      <tr>
          <td>Large</td>
          <td>24 (+2 recurrent)</td>
          <td>650M</td>
          <td>Layers 10, 20</td>
      </tr>
      <tr>
          <td>XL</td>
          <td>24 (+2 recurrent)</td>
          <td>1.3B</td>
          <td>Layers 10, 20</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best Model</th>
          <th>PG19 (tokens)</th>
          <th>arXiv</th>
          <th>GitHub</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Bits-per-token</td>
          <td>Rec:fixed:skip</td>
          <td>3.53</td>
          <td>1.24</td>
          <td>0.976</td>
      </tr>
      <tr>
          <td>Word-level PPL</td>
          <td>1.3B model</td>
          <td>26.50</td>
          <td>-</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>Error bars on PG19 are between 0.002 and 0.007 (3 runs with different seeds).</p>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Training: 32 Google V4 TPU replicas</li>
<li>Training time: ~48 hours for 500k steps on PG19</li>
<li>Batch size: 32 (segment length 4096) or 256 (segment length 512), adjusted so each model sees the same tokens per step</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Available</th>
          <th>License</th>
          <th>URL</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Code (Meliad)</td>
          <td>Yes</td>
          <td>Apache 2.0</td>
          <td><a href="https://github.com/google-research/meliad">github.com/google-research/meliad</a></td>
      </tr>
      <tr>
          <td>PG19 Dataset</td>
          <td>Yes</td>
          <td>Public Domain</td>
          <td>Public</td>
      </tr>
      <tr>
          <td>arXiv Dataset</td>
          <td>No</td>
          <td>Not redistributable</td>
          <td>Private</td>
      </tr>
      <tr>
          <td>GitHub Dataset</td>
          <td>No</td>
          <td>Not redistributable</td>
          <td>Private</td>
      </tr>
      <tr>
          <td>Pretrained Models</td>
          <td>No</td>
          <td>-</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility Assessment</strong>: Partially Reproducible. Source code is available under Apache 2.0 and the PG19 dataset is public. However, two of three evaluation datasets (arXiv, GitHub) were obtained via private channels and are not redistributable. No pretrained model checkpoints are released.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Hutchins, D., Schlag, I., Wu, Y., Dyer, E., &amp; Neyshabur, B. (2022). Block-Recurrent Transformers. <em>Advances in Neural Information Processing Systems 35 (NeurIPS 2022)</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{hutchins2022block,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Block-Recurrent Transformers}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Hutchins, DeLesley and Schlag, Imanol and Wu, Yuhuai and Dyer, Ethan and Neyshabur, Behnam}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2203.07852}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span>=<span style="color:#e6db74">{cs.LG}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>High-Performance Word2Vec in Pure PyTorch</title><link>https://hunterheidenreich.com/projects/modern-word2vec/</link><pubDate>Sat, 16 Aug 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/projects/modern-word2vec/</guid><description>Production-grade Word2Vec in PyTorch with vectorized Hierarchical Softmax, Negative Sampling, and torch.compile support.</description><content:encoded><![CDATA[<h2 id="overview">Overview</h2>
<p>Word2Vec is often treated as a &ldquo;solved problem&rdquo; or a black box inside libraries like Gensim. This project deconstructs the algorithm to treat it as a <strong>systems engineering challenge</strong>.</p>
<p>I built a ground-up, typed, and compiled PyTorch implementation that bridges the gap between the original C code&rsquo;s efficiency and modern GPU acceleration. The core innovation lies in <strong>&ldquo;tensorizing the tree&rdquo;</strong>, converting the pointer-chasing logic of Hierarchical Softmax into dense, vectorized operations compatible with <code>torch.compile</code>.</p>
<h2 id="features">Features</h2>
<h3 id="1-vectorized-hierarchical-softmax">1. Vectorized Hierarchical Softmax</h3>
<p>Classically, Hierarchical Softmax involves traversing a binary Huffman tree. While efficient on a CPU, this approach creates divergent execution paths on GPUs.</p>
<ul>
<li><strong>The Solution:</strong> I implemented a &ldquo;pre-computed path&rdquo; strategy. The tree traversal for every vocabulary word is flattened into fixed-size tensors (<code>word_path_indices</code>, <code>word_codes_tensor</code>) padded to the maximum depth.</li>
<li><strong>The Result:</strong> The forward pass becomes a massive, masked batch dot-product against internal node embeddings, allowing the GPU to crunch the probability tree without branching logic.</li>
</ul>
<h3 id="2-infinite-streaming--sliding-windows">2. Infinite Streaming &amp; Sliding Windows</h3>
<p>To handle datasets larger than RAM (e.g., Wikipedia/CommonCrawl), I built a custom <code>IterableDataset</code> that performs a true single-pass read.</p>
<ul>
<li><strong>Efficient Windowing:</strong> It uses a <code>collections.deque</code> buffer to slide over the token stream, generating training pairs only when a new token enters the center context.</li>
<li><strong>Zipfian Subsampling:</strong> Implemented a probabilistic rejection sampling layer that downsamples frequent words (like &ldquo;the&rdquo; or &ldquo;of&rdquo;) on-the-fly, strictly adhering to the original Mikolov et al. paper&rsquo;s distribution.</li>
</ul>
<h3 id="3-modern-production-tooling">3. Modern Production Tooling</h3>
<p>This project uses a strict &ldquo;software 2.0&rdquo; stack:</p>
<ul>
<li><strong>Dependency Management</strong>: Built with <code>uv</code> for deterministic, lightning-fast environment resolution.</li>
<li><strong>Compilation</strong>: Fully compatible with <code>torch.compile</code> (PyTorch 2.0+), allowing for graph fusion of the custom loss functions.</li>
</ul>
<h2 id="usage">Usage</h2>
<p>The library can be installed via <code>pip</code> and used as a drop-in replacement for Gensim&rsquo;s Word2Vec, with the added benefit of GPU acceleration.</p>
<h2 id="results">Results</h2>
<ul>
<li><strong>Correct Embeddings</strong>: The produced vectors pass qualitative semantic similarity checks (e.g., analogical reasoning tasks), confirming the tensorized tree produces the same geometry as sequential traversal.</li>
<li><strong>GPU-Scalable</strong>: The batched Huffman tree approach eliminates divergent GPU execution, enabling meaningful throughput gains on large vocabularies (100k+ tokens).</li>
<li><strong>OOM-Free on Large Corpora</strong>: The streaming <code>IterableDataset</code> with Zipfian subsampling runs on Wikipedia/CommonCrawl-scale text without loading data into RAM.</li>
<li><strong><code>torch.compile</code> Compatible</strong>: The custom loss functions fuse correctly under <code>torch.compile</code>, achieving kernel fusion unavailable in eager mode.</li>
</ul>
<h2 id="related-work">Related Work</h2>
<p>This project connects to related NLP work on this site:</p>
<ul>
<li><a href="/posts/intro-to-word-embeddings/">An Introduction to Word Embeddings</a>: conceptual background on the representations this library produces</li>
<li><a href="/research/word-company-vicinity/">Word Company Vicinity</a>: research applying word vector semantics to company names</li>
<li><a href="/research/semantic-network-induction/">Semantic Network Induction</a>: research on inducing semantic graphs from embedding spaces</li>
</ul>
]]></content:encoded></item><item><title>Sarcasm Detection with Transformers: A Cautionary Tale</title><link>https://hunterheidenreich.com/posts/sarcasm-detection-with-transformers/</link><pubDate>Sun, 25 Feb 2024 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/sarcasm-detection-with-transformers/</guid><description>Learn how dataset bias can lead to misleading results in NLP: a sarcasm detection model that learned to classify news sources.</description><content:encoded><![CDATA[<h2 id="why-sarcasm-detection-is-hard">Why Sarcasm Detection Is Hard</h2>
<p>Sarcasm detection represents one of the most challenging problems in NLP. The difficulties include:</p>
<p><strong>Context dependence</strong>: Sarcasm relies on situational knowledge and shared understanding that extends beyond the text itself.</p>
<p><strong>Subtlety</strong>: Even humans struggle with sarcastic interpretation, especially in written text without vocal cues.</p>
<p><strong>Cultural variability</strong>: Sarcastic expressions vary significantly across cultures and regions.</p>
<p><strong>Annotation disagreement</strong>: Human annotators often disagree on what constitutes sarcasm.</p>
<p>These challenges raise a fundamental question: can sarcasm detection be well-defined as a computational problem? This case study explores what happens when we try (and reveals a common pitfall in dataset construction).</p>
<h2 id="the-dataset-a-hidden-flaw">The Dataset: A Hidden Flaw</h2>
<p>I used the <a href="https://huggingface.co/datasets/raquiba/Sarcasm_News_Headline">Sarcasm News Headlines dataset</a>, which combines headlines from <a href="https://theonion.com/">The Onion</a> (satirical) and <a href="https://www.huffpost.com/">The Huffington Post</a> (traditional news). The dataset contains ~50,000 examples.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> datasets <span style="color:#f92672">import</span> load_dataset
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>dataset <span style="color:#f92672">=</span> load_dataset(<span style="color:#e6db74">&#34;raquiba/Sarcasm_News_Headline&#34;</span>)
</span></span><span style="display:flex;"><span>print(dataset[<span style="color:#e6db74">&#34;train&#34;</span>][<span style="color:#ae81ff">0</span>])
</span></span><span style="display:flex;"><span>print(dataset[<span style="color:#e6db74">&#34;train&#34;</span>][<span style="color:#ae81ff">1</span>])
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-plaintext" data-lang="plaintext"><span style="display:flex;"><span>{&#39;headline&#39;: &#39;thirtysomething scientists unveil doomsday clock of hair loss&#39;,
</span></span><span style="display:flex;"><span> &#39;is_sarcastic&#39;: 1}
</span></span><span style="display:flex;"><span>{&#39;headline&#39;: &#39;dem rep. totally nails why congress is falling short on gender, racial equality&#39;,
</span></span><span style="display:flex;"><span> &#39;is_sarcastic&#39;: 0}
</span></span></code></pre></div><p><strong>The critical flaw</strong>: This dataset uses binary classification based on source domain. The Onion headlines are labeled sarcastic, HuffPost headlines are not. This creates a dangerous shortcut where models learn to detect the publication source.</p>
<p>After preprocessing to standardize column names:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>dataset <span style="color:#f92672">=</span> dataset<span style="color:#f92672">.</span>map(
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">lambda</span> example: {<span style="color:#e6db74">&#34;text&#34;</span>: example[<span style="color:#e6db74">&#34;headline&#34;</span>], <span style="color:#e6db74">&#34;label&#34;</span>: example[<span style="color:#e6db74">&#34;is_sarcastic&#34;</span>]},
</span></span><span style="display:flex;"><span>    remove_columns<span style="color:#f92672">=</span>[<span style="color:#e6db74">&#34;headline&#34;</span>, <span style="color:#e6db74">&#34;article_link&#34;</span>, <span style="color:#e6db74">&#34;is_sarcastic&#34;</span>]
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><h2 id="fine-tuning-roberta">Fine-Tuning RoBERTa</h2>
<p>I fine-tuned a pre-trained RoBERTa model using standard practices:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> transformers <span style="color:#f92672">import</span> AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>model_name <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;FacebookAI/roberta-base&#34;</span>
</span></span><span style="display:flex;"><span>tokenizer <span style="color:#f92672">=</span> AutoTokenizer<span style="color:#f92672">.</span>from_pretrained(model_name)
</span></span><span style="display:flex;"><span>model <span style="color:#f92672">=</span> AutoModelForSequenceClassification<span style="color:#f92672">.</span>from_pretrained(model_name, num_labels<span style="color:#f92672">=</span><span style="color:#ae81ff">2</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Tokenize the data</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">tokenize_function</span>(examples):
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> tokenizer(examples[<span style="color:#e6db74">&#34;text&#34;</span>], truncation<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>, max_length<span style="color:#f92672">=</span><span style="color:#ae81ff">512</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>tokenized_datasets <span style="color:#f92672">=</span> dataset<span style="color:#f92672">.</span>map(tokenize_function, batched<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Training configuration</span>
</span></span><span style="display:flex;"><span>training_args <span style="color:#f92672">=</span> TrainingArguments(
</span></span><span style="display:flex;"><span>    output_dir<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;./results&#34;</span>,
</span></span><span style="display:flex;"><span>    num_train_epochs<span style="color:#f92672">=</span><span style="color:#ae81ff">5</span>,
</span></span><span style="display:flex;"><span>    per_device_train_batch_size<span style="color:#f92672">=</span><span style="color:#ae81ff">32</span>,
</span></span><span style="display:flex;"><span>    evaluation_strategy<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;epoch&#34;</span>,
</span></span><span style="display:flex;"><span>    save_strategy<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;epoch&#34;</span>,
</span></span><span style="display:flex;"><span>    load_best_model_at_end<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>,
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>trainer <span style="color:#f92672">=</span> Trainer(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span>model,
</span></span><span style="display:flex;"><span>    args<span style="color:#f92672">=</span>training_args,
</span></span><span style="display:flex;"><span>    train_dataset<span style="color:#f92672">=</span>tokenized_datasets[<span style="color:#e6db74">&#34;train&#34;</span>],
</span></span><span style="display:flex;"><span>    eval_dataset<span style="color:#f92672">=</span>tokenized_datasets[<span style="color:#e6db74">&#34;test&#34;</span>],
</span></span><span style="display:flex;"><span>    tokenizer<span style="color:#f92672">=</span>tokenizer,
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>trainer<span style="color:#f92672">.</span>train()
</span></span></code></pre></div><h2 id="results-too-good-to-be-true">Results: Too Good to Be True</h2>
<p>The model achieved high accuracy:</p>
<table>
  <thead>
      <tr>
          <th>Epoch</th>
          <th>Test Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>96.3%</td>
      </tr>
      <tr>
          <td>2</td>
          <td>97.8%</td>
      </tr>
      <tr>
          <td>3</td>
          <td>99.4%</td>
      </tr>
      <tr>
          <td>4</td>
          <td>99.8%</td>
      </tr>
      <tr>
          <td>5</td>
          <td>99.8%</td>
      </tr>
  </tbody>
</table>
<p>This should immediately raise red flags. Sarcasm detection is notoriously difficult, even for humans. Such high accuracy suggests the model learned a proxy task.</p>
<p>My hypothesis: <strong>The model bypassed sarcasm detection entirely, learning only to distinguish between The Onion and HuffPost writing styles.</strong></p>
<h2 id="interacting-with-the-model">Interacting with the Model</h2>
<p>Let&rsquo;s test our hypothesis by interacting with the model.</p>
<p>First, let&rsquo;s load the model and tokenizer:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> transformers <span style="color:#f92672">import</span> pipeline
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>model <span style="color:#f92672">=</span> AutoModelForSequenceClassification<span style="color:#f92672">.</span>from_pretrained(<span style="color:#e6db74">&#39;results/2024-02-25_20-24-51/checkpoint-4475&#39;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>clf <span style="color:#f92672">=</span> pipeline(<span style="color:#e6db74">&#39;text-classification&#39;</span>, model<span style="color:#f92672">=</span>model, tokenizer<span style="color:#f92672">=</span>tokenizer)
</span></span></code></pre></div><p>Now, let&rsquo;s test the model with some examples.</p>
<p>First, let&rsquo;s try an Onion article from this week, something I know to be sarcastic and not in the training data.
Let&rsquo;s use <a href="https://theonion.com/alabama-supreme-court-justice-invokes-veggietales-in-1851282252/">&ldquo;Alabama Supreme Court Justice Invokes &lsquo;VeggieTales&rsquo; In Ruling&rdquo;</a>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>clf(<span style="color:#e6db74">&#34;Alabama Supreme Court Justice Invokes ‘VeggieTales&#39; In Ruling&#34;</span>)
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-plaintext" data-lang="plaintext"><span style="display:flex;"><span>[{&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.99916672706604}]
</span></span></code></pre></div><p>The model is extremely confident that this is not sarcastic.</p>
<p>Let&rsquo;s try a different Onion article, possibly even more difficult: <a href="https://theonion.com/trump-booed-frozen-burritos-and-more-this-week-in-br-1851282066/">Breaking News Trump Booed, Frozen Burritos, And More: This Week In Breaking News February 24, 2024</a>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>clf(<span style="color:#e6db74">&#34;Breaking News Trump Booed, Frozen Burritos, And More: This Week In Breaking News February 24, 2024&#34;</span>)
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-plaintext" data-lang="plaintext"><span style="display:flex;"><span>[{&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.9993497729301453}]
</span></span></code></pre></div><p>Again, very confident that this is not sarcastic. Hmm. It could be the temporal accuracy of our model just cannot capture the sarcasm of the Onion in 2024.</p>
<p>Let&rsquo;s try one more Onion article, this one that is still recent but a bit more of a low-hanging fruit: <a href="https://theonion.com/mom-only-likes-the-other-outback-steakhouse-1851265335/">Mom Only Likes The Other Outback Steakhouse</a>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>clf(<span style="color:#e6db74">&#34;Mom Only Likes The Other Outback Steakhouse&#34;</span>)
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-plaintext" data-lang="plaintext"><span style="display:flex;"><span>[{&#39;label&#39;: &#39;LABEL_1&#39;, &#39;score&#39;: 0.9997231364250183}]
</span></span></code></pre></div><p>Finally, a correct prediction! The model is confident that this is sarcastic.
Our model detects only very specific types of sarcasm. It fails to generalize to new, unseen data within the same domain.</p>
<p>Let&rsquo;s also try some headlines from the Huffington Post, which the model should predict as not sarcastic.
Let&rsquo;s try the five most recent headlines from the Huffington Post:</p>
<ul>
<li><a href="https://www.huffpost.com/entry/donald-trump-south-carolina-nikki-haley_n_65db61f5e4b0e4346d52bed8">Donald Trump Won South Carolina - But There&rsquo;s 1 Big Caveat</a></li>
<li><a href="https://www.huffpost.com/entry/israeli-embassy-washington-man-set-fire_n_65db9364e4b0e4346d52ce3d">Man Sets Himself On Fire In Front Of Israeli Embassy In Washington</a></li>
<li><a href="https://www.huffpost.com/entry/bc-ml-israel-palestinians-temporary-truce-cease-fire_n_65db2e9ae4b0189a6a7e32ea">Israeli Media Report Progress On Reaching A Temporary Truce In Gaza And A Hostage-Prisoner Exchange</a></li>
<li><a href="https://www.huffpost.com/entry/george-latimer-race-comments-democratic-primary_n_65d8fac3e4b0cc1f2f7bafd8">A White Liberal Is Trying To Oust A Progressive Black Congressman. His Comments Could Make That Job Harder.</a></li>
<li><a href="https://www.huffpost.com/entry/mongolia-climate-change-extreme-weather_n_65d90294e4b0cc1f2f7bb527">Climate Change-Fueled Winter Extremes Put 90% Of This Country At &lsquo;High Risk&rsquo;</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>clf([
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;Donald Trump Won South Carolina - But There&#39;s 1 Big Caveat&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;Man Sets Himself On Fire In Front Of Israeli Embassy In Washington&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;Israeli Media Report Progress On Reaching A Temporary Truce In Gaza And A Hostage-Prisoner Exchange&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;A White Liberal Is Trying To Oust A Progressive Black Congressman. His Comments Could Make That Job Harder.&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;Climate Change-Fueled Winter Extremes Put 90% Of This Country At &#39;High Risk&#39;&#34;</span>
</span></span><span style="display:flex;"><span>])
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-plaintext" data-lang="plaintext"><span style="display:flex;"><span>[{&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.9993808269500732},
</span></span><span style="display:flex;"><span> {&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.9993786811828613},
</span></span><span style="display:flex;"><span> {&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.9985186457633972},
</span></span><span style="display:flex;"><span> {&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.9993883371353149},
</span></span><span style="display:flex;"><span> {&#39;label&#39;: &#39;LABEL_0&#39;, &#39;score&#39;: 0.9993487000465393}]
</span></span></code></pre></div><p>The model is extremely confident that these are not sarcastic.</p>
<p>The model detects sarcasm in limited cases. It fails to generalize to new, unseen data within the same domain. This is a common problem in machine learning. Training a model that performs well on a specific dataset is straightforward. Training a model that generalizes to new, unseen data remains a significant challenge.
Furthermore, our sarcasm detection project resulted in a domain classifier. For fuzzier concepts like sarcasm, it&rsquo;s important to be clear about what we&rsquo;re actually detecting, and to collect the necessary scale of data to capture the full range of the concept.</p>
<h2 id="key-takeaways">Key Takeaways</h2>
<p>This case study reveals a fundamental problem in ML: <strong>high accuracy guarantees only performance on the training distribution</strong>. Here&rsquo;s what actually happened:</p>
<ol>
<li><strong>Dataset bias</strong>: Using publication source as a proxy for sarcasm created a shortcut for the model</li>
<li><strong>Domain classification</strong>: The model exclusively learned to distinguish writing styles</li>
<li><strong>Poor generalization</strong>: New examples from the same sources often failed</li>
</ol>
<p>This is a common pitfall when building datasets for subjective concepts. The lesson: high accuracy must be accompanied by validation of the model&rsquo;s actual learned behavior.</p>
<p>For better sarcasm detection, we&rsquo;d need:</p>
<ul>
<li>Diverse sources beyond two publications</li>
<li>Human annotation across multiple contexts</li>
<li>Careful evaluation on out-of-domain examples</li>
</ul>
<p>Instructive failures in ML projects provide valuable lessons about our assumptions and the limitations of our approaches.</p>
]]></content:encoded></item><item><title>EigenNoise: Data-Free Word Vector Initialization</title><link>https://hunterheidenreich.com/research/eigennoise-contrastive-prior/</link><pubDate>Sun, 01 May 2022 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/eigennoise-contrastive-prior/</guid><description>Investigation into EigenNoise, a data-free initialization scheme for word vectors that approaches pre-trained model performance after fine-tuning.</description><content:encoded><![CDATA[<h2 id="abstract">Abstract</h2>
<p>We developed EigenNoise, a method to initialize word vectors using <strong>zero pre-training data</strong>. By deriving a co-occurrence matrix solely from the theoretical harmonic structure of language (Zipf&rsquo;s Law), this project demonstrates that we can mathematically synthesize a &ldquo;warm-start&rdquo; for NLP models. This approach challenges the reliance on massive corpora for initialization and offers a competitive alternative for low-resource environments.</p>
<h2 id="key-contributions">Key Contributions</h2>
<ul>
<li><strong>Algorithmic Innovation</strong>: Created a data-free initialization scheme by modeling independent co-occurrence statistics and applying eigen-decomposition</li>
<li><strong>Theoretical Grounding</strong>: Leveraged the <strong>harmonic statistical structure</strong> of language to derive representations from first principles</li>
<li><strong>Information-Theoretic Evaluation</strong>: Utilized <strong>Minimum Description Length (MDL)</strong> probing to rigorously measure the information content and regularity of the learned representations</li>
<li><strong>Efficiency</strong>: Demonstrated that EigenNoise vectors, once fine-tuned, match the performance of GloVe vectors (trained on Gigaword) despite seeing <strong>no pre-training text</strong></li>
</ul>
<h2 id="technical-implementation">Technical Implementation</h2>
<p>The core insight is that &ldquo;noise&rdquo; in language follows a predictable distribution.</p>
<ol>
<li><strong>Modeling</strong>: We model the &ldquo;null hypothesis&rdquo; of text, how words would co-occur if they were statistically independent but followed Zipfian rank-frequency. This yields a theoretical co-occurrence matrix $\hat{X}$:</li>
</ol>
<p>$$\hat{X}_{ij} = \frac{2mN}{r_i r_j H_N}$$</p>
<p>Where $r_i$ is the rank of word $i$, $N$ is vocabulary size, $m$ is the context window size, and $H_N$ is the $N$-th harmonic number.</p>
<ol start="2">
<li>
<p><strong>Factorization</strong>: We then solve for the word vectors by performing an <strong>eigen-decomposition</strong> on this matrix, extracting the top $d$ components to form the representation space.</p>
</li>
<li>
<p><strong>Probing</strong>: Validated performance using MDL probing on CoNLL-2003 and TweetEval benchmarks.</p>
</li>
</ol>
<h2 id="why-this-matters">Why This Matters</h2>
<p>This research explores how much structure can emerge from frequency statistics alone, with no text exposure at all. The central finding is that EigenNoise vectors, derived purely from Zipf&rsquo;s Law, reach competitive performance with GloVe after fine-tuning. This is evidence that a significant portion of what we call &ldquo;learned linguistic knowledge&rdquo; is a consequence of word frequency distributions, not semantic exposure to real text.</p>
<p>In 2026, small pretrained models are freely available and handle most low-resource initialization needs, so the practical case for data-free initialization is narrower than it was in 2022. The theoretical contribution remains relevant: EigenNoise establishes a clean null hypothesis for what word vectors look like when only frequency information is present. For interpretability researchers trying to disentangle frequency artifacts from genuine semantic content, this baseline has value independent of the initialization use case.</p>
<p>The <strong>MDL probing</strong> methodology applied here also contributes beyond the main result. Unlike task accuracy, MDL measures how much information a representation encodes and how compactly, providing a more principled lens for evaluating representational quality. EigenNoise&rsquo;s co-occurrence prior is grounded directly in the <strong>Independent Frequencies Model (IFM)</strong> introduced in the companion <a href="/research/word-company-vicinity/">Word2Vec factorization paper</a>. Together, the two works form a coherent theoretical line: the IFM characterizes the frequency-driven baseline of embedding space, and EigenNoise operationalizes it as a practical, data-free initialization scheme.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{heidenreich2022eigennoisecontrastivepriorwarmstart,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{EigenNoise: A Contrastive Prior to Warm-Start Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Hunter Scott Heidenreich and Jake Ryland Williams}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2205.04376}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.CL}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2205.04376}</span>,
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="related-work">Related Work</h2>
<p>For the theoretical foundation underlying EigenNoise&rsquo;s null hypothesis, including the first analytical solution to Word2Vec&rsquo;s softmax objective, see <a href="/research/word-company-vicinity/">Analytical Solution to Word2Vec Softmax &amp; Bias Probing</a>.</p>
]]></content:encoded></item><item><title>Analytical Solution to Word2Vec Softmax &amp; Bias Probing</title><link>https://hunterheidenreich.com/research/word-company-vicinity/</link><pubDate>Sun, 01 May 2022 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/word-company-vicinity/</guid><description>Analytical derivation of Word2Vec's softmax objective factorization and a new framework for detecting semantic bias in raw corpora.</description><content:encoded><![CDATA[<h2 id="abstract">Abstract</h2>
<p>While the Skip-Gram with Negative Sampling (SGNS) objective for Word2Vec has famously been shown to factorize a shifted PMI matrix, the implicit matrix factorization of the original <strong>Softmax</strong> objective has remained an open question. In this work, we provide the first known analytical solution to Word2Vec&rsquo;s softmax-optimized skip-gram algorithm.</p>
<p>We use this derivation to introduce the <strong>Independent Frequencies Model (IFM)</strong>, identifying a &ldquo;frequency-ratios property&rdquo; that unifies classical word vector models. This theoretical insight allows us to derive a low-cost, training-free method for measuring semantic bias directly from corpus statistics.</p>
<h2 id="key-contributions">Key Contributions</h2>
<ul>
<li><strong>Analytical Solution</strong>: Provided the first known analytical solution to Word2Vec&rsquo;s softmax-optimized skip-gram algorithm, proving it factorizes the log-conditional probability matrix.</li>
<li><strong>Independent Frequencies Model (IFM)</strong>: Introduced a dense co-occurrence model computable purely from unigram frequencies to act as a null hypothesis for embedding structures.</li>
<li><strong>Bias Dissonance Metric</strong>: Derived a low-cost, training-free method for measuring semantic bias directly from corpus statistics using the frequency-ratios property.</li>
<li><strong>Data Transparency</strong>: Demonstrated how specific corpora exhibit distinct bias profiles, offering a tool for auditing datasets before training large models.</li>
</ul>
<h2 id="key-theoretical-results">Key Theoretical Results</h2>
<h3 id="1-the-softmax-factorization-theorem">1. The Softmax Factorization Theorem</h3>
<p>We prove that under the log-softmax objective, Word2Vec implicitly converges towards a factorization of the <strong>log-conditional probability matrix</strong> of the co-occurrence model.</p>
<p><strong>Theorem:</strong> For the objective
$\mathcal{L}_{\text{soft}} = - \sum _{t,s} F _{t,s}^m \log \varphi (\vec{u}_t \vec{v}_s)$,
the algorithm converges to:</p>
<p>$$
\vec{u}_{t}\vec{v}_{s}^{T} = \log\frac{F_{t,s}^{m}}{f_{t}^{m}}
$$</p>
<p>where $F_{t,s}^m$ is the co-occurrence count and $f_t^m$ is the marginal frequency. This effectively makes the dot product of the embedding vectors equal to the log-conditional probability of the context word given the target word.</p>
<h3 id="2-the-independent-frequencies-model-ifm">2. The Independent Frequencies Model (IFM)</h3>
<p>To understand the baseline behavior of these models, we introduce the IFM, which models a dense co-occurrence matrix computable purely from unigram frequencies:</p>
<p>$$
\hat{F}_{t,s}^{m} = \frac{2m f_t f_s}{M}
$$</p>
<p>This model acts as a &ldquo;null hypothesis&rdquo; for embedding structures, allowing us to isolate true semantic signals from statistical noise.</p>
<h2 id="methodological-innovation-bias-dissonance">Methodological Innovation: Bias Dissonance</h2>
<p>Leveraging the frequency-ratios property derived from our factorization, we propose a metric called <strong>Dissonance ($\Delta$)</strong> to probe semantic bias in data without training a model.</p>
<p>For an analogy $A:B :: C:D$ (e.g., <em>man:king :: woman:queen</em>), we measure the alignment of their corpus frequency ratios. High dissonance indicates that the corpus statistics do not support the analogy, potentially revealing bias or under-representation.</p>
<p><strong>Intuitive Example:</strong> If a corpus contains the phrase <em>&ldquo;man is king&rdquo;</em> 100 times more often than <em>&ldquo;woman is queen,&rdquo;</em> the frequency ratios are misaligned. A perfect, unbiased analogy would have matching ratios (i.e., <em>man</em> relates to <em>king</em> at the same rate <em>woman</em> relates to <em>queen</em>). Any deviation from this symmetry is captured by our dissonance metric, revealing where the data itself encodes asymmetric associations.</p>
<p>$$
\Delta(x,y|\mathcal{D}) = \left| \log\frac{f_{t}f_{\bar{s}}}{f_{s}f_{\bar{t}}} \right| / \max_{l \in \mathcal{V}} { \log f_l }
$$</p>
<p>By applying this to the <strong>Bigger Analogy Test Set (BATS)</strong>, we demonstrated how specific corpora (like Wikipedia vs. Google Books) exhibit distinct bias profiles regarding geographic and encyclopedic knowledge.</p>
<h2 id="visualizing-statistical-independence">Visualizing Statistical Independence</h2>















<figure class="post-figure center ">
    <img src="/img/word-bias-iqr.webp"
         alt="Plot showing the portion of statistically dependent information decreasing as window size increases, with curves for different corpus sizes and an inset showing power-law decay"
         title="Plot showing the portion of statistically dependent information decreasing as window size increases, with curves for different corpus sizes and an inset showing power-law decay"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The Information Quality Ratio measuring the portion of co-occurrence information that is statistically dependent, plotted against window size. Colors indicate corpus size from the GUM corpus. The dashed lines show the IFM prediction. The inset reveals the power-law decay rate, demonstrating how linguistic dependencies diminish predictably with context distance.</figcaption>
    
</figure>

<h2 id="impact">Impact</h2>
<p>This work bridges the gap between empirical success and theoretical foundations in NLP by:</p>
<ol>
<li><strong>Solving a fundamental mechanism:</strong> Providing the missing factorization proof for Softmax Word2Vec.</li>
<li><strong>Efficient Pre-training:</strong> Suggesting that embedding layers can be &ldquo;warm-started&rdquo; using unigram statistics derived from the IFM.</li>
<li><strong>Data Transparency:</strong> Offering a computationally inexpensive tool for auditing datasets for bias before investing resources in training large models.</li>
</ol>
<h2 id="my-contribution">My Contribution</h2>
<p>Jake Williams is the first author and primary driver of this work. He developed the core theory, derived the factorization proofs, designed the dissonance metric, and ran the experiments. My role was supporting: I contributed through critique and refinement during the writing process, but the intellectual heavy lifting belongs to Jake.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{williams2022knowcompanywordslies,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{To Know by the Company Words Keep and What Else Lies in the Vicinity}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Jake Ryland Williams and Hunter Scott Heidenreich}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2205.00148}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.CL}</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2205.00148}</span>,
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="related-work">Related Work</h2>
<p>For a complementary analytical approach to word representations, deriving data-free word vector initializations from the same frequency-ratio insights, see <a href="/research/eigennoise-contrastive-prior/">EigenNoise: Data-Free Word Vector Initialization</a>.</p>
]]></content:encoded></item><item><title>GPT-2 Susceptibility to Universal Adversarial Triggers</title><link>https://hunterheidenreich.com/research/gpt2-adversarial-triggers/</link><pubDate>Sat, 01 May 2021 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/gpt2-adversarial-triggers/</guid><description>Investigation into whether universal adversarial triggers can control both topic and stance of GPT-2's generated text and security implications.</description><content:encoded><![CDATA[<blockquote>
<p><strong>Historical context:</strong> This paper was published in 2021, predating the modern red-teaming practices and adversarial robustness benchmarks that emerged with instruction-tuned and RLHF-trained models. GPT-2 is now a historical baseline, but the core methodology and findings remain a relevant foundation for current adversarial robustness work.</p></blockquote>
<h2 id="abstract">Abstract</h2>
<p>This work investigates universal adversarial triggers (UATs), a method for disrupting language models using input-agnostic token sequences. We investigated whether it is possible to use these triggers to control the <strong>topic</strong> and the <strong>stance</strong> of text generated by GPT-2. Across four controversial topics, we demonstrated success in identifying triggers that guide the model to produce text on a targeted subject and influence the position it takes. Our goal is to raise awareness that even deployed models are susceptible to this influence and to advocate for immediate safeguards.</p>
<h2 id="key-findings--contributions">Key Findings &amp; Contributions</h2>
<ul>
<li><strong>Topic and Stance Control</strong>: We were the first to systematically explore using UATs to control both the topic and the stance of a language model&rsquo;s output. We found that controlling the topic is highly feasible, and controlling the stance is also possible.</li>
<li><strong>The &ldquo;Filter Bubble&rdquo; Hypothesis</strong>: We observed that triggers for fringe topics (e.g., Flat Earth) were harder to find but offered a higher degree of stance control than broader topics. We posit this may reflect &ldquo;filter bubbles&rdquo; in the training data, where fringe viewpoints use distinct linguistic patterns.</li>
<li><strong>Ethical &amp; Security Analysis</strong>: We highlighted the security risks of deployed models being manipulated by external adversaries without internal model access. To be responsible, we withheld the most sensitive triggers we discovered.</li>
<li><strong>Constructive Applications</strong>: Beyond a security flaw, we proposed that UATs could be used constructively as a <strong>diagnostic tool</strong> to audit models for bias or as a method for <strong>bot detection</strong> on social media.</li>
</ul>
<h2 id="significance--why-this-matters">Significance &amp; Why This Matters</h2>
<p>This work extended early research on UATs by moving beyond single-issue attacks (like generating toxic content) to a nuanced analysis of topic and stance control. It demonstrated that a <strong>gradient-based search process (adapting HotFlip)</strong> is effective at manipulating model outputs, emphasizing a critical vulnerability for any organization deploying large language models.</p>
<p>For ML practitioners and security researchers, this highlights the importance of robust safeguards against input-agnostic attacks. It also opens the door to using these same adversarial techniques constructively: as diagnostic tools to audit models for hidden biases or to detect automated bot activity on social media platforms.</p>
<h2 id="related-work">Related Work</h2>
<p>The constructive bot-detection application proposed here connects directly to empirical work on coordinated inauthentic behavior. <a href="/research/coordinated-social-targeting/">Coordinated Social Targeting on Twitter</a> documents real-world follower-manipulation patterns on high-profile accounts, illustrating the kind of automated adversarial activity that UAT-based detection methods could help identify.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{10.1145/3461702.3462578,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Heidenreich, Hunter Scott and Williams, Jake Ryland}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{The Earth Is Flat and the Sun Is Not a Star: The Susceptibility of GPT-2 to Universal Adversarial Triggers}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">isbn</span> = <span style="color:#e6db74">{9781450384735}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{Association for Computing Machinery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">{New York, NY, USA}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://doi.org/10.1145/3461702.3462578}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1145/3461702.3462578}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">{Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{566--573}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">numpages</span> = <span style="color:#e6db74">{8}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">keywords</span> = <span style="color:#e6db74">{adversarial attacks, bias, language modeling, natural language processing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">location</span> = <span style="color:#e6db74">{Virtual Event, USA}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">series</span> = <span style="color:#e6db74">{AIES &#39;21}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Data-Driven WordNet Construction from Wiktionary</title><link>https://hunterheidenreich.com/research/semantic-network-induction/</link><pubDate>Fri, 01 Nov 2019 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/semantic-network-induction/</guid><description>We introduce an unsupervised algorithm for inducing semantic networks from noisy, crowd-sourced data, producing a resource with over 344,000 linked examples.</description><content:encoded><![CDATA[<h2 id="abstract">Abstract</h2>
<p>We introduce a novel <strong>unsupervised algorithm</strong> for inducing semantic networks from noisy, crowd-sourced data. By framing network construction as a &ldquo;relationship disambiguation&rdquo; task, we process the entirety of Wiktionary to build a massive, WordNet-like semantic resource. The resulting network is an order of magnitude larger than Princeton WordNet and features over <strong>344,000 linked example sentences</strong> (vs. WordNet&rsquo;s 68k). Evaluation on standard word similarity benchmarks demonstrates that our fully data-driven approach yields semantic structures competitive with expert-annotated resources.</p>
<h2 id="key-contributions">Key Contributions</h2>
<ul>
<li><strong>Unsupervised Hierarchy Induction</strong>: We propose a deterministic algorithm to construct a Directed Acyclic Graph (DAG) of senses from pairwise relationships, effectively inducing a semantic hierarchy without human supervision.</li>
<li><strong>A Massive Semantic Resource</strong>: We release a dataset enriched with hundreds of thousands of semantically linked usage examples, serving as a critical resource for tasks like Word Sense Disambiguation (WSD).</li>
<li><strong>Novel Disambiguation Framework</strong>: We model &ldquo;relationship disambiguation&rdquo; using a Laplacian kernel and FastText embeddings to filter noisy user annotations.</li>
<li><strong>Open-Source Infrastructure</strong>: We provide a full pipeline for downloading, parsing, and constructing networks from Wiktionary data.</li>
</ul>
<h2 id="technical-approach">Technical Approach</h2>
<p>The core of our method addresses the noise inherent in crowd-sourced dictionaries. We frame the problem as <strong>Latent Semantic Network Induction</strong>:</p>
<ol>
<li><strong>Relationship Disambiguation</strong>: For every linked pair of words (e.g., <em>go</em> ~ <em>proceed</em>), we define a semantic subspace using their definitions. We utilize <strong>FastText embeddings</strong> and a <strong>Laplacian kernel</strong> to identify which specific definitions participate in the relationship.</li>
<li><strong>Hierarchy Construction</strong>: We apply a custom intersection algorithm that treats more general senses as the &ldquo;overlap&rdquo; between specific definition sets. We formalize this as a set-theoretic &ldquo;hole punching&rdquo; operation, where a general sense $t$ is defined by the intersection of definition sets $\mathbb{D}&rsquo;$, excluding any broader intersections:</li>
</ol>
<p>$$f^{-1}(t) = \left(\bigcap_{\mathbb{D}&rsquo;} D_{u\sim v}\right) \setminus \left(\bigcup_{\mathbb{D} \supset \mathbb{D}&rsquo;} \bigcap_{\mathbb{D}} D_{u\sim v}\right)$$</p>
<h2 id="evaluation--validation">Evaluation &amp; Validation</h2>
<p>The primary achievement is scale: our induced network contains over <strong>344,000 linked example sentences</strong>, compared to Princeton WordNet&rsquo;s 68,000 (more than 5x the coverage), built entirely from crowd-sourced data without expert annotation.</p>
<p>Beyond scale, the network holds up semantically. On standard noun-similarity benchmarks (RG-65), the unsupervised network achieves a Spearman rank correlation of $\rho = 0.83$, matching the performance of Explicit Semantic Analysis (ESA) models built on expert-annotated WordNet ($\rho = 0.82$). The point is not that we beat WordNet by 0.01. It is that a fully automated approach over noisy Wiktionary data produces a resource of comparable quality at 5x the scale.</p>
<h2 id="why-this-matters">Why This Matters</h2>
<p>Building high-quality linguistic resources typically requires expensive expert annotation. Princeton WordNet took decades of lexicographer effort. This work demonstrates that an unsupervised algorithm over crowd-sourced data can produce a resource of comparable semantic quality at more than 5x the scale. For ML practitioners, that matters: larger coverage means more training signal for downstream tasks like Word Sense Disambiguation. For this portfolio, it shows early experience building structured NLP datasets from scratch, a theme that continues in later work on large-scale document corpora.</p>
<h2 id="related-work">Related Work</h2>
<p>For a theoretical treatment of word semantics from the same collaboration, including the first analytical solution to Word2Vec&rsquo;s softmax objective, see <a href="/research/word-company-vicinity/">Analytical Solution to Word2Vec Softmax &amp; Bias Probing</a>.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{heidenreich2019latent,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Latent semantic network induction in the context of linked example senses}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Heidenreich, Hunter and Williams, Jake}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{170--180}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>QuAC: Question Answering in Context Dataset</title><link>https://hunterheidenreich.com/posts/quac-question-answering-in-context/</link><pubDate>Wed, 31 Oct 2018 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/quac-question-answering-in-context/</guid><description>Analysis of QuAC's conversational QA through student-teacher interactions, featuring 100K+ context-dependent questions and coreference challenges.</description><content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>The <a href="https://aclanthology.org/D18-1241/">QuAC dataset</a> (Question Answering in Context) presents a conversational question answering approach that models student-teacher interactions. Published at EMNLP 2018, this work by Choi et al. addresses how systems can understand dialogue context, resolve references across conversation turns, and handle natural conversation ambiguity. Previous datasets treated questions independently.</p>
<p>The dataset addresses limitations in question answering research by incorporating real-world information-seeking dialogue complexities, where questions build upon previous exchanges and context drives understanding.</p>
<p>For comparison with related work, see my analysis of <a href="/posts/coqa-conversation-question-answering/">CoQA</a>.</p>
<h2 id="the-student-teacher-framework">The Student-Teacher Framework</h2>
<p>QuAC models information-seeking dialogue through a student-teacher setup:</p>
<ul>
<li><strong>Teacher</strong>: Has complete access to information (Wikipedia passage)</li>
<li><strong>Student</strong>: Seeks knowledge through questioning with limited initial context</li>
<li><strong>Interaction</strong>: Handles context-dependent questions, abstract inquiries, and unanswerable requests</li>
</ul>
<p>This framework mirrors real-world scenarios where one party has expertise while another seeks to learn through dialogue. AI systems must act as effective teachers, using available information to provide helpful responses despite ambiguous or incomplete questions.</p>
<p>The dataset contains over 100,000 questions across 14,000+ dialogues, providing substantial scale for training and evaluation.</p>















<figure class="post-figure center ">
    <img src="/img/quac_stats.webp"
         alt="QuAC dataset statistics and scale"
         title="QuAC dataset statistics and scale"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">QuAC dataset statistics and scale</figcaption>
    
</figure>

<h2 id="dataset-construction">Dataset Construction</h2>
<p>QuAC was built using Amazon Mechanical Turk with a two-person dialogue setup:</p>
<p><strong>Teacher role</strong>: Has access to the complete Wikipedia passage and provides answers extracted directly from the text</p>
<p><strong>Student role</strong>: Sees only the article title, introduction paragraph, and section heading, then asks questions to learn about the content</p>
<p>This asymmetric information design ensures student questions naturally differ from the passage content, creating realistic information-seeking scenarios. The extractive answer requirement maintains objective evaluation while simplifying scoring.</p>
<p><strong>Dialogue termination</strong>:</p>
<ul>
<li>12 questions answered</li>
<li>Manual termination by either participant</li>
<li>Two consecutive unanswerable questions</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/quac_convo.webp"
         alt="Example QuAC conversation showing student-teacher interaction"
         title="Example QuAC conversation showing student-teacher interaction"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Example QuAC conversation showing student-teacher interaction</figcaption>
    
</figure>

<h3 id="content-selection">Content Selection</h3>
<p>QuAC focuses on Wikipedia biographical articles for several practical reasons:</p>
<ul>
<li><strong>Reduced complexity</strong>: People-focused content requires less specialized domain knowledge</li>
<li><strong>Natural question flow</strong>: Biographical information lends itself to sequential questioning</li>
<li><strong>Quality control</strong>: Articles filtered to include only subjects with 100+ incoming links, ensuring content depth</li>
</ul>
<p>This focused scope enables consistent evaluation while maintaining broad coverage through diverse biographical subjects across fields and time periods.</p>
<h2 id="key-dataset-characteristics">Key Dataset Characteristics</h2>
<p>QuAC introduces several features that distinguish it from existing question answering benchmarks:</p>















<figure class="post-figure center ">
    <img src="/img/quac_comparison.webp"
         alt="Comparative analysis of QuAC against other QA datasets"
         title="Comparative analysis of QuAC against other QA datasets"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Comparative analysis of QuAC against other QA datasets</figcaption>
    
</figure>

<p><strong>Notable features</strong>:</p>
<ul>
<li><strong>High contextual dependency</strong>: 86% of questions require coreference resolution</li>
<li><strong>Non-factoid focus</strong>: 54% of questions go beyond simple fact retrieval</li>
<li><strong>Extended answers</strong>: Responses are longer and more detailed</li>
<li><strong>Unanswerable questions</strong>: Realistic scenarios where information isn&rsquo;t available</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/quac_dist.webp"
         alt="Distribution of question types in QuAC"
         title="Distribution of question types in QuAC"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Distribution of question types in QuAC</figcaption>
    
</figure>

<h3 id="the-coreference-resolution-challenge">The Coreference Resolution Challenge</h3>
<p>QuAC&rsquo;s complexity stems from its heavy reliance on coreference resolution across multiple contexts:</p>
<p><strong>Reference types</strong>:</p>
<ul>
<li><strong>Passage references</strong>: Pronouns and references to entities in the source text</li>
<li><strong>Dialogue references</strong>: References to previously discussed topics</li>
<li><strong>Abstract references</strong>: Challenging cases like &ldquo;what else?&rdquo; that require inferring the inquiry scope</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/quac_coref.webp"
         alt="Types and distribution of coreferences in QuAC"
         title="Types and distribution of coreferences in QuAC"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Types and distribution of coreferences in QuAC</figcaption>
    
</figure>

<p>The prevalence of coreference resolution makes QuAC particularly challenging, as this remains an active research problem in NLP. Models must understand passage content, track dialogue history, and resolve complex referential expressions simultaneously.</p>
<h2 id="performance-results">Performance Results</h2>
<p>Models face substantial challenges on QuAC, with significant gaps between human and machine performance:</p>















<figure class="post-figure center ">
    <img src="/img/quac_performance.webp"
         alt="Baseline model performance comparison on QuAC"
         title="Baseline model performance comparison on QuAC"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Baseline model performance comparison on QuAC</figcaption>
    
</figure>

<p><strong>Performance summary</strong>:</p>
<ul>
<li><strong>Human performance</strong>: 81.1% F1 score</li>
<li><strong>Best baseline</strong>: BiDAF++ with context achieves 60.2% F1</li>
<li><strong>Performance gap</strong>: 20+ point difference shows room for improvement</li>
</ul>
<h3 id="human-equivalence-metrics">Human Equivalence Metrics</h3>
<p>QuAC introduces evaluation metrics beyond traditional F1 scores:</p>
<p><strong>HEQ-Q (Human Equivalence Question-level)</strong>: Percentage of questions where the model achieves human-level or better performance</p>
<p><strong>HEQ-D (Human Equivalence Dialogue-level)</strong>: Percentage of complete dialogues where the model matches human performance across all questions</p>
<p><strong>Current results</strong>:</p>
<ul>
<li>Human baseline: 100% HEQ-Q, 100% HEQ-D (by definition)</li>
<li>Best model: 55.1% HEQ-Q, 5.2% HEQ-D</li>
</ul>
<p>These metrics show both average performance and consistency across questions and conversations, important for practical dialogue systems.</p>
<h2 id="research-impact">Research Impact</h2>
<p>QuAC represents an important step in question answering research by introducing realistic conversational dynamics that existing datasets lack. The student-teacher framework captures natural information-seeking behavior while maintaining extractive evaluation for objective assessment.</p>
<p><strong>Key contributions</strong>:</p>
<ul>
<li><strong>Conversational realism</strong>: Context-dependent questions that mirror dialogue patterns</li>
<li><strong>Coreference complexity</strong>: Integration of challenging NLP problems into QA evaluation</li>
<li><strong>Evaluation metrics</strong>: HEQ scores that measure consistency alongside average performance</li>
<li><strong>Large-scale framework</strong>: Substantial dataset enabling robust model training and evaluation</li>
</ul>
<p>The dataset&rsquo;s <a href="https://quac.ai/">leaderboard</a> provides researchers with a challenging benchmark for developing conversational AI systems. As models improve on QuAC, we can expect progress in dialogue agents, virtual assistants, and educational AI systems that engage in more natural, context-aware conversations.</p>
<p>QuAC&rsquo;s focus on dialogue context and reference resolution pushes the field toward AI systems that can engage in genuine conversation and understand complex dialogue flows.</p>
<h2 id="a-builders-perspective-quac-and-modern-instruction-tuning">A Builder&rsquo;s Perspective: QuAC and Modern Instruction Tuning</h2>
<p>Looking at QuAC through the lens of modern production ML, the student-teacher framework is incredibly relevant. Today, we train foundation models using Reinforcement Learning from Human Feedback (RLHF) and instruction tuning, which rely heavily on multi-turn, context-aware interactions.</p>
<p>When building systems like GutenOCR or enterprise document processing pipelines, users rarely ask perfectly formulated, context-free questions. They ask follow-ups, use pronouns, and expect the system to act as a knowledgeable &ldquo;teacher&rdquo; guiding them through the document. QuAC was one of the first datasets to formalize this asymmetric information dynamic. It highlighted the necessity of handling unanswerable questions gracefully, a critical feature for preventing hallucinations in today&rsquo;s production LLMs.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{choi-etal-2018-quac,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">&#34;{Q}u{AC}: Question Answering in Context&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">&#34;Choi, Eunsol and He, He and Iyyer, Mohit and Yatskar, Mark and Yih, Wen-tau and Choi, Yejin and Liang, Percy and Zettlemoyer, Luke&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">booktitle</span> = <span style="color:#e6db74">&#34;Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">month</span> = oct # <span style="color:#e6db74">&#34;-&#34;</span> # nov,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">&#34;2018&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">address</span> = <span style="color:#e6db74">&#34;Brussels, Belgium&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">&#34;Association for Computational Linguistics&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">&#34;https://aclanthology.org/D18-1241/&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">&#34;10.18653/v1/D18-1241&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">&#34;2174--2184&#34;</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>CoQA Dataset: Advancing Conversational Question Answering</title><link>https://hunterheidenreich.com/posts/coqa-conversation-question-answering/</link><pubDate>Thu, 23 Aug 2018 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/coqa-conversation-question-answering/</guid><description>Analysis of CoQA, a conversational QA dataset with multi-turn dialogue, coreference resolution, and natural answers for QA research.</description><content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>The <a href="https://doi.org/10.1162/tacl_a_00266">CoQA dataset</a> (Reddy et al., 2019) introduces conversational dynamics to question answering research. CoQA requires models to maintain context across multi-turn conversations while reading and reasoning about text passages. Previous datasets focused on isolated question-answer pairs.</p>
<p>This dataset addresses a gap in conversational AI research by providing a benchmark for systems that must understand dialogue flow and implicit references. These are key components of natural human conversation.</p>
<p>For related work on conversational question answering, see my analysis of <a href="/posts/quac-question-answering-in-context/">QuAC</a>.</p>
<h2 id="what-makes-conversational-qa-different">What Makes Conversational QA Different</h2>
<p>Conversational question answering introduces challenges beyond traditional reading comprehension:</p>
<ol>
<li><strong>Context dependency</strong>: Questions rely on previous dialogue turns for meaning</li>
<li><strong>Coreference resolution</strong>: Understanding pronouns and implicit references</li>
<li><strong>Abstractive answering</strong>: Rephrasing information to generate natural responses</li>
<li><strong>Multi-turn reasoning</strong>: Maintaining coherent dialogue across multiple exchanges</li>
</ol>
<p>These requirements differentiate CoQA from existing question answering datasets that treat each question independently.</p>
<h2 id="why-coqa-matters">Why CoQA Matters</h2>
<p>Question answering systems typically excel at finding specific information in text. However, they often struggle with natural conversation. Human communication involves building on previous exchanges, using pronouns and implicit references, and expressing ideas in varied ways.</p>
<p>CoQA addresses this by creating a large-scale dataset for conversational question answering with three primary characteristics:</p>
<ol>
<li>
<p><strong>Conversation-dependent questions</strong>: After the first question, every subsequent question depends on dialogue history across 127,000 questions spanning 8,000 conversations</p>
</li>
<li>
<p><strong>Natural, abstractive answers</strong>: CoQA requires rephrased responses that sound natural in conversation. The answerer first highlighted the relevant text span, then rephrased the information.</p>
</li>
<li>
<p><strong>Domain diversity</strong>: Training covers 5 domains with testing on 7 domains, including 2 unseen during training</p>
</li>
</ol>
<p>The performance gap is notable: humans achieve 88.8% F1 score while the best models at the time reached 65.1% F1, indicating substantial room for improvement.</p>
<h2 id="dataset-construction">Dataset Construction</h2>
<p>CoQA was constructed using Amazon Mechanical Turk, pairing workers in a question-answer dialogue setup. One worker asked questions about a given passage while another provided answers. The answerer first highlighted the relevant text span, then rephrased the information using different words to create natural, abstractive responses.</p>
<p>This methodology produces answers that sound conversational. This makes the dataset highly realistic for dialogue applications.</p>
<h3 id="domain-coverage">Domain Coverage</h3>
<p>CoQA spans diverse text types to ensure evaluation across different writing styles and topics:</p>
<p><strong>Training domains (5):</strong></p>
<ul>
<li>Children&rsquo;s stories from <a href="https://web.archive.org/web/20180829214346/https://uclmr.github.io/ai4exams/data.html#mctest">MCTest</a></li>
<li>Literature from <a href="https://www.gutenberg.org/">Project Gutenberg</a></li>
<li>Educational content from <a href="https://www.cs.cmu.edu/~glai1/data/race/">RACE</a> (middle/high school English)</li>
<li>CNN news articles</li>
<li>Wikipedia articles</li>
</ul>
<p><strong>Test-only domains (2):</strong></p>
<ul>
<li>Science articles from <a href="http://data.allenai.org/ai2-science-questions/">AI2 Science Questions</a></li>
<li>Creative writing from <a href="https://www.reddit.com/r/WritingPrompts/">Reddit WritingPrompts</a></li>
</ul>















<figure class="post-figure center ">
    <img src="/img/coqa_domains.webp"
         alt="Domain distribution in the CoQA dataset"
         title="Domain distribution in the CoQA dataset"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Domain distribution in the CoQA dataset</figcaption>
    
</figure>

<p>The inclusion of test-only domains provides a rigorous evaluation of model generalization to unseen text types.</p>
<h2 id="comparison-with-existing-datasets">Comparison with Existing Datasets</h2>
<p>Prior to CoQA, the dominant question answering benchmark was <a href="https://rajpurkar.github.io/SQuAD-explorer/">SQuAD (Stanford Question Answering Dataset)</a>. SQuAD established foundations for reading comprehension and presented specific constraints:</p>
<ul>
<li><strong>SQuAD 1.0</strong>: 100,000+ questions requiring exact text extraction from Wikipedia passages</li>
<li><strong>SQuAD 2.0</strong>: Added 50,000+ unanswerable questions to test when no answer exists</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/squad_coqa_size.webp"
         alt="Scale comparison between SQuAD and CoQA datasets"
         title="Scale comparison between SQuAD and CoQA datasets"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Scale comparison between SQuAD and CoQA datasets</figcaption>
    
</figure>

<p>SQuAD treats each question independently and requires only extractive answers. CoQA addresses these constraints through conversational context and abstractive responses.</p>
<h3 id="question-and-answer-analysis">Question and Answer Analysis</h3>
<p>The differences between SQuAD and CoQA extend beyond conversational context:</p>
<p><strong>Question diversity</strong>: SQuAD heavily favors &ldquo;what&rdquo; questions (~50%). CoQA shows a more balanced distribution across question types, reflecting natural conversation patterns.</p>















<figure class="post-figure center ">
    <img src="/img/squad_v_coqa.webp"
         alt="Question type distribution comparison between SQuAD and CoQA"
         title="Question type distribution comparison between SQuAD and CoQA"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Question type distribution comparison between SQuAD and CoQA</figcaption>
    
</figure>

<p><strong>Context dependence</strong>: CoQA includes challenging single-word questions like &ldquo;who?&rdquo;, &ldquo;where?&rdquo;, or &ldquo;why?&rdquo; that depend entirely on dialogue history.</p>
<p><strong>Answer characteristics</strong>: CoQA answers vary significantly in length and style. SQuAD primarily features extractive spans.</p>















<figure class="post-figure center ">
    <img src="/img/squad_coqa_answers.webp"
         alt="Answer length distribution in SQuAD vs CoQA"
         title="Answer length distribution in SQuAD vs CoQA"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Answer length distribution in SQuAD vs CoQA</figcaption>
    
</figure>

<h2 id="the-coreference-challenge">The Coreference Challenge</h2>
<p>CoQA&rsquo;s difficulty stems largely from its reliance on coreference resolution (determining when different expressions refer to the same entity). This remains a challenging research problem in NLP.</p>
<p><strong>Coreference types in CoQA</strong>:</p>
<ul>
<li><strong>Explicit coreferences</strong> (~50% of questions): Clear indicators like pronouns (&ldquo;him,&rdquo; &ldquo;it,&rdquo; &ldquo;her,&rdquo; &ldquo;that&rdquo;)</li>
<li><strong>Implicit coreferences</strong> (~20% of questions): Context-dependent references requiring inference (e.g., asking &ldquo;where?&rdquo; without specifying what)</li>
</ul>















<figure class="post-figure center ">
    <img src="/img/coqa_coreferences.webp"
         alt="Distribution of coreference types in CoQA questions"
         title="Distribution of coreference types in CoQA questions"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Distribution of coreference types in CoQA questions</figcaption>
    
</figure>

<p>These linguistic phenomena make CoQA more difficult than traditional reading comprehension, as models must resolve references across dialogue turns while maintaining conversational coherence.</p>
<h2 id="performance-benchmarks">Performance Benchmarks</h2>
<p>Models faced significant challenges on CoQA, with substantial room for improvement:</p>















<figure class="post-figure center ">
    <img src="/img/coqa_scores.webp"
         alt="Performance comparison on CoQA across different model types"
         title="Performance comparison on CoQA across different model types"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Performance comparison on CoQA across different model types</figcaption>
    
</figure>

<p>The performance gap between human and machine capabilities highlighted conversational question answering as a challenging frontier in NLP research.</p>
<h2 id="research-impact-and-future-directions">Research Impact and Future Directions</h2>
<p>CoQA represents a step toward more natural conversational AI systems. By requiring models to handle dialogue context, coreference resolution, and abstractive reasoning simultaneously, it challenges current NLP system capabilities.</p>
<p>The dataset&rsquo;s <a href="https://stanfordnlp.github.io/coqa/">leaderboard</a> provides a benchmark for measuring progress on this task. As models improve on CoQA, we can expect advances in conversational AI applications, from chatbots to virtual assistants that engage in more natural, context-aware dialogue.</p>
<p>CoQA&rsquo;s contribution to the field aims to parallel ImageNet&rsquo;s impact on computer vision, providing a challenging, well-constructed benchmark that drives research toward more capable AI systems.</p>
<h2 id="a-builders-perspective-coqa-in-the-era-of-llms">A Builder&rsquo;s Perspective: CoQA in the Era of LLMs</h2>
<p>Looking back at CoQA from the perspective of modern production systems, this dataset was highly prescient. The challenges it introduced, such as multi-turn reasoning, coreference resolution, and abstractive answering, are the exact capabilities we now expect from instruction-tuned Large Language Models (LLMs).</p>
<p>When building document processing pipelines at scale, we rarely extract isolated facts. Users want to chat with their documents, asking follow-up questions like, &ldquo;What does that mean for the Q3 budget?&rdquo; Resolving &ldquo;that&rdquo; to a previous turn&rsquo;s context is exactly what CoQA formalized. Datasets like CoQA laid the groundwork for the conversational interfaces we build today, shifting the field&rsquo;s focus from simple extraction to genuine dialogue comprehension.</p>
<h2 id="references">References</h2>
<p>Reddy, S., Chen, D., &amp; Manning, C. D. (2019). CoQA: A conversational question answering challenge. <em>Transactions of the Association for Computational Linguistics</em>, 7, 249-266.</p>
]]></content:encoded></item><item><title>Word Embeddings in NLP: An Introduction</title><link>https://hunterheidenreich.com/posts/intro-to-word-embeddings/</link><pubDate>Sun, 05 Aug 2018 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/posts/intro-to-word-embeddings/</guid><description>Learn about word embeddings in NLP: from basic one-hot encoding to contextual models like ELMo. Guide with examples.</description><content:encoded><![CDATA[<h2 id="understanding-word-embeddings">Understanding Word Embeddings</h2>
<p>A word embedding maps words to real-valued vectors:</p>
<p>$$
\text{word} \rightarrow \mathbb{R}^n
$$</p>
<p>where $n$ represents the dimensionality of the embedding space.</p>
<p>The goal is simple: position semantically similar words close together in vector space. This dense representation typically uses hundreds of dimensions, a massive reduction from the millions required by one-hot encoding.</p>
<p>Word embeddings are grounded in <a href="https://en.wikipedia.org/wiki/Distributional_semantics">Zellig Harris&rsquo; distributional hypothesis</a>: words appearing in similar contexts tend to have similar meanings. This forms the foundation of distributional semantics.</p>















<figure class="post-figure center ">
    <img src="/img/distributional_semantics-50.webp"
         alt="Distributional semantics visualization"
         title="Distributional semantics visualization"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Words embedded in three-dimensional space, organized by semantic similarity</figcaption>
    
</figure>

<p>Different embedding algorithms capture various aspects of this distributional principle. This post explores the main methods for creating word embeddings and their applications in natural language processing.</p>
<p>While modern foundation models and terabyte-scale Vision-Language Models (VLMs) rely on advanced subword tokenizers (like BPE) and massive Transformer embedding layers, the fundamental goal remains exactly the same: mapping discrete text to a continuous vector space where math can capture meaning. Understanding these foundational techniques provides the necessary intuition for debugging and scaling today&rsquo;s production ML systems.</p>
<h2 id="why-word-embeddings-matter-in-nlp">Why Word Embeddings Matter in NLP</h2>
<p>Computers require numerical representations to apply machine learning algorithms to text. Word embeddings bridge this gap by converting text into dense vectors that preserve semantic and syntactic relationships.</p>
<p><strong>Key advantages:</strong></p>
<ol>
<li><strong>Dense representation</strong>: Hundreds of dimensions provide a compact alternative to vocabulary-sized sparse vectors.</li>
<li><strong>Semantic preservation</strong>: Similar words cluster together in vector space.</li>
<li><strong>Mathematical operations</strong>: Enable analogical reasoning ($\text{king} - \text{man} + \text{woman} \approx \text{queen}$).</li>
<li><strong>Transfer learning</strong>: Pre-trained embeddings work across multiple tasks and domains.</li>
</ol>
<p>Modern deep learning architectures leverage these properties extensively. The development of universal, pre-trained embeddings was a significant step forward. We can use versatile embeddings that generalize across applications, eliminating the need to train task-specific representations from scratch.</p>
<h2 id="word-embedding-approaches">Word Embedding Approaches</h2>
<h3 id="one-hot-encoding-and-count-vectorization">One-Hot Encoding and Count Vectorization</h3>
<p>One-hot encoding represents the simplest approach to word vectorization. Each word gets a unique dimension in a vocabulary-sized vector, marked with 1 for presence and 0 elsewhere. Count vectorization extends this by counting the occurrences of each word in a document.</p>















<figure class="post-figure center ">
    <img src="/img/word_vector_onehot-50.webp"
         alt="One-hot encoding visualization"
         title="One-hot encoding visualization"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">One-hot encoding creates sparse vectors with single active dimensions</figcaption>
    
</figure>

<p><strong>Characteristics:</strong></p>
<ul>
<li><strong>High dimensionality</strong>: Vector length equals vocabulary size.</li>
<li><strong>Extreme sparsity</strong>: Most dimensions contain zeros.</li>
<li><strong>No relationships</strong>: Treats all words as equally distant.</li>
<li><strong>Computational efficiency</strong>: Simple to implement and understand.</li>
</ul>
<p>While lacking semantic information, count vectorization serves as a foundation for more complex methods. Let&rsquo;s look at a practical implementation using scikit-learn&rsquo;s <code>CountVectorizer</code>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.feature_extraction.text <span style="color:#f92672">import</span> CountVectorizer
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Initialize the vectorizer</span>
</span></span><span style="display:flex;"><span>vectorizer <span style="color:#f92672">=</span> CountVectorizer()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Sample text for demonstration</span>
</span></span><span style="display:flex;"><span>sample_text <span style="color:#f92672">=</span> [<span style="color:#e6db74">&#34;One of the most basic ways we can numerically represent words &#34;</span>
</span></span><span style="display:flex;"><span>               <span style="color:#e6db74">&#34;is through the one-hot encoding method (also sometimes called &#34;</span>
</span></span><span style="display:flex;"><span>               <span style="color:#e6db74">&#34;count vectorizing).&#34;</span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Fit the vectorizer to our text data</span>
</span></span><span style="display:flex;"><span>vectorizer<span style="color:#f92672">.</span>fit(sample_text)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Examine the vocabulary and word indices</span>
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">&#39;Vocabulary:&#39;</span>)
</span></span><span style="display:flex;"><span>print(vectorizer<span style="color:#f92672">.</span>vocabulary_)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Transform text to vectors</span>
</span></span><span style="display:flex;"><span>vector <span style="color:#f92672">=</span> vectorizer<span style="color:#f92672">.</span>transform(sample_text)
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">&#39;Full vector:&#39;</span>)
</span></span><span style="display:flex;"><span>print(vector<span style="color:#f92672">.</span>toarray())
</span></span></code></pre></div><p>In a production environment, count vectorization introduces significant engineering challenges. When processing millions of documents, the vocabulary size explodes. Storing and computing on these massive sparse matrices quickly leads to memory exhaustion. In these scaling scenarios, practitioners often turn to the <strong>Hashing Trick</strong> (via <code>HashingVectorizer</code>) to bound the dimensionality, or they move entirely to the dense embeddings discussed later in this post.</p>
<p>We can see count vectorization in action with a real dataset, building a simple text classifier for the <a href="https://www.kaggle.com/datasets/crawford/20-newsgroups">20 Newsgroups dataset</a>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.datasets <span style="color:#f92672">import</span> fetch_20newsgroups
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.feature_extraction.text <span style="color:#f92672">import</span> CountVectorizer
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn.naive_bayes <span style="color:#f92672">import</span> MultinomialNB
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> sklearn <span style="color:#f92672">import</span> metrics
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Load train and test splits, removing metadata for a cleaner signal</span>
</span></span><span style="display:flex;"><span>newsgroups_train <span style="color:#f92672">=</span> fetch_20newsgroups(subset<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;train&#39;</span>,
</span></span><span style="display:flex;"><span>                                      remove<span style="color:#f92672">=</span>(<span style="color:#e6db74">&#39;headers&#39;</span>, <span style="color:#e6db74">&#39;footers&#39;</span>, <span style="color:#e6db74">&#39;quotes&#39;</span>))
</span></span><span style="display:flex;"><span>newsgroups_test <span style="color:#f92672">=</span> fetch_20newsgroups(subset<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;test&#39;</span>,
</span></span><span style="display:flex;"><span>                                     remove<span style="color:#f92672">=</span>(<span style="color:#e6db74">&#39;headers&#39;</span>, <span style="color:#e6db74">&#39;footers&#39;</span>, <span style="color:#e6db74">&#39;quotes&#39;</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Initialize and fit vectorizer on training data</span>
</span></span><span style="display:flex;"><span>vectorizer <span style="color:#f92672">=</span> CountVectorizer()
</span></span><span style="display:flex;"><span>X_train <span style="color:#f92672">=</span> vectorizer<span style="color:#f92672">.</span>fit_transform(newsgroups_train<span style="color:#f92672">.</span>data)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Build and train classifier</span>
</span></span><span style="display:flex;"><span>classifier <span style="color:#f92672">=</span> MultinomialNB(alpha<span style="color:#f92672">=</span><span style="color:#ae81ff">0.01</span>)
</span></span><span style="display:flex;"><span>classifier<span style="color:#f92672">.</span>fit(X_train, newsgroups_train<span style="color:#f92672">.</span>target)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Transform test data and make predictions</span>
</span></span><span style="display:flex;"><span>X_test <span style="color:#f92672">=</span> vectorizer<span style="color:#f92672">.</span>transform(newsgroups_test<span style="color:#f92672">.</span>data)
</span></span><span style="display:flex;"><span>y_pred <span style="color:#f92672">=</span> classifier<span style="color:#f92672">.</span>predict(X_test)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Evaluate performance</span>
</span></span><span style="display:flex;"><span>accuracy <span style="color:#f92672">=</span> metrics<span style="color:#f92672">.</span>accuracy_score(newsgroups_test<span style="color:#f92672">.</span>target, y_pred)
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#39;Accuracy: </span><span style="color:#e6db74">{</span>accuracy<span style="color:#e6db74">:</span><span style="color:#e6db74">.3f</span><span style="color:#e6db74">}</span><span style="color:#e6db74">&#39;</span>)
</span></span></code></pre></div><p>This provides a solid baseline. To capture actual semantic meaning and reduce dimensionality, we must move beyond simple counting.</p>
<h3 id="tf-idf-term-frequency-inverse-document-frequency">TF-IDF (Term Frequency-Inverse Document Frequency)</h3>
<p><a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html">TF-IDF</a> extends one-hot encoding by weighting terms based on their importance across a document collection. TF-IDF combines:</p>
<ul>
<li><strong>Term Frequency (TF)</strong>: How often a word appears in a document</li>
<li><strong>Inverse Document Frequency (IDF)</strong>: How rare a word is across all documents</li>
</ul>
<p>This weighting scheme reduces the impact of common words (like &ldquo;the&rdquo; or &ldquo;and&rdquo;) while emphasizing distinctive terms that appear frequently in specific documents but rarely elsewhere.</p>
<p><strong>Advantages:</strong></p>
<ul>
<li>Captures document-level importance</li>
<li>Reduces impact of stop words</li>
<li>Effective for information retrieval tasks</li>
</ul>
<p><strong>Limitations:</strong></p>
<ul>
<li>Still high-dimensional and sparse</li>
<li>No semantic relationships between terms</li>
<li>Context-independent representation</li>
</ul>
<h3 id="co-occurrence-matrices">Co-Occurrence Matrices</h3>
<p>Co-occurrence matrices capture word relationships by recording which terms appear together within defined contexts (sentences, paragraphs, or fixed windows). The resulting matrix has dimensions equal to vocabulary size squared, with entries showing co-occurrence frequency.</p>















<figure class="post-figure center ">
    <img src="/img/Word_co-occurrence_network_%28range_3_words%29_-_ENG-50.webp"
         alt="Co-occurrence network visualization"
         title="Co-occurrence network visualization"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Co-occurrence relationships within a three-word window</figcaption>
    
</figure>

<p><strong>Key properties:</strong></p>
<ul>
<li><strong>Global statistics</strong>: Captures corpus-wide word relationships</li>
<li><strong>Symmetric relationships</strong>: Mutual co-occurrence patterns</li>
<li><strong>Extreme dimensionality</strong>: Vocabulary size squared creates storage challenges</li>
<li><strong>Sparse representation</strong>: Most word pairs never co-occur</li>
</ul>
<p>While computationally expensive to store and process, co-occurrence matrices form the foundation for advanced methods like GloVe that compress this information into dense representations.</p>
<h2 id="neural-network-based-embeddings">Neural Network-Based Embeddings</h2>
<h3 id="neural-probabilistic-language-models">Neural Probabilistic Language Models</h3>
<p><a href="https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf">Neural probabilistic models</a> pioneered the use of neural networks for learning word embeddings. These models learn dense representations as a byproduct of language modeling, predicting the next word in a sequence.</p>















<figure class="post-figure center ">
    <img src="/img/bengio-npm-50.webp"
         alt="Neural probabilistic model diagram"
         title="Neural probabilistic model diagram"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Architecture of neural probabilistic language models</figcaption>
    
</figure>

<p><strong>Training process:</strong></p>
<ol>
<li>Initialize random dense embeddings for each vocabulary word</li>
<li>Use embeddings as inputs to predict language modeling objectives</li>
<li>Update embeddings through backpropagation based on prediction errors</li>
<li>Resulting embeddings capture patterns useful for the training task</li>
</ol>
<p>This approach demonstrated that task-specific embeddings could be learned jointly with model objectives, establishing the foundation for modern embedding methods.</p>
<h3 id="word2vec">Word2Vec</h3>
<p><a href="https://code.google.com/archive/p/word2vec/">Word2Vec</a> revolutionized word embeddings by introducing efficient training algorithms for massive corpora. It became the first method to demonstrate compelling vector arithmetic properties, enabling analogical reasoning like the famous &ldquo;$\text{king} - \text{man} + \text{woman} \approx \text{queen}$&rdquo; example.</p>















<figure class="post-figure center ">
    <img src="/img/Word_vector_illustration.webp"
         alt="Word2Vec vector arithmetic visualization"
         title="Word2Vec vector arithmetic visualization"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Word2Vec demonstrates analogical relationships through vector arithmetic</figcaption>
    
</figure>

<p><strong>Two training architectures:</strong></p>
<h4 id="continuous-bag-of-words-cbow">Continuous Bag-of-Words (CBOW)</h4>
<p>Predicts target words from surrounding context words. Given a window of context words, the model learns to predict the central word.</p>
<h4 id="skip-gram">Skip-Gram</h4>
<p>Predicts context words from target words. Given a central word, the model learns to predict surrounding words within a defined window.</p>
<p><strong>Key advantages:</strong></p>
<ul>
<li><strong>Computational efficiency</strong>: Much faster than neural probabilistic models</li>
<li><strong>Scalable training</strong>: Can process billion-word corpora effectively</li>
<li><strong>Quality embeddings</strong>: Captures semantic and syntactic relationships</li>
<li><strong>Flexible context</strong>: Window size controls topical vs. functional similarity</li>
</ul>
<p>The choice of window size significantly impacts learned relationships. Larger windows capture topical associations, while smaller windows focus on syntactic and functional similarities.</p>
<h3 id="glove-global-vectors">GloVe (Global Vectors)</h3>
<p><a href="https://nlp.stanford.edu/projects/glove/">GloVe</a> combines the best aspects of matrix factorization methods (which capture global corpus statistics) and local context window approaches like Word2Vec. Matrix factorization methods excel at global patterns but struggle with analogical reasoning, while Word2Vec captures local relationships but may miss global structure.</p>
<p><strong>Key innovation:</strong>
GloVe trains on a global word-context co-occurrence matrix, incorporating corpus-wide statistical information while maintaining the analogical reasoning capabilities that made Word2Vec successful.</p>
<p><strong>Advantages over Word2Vec:</strong></p>
<ul>
<li><strong>Global optimization</strong>: Leverages entire corpus statistics</li>
<li><strong>Better performance</strong>: Often outperforms Word2Vec on word similarity and analogy tasks</li>
<li><strong>Stable training</strong>: More consistent convergence due to global objective function</li>
</ul>
<p>The result is embeddings that capture both local syntactic patterns and global semantic relationships more effectively.</p>
<h2 id="contextual-embedding-methods">Contextual Embedding Methods</h2>
<h3 id="fasttext">FastText</h3>
<p><a href="https://github.com/facebookresearch/fastText">FastText</a> addresses a critical limitation of previous methods: handling out-of-vocabulary (OOV) words. By incorporating subword information, FastText can generate meaningful representations for previously unseen words.</p>
<p><strong>Subword approach:</strong></p>
<ul>
<li>Decomposes words into character n-grams (typically 3-6 characters)</li>
<li>Represents words as sums of their component n-grams</li>
<li>Trains using skip-gram objective with negative sampling</li>
</ul>
<p><strong>Key advantages:</strong></p>
<ul>
<li><strong>OOV handling</strong>: Can embed unseen words using known subword components</li>
<li><strong>Morphological awareness</strong>: Captures relationships between related word forms</li>
<li><strong>Multilingual support</strong>: Facebook released pre-trained embeddings for 294 languages</li>
<li><strong>Robust performance</strong>: Particularly effective for morphologically rich languages</li>
</ul>
<p>For example, if the model knows &ldquo;navigate,&rdquo; it can provide meaningful representation for &ldquo;circumnavigate&rdquo; by leveraging shared subword components, even if &ldquo;circumnavigate&rdquo; wasn&rsquo;t in the training data.</p>
<h3 id="poincaré-embeddings">Poincaré Embeddings</h3>
<p><a href="https://radimrehurek.com/gensim/models/poincare.html">Poincaré embeddings</a> introduce a novel approach by learning representations in hyperbolic space. This geometric innovation specifically targets hierarchical relationships in data.</p>
<p><strong>Hyperbolic geometry advantages:</strong></p>
<ul>
<li><strong>Natural hierarchy encoding</strong>: Distance represents similarity, while norm encodes hierarchical level</li>
<li><strong>Efficient representation</strong>: Requires fewer dimensions for hierarchical data</li>
<li><strong>Mathematical elegance</strong>: Leverages properties of hyperbolic space for embedding optimization</li>
</ul>
<p><strong>Applications:</strong>
Particularly effective for data with inherent hierarchical structure, such as:</p>
<ul>
<li>WordNet taxonomies</li>
<li>Organizational charts</li>
<li>Computer network topologies</li>
<li>Knowledge graphs</li>
</ul>
<p>The <a href="https://arxiv.org/abs/1705.08039">original paper</a> demonstrates good efficiency in reproducing WordNet relationships with significantly lower dimensionality compared to traditional embedding methods.</p>
<h2 id="contextual-embeddings">Contextual Embeddings</h2>
<h3 id="elmo-embeddings-from-language-models">ELMo (Embeddings from Language Models)</h3>
<p><a href="https://github.com/allenai/allennlp-models">ELMo</a> represents a paradigm shift toward contextual word representations. ELMo generates dynamic representations based on sentence context, adapting to word usage patterns.</p>
<p><strong>Architecture:</strong></p>
<ul>
<li><strong>Bidirectional LSTM</strong>: Processes text in both forward and backward directions</li>
<li><strong>Character-level input</strong>: Handles OOV words and captures morphological patterns</li>
<li><strong>Multi-layer representations</strong>: Combines different abstraction levels</li>
</ul>
<p><strong>Layer specialization:</strong></p>
<ul>
<li><strong>Lower layers</strong>: Excel at syntactic tasks (POS tagging, parsing)</li>
<li><strong>Higher layers</strong>: Capture semantic relationships (word sense disambiguation)</li>
<li><strong>Combined layers</strong>: Weighted combination achieves good performance</li>
</ul>
<p><strong>Key innovation:</strong>
ELMo embeddings vary by context. The word &ldquo;bank&rdquo; receives different representations in &ldquo;river bank&rdquo; versus &ldquo;financial bank,&rdquo; addressing polysemy directly through contextual awareness.</p>
<p>This approach achieved strong performance across numerous NLP tasks by providing context-sensitive representations that adapt to word usage patterns.</p>
<h3 id="probabilistic-fasttext">Probabilistic FastText</h3>
<p><a href="https://github.com/benathi/multisense-prob-fasttext">Probabilistic FastText</a> addresses polysemy (words with multiple meanings) through probabilistic modeling. Traditional embeddings conflate different word senses into single representations, limiting their precision.</p>
<p><strong>The polysemy problem:</strong>
Consider &ldquo;rock&rdquo; which can mean:</p>
<ul>
<li>Rock music (genre)</li>
<li>A stone (geological object)</li>
<li>Rocking motion (verb)</li>
</ul>
<p>Standard embeddings average these meanings, producing representations that may not capture any sense precisely.</p>
<p><strong>Probabilistic approach:</strong>
Probabilistic FastText represents words as Gaussian mixture models: probability distributions that can capture multiple distinct meanings as separate components.</p>
<p><strong>Advantages:</strong></p>
<ul>
<li><strong>Multi-sense representation</strong>: Each word sense gets its own distribution</li>
<li><strong>Context sensitivity</strong>: Can select appropriate sense based on usage context</li>
<li><strong>Uncertainty quantification</strong>: Probabilistic framework captures embedding confidence</li>
</ul>
<p>This approach provides a more nuanced treatment of lexical ambiguity, particularly valuable for words with distinct, context-dependent meanings.</p>
<h2 id="summary-and-future-directions">Summary and Future Directions</h2>
<p>Word embeddings have evolved from simple one-hot encodings to contextual representations that capture nuanced linguistic relationships. Each approach offers distinct advantages:</p>
<p><strong>Static embeddings</strong> (Word2Vec, GloVe, FastText) provide:</p>
<ul>
<li>Computational efficiency for large-scale applications</li>
<li>Pre-trained models available for numerous languages</li>
<li>Clear analogical reasoning capabilities</li>
<li>Good performance on many downstream tasks</li>
</ul>
<p><strong>Contextual embeddings</strong> (ELMo, BERT, GPT) offer:</p>
<ul>
<li>Dynamic representations based on sentence context</li>
<li>Better handling of polysemy and word sense disambiguation</li>
<li>Strong performance on complex NLP tasks</li>
<li>Ability to capture subtle contextual nuances</li>
</ul>
<p><strong>Choosing the right approach</strong> depends on:</p>
<ul>
<li><strong>Task requirements</strong>: Static embeddings for efficiency, contextual for accuracy</li>
<li><strong>Data availability</strong>: Pre-trained models vs. domain-specific training</li>
<li><strong>Computational constraints</strong>: Static embeddings require less processing power</li>
<li><strong>Language coverage</strong>: Consider availability of pre-trained models for target languages</li>
</ul>
<p>The field continues advancing toward more efficient contextual models, better multilingual representations, and embeddings that capture increasingly complex linguistic phenomena.</p>
<p>For a production-grade Word2Vec implementation in PyTorch that takes these concepts further, see the <a href="/projects/modern-word2vec/">High-Performance Word2Vec project</a>.</p>
]]></content:encoded></item></channel></rss>