Natural Language Processing on Hunter Heidenreich | ML Research Scientist

SpeechT5: Unified Speech-Text Pre-Training Framework

Sat, 11 Apr 2026 00:00:00 +0000

A Unified Encoder-Decoder for Spoken Language Processing

SpeechT5 is a Method paper that introduces a shared encoder-decoder pre-training framework for spoken language processing. Inspired by T5’s text-to-text paradigm, SpeechT5 reformulates all spoken language tasks as “speech/text to speech/text” problems. The framework uses modal-specific pre-nets and post-nets to interface between raw speech or text and a shared Transformer encoder-decoder, enabling a single pre-trained model to handle six downstream tasks: automatic speech recognition (ASR), text-to-speech synthesis (TTS), speech translation (ST), voice conversion (VC), speech enhancement (SE), and speaker identification (SID).

Bridging the Gap Between Speech and Text Pre-Training

Prior speech pre-training work (wav2vec 2.0, HuBERT) suffered from two key limitations. First, these models learned speech representations from unlabeled audio alone, ignoring the complementary information in text data that is critical for cross-modal tasks like ASR and TTS. Second, they relied on encoder-only architectures with task-specific prediction heads, leaving the decoder un-pretrained for sequence-to-sequence generation tasks.

SpeechT5 addresses both gaps by (1) jointly pre-training on unlabeled speech and text data, and (2) using a full encoder-decoder architecture that benefits generation tasks directly. The approach builds on the observation that speech and text, despite their surface differences, share underlying semantic structure that a unified representation can capture.

The core innovation in SpeechT5 is a cross-modal vector quantization (VQ) mechanism that aligns speech and text representations into a shared semantic space. The architecture consists of three components:

Shared encoder-decoder backbone. A Transformer with 12 encoder blocks and 6 decoder blocks (768-dim, 12 heads), using relative position embeddings.

Modal-specific pre/post-nets. Six specialized networks handle the conversion between raw modalities and the shared representation space:

Speech-encoder pre-net: a convolutional feature extractor (from wav2vec 2.0) downsampling raw waveforms
Speech-decoder pre-net: three FC layers with ReLU, processing 80-dimensional log Mel-filterbank features
Speech-decoder post-net: a linear layer predicting Mel features plus five 1D conv layers (256 channels) for residual refinement, with an x-vector speaker embedding concatenated for multi-speaker support
Text pre/post-nets: shared embedding layers mapping between character-level token indices and hidden states (768-dim)

Cross-modal vector quantization. A shared codebook $\mathbf{C}^{K}$ with $K$ learnable embeddings bridges the two modalities. Encoder outputs $\mathbf{u}_i$ are quantized via nearest-neighbor lookup:

$$ \mathbf{c}_i = \arg\min_{j \in [K]} | \mathbf{u}_i - \mathbf{c}_j |_2 $$

A proportion (10%) of contextual representations are randomly replaced with these quantized latent units before being fed to the decoder’s cross-attention. This mixing forces the quantizer to capture cross-modal features. A diversity loss encourages full codebook utilization:

$$ \mathcal{L}_d = \frac{1}{K} \sum_{k=1}^{K} p_k \log p_k $$

Pre-Training Objectives

SpeechT5 combines three pre-training objectives:

Speech pre-training uses two tasks. A bidirectional masked prediction loss $\mathcal{L}_{mlm}^{s}$ follows HuBERT’s approach, masking 8% of timesteps in 10-step spans and predicting frame-level targets from an acoustic unit discovery model:

$$ \mathcal{L}_{mlm}^{s} = \sum_{n \in \mathcal{M}} \log p(\mathbf{z}_n \mid \hat{\mathbf{H}}, n) $$

A reconstruction loss $\mathcal{L}_{1}^{s}$ minimizes the $L_1$ distance between predicted and original Mel-filterbank features, plus a binary cross-entropy stop-token loss $\mathcal{L}_{bce}^{s}$.

Text pre-training uses BART-style denoising, masking 30% of text spans (Poisson $\lambda = 3.5$) and training with maximum likelihood estimation:

$$ \mathcal{L}_{mle}^{t} = \sum_{n=1}^{N^t} \log p(\mathbf{y}_n^t \mid \mathbf{y}_{< n}^t, \hat{\mathbf{X}}^t) $$

The full pre-training loss combines all components:

$$ \mathcal{L} = \mathcal{L}_{mlm}^{s} + \mathcal{L}_{1}^{s} + \mathcal{L}_{bce}^{s} + \mathcal{L}_{mle}^{t} + \gamma \mathcal{L}_d $$

where $\gamma = 0.1$.

Evaluation Across Six Spoken Language Tasks

SpeechT5 was evaluated on six downstream tasks, each using a different combination of the shared encoder-decoder and task-appropriate pre/post-nets:

Automatic Speech Recognition (ASR)

Fine-tuned on LibriSpeech 100h with joint CTC/attention decoding. The decoding objective maximizes a combination of decoder, CTC, and language model log-probabilities:

$$ \alpha \log P_{Dec} + (1 - \alpha) \log P_{CTC} + \beta \log P_{LM} $$

where $\alpha = 0.5$ and $\beta = 1.0$ for the 100h setting (beam size 30). Results on the test sets:

Model	LM	test-clean	test-other
wav2vec 2.0 BASE	-	6.1	13.3
HuBERT BASE	-	5.8	13.3
SpeechT5	-	4.4	10.4
wav2vec 2.0 BASE	Transf.	2.6	6.3
SpeechT5	Transf.	2.4	5.8

Text-to-Speech Synthesis (TTS)

Fine-tuned on LibriTTS 460h clean sets with HiFi-GAN vocoder:

Model	Naturalness	MOS	CMOS
Ground Truth	-	3.87 ± 0.04	-
Baseline	2.76	3.56 ± 0.05	0
SpeechT5	2.91	3.65 ± 0.04	+0.290

Speech Translation (ST)

Evaluated on MUST-C English-to-German and English-to-French:

Model	EN-DE	EN-FR
Fairseq ST	22.70	32.90
Adapter Tuning	24.63	34.98
Baseline (HuBERT init)	23.43	33.76
SpeechT5	25.18	35.30

Voice Conversion (VC)

Evaluated on CMU Arctic:

Model	WER (bdl→slt)	MCD (bdl→slt)
VTN w/ TTS	7.6%	6.33
Many-to-many VTN	-	6.13
SpeechT5	7.8%	5.93

Speech Enhancement (SE)

On WHAM! dataset, SpeechT5 reduced WER from 76.1% (noisy) to 8.9%, a relative 9% improvement over the baseline’s 10.9%.

Speaker Identification (SID)

On VoxCeleb1, SpeechT5 achieved 96.49% accuracy, outperforming HuBERT LARGE at 90.33% (from SUPERB) and SpeechNet multi-task at 87.90%.

Ablation Study and Key Findings

The ablation study reveals the contribution of each pre-training component:

Model	ASR (clean)	ASR (other)	VC (MCD)	SID (ACC)
SpeechT5	4.4	10.7	5.93	96.49%
w/o Speech PT	-	-	6.49	38.61%
w/o Text PT	5.4	12.8	6.03	95.60%
w/o Joint PT	4.6	11.3	6.18	95.54%
w/o $\mathcal{L}_{mlm}^{s}$	7.6	22.4	6.29	90.91%

Key findings:

Speech pre-training is critical: without it, ASR fails to converge entirely, and SID accuracy drops to 38.61%.
Text pre-training complements speech: removing it degrades ASR by ~20% relative, confirming that textual knowledge transfers to speech tasks.
Joint pre-training enables cross-modal transfer: the vector quantization approach is essential for modality-bridging tasks like ASR.
The masked prediction loss $\mathcal{L}_{mlm}^{s}$ is the most important single component, responsible for learning strong acoustic features.

The authors note limitations in the current scope (English-only, BASE model size) and propose scaling to larger models and multilingual settings as future work.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Speech pre-training	LibriSpeech	960 hours	Full training set
Text pre-training	LibriSpeech LM text	400M sentences	Normalized language model text
ASR fine-tuning	LibriSpeech	100h / 960h subsets
TTS fine-tuning	LibriTTS	460h clean sets
ST fine-tuning	MUST-C	EN-DE, EN-FR
VC fine-tuning	CMU Arctic	4 speakers	bdl, clb, slt, rms
SE fine-tuning	WHAM!	16 kHz max	enhance-single task
SID fine-tuning	VoxCeleb1	100k+ utterances	1,251 speakers

Algorithms

Optimizer: Adam with warmup (8% of steps) to peak LR $2 \times 10^{-4}$, then linear decay
Speech masking: 8% of timesteps, 10-step spans
Text masking: 30% of spans, Poisson $\lambda = 3.5$
Vector quantization: 2 codebooks × 100 entries = $10^4$ theoretical maximum codes
CTC/attention joint decoding for ASR (beam size 30)
HiFi-GAN vocoder for TTS and SE waveform generation
Parallel WaveGAN vocoder for VC

Fine-Tuning Hyperparameters

Task	GPUs	Steps	Peak LR	Batch (per GPU)	Schedule
ASR (100h)	8×V100	80k	6e-5	256k audio samples	Warmup 10%, hold 40%, linear decay
ASR (960h)	8×V100	320k	1.3e-4	256k audio samples	Warmup 10%, hold 40%, linear decay
TTS	8×V100	120k	4e-4	45k tokens	Warmup 10k steps, inv. sqrt decay
ST	8×V100	80k	-	-	Warmup 10k steps
VC	8×V100	60k	1e-4	20k tokens	6k warmup, inv. sqrt decay
SE	8×V100	100k	1e-4	16k tokens	10k warmup, inv. sqrt decay
SID	8×V100	60k	5e-4	64 segments (3s each)	Triangular cyclical (1e-8 to 5e-4)

Models

Encoder: 12 Transformer blocks (768-dim, 3072 FFN, 12 heads)
Decoder: 6 Transformer blocks (same dimensions)
Speech-encoder pre-net: 7 conv blocks (512 channels, strides [5,2,2,2,2,2,2], kernels [10,3,3,3,3,2,2])
Code and pre-trained models available at github.com/microsoft/SpeechT5 (MIT license)

Artifacts

Artifact	Type	License	Notes
microsoft/SpeechT5	Code	MIT	Official Fairseq-based implementation
Pre-trained models (via repo)	Model	MIT	SpeechT5 BASE encoder-decoder checkpoints
LibriSpeech	Dataset	CC-BY-4.0	960h speech pre-training and ASR fine-tuning
LibriTTS	Dataset	CC-BY-4.0	460h TTS fine-tuning
MUST-C	Dataset	CC-BY-NC-ND-4.0	Speech translation fine-tuning
CMU Arctic	Dataset	Free	Voice conversion fine-tuning
WHAM!	Dataset	CC-BY-NC-4.0	Speech enhancement fine-tuning
VoxCeleb1	Dataset	CC-BY-SA-4.0	Speaker identification fine-tuning

Hardware

Pre-training: 32 NVIDIA V100 GPUs
Batch: ~90s speech per GPU + 12k text tokens per GPU, gradient accumulation 2
Pre-training steps: 500k

Paper Information

Citation: Ao, J., Wang, R., Zhou, L., Wang, C., Ren, S., Wu, Y., Liu, S., Ko, T., Li, Q., Zhang, Y., Wei, Z., Qian, Y., Li, J., & Wei, F. (2022). SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 5723-5738.

@inproceedings{ao2022speecht,
  title={SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing},
  author={Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and Wei, Zhihua and Qian, Yao and Li, Jinyu and Wei, Furu},
  booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={5723--5738},
  year={2022},
  doi={10.18653/v1/2022.acl-long.393}
}

T5: Exploring Transfer Learning Limits

Wed, 08 Apr 2026 00:00:00 +0000

A systematic study of NLP transfer learning

This is a systematization paper that provides a comprehensive empirical survey of transfer learning techniques for NLP. Rather than proposing a single new method, T5 introduces a unified text-to-text framework and uses it as a testbed to systematically compare pre-training objectives, architectures, unlabeled data sources, transfer approaches, and multi-task mixing strategies. The scale of the ablation study (covering dozens of configurations) and the release of C4, pre-trained models, and code make it both a reference guide and a resource.

Unifying NLP tasks as text-to-text

The core design decision is to cast every NLP task as a text-to-text problem: both the input and output are text strings, with a task-specific prefix. Classification, regression, summarization, translation, and question answering all use the same model, loss function (cross-entropy on output tokens), and decoding procedure. This simplicity enables fair comparison across tasks and training strategies.

The model architecture is a standard encoder-decoder Transformer. The paper finds that this form outperforms decoder-only (language model) and encoder-only (BERT-style) variants in the text-to-text setting, while having similar computational cost to decoder-only models despite twice the parameters (the encoder processes the input only once, then the decoder attends to it).

Multi-task mixing: strategies and findings

The most thesis-relevant contribution is the systematic ablation of multi-task mixing strategies (Section 3.5.2). When training on multiple tasks simultaneously (which in the text-to-text framework simply means mixing data from different sources), the central question is how to set the proportion of data from each task.

Three mixing strategies

Examples-proportional mixing. Sample in proportion to each dataset’s size, with an artificial cap $K$ on the maximum dataset size. Without the cap, the unsupervised pre-training data (orders of magnitude larger) would dominate all batches. The mixing rate for task $m$ is:

$$ r_{m} = \frac{\min(e_{m}, K)}{\sum_{n} \min(e_{n}, K)} $$

where $e_{m}$ is the number of examples in task $m$’s dataset.

Temperature-scaled mixing. Raise each mixing rate $r_{m}$ to the power $1/T$ and renormalize. At $T=1$ this equals examples-proportional mixing; as $T$ increases, proportions approach equal mixing. Uses a large cap $K = 2^{21}$.

Equal mixing. Sample uniformly from all tasks. Included as a negative reference: the model overfits on low-resource tasks and underfits on high-resource tasks.

Results

Mixing strategy	GLUE	CNN/DM	SQuAD	SuperGLUE	EnDe	EnFr	EnRo
Baseline (pre-train/fine-tune)	83.28	19.24	80.88	71.36	26.98	39.82	27.65
Equal	76.13	19.02	76.51	63.37	23.89	34.31	26.78
Examples-proportional, $K=2^{18}$	81.67	19.07	78.17	67.94	24.57	35.19	27.39
Examples-proportional, $K=2^{19}$	81.42	19.24	79.78	67.30	25.21	36.30	27.76
Temperature-scaled, $T=2$	81.90	19.28	79.42	69.92	25.42	36.72	27.20

Key findings on mixing:

Multi-task training underperforms pre-train-then-fine-tune on most tasks. No mixing strategy matches the baseline of unsupervised pre-training followed by task-specific fine-tuning.
Equal mixing is worst. It dramatically degrades performance, confirming that proportions matter.
There exists a task-specific sweet spot for the cap $K$. Most tasks have an optimal $K$ value; larger or smaller values hurt. The exception is very high-resource tasks (WMT English-French) that always benefit from higher mixing proportions.
Temperature scaling at $T=2$ provides the best single compromise. It achieves reasonable performance across all tasks without requiring per-task tuning of $K$.
Multi-task pre-training followed by fine-tuning closes the gap. When multi-task training is used as pre-training (not as the final training stage), followed by task-specific fine-tuning, performance becomes comparable to unsupervised pre-training alone. This suggests that multi-task exposure during pre-training provides useful early signal without the negative effects of forcing a single model to perform all tasks simultaneously.
“Leave-one-out” training works. Pre-training on a multi-task mixture that excludes a target task, then fine-tuning on it, produces only slightly worse results. This indicates that multi-task pre-training builds general capabilities that transfer to unseen tasks without dramatic task interference.

Data repetition degrades performance

The paper also systematically tests the effect of pre-training data set size by truncating C4 and training over repeated data:

Unique tokens	Repeats	GLUE	SQuAD	SuperGLUE
Full dataset	0	83.28	80.88	71.36
$2^{29}$	64	82.87	80.97	72.03
$2^{27}$	256	82.62	79.78	69.97
$2^{25}$	1,024	79.55	76.27	64.76
$2^{23}$	4,096	76.34	70.92	59.29

Performance degrades as data shrinks, with 64 repeats showing limited effects but 1,024+ repeats causing significant degradation. Training loss curves confirm memorization at high repetition counts. The paper recommends using large, diverse pre-training datasets whenever possible.

Scaling and final configuration

The paper compares scaling strategies: more data, larger models, and ensembles. Training a larger model for fewer steps generally outperforms training a smaller model on more data. Ensembles of independently pre-trained and fine-tuned models provide orthogonal gains.

The final T5-11B model combines the best choices from all ablations: encoder-decoder architecture, span corruption objective, C4 pre-training data, multi-task pre-training followed by fine-tuning, and scaling to 11B parameters trained on over 1 trillion tokens. It achieves state-of-the-art results on GLUE (90.3 average), SuperGLUE (88.9, near human performance of 89.8), SQuAD, and CNN/Daily Mail. It does not achieve state-of-the-art on WMT translation tasks, where methods using backtranslation and cross-lingual pre-training retain the lead.

Implications and limitations

The T5 paper’s multi-task mixing findings are its most enduring contribution beyond the model itself. The core lessons: proportions matter enormously (equal mixing fails), examples-proportional mixing with a cap is a reasonable default, temperature scaling provides a single-knob alternative, and multi-task pre-training followed by fine-tuning can match pure unsupervised pre-training.

Limitations:

All ablations use the same encoder-decoder architecture. Findings may not transfer to decoder-only models that dominate current practice.
The multi-task mixing experiments treat each task as a separate “domain.” Interactions between similar tasks (e.g., multiple classification tasks) are not isolated.
The paper does not provide a principled method for choosing $K$ or $T$; both require empirical search.
C4 has known quality issues (templated text, noisy content) that have been addressed in later datasets.

Reproducibility Details

Status: Highly Reproducible. Code, pre-trained models, and the C4 dataset are all publicly released.

Data

Purpose	Dataset	Size	Notes
Pre-training	C4 (Colossal Clean Crawled Corpus)	~750 GB	Heuristically cleaned Common Crawl
Downstream	GLUE, SuperGLUE, SQuAD, CNN/DM, WMT (EnDe, EnFr, EnRo)	Standard splits	Text-to-text format

Models

Encoder-decoder Transformer. Sizes: Base (220M), Small (60M), Large (770M), 3B, 11B. Baseline uses Base size. SentencePiece vocabulary with 32K tokens. Pre-trained for $2^{19}$ steps, fine-tuned for $2^{18}$ steps on individual tasks.

Algorithms

Multi-task mixing: examples-proportional with cap $K \in {2^{16}, \ldots, 2^{21}}$, temperature-scaled with $T \in {2, 4, 8}$, and equal mixing. Unsupervised objective: span corruption (mean span length 3, 15% corruption rate). Training with Adafactor optimizer, inverse square root learning rate schedule.

Hardware

All models trained using Mesh TensorFlow on TPU slices. T5-11B pre-trained for 1M steps with batch size $2^{11}$ sequences of length 512 (~1 trillion tokens total). Exact TPU pod configurations per experiment not detailed.

Artifacts

Artifact	Type	License	Notes
T5 Code	Code	Apache 2.0	Official TensorFlow implementation (JAX successor: T5X)
T5 Models	Model	Apache 2.0	Pre-trained checkpoints (Small through 11B)
C4 Dataset	Dataset	-	~750 GB cleaned Common Crawl, via TensorFlow Datasets

Citation

@article{raffel2020exploring,
  title={Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
  author={Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J.},
  journal={Journal of Machine Learning Research},
  volume={21},
  number={140},
  pages={1--67},
  year={2020}
}

SlimPajama-DC: Data Combinations for LLM Training

Wed, 08 Apr 2026 00:00:00 +0000

An empirical study of data domain combinations

This is a discovery paper that empirically investigates how different combinations and proportions of data domains affect language model pretraining. Using the SlimPajama dataset (a globally deduplicated, 627B token refinement of RedPajama), the study trains seven 1.3B model configurations with varying domain mixtures to identify which combinations and deduplication strategies produce the best downstream performance.

Why data combination strategy matters

Multi-source pretraining datasets combine data from web crawls, code repositories, books, academic papers, and other sources. Two underexplored questions drive this work: (1) Does deduplication within each source (local) versus across all sources (global) meaningfully affect model quality? (2) When sources are thoroughly deduplicated, how does the combination and proportion of domains affect downstream performance? Most open-source LLM training datasets (RedPajama, The Pile) perform only local deduplication, leaving cross-source redundancy unaddressed.

Global deduplication and the SlimPajama dataset

SlimPajama applies global MinHashLSH deduplication (Jaccard similarity threshold 0.8, 13-gram signatures) across all seven data sources simultaneously. This reduces RedPajama’s 1.2T tokens to 627B tokens, a roughly 48% reduction. The heaviest deduplication hits CommonCrawl and GitHub, which had the most cross-source overlap.

The key processing steps:

Low-length document filtering: Remove documents below a minimum length threshold.
Global deduplication: MinHashLSH across all sources simultaneously, requiring 64 CPU cores and 1.4TB peak memory. This removes both within-source and between-source duplicates.

The resulting dataset composition:

Source	SlimPajama	RedPajama	LLaMA 1
CommonCrawl	52.2% (333B)	72.6% (878B)	67.0%
C4	26.7% (170B)	14.4% (175B)	15.0%
GitHub	5.2% (33B)	4.9% (59B)	4.5%
Books	4.2% (27B)	2.1% (26B)	4.5%
ArXiv	4.6% (29B)	2.3% (28B)	2.5%
Wikipedia	3.8% (24B)	2.0% (24B)	4.5%
StackExchange	3.3% (21B)	1.7% (20B)	2.0%

Seven domain combination configurations

All configurations train 1.3B parameter models on 330B tokens with identical architecture and hyperparameters. The configurations systematically vary domain diversity:

DC-1: CommonCrawl only (single source)
DC-2: CommonCrawl + C4 (two web sources)
DC-3: CommonCrawl + C4 with adjusted proportions
DC-4: Wikipedia + Books + GitHub + ArXiv + StackExchange (no web crawl)
DC-5: CommonCrawl + C4 + Wikipedia + Books (four sources, no code/academic)
DC-6: All seven SlimPajama sources (maximum diversity)
DC-7: RefinedWeb CommonCrawl (external single-source baseline)

The experimental design probes: incremental diversity (DC-1 to DC-2 to DC-5 to DC-6), proportion sensitivity (DC-2 vs DC-3), source importance (DC-3 vs DC-4), and specialization vs generalization (individual vs combined).

Diversity after global deduplication drives performance

Hugging Face leaderboard results

Model	Average	ARC	HellaSwag	MMLU	TruthfulQA
RedPajama-1.3B	38.0	37.2	55.8	24.9	34.3
DC-1 (CC only)	38.5	36.3	56.0	27.0	34.8
DC-4 (no web)	37.6	33.4	53.3	26.0	37.6
DC-6 (all sources)	40.0	33.7	61.0	26.9	38.4
DC-7 (RefinedWeb)	41.0	35.1	64.7	26.2	37.9

Key patterns:

More domain diversity improves average performance. The progression DC-1 (38.5) to DC-2 (38.4) to DC-5 (38.6) to DC-6 (40.0) shows that adding domains consistently lifts average accuracy once global deduplication has removed cross-source redundancy.
Global deduplication enables clean combination. All SlimPajama configurations except DC-4 outperform RedPajama-1.3B (38.0), which uses local deduplication only. The elimination of cross-source overlap means adding sources contributes genuinely new information.
Removing web crawl data hurts. DC-4 (no CommonCrawl/C4) scores lowest (37.6), demonstrating that web text provides essential breadth even when specialized sources are included.
Individual domains excel at specific tasks. DC-1 (CC only) achieves the highest ARC and MMLU scores. DC-4 leads on Winogrande. DC-5 leads on WSC273. No single combination dominates all tasks, reinforcing that diversity trades specialization for generalization.
Findings transfer to 7B scale. The best 1.3B configuration insights were applied to a 7B model trained with large batch sizes, achieving 63.4 average accuracy across the extended benchmark suite.

Training loss patterns

DC-6 (all sources) achieves the lowest training loss among SlimPajama configurations, consistent with the downstream results. DC-4 (no web crawl) shows the highest training loss, confirming that the large, diverse web crawl data is the most important single component.

Implications and limitations

The central finding is that diversity matters most after deduplication. When cross-source redundancy is removed, each additional source contributes genuinely new signal. Without global deduplication, adding sources may just increase redundancy without proportional benefit.

Limitations:

Only seven fixed configurations are tested. No systematic search over continuous mixture proportions (contrast with DoReMi or Data Mixing Laws).
The configurations are not independent: DC-6 includes all sources from DC-1 through DC-5, making it difficult to isolate the contribution of any single addition.
Only 1.3B and 7B scales tested. Whether the diversity benefit continues scaling is unverified.
English-only. Cross-lingual diversity effects are not studied.
The paper is a technical report without formal peer review.

Reproducibility Details

Status: Highly Reproducible. All 1.3B models and datasets are publicly released under MIT license on HuggingFace.

Data

Purpose	Dataset	Size	Notes
Training	SlimPajama	627B tokens	Globally deduplicated from 1.2T RedPajama
Training	RefinedWeb	600B tokens	External CC-only baseline
Evaluation	HF Leaderboard (ARC, HellaSwag, MMLU, TruthfulQA)	Standard	4 benchmarks
Evaluation	Extended suite	12 additional benchmarks	Zero and few-shot

Models

1.3B parameter Cerebras-GPT architecture with ALiBi positional encoding and SwiGLU activation. All configurations trained on 330B tokens. 7B model trained with large batch-size (LBS) strategy on Cerebras 16x CS-2 cluster (80 PFLOP/s in bf16).

Hardware

Cerebras 16x CS-2 cluster, 80 PFLOP/s in bf16 mixed precision.

Artifacts

Artifact	Type	License	Notes
SlimPajama-DC Models	Model	MIT	All 1.3B DC configurations (select via revision)
SlimPajama-627B-DC Dataset	Dataset	-	Source-split version of SlimPajama-627B

Citation

@article{shen2023slimpajamadc,
  title={SlimPajama-DC: Understanding Data Combinations for LLM Training},
  author={Shen, Zhiqiang and Tao, Tianhua and Ma, Liqun and Neiswanger, Willie and Liu, Zhengzhong and Wang, Hongyi and Tan, Bowen and Hestness, Joel and Vassilieva, Natalia and Soboleva, Daria and Xing, Eric},
  journal={arXiv preprint arXiv:2309.10818},
  year={2023}
}

Scaling Data-Constrained Language Models

Wed, 08 Apr 2026 00:00:00 +0000

An empirical study of scaling under data constraints

This is a discovery paper that systematically investigates what happens when language models are trained for multiple epochs on repeated data. It extends the Chinchilla scaling laws to the data-constrained regime by proposing a new scaling formula that accounts for the diminishing value of repeated tokens, validated across 400+ training runs ranging from 10M to 9B parameters and up to 1500 epochs.

Running out of unique training data

The Chinchilla scaling laws assume unlimited unique data: for a given compute budget, there exists an optimal balance of model parameters and training tokens. But extrapolating these laws to larger models implies data requirements that exceed what is available. Villalobos et al. estimated that high-quality English text would be exhausted by 2024 under Chinchilla-optimal scaling. Most prior large language models trained for a single epoch, and some work explicitly warned against data reuse. The Galactica models (trained for 4.25 epochs) showed that multi-epoch training could work, but no systematic study had quantified the tradeoff between repeated data and fresh data, or how to allocate compute optimally when data is finite.

Effective data with exponential decay for repetition

The paper generalizes the Chinchilla scaling law by replacing raw token count $D$ with an effective data term $D’$ that accounts for the diminishing value of repeated tokens:

$$ L(N, D) = \frac{A}{N’^{\alpha}} + \frac{B}{D’^{\beta}} + E $$

where the effective data is:

$$ D’ = U_{D} + U_{D} R_{D}^{} \left(1 - e^{-R_{D}/R_{D}^{}}\right) $$

Here $U_{D}$ is the number of unique tokens, $R_{D}$ is the number of repetitions (epochs minus 1), and $R_{D}^{}$ is a learned constant representing the “half-life” of data repetition. When $R_{D} = 0$ (single epoch), $D’ = U_{D} = D$ and the formula reduces to standard Chinchilla. When $R_{D} \ll R_{D}^{}$, repeated data is worth almost the same as fresh data. As $R_{D}$ grows large, the value of repeated tokens decays to zero, and $D’$ saturates at $U_{D}(1 + R_{D}^{})$, meaning no amount of repetition can substitute for more than $R_{D}^{}$ epochs’ worth of fresh data.

A symmetric formula handles excess parameters:

$$ N’ = U_{N} + U_{N} R_{N}^{} \left(1 - e^{-R_{N}/R_{N}^{}}\right) $$

where $U_{N}$ is the compute-optimal parameter count for $U_{D}$ unique tokens and $R_{N}$ measures how much the model exceeds that count. The fitted values are $R_{D}^{} \approx 15.0$ (data repetition half-life at ~16 epochs) and $R_{N}^{} \approx 5.3$ (excess parameters decay faster than repeated data).

Experiments across 400+ models

Scale. Models from 10M to 9B parameters, trained for up to 1500 epochs. Three experimental protocols: fixed unique data (100M, 400M, 1.5B tokens), fixed FLOPs, and parametric fitting across all runs. Training on C4 (English web text) with GPT-2 architecture decoder-only transformers.

Resource allocation: epochs scale faster than parameters

With fixed unique data, results show that more than 50% loss reduction is possible by training beyond one epoch and increasing model size beyond the single-epoch optimum. The data-constrained efficient frontier recommends allocating most additional compute to more epochs rather than more parameters, because excess parameters decay faster ($R_{N}^{} < R_{D}^{}$). This contrasts with Chinchilla, which recommends scaling both equally.

A concrete validation: training the data-constrained compute-optimal model for $9.3 \times 10^{21}$ FLOPs with 25B unique tokens, the recommended allocation (27% fewer parameters, more epochs) achieves better loss and downstream performance than the Chinchilla-optimal allocation.

Resource return: the 4-epoch safe zone and 16-epoch half-life

Epochs	Loss impact	Downstream impact
1 (baseline)	Optimal	Optimal
Up to 4	Negligible (+0.5% loss)	No significant difference
~16 ($R_{D}^{*}$)	Diminishing returns begin sharply	Measurable degradation
Beyond 16	Returns decay to near zero	Significant degradation
Extreme (44+)	Training can diverge	Failure

The 8.7B parameter model trained for 4 epochs ($D_{C} = 44$B unique tokens) finishes with only 0.5% higher validation loss than the single-epoch model ($D_{C} = 178$B unique tokens). Beyond 16 epochs, each repeated token retains only $1 - 1/e \approx 63%$ of the value of a fresh token, meaning roughly 37% of value is lost per repetition cycle at the half-life point.

Complementary strategies: code augmentation and filtering

When data is limited, two strategies can extend the effective dataset:

Code augmentation. Mixing Python code from The Stack with natural language data. Up to 50% code (42B tokens) shows no degradation on natural language benchmarks, effectively providing a 2x increase in useful training data. Some tasks (WebNLG generation, bAbI reasoning) actually improve with code, possibly because code trains long-range state-tracking capabilities.

Filtering relaxation. Perplexity filtering (keeping the 25% lowest-perplexity samples) is effective on noisy datasets, but deduplication filtering does not improve downstream performance (though it may reduce memorization). The recommendation: reserve aggressive filtering for noisy data sources; for clean datasets, more data through reduced filtering is better than less data through strict filtering.

Combined strategy: doubling available data with code and then repeating for 4 epochs yields 8x more training tokens with performance expected to match 8x more unique data.

Key findings and limitations

Key findings:

Multi-epoch training is beneficial, not harmful, up to moderate repetition counts.
The data-constrained scaling law accurately predicts loss under repetition using an exponential decay formulation.
Compute should be allocated to epochs faster than parameters when data is constrained.
Code augmentation and selective filtering extend effective data without quality degradation.

Limitations:

All experiments use the GPT-2 transformer architecture; applicability to other architectures or modalities is untested.
Only the entire dataset is repeated uniformly. Selectively repeating subsets (e.g., high-value data for more epochs) is not modeled.
Hyperparameter sensitivity (learning rate, dropout) to epoch count is unexplored. Higher learning rates may cause earlier onset of diminishing returns.
Focused on English text. Cross-lingual augmentation effects are not studied.

Reproducibility Details

Status: Highly Reproducible. Code, models, datasets, and hyperparameters are all publicly released under Apache 2.0.

Data

Purpose	Dataset	Size	Notes
Training	C4 (English)	Varies by experiment	Fixed unique data: 100M, 400M, 1.5B tokens
Code augmentation	The Stack (Python)	Up to 42B tokens	Mixed with natural language
Evaluation	19 NL tasks	Standard splits	Zero to five-shot, 114 scores per model

Algorithms

Data-constrained scaling law: $D’ = U_{D} + U_{D} R_{D}^{}(1 - e^{-R_{D}/R_{D}^{}})$ with $R_{D}^{} \approx 15.0$, $R_{N}^{} \approx 5.3$. Fitted using the methodology of Hoffmann et al. (2022) adapted for the repetition terms. 400+ training runs used for fitting.

Models

GPT-2 architecture decoder-only transformers with GPT-2 tokenizer. Sizes: 10M to 8.7B parameters. Cosine learning rate schedule (max 2e-4, decay to 2e-5), Adam optimizer ($\beta_2 = 0.999$), dropout 0.1, weight decay 0.1, gradient clipping at 1.0. bfloat16 precision. Trained using Megatron-DeepSpeed.

Evaluation

Metric	Data-Constrained Optimal	Chinchilla Optimal	Notes
Validation loss (9.3e21 FLOPs, 25B unique)	Lower	Higher	27% fewer parameters
Downstream (4 epochs vs 1)	No significant difference	Baseline	8.7B params, 44B unique tokens
Code augmentation (50% code)	No NL degradation	Baseline	Some tasks improve

Hardware

Trained on the LUMI supercomputer (Finland) using AMD Instinct MI250X GPUs with data, tensor, and pipeline parallelism. Up to 256 GPUs (64 nodes) per run, with up to 2,200 nodes (~8,800 GPUs) used in parallel across all concurrent runs. Total compute: approximately 3 million GPU hours. The cluster runs on 100% renewable hydroelectric energy.

Artifacts

Artifact	Type	License	Notes
datablations	Code + Models + Data	Apache 2.0	All 400+ models, datasets, and training code
Megatron-DeepSpeed fork	Code	-	Training framework adapted for AMD ROCm

Citation

@inproceedings{muennighoff2023scaling,
  title={Scaling Data-Constrained Language Models},
  author={Muennighoff, Niklas and Rush, Alexander M. and Barak, Boaz and Le Scao, Teven and Piktus, Aleksandra and Tazi, Nouamane and Pyysalo, Sampo and Wolf, Thomas and Raffel, Colin},
  booktitle={Advances in Neural Information Processing Systems},
  volume={36},
  year={2023}
}

DoReMi: Optimizing Data Mixtures for LM Pretraining

Wed, 08 Apr 2026 00:00:00 +0000

A method for automatic domain reweighting

This is a method paper that introduces Domain Reweighting with Minimax Optimization (DoReMi), an algorithm for automatically tuning the mixture proportions of pretraining data domains. Rather than relying on heuristics or expensive downstream-task-based tuning, DoReMi uses a small proxy model trained with group distributionally robust optimization (Group DRO) to produce domain weights that transfer to much larger models.

Why data mixture proportions matter

Language model pretraining datasets combine text from many domains: web crawls, Wikipedia, books, code, academic papers, and others. The mixture proportions (how much of each domain to include) significantly affect downstream performance, but existing approaches either set them by hand (The Pile uses heuristic weights) or tune them against downstream tasks (GLaM/PaLM), which is expensive and risks overfitting to a specific evaluation set. No principled, task-agnostic method existed for determining mixture proportions.

Minimax optimization over domain excess loss

DoReMi’s core insight is to frame data mixture optimization as a minimax problem: find domain weights that minimize the worst-case excess loss across all domains. The algorithm has three steps.

Step 1: Train a small reference model (280M parameters) on some default domain weights $\alpha_{\text{ref}}$ (e.g., proportional to raw token count).

Step 2: Train a small proxy model $p_{\theta}$ using Group DRO, which solves the minimax objective:

$$ \min_{\theta} \max_{\alpha \in \Delta^{k}} \sum_{i=1}^{k} \alpha_{i} \cdot \left[ \frac{1}{\sum_{x \in D_{i}} |x|} \sum_{x \in D_{i}} \ell_{\theta}(x) - \ell_{\text{ref}}(x) \right] $$

where $\ell_{\theta}(x) = -\log p_{\theta}(x)$ and $\ell_{\text{ref}}(x) = -\log p_{\text{ref}}(x)$. The excess loss $\ell_{\theta}(x) - \ell_{\text{ref}}(x)$ measures how much headroom the proxy has to improve on each example relative to the reference. The inner maximization upweights domains with high excess loss via exponentiated gradient ascent, while the outer minimization trains the proxy on those upweighted domains.

At each training step, the domain weights update as:

$$ \alpha_{t}’ \leftarrow \alpha_{t-1} \exp(\eta \lambda_{t}) $$

where $\lambda_{t}[i]$ is the per-domain excess loss (clipped at zero), followed by renormalization and smoothing with a uniform component: $\alpha_{t} \leftarrow (1-c)\frac{\alpha_{t}’}{\sum_{i} \alpha_{t}’[i]} + cu$, with $c = 10^{-3}$.

The final domain weights are the average over all training steps: $\bar{\alpha} = \frac{1}{T}\sum_{t=1}^{T} \alpha_{t}$.

Step 3: Resample data according to $\bar{\alpha}$ and train the full-scale model using standard procedures.

Iterated DoReMi extends this by running multiple rounds, using the previous round’s optimized weights as the next round’s reference weights. This converges within 3 rounds on the GLaM dataset.

Experiments across The Pile and GLaM datasets

Datasets. The Pile (22 domains, 800GB) and the GLaM dataset (8 domains, also used for PaLM). On The Pile, baseline weights come from the dataset defaults. On GLaM, baseline weights are uniform, with downstream-tuned oracle weights available for comparison.

Setup. Transformer decoder-only LMs trained with next-token prediction. All models use batch size 512 and sequence length 1024. Proxy and reference models are 280M parameters. Main models are 8B parameters (30x larger). Training runs: 200K steps (Pile) or 300K steps (GLaM). The domain weight optimization cost (training two 280M models) is 8% of the compute for the 8B main model.

Evaluation. Per-domain held-out perplexity and one-shot generative accuracy on five tasks: TriviaQA, NaturalQuestions, WebQuestions, SQuADv2, and LAMBADA.

Key domain weight shifts

On The Pile, DoReMi (280M) dramatically upweights diverse web text (Pile-CC: 0.112 to 0.606) while downweighting specialized domains like ArXiv (0.105 to 0.004), PubMed Central (0.107 to 0.005), and StackExchange (0.093 to 0.015). Smaller, underrepresented domains like YouTubeSubtitles and PhilPapers receive proportionally large increases.

Scaling behavior

DoReMi was tested with matched proxy/main model sizes (280M through 1B) and with varying proxy sizes (70M through 1B) feeding into an 8B main model.

Configuration	Speedup to baseline accuracy	Downstream improvement
DoReMi (280M to 280M)	4x	+2% avg accuracy
DoReMi (280M to 8B)	2.6x	+6.5% avg accuracy
DoReMi (150M to 8B)	~2x	Significant
DoReMi (1B to 8B)	~2x	Significant

Improvements are consistent across all tested model scales (280M to 1B matched), with no sign of diminishing returns at larger sizes.

Perplexity improves everywhere, even on downweighted domains

The most striking finding is that DoReMi improves perplexity on all 22 domains in The Pile, including domains it downweights. The proposed explanation: the lowest-entropy domains need few samples to learn (they’re statistically simple), while the highest-entropy domains have token distributions close to the uniform initialization and also need fewer samples. Reallocating weight to medium-entropy domains generates positive transfer that lifts all domains.

On The Pile, DoReMi reaches the baseline’s downstream accuracy in 75K steps versus 200K for the baseline (2.6x speedup) and achieves a 6.5% absolute improvement in average one-shot accuracy at 200K steps.

On the GLaM dataset, iterated DoReMi (round 2) matches the performance of domain weights that were tuned directly on downstream task performance, despite having no knowledge of downstream tasks. Domain weights converge within 3 iterations.

Ablations

Using only the proxy model’s loss (prefer hardest domains) or only the negative reference loss (prefer easiest domains) both underperform the full excess loss formulation. Both components are necessary: the excess loss identifies domains where the proxy has room to improve relative to what is learnable.

The proxy model itself typically underperforms the main model trained on its weights, and this gap grows at larger proxy scales. A 1B proxy model underperforms the 1B baseline, yet its domain weights still improve 1B main model training by over 2x. This suggests the domain weight signal is robust even when the proxy model itself is not well-trained.

Limitations

The domain weight landscape may have multiple local optima: a 280M proxy puts most weight on Pile-CC, while a 1B proxy favors OpenWebText2 instead. Both configurations improve over baseline, but the optimal weights are not unique.

The granularity of “domains” matters. DoReMi works better with more domains (22 on The Pile versus 8 on GLaM). Domains are defined by data provenance, which is coarse-grained. Fine-grained domain definitions (e.g., via clustering) could improve results but also risk DRO putting all weight on a small set of worst-case examples.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	The Pile	800 GB, 22 domains	Default heuristic weights as baseline
Pretraining	GLaM dataset	8 domains	Uniform weights as baseline; downstream-tuned oracle available
Evaluation	TriviaQA, NaturalQuestions, WebQuestions, SQuADv2, LAMBADA	Standard splits	One-shot generative evaluation

Algorithms

Group DRO with exponentiated gradient ascent for domain weight updates. Step size $\eta = 1$, smoothing $c = 10^{-3}$. Per-token excess loss clipped at zero. Domain weights averaged over all training steps. Iterated DoReMi converges when $|\bar{\alpha} - \alpha_{\text{ref}}|_{\infty} < 10^{-3}$.

Models

Vanilla Transformer decoder-only models with 256K vocabulary. Sizes: 70M (3 layers), 150M (6 layers), 280M (12 layers), 510M (12 layers), 760M (12 layers), 1B (16 layers), 8B (32 layers). All use 64-dim attention heads except 8B (128-dim).

Evaluation

Metric	DoReMi (280M to 8B)	Baseline (8B)	Notes
Avg one-shot accuracy	+6.5% over baseline	Reference	5 generative tasks
Worst-case log-perplexity	1.46	1.71	Across 22 Pile domains
Avg log-perplexity	1.40	1.64	Across 22 Pile domains
Domains beating baseline	22/22	0/22	Per-domain perplexity

Hardware

Proxy and reference models (under 1B) trained on TPUv3. Models at 1B and 8B trained on TPUv4. Domain weight optimization (two 280M runs) costs 8% of 8B training FLOPs.

Citation

@inproceedings{xie2023doremi,
  title={DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining},
  author={Xie, Sang Michael and Pham, Hieu and Dong, Xuanyi and Du, Nan and Liu, Hanxiao and Lu, Yifeng and Liang, Percy and Le, Quoc V. and Ma, Tengyu and Yu, Adams Wei},
  booktitle={Advances in Neural Information Processing Systems},
  volume={36},
  year={2023}
}

Data Mixing Laws for LM Pretraining Optimization

Wed, 08 Apr 2026 00:00:00 +0000

An empirical discovery of predictable mixture-loss relationships

This is a discovery paper that identifies a quantitative, functional relationship between pretraining data mixture proportions and language model loss. The key finding is that domain-specific validation loss follows an exponential law over the linear combination of training domain proportions, and this law composes with standard scaling laws to enable cheap prediction of large-model performance under arbitrary mixtures.

The missing quantitative link between data mixtures and performance

Pretraining data for large language models combines text from many domains (web, code, academic, books, etc.), and mixture proportions significantly affect model quality. Existing approaches either set proportions by hand without disclosed criteria (LLaMA, Baichuan) or use algorithmic methods like DoReMi that optimize qualitatively but cannot predict the quantitative effect of a specific mixture before training. Scaling laws exist for model size and data quantity, but no equivalent existed for mixture proportions. This paper fills that gap.

The exponential data mixing law

The core finding: for a model of fixed size trained for a fixed number of steps, the validation loss on domain $i$ as a function of the training mixture proportions $r_{1 \dots M}$ follows:

$$ L_{i}(r_{1 \dots M}) = c_{i} + k_{i} \exp\left(\sum_{j=1}^{M} t_{ij} r_{j}\right) $$

where $c_{i}$, $k_{i}$, and $t_{ij}$ are fitted parameters. The constant $c_{i}$ represents the irreducible loss (not affected by mixture changes). The interaction coefficients $t_{ij}$ capture how training domain $j$ affects validation loss on domain $i$: negative $t_{ij}$ means domain $j$ helps domain $i$, positive means it hurts.

This was discovered progressively:

Two domains: Log-reducible-loss is linear in domain proportion (univariate exponential).
Three domains: The exponential generalizes to a linear combination over all domain proportions (Eq. above), outperforming alternatives with comparable parameter count.
General validation: For a validation set composed of $K$ domains with proportions $s_{1 \dots K}$, the overall loss is:

$$ L(r_{1 \dots M}) = \sum_{i=1}^{K} s_{i} \left[ c_{i} + k_{i} \exp\left(\sum_{j=1}^{M} t_{ij} r_{j}\right) \right] $$

When the validation set composition is unknown, implicit domain aggregation treats $s_{i}$ as learnable parameters. Setting the number of implicit domains larger than the true number works well and is robust to overestimation.

Domain interaction patterns

Visualizing the fitted $t_{ij}$ coefficients across 5 coarse Pile domains reveals three relationship types: most domain pairs are unrelated (sparse interaction matrix where each domain’s loss is dominated by its own training proportion), some show facilitation (e.g., dialogue data helps internet text), and some show conflict (e.g., symbolic data hurts prose). This sparsity explains why the law can be fitted with fewer samples than the quadratic parameter count would suggest.

Nested scaling pipeline for cheap prediction

Fitting data mixing laws directly at target scale is too expensive (requires many full training runs at different mixtures). The paper proposes nesting three scaling laws:

Step 1: For each mixture $r_{i}$ and each small model size $N_{j}$, train for $S_{0}$ steps. Fit a power law $L(S) = E_{1} + B/S^{\beta}$ over steps to extrapolate to the target step count $S_{\text{target}}$.

Step 2: With the step-extrapolated losses for each mixture, fit a power law $L(N) = E_{2} + A/N^{\alpha}$ over model sizes to extrapolate to the target model size $N_{\text{target}}$.

Step 3: With the predicted losses at $(N_{\text{target}}, S_{\text{target}})$ for all sampled mixtures, fit the data mixing law and search for the optimal mixture.

This pipeline requires only training small models (70M to 410M) for short runs (30B tokens) to predict performance of a 1B model trained for 100B tokens.

Mixture sampling strategy

To get informative samples efficiently, the paper uses double-diminishing proportions: for each domain, enumerate proportions by halving from the maximum available. This distributes losses evenly across the exponential law’s range. From 40 candidate mixtures trained at the smallest scale (70M), 20 are selected based on which subset minimizes data mixing law fitting error.

Experiments on RedPajama and continual pretraining

Main experiment. Models trained on RedPajama, validated on the Pile (mimicking the common scenario where validation data comes from a different distribution than training). Small models: 70M, 160M, 305M, 410M trained for 30B tokens. Target: 1B model for 100B tokens.

The optimized mixture dramatically redistributes weight compared to RedPajama defaults:

Domain	Default	Optimized
CommonCrawl	0.670	0.125
C4	0.150	0.250
GitHub	0.045	0.141
ArXiv	0.045	0.250
Books	0.045	0.094
StackExchange	0.025	0.125
Wikipedia	0.020	0.016

The optimized mixture reaches the default mixture’s final performance in 73% of the training steps and eventually achieves performance equivalent to 48% more training on the default mixture.

Comparison to DoReMi and DoGE. Data mixing laws outperform both: the predicted-optimal mixture achieves lower validation loss than DoReMi and DoGE (both universal and OOD settings) for 1B models trained for 100B tokens on RedPajama.

Continual pretraining. The law extends to continual pretraining (Pythia-70M on Pile + Python code). It accurately predicts the critical mixture proportion that avoids catastrophic forgetting on the original domain while improving the target domain. This suggests data mixing laws could guide dynamic data schedules across multi-stage pretraining.

Implications and limitations

The data mixing law provides a predictive framework rather than just an optimization algorithm. Key implications:

The interaction coefficients $t_{ij}$ make domain relationships quantitatively observable before full-scale training, identifying facilitation and conflict pairs.
The nested pipeline’s cost is dominated by the small-model training runs (40 mixtures at 70M scale), which is orders of magnitude cheaper than even a single target-scale run.
The continual pretraining application opens the door to optimizing dynamic data schedules, where mixture proportions change across training stages.

Limitations: The “domain” concept remains loosely defined (provenance-based). The nested scaling laws introduce compounding errors at each step, and predictions tend to slightly underestimate actual loss. The number of required fitting samples, while subquadratic in practice due to sparsity, still scales with the number of domains. No theoretical justification for the exponential form is provided; it is a purely empirical finding.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training (pilot)	The Pile (GitHub, Pile-CC, Books3)	30B tokens	2-domain and 3-domain experiments
Training (main)	RedPajama	100B tokens	7 domains
Validation	The Pile validation set	Standard split	Out-of-distribution relative to RedPajama
Continual pretraining	Pile + Python code	10B tokens	Pythia-70M base model

Algorithms

Data mixing law: $L_{i}(r_{1 \dots M}) = c_{i} + k_{i} \exp(\sum_{j} t_{ij} r_{j})$. Fitted via AdaBoost Regressor on sampled mixtures. Step scaling law: $L(S) = E_{1} + B/S^{\beta}$. Model size scaling law: $L(N) = E_{2} + A/N^{\alpha}$. Both fitted via Huber loss minimization with LBFGS. Decomposed Chinchilla-style (separate fits for stability). 40 candidate mixtures sampled via double-diminishing proportions, 20 selected for the final pipeline.

Models

Transformer decoder-only LMs. Pilot: 70M, 160M. Main pipeline: 70M, 160M, 305M, 410M (for fitting), 1B (target). Batch size: 1M tokens. Cosine learning rate decay with 2K step warmup, decaying to 0.1x at 100K steps.

Evaluation

Metric	Optimized Mixture	Default Mixture	Notes
Steps to match default final loss	73K (73%)	100K (100%)	27% training reduction
Equivalent extra training	+48%	Baseline	Estimated via step scaling law
Validation loss (1B, 100B)	Lowest	Higher than optimized	Also beats DoReMi and DoGE

Hardware

8 A100 GPUs. Training times per 30B-token run: 3.5 hours (70M), 8 hours (160M), 16 hours (305M), 21 hours (410M).

Artifacts

Artifact	Type	License	Notes
The Pile	Dataset	MIT	Pilot and validation data
RedPajama	Dataset	Apache 2.0	Main training data
Pythia Suite	Model	Apache 2.0	Model architecture configs; Pythia-70M checkpoint for continual pretraining

Reproducibility status: Partially Reproducible. Datasets and base model checkpoints are public. No official code release for the data mixing law fitting pipeline, mixture sampling, or the nested scaling law prediction workflow.

Citation

@inproceedings{ye2025datamixinglaws,
  title={Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance},
  author={Ye, Jiasheng and Liu, Peiju and Sun, Tianxiang and Zhan, Jun and Zhou, Yunhua and Qiu, Xipeng},
  booktitle={International Conference on Learning Representations},
  year={2025}
}

RWKV: Linear-Cost RNN with Transformer Training

Tue, 07 Apr 2026 00:00:00 +0000

A New Architecture Bridging RNNs and Transformers

This is a Method paper that introduces RWKV (Receptance Weighted Key Value), a novel sequence model architecture that combines the parallelizable training of Transformers with the efficient $O(Td)$ inference of RNNs. RWKV can be formulated equivalently as either a Transformer (for parallel training) or an RNN (for sequential inference), achieving the lowest computational and memory complexity among comparable architectures while matching Transformer-level performance. The authors scale RWKV to 14 billion parameters, making it the largest dense RNN ever trained at the time of publication.

The Quadratic Cost of Self-Attention

Transformers have become the dominant architecture for NLP, powering models like GPT-3, LLaMA, and Chinchilla. Their self-attention mechanism captures both local and long-range dependencies while supporting parallelized training. However, self-attention scales quadratically with sequence length in both time ($O(T^2d)$) and space ($O(T^2 + Td)$), making it computationally and memory intensive for long sequences and resource-constrained deployment.

RNNs, by contrast, offer linear scaling in memory and computation, but suffer from the vanishing gradient problem and cannot parallelize across the time dimension during training. This limits their scalability and makes them unable to match Transformer performance in practice.

Prior work on efficient Transformers (Reformer, Performer, Linformer, AFT, MEGA) has attempted to reduce this quadratic cost, often at the expense of model expressivity. RWKV aims to combine the best of both worlds: Transformer-grade training efficiency with RNN-grade inference cost, without any approximation to the attention mechanism.

Linear Attention via Channel-Wise Decay

RWKV is built on four core vectors that interact multiplicatively at each timestep:

R (Receptance): receives past information, acting as a gating signal
W (Weight): a trainable positional weight decay vector
K (Key): analogous to keys in standard attention
V (Value): analogous to values in standard attention

The architecture consists of stacked residual blocks, each containing a time-mixing sub-block and a channel-mixing sub-block.

Token Shift

All linear projection vectors are produced by interpolating between the current input $x_t$ and the previous input $x_{t-1}$, creating a token shift mechanism:

$$ r_t = W_r \cdot (\mu_r \odot x_t + (1 - \mu_r) \odot x_{t-1}) $$

$$ k_t = W_k \cdot (\mu_k \odot x_t + (1 - \mu_k) \odot x_{t-1}) $$

$$ v_t = W_v \cdot (\mu_v \odot x_t + (1 - \mu_v) \odot x_{t-1}) $$

where $\mu_r$, $\mu_k$, $\mu_v$ are learnable interpolation parameters. This is implemented efficiently as a simple offset in the temporal dimension.

The WKV Operator

The core attention-like computation replaces standard dot-product attention with a channel-wise weighted sum using exponential decay:

$$ wkv_t = \frac{\sum_{i=1}^{t-1} e^{-(t-1-i)w + k_i} \odot v_i + e^{u + k_t} \odot v_t}{\sum_{i=1}^{t-1} e^{-(t-1-i)w + k_i} + e^{u + k_t}} $$

Here $w$ is the channel-wise time decay vector and $u$ is a separate bonus vector that attends specifically to the current token. Unlike AFT where $W$ is a pairwise matrix, RWKV treats $W$ as a channel-wise vector modified by relative position, enabling the recurrent formulation.

Output Gating

The receptance vector gates the WKV output through a sigmoid:

$$ o_t = W_o \cdot (\sigma(r_t) \odot wkv_t) $$

The channel-mixing block uses a similar gating mechanism with squared ReLU activation:

$$ o’_t = \sigma(r’_t) \odot (W’_v \cdot \max(k’_t, 0)^2) $$

Dual-Mode Operation

During training, RWKV operates in time-parallel mode. The matrix multiplications ($W_\lambda$ for $\lambda \in {r, k, v, o}$) dominate at $O(BTd^2)$ and parallelize identically to standard Transformers. The element-wise WKV computation is $O(BTd)$ and parallelizes along batch and channel dimensions.

During inference, RWKV switches to time-sequential mode. Each timestep updates a fixed-size state vector, giving constant $O(d)$ memory and $O(Td)$ total time for generating $T$ tokens, compared to $O(T^2d)$ for standard Transformers.

Optimizations

Three additional design choices improve training:

Custom CUDA kernels for the sequential WKV computation, fusing it into a single kernel on training accelerators
Small init embedding: initializing the embedding matrix with small values plus an additional LayerNorm, accelerating convergence
Custom initialization: most weights initialized to zero with no biases, following identity-mapping principles from residual network design

Scaling to 14B Parameters and Benchmark Evaluation

Model Scaling

The authors train six RWKV models from 169M to 14B parameters, all for one epoch (330B tokens) on the Pile:

Model	Layers	Dimension	Parameters	FLOP/Token
169M	12	768	$1.69 \times 10^8$	$2.61 \times 10^8$
430M	24	1024	$4.30 \times 10^8$	$7.57 \times 10^8$
1.5B	24	2048	$1.52 \times 10^9$	$2.82 \times 10^9$
3B	32	2560	$2.99 \times 10^9$	$5.71 \times 10^9$
7B	32	4096	$7.39 \times 10^9$	$1.44 \times 10^{10}$
14B	40	5120	$1.42 \times 10^{10}$	$2.78 \times 10^{10}$

The parameter count follows: $\text{params} = 2VD + 13D^2L + D(11L + 4)$, where $V = 50277$ is vocabulary size, $D$ is model dimension, and $L$ is layers. FLOPs match the standard transformer formula: $\text{FLOP} = 6 \cdot [\text{tokens}] \cdot [\text{params}]$.

Scaling Laws

Training 45 RWKV models across varied (dataset, parameters) pairs, the authors find that RWKV follows the same log-log linear scaling law established for Transformers. The linear fit to Pareto-optimal points achieves $r^2 = 0.994$, and extrapolation an additional order of magnitude still yields $r^2 = 0.875$. This contrasts with prior claims that LSTMs do not follow transformer-like scaling.

NLP Benchmarks

RWKV is compared against similarly-sized models trained on comparable token budgets: Pythia, OPT, and BLOOM (all FLOP-matched). Results span twelve benchmarks: ARC (Easy/Challenge), BoolQ, COPA, HeadQA, HellaSwag, LAMBADA, OpenBookQA, PIQA, ReCoRD, SciQ, and Winogrande.

RWKV performs competitively with Transformers across all model sizes. On average across benchmarks, RWKV tracks closely with Pythia and outperforms OPT and BLOOM at comparable scales.

Long Context and Extended Finetuning

RWKV can extend its context length after pretraining through progressive finetuning: doubling from 1024 to 2048 (10B tokens), then to 4096 (100B tokens), and finally to 8192 (100B tokens). Each doubling reduces test loss on the Pile, indicating effective use of longer context.

On the Long Range Arena (LRA) benchmark, which tests sequences from 1,000 to 16,000 tokens, RWKV performs second only to S4 across the five datasets.

Inference Efficiency

Benchmarking text generation on CPU (x86) and GPU (NVIDIA A100 80GB) at float32 precision shows that RWKV exhibits linear scaling in generation time, while Transformers scale quadratically. This advantage grows with sequence length: for long outputs, RWKV completes generation substantially faster at equivalent model sizes.

Competitive Performance with Key Caveats

RWKV demonstrates that RNN-class models can match Transformer performance at scale, while maintaining $O(Td)$ time and $O(d)$ memory during inference. The key findings are:

Scaling laws hold: RWKV follows the same compute-optimal scaling as Transformers ($r^2 = 0.994$), contradicting earlier claims about RNN scaling behavior
Competitive NLP performance: Across twelve benchmarks, RWKV matches similarly-sized Transformers trained on comparable data
Linear inference cost: Generation time scales linearly rather than quadratically, with constant memory regardless of sequence length
Context extension: Progressive finetuning effectively extends the context window post-training

Limitations

The authors identify two primary limitations:

Information compression: Linear attention funnels all past information through a single fixed-size state vector. For tasks requiring recall of specific details over very long contexts, this is mechanistically more constrained than full self-attention, which maintains direct access to all previous tokens.

Prompt sensitivity: RWKV is more sensitive to prompt engineering than standard Transformers. The linear attention mechanism limits how much prompt information carries forward, making the order of information in the prompt particularly important. Reordering prompts improved F1 from 44.2% to 74.8% on one task.

Future Directions

The authors suggest several avenues: applying parallel scan to reduce WKV cost to $O(B \log(T) d)$, extending RWKV to encoder-decoder and multimodal architectures, leveraging hidden states for interpretability and safety, and increasing internal state size to improve long-range recall.

Reproducibility Details

Artifacts

Artifact	Type	License	Notes
BlinkDL/RWKV-LM	Code	Apache-2.0	Official PyTorch training and inference implementation
Pre-trained weights (169M to 14B)	Model	Apache-2.0	All six Pile-trained sizes on HuggingFace (`BlinkDL/rwkv-4-pile-*`)
The Pile	Dataset	Mixed	825 GiB pretraining corpus; component licenses vary by source

Reproducibility classification: Highly Reproducible. Training code (Apache-2.0), pre-trained weights for all six model sizes, the full training corpus, and complete hyperparameters (Appendix G) are all publicly available. The only missing detail is the specific GPU cluster configuration used for pretraining.

Data

Purpose	Dataset	Size	Notes
Pretraining	The Pile	330B tokens	One full epoch for all model sizes
Context extension	The Pile	210B additional tokens	Progressive doubling: 1024 to 8192
NLP evaluation	ARC, BoolQ, COPA, HeadQA, HellaSwag, LAMBADA, OpenBookQA, PIQA, ReCoRD, SciQ, Winogrande	Various	Zero-shot evaluation
Long-range evaluation	Long Range Arena (LRA)	1K-16K tokens	Five sub-tasks

Algorithms

Optimizer: Adam ($\beta = (0.9, 0.99)$), no weight decay
Precision: bfloat16
Training context length: 1024 tokens
Learning rate: constant warmup, then exponential decay
Auxiliary loss from PaLM (softmax normalizer regularization)
Batch size: 128 or 256 sequences (dynamically switched)
Training organized into mini-epochs of 40,320 samples each (8,043 mini-epochs per Pile epoch)

Models

Model	Init LR	Warmup Mini-Epochs	End LR
169M	6e-4	361	1e-5
430M	4e-4	411	1e-5
1.5B	3e-4	443	1e-5
3B	1.5e-4	451	1e-5
7B	1.5e-4	465	1e-5
14B	1e-4	544	7e-6

All pretrained models (169M to 14B) are publicly released on HuggingFace (BlinkDL/rwkv-4-pile-*) under Apache-2.0. Training code is at BlinkDL/RWKV-LM (Apache-2.0).

Evaluation

All NLP benchmarks evaluated in zero-shot setting
FLOP-matched comparison against Pythia, OPT, BLOOM
Inference benchmarked on CPU (x86) and GPU (NVIDIA A100 80GB) at float32

Hardware

Inference experiments: NVIDIA A100 80GB GPU
Training hardware details not fully specified; FLOP budgets reported per model

Paper Information

Citation: Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., … & Zhu, R.-J. (2023). RWKV: Reinventing RNNs for the Transformer Era. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 14048-14077.

Publication: Findings of EMNLP 2023

Additional Resources:

GitHub Repository (Apache-2.0)

@inproceedings{peng2023rwkv,
  title={RWKV: Reinventing RNNs for the Transformer Era},
  author={Peng, Bo and Alcaide, Eric and Anthony, Quentin and Albalak, Alon and Arcadinho, Samuel and Biderman, Stella and Cao, Huanqi and Cheng, Xin and Chung, Michael and Derczynski, Leon and Du, Xingjian and Grella, Matteo and GV, Kranthi Kiran and He, Xuzheng and Hou, Haowen and Kazienko, Przemys{\l}aw and Koco{\'n}, Jan and Kong, Jiaming and Koptyra, Bart{\l}omiej and Lau, Hayden and Lin, Jiaju and Mantri, Krishna Sri Ipsit and Mom, Ferdinand and Saito, Atsushi and Song, Guangyu and Tang, Xiangru and Wind, Johan S. and Wo{\'z}niak, Stanis{\l}aw and Zhang, Zhenyuan and Zhou, Qinghua and Zhu, Jian and Zhu, Rui-Jie},
  booktitle={Findings of the Association for Computational Linguistics: EMNLP 2023},
  pages={14048--14077},
  year={2023},
  doi={10.18653/v1/2023.findings-emnlp.936}
}

Block-Recurrent Transformers for Long Sequences

Tue, 07 Apr 2026 00:00:00 +0000

A Method for Combining Attention with Block-Level Recurrence

This is a Method paper that introduces the Block-Recurrent Transformer, a model architecture that integrates recurrence into the transformer framework at the block level. Rather than processing tokens one at a time (as in traditional RNNs) or attending over entire sequences (as in standard transformers), this approach applies a transformer layer recurrently across blocks of tokens. The result is a model with linear complexity in sequence length that maintains the parallelism benefits of transformers during training. A related approach, RWKV, later explored similar ideas using linear attention with channel-wise decay.

Why Transformers Struggle with Long Documents

Transformers have largely replaced RNNs for sequence modeling tasks, but their quadratic self-attention cost limits the length of sequences they can process. A transformer with a window size of 512 tokens cannot see information beyond that window, making it blind to long-range dependencies in books, technical papers, or source code repositories.

Prior approaches to this problem fall into several categories: sparse attention patterns (BigBird, Routing Transformers, Reformer), sequence compression (Linformer, Funnel Transformers), and linearized attention approximations. These methods either sacrifice the expressiveness of full softmax attention or introduce implementation complexity.

Traditional RNNs like LSTMs offer linear complexity but suffer from three key limitations: sequential processing prevents parallelism on modern hardware, a single state vector bottlenecks information capacity, and vanishing gradients limit effective memory to a few hundred tokens.

Block-Level Recurrence with LSTM-Style Gates

The core innovation is applying a standard transformer layer in a recurrent fashion along the sequence, operating on blocks of $W$ tokens rather than individual tokens. The recurrent cell maintains $S$ state vectors (typically $S = W = 512$) that are updated at each block boundary.

The Recurrent Cell

The cell has two processing directions:

Vertical direction: An ordinary transformer layer with self-attention over input tokens and cross-attention to recurrent states, producing output embeddings.
Horizontal direction: Self-attention over current state vectors and cross-attention to input tokens, producing updated state vectors. Residual connections are replaced with gates.

Self-attention and cross-attention are computed in parallel (not sequentially), with results concatenated and fed into a linear projection. Keys and values are shared between directions, while queries are separate, yielding four query sets: $Q_e^v$, $Q_s^v$ (vertical) and $Q_s^h$, $Q_e^h$ (horizontal).

Gating Mechanisms

Two gate types are explored. The fixed gate uses a learned convex combination:

$$ g = \sigma(b_g) $$

$$ c_{t+1} = c_t \odot g + z_t \odot (1 - g) $$

where $g$ is constant after training, implementing an exponential moving average.

The LSTM gate uses input and forget gates:

$$ i_t = \sigma(W_i h_t + b_i - 1) $$

$$ f_t = \sigma(W_f h_t + b_f + 1) $$

$$ c_{t+1} = c_t \odot f_t + z_t \odot i_t $$

The bias offsets ($-1$ for input, $+1$ for forget) initialize the model to “remember” by default, which is critical for training stability. Without careful initialization, the model can fall into a local optimum where it ignores the recurrent state entirely. This echoes the gate initialization challenges studied by Tallec and Ollivier, who derived chrono initialization for LSTMs from time-warping invariance.

Gate Configurations

Three configurations are tested: dual (gates on both attention and MLP outputs), single (gate only on MLP output), and skip (gate only on attention output, no MLP). The skip configuration removes the large MLP from the recurrent layer entirely.

Learned State IDs

Since the same weights are applied to all state vectors, learned “state IDs” (analogous to position embeddings) are added so each state vector can issue distinct queries. T5-style relative position bias is used for token self-attention, with no position bias for state-token cross-attention.

Language Modeling on PG19, arXiv, and GitHub

Experimental Setup

The base model is a 12-layer transformer with 150M parameters (8 heads of size 128, embedding dimension 1024, MLP hidden size 4096). The recurrent layer is placed at layer 10 with segment length $N = 4096$ and window size $W = 512$. The architecture is evaluated on three long-document datasets:

PG19: Full-length books from Project Gutenberg (pre-1919)
arXiv: Mathematics papers in LaTeX
GitHub: Concatenated source code from open-source repositories

All models report bits-per-token ($\log_2$ perplexity, lower is better).

Baselines

Five baselines are compared: Transformer-XL with window sizes of 512, 1024, and 2048, plus 12-layer and 13-layer sliding window models. The 13-layer sliding window (Slide:13L) is the primary comparison, having equivalent computation cost and parameter count to the recurrent models.

Main Results

Model	Step Time	PG19 (bytes)	PG19 (tokens)	arXiv	GitHub
XL:512	0.88	1.01	3.62	1.45	1.21
XL:2048	2.11	0.990	3.58	1.31	1.01
Slide:13L	1.00	0.989	3.58	1.42	1.17
Rec:fixed:skip	0.99	0.952	3.53	1.24	0.976
Rec:fixed:dual	1.01	0.957	3.52	1.27	0.991
Feedback:fixed:skip	1.35	0.935	3.49	1.24	-
Memorizing Trans. 64k	1.94	0.950	3.53	1.22	-

The Rec:fixed:skip configuration achieves the best overall results while being slightly faster than the 13-layer baseline. It outperforms XL:2048, which runs over 2x slower. The block feedback variant (allowing all layers to cross-attend to recurrent states) improves perplexity further at ~35-40% higher step time.

Scaling Behavior

Models from 40M to 1.3B parameters show that the benefit of recurrence is consistent across scales and increases with model size. At larger sizes, adding recurrence provides a benefit greater than doubling the number of parameters. The 1.3B parameter model achieves 26.50 word-level perplexity on PG19, setting a new state of the art at the time of publication.

Model	Layers	PG19 Perplexity	Parameters
Compressive Transformer	36	33.6	-
Routing Transformer	22	33.2	490M
Perceiver AR	60	28.9	974.6M
Block-Recurrent Transformer	24	26.50	1.3B

Ablations

Multiple recurrent layers: Two adjacent layers (9, 10) provide no benefit. Two separated layers (4, 10) help but no more than adding another non-recurrent layer.
Number of states: Improvement up to 1024 states, degradation at 2048.
Window size reduction: Reducing the sliding window hurts Transformer-XL dramatically but has smaller impact on the recurrent model, which compensates via recurrence.
Gate type: The fixed gate consistently outperforms the LSTM gate despite being theoretically less expressive.

Qualitative Analysis

Comparing per-token predictions against Transformer-XL on PG19 books, the recurrent model’s advantage comes overwhelmingly from predicting proper names (17/20 top-improvement tokens). In 19/20 cases, the predicted word was outside the attention window, confirming it was stored in recurrent state. The model can remember book titles and authors across 60,000+ tokens.

Findings, Limitations, and Future Directions

The Block-Recurrent Transformer demonstrates that recurrence at the block level is a cost-effective way to improve language modeling on long sequences. The fixed:skip configuration (the simplest variant) performs best, suggesting the model primarily uses recurrence for long-range name lookup rather than complex reasoning. The fact that removing the MLP from the recurrent layer has minimal impact further supports this interpretation.

Key limitations include: the model was only evaluated on language modeling perplexity (no downstream tasks), the LSTM gate underperforms the simpler fixed gate (suggesting untapped potential for more expressive recurrence), and the authors acknowledge that training the recurrent layer to fully exploit its capacity for knowledge extraction will require further advances.

The authors note that evaluating on downstream tasks requiring long-range context (book summarization, long-document QA, code completion) is an important direction for future work.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Eval	PG19	~29k books	Public domain, freely available
Training/Eval	arXiv	Mathematics papers	Obtained via private channels, not redistributable
Training/Eval	GitHub	Open-source repos	Obtained via private channels, not redistributable

Algorithms

Optimizer: Adafactor
Learning rate: 1.0 with inverse square root decay (initial experiments), cosine decay with max 0.01 (scaling experiments)
Warmup: 1000 steps
Dropout: 0.05
Vocabulary: 32k SentencePiece (T5 pretrained for initial, custom for scaling)
Gate initialization: bias of $+1$ for forget gate, $-1$ for input gate to ensure initial “remember” behavior

Models

Variant	Layers	Parameters	Recurrent Layers
Base	12 (+1 recurrent)	~151-164M	Layer 10
Large	24 (+2 recurrent)	650M	Layers 10, 20
XL	24 (+2 recurrent)	1.3B	Layers 10, 20

Evaluation

Metric	Best Model	PG19 (tokens)	arXiv	GitHub
Bits-per-token	Rec:fixed:skip	3.53	1.24	0.976
Word-level PPL	1.3B model	26.50	-	-

Error bars on PG19 are between 0.002 and 0.007 (3 runs with different seeds).

Hardware

Training: 32 Google V4 TPU replicas
Training time: ~48 hours for 500k steps on PG19
Batch size: 32 (segment length 4096) or 256 (segment length 512), adjusted so each model sees the same tokens per step

Artifacts

Artifact	Available	License	URL
Code (Meliad)	Yes	Apache 2.0	github.com/google-research/meliad
PG19 Dataset	Yes	Public Domain	Public
arXiv Dataset	No	Not redistributable	Private
GitHub Dataset	No	Not redistributable	Private
Pretrained Models	No	-	-

Reproducibility Assessment: Partially Reproducible. Source code is available under Apache 2.0 and the PG19 dataset is public. However, two of three evaluation datasets (arXiv, GitHub) were obtained via private channels and are not redistributable. No pretrained model checkpoints are released.

Paper Information

Citation: Hutchins, D., Schlag, I., Wu, Y., Dyer, E., & Neyshabur, B. (2022). Block-Recurrent Transformers. Advances in Neural Information Processing Systems 35 (NeurIPS 2022).

@misc{hutchins2022block,
  title={Block-Recurrent Transformers},
  author={Hutchins, DeLesley and Schlag, Imanol and Wu, Yuhuai and Dyer, Ethan and Neyshabur, Behnam},
  year={2022},
  eprint={2203.07852},
  archiveprefix={arXiv},
  primaryclass={cs.LG}
}

High-Performance Word2Vec in Pure PyTorch

Sat, 16 Aug 2025 00:00:00 +0000

Overview

Word2Vec is often treated as a “solved problem” or a black box inside libraries like Gensim. This project deconstructs the algorithm to treat it as a systems engineering challenge.

I built a ground-up, typed, and compiled PyTorch implementation that bridges the gap between the original C code’s efficiency and modern GPU acceleration. The core innovation lies in “tensorizing the tree”, converting the pointer-chasing logic of Hierarchical Softmax into dense, vectorized operations compatible with torch.compile.

Features

1. Vectorized Hierarchical Softmax

Classically, Hierarchical Softmax involves traversing a binary Huffman tree. While efficient on a CPU, this approach creates divergent execution paths on GPUs.

The Solution: I implemented a “pre-computed path” strategy. The tree traversal for every vocabulary word is flattened into fixed-size tensors (word_path_indices, word_codes_tensor) padded to the maximum depth.
The Result: The forward pass becomes a massive, masked batch dot-product against internal node embeddings, allowing the GPU to crunch the probability tree without branching logic.

2. Infinite Streaming & Sliding Windows

To handle datasets larger than RAM (e.g., Wikipedia/CommonCrawl), I built a custom IterableDataset that performs a true single-pass read.

Efficient Windowing: It uses a collections.deque buffer to slide over the token stream, generating training pairs only when a new token enters the center context.
Zipfian Subsampling: Implemented a probabilistic rejection sampling layer that downsamples frequent words (like “the” or “of”) on-the-fly, strictly adhering to the original Mikolov et al. paper’s distribution.

3. Modern Production Tooling

This project uses a strict “software 2.0” stack:

Dependency Management: Built with uv for deterministic, lightning-fast environment resolution.
Compilation: Fully compatible with torch.compile (PyTorch 2.0+), allowing for graph fusion of the custom loss functions.

Usage

The library can be installed via pip and used as a drop-in replacement for Gensim’s Word2Vec, with the added benefit of GPU acceleration.

Results

Correct Embeddings: The produced vectors pass qualitative semantic similarity checks (e.g., analogical reasoning tasks), confirming the tensorized tree produces the same geometry as sequential traversal.
GPU-Scalable: The batched Huffman tree approach eliminates divergent GPU execution, enabling meaningful throughput gains on large vocabularies (100k+ tokens).
OOM-Free on Large Corpora: The streaming IterableDataset with Zipfian subsampling runs on Wikipedia/CommonCrawl-scale text without loading data into RAM.
torch.compile Compatible: The custom loss functions fuse correctly under torch.compile, achieving kernel fusion unavailable in eager mode.

This project connects to related NLP work on this site:

An Introduction to Word Embeddings: conceptual background on the representations this library produces
Word Company Vicinity: research applying word vector semantics to company names
Semantic Network Induction: research on inducing semantic graphs from embedding spaces

Sarcasm Detection with Transformers: A Cautionary Tale

Sun, 25 Feb 2024 00:00:00 +0000

Why Sarcasm Detection Is Hard

Sarcasm detection represents one of the most challenging problems in NLP. The difficulties include:

Context dependence: Sarcasm relies on situational knowledge and shared understanding that extends beyond the text itself.

Subtlety: Even humans struggle with sarcastic interpretation, especially in written text without vocal cues.

Cultural variability: Sarcastic expressions vary significantly across cultures and regions.

Annotation disagreement: Human annotators often disagree on what constitutes sarcasm.

These challenges raise a fundamental question: can sarcasm detection be well-defined as a computational problem? This case study explores what happens when we try (and reveals a common pitfall in dataset construction).

The Dataset: A Hidden Flaw

I used the Sarcasm News Headlines dataset, which combines headlines from The Onion (satirical) and The Huffington Post (traditional news). The dataset contains ~50,000 examples.

from datasets import load_dataset

dataset = load_dataset("raquiba/Sarcasm_News_Headline")
print(dataset["train"][0])
print(dataset["train"][1])

{'headline': 'thirtysomething scientists unveil doomsday clock of hair loss',
 'is_sarcastic': 1}
{'headline': 'dem rep. totally nails why congress is falling short on gender, racial equality',
 'is_sarcastic': 0}

The critical flaw: This dataset uses binary classification based on source domain. The Onion headlines are labeled sarcastic, HuffPost headlines are not. This creates a dangerous shortcut where models learn to detect the publication source.

After preprocessing to standardize column names:

dataset = dataset.map(
    lambda example: {"text": example["headline"], "label": example["is_sarcastic"]},
    remove_columns=["headline", "article_link", "is_sarcastic"]
)

Fine-Tuning RoBERTa

I fine-tuned a pre-trained RoBERTa model using standard practices:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

model_name = "FacebookAI/roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Tokenize the data
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Training configuration
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=5,
    per_device_train_batch_size=32,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
)

trainer.train()

Results: Too Good to Be True

The model achieved high accuracy:

Epoch	Test Accuracy
1	96.3%
2	97.8%
3	99.4%
4	99.8%
5	99.8%

This should immediately raise red flags. Sarcasm detection is notoriously difficult, even for humans. Such high accuracy suggests the model learned a proxy task.

My hypothesis: The model bypassed sarcasm detection entirely, learning only to distinguish between The Onion and HuffPost writing styles.

Interacting with the Model

Let’s test our hypothesis by interacting with the model.

First, let’s load the model and tokenizer:

from transformers import pipeline

model = AutoModelForSequenceClassification.from_pretrained('results/2024-02-25_20-24-51/checkpoint-4475')

clf = pipeline('text-classification', model=model, tokenizer=tokenizer)

Now, let’s test the model with some examples.

First, let’s try an Onion article from this week, something I know to be sarcastic and not in the training data. Let’s use “Alabama Supreme Court Justice Invokes ‘VeggieTales’ In Ruling”:

clf("Alabama Supreme Court Justice Invokes ‘VeggieTales' In Ruling")

[{'label': 'LABEL_0', 'score': 0.99916672706604}]

The model is extremely confident that this is not sarcastic.

Let’s try a different Onion article, possibly even more difficult: Breaking News Trump Booed, Frozen Burritos, And More: This Week In Breaking News February 24, 2024:

clf("Breaking News Trump Booed, Frozen Burritos, And More: This Week In Breaking News February 24, 2024")

[{'label': 'LABEL_0', 'score': 0.9993497729301453}]

Again, very confident that this is not sarcastic. Hmm. It could be the temporal accuracy of our model just cannot capture the sarcasm of the Onion in 2024.

Let’s try one more Onion article, this one that is still recent but a bit more of a low-hanging fruit: Mom Only Likes The Other Outback Steakhouse:

clf("Mom Only Likes The Other Outback Steakhouse")

[{'label': 'LABEL_1', 'score': 0.9997231364250183}]

Finally, a correct prediction! The model is confident that this is sarcastic. Our model detects only very specific types of sarcasm. It fails to generalize to new, unseen data within the same domain.

Let’s also try some headlines from the Huffington Post, which the model should predict as not sarcastic. Let’s try the five most recent headlines from the Huffington Post:

clf([
    "Donald Trump Won South Carolina - But There's 1 Big Caveat",
    "Man Sets Himself On Fire In Front Of Israeli Embassy In Washington",
    "Israeli Media Report Progress On Reaching A Temporary Truce In Gaza And A Hostage-Prisoner Exchange",
    "A White Liberal Is Trying To Oust A Progressive Black Congressman. His Comments Could Make That Job Harder.",
    "Climate Change-Fueled Winter Extremes Put 90% Of This Country At 'High Risk'"
])

[{'label': 'LABEL_0', 'score': 0.9993808269500732},
 {'label': 'LABEL_0', 'score': 0.9993786811828613},
 {'label': 'LABEL_0', 'score': 0.9985186457633972},
 {'label': 'LABEL_0', 'score': 0.9993883371353149},
 {'label': 'LABEL_0', 'score': 0.9993487000465393}]

The model is extremely confident that these are not sarcastic.

The model detects sarcasm in limited cases. It fails to generalize to new, unseen data within the same domain. This is a common problem in machine learning. Training a model that performs well on a specific dataset is straightforward. Training a model that generalizes to new, unseen data remains a significant challenge. Furthermore, our sarcasm detection project resulted in a domain classifier. For fuzzier concepts like sarcasm, it’s important to be clear about what we’re actually detecting, and to collect the necessary scale of data to capture the full range of the concept.

Key Takeaways

This case study reveals a fundamental problem in ML: high accuracy guarantees only performance on the training distribution. Here’s what actually happened:

Dataset bias: Using publication source as a proxy for sarcasm created a shortcut for the model
Domain classification: The model exclusively learned to distinguish writing styles
Poor generalization: New examples from the same sources often failed

This is a common pitfall when building datasets for subjective concepts. The lesson: high accuracy must be accompanied by validation of the model’s actual learned behavior.

For better sarcasm detection, we’d need:

Diverse sources beyond two publications
Human annotation across multiple contexts
Careful evaluation on out-of-domain examples

Instructive failures in ML projects provide valuable lessons about our assumptions and the limitations of our approaches.

EigenNoise: Data-Free Word Vector Initialization

Sun, 01 May 2022 00:00:00 +0000

Abstract

We developed EigenNoise, a method to initialize word vectors using zero pre-training data. By deriving a co-occurrence matrix solely from the theoretical harmonic structure of language (Zipf’s Law), this project demonstrates that we can mathematically synthesize a “warm-start” for NLP models. This approach challenges the reliance on massive corpora for initialization and offers a competitive alternative for low-resource environments.

Key Contributions

Algorithmic Innovation: Created a data-free initialization scheme by modeling independent co-occurrence statistics and applying eigen-decomposition
Theoretical Grounding: Leveraged the harmonic statistical structure of language to derive representations from first principles
Information-Theoretic Evaluation: Utilized Minimum Description Length (MDL) probing to rigorously measure the information content and regularity of the learned representations
Efficiency: Demonstrated that EigenNoise vectors, once fine-tuned, match the performance of GloVe vectors (trained on Gigaword) despite seeing no pre-training text

Technical Implementation

The core insight is that “noise” in language follows a predictable distribution.

Modeling: We model the “null hypothesis” of text, how words would co-occur if they were statistically independent but followed Zipfian rank-frequency. This yields a theoretical co-occurrence matrix $\hat{X}$:

$$\hat{X}_{ij} = \frac{2mN}{r_i r_j H_N}$$

Where $r_i$ is the rank of word $i$, $N$ is vocabulary size, $m$ is the context window size, and $H_N$ is the $N$-th harmonic number.

Factorization: We then solve for the word vectors by performing an eigen-decomposition on this matrix, extracting the top $d$ components to form the representation space.
Probing: Validated performance using MDL probing on CoNLL-2003 and TweetEval benchmarks.

Why This Matters

This research explores how much structure can emerge from frequency statistics alone, with no text exposure at all. The central finding is that EigenNoise vectors, derived purely from Zipf’s Law, reach competitive performance with GloVe after fine-tuning. This is evidence that a significant portion of what we call “learned linguistic knowledge” is a consequence of word frequency distributions, not semantic exposure to real text.

In 2026, small pretrained models are freely available and handle most low-resource initialization needs, so the practical case for data-free initialization is narrower than it was in 2022. The theoretical contribution remains relevant: EigenNoise establishes a clean null hypothesis for what word vectors look like when only frequency information is present. For interpretability researchers trying to disentangle frequency artifacts from genuine semantic content, this baseline has value independent of the initialization use case.

The MDL probing methodology applied here also contributes beyond the main result. Unlike task accuracy, MDL measures how much information a representation encodes and how compactly, providing a more principled lens for evaluating representational quality. EigenNoise’s co-occurrence prior is grounded directly in the Independent Frequencies Model (IFM) introduced in the companion Word2Vec factorization paper. Together, the two works form a coherent theoretical line: the IFM characterizes the frequency-driven baseline of embedding space, and EigenNoise operationalizes it as a practical, data-free initialization scheme.

Citation

@misc{heidenreich2022eigennoisecontrastivepriorwarmstart,
  title={EigenNoise: A Contrastive Prior to Warm-Start Representations},
  author={Hunter Scott Heidenreich and Jake Ryland Williams},
  year={2022},
  eprint={2205.04376},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2205.04376},
}

For the theoretical foundation underlying EigenNoise’s null hypothesis, including the first analytical solution to Word2Vec’s softmax objective, see Analytical Solution to Word2Vec Softmax & Bias Probing.

Analytical Solution to Word2Vec Softmax & Bias Probing

Sun, 01 May 2022 00:00:00 +0000

Abstract

While the Skip-Gram with Negative Sampling (SGNS) objective for Word2Vec has famously been shown to factorize a shifted PMI matrix, the implicit matrix factorization of the original Softmax objective has remained an open question. In this work, we provide the first known analytical solution to Word2Vec’s softmax-optimized skip-gram algorithm.

We use this derivation to introduce the Independent Frequencies Model (IFM), identifying a “frequency-ratios property” that unifies classical word vector models. This theoretical insight allows us to derive a low-cost, training-free method for measuring semantic bias directly from corpus statistics.

Key Contributions

Analytical Solution: Provided the first known analytical solution to Word2Vec’s softmax-optimized skip-gram algorithm, proving it factorizes the log-conditional probability matrix.
Independent Frequencies Model (IFM): Introduced a dense co-occurrence model computable purely from unigram frequencies to act as a null hypothesis for embedding structures.
Bias Dissonance Metric: Derived a low-cost, training-free method for measuring semantic bias directly from corpus statistics using the frequency-ratios property.
Data Transparency: Demonstrated how specific corpora exhibit distinct bias profiles, offering a tool for auditing datasets before training large models.

Key Theoretical Results

1. The Softmax Factorization Theorem

We prove that under the log-softmax objective, Word2Vec implicitly converges towards a factorization of the log-conditional probability matrix of the co-occurrence model.

Theorem: For the objective $\mathcal{L}_{\text{soft}} = - \sum _{t,s} F _{t,s}^m \log \varphi (\vec{u}_t \vec{v}_s)$, the algorithm converges to:

$$ \vec{u}_{t}\vec{v}_{s}^{T} = \log\frac{F_{t,s}^{m}}{f_{t}^{m}} $$

where $F_{t,s}^m$ is the co-occurrence count and $f_t^m$ is the marginal frequency. This effectively makes the dot product of the embedding vectors equal to the log-conditional probability of the context word given the target word.

2. The Independent Frequencies Model (IFM)

To understand the baseline behavior of these models, we introduce the IFM, which models a dense co-occurrence matrix computable purely from unigram frequencies:

$$ \hat{F}_{t,s}^{m} = \frac{2m f_t f_s}{M} $$

This model acts as a “null hypothesis” for embedding structures, allowing us to isolate true semantic signals from statistical noise.

Methodological Innovation: Bias Dissonance

Leveraging the frequency-ratios property derived from our factorization, we propose a metric called Dissonance ($\Delta$) to probe semantic bias in data without training a model.

For an analogy $A:B :: C:D$ (e.g., man:king :: woman:queen), we measure the alignment of their corpus frequency ratios. High dissonance indicates that the corpus statistics do not support the analogy, potentially revealing bias or under-representation.

Intuitive Example: If a corpus contains the phrase “man is king” 100 times more often than “woman is queen,” the frequency ratios are misaligned. A perfect, unbiased analogy would have matching ratios (i.e., man relates to king at the same rate woman relates to queen). Any deviation from this symmetry is captured by our dissonance metric, revealing where the data itself encodes asymmetric associations.

$$ \Delta(x,y|\mathcal{D}) = \left| \log\frac{f_{t}f_{\bar{s}}}{f_{s}f_{\bar{t}}} \right| / \max_{l \in \mathcal{V}} { \log f_l } $$

By applying this to the Bigger Analogy Test Set (BATS), we demonstrated how specific corpora (like Wikipedia vs. Google Books) exhibit distinct bias profiles regarding geographic and encyclopedic knowledge.

Visualizing Statistical Independence

The Information Quality Ratio measuring the portion of co-occurrence information that is statistically dependent, plotted against window size. Colors indicate corpus size from the GUM corpus. The dashed lines show the IFM prediction. The inset reveals the power-law decay rate, demonstrating how linguistic dependencies diminish predictably with context distance.

Impact

This work bridges the gap between empirical success and theoretical foundations in NLP by:

Solving a fundamental mechanism: Providing the missing factorization proof for Softmax Word2Vec.
Efficient Pre-training: Suggesting that embedding layers can be “warm-started” using unigram statistics derived from the IFM.
Data Transparency: Offering a computationally inexpensive tool for auditing datasets for bias before investing resources in training large models.

My Contribution

Jake Williams is the first author and primary driver of this work. He developed the core theory, derived the factorization proofs, designed the dissonance metric, and ran the experiments. My role was supporting: I contributed through critique and refinement during the writing process, but the intellectual heavy lifting belongs to Jake.

Citation

@misc{williams2022knowcompanywordslies,
      title={To Know by the Company Words Keep and What Else Lies in the Vicinity},
      author={Jake Ryland Williams and Hunter Scott Heidenreich},
      year={2022},
      eprint={2205.00148},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2205.00148},
}

For a complementary analytical approach to word representations, deriving data-free word vector initializations from the same frequency-ratio insights, see EigenNoise: Data-Free Word Vector Initialization.

GPT-2 Susceptibility to Universal Adversarial Triggers

Sat, 01 May 2021 00:00:00 +0000

Historical context: This paper was published in 2021, predating the modern red-teaming practices and adversarial robustness benchmarks that emerged with instruction-tuned and RLHF-trained models. GPT-2 is now a historical baseline, but the core methodology and findings remain a relevant foundation for current adversarial robustness work.

Abstract

This work investigates universal adversarial triggers (UATs), a method for disrupting language models using input-agnostic token sequences. We investigated whether it is possible to use these triggers to control the topic and the stance of text generated by GPT-2. Across four controversial topics, we demonstrated success in identifying triggers that guide the model to produce text on a targeted subject and influence the position it takes. Our goal is to raise awareness that even deployed models are susceptible to this influence and to advocate for immediate safeguards.

Key Findings & Contributions

Topic and Stance Control: We were the first to systematically explore using UATs to control both the topic and the stance of a language model’s output. We found that controlling the topic is highly feasible, and controlling the stance is also possible.
The “Filter Bubble” Hypothesis: We observed that triggers for fringe topics (e.g., Flat Earth) were harder to find but offered a higher degree of stance control than broader topics. We posit this may reflect “filter bubbles” in the training data, where fringe viewpoints use distinct linguistic patterns.
Ethical & Security Analysis: We highlighted the security risks of deployed models being manipulated by external adversaries without internal model access. To be responsible, we withheld the most sensitive triggers we discovered.
Constructive Applications: Beyond a security flaw, we proposed that UATs could be used constructively as a diagnostic tool to audit models for bias or as a method for bot detection on social media.

Significance & Why This Matters

This work extended early research on UATs by moving beyond single-issue attacks (like generating toxic content) to a nuanced analysis of topic and stance control. It demonstrated that a gradient-based search process (adapting HotFlip) is effective at manipulating model outputs, emphasizing a critical vulnerability for any organization deploying large language models.

For ML practitioners and security researchers, this highlights the importance of robust safeguards against input-agnostic attacks. It also opens the door to using these same adversarial techniques constructively: as diagnostic tools to audit models for hidden biases or to detect automated bot activity on social media platforms.

The constructive bot-detection application proposed here connects directly to empirical work on coordinated inauthentic behavior. Coordinated Social Targeting on Twitter documents real-world follower-manipulation patterns on high-profile accounts, illustrating the kind of automated adversarial activity that UAT-based detection methods could help identify.

Citation

@inproceedings{10.1145/3461702.3462578,
  author = {Heidenreich, Hunter Scott and Williams, Jake Ryland},
  title = {The Earth Is Flat and the Sun Is Not a Star: The Susceptibility of GPT-2 to Universal Adversarial Triggers},
  year = {2021},
  isbn = {9781450384735},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3461702.3462578},
  doi = {10.1145/3461702.3462578},
  booktitle = {Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society},
  pages = {566--573},
  numpages = {8},
  keywords = {adversarial attacks, bias, language modeling, natural language processing},
  location = {Virtual Event, USA},
  series = {AIES '21}
}

Data-Driven WordNet Construction from Wiktionary

Fri, 01 Nov 2019 00:00:00 +0000

Abstract

We introduce a novel unsupervised algorithm for inducing semantic networks from noisy, crowd-sourced data. By framing network construction as a “relationship disambiguation” task, we process the entirety of Wiktionary to build a massive, WordNet-like semantic resource. The resulting network is an order of magnitude larger than Princeton WordNet and features over 344,000 linked example sentences (vs. WordNet’s 68k). Evaluation on standard word similarity benchmarks demonstrates that our fully data-driven approach yields semantic structures competitive with expert-annotated resources.

Key Contributions

Unsupervised Hierarchy Induction: We propose a deterministic algorithm to construct a Directed Acyclic Graph (DAG) of senses from pairwise relationships, effectively inducing a semantic hierarchy without human supervision.
A Massive Semantic Resource: We release a dataset enriched with hundreds of thousands of semantically linked usage examples, serving as a critical resource for tasks like Word Sense Disambiguation (WSD).
Novel Disambiguation Framework: We model “relationship disambiguation” using a Laplacian kernel and FastText embeddings to filter noisy user annotations.
Open-Source Infrastructure: We provide a full pipeline for downloading, parsing, and constructing networks from Wiktionary data.

Technical Approach

The core of our method addresses the noise inherent in crowd-sourced dictionaries. We frame the problem as Latent Semantic Network Induction:

Relationship Disambiguation: For every linked pair of words (e.g., go ~ proceed), we define a semantic subspace using their definitions. We utilize FastText embeddings and a Laplacian kernel to identify which specific definitions participate in the relationship.
Hierarchy Construction: We apply a custom intersection algorithm that treats more general senses as the “overlap” between specific definition sets. We formalize this as a set-theoretic “hole punching” operation, where a general sense $t$ is defined by the intersection of definition sets $\mathbb{D}’$, excluding any broader intersections:

$$f^{-1}(t) = \left(\bigcap_{\mathbb{D}’} D_{u\sim v}\right) \setminus \left(\bigcup_{\mathbb{D} \supset \mathbb{D}’} \bigcap_{\mathbb{D}} D_{u\sim v}\right)$$

Evaluation & Validation

The primary achievement is scale: our induced network contains over 344,000 linked example sentences, compared to Princeton WordNet’s 68,000 (more than 5x the coverage), built entirely from crowd-sourced data without expert annotation.

Beyond scale, the network holds up semantically. On standard noun-similarity benchmarks (RG-65), the unsupervised network achieves a Spearman rank correlation of $\rho = 0.83$, matching the performance of Explicit Semantic Analysis (ESA) models built on expert-annotated WordNet ($\rho = 0.82$). The point is not that we beat WordNet by 0.01. It is that a fully automated approach over noisy Wiktionary data produces a resource of comparable quality at 5x the scale.

Why This Matters

Building high-quality linguistic resources typically requires expensive expert annotation. Princeton WordNet took decades of lexicographer effort. This work demonstrates that an unsupervised algorithm over crowd-sourced data can produce a resource of comparable semantic quality at more than 5x the scale. For ML practitioners, that matters: larger coverage means more training signal for downstream tasks like Word Sense Disambiguation. For this portfolio, it shows early experience building structured NLP datasets from scratch, a theme that continues in later work on large-scale document corpora.

For a theoretical treatment of word semantics from the same collaboration, including the first analytical solution to Word2Vec’s softmax objective, see Analytical Solution to Word2Vec Softmax & Bias Probing.

Citation

@inproceedings{heidenreich2019latent,
  title={Latent semantic network induction in the context of linked example senses},
  author={Heidenreich, Hunter and Williams, Jake},
  booktitle={Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)},
  pages={170--180},
  year={2019}
}

QuAC: Question Answering in Context Dataset

Wed, 31 Oct 2018 00:00:00 +0000

Introduction

The QuAC dataset (Question Answering in Context) presents a conversational question answering approach that models student-teacher interactions. Published at EMNLP 2018, this work by Choi et al. addresses how systems can understand dialogue context, resolve references across conversation turns, and handle natural conversation ambiguity. Previous datasets treated questions independently.

The dataset addresses limitations in question answering research by incorporating real-world information-seeking dialogue complexities, where questions build upon previous exchanges and context drives understanding.

For comparison with related work, see my analysis of CoQA.

The Student-Teacher Framework

QuAC models information-seeking dialogue through a student-teacher setup:

Teacher: Has complete access to information (Wikipedia passage)
Student: Seeks knowledge through questioning with limited initial context
Interaction: Handles context-dependent questions, abstract inquiries, and unanswerable requests

This framework mirrors real-world scenarios where one party has expertise while another seeks to learn through dialogue. AI systems must act as effective teachers, using available information to provide helpful responses despite ambiguous or incomplete questions.

The dataset contains over 100,000 questions across 14,000+ dialogues, providing substantial scale for training and evaluation.

QuAC dataset statistics and scale

Dataset Construction

QuAC was built using Amazon Mechanical Turk with a two-person dialogue setup:

Teacher role: Has access to the complete Wikipedia passage and provides answers extracted directly from the text

Student role: Sees only the article title, introduction paragraph, and section heading, then asks questions to learn about the content

This asymmetric information design ensures student questions naturally differ from the passage content, creating realistic information-seeking scenarios. The extractive answer requirement maintains objective evaluation while simplifying scoring.

Dialogue termination:

12 questions answered
Manual termination by either participant
Two consecutive unanswerable questions

Example QuAC conversation showing student-teacher interaction

Content Selection

QuAC focuses on Wikipedia biographical articles for several practical reasons:

Reduced complexity: People-focused content requires less specialized domain knowledge
Natural question flow: Biographical information lends itself to sequential questioning
Quality control: Articles filtered to include only subjects with 100+ incoming links, ensuring content depth

This focused scope enables consistent evaluation while maintaining broad coverage through diverse biographical subjects across fields and time periods.

Key Dataset Characteristics

QuAC introduces several features that distinguish it from existing question answering benchmarks:

Comparative analysis of QuAC against other QA datasets

Notable features:

High contextual dependency: 86% of questions require coreference resolution
Non-factoid focus: 54% of questions go beyond simple fact retrieval
Extended answers: Responses are longer and more detailed
Unanswerable questions: Realistic scenarios where information isn’t available

Distribution of question types in QuAC

The Coreference Resolution Challenge

QuAC’s complexity stems from its heavy reliance on coreference resolution across multiple contexts:

Reference types:

Passage references: Pronouns and references to entities in the source text
Dialogue references: References to previously discussed topics
Abstract references: Challenging cases like “what else?” that require inferring the inquiry scope

Types and distribution of coreferences in QuAC

The prevalence of coreference resolution makes QuAC particularly challenging, as this remains an active research problem in NLP. Models must understand passage content, track dialogue history, and resolve complex referential expressions simultaneously.

Performance Results

Models face substantial challenges on QuAC, with significant gaps between human and machine performance:

Baseline model performance comparison on QuAC

Performance summary:

Human performance: 81.1% F1 score
Best baseline: BiDAF++ with context achieves 60.2% F1
Performance gap: 20+ point difference shows room for improvement

Human Equivalence Metrics

QuAC introduces evaluation metrics beyond traditional F1 scores:

HEQ-Q (Human Equivalence Question-level): Percentage of questions where the model achieves human-level or better performance

HEQ-D (Human Equivalence Dialogue-level): Percentage of complete dialogues where the model matches human performance across all questions

Current results:

Human baseline: 100% HEQ-Q, 100% HEQ-D (by definition)
Best model: 55.1% HEQ-Q, 5.2% HEQ-D

These metrics show both average performance and consistency across questions and conversations, important for practical dialogue systems.

Research Impact

QuAC represents an important step in question answering research by introducing realistic conversational dynamics that existing datasets lack. The student-teacher framework captures natural information-seeking behavior while maintaining extractive evaluation for objective assessment.

Key contributions:

Conversational realism: Context-dependent questions that mirror dialogue patterns
Coreference complexity: Integration of challenging NLP problems into QA evaluation
Evaluation metrics: HEQ scores that measure consistency alongside average performance
Large-scale framework: Substantial dataset enabling robust model training and evaluation

The dataset’s leaderboard provides researchers with a challenging benchmark for developing conversational AI systems. As models improve on QuAC, we can expect progress in dialogue agents, virtual assistants, and educational AI systems that engage in more natural, context-aware conversations.

QuAC’s focus on dialogue context and reference resolution pushes the field toward AI systems that can engage in genuine conversation and understand complex dialogue flows.

A Builder’s Perspective: QuAC and Modern Instruction Tuning

Looking at QuAC through the lens of modern production ML, the student-teacher framework is incredibly relevant. Today, we train foundation models using Reinforcement Learning from Human Feedback (RLHF) and instruction tuning, which rely heavily on multi-turn, context-aware interactions.

When building systems like GutenOCR or enterprise document processing pipelines, users rarely ask perfectly formulated, context-free questions. They ask follow-ups, use pronouns, and expect the system to act as a knowledgeable “teacher” guiding them through the document. QuAC was one of the first datasets to formalize this asymmetric information dynamic. It highlighted the necessity of handling unanswerable questions gracefully, a critical feature for preventing hallucinations in today’s production LLMs.

Citation

@inproceedings{choi-etal-2018-quac,
    title = "{Q}u{AC}: Question Answering in Context",
    author = "Choi, Eunsol and He, He and Iyyer, Mohit and Yatskar, Mark and Yih, Wen-tau and Choi, Yejin and Liang, Percy and Zettlemoyer, Luke",
    booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
    month = oct # "-" # nov,
    year = "2018",
    address = "Brussels, Belgium",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/D18-1241/",
    doi = "10.18653/v1/D18-1241",
    pages = "2174--2184"
}

CoQA Dataset: Advancing Conversational Question Answering

Thu, 23 Aug 2018 00:00:00 +0000

Introduction

The CoQA dataset (Reddy et al., 2019) introduces conversational dynamics to question answering research. CoQA requires models to maintain context across multi-turn conversations while reading and reasoning about text passages. Previous datasets focused on isolated question-answer pairs.

This dataset addresses a gap in conversational AI research by providing a benchmark for systems that must understand dialogue flow and implicit references. These are key components of natural human conversation.

For related work on conversational question answering, see my analysis of QuAC.

What Makes Conversational QA Different

Conversational question answering introduces challenges beyond traditional reading comprehension:

Context dependency: Questions rely on previous dialogue turns for meaning
Coreference resolution: Understanding pronouns and implicit references
Abstractive answering: Rephrasing information to generate natural responses
Multi-turn reasoning: Maintaining coherent dialogue across multiple exchanges

These requirements differentiate CoQA from existing question answering datasets that treat each question independently.

Why CoQA Matters

Question answering systems typically excel at finding specific information in text. However, they often struggle with natural conversation. Human communication involves building on previous exchanges, using pronouns and implicit references, and expressing ideas in varied ways.

CoQA addresses this by creating a large-scale dataset for conversational question answering with three primary characteristics:

Conversation-dependent questions: After the first question, every subsequent question depends on dialogue history across 127,000 questions spanning 8,000 conversations
Natural, abstractive answers: CoQA requires rephrased responses that sound natural in conversation. The answerer first highlighted the relevant text span, then rephrased the information.
Domain diversity: Training covers 5 domains with testing on 7 domains, including 2 unseen during training

The performance gap is notable: humans achieve 88.8% F1 score while the best models at the time reached 65.1% F1, indicating substantial room for improvement.

Dataset Construction

CoQA was constructed using Amazon Mechanical Turk, pairing workers in a question-answer dialogue setup. One worker asked questions about a given passage while another provided answers. The answerer first highlighted the relevant text span, then rephrased the information using different words to create natural, abstractive responses.

This methodology produces answers that sound conversational. This makes the dataset highly realistic for dialogue applications.

Domain Coverage

CoQA spans diverse text types to ensure evaluation across different writing styles and topics:

Training domains (5):

Children’s stories from MCTest
Literature from Project Gutenberg
Educational content from RACE (middle/high school English)
CNN news articles
Wikipedia articles

Test-only domains (2):

Science articles from AI2 Science Questions
Creative writing from Reddit WritingPrompts

Domain distribution in the CoQA dataset

The inclusion of test-only domains provides a rigorous evaluation of model generalization to unseen text types.

Comparison with Existing Datasets

Prior to CoQA, the dominant question answering benchmark was SQuAD (Stanford Question Answering Dataset). SQuAD established foundations for reading comprehension and presented specific constraints:

SQuAD 1.0: 100,000+ questions requiring exact text extraction from Wikipedia passages
SQuAD 2.0: Added 50,000+ unanswerable questions to test when no answer exists

Scale comparison between SQuAD and CoQA datasets

SQuAD treats each question independently and requires only extractive answers. CoQA addresses these constraints through conversational context and abstractive responses.

Question and Answer Analysis

The differences between SQuAD and CoQA extend beyond conversational context:

Question diversity: SQuAD heavily favors “what” questions (~50%). CoQA shows a more balanced distribution across question types, reflecting natural conversation patterns.

Question type distribution comparison between SQuAD and CoQA

Context dependence: CoQA includes challenging single-word questions like “who?”, “where?”, or “why?” that depend entirely on dialogue history.

Answer characteristics: CoQA answers vary significantly in length and style. SQuAD primarily features extractive spans.

Answer length distribution in SQuAD vs CoQA

The Coreference Challenge

CoQA’s difficulty stems largely from its reliance on coreference resolution (determining when different expressions refer to the same entity). This remains a challenging research problem in NLP.

Coreference types in CoQA:

Explicit coreferences (~50% of questions): Clear indicators like pronouns (“him,” “it,” “her,” “that”)
Implicit coreferences (~20% of questions): Context-dependent references requiring inference (e.g., asking “where?” without specifying what)

Distribution of coreference types in CoQA questions

These linguistic phenomena make CoQA more difficult than traditional reading comprehension, as models must resolve references across dialogue turns while maintaining conversational coherence.

Performance Benchmarks

Models faced significant challenges on CoQA, with substantial room for improvement:

Performance comparison on CoQA across different model types

The performance gap between human and machine capabilities highlighted conversational question answering as a challenging frontier in NLP research.

Research Impact and Future Directions

CoQA represents a step toward more natural conversational AI systems. By requiring models to handle dialogue context, coreference resolution, and abstractive reasoning simultaneously, it challenges current NLP system capabilities.

The dataset’s leaderboard provides a benchmark for measuring progress on this task. As models improve on CoQA, we can expect advances in conversational AI applications, from chatbots to virtual assistants that engage in more natural, context-aware dialogue.

CoQA’s contribution to the field aims to parallel ImageNet’s impact on computer vision, providing a challenging, well-constructed benchmark that drives research toward more capable AI systems.

A Builder’s Perspective: CoQA in the Era of LLMs

Looking back at CoQA from the perspective of modern production systems, this dataset was highly prescient. The challenges it introduced, such as multi-turn reasoning, coreference resolution, and abstractive answering, are the exact capabilities we now expect from instruction-tuned Large Language Models (LLMs).

When building document processing pipelines at scale, we rarely extract isolated facts. Users want to chat with their documents, asking follow-up questions like, “What does that mean for the Q3 budget?” Resolving “that” to a previous turn’s context is exactly what CoQA formalized. Datasets like CoQA laid the groundwork for the conversational interfaces we build today, shifting the field’s focus from simple extraction to genuine dialogue comprehension.

References

Reddy, S., Chen, D., & Manning, C. D. (2019). CoQA: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7, 249-266.

Word Embeddings in NLP: An Introduction

Sun, 05 Aug 2018 00:00:00 +0000

Understanding Word Embeddings

A word embedding maps words to real-valued vectors:

$$ \text{word} \rightarrow \mathbb{R}^n $$

where $n$ represents the dimensionality of the embedding space.

The goal is simple: position semantically similar words close together in vector space. This dense representation typically uses hundreds of dimensions, a massive reduction from the millions required by one-hot encoding.

Word embeddings are grounded in Zellig Harris’ distributional hypothesis: words appearing in similar contexts tend to have similar meanings. This forms the foundation of distributional semantics.

Words embedded in three-dimensional space, organized by semantic similarity

Different embedding algorithms capture various aspects of this distributional principle. This post explores the main methods for creating word embeddings and their applications in natural language processing.

While modern foundation models and terabyte-scale Vision-Language Models (VLMs) rely on advanced subword tokenizers (like BPE) and massive Transformer embedding layers, the fundamental goal remains exactly the same: mapping discrete text to a continuous vector space where math can capture meaning. Understanding these foundational techniques provides the necessary intuition for debugging and scaling today’s production ML systems.

Why Word Embeddings Matter in NLP

Computers require numerical representations to apply machine learning algorithms to text. Word embeddings bridge this gap by converting text into dense vectors that preserve semantic and syntactic relationships.

Key advantages:

Dense representation: Hundreds of dimensions provide a compact alternative to vocabulary-sized sparse vectors.
Semantic preservation: Similar words cluster together in vector space.
Mathematical operations: Enable analogical reasoning ($\text{king} - \text{man} + \text{woman} \approx \text{queen}$).
Transfer learning: Pre-trained embeddings work across multiple tasks and domains.

Modern deep learning architectures leverage these properties extensively. The development of universal, pre-trained embeddings was a significant step forward. We can use versatile embeddings that generalize across applications, eliminating the need to train task-specific representations from scratch.

Word Embedding Approaches

One-Hot Encoding and Count Vectorization

One-hot encoding represents the simplest approach to word vectorization. Each word gets a unique dimension in a vocabulary-sized vector, marked with 1 for presence and 0 elsewhere. Count vectorization extends this by counting the occurrences of each word in a document.

One-hot encoding creates sparse vectors with single active dimensions

Characteristics:

High dimensionality: Vector length equals vocabulary size.
Extreme sparsity: Most dimensions contain zeros.
No relationships: Treats all words as equally distant.
Computational efficiency: Simple to implement and understand.

While lacking semantic information, count vectorization serves as a foundation for more complex methods. Let’s look at a practical implementation using scikit-learn’s CountVectorizer.

from sklearn.feature_extraction.text import CountVectorizer

# Initialize the vectorizer
vectorizer = CountVectorizer()

# Sample text for demonstration
sample_text = ["One of the most basic ways we can numerically represent words "
               "is through the one-hot encoding method (also sometimes called "
               "count vectorizing)."]

# Fit the vectorizer to our text data
vectorizer.fit(sample_text)

# Examine the vocabulary and word indices
print('Vocabulary:')
print(vectorizer.vocabulary_)

# Transform text to vectors
vector = vectorizer.transform(sample_text)
print('Full vector:')
print(vector.toarray())

In a production environment, count vectorization introduces significant engineering challenges. When processing millions of documents, the vocabulary size explodes. Storing and computing on these massive sparse matrices quickly leads to memory exhaustion. In these scaling scenarios, practitioners often turn to the Hashing Trick (via HashingVectorizer) to bound the dimensionality, or they move entirely to the dense embeddings discussed later in this post.

We can see count vectorization in action with a real dataset, building a simple text classifier for the 20 Newsgroups dataset:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# Load train and test splits, removing metadata for a cleaner signal
newsgroups_train = fetch_20newsgroups(subset='train',
                                      remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test',
                                     remove=('headers', 'footers', 'quotes'))

# Initialize and fit vectorizer on training data
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(newsgroups_train.data)

# Build and train classifier
classifier = MultinomialNB(alpha=0.01)
classifier.fit(X_train, newsgroups_train.target)

# Transform test data and make predictions
X_test = vectorizer.transform(newsgroups_test.data)
y_pred = classifier.predict(X_test)

# Evaluate performance
accuracy = metrics.accuracy_score(newsgroups_test.target, y_pred)
print(f'Accuracy: {accuracy:.3f}')

This provides a solid baseline. To capture actual semantic meaning and reduce dimensionality, we must move beyond simple counting.

TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF extends one-hot encoding by weighting terms based on their importance across a document collection. TF-IDF combines:

Term Frequency (TF): How often a word appears in a document
Inverse Document Frequency (IDF): How rare a word is across all documents

This weighting scheme reduces the impact of common words (like “the” or “and”) while emphasizing distinctive terms that appear frequently in specific documents but rarely elsewhere.

Advantages:

Captures document-level importance
Reduces impact of stop words
Effective for information retrieval tasks

Limitations:

Still high-dimensional and sparse
No semantic relationships between terms
Context-independent representation

Co-Occurrence Matrices

Co-occurrence matrices capture word relationships by recording which terms appear together within defined contexts (sentences, paragraphs, or fixed windows). The resulting matrix has dimensions equal to vocabulary size squared, with entries showing co-occurrence frequency.

Co-occurrence relationships within a three-word window

Key properties:

Global statistics: Captures corpus-wide word relationships
Symmetric relationships: Mutual co-occurrence patterns
Extreme dimensionality: Vocabulary size squared creates storage challenges
Sparse representation: Most word pairs never co-occur

While computationally expensive to store and process, co-occurrence matrices form the foundation for advanced methods like GloVe that compress this information into dense representations.

Neural Network-Based Embeddings

Neural Probabilistic Language Models

Neural probabilistic models pioneered the use of neural networks for learning word embeddings. These models learn dense representations as a byproduct of language modeling, predicting the next word in a sequence.

Architecture of neural probabilistic language models

Training process:

Initialize random dense embeddings for each vocabulary word
Use embeddings as inputs to predict language modeling objectives
Update embeddings through backpropagation based on prediction errors
Resulting embeddings capture patterns useful for the training task

This approach demonstrated that task-specific embeddings could be learned jointly with model objectives, establishing the foundation for modern embedding methods.

Word2Vec

Word2Vec revolutionized word embeddings by introducing efficient training algorithms for massive corpora. It became the first method to demonstrate compelling vector arithmetic properties, enabling analogical reasoning like the famous “$\text{king} - \text{man} + \text{woman} \approx \text{queen}$” example.

Word2Vec demonstrates analogical relationships through vector arithmetic

Two training architectures:

Continuous Bag-of-Words (CBOW)

Predicts target words from surrounding context words. Given a window of context words, the model learns to predict the central word.

Skip-Gram

Predicts context words from target words. Given a central word, the model learns to predict surrounding words within a defined window.

Key advantages:

Computational efficiency: Much faster than neural probabilistic models
Scalable training: Can process billion-word corpora effectively
Quality embeddings: Captures semantic and syntactic relationships
Flexible context: Window size controls topical vs. functional similarity

The choice of window size significantly impacts learned relationships. Larger windows capture topical associations, while smaller windows focus on syntactic and functional similarities.

GloVe (Global Vectors)

GloVe combines the best aspects of matrix factorization methods (which capture global corpus statistics) and local context window approaches like Word2Vec. Matrix factorization methods excel at global patterns but struggle with analogical reasoning, while Word2Vec captures local relationships but may miss global structure.

Key innovation: GloVe trains on a global word-context co-occurrence matrix, incorporating corpus-wide statistical information while maintaining the analogical reasoning capabilities that made Word2Vec successful.

Advantages over Word2Vec:

Global optimization: Leverages entire corpus statistics
Better performance: Often outperforms Word2Vec on word similarity and analogy tasks
Stable training: More consistent convergence due to global objective function

The result is embeddings that capture both local syntactic patterns and global semantic relationships more effectively.

Contextual Embedding Methods

FastText

FastText addresses a critical limitation of previous methods: handling out-of-vocabulary (OOV) words. By incorporating subword information, FastText can generate meaningful representations for previously unseen words.

Subword approach:

Decomposes words into character n-grams (typically 3-6 characters)
Represents words as sums of their component n-grams
Trains using skip-gram objective with negative sampling

Key advantages:

OOV handling: Can embed unseen words using known subword components
Morphological awareness: Captures relationships between related word forms
Multilingual support: Facebook released pre-trained embeddings for 294 languages
Robust performance: Particularly effective for morphologically rich languages

For example, if the model knows “navigate,” it can provide meaningful representation for “circumnavigate” by leveraging shared subword components, even if “circumnavigate” wasn’t in the training data.

Poincaré Embeddings

Poincaré embeddings introduce a novel approach by learning representations in hyperbolic space. This geometric innovation specifically targets hierarchical relationships in data.

Hyperbolic geometry advantages:

Natural hierarchy encoding: Distance represents similarity, while norm encodes hierarchical level
Efficient representation: Requires fewer dimensions for hierarchical data
Mathematical elegance: Leverages properties of hyperbolic space for embedding optimization

Applications: Particularly effective for data with inherent hierarchical structure, such as:

WordNet taxonomies
Organizational charts
Computer network topologies
Knowledge graphs

The original paper demonstrates good efficiency in reproducing WordNet relationships with significantly lower dimensionality compared to traditional embedding methods.

Contextual Embeddings

ELMo (Embeddings from Language Models)

ELMo represents a paradigm shift toward contextual word representations. ELMo generates dynamic representations based on sentence context, adapting to word usage patterns.

Architecture:

Bidirectional LSTM: Processes text in both forward and backward directions
Character-level input: Handles OOV words and captures morphological patterns
Multi-layer representations: Combines different abstraction levels

Layer specialization:

Lower layers: Excel at syntactic tasks (POS tagging, parsing)
Higher layers: Capture semantic relationships (word sense disambiguation)
Combined layers: Weighted combination achieves good performance

Key innovation: ELMo embeddings vary by context. The word “bank” receives different representations in “river bank” versus “financial bank,” addressing polysemy directly through contextual awareness.

This approach achieved strong performance across numerous NLP tasks by providing context-sensitive representations that adapt to word usage patterns.

Probabilistic FastText

Probabilistic FastText addresses polysemy (words with multiple meanings) through probabilistic modeling. Traditional embeddings conflate different word senses into single representations, limiting their precision.

The polysemy problem: Consider “rock” which can mean:

Rock music (genre)
A stone (geological object)
Rocking motion (verb)

Standard embeddings average these meanings, producing representations that may not capture any sense precisely.

Probabilistic approach: Probabilistic FastText represents words as Gaussian mixture models: probability distributions that can capture multiple distinct meanings as separate components.

Advantages:

Multi-sense representation: Each word sense gets its own distribution
Context sensitivity: Can select appropriate sense based on usage context
Uncertainty quantification: Probabilistic framework captures embedding confidence

This approach provides a more nuanced treatment of lexical ambiguity, particularly valuable for words with distinct, context-dependent meanings.

Summary and Future Directions

Word embeddings have evolved from simple one-hot encodings to contextual representations that capture nuanced linguistic relationships. Each approach offers distinct advantages:

Static embeddings (Word2Vec, GloVe, FastText) provide:

Computational efficiency for large-scale applications
Pre-trained models available for numerous languages
Clear analogical reasoning capabilities
Good performance on many downstream tasks

Contextual embeddings (ELMo, BERT, GPT) offer:

Dynamic representations based on sentence context
Better handling of polysemy and word sense disambiguation
Strong performance on complex NLP tasks
Ability to capture subtle contextual nuances

Choosing the right approach depends on:

Task requirements: Static embeddings for efficiency, contextual for accuracy
Data availability: Pre-trained models vs. domain-specific training
Computational constraints: Static embeddings require less processing power
Language coverage: Consider availability of pre-trained models for target languages

The field continues advancing toward more efficient contextual models, better multilingual representations, and embeddings that capture increasingly complex linguistic phenomena.

For a production-grade Word2Vec implementation in PyTorch that takes these concepts further, see the High-Performance Word2Vec project.

Natural Language Processing on Hunter Heidenreich | ML Research Scientist

SpeechT5: Unified Speech-Text Pre-Training Framework

A Unified Encoder-Decoder for Spoken Language Processing

Bridging the Gap Between Speech and Text Pre-Training

Cross-Modal Vector Quantization for Alignment

Pre-Training Objectives

Evaluation Across Six Spoken Language Tasks

Automatic Speech Recognition (ASR)

Text-to-Speech Synthesis (TTS)

Speech Translation (ST)

Voice Conversion (VC)

Speech Enhancement (SE)

Speaker Identification (SID)

Ablation Study and Key Findings

Reproducibility Details

Data

Algorithms

Fine-Tuning Hyperparameters

Models

Artifacts

Hardware

Paper Information

T5: Exploring Transfer Learning Limits

A systematic study of NLP transfer learning

Unifying NLP tasks as text-to-text

Multi-task mixing: strategies and findings

Three mixing strategies

Results

Data repetition degrades performance

Scaling and final configuration

Implications and limitations

Reproducibility Details

Data

Models

Algorithms

Hardware

Artifacts

Citation

SlimPajama-DC: Data Combinations for LLM Training

An empirical study of data domain combinations

Why data combination strategy matters

Global deduplication and the SlimPajama dataset

Seven domain combination configurations

Diversity after global deduplication drives performance

Hugging Face leaderboard results

Training loss patterns

Implications and limitations

Reproducibility Details

Data

Models

Hardware

Artifacts

Citation

Scaling Data-Constrained Language Models

An empirical study of scaling under data constraints

Running out of unique training data

Effective data with exponential decay for repetition

Experiments across 400+ models

Resource allocation: epochs scale faster than parameters

Resource return: the 4-epoch safe zone and 16-epoch half-life

Complementary strategies: code augmentation and filtering

Key findings and limitations

Reproducibility Details

Data

Algorithms

Models

Evaluation

Hardware

Artifacts

Citation

DoReMi: Optimizing Data Mixtures for LM Pretraining

A method for automatic domain reweighting

Why data mixture proportions matter

Minimax optimization over domain excess loss

Experiments across The Pile and GLaM datasets

Key domain weight shifts

Scaling behavior

Perplexity improves everywhere, even on downweighted domains

Ablations

Limitations