Hunter Heidenreich | AI Research Scientist & Engineer logoHome
  • About
  • Research
  • Posts
  • Projects
  • Notes
  • Videos
  • Topics
  • Archive
  • Search
Home

Research

Publications and preprints spanning NLP, computational social science, document processing, foundation models, and AI safety.
Google Scholar

Document Processing

Document Processing
GutenOCR Mascot

GutenOCR: A Grounded Vision-Language Front-End for Documents

GutenOCR is a family of vision-language models designed to serve as a ‘grounded OCR front-end’, providing high-quality text transcription and explicit geometric grounding.

2026-01-20 · Hunter Heidenreich
  • Computer-Vision
  • Transformers
  • Generative-Models
  • and 1 more
Document Processing
Statistics of the PubMed-OCR dataset including number of articles, pages, words, and bounding boxes.

PubMed-OCR: PMC Open Access OCR Annotations

PubMed-OCR provides 1.5M pages of scientific articles with comprehensive OCR annotations and bounding boxes to support layout-aware modeling and document analysis.

2026-01-16 · Hunter Heidenreich
  • Datasets
  • Computer-Vision
Document Processing
Stream accuracy versus relative throughput for Mistral-7B and XGBoost models

LLMs for Insurance Document Automation

We explore LLM applications for page stream segmentation in insurance document processing, demonstrating that parameter-efficient fine-tuning achieves strong accuracy but revealing significant calibration challenges that limit deployment confidence.

2025-01-01 · Hunter Heidenreich
  • Transformers
  • Benchmark
  • Evaluation
Document Processing
Diagram showing page stream segmentation workflow: an input stream of pages is processed through binary classification of page pairs to predict document breaks, producing segmented output documents

LLMs for Page Stream Segmentation

We create TabMe++, an enhanced page stream segmentation benchmark with commercial-grade OCR, and show that parameter-efficiently fine-tuned decoder-based LLMs like Mistral-7B achieve 80% straight-through processing rates, dramatically outperforming encoder-based models.

2024-08-21 · Hunter Heidenreich
  • Transformers
  • Benchmark
  • Computer-Vision

Scientific Machine Learning

Time Series Forecasting
Forecasting comparison of different neural architectures on the Multiscale Lorenz-96 system

Optimizing Sequence Models for Dynamical Systems

We systematically ablate core mechanisms of Transformers and RNNs, finding that attention-augmented Recurrent Highway Networks outperform standard Transformers on forecasting high-dimensional chaotic systems.

2024-10-01 · Hunter Heidenreich
  • Transformers
  • Neural-Networks
  • Evaluation

Natural Language Processing

Natural Language Processing
Heatmap visualization of the EigenNoise analytical co-occurrence prior matrix showing word rank relationships

EigenNoise: Data-Free Word Vector Initialization

We develop EigenNoise, a zero-data initialization method for word vectors that synthesizes representations from Zipf’s Law alone, demonstrating competitive performance to GloVe after fine-tuning without requiring any pre-training corpus.

2022-05-01 · Hunter Heidenreich
  • Embeddings
  • Optimization
Natural Language Processing
Information Quality Ratio plot showing statistical dependencies decay as window size increases

Analytical Solution to Word2Vec Softmax & Bias Probing

We provide the first analytical solution to Word2Vec’s softmax skip-gram objective, introducing the Independent Frequencies Model and deriving a low-cost, training-free method for measuring semantic bias directly from corpus statistics.

2022-05-01 · Hunter Heidenreich
  • Embeddings
  • Algorithms
  • Evaluation
Natural Language Processing
Venn diagram showing semantic overlap between word senses for go, move, and proceed, illustrating our hierarchy induction algorithm

Data-Driven WordNet Construction from Wiktionary

We present an unsupervised algorithm for inducing semantic networks from Wiktionary’s crowd-sourced data, creating a WordNet-like resource an order of magnitude larger than Princeton WordNet with over 344,000 linked example sentences.

2019-11-01 · Hunter Heidenreich
  • Embeddings
  • Datasets

Computational Social Science

Computational Social Science
Diagram of the Universal Message schema showing fields like ID, Text, Author, and Reply Sets that normalize data across platforms

Look, Don't Tweet: Unified Data Models for Social NLP

Bachelor’s thesis introducing PyConversations, an open-source library that normalizes over 308 million posts from Twitter, Reddit, Facebook, and 4chan into a unified data model for cross-platform social media research.

2021-06-30 · Hunter Heidenreich
  • Social-Media
  • Datasets
  • Python
  • and 1 more
Computational Social Science
NewsTweet data collection pipeline: news outlets are crawled via Google News RSS feeds, articles are accessed to extract embedded tweets, and user timelines are downloaded from Twitter

NewsTweet Dataset: Social Media in Digital Journalism

We introduce NewsTweet, a dataset and pipeline for studying embedded tweets in digital journalism, revealing that 13% of Google News articles incorporate tweets and providing insights into how social media becomes newsworthy.

2020-08-01 · Hunter Heidenreich
  • Datasets
  • Social-Media
Computational Social Science
Sawtooth follower growth patterns for @elonmusk and @realDonaldTrump showing coordinated bot activity

Coordinated Social Targeting on Twitter

We developed high-frequency monitoring tools to detect coordinated manipulation on Twitter, documenting anomalous follower patterns including sub-second spikes, sawtooth waves, circulating accounts, and weaponized ancient dormant accounts targeting political figures.

2020-07-01 · Hunter Heidenreich
  • Social-Media

AI Safety & Adversarial ML

AI Safety
A nonsensical trigger sequence 'WTC theoriesclimate Flat Hubbard Principle' is fed into GPT-2, which then generates Flat Earth conspiracy text

GPT-2 Susceptibility to Universal Adversarial Triggers

We demonstrate that universal adversarial triggers can control both the topic and stance of GPT-2’s generated text, revealing security vulnerabilities in deployed language models and proposing constructive applications for bias auditing.

2021-05-01 · Hunter Heidenreich
  • Transformers
  • Adversarial-Machine-Learning
  • Neural-Networks
© 2026 Hunter Heidenreich · Powered by Hugo & PaperMod