Research

Optimizing Sequence Models for Dynamical Systems

Ablation study deconstructing sequence models. Attention-augmented Recurrent Highway Networks outperform Transformers on …

GutenOCR introduces vision-language models for grounded OCR, offering precise text transcription and geometric grounding …

A large-scale dataset of 209K+ articles with OCR and layout bounding boxes, enabling layout-aware modeling and document …

Enhanced TabMe benchmark for page stream segmentation, creating TabMe++, showing fine-tuned decoder-based LLMs …

LLM applications for insurance document automation using parameter-efficient fine-tuning and analysis of calibration …

Analytical derivation of Word2Vec's softmax objective factorization and a new framework for detecting semantic bias in …

Investigation into EigenNoise, a data-free initialization scheme for word vectors that approaches pre-trained model …

PyConversations library and unified data schema for normalizing 300M+ posts across Twitter, Reddit, Facebook, and 4chan.

Investigation into whether universal adversarial triggers can control both topic and stance of GPT-2's generated text …

NewsTweet dataset for studying embedded tweets in online journalism. Analysis shows 13% of Google News stories contain …

Investigation into follower dynamics on high-profile Twitter accounts, documenting sub-second spikes, saw-tooth …

We introduce an unsupervised algorithm for inducing semantic networks from noisy, crowd-sourced data, producing a …