Hi, I’m Hunter.

I’m an ML Research Scientist at Roots.ai. I train large language and vision models at production scale on DGX H100 clusters, and my research roots are in scientific computing for molecular dynamics at Harvard. I publish at venues like COLING, EMNLP, and AIES, and I build open-source tools and datasets. I’m exploring how foundation model training transfers to scientific domains: computational chemistry, materials science, and molecular generation. More about me →
Document Processing
GutenOCR Mascot

GutenOCR: A Grounded Vision-Language Front-End for Documents

GutenOCR is a family of vision-language models designed to serve as a ‘grounded OCR front-end’, providing high-quality text transcription and explicit geometric grounding.

Time Series Forecasting
Forecasting comparison of different neural architectures on the Multiscale Lorenz-96 system

Optimizing Sequence Models for Dynamical Systems

We systematically ablate core mechanisms of Transformers and RNNs, finding that attention-augmented Recurrent Highway Networks outperform standard Transformers on forecasting high-dimensional chaotic systems.

Computational Chemistry
BioT5 architecture showing SELFIES molecules, amino acid proteins, and scientific text feeding into a T5 encoder-decoder

BioT5: Cross-Modal Integration of Biology and Chemistry

BioT5 uses SELFIES representations and separate tokenization to pre-train a unified T5 model across molecules, proteins, and text, achieving state-of-the-art results on 10 of 15 downstream tasks.

Computational Chemistry
ChatDrug pipeline from prompt design through ChatGPT to domain feedback and edited molecule output

ChatDrug: Conversational Drug Editing with ChatGPT

ChatDrug is a parameter-free framework that combines ChatGPT with retrieval-augmented domain feedback and iterative conversation to edit drugs across small molecules, peptides, and proteins.

Computational Chemistry
ChemCrow architecture with GPT-4 central planner connected to 18 chemistry tools via ReAct reasoning

ChemCrow: Augmenting LLMs with 18 Chemistry Tools

ChemCrow augments GPT-4 with 18 chemistry tools to autonomously plan and execute syntheses, discover novel chromophores, and solve diverse chemical reasoning tasks.

Computational Chemistry
ChemGE pipeline from integer chromosome through CFG grammar rules to valid SMILES output

ChemGE: Molecule Generation via Grammatical Evolution

ChemGE uses grammatical evolution over SMILES context-free grammars to generate diverse drug-like molecules in parallel, outperforming deep learning baselines in throughput and molecular diversity.

Computational Chemistry
ChemLLM pipeline from ChemData structured templates through fine-tuned InternLM2 to ChemBench evaluation

ChemLLM: A Chemical Large Language Model Framework

ChemLLM presents a comprehensive framework for chemistry-specific language modeling, including a 7M-sample instruction tuning dataset (ChemData), a 4,100-question benchmark (ChemBench), and a two-stage fine-tuned model that matches GPT-4 on core chemical tasks.

Computational Chemistry
Coscientist architecture with GPT-4 planner orchestrating web search, code execution, document search, and robot lab API modules

Coscientist: Autonomous Chemistry with LLM Agents

Introduces Coscientist, a GPT-4-driven AI system that autonomously designs and executes chemical experiments using web search, code execution, and robotic lab automation.

Computational Chemistry
Three data transfer methods for retrosynthesis: pre-training plus fine-tuning, multi-task learning, and self-training

Data Transfer Approaches for Seq-to-Seq Retrosynthesis

A systematic study of data transfer techniques (joint training, self-training, pre-training plus fine-tuning) applied to Transformer-based retrosynthesis. Pre-training on USPTO-Full followed by fine-tuning on USPTO-50K achieves the best results, improving top-1 accuracy from 35.3% to 57.4%.

Computational Chemistry
DrugAssist workflow from user instruction through LoRA fine-tuned Llama2 to optimized molecule output

DrugAssist: Interactive LLM Molecule Optimization

DrugAssist fine-tunes Llama2-7B-Chat on over one million molecule pairs for interactive, dialogue-based molecule optimization across six molecular properties.

Computational Chemistry
DrugChat architecture showing GNN encoder, linear adaptor, and Vicuna LLM for conversational drug analysis

DrugChat: Conversational QA on Drug Molecule Graphs

DrugChat is a prototype system that bridges molecular graph neural networks with large language models for interactive, multi-turn question answering about drug compounds. It trains only a lightweight linear adaptor between a frozen GNN encoder and Vicuna-13B using 143K curated QA pairs from ChEMBL and PubChem.

Computational Chemistry
Pareto front plot for multi-objective optimization alongside DrugEx v2 explorer-exploiter architecture

DrugEx v2: Pareto Multi-Objective RL for Drug Design

DrugEx v2 introduces Pareto-based multi-objective optimization and evolutionary exploration strategies into an RNN reinforcement learning framework for de novo drug design toward multiple protein targets.