
GutenOCR: A Grounded Vision-Language Front-End for Documents
GutenOCR is a family of vision-language models designed to serve as a ‘grounded OCR front-end’, providing high-quality text transcription and explicit geometric grounding.

Optimizing Sequence Models for Dynamical Systems
We systematically ablate core mechanisms of Transformers and RNNs, finding that attention-augmented Recurrent Highway Networks outperform standard Transformers on forecasting high-dimensional chaotic systems.

BioT5: Cross-Modal Integration of Biology and Chemistry
BioT5 uses SELFIES representations and separate tokenization to pre-train a unified T5 model across molecules, proteins, and text, achieving state-of-the-art results on 10 of 15 downstream tasks.

ChatDrug: Conversational Drug Editing with ChatGPT
ChatDrug is a parameter-free framework that combines ChatGPT with retrieval-augmented domain feedback and iterative conversation to edit drugs across small molecules, peptides, and proteins.

ChemCrow: Augmenting LLMs with 18 Chemistry Tools
ChemCrow augments GPT-4 with 18 chemistry tools to autonomously plan and execute syntheses, discover novel chromophores, and solve diverse chemical reasoning tasks.

ChemGE: Molecule Generation via Grammatical Evolution
ChemGE uses grammatical evolution over SMILES context-free grammars to generate diverse drug-like molecules in parallel, outperforming deep learning baselines in throughput and molecular diversity.

ChemLLM: A Chemical Large Language Model Framework
ChemLLM presents a comprehensive framework for chemistry-specific language modeling, including a 7M-sample instruction tuning dataset (ChemData), a 4,100-question benchmark (ChemBench), and a two-stage fine-tuned model that matches GPT-4 on core chemical tasks.

Coscientist: Autonomous Chemistry with LLM Agents
Introduces Coscientist, a GPT-4-driven AI system that autonomously designs and executes chemical experiments using web search, code execution, and robotic lab automation.

Data Transfer Approaches for Seq-to-Seq Retrosynthesis
A systematic study of data transfer techniques (joint training, self-training, pre-training plus fine-tuning) applied to Transformer-based retrosynthesis. Pre-training on USPTO-Full followed by fine-tuning on USPTO-50K achieves the best results, improving top-1 accuracy from 35.3% to 57.4%.

DrugAssist: Interactive LLM Molecule Optimization
DrugAssist fine-tunes Llama2-7B-Chat on over one million molecule pairs for interactive, dialogue-based molecule optimization across six molecular properties.

DrugChat: Conversational QA on Drug Molecule Graphs
DrugChat is a prototype system that bridges molecular graph neural networks with large language models for interactive, multi-turn question answering about drug compounds. It trains only a lightweight linear adaptor between a frozen GNN encoder and Vicuna-13B using 143K curated QA pairs from ChEMBL and PubChem.

DrugEx v2: Pareto Multi-Objective RL for Drug Design
DrugEx v2 introduces Pareto-based multi-objective optimization and evolutionary exploration strategies into an RNN reinforcement learning framework for de novo drug design toward multiple protein targets.