Introduction
This post explores machine learning approaches for classifying congressional bills by policy area, using data from the 115th to 117th Congresses (2017-2023). We’ll examine:
- The fundamentals of bill classification
- Traditional machine learning models as baselines
- Performance analysis across different time periods and policy domains
This work establishes baselines for future deep learning approaches to legislative text classification.
This post builds on the data foundation established in Exploring the 117th U.S. Congress. For the complete learning path, see the Congressional Data Analysis series.
Why This Matters
Automatically classifying congressional bills by policy area serves several important purposes. It provides insights into legislative priorities, enables analysis of policy trends over time, and promotes transparency in governance. Machine learning can help researchers, journalists, and citizens navigate the thousands of bills introduced each Congress more effectively.
This work represents a step toward building tools that help people engage with the legislative process and understand how policies may affect their lives.
Data
The data comes from scraping Congress.gov for all bills from the 115th through 117th Congresses. Each bill includes:
- Bill ID and title
- Summary (when available) - the earliest summary provided
- Full text (when available) - the earliest text version
- Policy area classification
Our task is to predict policy area from text features:
$$ f(X) = \hat{y}, \quad \text{where} \quad X = { \text{title}, \text{summary}, \text{text} }, \quad \hat{y} \in { \text{policy areas} } $$
The complete dataset is available at Hugging Face: hheiden/us-congress-bill-policy-115_117.
Bills by Congress
Our dataset contains the following distribution:
Congress | Bills |
---|---|
115th | 13,555 |
116th | 16,601 |
117th | 17,817 |
Total | 47,973 |
Policy Areas
Each bill receives a policy area label from Congress.gov (see glossary). The dataset includes 33 policy areas, though these classes are highly imbalanced.
The following table shows the number of bills in each policy area across the three Congresses:
Policy Area | 115th | 116th | 117th | Total |
---|---|---|---|---|
Agriculture and Food | 312 | 328 | 398 | 1,038 |
Animals | 96 | 83 | 71 | 250 |
Armed Forces and National Security | 1,108 | 1,337 | 1,399 | 3,844 |
Arts, Culture, Religion | 81 | 79 | 103 | 263 |
Civil Rights and Liberties, Minority Issues | 175 | 205 | 220 | 600 |
Commerce | 312 | 593 | 633 | 1,538 |
Congress | 594 | 541 | 640 | 1,775 |
Crime and Law Enforcement | 827 | 904 | 1,022 | 2,753 |
Economics and Public Finance | 176 | 210 | 197 | 583 |
Education | 607 | 798 | 801 | 2,206 |
Emergency Management | 207 | 198 | 202 | 607 |
Energy | 316 | 370 | 530 | 1,216 |
Environmental Protection | 352 | 423 | 464 | 1,239 |
Families | 79 | 127 | 139 | 345 |
Finance and Financial Sector | 556 | 611 | 601 | 1,768 |
Foreign Trade and International Finance | 120 | 148 | 212 | 480 |
Government Operations and Politics | 1,008 | 1,258 | 1,272 | 3,538 |
Health | 1,526 | 2,109 | 2,276 | 5,911 |
Housing and Community Development | 142 | 250 | 231 | 623 |
Immigration | 398 | 466 | 591 | 1,455 |
International Affairs | 918 | 1,178 | 1,390 | 3,486 |
Labor and Employment | 348 | 452 | 552 | 1,352 |
Law | 109 | 162 | 175 | 446 |
Native Americans | 175 | 234 | 245 | 654 |
Public Lands and Natural Resources | 718 | 648 | 642 | 2,008 |
Science, Technology, Communications | 389 | 551 | 505 | 1,445 |
Social Sciences and History | 5 | 6 | 4 | 15 |
Social Welfare | 177 | 229 | 199 | 605 |
Sports and Recreation | 92 | 93 | 125 | 310 |
Taxation | 983 | 1,156 | 1,078 | 3,217 |
Transportation and Public Works | 492 | 672 | 742 | 1,906 |
Water Resources Development | 89 | 111 | 110 | 310 |
Private Legislation | 69 | 71 | 48 | 188 |
The class imbalance is severe: Social Sciences and History
has only 15 bills across all three Congresses, while Health
has 5,911 bills. This imbalance presents modeling challenges, as minority classes may lack sufficient representative samples.
Text Statistics
We analyzed token counts using spaCy to understand the computational requirements for each text field.
Title Token Statistics:
Congress | Average Tokens | Min Tokens | Max Tokens | Total Tokens |
---|---|---|---|---|
115th | 12.3 | 1 | 167 | 166,763 |
116th | 11.3 | 1 | 226 | 188,158 |
117th | 11.5 | 1 | 272 | 204,978 |
All | 11.7 | 1 | 272 | 559,419 |
Summary Token Statistics:
Congress | Average Tokens | Min Tokens | Max Tokens | Total Tokens |
---|---|---|---|---|
115th | 109.1 | 2 | 6,839 | 1,479,212 |
116th | 94.9 | 2 | 5,886 | 1,574,732 |
117th | 95.1 | 2 | 502 | 1,695,276 |
All | 99.0 | 2 | 6,839 | 4,749,220 |
Full Text Token Statistics:
Congress | Average Tokens | Min Tokens | Max Tokens | Total Tokens |
---|---|---|---|---|
115th | 2,588.7 | 91 | 304,478 | 35,092,075 |
116th | 2,760.3 | 70 | 973,173 | 45,824,498 |
117th | 2,706.7 | 71 | 1,013,608 | 48,224,757 |
All | - | 70 | 1,013,608 | 129,141,330 |
These statistics reveal computational trade-offs:
- Titles average ~12 tokens - computationally efficient but limited information
- Summaries average ~100 tokens - good balance of information and efficiency
- Full text averages ~2,700 tokens with 129M total tokens - detailed but computationally expensive
We’ll prototype with titles and summaries before considering full text, given the computational costs involved.
Evaluation Framework
Experimental Design
We train models on one Congress and test on others, creating a 3×3 evaluation grid. This setup evaluates both within-Congress performance (same session) and cross-Congress generalization (different sessions). We expect temporal drift between Congress sessions to impact performance.
Metrics and Hyperparameter Tuning
We use weighted average F1 score to handle class imbalance, ensuring fair evaluation across all policy areas regardless of frequency.
For within-Congress evaluation, we report cross-validated scores. For cross-Congress evaluation, we test on the entire target Congress dataset.
Hyperparameter tuning uses Cross-Validation Grid Search with folds set to min(3, n_samples)
to ensure all classes are represented. We apply the best parameters from training to test generalization across different Congresses.
Baseline Models
We evaluate three traditional machine learning approaches using TF-IDF vectorization:
Text Preprocessing
We convert text to numerical features using TF-IDF (term frequency-inverse document frequency), which weighs word importance by frequency within documents relative to the entire corpus. This creates normalized feature vectors suitable for machine learning classification.
Multinomial Naive Bayes
We start with Multinomial Naive Bayes as our simplest baseline. Despite its “naive” independence assumption between features, this model often performs surprisingly well for text classification tasks and serves as an important benchmark—if more complex models can’t beat Naive Bayes, it signals potential issues with the approach or data.
The model’s feature_log_prob_
attribute reveals the most influential words for each policy area, providing interpretable insights into classification patterns.
You can see the code for training the Naive Bayes model below:
Logistic Regression
Logistic regression provides a natural step up in complexity from Naive Bayes. It uses the logistic function to convert linear combinations of features into probabilities, making it an excellent baseline for comparison with more sophisticated models while remaining interpretable.
You can see the code for training the Logistic Regression model below:
XGBoost
We include XGBoost as our tree-based ensemble method. While XGBoost typically excels on structured tabular data, we test whether its gradient boosting approach can effectively handle TF-IDF features for text classification.
You can see the code for training the XGBoost model below:
Results
We evaluate models on three input types:
- Title-only: Quick prototyping with limited context
- Summary-only: Balanced information content and computational efficiency
- Full text: Maximum context with computational constraints (limited hyperparameter tuning)
Title-Only Inputs
Naive Bayes
Title-only Naive Bayes experiments are run with the following settings:
sweep_nb(
data,
X_key='title',
y_key='policy_area',
tfidf_params={
'lowercase': True,
'dtype': np.float32,
},
tfidf_grid={
'ngram_range': [(1, 1), (1, 2)],
'max_df': (0.05, 0.1, 0.25, 0.5),
'min_df': (1, 2, 5),
},
nb_params={},
nb_grid={
'alpha': (1, 0.1, 0.01, 0.001),
},
)
which yields the following output:
Training on Congress 115
Best score: 0.661
Refit Time: 0.570
Best parameters set:
clf__alpha: 0.01
tfidf__max_df: 0.05
tfidf__min_df: 1
tfidf__ngram_range: (1, 2)
Testing on Congress 116 F1: 0.6369760774921475
Testing on Congress 117 F1: 0.5488274400521962
Training on Congress 116
Best score: 0.677
Refit Time: 0.499
Best parameters set:
clf__alpha: 0.01
tfidf__max_df: 0.05
tfidf__min_df: 1
tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.691175262953872
Testing on Congress 117 F1: 0.6798043069585031
Training on Congress 117
Best score: 0.670
Refit Time: 0.565
Best parameters set:
clf__alpha: 0.01
tfidf__max_df: 0.25
tfidf__min_df: 1
tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.6168474701996426
Testing on Congress 116 F1: 0.6981574942116808
Mean fit time: 0.54 ± 0.03s
Results Summary
The results demonstrate several key findings:
- Fast training: Sub-second training times make this highly practical
- Solid baseline performance: F1 scores around 0.65-0.70 provide a reasonable starting point
- Consistent hyperparameters: Similar optimal settings across Congresses suggest stable patterns
- Temporal effects: Performance generally decreases when training and testing on Congresses further apart in time
Training on the 116th Congress yields the best cross-Congress performance, likely due to its temporal proximity to both adjacent sessions.

Naive Bayes F1 scores show temporal effects, with better performance between adjacent Congresses
The model learns interpretable features for each policy area. For example, Agriculture bills are strongly associated with terms like “farm,” “crop,” and “livestock,” while Armed Forces bills correlate with “military,” “defense,” and “veterans.”

Naive Bayes Top Features for Agriculture and Food

Naive Bayes Top Features for Animals

Naive Bayes Top Features for Armed Forces and National Security

Naive Bayes Top Features for Arts, Culture, Religion

Naive Bayes Top Features for Civil Rights and Liberties, Minority Issues

Naive Bayes Top Features for Commerce

Naive Bayes Top Features for Congress

Naive Bayes Top Features for Crime and Law Enforcement

Naive Bayes Top Features for Economics and Public Finance

Naive Bayes Top Features for Education

Naive Bayes Top Features for Emergency Management

Naive Bayes Top Features for Energy

Naive Bayes Top Features for Environmental Protection

Naive Bayes Top Features for Families

Naive Bayes Top Features for Finance and Financial Sector

Naive Bayes Top Features for Foreign Trade and International Finance

Naive Bayes Top Features for Government Operations and Politics

Naive Bayes Top Features for Health

Naive Bayes Top Features for Housing and Community Development

Naive Bayes Top Features Immigration

Naive Bayes Top Features for International Affairs

Naive Bayes Top Features for Labor and Employment

Naive Bayes Top Features for Law

Naive Bayes Top Features for Native Americans

Naive Bayes Top Features for Public Lands and Natural Resources

Naive Bayes Top Features for Science, Technology, Communications

Naive Bayes Top Features for Social Sciences and History

Naive Bayes Top Features for Social Welfare

Naive Bayes Top Features for Sports and Recreation

Naive Bayes Top Features for Taxation

Naive Bayes Top Features for Transportation and Public Works

Naive Bayes Top Features for Water Resources Development

Naive Bayes Top Features for Private Legislation
Logistic Regression
Title-only Logistic Regression experiments are run with the following settings:
sweep_logreg(
data,
X_key='title',
y_key='policy_area',
tfidf_params={
'lowercase': True,
'dtype': np.float32,
},
tfidf_grid={
'ngram_range': [(1, 1), (1, 2)],
'max_df': (0.05, 0.1, 0.25),
},
logreg_params={
'max_iter': 1000,
'random_state': 42,
'class_weight': 'balanced',
},
logreg_grid={
'C': [0.1, 1, 10],
},
)
which yields the following output:
Training on Congress 115
Best score: 0.704
Refit Time: 32.063
Best parameters set:
clf__C: 10
tfidf__max_df: 0.05
tfidf__ngram_range: (1, 2)
Testing on Congress 116 F1: 0.6809188275881766
Testing on Congress 117 F1: 0.601917336933838
Training on Congress 116
Best score: 0.714
Refit Time: 31.227
Best parameters set:
clf__C: 10
tfidf__max_df: 0.05
tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.7408989977276476
Testing on Congress 117 F1: 0.7200639105208106
Training on Congress 117
Best score: 0.711
Refit Time: 34.083
Best parameters set:
clf__C: 10
tfidf__max_df: 0.05
tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.674418393892329
Testing on Congress 116 F1: 0.7405934743144291
Mean fit time: 32.46 ± 1.20s
Results Summary
Logistic regression improves upon Naive Bayes performance:
- Higher F1 scores: Generally 5-7 percentage points better than Naive Bayes
- Consistent hyperparameters: Optimal settings remain stable across Congresses
- Reasonable training time: 30-35 seconds per model remains manageable
- Strong cross-Congress generalization: F1 scores consistently above 0.70

Logistic Regression Policy Area Classification F1 Score
XGBoost
Title-only XGBoost experiments are run with the following settings:
sweep_xgb(
data,
X_key='title',
y_key='policy_area',
tfidf_grid={
'max_df': (0.05,),
},
xgb_grid={
'max_depth': (6,),
'eta': (0.3,),
},
)
which yields the following output:
Training on Congress 115
Best score: 0.591
Refit Time: 198.063
Best parameters set:
clf__eta: 0.3
clf__max_depth: 6
clf__num_class: 33
tfidf__max_df: 0.05
Testing on Congress 116 F1: 0.5649530686141018
Testing on Congress 117 F1: 0.5215939580735101
Training on Congress 116
Best score: 0.600
Refit Time: 264.824
Best parameters set:
clf__eta: 0.3
clf__max_depth: 6
clf__num_class: 33
tfidf__max_df: 0.05
Testing on Congress 115 F1: 0.6037922738570368
Testing on Congress 117 F1: 0.5965027418245722
Training on Congress 117
Best score: 0.595
Refit Time: 249.799
Best parameters set:
clf__eta: 0.3
clf__max_depth: 6
clf__num_class: 33
tfidf__max_df: 0.05
Testing on Congress 115 F1: 0.5600491477899472
Testing on Congress 116 F1: 0.60815381664894
Mean fit time: 237.56 ± 28.60s
Results Summary
XGBoost underperforms relative to expectations:
- Poor performance: F1 scores significantly below linear models (0.55-0.60 range)
- Long training times: 4+ minutes per model with limited hyperparameter exploration
- Questionable value: The computational cost doesn’t justify the poor performance
Given these results, we focus on the more promising linear models for subsequent experiments with longer text inputs.

XGBoost Policy Area Classification F1 Score
Training Efficiency
The computational costs vary dramatically:
Model | Training Time |
---|---|
Naive Bayes | 0.54 ± 0.03s |
Logistic Regression | 32.46 ± 1.20s |
XGBoost | 237.56 ± 28.60s |
XGBoost’s poor performance despite high computational cost suggests that tree-based methods may not be well-suited for sparse TF-IDF features. We’ll focus on linear models for the remaining experiments.
Summary-Only Results
Using bill summaries provides substantially more context than titles alone, leading to significant performance improvements.
Naive Bayes Performance
The summary-based models show dramatic improvement over title-only versions:
- F1 scores: 0.85+ within-Congress, 0.77-0.86 cross-Congress
- Training time: Still fast at ~3.4 seconds
- Strong generalization: Consistent performance across time periods

Summary-based models achieve 80%+ F1 scores across most Congress combinations
Logistic Regression Performance
Logistic regression slightly outperforms Naive Bayes on summaries:
- F1 scores: 0.86+ within-Congress, 0.79-0.87 cross-Congress
- Training time: Reasonable at ~12 seconds
- Stable hyperparameters: Consistent optimal settings across Congresses

Logistic regression maintains slight performance advantage over Naive Bayes
The performance difference between models suggests they likely rely on similar feature patterns, with logistic regression better capturing feature interactions. or as a plot:

Naive Bayes Policy Area Classification F1 Score
Which shows us significant lift in performance over the title-only inputs (now we’re 80%+ F1 score across the board, except for training on the 115th Congress and testing on the 117th Congress–farthest apart forwards in time).
Logistic Regression
Summary-only Logistic Regression experiments are run with the following settings:
sweep_logreg(
data,
X_key='summary',
y_key='policy_area',
tfidf_params={
'lowercase': True,
'dtype': np.float32,
},
tfidf_grid={
# 'ngram_range': [(1, 1), (1, 2)],
'max_df': (0.05, 0.1, 0.25),
},
logreg_params={
'max_iter': 1000,
'random_state': 42,
'class_weight': 'balanced',
},
logreg_grid={
'C': [0.1, 1, 10],
},
)
which yields the following output:
Training on Congress 115
Best score: 0.862
Refit Time: 9.007
Best parameters set:
clf__C: 10
tfidf__max_df: 0.25
Testing on Congress 116 F1: 0.8284864693401133
Testing on Congress 117 F1: 0.7934161507811646
Training on Congress 116
Best score: 0.865
Refit Time: 13.897
Best parameters set:
clf__C: 10
tfidf__max_df: 0.25
Testing on Congress 115 F1: 0.8637852557418315
Testing on Congress 117 F1: 0.8594775615031977
Training on Congress 117
Best score: 0.862
Refit Time: 12.167
Best parameters set:
clf__C: 10
tfidf__max_df: 0.25
Testing on Congress 115 F1: 0.8355736563084967
Testing on Congress 116 F1: 0.8696403838390832
Mean fit time: 11.69 ± 2.02s
And plotted:

Logistic Regression Policy Area Classification F1 Score
Overall, it’s nice to see slightly better performance from the Logistic Regression model over the Naive Bayes model. We see similar trends in the performance lifts, and my hunch is that the two models are basically predicting off the same features, but the Logistic Regression model is able to better capture the relationships between the features and the target. That said, we’d need to pin that down with more rigorous testing to be sure that’s the case before acting on such an assumption.
Full Text Results
We test whether complete bill text improves performance over summaries, using optimal hyperparameters from summary experiments.
Naive Bayes on Full Text
Surprisingly, full text yields slightly lower performance than summaries:
- F1 scores: 0.84-0.85 within-Congress, 0.77-0.86 cross-Congress
- Training time: ~50 seconds (10× slower than summaries)
- Performance drop: Likely due to increased noise in lengthy documents

Full text performance is slightly worse than summaries, suggesting diminishing returns
Logistic Regression on Full Text
Logistic regression shows the strongest performance on full text:
- F1 scores: 0.87-0.88 within-Congress, 0.83-0.89 cross-Congress
- Training time: ~70 seconds
- Best overall performance: Approaches 90% F1 on some Congress pairs

Logistic regression achieves the best performance using full bill text
The logistic regression model benefits from having access to complete legislative language while effectively regularizing against noise.
Key Findings
This baseline study establishes several important results:
Best performing model: Logistic regression trained on full bill text achieves up to 89% F1 score, providing a strong benchmark for future deep learning approaches.
Text input comparison:
- Titles: Limited but fast (F1 ~0.65-0.70)
- Summaries: Good balance of performance and efficiency (F1 ~0.85)
- Full text: Best performance but computationally expensive (F1 ~0.87-0.89)
Cross-Congress generalization: Models trained on one Congress generalize reasonably well to others, though performance decreases with temporal distance between sessions.
Model performance ranking: Logistic Regression > Naive Bayes » XGBoost for this text classification task.
Next Steps
The strong baseline performance sets the stage for several research directions:
- Deep learning models: Transformer-based approaches using pre-trained language models
- Dataset expansion: Including additional Congresses and more detailed bill metadata
- Error analysis: Understanding failure cases and class-specific performance patterns
- Feature engineering: Exploring domain-specific text preprocessing and feature extraction
The complete dataset and experimental code are available for researchers interested in building upon these baselines.
Resources: