Introduction

This post explores machine learning approaches for classifying congressional bills by policy area, using data from the 115th to 117th Congresses (2017-2023). We’ll examine:

  • The fundamentals of bill classification
  • Traditional machine learning models as baselines
  • Performance analysis across different time periods and policy domains

This work establishes baselines for future deep learning approaches to legislative text classification.

This post builds on the data foundation established in Exploring the 117th U.S. Congress. For the complete learning path, see the Congressional Data Analysis series.

Why This Matters

Automatically classifying congressional bills by policy area serves several important purposes. It provides insights into legislative priorities, enables analysis of policy trends over time, and promotes transparency in governance. Machine learning can help researchers, journalists, and citizens navigate the thousands of bills introduced each Congress more effectively.

This work represents a step toward building tools that help people engage with the legislative process and understand how policies may affect their lives.

Data

The data comes from scraping Congress.gov for all bills from the 115th through 117th Congresses. Each bill includes:

  • Bill ID and title
  • Summary (when available) - the earliest summary provided
  • Full text (when available) - the earliest text version
  • Policy area classification

Our task is to predict policy area from text features:

$$ f(X) = \hat{y}, \quad \text{where} \quad X = { \text{title}, \text{summary}, \text{text} }, \quad \hat{y} \in { \text{policy areas} } $$

The complete dataset is available at Hugging Face: hheiden/us-congress-bill-policy-115_117.

Bills by Congress

Our dataset contains the following distribution:

CongressBills
115th13,555
116th16,601
117th17,817
Total47,973

Policy Areas

Each bill receives a policy area label from Congress.gov (see glossary). The dataset includes 33 policy areas, though these classes are highly imbalanced.

The following table shows the number of bills in each policy area across the three Congresses:

Policy Area115th116th117thTotal
Agriculture and Food3123283981,038
Animals968371250
Armed Forces and National Security1,1081,3371,3993,844
Arts, Culture, Religion8179103263
Civil Rights and Liberties, Minority Issues175205220600
Commerce3125936331,538
Congress5945416401,775
Crime and Law Enforcement8279041,0222,753
Economics and Public Finance176210197583
Education6077988012,206
Emergency Management207198202607
Energy3163705301,216
Environmental Protection3524234641,239
Families79127139345
Finance and Financial Sector5566116011,768
Foreign Trade and International Finance120148212480
Government Operations and Politics1,0081,2581,2723,538
Health1,5262,1092,2765,911
Housing and Community Development142250231623
Immigration3984665911,455
International Affairs9181,1781,3903,486
Labor and Employment3484525521,352
Law109162175446
Native Americans175234245654
Public Lands and Natural Resources7186486422,008
Science, Technology, Communications3895515051,445
Social Sciences and History56415
Social Welfare177229199605
Sports and Recreation9293125310
Taxation9831,1561,0783,217
Transportation and Public Works4926727421,906
Water Resources Development89111110310
Private Legislation697148188

The class imbalance is severe: Social Sciences and History has only 15 bills across all three Congresses, while Health has 5,911 bills. This imbalance presents modeling challenges, as minority classes may lack sufficient representative samples.

Text Statistics

We analyzed token counts using spaCy to understand the computational requirements for each text field.

Title Token Statistics:

CongressAverage TokensMin TokensMax TokensTotal Tokens
115th12.31167166,763
116th11.31226188,158
117th11.51272204,978
All11.71272559,419

Summary Token Statistics:

CongressAverage TokensMin TokensMax TokensTotal Tokens
115th109.126,8391,479,212
116th94.925,8861,574,732
117th95.125021,695,276
All99.026,8394,749,220

Full Text Token Statistics:

CongressAverage TokensMin TokensMax TokensTotal Tokens
115th2,588.791304,47835,092,075
116th2,760.370973,17345,824,498
117th2,706.7711,013,60848,224,757
All-701,013,608129,141,330

These statistics reveal computational trade-offs:

  • Titles average ~12 tokens - computationally efficient but limited information
  • Summaries average ~100 tokens - good balance of information and efficiency
  • Full text averages ~2,700 tokens with 129M total tokens - detailed but computationally expensive

We’ll prototype with titles and summaries before considering full text, given the computational costs involved.

Evaluation Framework

Experimental Design

We train models on one Congress and test on others, creating a 3×3 evaluation grid. This setup evaluates both within-Congress performance (same session) and cross-Congress generalization (different sessions). We expect temporal drift between Congress sessions to impact performance.

Metrics and Hyperparameter Tuning

We use weighted average F1 score to handle class imbalance, ensuring fair evaluation across all policy areas regardless of frequency.

For within-Congress evaluation, we report cross-validated scores. For cross-Congress evaluation, we test on the entire target Congress dataset.

Hyperparameter tuning uses Cross-Validation Grid Search with folds set to min(3, n_samples) to ensure all classes are represented. We apply the best parameters from training to test generalization across different Congresses.

Baseline Models

We evaluate three traditional machine learning approaches using TF-IDF vectorization:

Text Preprocessing

We convert text to numerical features using TF-IDF (term frequency-inverse document frequency), which weighs word importance by frequency within documents relative to the entire corpus. This creates normalized feature vectors suitable for machine learning classification.

Multinomial Naive Bayes

We start with Multinomial Naive Bayes as our simplest baseline. Despite its “naive” independence assumption between features, this model often performs surprisingly well for text classification tasks and serves as an important benchmark—if more complex models can’t beat Naive Bayes, it signals potential issues with the approach or data.

The model’s feature_log_prob_ attribute reveals the most influential words for each policy area, providing interpretable insights into classification patterns.

You can see the code for training the Naive Bayes model below:

Logistic Regression

Logistic regression provides a natural step up in complexity from Naive Bayes. It uses the logistic function to convert linear combinations of features into probabilities, making it an excellent baseline for comparison with more sophisticated models while remaining interpretable.

You can see the code for training the Logistic Regression model below:

XGBoost

We include XGBoost as our tree-based ensemble method. While XGBoost typically excels on structured tabular data, we test whether its gradient boosting approach can effectively handle TF-IDF features for text classification.

You can see the code for training the XGBoost model below:

Results

We evaluate models on three input types:

  • Title-only: Quick prototyping with limited context
  • Summary-only: Balanced information content and computational efficiency
  • Full text: Maximum context with computational constraints (limited hyperparameter tuning)

Title-Only Inputs

Naive Bayes

Title-only Naive Bayes experiments are run with the following settings:

sweep_nb(
    data,
    X_key='title',
    y_key='policy_area',
    tfidf_params={
        'lowercase': True,
        'dtype': np.float32,
    },
    tfidf_grid={
        'ngram_range': [(1, 1), (1, 2)],
        'max_df': (0.05, 0.1, 0.25, 0.5),
        'min_df': (1, 2, 5),
    },
    nb_params={},
    nb_grid={
        'alpha': (1, 0.1, 0.01, 0.001),
    },
)

which yields the following output:

Training on Congress 115
Best score: 0.661
Refit Time: 0.570
Best parameters set:
	clf__alpha: 0.01
	tfidf__max_df: 0.05
	tfidf__min_df: 1
	tfidf__ngram_range: (1, 2)
Testing on Congress 116 F1: 0.6369760774921475
Testing on Congress 117 F1: 0.5488274400521962

Training on Congress 116
Best score: 0.677
Refit Time: 0.499
Best parameters set:
	clf__alpha: 0.01
	tfidf__max_df: 0.05
	tfidf__min_df: 1
	tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.691175262953872
Testing on Congress 117 F1: 0.6798043069585031

Training on Congress 117
Best score: 0.670
Refit Time: 0.565
Best parameters set:
	clf__alpha: 0.01
	tfidf__max_df: 0.25
	tfidf__min_df: 1
	tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.6168474701996426
Testing on Congress 116 F1: 0.6981574942116808

Mean fit time: 0.54 ± 0.03s

Results Summary

The results demonstrate several key findings:

  • Fast training: Sub-second training times make this highly practical
  • Solid baseline performance: F1 scores around 0.65-0.70 provide a reasonable starting point
  • Consistent hyperparameters: Similar optimal settings across Congresses suggest stable patterns
  • Temporal effects: Performance generally decreases when training and testing on Congresses further apart in time

Training on the 116th Congress yields the best cross-Congress performance, likely due to its temporal proximity to both adjacent sessions.

Naive Bayes Policy Area Classification F1 Score

Naive Bayes F1 scores show temporal effects, with better performance between adjacent Congresses

The model learns interpretable features for each policy area. For example, Agriculture bills are strongly associated with terms like “farm,” “crop,” and “livestock,” while Armed Forces bills correlate with “military,” “defense,” and “veterans.”

Naive Bayes Top Features for Agriculture and Food

Naive Bayes Top Features for Agriculture and Food

Naive Bayes Top Features for Animals

Naive Bayes Top Features for Animals

Naive Bayes Top Features for Armed Forces and National Security

Naive Bayes Top Features for Armed Forces and National Security

Naive Bayes Top Features for Arts, Culture, Religion

Naive Bayes Top Features for Arts, Culture, Religion

Naive Bayes Top Features for Civil Rights and Liberties, Minority Issues

Naive Bayes Top Features for Civil Rights and Liberties, Minority Issues

Naive Bayes Top Features for Commerce

Naive Bayes Top Features for Commerce

Naive Bayes Top Features for Congress

Naive Bayes Top Features for Congress

Naive Bayes Top Features for Crime and Law Enforcement

Naive Bayes Top Features for Crime and Law Enforcement

Naive Bayes Top Features for Economics and Public Finance

Naive Bayes Top Features for Economics and Public Finance

Naive Bayes Top Features for Education

Naive Bayes Top Features for Education

Naive Bayes Top Features for Emergency Management

Naive Bayes Top Features for Emergency Management

Naive Bayes Top Features for Energy

Naive Bayes Top Features for Energy

Naive Bayes Top Features for Environmental Protection

Naive Bayes Top Features for Environmental Protection

Naive Bayes Top Features for Families

Naive Bayes Top Features for Families

Naive Bayes Top Features for Finance and Financial Sector

Naive Bayes Top Features for Finance and Financial Sector

Naive Bayes Top Features for Foreign Trade and International Finance

Naive Bayes Top Features for Foreign Trade and International Finance

Naive Bayes Top Features for Government Operations and Politics

Naive Bayes Top Features for Government Operations and Politics

Naive Bayes Top Features for Health

Naive Bayes Top Features for Health

Naive Bayes Top Features for Housing and Community Development

Naive Bayes Top Features for Housing and Community Development

Naive Bayes Top Features Immigration

Naive Bayes Top Features Immigration

Naive Bayes Top Features for International Affairs

Naive Bayes Top Features for International Affairs

Naive Bayes Top Features for Labor and Employment

Naive Bayes Top Features for Labor and Employment

Naive Bayes Top Features for Law

Naive Bayes Top Features for Law

Naive Bayes Top Features for Native Americans

Naive Bayes Top Features for Native Americans

Naive Bayes Top Features for Public Lands and Natural Resources

Naive Bayes Top Features for Public Lands and Natural Resources

Naive Bayes Top Features for Science, Technology, Communications

Naive Bayes Top Features for Science, Technology, Communications

Naive Bayes Top Features for Social Sciences and History

Naive Bayes Top Features for Social Sciences and History

Naive Bayes Top Features for Social Welfare

Naive Bayes Top Features for Social Welfare

Naive Bayes Top Features for Sports and Recreation

Naive Bayes Top Features for Sports and Recreation

Naive Bayes Top Features for Taxation

Naive Bayes Top Features for Taxation

Naive Bayes Top Features for Transportation and Public Works

Naive Bayes Top Features for Transportation and Public Works

Naive Bayes Top Features for Water Resources Development

Naive Bayes Top Features for Water Resources Development

Naive Bayes Top Features for Private Legislation

Naive Bayes Top Features for Private Legislation

Logistic Regression

Title-only Logistic Regression experiments are run with the following settings:

sweep_logreg(
    data,
    X_key='title',
    y_key='policy_area',
    tfidf_params={
        'lowercase': True,
        'dtype': np.float32,
    },
    tfidf_grid={
        'ngram_range': [(1, 1), (1, 2)],
        'max_df': (0.05, 0.1, 0.25),
    },
    logreg_params={
        'max_iter': 1000,
        'random_state': 42,
        'class_weight': 'balanced',
    },
    logreg_grid={
        'C': [0.1, 1, 10],
    },
)

which yields the following output:

Training on Congress 115
Best score: 0.704
Refit Time: 32.063
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.05
	tfidf__ngram_range: (1, 2)
Testing on Congress 116 F1: 0.6809188275881766
Testing on Congress 117 F1: 0.601917336933838

Training on Congress 116
Best score: 0.714
Refit Time: 31.227
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.05
	tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.7408989977276476
Testing on Congress 117 F1: 0.7200639105208106

Training on Congress 117
Best score: 0.711
Refit Time: 34.083
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.05
	tfidf__ngram_range: (1, 2)
Testing on Congress 115 F1: 0.674418393892329
Testing on Congress 116 F1: 0.7405934743144291

Mean fit time: 32.46 ± 1.20s

Results Summary

Logistic regression improves upon Naive Bayes performance:

  • Higher F1 scores: Generally 5-7 percentage points better than Naive Bayes
  • Consistent hyperparameters: Optimal settings remain stable across Congresses
  • Reasonable training time: 30-35 seconds per model remains manageable
  • Strong cross-Congress generalization: F1 scores consistently above 0.70
Logistic Regression Policy Area Classification F1 Score

Logistic Regression Policy Area Classification F1 Score

XGBoost

Title-only XGBoost experiments are run with the following settings:

sweep_xgb(
    data,
    X_key='title',
    y_key='policy_area',
    tfidf_grid={
        'max_df': (0.05,),
    },
    xgb_grid={
        'max_depth': (6,),
        'eta': (0.3,),
    },
)

which yields the following output:

Training on Congress 115
Best score: 0.591
Refit Time: 198.063
Best parameters set:
	clf__eta: 0.3
	clf__max_depth: 6
	clf__num_class: 33
	tfidf__max_df: 0.05
Testing on Congress 116 F1: 0.5649530686141018
Testing on Congress 117 F1: 0.5215939580735101

Training on Congress 116
Best score: 0.600
Refit Time: 264.824
Best parameters set:
	clf__eta: 0.3
	clf__max_depth: 6
	clf__num_class: 33
	tfidf__max_df: 0.05
Testing on Congress 115 F1: 0.6037922738570368
Testing on Congress 117 F1: 0.5965027418245722

Training on Congress 117
Best score: 0.595
Refit Time: 249.799
Best parameters set:
	clf__eta: 0.3
	clf__max_depth: 6
	clf__num_class: 33
	tfidf__max_df: 0.05
Testing on Congress 115 F1: 0.5600491477899472
Testing on Congress 116 F1: 0.60815381664894

Mean fit time: 237.56 ± 28.60s

Results Summary

XGBoost underperforms relative to expectations:

  • Poor performance: F1 scores significantly below linear models (0.55-0.60 range)
  • Long training times: 4+ minutes per model with limited hyperparameter exploration
  • Questionable value: The computational cost doesn’t justify the poor performance

Given these results, we focus on the more promising linear models for subsequent experiments with longer text inputs.

XGBoost Policy Area Classification F1 Score

XGBoost Policy Area Classification F1 Score

Training Efficiency

The computational costs vary dramatically:

ModelTraining Time
Naive Bayes0.54 ± 0.03s
Logistic Regression32.46 ± 1.20s
XGBoost237.56 ± 28.60s

XGBoost’s poor performance despite high computational cost suggests that tree-based methods may not be well-suited for sparse TF-IDF features. We’ll focus on linear models for the remaining experiments.

Summary-Only Results

Using bill summaries provides substantially more context than titles alone, leading to significant performance improvements.

Naive Bayes Performance

The summary-based models show dramatic improvement over title-only versions:

  • F1 scores: 0.85+ within-Congress, 0.77-0.86 cross-Congress
  • Training time: Still fast at ~3.4 seconds
  • Strong generalization: Consistent performance across time periods
Naive Bayes Summary Performance

Summary-based models achieve 80%+ F1 scores across most Congress combinations

Logistic Regression Performance

Logistic regression slightly outperforms Naive Bayes on summaries:

  • F1 scores: 0.86+ within-Congress, 0.79-0.87 cross-Congress
  • Training time: Reasonable at ~12 seconds
  • Stable hyperparameters: Consistent optimal settings across Congresses
Logistic Regression Summary Performance

Logistic regression maintains slight performance advantage over Naive Bayes

The performance difference between models suggests they likely rely on similar feature patterns, with logistic regression better capturing feature interactions. or as a plot:

Naive Bayes Policy Area Classification F1 Score

Naive Bayes Policy Area Classification F1 Score

Which shows us significant lift in performance over the title-only inputs (now we’re 80%+ F1 score across the board, except for training on the 115th Congress and testing on the 117th Congress–farthest apart forwards in time).

Logistic Regression

Summary-only Logistic Regression experiments are run with the following settings:

sweep_logreg(
    data,
    X_key='summary',
    y_key='policy_area',
    tfidf_params={
        'lowercase': True,
        'dtype': np.float32,
    },
    tfidf_grid={
        # 'ngram_range': [(1, 1), (1, 2)],
        'max_df': (0.05, 0.1, 0.25),
    },
    logreg_params={
        'max_iter': 1000,
        'random_state': 42,
        'class_weight': 'balanced',
    },
    logreg_grid={
        'C': [0.1, 1, 10],
    },
)

which yields the following output:

Training on Congress 115
Best score: 0.862
Refit Time: 9.007
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.25
Testing on Congress 116 F1: 0.8284864693401133
Testing on Congress 117 F1: 0.7934161507811646

Training on Congress 116
Best score: 0.865
Refit Time: 13.897
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.25
Testing on Congress 115 F1: 0.8637852557418315
Testing on Congress 117 F1: 0.8594775615031977

Training on Congress 117
Best score: 0.862
Refit Time: 12.167
Best parameters set:
	clf__C: 10
	tfidf__max_df: 0.25
Testing on Congress 115 F1: 0.8355736563084967
Testing on Congress 116 F1: 0.8696403838390832

Mean fit time: 11.69 ± 2.02s

And plotted:

Logistic Regression Policy Area Classification F1 Score

Logistic Regression Policy Area Classification F1 Score

Overall, it’s nice to see slightly better performance from the Logistic Regression model over the Naive Bayes model. We see similar trends in the performance lifts, and my hunch is that the two models are basically predicting off the same features, but the Logistic Regression model is able to better capture the relationships between the features and the target. That said, we’d need to pin that down with more rigorous testing to be sure that’s the case before acting on such an assumption.

Full Text Results

We test whether complete bill text improves performance over summaries, using optimal hyperparameters from summary experiments.

Naive Bayes on Full Text

Surprisingly, full text yields slightly lower performance than summaries:

  • F1 scores: 0.84-0.85 within-Congress, 0.77-0.86 cross-Congress
  • Training time: ~50 seconds (10× slower than summaries)
  • Performance drop: Likely due to increased noise in lengthy documents
Naive Bayes Full Text Performance

Full text performance is slightly worse than summaries, suggesting diminishing returns

Logistic Regression on Full Text

Logistic regression shows the strongest performance on full text:

  • F1 scores: 0.87-0.88 within-Congress, 0.83-0.89 cross-Congress
  • Training time: ~70 seconds
  • Best overall performance: Approaches 90% F1 on some Congress pairs
Logistic Regression Full Text Performance

Logistic regression achieves the best performance using full bill text

The logistic regression model benefits from having access to complete legislative language while effectively regularizing against noise.

Key Findings

This baseline study establishes several important results:

Best performing model: Logistic regression trained on full bill text achieves up to 89% F1 score, providing a strong benchmark for future deep learning approaches.

Text input comparison:

  • Titles: Limited but fast (F1 ~0.65-0.70)
  • Summaries: Good balance of performance and efficiency (F1 ~0.85)
  • Full text: Best performance but computationally expensive (F1 ~0.87-0.89)

Cross-Congress generalization: Models trained on one Congress generalize reasonably well to others, though performance decreases with temporal distance between sessions.

Model performance ranking: Logistic Regression > Naive Bayes » XGBoost for this text classification task.

Next Steps

The strong baseline performance sets the stage for several research directions:

  1. Deep learning models: Transformer-based approaches using pre-trained language models
  2. Dataset expansion: Including additional Congresses and more detailed bill metadata
  3. Error analysis: Understanding failure cases and class-specific performance patterns
  4. Feature engineering: Exploring domain-specific text preprocessing and feature extraction

The complete dataset and experimental code are available for researchers interested in building upon these baselines.

Resources: