Skip to content

govex/small-area-estimator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Small Area Estimator

A Python package for estimating metrics at small geographic areas (census places, tracts) from county-level data.

About

This package was developed by the Bloomberg Center for Government Excellence (GovEx) Research & Analytics team to address a common challenge: publicly available data often exists at the county-level, but not at smaller geographic levels like cities.

Core Approach

The default model uses a mixed linear regression model to learn the relationship between predictors (like poverty rate) and your outcome across all counties, while accounting for state-specific differences. This county-level pattern is then applied to estimate values for smaller areas.

Alternative models and additional features may perform better for specific use cases. The Trainer and Predictor classes include extension patterns.

Relationship to Literature

This approach is inspired by the (Zhang et al. 2014) methodology, but works in reverse as they start with individual-level health survey data to aggregate up. Both use hierarchical geographic structure and poverty rate for estimations.

Critical Assumption

⚠️ This approach assumes county-level model performance translates to small-area performance.

The only way to validate this is with datasets that exist at both levels. See examples/ground_truth_evaluation.ipynb for validation with ground truth on health insurance rates.

Before using this package:

  1. Check model performance on your county training data
  2. Validate with ground truth at smaller areas when possible
  3. Decide if performance is acceptable for your use case

Installation

Requires Python 3.12+

git clone https://github.com/govex/small-area-estimator.git
cd small-area-estimator
uv venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
uv pip install -e .

Quick Start

from sae.data import FeatureExtractor
from sae.model import Trainer, Predictor
from sae.evaluation import Evaluator

# 1. Fetch county poverty rates (Census API key required)
extractor = FeatureExtractor(api_key="your_key")
county_features = extractor.get_county_poverty(years=range(2012, 2024))

# 2. Merge with your county target data (must have: geo_id, year, target_metric)
training_data = county_features.merge(your_county_data, on=['geo_id', 'year'])

# 3. Train model
trainer = Trainer(target_col='target_metric')
trainer.fit(training_data)
trainer.save('model.pkl')

# 4. Evaluate performance on counties
evaluator = Evaluator(trainer.model_dict)
evaluator.print_summary()
evaluator.make_diagnostic_plots(output_dir='outputs/')

# 5. Predict on places
places_features = extractor.get_place_poverty(years=[2024])
predictor = Predictor.load('model.pkl')
predictions = predictor.predict(places_features)

Examples

Data Requirements

Your county-level target data needs:

  • geo_id: 5-digit county FIPS code
  • year: Year of observation (integer)
  • target_metric: The value you want to estimate. Use rates, percentages, or per-capita measures rather than raw counts that scale with population size.

For best results:

  • Hundreds or thousands of county-year observations across multiple years
  • Data from multiple states
  • Target metric is correlated with poverty rate (otherwise add additional features)

Understanding Results

After training, you'll see output like this (example values shown):

Model Performance Summary (on county training data)
  RMSE: 11.86 (19.8%)
  MAE: 9.89 (16.5%)
  Mean target: 59.82

What this means:

  • Your county predictions are typically off by ~10 percentage points (MAE in this example). MAE and RMSE is easy to interpret but doesn't tell the whole story so check diagnostic plots
  • What constitutes "good" performance depends on your use case:
    • MAE <5pp might be excellent for some metrics
    • MAE of 10-15pp might be acceptable if no better alternative exists
    • You're assuming small-area errors could be higher than county errors

Diagnostic Plots

The package generates four diagnostic plots to assess model quality. Review these to understand if your model assumptions hold:

1. Actual vs Predicted (fit_plots.png)

What it shows: Blue dots = actual values, orange line = model predictions by state

What to look for:

  • Orange line should follow through the blue dots
  • Good fit = dots cluster around the line
  • Poor fit = systematic gaps between dots and line

Red flags:

  • Predictions consistently above or below actual values (bias)
  • Line is flat when dots show a clear trend (model missing the relationship)

2. Residuals Plot (residual_plots.png)

What it shows: Prediction errors (residuals) across poverty rates

What to look for:

  • Random scatter around the dashed zero line
  • No patterns or shapes (funnel, curve, clusters)

Red flags:

  • Errors get bigger at high/low values (heteroscedasticity)
  • Curve or other shape reflecting a non-linear relationship that the model is missing
  • All points above or below zero (systematic bias)

3. QQ Plot (qq_plots.png)

What it shows: Whether prediction errors follow a normal distribution

What to look for:

  • QQ Plot (diagonal dashed line): Points should follow the line, reflecting normal distribution
  • Histogram: Bell-shaped, reflecting normally distributed prediction errors

Red flags:

  • Non bell-shaped histogram: May reflect small sample size, consistent over/under-predictions, or groups of counties with different behaviors
  • Curved tails in QQ plot: Upper tail curves up = predictions too high; lower tail curves down = predictions too low
  • Many states showing these patterns: A couple weird states is to be expected, but if many states show these patterns than model assumptions may not hold

4. Error Bars by State (error_bars.png)

What it shows: Model performance varies by state

What to look for:

  • Blue bars (RMSE) should be small relative to orange bars (mean target)
  • Error bars (standard deviation) show data variability

Red flags:

  • States with very large RMSE = poor fit in those states
  • May need state-specific investigation or different approach

Common Configurations

Train only on urban counties (for urban place predictions):

trainer = Trainer(
    target_col='your_metric',
    feature_filters={'urban_quartile': [0, 1]}  # 50%+ urban
)

Temporal validation (use only historical data):

trainer = Trainer(
    target_col='your_metric',
    year_filter='before',
    year_value=2024
)

Contributing

This is an early-stage project. Contributions welcome! Open issues for bugs or feature requests, or share your SAE approach!

License

MIT License. See LICENSE for details.

Authors

Developed by the Bloomberg Center for Government Excellence (GovEx) Research & Analytics team.

About

Python package for estimating small area metrics from county-level data

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages