Skip to content

nishreenk/ca-calfresh-coverage-gap

Repository files navigation

ca-calfresh-coverage-gap

California county-level food insecurity analysis — OLS regression identifying poverty and physical food access as key predictors across 58 counties. CalFresh coverage gap. HC3 robust SEs. Python.

California CalFresh Coverage Gap Analysis

Which county-level factors predict food insecurity across California's 58 counties — and what survives when poverty is controlled?

This project uses cross-sectional county-level data from four public sources to identify the structural factors associated with food insecurity across California, with a focus on understanding what drives variation in CalFresh program need and uptake.

Key finding: Overall poverty rate is the dominant predictor of county food insecurity (r = 0.90, coefficient 1.45, p < 0.001). When poverty is explicitly controlled, most SNAP-related associations collapse — but physical food access barriers for SNAP recipients remain independently significant (p = 0.013), pointing to geographic distance to food retail as a driver beyond poverty itself.

📄 Read the full analysis: Medium article


Results at a Glance

Finding Result
Poverty vs food insecurity correlation r = 0.90, p < 0.001
Model fit (preferred M3P) R² = 0.880, Adj-R² = 0.866
Overall poverty rate (M3P) +1.451 pp per SD (p < 0.001)
Low access, SNAP recipients (M3P) +0.556 pp per SD (p = 0.013)
SNAP benefit per capita (M3P) +0.611 pp per SD (p = 0.056)
SNAP eligibility rate without poverty (M3) +1.117 pp per SD (p < 0.001)
SNAP eligibility rate with poverty (M3P) +0.244 pp per SD (p = 0.490)
N counties 58

Repository Structure

ca-calfresh-coverage-gap/
│
├── README.md
├── requirements.txt
│
├── data/                              # data folder (see instructions below)
│   └── README_data.md                 # exact download instructions
│
├── figures/                           # generated by scripts (not committed)
│
├── calfresh_01_load_data.py           # Step 1: load all sources, build panel
├── calfresh_02_eda.py                 # Step 2: exploratory data analysis
├── calfresh_03_feature_engineering.py # Step 3: transforms, VIF, standardize
├── calfresh_04_regression.py          # Step 4: OLS regression M1-M3P + M5
└── calfresh_poverty_scatter.py        # Supplemental: poverty vs FI scatter

Data Sources

1. Feeding America Map the Meal Gap (MMG)

URL: https://www.feedingamerica.org/research/map-the-meal-gap/by-county

Annual county-level food insecurity estimates for all US counties. Download the multi-year Excel file covering 2019–2023. Save to data/ folder.

Key variables used:

  • Overall Food Insecurity Rate — percentage of county population that is food insecure
  • Child Food Insecurity Rate
  • % FI <= SNAP Threshold — share of food-insecure people who meet the SNAP income eligibility criterion
  • Cost Per Meal

Note: MMG estimates are model-derived, not directly surveyed. Small county estimates carry more uncertainty.

2. USDA Food Environment Atlas (2025)

URL: https://www.ers.usda.gov/data-products/food-environment-atlas/

Download the full Excel workbook. Two sheets are used:

  • ASSISTANCE — SNAP benefit per capita (2017 and 2022)
  • ACCESS — food access variables by county

Important: Most SNAP participation columns in the Atlas are state-level constants — identical for all California counties. Only the four access variables and two benefit-per-capita variables have genuine county-level variation. The loading script checks this automatically.

3. ACS B17001 — Overall Poverty Rate

Pulled automatically via the Census Bureau API when calfresh_01_load_data.py is run. Requires a free Census API key — sign up at https://api.census.gov/data/key_signup.html

Add your key to the script:

API_KEY = "your_key_here"

Covers all California counties, 2019–2022 (5-year estimates), averaged to match the MMG panel period.

4. DFA256 — California Admin CalFresh Data

URL: https://data.ca.gov (search "DFA256")

Monthly county-level CalFresh enrollment, 2004–2017. Loaded and inspected but not used in the final analysis due to a date gap with the MMG panel (which starts 2019). Included in the loading script for completeness.


Quickstart

Requirements

pip install -r requirements.txt

requirements.txt:

pandas>=2.0
numpy>=1.24
scipy>=1.10
matplotlib>=3.7
openpyxl>=3.1
requests>=2.28

No R. No statsmodels. No proprietary tools.

Configure paths

At the top of each script, update these two variables to match your local folder structure:

DATA = "/your/path/to/data/"      # folder containing downloaded data files
FIG  = "/your/path/to/figures/"   # folder for output figures

Step-by-step run order

Scripts must be run in order — each depends on outputs from the previous.


Step 1 — calfresh_01_load_data.py

Reads all four data sources, pulls ACS B17001 overall poverty rate from the Census API, and saves clean CSVs.

Before running — add to data/ folder:

  • MMG2025_2019-2023_Data_To_Share.xlsx (Feeding America)
  • 2025-food-environment-atlas-data.xlsx (USDA)
  • Set API_KEY = "your_key_here" in the script

Outputs to data/: county_avg_2019_2023.csv, mmg_ca_panel.csv, food_env_atlas_ca.csv, b17001_ca_avg.csv, b17001_ca_panel.csv, b22003_ca_panel.csv, dfa256_annual.csv


Step 2 — calfresh_02_eda.py

Exploratory data analysis — distributions, correlations, bivariate scatter plots.

Inputs: county_avg_2019_2023.csv, food_env_atlas_ca.csv

Outputs to figures/: 01_univariate.png, 02_bivariate.png, 03_correlation_matrix.png


Step 3 — calfresh_03_feature_engineering.py

Transforms skewed variables, checks VIF and drops multicollinear predictors, standardizes all features to z-scores, merges overall poverty rate from ACS B17001.

Inputs: county_avg_2019_2023.csv, food_env_atlas_ca.csv, b17001_ca_avg.csv

Outputs to data/: county_features.csv

Outputs to figures/: 04_log_transforms.png


Step 4 — calfresh_04_regression.py

Runs six OLS model specifications (M1 through M3P and M5), prints coefficient tables, generates coefficient stability and residual diagnostic figures.

Inputs: county_features.csv

Outputs to figures/: 05_coef_stability.png, 06_residuals_m3p.png


Supplemental — calfresh_poverty_scatter.py

Scatter plot of overall poverty rate vs food insecurity rate, labeled with notable counties.

Inputs: county_features.csv

Outputs to figures/: calfresh_poverty_vs_fi.png


Analytical Methods

OLS Regression with HC3 Robust Standard Errors

Outcome: food_insecurity_rate (county average 2019–2023, %)

Unit: California county (N = 58)

Standard errors: HC3 heteroskedasticity-robust, implemented from scratch in NumPy. Standard OLS assumes constant residual variance across all observations — an assumption called homoskedasticity. MMG estimates for tiny rural counties carry far more uncertainty than estimates for large counties, because they are based on much thinner underlying data. HC3 corrects for this unequal variance without assuming any particular form for the heteroskedasticity.

Model specifications:

Model Specification N Adj-R²
M1 SNAP benefit per capita + cost per meal (baseline) 58 0.705 0.695
M2 + Low access variables 58 0.748 0.729
M3 + SNAP income eligibility rate 58 0.817 0.800
M3P + Overall poverty rate (preferred) 58 0.880 0.866
M4 + Benefit growth (full model) 58 0.823 0.802
M5 Sensitivity: large counties only (N=39) 39 0.835 0.810

Variable Transformation

Three transformations were tested for each right-skewed variable (log, sqrt, Box-Cox). Square root was chosen for pct_low_access_pop because log overcorrected (skew flipped from +2.20 to −1.31). Log was used for the other two access variables. Box-Cox was used as a diagnostic benchmark but not as the primary transform — it requires storing and reapplying a specific lambda parameter, unlike sqrt which is parameter-free.

VIF Check

All predictors checked for multicollinearity using Variance Inflation Factor before model building. Three variables dropped: access_x_eligible_z (VIF = 70.51), log_pct_low_access_lowincome_z (VIF = 43.45), pct_fi_snap_eligible_z (VIF = 12.34). VIF for all retained variables in M3P is below 5.5.

Why M3P is the Preferred Model

M3 shows strong positive associations for both SNAP benefit per capita (1.177, p < 0.001) and SNAP income eligibility rate (1.117, p < 0.001). These look counterintuitive — higher SNAP activity associated with higher food insecurity. The explanation is that both variables are driven by poverty: counties with more poverty have more food insecurity, receive more SNAP benefits, and have more residents meeting the income eligibility criterion — all for the same reason.

M3P tests this directly by adding overall poverty rate. Result: SNAP benefit coefficient drops to 0.611 (borderline significant) and eligibility rate collapses to 0.244 (non-significant). Both were largely poverty proxies. The one predictor that survives poverty control is low food access for SNAP recipients (0.556, p = 0.013) — geographic distance to food retail as an independent driver beyond poverty.


Key Figures

Figure File What it shows
Univariate distributions figures/01_univariate.png All nine candidate variables, 58 counties
Bivariate scatter plots figures/02_bivariate.png Food insecurity vs key predictors
Correlation matrix figures/03_correlation_matrix.png Pairwise correlations, multicollinearity visible
Variable transforms figures/04_log_transforms.png Before/after skew for three access variables
Poverty vs FI figures/calfresh_poverty_vs_fi.png r = 0.90 scatter, 58 counties labeled
Coefficient stability figures/05_coef_stability.png M1 through M3P coefficient trajectories
Residual diagnostics figures/06_residuals_m3p.png Fitted vs residuals, Q-Q, actual vs fitted

Limitations

  • Cross-sectional data cannot establish causation — this is a general limitation of observational data observed at a single point in time. Without temporal variation or a natural experiment, it is impossible to determine the direction of causality from an association alone.
  • MMG food insecurity estimates are model-derived, not directly measured. Small county estimates carry more uncertainty.
  • N = 58 is a small sample. Sequential model building and aggressive VIF pruning kept the model parsimonious.
  • DFA256 administrative data ends in 2017, creating a gap with the 2019–2023 MMG panel.
  • The 2022 SNAP benefit variable reflects COVID-19 Emergency Allotments still active in California — elevated above typical benefit levels.
  • Spatial autocorrelation is not addressed. Neighboring Central Valley counties share unobserved traits. A spatial lag model or Moran's I test would be a natural next step.

Citation

Kachwala, N. (2026). California CalFresh Coverage Gap Analysis.
GitHub: https://github.com/nishreenk/ca-calfresh-coverage-gap

Data citations:

Feeding America (2025). Map the Meal Gap 2025. feedingamerica.org
USDA Economic Research Service (2025). Food Environment Atlas. ers.usda.gov
U.S. Census Bureau. American Community Survey 5-Year Estimates B17001, 2019-2022.
California CDSS. DFA256 CalFresh Administrative Data, 2004-2017.

Contact

Nishrin Kachwala GitHub: @nishreenk Medium: @nishrin-kachwala

About

California county-level food insecurity analysis — OLS regression identifying poverty and physical food access as key predictors across 58 counties. CalFresh coverage gap. HC3 robust SEs. Python.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages