Built at the University of Washington Databricks Hackathon 2026
- Overview
- The Problem
- The Solution
- Architecture Diagram
- Core Workflow
- Results
- Tech Stack
- Project Structure
- Key Technical Achievements
- Scenario Engine
- Getting Started
- Deployment
- Contributing
- License
HarvestIQ is an end-to-end agricultural yield forecasting platform that predicts corn and soybean yield 60β90 days in advance at the US county level. It quantifies uncertainty to provide both expected yield and best/worst-case scenarios based on near-term weather forecasts.
"If temperatures exceed 95Β°F for more than 4 days during pollination, expected corn yield decreases by 12%."
Today, yield forecasting is reactive:
- Farmers discover yield losses after harvest
- Crop insurers price risk without county-level data
- Agronomists lack uncertainty quantification tools
There is no early warning system for weather-driven yield risk.
The platform uses 15 years of NOAA weather data and XGBoost quantile regression models to generate three forecast bands:
- Q10 (Pessimistic): Worst-case scenario
- Q50 (Median): Most likely yield
- Q90 (Optimistic): Best-case scenario
---## Architecture Diagram
+------------------+ +------------------+ +------------------+
| NOAA GHCN Data | | US Census Data | | Weather Forecasts |
| (Raw Stations) | | (TIGER/Line) | | (Future Conditions) |
+--------+---------+ +--------+---------+ +----------+--------+
| | |
| | |
v v v
+------------------+ +------------------+ +------------------+
| PySpark ETL on | | KDTree Spatial | | Scenario |
| Databricks | | Matching Engine | | Generation |
| | | | | |
| - GDD calc | | - 44K stations | | - Drought |
| - Heat stress | | - 3K+ counties | | - Heatwave |
| - Precip features| | - County centroids| | - Excess Rain |
+--------+---------+ +--------+---------+ +---------+--------+
| | |
+------------+-----------+------------------------+
|
v
+---------------------------+
| XGBoost Quantile Models |
| (MLflow Tracked) |
| |
| - Corn model (q10,q50,q90)|
| - Soybean model (q10,q50,q90)|
| - RΒ² ~0.65 avg |
+------------+------------+
|
v
+---------------------------+
| Unity Catalog Storage |
| Model Registry & Tables |
+------------+------------+
|
v
+---------------------------+
| Plotly Interactive |
| Dashboards |
| - County-level maps |
| - Time series charts |
| - Scenario comparisons |
+---------------------------+
Raw NOAA GHCN Data (Databricks Marketplace)
|
v
PySpark Feature Engineering
(GDD, heat stress, growing degree days)
|
v
KDTree Spatial Matching
(Weather stations β County centroids)
|
v
XGBoost Quantile Regression
(q10, q50, q90 for corn & soybeans)
|
v
MLflow Tracking & Unity Catalog
(Model versioning & storage)
|
v
Plotly Dashboards
(Interactive visualization)
| Crop | MAE | RΒ² |
|---|---|---|
| Corn | 20.65 bu/acre | 0.67 |
| Soybeans | 5.58 bu/acre | 0.62 |
The models processed 83M+ weather observations and 108,573 training examples across 2010β2024.
| Component | Technology |
|---|---|
| Platform | Databricks Serverless |
| Data Processing | PySpark |
| ML Models | XGBoost Quantile Regression |
| Experiment Tracking | MLflow |
| Data Catalog | Unity Catalog |
| Spatial Analysis | scipy KDTree |
| Geospatial | US Census TIGER/Line Shapefiles |
| Visualization | Plotly |
| Language | Python 3.9+ |
HarvestIQ/
βββ data/ # Raw & processed datasets
β βββ processed/ # Feature-engineered data
βββ notebooks/ # Databricks notebooks
β βββ feature_engineering/ # PySpark feature pipelines
β βββ model_training/ # XGBoost training & tuning
β βββ visualization/ # Plotly dashboard notebooks
βββ preprocessing/ # Helper scripts
β βββ shapefile_processing/ # Census data processing
βββ .gitignore
βββ README.md
---## Key Technical Achievements
- KDTree Spatial Matching: Efficiently matched 44,728 weather stations to 3,000+ counties using scipy KDTree.
- Custom Pipeline: Built and validated centroid extraction from raw US Census TIGER/Line shapefiles.
- Quantile Regression: Delivers calibrated uncertainty intervals (q10, q50, q90) rather than simple point estimates β enabling risk-aware decision making.
- Scalable Processing: PySpark distributed processing of 83M+ weather observations on Databricks Serverless.
HarvestIQ simulates three primary weather stress scenarios:
| Scenario | Impact | Yield Effect |
|---|---|---|
| Drought | -30% precipitation | -12% yield |
| Heatwave | +5Β°C temperature | -15% yield |
| Excess Rain | +50% precipitation | -7% yield |
- Databricks Account with Serverless compute enabled
- Python 3.9+ installed locally (for preprocessing scripts)
- pip package manager
- Access to NOAA GHCN data via Databricks Marketplace
git clone https://github.com/BUVKAUSHIK/HarvestIQ.git
cd HarvestIQ- Open your Databricks workspace
- Navigate to Workspace β Import
- Upload notebooks from the
notebooks/directory
- Attach NOAA GHCN dataset from Databricks Marketplace
- Upload TIGER/Line shapefiles for county boundaries
- Start with
feature_engineeringnotebook to process weather data - Run
model_trainingnotebook to train XGBoost quantile models - Launch
visualizationnotebook for interactive dashboards
cd preprocessing
cd shapefile_processing
pip install geopandas shapely
python process_shapefiles.pyThis project runs on Databricks Serverless. To deploy:
- Import all notebooks to your Databricks workspace
- Set up Unity Catalog for model registry and data governance
- Configure MLflow for experiment tracking
- Set permissions on the data and model catalogs
- Schedule jobs in Databricks Workflows for automated retraining
For local preprocessing, ensure you have geopandas and shapely installed.
Contributions are welcome! Please follow the standard workflow:
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature-name
- Commit with clear messages:
git commit -am 'Add some feature' - Push to the branch:
git push origin feature/your-feature-name
- Open a Pull Request
This project is licensed under the MIT License.
Built with β€οΈ by BUVKAUSHIK at UW Databricks Hackathon 2026