HarvestIQ — Early Warning Yield Intelligence for American Agriculture

Built at the University of Washington Databricks Hackathon 2026

Overview

HarvestIQ is an end-to-end agricultural yield forecasting platform that predicts corn and soybean yield 60–90 days in advance at the US county level. It quantifies uncertainty to provide both expected yield and best/worst-case scenarios based on near-term weather forecasts.

"If temperatures exceed 95°F for more than 4 days during pollination, expected corn yield decreases by 12%."

The Problem

Today, yield forecasting is reactive:

Farmers discover yield losses after harvest
Crop insurers price risk without county-level data
Agronomists lack uncertainty quantification tools

There is no early warning system for weather-driven yield risk.

The Solution

The platform uses 15 years of NOAA weather data and XGBoost quantile regression models to generate three forecast bands:

Q10 (Pessimistic): Worst-case scenario
Q50 (Median): Most likely yield
Q90 (Optimistic): Best-case scenario

---## Architecture Diagram

+------------------+     +------------------+     +------------------+
|  NOAA GHCN Data  |     |  US Census Data  |     |  Weather Forecasts  |
|  (Raw Stations)  |     |  (TIGER/Line)    |     |  (Future Conditions) |
+--------+---------+     +--------+---------+     +----------+--------+
         |                        |                        |
         |                        |                        |
         v                        v                        v
+------------------+     +------------------+     +------------------+
| PySpark ETL on   |     | KDTree Spatial   |     | Scenario         |
| Databricks       |     | Matching Engine  |     | Generation       |
|                  |     |                  |     |                  |
| - GDD calc       |     | - 44K stations   |     | - Drought        |
| - Heat stress    |     | - 3K+ counties   |     | - Heatwave       |
| - Precip features|     | - County centroids|     | - Excess Rain    |
+--------+---------+     +--------+---------+     +---------+--------+
         |                        |                        |
         +------------+-----------+------------------------+
                      |
                      v
         +---------------------------+
         |  XGBoost Quantile Models  |
         |  (MLflow Tracked)         |
         |                           |
         |  - Corn model (q10,q50,q90)|
         |  - Soybean model (q10,q50,q90)|
         |  - R² ~0.65 avg          |
         +------------+------------+
                      |
                      v
         +---------------------------+
         |  Unity Catalog Storage    |
         |  Model Registry & Tables  |
         +------------+------------+
                      |
                      v
         +---------------------------+
         |  Plotly Interactive       |
         |  Dashboards               |
         |  - County-level maps      |
         |  - Time series charts     |
         |  - Scenario comparisons   |
         +---------------------------+

Core Workflow

Raw NOAA GHCN Data (Databricks Marketplace)
         |
         v
  PySpark Feature Engineering
  (GDD, heat stress, growing degree days)
         |
         v
  KDTree Spatial Matching
  (Weather stations → County centroids)
         |
         v
  XGBoost Quantile Regression
  (q10, q50, q90 for corn & soybeans)
         |
         v
  MLflow Tracking & Unity Catalog
  (Model versioning & storage)
         |
         v
  Plotly Dashboards
  (Interactive visualization)

Results

Crop	MAE	R²
Corn	20.65 bu/acre	0.67
Soybeans	5.58 bu/acre	0.62

The models processed 83M+ weather observations and 108,573 training examples across 2010–2024.

Tech Stack

Component	Technology
Platform	Databricks Serverless
Data Processing	PySpark
ML Models	XGBoost Quantile Regression
Experiment Tracking	MLflow
Data Catalog	Unity Catalog
Spatial Analysis	scipy KDTree
Geospatial	US Census TIGER/Line Shapefiles
Visualization	Plotly
Language	Python 3.9+

Project Structure

HarvestIQ/
├── data/                     # Raw & processed datasets
│   └── processed/            # Feature-engineered data
├── notebooks/                # Databricks notebooks
│   ├── feature_engineering/  # PySpark feature pipelines
│   ├── model_training/       # XGBoost training & tuning
│   └── visualization/        # Plotly dashboard notebooks
├── preprocessing/            # Helper scripts
│   └── shapefile_processing/ # Census data processing
├── .gitignore
└── README.md

---## Key Technical Achievements

KDTree Spatial Matching: Efficiently matched 44,728 weather stations to 3,000+ counties using scipy KDTree.
Custom Pipeline: Built and validated centroid extraction from raw US Census TIGER/Line shapefiles.
Quantile Regression: Delivers calibrated uncertainty intervals (q10, q50, q90) rather than simple point estimates — enabling risk-aware decision making.
Scalable Processing: PySpark distributed processing of 83M+ weather observations on Databricks Serverless.

Scenario Engine

HarvestIQ simulates three primary weather stress scenarios:

Scenario	Impact	Yield Effect
Drought	-30% precipitation	-12% yield
Heatwave	+5°C temperature	-15% yield
Excess Rain	+50% precipitation	-7% yield

Getting Started

Prerequisites

Databricks Account with Serverless compute enabled
Python 3.9+ installed locally (for preprocessing scripts)
pip package manager
Access to NOAA GHCN data via Databricks Marketplace

Setup & Running

1. Clone the Repository

git clone https://github.com/BUVKAUSHIK/HarvestIQ.git
cd HarvestIQ

2. Upload Notebooks to Databricks

Open your Databricks workspace
Navigate to Workspace → Import
Upload notebooks from the notebooks/ directory

3. Configure Data Sources

Attach NOAA GHCN dataset from Databricks Marketplace
Upload TIGER/Line shapefiles for county boundaries

4. Run the Pipeline

Start with feature_engineering notebook to process weather data
Run model_training notebook to train XGBoost quantile models
Launch visualization notebook for interactive dashboards

5. Local Preprocessing (Optional)

cd preprocessing
cd shapefile_processing
pip install geopandas shapely
python process_shapefiles.py

Deployment

This project runs on Databricks Serverless. To deploy:

Import all notebooks to your Databricks workspace
Set up Unity Catalog for model registry and data governance
Configure MLflow for experiment tracking
Set permissions on the data and model catalogs
Schedule jobs in Databricks Workflows for automated retraining

For local preprocessing, ensure you have geopandas and shapely installed.

Contributing

Contributions are welcome! Please follow the standard workflow:

Fork the repository

Create a feature branch:

git checkout -b feature/your-feature-name

Commit with clear messages:
```
git commit -am 'Add some feature'
```

Push to the branch:

git push origin feature/your-feature-name

Open a Pull Request

License

This project is licensed under the MIT License.

_{Built with ❤️ by BUVKAUSHIK at UW Databricks Hackathon 2026}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HarvestIQ — Early Warning Yield Intelligence for American Agriculture

Table of Contents

Overview

The Problem

The Solution

Core Workflow

Results

Tech Stack

Project Structure

Scenario Engine

Getting Started

Prerequisites

Setup & Running

1. Clone the Repository

2. Upload Notebooks to Databricks

3. Configure Data Sources

4. Run the Pipeline

5. Local Preprocessing (Optional)

Deployment

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
notebooks		notebooks
preprocessing		preprocessing
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

HarvestIQ — Early Warning Yield Intelligence for American Agriculture

Table of Contents

Overview

The Problem

The Solution

Core Workflow

Results

Tech Stack

Project Structure

Scenario Engine

Getting Started

Prerequisites

Setup & Running

1. Clone the Repository

2. Upload Notebooks to Databricks

3. Configure Data Sources

4. Run the Pipeline

5. Local Preprocessing (Optional)

Deployment

Contributing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages