Skip to content

BUVKAUSHIK/HarvestIQ

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

HarvestIQ β€” Early Warning Yield Intelligence for American Agriculture

Built at the University of Washington Databricks Hackathon 2026

Python PySpark Databricks XGBoost License


Table of Contents


Overview

HarvestIQ is an end-to-end agricultural yield forecasting platform that predicts corn and soybean yield 60–90 days in advance at the US county level. It quantifies uncertainty to provide both expected yield and best/worst-case scenarios based on near-term weather forecasts.

"If temperatures exceed 95Β°F for more than 4 days during pollination, expected corn yield decreases by 12%."


The Problem

Today, yield forecasting is reactive:

  • Farmers discover yield losses after harvest
  • Crop insurers price risk without county-level data
  • Agronomists lack uncertainty quantification tools

There is no early warning system for weather-driven yield risk.


The Solution

The platform uses 15 years of NOAA weather data and XGBoost quantile regression models to generate three forecast bands:

  • Q10 (Pessimistic): Worst-case scenario
  • Q50 (Median): Most likely yield
  • Q90 (Optimistic): Best-case scenario

---## Architecture Diagram

+------------------+     +------------------+     +------------------+
|  NOAA GHCN Data  |     |  US Census Data  |     |  Weather Forecasts  |
|  (Raw Stations)  |     |  (TIGER/Line)    |     |  (Future Conditions) |
+--------+---------+     +--------+---------+     +----------+--------+
         |                        |                        |
         |                        |                        |
         v                        v                        v
+------------------+     +------------------+     +------------------+
| PySpark ETL on   |     | KDTree Spatial   |     | Scenario         |
| Databricks       |     | Matching Engine  |     | Generation       |
|                  |     |                  |     |                  |
| - GDD calc       |     | - 44K stations   |     | - Drought        |
| - Heat stress    |     | - 3K+ counties   |     | - Heatwave       |
| - Precip features|     | - County centroids|     | - Excess Rain    |
+--------+---------+     +--------+---------+     +---------+--------+
         |                        |                        |
         +------------+-----------+------------------------+
                      |
                      v
         +---------------------------+
         |  XGBoost Quantile Models  |
         |  (MLflow Tracked)         |
         |                           |
         |  - Corn model (q10,q50,q90)|
         |  - Soybean model (q10,q50,q90)|
         |  - RΒ² ~0.65 avg          |
         +------------+------------+
                      |
                      v
         +---------------------------+
         |  Unity Catalog Storage    |
         |  Model Registry & Tables  |
         +------------+------------+
                      |
                      v
         +---------------------------+
         |  Plotly Interactive       |
         |  Dashboards               |
         |  - County-level maps      |
         |  - Time series charts     |
         |  - Scenario comparisons   |
         +---------------------------+

Core Workflow

Raw NOAA GHCN Data (Databricks Marketplace)
         |
         v
  PySpark Feature Engineering
  (GDD, heat stress, growing degree days)
         |
         v
  KDTree Spatial Matching
  (Weather stations β†’ County centroids)
         |
         v
  XGBoost Quantile Regression
  (q10, q50, q90 for corn & soybeans)
         |
         v
  MLflow Tracking & Unity Catalog
  (Model versioning & storage)
         |
         v
  Plotly Dashboards
  (Interactive visualization)

Results

Crop MAE RΒ²
Corn 20.65 bu/acre 0.67
Soybeans 5.58 bu/acre 0.62

The models processed 83M+ weather observations and 108,573 training examples across 2010–2024.


Tech Stack

Component Technology
Platform Databricks Serverless
Data Processing PySpark
ML Models XGBoost Quantile Regression
Experiment Tracking MLflow
Data Catalog Unity Catalog
Spatial Analysis scipy KDTree
Geospatial US Census TIGER/Line Shapefiles
Visualization Plotly
Language Python 3.9+

Project Structure

HarvestIQ/
β”œβ”€β”€ data/                     # Raw & processed datasets
β”‚   └── processed/            # Feature-engineered data
β”œβ”€β”€ notebooks/                # Databricks notebooks
β”‚   β”œβ”€β”€ feature_engineering/  # PySpark feature pipelines
β”‚   β”œβ”€β”€ model_training/       # XGBoost training & tuning
β”‚   └── visualization/        # Plotly dashboard notebooks
β”œβ”€β”€ preprocessing/            # Helper scripts
β”‚   └── shapefile_processing/ # Census data processing
β”œβ”€β”€ .gitignore
└── README.md

---## Key Technical Achievements

  • KDTree Spatial Matching: Efficiently matched 44,728 weather stations to 3,000+ counties using scipy KDTree.
  • Custom Pipeline: Built and validated centroid extraction from raw US Census TIGER/Line shapefiles.
  • Quantile Regression: Delivers calibrated uncertainty intervals (q10, q50, q90) rather than simple point estimates β€” enabling risk-aware decision making.
  • Scalable Processing: PySpark distributed processing of 83M+ weather observations on Databricks Serverless.

Scenario Engine

HarvestIQ simulates three primary weather stress scenarios:

Scenario Impact Yield Effect
Drought -30% precipitation -12% yield
Heatwave +5Β°C temperature -15% yield
Excess Rain +50% precipitation -7% yield

Getting Started

Prerequisites

  • Databricks Account with Serverless compute enabled
  • Python 3.9+ installed locally (for preprocessing scripts)
  • pip package manager
  • Access to NOAA GHCN data via Databricks Marketplace

Setup & Running

1. Clone the Repository

git clone https://github.com/BUVKAUSHIK/HarvestIQ.git
cd HarvestIQ

2. Upload Notebooks to Databricks

  • Open your Databricks workspace
  • Navigate to Workspace β†’ Import
  • Upload notebooks from the notebooks/ directory

3. Configure Data Sources

  • Attach NOAA GHCN dataset from Databricks Marketplace
  • Upload TIGER/Line shapefiles for county boundaries

4. Run the Pipeline

  1. Start with feature_engineering notebook to process weather data
  2. Run model_training notebook to train XGBoost quantile models
  3. Launch visualization notebook for interactive dashboards

5. Local Preprocessing (Optional)

cd preprocessing
cd shapefile_processing
pip install geopandas shapely
python process_shapefiles.py

Deployment

This project runs on Databricks Serverless. To deploy:

  1. Import all notebooks to your Databricks workspace
  2. Set up Unity Catalog for model registry and data governance
  3. Configure MLflow for experiment tracking
  4. Set permissions on the data and model catalogs
  5. Schedule jobs in Databricks Workflows for automated retraining

For local preprocessing, ensure you have geopandas and shapely installed.


Contributing

Contributions are welcome! Please follow the standard workflow:

  1. Fork the repository
  2. Create a feature branch:
    git checkout -b feature/your-feature-name
  3. Commit with clear messages:
    git commit -am 'Add some feature'
  4. Push to the branch:
    git push origin feature/your-feature-name
  5. Open a Pull Request

License

This project is licensed under the MIT License.


Built with ❀️ by BUVKAUSHIK at UW Databricks Hackathon 2026

About

🌽 Early warning yield intelligence for American agriculture β€” XGBoost quantile forecasting, KDTree spatial matching, and interactive Plotly dashboards on Databricks

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors