Quant Data Pipeline

Research data infrastructure for quantitative strategy development. Ingests OHLCV data from multiple sources, performs statistical validation (gap detection, outlier filtering, corporate action adjustment), and stores in columnar format optimized for backtesting queries.

Status: Beta - Core functionality implemented and tested. See notebooks/examples.py for usage examples.

Features

Multi-Source Ingestion: Yahoo Finance, Polygon.io, and CSV/Parquet files
Data Validation Pipeline: Statistical anomaly detection, gap filling, outlier filtering
Corporate Action Adjustments: Automatic handling of splits, dividends, and mergers
Point-in-Time Data: Prevents lookahead bias with as-of date queries
High-Performance Storage: DuckDB/Parquet backend with sub-second query latency
Research-Optimized Queries: Pre-built queries for common backtesting patterns (returns, volatility, correlation)
Data Quality Metrics: Automated reporting on data completeness and integrity
Polars-First Design: Built on Polars for high-performance data processing

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                           Data Sources                                   │
├──────────────┬──────────────┬──────────────┬──────────────────────────────┤
│ Yahoo Finance│  Polygon.io  │   CSV/Parquet │        Custom Sources        │
└──────┬───────┴──────┬───────┴──────┬───────┴──────────────┬──────────────┘
       │              │              │                      │
       ▼              ▼              ▼                      ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                        Ingestion Layer                                   │
│  • Source adapters (BaseDataSource interface)                           │
│  • Configurable data fetching with date ranges                          │
│  • Automatic column normalization                                        │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                       Validation Pipeline                                │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐    │
│  │ Gap         │  │ Outlier     │  │ Corporate   │  │ Quality     │    │
│  │ Detection   │→ │ Filtering   │→ │ Actions     │→ │ Scoring     │    │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘    │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                         Storage Layer                                    │
│  • DuckDB for analytics queries (default)                               │
│  • Parquet files for archival (partitioned by date)                     │
│  • Point-in-time snapshots for backtesting                              │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                          Query Layer (ResearchAPI)                       │
│  • OHLCV queries with filtering                                         │
│  • Returns calculation (daily/weekly/monthly, simple/log)               │
│  • Rolling volatility and correlation analysis                          │
│  • Universe screening and summary statistics                            │
│  • Point-in-time queries (backtesting)                                  │
│  • Custom SQL support (DuckDB)                                          │
└─────────────────────────────────────────────────────────────────────────┘

Quick Start

Installation

# Clone the repository
git clone https://github.com/pramurta/quant-data-pipeline.git
cd quant-data-pipeline

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install in development mode
pip install -e .

Configuration

# Copy example config (optional)
cp config/config.example.yaml config/config.yaml

# Set API keys (optional - for Polygon.io)
export POLYGON_API_KEY="your_key_here"

Basic Usage

from src.pipeline import DataPipeline
from src.ingestion import YahooFinanceSource

# Initialize pipeline
pipeline = DataPipeline(storage_path="./data")

# Ingest data
symbols = ["AAPL", "GOOGL", "MSFT", "AMZN", "META"]
pipeline.ingest(
    source=YahooFinanceSource(),
    symbols=symbols,
    start_date="2020-01-01",
    end_date="2024-01-01"
)

# Query data for research
df = pipeline.query.get_ohlcv(
    symbols=symbols,
    start_date="2023-01-01",
    end_date="2024-01-01",
    adjusted=True  # Corporate action adjusted
)

# Calculate returns
returns = pipeline.query.get_returns(
    symbols=symbols,
    start_date="2023-01-01",
    frequency="daily"
)

# Get point-in-time data (prevents lookahead bias)
pit_data = pipeline.query.get_point_in_time(
    symbols=symbols,
    as_of_date="2023-06-15",
    lookback_days=252
)

Data Validation

The validation pipeline performs multiple checks on ingested data:

1. Gap Detection

Identifies missing trading days and provides multiple interpolation methods:

from src.validation import GapDetector

detector = GapDetector(
    max_gap_days=5,  # Flag gaps longer than 5 days
)

gaps = detector.detect(df)
filled_df = detector.fill(df, method="forward")  # or "linear", "interpolate"

2. Outlier Detection

Statistical methods for identifying anomalous price movements:

from src.validation import OutlierDetector

detector = OutlierDetector(
    method="zscore",  # or "iqr", "isolation_forest"
    threshold=4.0,
)

outliers = detector.detect(df)
cleaned_df = detector.clean(df, method="clip")  # or "remove"

3. Corporate Action Adjustments

Handles splits and dividends:

from src.validation import CorporateActionAdjuster

adjuster = CorporateActionAdjuster()

# Adjust historical prices for splits and dividends
adjusted_df = adjuster.adjust(df)

4. Data Quality Scoring

from src.validation import QualityScorer

scorer = QualityScorer()
report = scorer.generate_report(df)

print(f"Completeness: {report['scores']['completeness']:.2%}")
print(f"Consistency: {report['scores']['consistency']:.2%}")
print(f"Overall Score: {report['scores']['overall']:.2%}")

Storage and Query Optimization

Storage Structure

Data is stored efficiently using DuckDB (default) or Parquet files:

DuckDB Backend (default):

data/
└── data.duckdb              # Single database file containing:
                             # - ohlcv table (validated data)
                             # - raw_data table (ingested data)
                             # - snapshots table (point-in-time data)

Parquet Backend:

data/
└── parquet/
    └── ohlcv/              # Partitioned by year/month
        └── year=2023/
            └── month=01/
                └── data.parquet

Query Examples

# Get universe of stocks meeting criteria
universe = pipeline.query.screen(
    min_price=5.0,
    min_avg_volume=1_000_000,
    as_of_date="2023-12-31"
)

# Calculate rolling volatility
vol = pipeline.query.rolling_volatility(
    symbols=universe,
    window=20,
    annualize=True
)

# Get correlation matrix
corr = pipeline.query.correlation_matrix(
    symbols=universe[:50],  # First 50 stocks from screened universe
    start_date="2023-01-01",
    end_date="2023-12-31"
)

# Get summary statistics (returns, volatility, Sharpe ratio)
stats = pipeline.query.summary_statistics(
    symbols=["AAPL", "GOOGL", "MSFT"],
    start_date="2023-01-01",
    end_date="2023-12-31"
)

# Custom SQL query (DuckDB backend only)
custom_df = pipeline.query.sql("""
    SELECT
        symbol,
        date,
        close,
        volume,
        close / LAG(close) OVER (PARTITION BY symbol ORDER BY date) - 1 as return
    FROM ohlcv
    WHERE date >= '2023-01-01'
    AND symbol IN ('AAPL', 'GOOGL', 'MSFT')
    ORDER BY symbol, date
""")

Performance

The pipeline is optimized for performance using:

Polars for fast DataFrame operations (often 5-10x faster than pandas)
DuckDB for analytical queries with columnar storage
Vectorized operations for calculations (returns, volatility, etc.)
Lazy evaluation where possible

Expected performance characteristics:

OHLCV queries: Sub-second for most date ranges
Returns calculation: Vectorized operations on millions of rows
Correlation matrices: Efficient computation using numpy/polars
Data validation: Parallel processing of validation rules

Project Structure

quant-data-pipeline/
├── src/
│   ├── __init__.py
│   ├── pipeline.py              # Main pipeline orchestration
│   ├── ingestion/
│   │   ├── __init__.py
│   │   ├── base.py              # Abstract base source
│   │   ├── yahoo.py             # Yahoo Finance adapter
│   │   ├── polygon.py           # Polygon.io adapter
│   │   └── csv_source.py        # CSV/Parquet file source
│   ├── validation/
│   │   ├── __init__.py
│   │   ├── gaps.py              # Gap detection and filling
│   │   ├── outliers.py          # Outlier detection
│   │   ├── corporate_actions.py # Split/dividend adjustments
│   │   ├── quality.py           # Data quality scoring
│   │   └── pipeline.py          # Validation pipeline orchestration
│   ├── storage/
│   │   ├── __init__.py
│   │   ├── base.py              # Storage interface
│   │   ├── duckdb_store.py      # DuckDB backend
│   │   └── parquet_store.py     # Parquet file backend
│   ├── query/
│   │   ├── __init__.py
│   │   └── research_api.py      # High-level research queries
│   └── utils/
│       ├── __init__.py
│       └── config.py            # Configuration management
├── tests/
│   ├── test_ingestion.py
│   ├── test_validation.py
│   ├── test_storage.py
│   └── test_query.py
├── config/
│   └── config.example.yaml
├── scripts/
│   └── ingest_daily.py          # Daily ingestion script
├── notebooks/
│   └── examples.py              # Usage examples (jupytext format)
├── requirements.txt
├── pyproject.toml
├── setup.py
└── README.md

Testing

# Run all tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=src --cov-report=html

# Run specific test module
pytest tests/test_validation.py -v

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

yfinance for Yahoo Finance data access
DuckDB for high-performance analytics
Polars for fast DataFrame operations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quant Data Pipeline

Features

Architecture

Quick Start

Installation

Configuration

Basic Usage

Data Validation

1. Gap Detection

2. Outlier Detection

3. Corporate Action Adjustments

4. Data Quality Scoring

Storage and Query Optimization

Storage Structure

Query Examples

Performance

Project Structure

Testing

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
config		config
notebooks		notebooks
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
sample_data.csv		sample_data.csv
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

Quant Data Pipeline

Features

Architecture

Quick Start

Installation

Configuration

Basic Usage

Data Validation

1. Gap Detection

2. Outlier Detection

3. Corporate Action Adjustments

4. Data Quality Scoring

Storage and Query Optimization

Storage Structure

Query Examples

Performance

Project Structure

Testing

Contributing

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages