Skip to content

pnascimento9596/medspend-normalize

Repository files navigation

medspend-normalize

CI Coverage Python License

Problem Statement

Hospitals use different ERP systems (Lawson, PeopleSoft, Workday) that export procurement data in incompatible formats with inconsistent vendor naming. This makes it impossible to compare pricing across facilities or detect overspend without manual normalization. medspend-normalize automates the entire pipeline — from raw ERP exports to actionable savings insights.

Key Features

  • Multi-format ETL — Ingest CSV/Excel from Lawson, PeopleSoft, and Workday with automatic schema mapping
  • Vendor normalization — Fuzzy string matching (rapidfuzz) resolves naming variations like "MEDTRONIC INC" vs "Medtronic plc"
  • UNSPSC validation — Validates and decodes Segment 42 commodity codes for medical supply categorization
  • Pricing anomaly detection — Ensemble method combining IQR, modified Z-score, and Isolation Forest
  • Facility benchmarking — Cross-facility price comparison with savings opportunity ranking
  • REST API — Full FastAPI backend with Swagger docs, pagination, and filtering
  • Source lineage tracking — Every normalized record traces back to its raw transaction, batch, and source file

Architecture

flowchart LR
    A[CSV / Excel<br>ERP Exports] -->|parse & validate| B[ETL Pipeline]
    B -->|load| C[(SQLite DB)]
    C -->|fuzzy match| D[Vendor Normalization]
    D --> C
    C --> E[Analytics Engine]
    E -->|IQR + Z-score + IF| F[Anomaly Detection]
    E -->|cross-facility| G[Benchmarking]
    F --> C
    G --> C
    C --> H[FastAPI REST API]
    H --> I[Savings Reports]
Loading

Quick Start

# Clone and install
git clone https://github.com/pnascimento9596/medspend-normalize.git
cd medspend-normalize
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

# Generate synthetic data and initialize
medspend generate-data
medspend init

# Run the full pipeline
medspend ingest data/facility_a_lawson.csv -s lawson -f FAC-001
medspend ingest data/facility_b_peoplesoft.csv -s peoplesoft -f FAC-002
medspend ingest data/facility_c_workday.csv -s workday -f FAC-003
medspend normalize
medspend detect-anomalies --method ensemble
medspend benchmark
medspend report

# Start the API
medspend serve
# or: uvicorn medspend_normalize.api.main:app --reload

Usage Examples

Health check:

curl http://localhost:8000/api/v1/health
{
  "status": "ok",
  "database": "connected",
  "record_counts": {
    "raw_transactions": 13000,
    "normalized_vendors": 23,
    "normalized_items": 4575,
    "anomalies": 1070
  },
  "last_ingestion": "2025-01-15T10:30:00"
}

Savings summary:

curl http://localhost:8000/api/v1/savings
{
  "anomaly_based_savings": 2426753.63,
  "benchmark_based_savings": 7006419.75,
  "total_estimated_savings": 9433173.38,
  "by_vendor": [
    {"category": "vendor", "identifier": "Medtronic plc", "estimated_savings": 498929.12, "transaction_count": 136}
  ],
  "by_facility": [
    {"category": "facility", "identifier": "FAC-002", "estimated_savings": 884458.18, "transaction_count": 363}
  ]
}

List anomalies with filtering:

curl "http://localhost:8000/api/v1/anomalies?severity_min=50&page_size=5"

Project Structure

medspend-normalize/
├── src/medspend_normalize/
│   ├── cli.py                  # Typer CLI entry point
│   ├── exceptions.py           # Custom exceptions
│   ├── config/
│   │   ├── settings.py         # pydantic-settings (env vars)
│   │   └── logging_config.py
│   ├── models/
│   │   ├── orm.py              # SQLAlchemy 2.0 models (7 tables)
│   │   ├── schemas.py          # Pydantic v2 request/response models
│   │   └── database.py         # Engine and session management
│   ├── ingestion/
│   │   ├── parsers.py          # CSV/Excel parsing with schema mapping
│   │   ├── validators.py       # Data validation rules
│   │   └── loader.py           # Bulk insert into DB
│   ├── normalization/
│   │   ├── vendor_matcher.py   # Fuzzy vendor name matching
│   │   ├── unspsc_validator.py # UNSPSC code validation
│   │   └── item_dedup.py       # Item deduplication
│   ├── analytics/
│   │   ├── anomaly_detector.py # IQR, Z-score, Isolation Forest, ensemble
│   │   ├── benchmarking.py     # Cross-facility price benchmarks
│   │   └── savings_estimator.py
│   └── api/
│       ├── main.py             # FastAPI app with middleware
│       ├── dependencies.py     # DI providers
│       └── routes/             # ingest, vendors, anomalies, benchmarks
├── config/
│   └── schema_mappings.yaml    # ERP column name mappings
├── scripts/
│   └── generate_data.py        # Synthetic data generator (Faker)
├── tests/
│   ├── unit/                   # 69 unit tests
│   └── integration/            # 27 integration + API tests
├── Dockerfile                  # Multi-stage build
├── docker-compose.yml
└── pyproject.toml

Testing

# Run all tests
pytest

# With coverage
pytest --cov=src --cov-report=term-missing

# Unit tests only
pytest tests/unit/

# Integration tests only
pytest tests/integration/

Tech Stack

Python 3.11+ | SQLAlchemy 2.0 | FastAPI | pandas | scikit-learn | rapidfuzz | Pydantic v2 | Typer | Rich | pytest

License

MIT

About

ETL pipeline and anomaly detection engine that normalizes multi-source healthcare spend data and identifies pricing outliers across hospital systems. Python, FastAPI, SQLAlchemy, scikit-learn.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors