medspend-normalize

Problem Statement

Hospitals use different ERP systems (Lawson, PeopleSoft, Workday) that export procurement data in incompatible formats with inconsistent vendor naming. This makes it impossible to compare pricing across facilities or detect overspend without manual normalization. medspend-normalize automates the entire pipeline — from raw ERP exports to actionable savings insights.

Key Features

Multi-format ETL — Ingest CSV/Excel from Lawson, PeopleSoft, and Workday with automatic schema mapping
Vendor normalization — Fuzzy string matching (rapidfuzz) resolves naming variations like "MEDTRONIC INC" vs "Medtronic plc"
UNSPSC validation — Validates and decodes Segment 42 commodity codes for medical supply categorization
Pricing anomaly detection — Ensemble method combining IQR, modified Z-score, and Isolation Forest
Facility benchmarking — Cross-facility price comparison with savings opportunity ranking
REST API — Full FastAPI backend with Swagger docs, pagination, and filtering
Source lineage tracking — Every normalized record traces back to its raw transaction, batch, and source file

Architecture

flowchart LR
    A[CSV / Excel<br>ERP Exports] -->|parse & validate| B[ETL Pipeline]
    B -->|load| C[(SQLite DB)]
    C -->|fuzzy match| D[Vendor Normalization]
    D --> C
    C --> E[Analytics Engine]
    E -->|IQR + Z-score + IF| F[Anomaly Detection]
    E -->|cross-facility| G[Benchmarking]
    F --> C
    G --> C
    C --> H[FastAPI REST API]
    H --> I[Savings Reports]

Quick Start

# Clone and install
git clone https://github.com/pnascimento9596/medspend-normalize.git
cd medspend-normalize
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

# Generate synthetic data and initialize
medspend generate-data
medspend init

# Run the full pipeline
medspend ingest data/facility_a_lawson.csv -s lawson -f FAC-001
medspend ingest data/facility_b_peoplesoft.csv -s peoplesoft -f FAC-002
medspend ingest data/facility_c_workday.csv -s workday -f FAC-003
medspend normalize
medspend detect-anomalies --method ensemble
medspend benchmark
medspend report

# Start the API
medspend serve
# or: uvicorn medspend_normalize.api.main:app --reload

Usage Examples

Health check:

curl http://localhost:8000/api/v1/health

{
  "status": "ok",
  "database": "connected",
  "record_counts": {
    "raw_transactions": 13000,
    "normalized_vendors": 23,
    "normalized_items": 4575,
    "anomalies": 1070
  },
  "last_ingestion": "2025-01-15T10:30:00"
}

Savings summary:

curl http://localhost:8000/api/v1/savings

{
  "anomaly_based_savings": 2426753.63,
  "benchmark_based_savings": 7006419.75,
  "total_estimated_savings": 9433173.38,
  "by_vendor": [
    {"category": "vendor", "identifier": "Medtronic plc", "estimated_savings": 498929.12, "transaction_count": 136}
  ],
  "by_facility": [
    {"category": "facility", "identifier": "FAC-002", "estimated_savings": 884458.18, "transaction_count": 363}
  ]
}

List anomalies with filtering:

curl "http://localhost:8000/api/v1/anomalies?severity_min=50&page_size=5"

Project Structure

medspend-normalize/
├── src/medspend_normalize/
│   ├── cli.py                  # Typer CLI entry point
│   ├── exceptions.py           # Custom exceptions
│   ├── config/
│   │   ├── settings.py         # pydantic-settings (env vars)
│   │   └── logging_config.py
│   ├── models/
│   │   ├── orm.py              # SQLAlchemy 2.0 models (7 tables)
│   │   ├── schemas.py          # Pydantic v2 request/response models
│   │   └── database.py         # Engine and session management
│   ├── ingestion/
│   │   ├── parsers.py          # CSV/Excel parsing with schema mapping
│   │   ├── validators.py       # Data validation rules
│   │   └── loader.py           # Bulk insert into DB
│   ├── normalization/
│   │   ├── vendor_matcher.py   # Fuzzy vendor name matching
│   │   ├── unspsc_validator.py # UNSPSC code validation
│   │   └── item_dedup.py       # Item deduplication
│   ├── analytics/
│   │   ├── anomaly_detector.py # IQR, Z-score, Isolation Forest, ensemble
│   │   ├── benchmarking.py     # Cross-facility price benchmarks
│   │   └── savings_estimator.py
│   └── api/
│       ├── main.py             # FastAPI app with middleware
│       ├── dependencies.py     # DI providers
│       └── routes/             # ingest, vendors, anomalies, benchmarks
├── config/
│   └── schema_mappings.yaml    # ERP column name mappings
├── scripts/
│   └── generate_data.py        # Synthetic data generator (Faker)
├── tests/
│   ├── unit/                   # 69 unit tests
│   └── integration/            # 27 integration + API tests
├── Dockerfile                  # Multi-stage build
├── docker-compose.yml
└── pyproject.toml

Testing

# Run all tests
pytest

# With coverage
pytest --cov=src --cov-report=term-missing

# Unit tests only
pytest tests/unit/

# Integration tests only
pytest tests/integration/

Tech Stack

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
config		config
data		data
scripts		scripts
src/medspend_normalize		src/medspend_normalize
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

medspend-normalize

Problem Statement

Key Features

Architecture

Quick Start

Usage Examples

Project Structure

Testing

Tech Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

medspend-normalize

Problem Statement

Key Features

Architecture

Quick Start

Usage Examples

Project Structure

Testing

Tech Stack

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages