Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -211,3 +211,5 @@ OMOP_CDM*.csv
*.db
.vscode/
.DS_Store
_temp/
notebooks/
8 changes: 7 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,4 +108,10 @@
- literally just removing stale sqlalchemy-utils dependency

# 0.3.27
- adding minimum versions for dependabot alerts (dev deps only)
- adding minimum versions for dependabot alerts (dev deps only)

# 0.4.0
- update to handle psycopg (as opposed to psycopg2) cleanly
- overall api cleanup with the goal of being more explicit about selection of specific db backends
- general typing cleanup
- removed example notebooks until they can be cleaned up with working use-cases according to updated api
68 changes: 31 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,28 +4,27 @@
https://github.com/AustralianCancerDataNetwork/orm-loader/actions/workflows/tests.yml
)

A lightweight, reusable foundation for building and validating SQLAlchemy-based clinical (and non-clinical) data models.
A lightweight foundation for building and validating SQLAlchemy-based data models.

This library provides general-purpose ORM infrastructure that sits below any specific data model (OMOP, PCORnet, custom CDMs, etc.), focusing on:
`orm-loader` sits below any particular schema or CDM. It gives you a small set of reusable pieces for defining tables, loading files through staging tables, and checking models against external specifications. It stays out of domain logic on purpose.

* declarative base configuration
* bulk ingestion patterns
* file-based validation & loading
* table introspection
* model-agnostic validation scaffolding
* safe, database-portable operational helpers
The library focuses on:

It intentionally contains no domain logic and no assumptions about a specific schema.
* ORM table mixins and introspection
* staged file loading
* loader and validation infrastructure
* operational helpers that work across supported backends

At the moment, the built-in backends are SQLite and PostgreSQL.

### What this library provides:

This library provides a small set of composable building blocks for defining, loading, inspecting, and validating SQLAlchemy-based data models.
All components are model-agnostic and can be selectively combined in downstream libraries.
### What this library provides

1. A minimal, opinionated ORM table base
The package is deliberately small. Most downstream projects only need a couple of these pieces.

ORMTableBase provides structural introspection utilities for SQLAlchemy-mapped tables, without imposing any domain semantics.
1. A minimal ORM table base

`ORMTableBase` provides structural utilities for mapped tables without pulling domain rules into the base layer.

It supports:
* mapper access and inspection
Expand All @@ -41,33 +40,31 @@ class MyTable(ORMTableBase, Base):
__tablename__ = "my_table"

```
This base is intended to be inherited by all ORM tables, either directly or via higher-level mixins.
You can inherit from it directly or pick it up through one of the higher-level mixins.

2. CSV-based ingestion mixins

CSVLoadableTableInterface adds opt-in CSV loading support for ORM tables using pandas, with a focus on correctness and scalability.
`CSVLoadableTableInterface` adds staged file loading to ORM tables. It can use pandas or PyArrow loaders, and on PostgreSQL it can use a fast `COPY` path when the input is clean enough.

Features include:
* staging table creation and cleanup
* chunked loading for large files
* optional per-table normalisation logic
* optional deduplication against existing database rows
* safe bulk inserts using SQLAlchemy sessions
* optional casting and deduplication before insert
* backend-specific merge behaviour
* PostgreSQL fast-path loading with ORM fallback
* backend-aware index handling during merge

```python
class MyTable(CSVLoadableTableInterface, ORMTableBase, Base):
__tablename__ = "my_table"

```

Downstream models may override:
* normalise_dataframe(...)
* dedupe_dataframe(...)
* csv_columns()
to implement table-specific ingestion policies.
The main extension points here are loader choice, column mapping, and the normal SQLAlchemy model definitions themselves. Most downstream projects do not need to override much beyond `csv_columns()` and the model schema.

3. Structured serialisation and hashing

SerialisableTableInterface adds lightweight, explicit serialisation helpers for ORM rows.
`SerialisableTableInterface` adds lightweight serialisation helpers for ORM rows.

It supports:
* conversion to dictionaries
Expand All @@ -92,7 +89,7 @@ This is useful for:

4. Model registry and validation scaffolding

The library includes model-agnostic validation infrastructure, designed to compare ORM models against external specifications.
The library includes validation infrastructure for comparing ORM models against external specifications.

This includes:
* a model registry
Expand All @@ -118,7 +115,8 @@ Validation output is available as:
* exit codes suitable for pipelines

5. Database bootstrap helpers
The library provides lightweight helpers for schema creation and bootstrapping, without imposing a migration strategy.

The library provides lightweight helpers for schema creation and bootstrapping. It does not try to replace migrations.

```python
from orm_loader.metadata import Base
Expand All @@ -127,24 +125,20 @@ from orm_loader.bootstrap import bootstrap
bootstrap(engine, create=True)
```

6. Safe bulk-loading utilities
6. Bulk-loading helpers

A reusable context manager simplifies trusted bulk ingestion workflows:
* temporarily disables foreign key checks where supported
* suppresses autoflush for performance
* ensures reliable rollback on failure
There are a few lower-level helpers for trusted bulk workflows, including backend-aware foreign key management and SQLite connection setup for heavy local loads.

## Summary

This library intentionally focuses on infrastructure, not semantics.
This library is meant to be the boring layer underneath downstream models:

It provides:
* reusable ORM mixins
* safe ingestion patterns
* staged ingestion patterns
* validation scaffolding
* database-portable utilities
* operational helpers

while leaving domain rules, business logic, and schema semantics to downstream libraries.
Domain rules, business logic, and schema semantics stay in the downstream project.

This makes it suitable as a shared foundation for:
* clinical data models
Expand Down
2 changes: 2 additions & 0 deletions TODO.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[] consider opt-in malformed text repair (as opposed to existing normalisation) - e.g. load_csv(..., text_repair: str | None = None)
- consider ftfy.fix_encoding()
19 changes: 6 additions & 13 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,25 +3,24 @@
A lightweight, reusable foundation for building and validating
SQLAlchemy-based data models.

`orm-loader` provides **infrastructure, not semantics**.
`orm-loader` provides infrastructure for SQLAlchemy-based data models. It is the shared plumbing layer, not the place where model-specific rules live.

It focuses on:

- ORM table introspection
- safe bulk ingestion patterns
- file-based loading via staging tables
- model-agnostic validation scaffolding
- database-portable operational helpers
- operational helpers for supported backends

No domain logic is included.
No schema assumptions are enforced.
It currently ships with backend implementations for SQLite and PostgreSQL.

---

## Core Concepts

- **Tables are structural** — semantics live downstream
- **Mixins define capabilities**, not behaviour contracts
- **Mixins define capabilities**
- **Protocols decouple infrastructure from implementations**
- **Ingestion is explicit and staged**

Expand All @@ -37,13 +36,7 @@ No schema assumptions are enforced.

# Design Philosophy

`orm-loader` is intentionally conservative.

It provides:

- *mechanisms*, not policies
- *capabilities*, not workflows
- *structure*, not semantics
`orm-loader` is intentionally conservative. It gives downstream libraries the machinery to load, inspect, and validate data without deciding what the data means.

The library is designed to sit **below**:

Expand All @@ -65,6 +58,7 @@ and **above**:
- No schema enforcement
- No migrations
- No concurrency guarantees
- No support yet for arbitrary database dialects

---

Expand All @@ -81,4 +75,3 @@ This allows downstream libraries to:
- replace base classes
- mock implementations
- incrementally adopt features

1 change: 1 addition & 0 deletions docs/loaders/context.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ on globals or implicit configuration.
| `chunksize` | Optional chunk size |
| `normalise` | Whether to cast values to ORM types |
| `dedupe` | Whether to deduplicate incoming data |
| `quote_mode` | CSV quoting mode for PostgreSQL fast-path loading |

::: orm_loader.loaders.data_classes.LoaderContext

Expand Down
22 changes: 10 additions & 12 deletions docs/loaders/helpers.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
# Loader Helper Utilities

This page documents low-level helper functions used by loaders.

These utilities are stateless and intentionally conservative.
This page covers the low-level functions that support the loader implementations.

---

Expand Down Expand Up @@ -37,36 +35,36 @@ Used by `ParquetLoader` for internal deduplication.

---

## Conservative CSV parsing
## Batch-oriented CSV parsing

### `conservative_load_parquet(...)`

Reads CSV files using PyArrow with:
Despite the name, this helper reads delimited text with PyArrow and yields batches:

- strict column inclusion
- malformed row skipping
- chunked batch iteration

This is used when loading CSVs via the Parquet pipeline.
This is used by the PyArrow-based loader path.

---

## PostgreSQL fast-path loading

### `quick_load_pg(...)`

Loads CSV files into PostgreSQL staging tables using `COPY`.
Loads CSV files into a PostgreSQL staging table using `COPY`.

### Characteristics

- Extremely fast
- Bypasses ORM
- Sensitive to data quality issues
- Fast
- Bypasses ORM row construction
- Works best on clean input

### Failure handling

- Errors trigger rollback
- Loader falls back to ORM-based loading
- No partial silent loads
- `CSVLoadableTableInterface` falls back to ORM-based loading
- Failures are noisy on purpose

This helper is only used when explicitly supported by the database.
23 changes: 10 additions & 13 deletions docs/loaders/index.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,13 @@
# Loaders

The `orm_loader.loaders` module provides **conservative, schema-aware file
ingestion infrastructure** for loading external data into ORM-backed
staging tables.
The `orm_loader.loaders` module provides conservative, schema-aware file
loading into ORM-backed staging tables.

This subsystem is designed to handle:

- untrusted or messy source files
- large datasets requiring chunked processing
- incremental and repeatable loads
- repeatable staged loads
- dialect-specific optimisations (e.g. PostgreSQL COPY)
- explicit, inspectable failure modes

Expand All @@ -23,7 +22,7 @@ they do not embed domain rules or business semantics.

[`LoaderContext`](context.md)

A `LoaderContext` object carries all state required to load a single file:
A `LoaderContext` object carries the state required to load one file:

- target ORM table
- database session
Expand All @@ -44,8 +43,7 @@ All loaders implement a common interface:
- `orm_file_load(ctx)` — orchestrates file ingestion
- `dedupe(data, ctx)` — defines deduplication semantics

Concrete implementations differ only in **how data is read and processed**,
not in how it is staged.
Concrete implementations mainly differ in how they read and transform incoming data.

---

Expand All @@ -54,11 +52,10 @@ not in how it is staged.
Loaders always write to **staging tables**, never directly to production
tables.

This allows:
This gives you:

- safe rollback
- repeatable merges
- database-level deduplication
- bulk loading optimisations

Final merge semantics are handled by the table mixins, not by loaders.
Expand All @@ -69,8 +66,8 @@ Final merge semantics are handled by the table mixins, not by loaders.

| Loader | Use case |
|------|----------|
| `PandasLoader` | Flexible, debuggable CSV ingestion |
| `ParquetLoader` | High-volume, columnar ingestion |
| `PandasLoader` | Flexible CSV and TSV ingestion |
| `ParquetLoader` | Columnar or batch-oriented ingestion |

Both loaders share the same lifecycle and guarantees.

Expand All @@ -81,11 +78,11 @@ Both loaders share the same lifecycle and guarantees.
1. Detect file format and encoding
2. Read data in chunks or batches
3. Optionally normalise to ORM column types
4. Optionally deduplicate (internal and/or database-level)
4. Optionally deduplicate within the incoming data
5. Insert into staging table
6. Return row count

No implicit commits or merges occur at this layer.
Final merge behaviour belongs to the table mixins and backend layer, not to the loader itself.

---

Expand Down
Loading
Loading