AustralianCancerDataNetwork · gkennos · May 14, 2026 · May 14, 2026 · May 14, 2026 · May 14, 2026
diff --git a/.gitignore b/.gitignore
@@ -211,3 +211,5 @@ OMOP_CDM*.csv
 *.db
 .vscode/
 .DS_Store
+_temp/
+notebooks/
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -108,4 +108,10 @@
 - literally just removing stale sqlalchemy-utils dependency
 
 # 0.3.27
-- adding minimum versions for dependabot alerts (dev deps only)
+- adding minimum versions for dependabot alerts (dev deps only)
+
+# 0.4.0
+- update to handle psycopg (as opposed to psycopg2) cleanly
+- overall api cleanup with the goal of being more explicit about selection of specific db backends
+- general typing cleanup
+- removed example notebooks until they can be cleaned up with working use-cases according to updated api
diff --git a/README.md b/README.md
@@ -4,28 +4,27 @@
 https://github.com/AustralianCancerDataNetwork/orm-loader/actions/workflows/tests.yml
 )
 
-A lightweight, reusable foundation for building and validating SQLAlchemy-based clinical (and non-clinical) data models.
+A lightweight foundation for building and validating SQLAlchemy-based data models.
 
-This library provides general-purpose ORM infrastructure that sits below any specific data model (OMOP, PCORnet, custom CDMs, etc.), focusing on:
+`orm-loader` sits below any particular schema or CDM. It gives you a small set of reusable pieces for defining tables, loading files through staging tables, and checking models against external specifications. It stays out of domain logic on purpose.
 
-* declarative base configuration
-* bulk ingestion patterns
-* file-based validation & loading
-* table introspection
-* model-agnostic validation scaffolding
-* safe, database-portable operational helpers
+The library focuses on:
 
-It intentionally contains no domain logic and no assumptions about a specific schema.
+* ORM table mixins and introspection
+* staged file loading
+* loader and validation infrastructure
+* operational helpers that work across supported backends
 
+At the moment, the built-in backends are SQLite and PostgreSQL.
 
-### What this library provides:
 
-This library provides a small set of composable building blocks for defining, loading, inspecting, and validating SQLAlchemy-based data models.
-All components are model-agnostic and can be selectively combined in downstream libraries.
+### What this library provides
 
-1. A minimal, opinionated ORM table base
+The package is deliberately small. Most downstream projects only need a couple of these pieces.
 
-ORMTableBase provides structural introspection utilities for SQLAlchemy-mapped tables, without imposing any domain semantics.
+1. A minimal ORM table base
+
+`ORMTableBase` provides structural utilities for mapped tables without pulling domain rules into the base layer.
 
 It supports:
 * mapper access and inspection
@@ -41,33 +40,31 @@ class MyTable(ORMTableBase, Base):
     __tablename__ = "my_table"
 
 ```
-This base is intended to be inherited by all ORM tables, either directly or via higher-level mixins.
+You can inherit from it directly or pick it up through one of the higher-level mixins.
 
 2. CSV-based ingestion mixins
 
-CSVLoadableTableInterface adds opt-in CSV loading support for ORM tables using pandas, with a focus on correctness and scalability.
+`CSVLoadableTableInterface` adds staged file loading to ORM tables. It can use pandas or PyArrow loaders, and on PostgreSQL it can use a fast `COPY` path when the input is clean enough.
 
 Features include:
+* staging table creation and cleanup
 * chunked loading for large files
-* optional per-table normalisation logic
-* optional deduplication against existing database rows
-* safe bulk inserts using SQLAlchemy sessions
+* optional casting and deduplication before insert
+* backend-specific merge behaviour
+* PostgreSQL fast-path loading with ORM fallback
+* backend-aware index handling during merge
 
 ```python
 class MyTable(CSVLoadableTableInterface, ORMTableBase, Base):
     __tablename__ = "my_table"
 
 ```
 
-Downstream models may override:
-* normalise_dataframe(...)
-* dedupe_dataframe(...)
-* csv_columns()
-to implement table-specific ingestion policies.
+The main extension points here are loader choice, column mapping, and the normal SQLAlchemy model definitions themselves. Most downstream projects do not need to override much beyond `csv_columns()` and the model schema.
 
 3. Structured serialisation and hashing
 
-SerialisableTableInterface adds lightweight, explicit serialisation helpers for ORM rows.
+`SerialisableTableInterface` adds lightweight serialisation helpers for ORM rows.
 
 It supports:
 * conversion to dictionaries
@@ -92,7 +89,7 @@ This is useful for:
 
 4. Model registry and validation scaffolding
 
-The library includes model-agnostic validation infrastructure, designed to compare ORM models against external specifications.
+The library includes validation infrastructure for comparing ORM models against external specifications.
 
 This includes:
 * a model registry
@@ -118,7 +115,8 @@ Validation output is available as:
 * exit codes suitable for pipelines
 
 5. Database bootstrap helpers
-The library provides lightweight helpers for schema creation and bootstrapping, without imposing a migration strategy.
+
+The library provides lightweight helpers for schema creation and bootstrapping. It does not try to replace migrations.
 
 ```python
 from orm_loader.metadata import Base
@@ -127,24 +125,20 @@ from orm_loader.bootstrap import bootstrap
 bootstrap(engine, create=True)
 ```
 
-6. Safe bulk-loading utilities
+6. Bulk-loading helpers
 
-A reusable context manager simplifies trusted bulk ingestion workflows:
-* temporarily disables foreign key checks where supported
-* suppresses autoflush for performance
-* ensures reliable rollback on failure
+There are a few lower-level helpers for trusted bulk workflows, including backend-aware foreign key management and SQLite connection setup for heavy local loads.
 
 ## Summary
 
-This library intentionally focuses on infrastructure, not semantics.
+This library is meant to be the boring layer underneath downstream models:
 
-It provides:
 * reusable ORM mixins
-* safe ingestion patterns
+* staged ingestion patterns
 * validation scaffolding
-* database-portable utilities
+* operational helpers
 
-while leaving domain rules, business logic, and schema semantics to downstream libraries.
+Domain rules, business logic, and schema semantics stay in the downstream project.
 
 This makes it suitable as a shared foundation for:
 * clinical data models

diff --git a/TODO.txt b/TODO.txt
@@ -0,0 +1,2 @@
+[] consider opt-in malformed text repair (as opposed to existing normalisation) - e.g. load_csv(..., text_repair: str | None = None)
+- consider ftfy.fix_encoding()
diff --git a/docs/index.md b/docs/index.md
@@ -3,25 +3,24 @@
 A lightweight, reusable foundation for building and validating
 SQLAlchemy-based data models.
 
-`orm-loader` provides **infrastructure, not semantics**.
+`orm-loader` provides infrastructure for SQLAlchemy-based data models. It is the shared plumbing layer, not the place where model-specific rules live.
 
 It focuses on:
 
 - ORM table introspection
 - safe bulk ingestion patterns
 - file-based loading via staging tables
 - model-agnostic validation scaffolding
-- database-portable operational helpers
+- operational helpers for supported backends
 
-No domain logic is included.
-No schema assumptions are enforced.
+It currently ships with backend implementations for SQLite and PostgreSQL.
 
 ---
 
 ## Core Concepts
 
 - **Tables are structural** — semantics live downstream
-- **Mixins define capabilities**, not behaviour contracts
+- **Mixins define capabilities**
 - **Protocols decouple infrastructure from implementations**
 - **Ingestion is explicit and staged**
 
@@ -37,13 +36,7 @@ No schema assumptions are enforced.
 
 # Design Philosophy
 
-`orm-loader` is intentionally conservative.
-
-It provides:
-
-- *mechanisms*, not policies
-- *capabilities*, not workflows
-- *structure*, not semantics
+`orm-loader` is intentionally conservative. It gives downstream libraries the machinery to load, inspect, and validate data without deciding what the data means.
 
 The library is designed to sit **below**:
 
@@ -65,6 +58,7 @@ and **above**:
 - No schema enforcement
 - No migrations
 - No concurrency guarantees
+- No support yet for arbitrary database dialects
 
 ---
 
@@ -81,4 +75,3 @@ This allows downstream libraries to:
 - replace base classes
 - mock implementations
 - incrementally adopt features
-
diff --git a/docs/loaders/context.md b/docs/loaders/context.md
@@ -25,6 +25,7 @@ on globals or implicit configuration.
 | `chunksize` | Optional chunk size |
 | `normalise` | Whether to cast values to ORM types |
 | `dedupe` | Whether to deduplicate incoming data |
+| `quote_mode` | CSV quoting mode for PostgreSQL fast-path loading |
 
 ::: orm_loader.loaders.data_classes.LoaderContext
 

diff --git a/docs/loaders/helpers.md b/docs/loaders/helpers.md
@@ -1,8 +1,6 @@
 # Loader Helper Utilities
 
-This page documents low-level helper functions used by loaders.
-
-These utilities are stateless and intentionally conservative.
+This page covers the low-level functions that support the loader implementations.
 
 ---
 
@@ -37,36 +35,36 @@ Used by `ParquetLoader` for internal deduplication.
 
 ---
 
-## Conservative CSV parsing
+## Batch-oriented CSV parsing
 
 ### `conservative_load_parquet(...)`
 
-Reads CSV files using PyArrow with:
+Despite the name, this helper reads delimited text with PyArrow and yields batches:
 
 - strict column inclusion
 - malformed row skipping
 - chunked batch iteration
 
-This is used when loading CSVs via the Parquet pipeline.
+This is used by the PyArrow-based loader path.
 
 ---
 
 ## PostgreSQL fast-path loading
 
 ### `quick_load_pg(...)`
 
-Loads CSV files into PostgreSQL staging tables using `COPY`.
+Loads CSV files into a PostgreSQL staging table using `COPY`.
 
 ### Characteristics
 
-- Extremely fast
-- Bypasses ORM
-- Sensitive to data quality issues
+- Fast
+- Bypasses ORM row construction
+- Works best on clean input
 
 ### Failure handling
 
 - Errors trigger rollback
-- Loader falls back to ORM-based loading
-- No partial silent loads
+- `CSVLoadableTableInterface` falls back to ORM-based loading
+- Failures are noisy on purpose
 
 This helper is only used when explicitly supported by the database.
diff --git a/docs/loaders/index.md b/docs/loaders/index.md
@@ -1,14 +1,13 @@
 # Loaders
 
-The `orm_loader.loaders` module provides **conservative, schema-aware file
-ingestion infrastructure** for loading external data into ORM-backed
-staging tables.
+The `orm_loader.loaders` module provides conservative, schema-aware file
+loading into ORM-backed staging tables.
 
 This subsystem is designed to handle:
 
 - untrusted or messy source files
 - large datasets requiring chunked processing
-- incremental and repeatable loads
+- repeatable staged loads
 - dialect-specific optimisations (e.g. PostgreSQL COPY)
 - explicit, inspectable failure modes
 
@@ -23,7 +22,7 @@ they do not embed domain rules or business semantics.
 
 [`LoaderContext`](context.md)
 
-A `LoaderContext` object carries all state required to load a single file:
+A `LoaderContext` object carries the state required to load one file:
 
 - target ORM table
 - database session
@@ -44,8 +43,7 @@ All loaders implement a common interface:
 - `orm_file_load(ctx)` — orchestrates file ingestion
 - `dedupe(data, ctx)` — defines deduplication semantics
 
-Concrete implementations differ only in **how data is read and processed**,
-not in how it is staged.
+Concrete implementations mainly differ in how they read and transform incoming data.
 
 ---
 
@@ -54,11 +52,10 @@ not in how it is staged.
 Loaders always write to **staging tables**, never directly to production
 tables.
 
-This allows:
+This gives you:
 
 - safe rollback
 - repeatable merges
-- database-level deduplication
 - bulk loading optimisations
 
 Final merge semantics are handled by the table mixins, not by loaders.
@@ -69,8 +66,8 @@ Final merge semantics are handled by the table mixins, not by loaders.
 
 | Loader | Use case |
 |------|----------|
-| `PandasLoader` | Flexible, debuggable CSV ingestion |
-| `ParquetLoader` | High-volume, columnar ingestion |
+| `PandasLoader` | Flexible CSV and TSV ingestion |
+| `ParquetLoader` | Columnar or batch-oriented ingestion |
 
 Both loaders share the same lifecycle and guarantees.
 
@@ -81,11 +78,11 @@ Both loaders share the same lifecycle and guarantees.
 1. Detect file format and encoding  
 2. Read data in chunks or batches  
 3. Optionally normalise to ORM column types  
-4. Optionally deduplicate (internal and/or database-level)  
+4. Optionally deduplicate within the incoming data  
 5. Insert into staging table  
 6. Return row count  
 
-No implicit commits or merges occur at this layer.
+Final merge behaviour belongs to the table mixins and backend layer, not to the loader itself.
 
 ---
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		[] consider opt-in malformed text repair (as opposed to existing normalisation) - e.g. load_csv(..., text_repair: str \| None = None)
		- consider ftfy.fix_encoding()