AustralianCancerDataNetwork · nicoloesch · May 8, 2026 · May 11, 2026 · May 13, 2026 · May 13, 2026
diff --git a/.gitignore b/.gitignore
@@ -13,4 +13,7 @@ wheels/
 docs/backup/
 docs/omop_relationships.csv
 .vscode/
-.env
+.env
+resources/
+*.DS_Store
+logging/
diff --git a/README.md b/README.md
@@ -1,21 +1,12 @@
-# Architecture
-
-This library provides a lightweight, query-time knowledge-graph layer over an OMOP vocabulary database, with explicit separation between:
-
-* graph access (nodes, edges, predicates),
-* graph algorithms (traversal, pathfinding),
-* path scoring and explanation, and
-* presentation / inspection utilities.
-
 # omop-graph
 
 **omop-graph** is a lightweight, opinionated knowledge-graph traversal and path-analysis library built on top of the OMOP vocabulary model.
 
 It provides:
 - a stable **KnowledgeGraph façade** over OMOP concepts and relationships
 - flexible **graph traversal** (forward, backward, bidirectional)
-- **path discovery and ranking** with transparent scoring
-- **traceable explanations** of why one path is preferred over another
+- **path discovery** with transparent scoring
+- **traceable explanations** of traversal decisions
 - multiple **rendering backends** (text, HTML, Mermaid)
 
 The library is designed for:
@@ -31,105 +22,95 @@ The library is designed for:
 pip install omop-graph
 ```
 
+With embedding support (sqlite-vec backend, zero config):
+
+```bash
+pip install "omop-graph[emb]"
+```
+
+For larger deployments use `[pgvector]` or `[faiss-cpu]` instead (or in addition).
+Full setup is covered in the [omop-emb documentation](https://australiancancerdatanetwork.github.io/omop-emb/).
+
+---
+
 ## Core Concepts
 
 ### KnowledgeGraph
 
-KnowledgeGraph is the main entry point. It wraps an existing SQLAlchemy session connected to an OMOP vocabulary schema. kg-core assumes OMOP semantics and tables.
+`KnowledgeGraph` is the main entry point. It wraps a SQLAlchemy `Engine` connected to an OMOP vocabulary schema and provides a high-level Pythonic API over the relational tables.
 
 ```python
+from sqlalchemy import create_engine
 from omop_graph.graph.kg import KnowledgeGraph
-```
 
-### Nodes and Edges
+engine = create_engine("postgresql://user:pass@localhost/omop")
+kg = KnowledgeGraph(engine)
 
-Nodes are OMOP Concepts; Edges are OMOP Concept_Relationships
+# Lookup a concept by label
+match_group = kg.label_lookup("Atrial Fibrillation", fuzzy=False)
+concept = match_group.best_match
+print(f"ID: {concept.concept_id}, Name: {concept.matched_label}")
 
-Relationships are classified into semantic kinds:
+# Traverse the hierarchy
+parents = kg.parents(concept.concept_id)
+```
+
+### Nodes and Edges
 
-* ONTOLOGICAL
-* MAPPING
-* ATTRIBUTE
-* VERSIONING
-* METADATA
+Nodes are OMOP Concepts; Edges are OMOP Concept_Relationships.
 
-This classification drives traversal and scoring.
+Relationships are pre-classified into semantic kinds (`ClassIDEnum`):
 
-### Traversal, Paths and Scoring
+- `HIERARCHY` — parent/child ontological relationships
+- `IDENTITY` — mapping to standard concepts
+- `COMPOSITION` — part-of relationships
+- `ASSOCIATION` — lateral clinical associations
+- `ATTRIBUTE` — concept attribute relationships
 
-You can:
+This classification drives traversal filtering and scoring.
 
-* expand neighbourhoods
-* extract subgraphs
-* trace traversal decisions
-* control which relationship kinds are followed
-* discover multiple candidate paths between concepts and rank them
-* render simple HTML cards for easy interactive exploration
+### Traversal and Paths
 
 ```python
 from omop_graph.graph.paths import find_shortest_paths
 from omop_graph.extensions.omop_alchemy import ClassIDEnum
 
-ingredient = kg.concept_id_by_code("RxNorm", "6809") # Metformin
-drug = kg.concept_id_by_code("RxNorm", "860975") # Metformin 500 MG Oral Tablet
-
-kg.concept_view(drug) # ConceptView(id=40163924, RxNorm:860975, name='24 HR metformin hydrochloride 500 MG Extended Release Oral Tablet')
-kg.concept_view(ingredient) # ConceptView(id=1503297, RxNorm:6809, name='metformin')
+ingredient = kg.concept_id_by_code("RxNorm", "6809")    # Metformin
+drug = kg.concept_id_by_code("RxNorm", "860975")         # Metformin 500 MG Oral Tablet
 
 paths, trace = find_shortest_paths(
     kg,
     source=drug,
     target=ingredient,
-    predicate_kinds={
-        ClassIDEnum.HIERARCHICAL,
-        ClassIDEnum.IDENTITY,
-    },
+    predicate_kinds=frozenset({ClassIDEnum.HIERARCHY, ClassIDEnum.IDENTITY}),
     max_depth=6,
     traced=True,
 )
-
-ranked = rank_paths(kg, paths)
-
-```
-
-### 
-
-```python
-paths = kg.find_shortest_paths(
-    source=a,
-    target=b,
-    max_depth=6,
-)
-ranked = kg.rank_paths(paths)
 ```
 
 ### Rendering
 
-Outputs can be rendered as:
+Outputs can be rendered as plain text, HTML (Jupyter), or Mermaid diagrams. Rendering auto-detects the environment.
 
-* plain text (CLI / logs)
-* HTML (Jupyter)
-* Mermaid diagrams
-
-Rendering auto-detects the environment.
-
-```python 
+```python
 from IPython.display import HTML, display
 from omop_graph.render import render_trace
 
 display(HTML(render_trace(kg, trace)))
 ```
 
+---
+
 ## Project Structure
-```graphql
 
+```
 omop_graph/
 ├── graph/          # graph logic, traversal, paths, scoring
 ├── render/         # HTML / text / Mermaid renderers
-├── reasoning/      # Ontology traversal methods for specific reasoner tasks
-├────── resolvers/  # Resolve labels for exact / fuzzy / synonym matches - TODO: embedding matches
-├────── phenotypes/ # Set operations to build efficient hierarchical groupings for reasoning
+├── reasoning/      # ontology traversal methods for specific reasoner tasks
+│   ├── resolvers/  # resolve labels via exact / fuzzy / full-text / synonym search
+│   └── phenotypes/ # set operations for hierarchical groupings
+├── oaklib_interface/  # OAK-compliant adapter
 ├── api.py          # stable public API surface
 └── db/             # session helpers
-
-```
+```
diff --git a/docs/predicate_classification.csv → config/predicate_classification.csv b/docs/predicate_classification.csv → config/predicate_classification.csv
diff --git a/docs/predicate_mapping.csv → config/predicate_mapping.csv b/docs/predicate_mapping.csv → config/predicate_mapping.csv
diff --git a/docker-compose.yaml b/docker-compose.yaml
@@ -0,0 +1,28 @@
+services:
+  omop-cdm-db:
+    image: postgres:16-alpine
+    restart: always
+    env_file: .env
+    environment:
+      - POSTGRES_USER=${OMOP_CDM_DB_USER:-omop}
+      - POSTGRES_PASSWORD=${OMOP_CDM_DB_PASSWORD:-omop}
+      - POSTGRES_DB=${OMOP_CDM_DB_NAME:-omop}
+      - PGDATA=/var/lib/postgresql/data/pgdata
+    volumes:
+      - db_data:/var/lib/postgresql/data
+    networks:
+      - omop-net
+    healthcheck:
+      test: ["CMD-SHELL", "pg_isready -U ${OMOP_CDM_DB_USER:-omop} -d ${OMOP_CDM_DB_NAME:-omop}"]
+      interval: 5s
+      timeout: 5s
+      retries: 5
+    ports:
+      - "5432:5432"
+
+networks:
+  omop-net:
+    name: omop-net
+
+volumes:
+  db_data:
diff --git a/docs/graph/edges.md b/docs/graph/edges.md
@@ -16,16 +16,16 @@ To allow reproduction and evaluation of this approach, we provide clear guidelin
 
 ??? "Expand to see the grouping classification of predicates"
 
-    {{ to_grouped_table('docs/predicate_classification.csv', [0, 1], [0, 1, 2, 3, 4], [0, 1],) }}
+    {{ to_grouped_table('config/predicate_classification.csv', [0, 1], [0, 1, 2, 3, 4], [0, 1],) }}
 
 ## Predicate Mappings
-Following the predicate classification guidelines of the previous seciton, we calssified the following predicates into their respective classification groups.
+Following the predicate classification guidelines of the previous section, we classified the following predicates into their respective classification groups.
 
 !!! warning
 
     This classification is currently still under development and most likely may change with increased feedback from clinicians. The respective interface to store these classifications in the OMOP CDM has been prepared and we are in talks to potentially include this classification eventually in the official OMOP CDM.
 
 ??? "Expand to see the classification of all edge connections"
 
-    {{ to_grouped_table('docs/predicate_mapping.csv', [0, 1], [0, 1, 2, 3], [0, 1], {"r_id": "relationship_id", "r_name": "relationship_name"}) }}
+    {{ to_grouped_table('config/predicate_mapping.csv', [0, 1], [0, 1, 2, 3], [0, 1], {"r_id": "relationship_id", "r_name": "relationship_name"}) }}
 
diff --git a/docs/graph/kg.md b/docs/graph/kg.md
@@ -27,19 +27,14 @@ While the OMOP CDM is stored in a Relational Database Management System (RDBMS),
 
 ### Basic Usage
 
-The `KnowledgeGraph` can be used standalone after connecting to the OMOP CDM database on disk.
+The `KnowledgeGraph` can be used standalone after connecting to the OMOP CDM database.
 
 ```python
 from sqlalchemy import create_engine
-from sqlalchemy.orm import sessionmaker
 from omop_graph.graph.kg import KnowledgeGraph
 
-# Setup your SQLAlchemy session
 engine = create_engine("postgresql://user:pass@localhost/omop")
-SessionLocal = sessionmaker(bind=engine)
-
-# Initialize the Virtual Knowledge Graph
-kg = KnowledgeGraph(SessionLocal)
+kg = KnowledgeGraph(engine)
 
 # Lookup a concept by its label
 match_group = kg.label_lookup("Atrial Fibrillation", fuzzy=False)
@@ -59,41 +54,53 @@ print(f"Parent IDs: {parents}")
 To enable semantic similarity and RAG-based retrieval, pass a `KnowledgeGraphEmbeddingConfiguration` when initialising the graph.
 This requires the optional `omop-emb` package — see the [installation guide](../usage/installation.md#embedding-rag).
 
+!!! info "omop-emb documentation"
+    `omop-emb` manages all embedding storage, backends, and retrieval. Full documentation — including backend setup, CLI reference, FAISS sidecar, and configuration — is available at [australiancancerdatanetwork.github.io/omop-emb](https://australiancancerdatanetwork.github.io/omop-emb/).
+
 #### Read-only (pre-computed embeddings already in the DB)
 
 Use this when embeddings have already been indexed and you only need retrieval:
 
 ```python
+from sqlalchemy import create_engine
 from omop_graph.graph.kg import KnowledgeGraph, KnowledgeGraphEmbeddingConfiguration
-from omop_emb import BackendType, ProviderType
+from omop_emb.config import BackendType, MetricType, ProviderType
+
+engine = create_engine("postgresql://user:pass@localhost/omop")
 
 emb_config = KnowledgeGraphEmbeddingConfiguration(
-    backend_type=BackendType.FAISS,
+    backend_type=BackendType.PGVECTOR,      # or BackendType.SQLITEVEC
     provider_type=ProviderType.OLLAMA,
-    canonical_model_name="text-embedding-3-small:0.6b",
-    base_storage_dir="/data/embeddings",
+    model_name="nomic-embed-text:v1.5",     # must match the name used at ingestion time
+    metric_type=MetricType.COSINE,
 )
-kg = KnowledgeGraph(SessionLocal, emb_config=emb_config)
+kg = KnowledgeGraph(engine, emb_config=emb_config)
 ```
 
+The backend is resolved from `backend_type` or, as a fallback, from the `OMOP_EMB_BACKEND` environment variable.
+See the [omop-emb configuration reference](https://australiancancerdatanetwork.github.io/omop-emb/usage/configuration/) for all connection variables.
+
 #### Write-capable (generate and store embeddings at runtime)
 
-Provide an `EmbeddingClient` to enable both reading and writing embeddings:
+Provide an `EmbeddingClient` to enable both reading and writing embeddings. The `provider_type` and `model_name`
+are derived automatically from the client:
 
 ```python
 from omop_emb import EmbeddingClient
-from omop_emb import BackendType, ProviderType
+from omop_emb.config import BackendType, MetricType
 
-client = EmbeddingClient(...)  # configured for your provider
+client = EmbeddingClient(
+    model="nomic-embed-text:v1.5",
+    api_base="http://ollama:11434/v1",
+)
 
 emb_config = KnowledgeGraphEmbeddingConfiguration(
-    backend_type=BackendType.FAISS,
-    base_storage_dir="/data/embeddings",
+    backend_type=BackendType.PGVECTOR,
+    metric_type=MetricType.COSINE,
     client=client,
 )
-kg = KnowledgeGraph(SessionLocal, emb_config=emb_config)
+kg = KnowledgeGraph(engine, emb_config=emb_config)
 ```
-The `provider_type` will be automatically determined from the `client`.
 
 #### Fallback embedding calculation
 
@@ -107,12 +114,12 @@ for any missing concepts on-the-fly during a similarity call.
 
 ```python
 emb_config = KnowledgeGraphEmbeddingConfiguration(
-    backend_type="faiss",
-    base_storage_dir="/data/embeddings",
+    backend_type=BackendType.PGVECTOR,
+    metric_type=MetricType.COSINE,
     client=client,
-    compute_missing_embeddings=True,  # compute embeddings for concepts not yet in the store
+    compute_missing_embeddings=True,
 )
-kg = KnowledgeGraph(SessionLocal, emb_config=emb_config)
+kg = KnowledgeGraph(engine, emb_config=emb_config)
 ```
 
 | `compute_missing_embeddings` | `client` present | Behaviour when concepts are missing |

diff --git a/docs/index.md b/docs/index.md
@@ -1,32 +1,40 @@
 # omop-graph
 
-**omop-graph** is a lightweight virtual knowledge Graph (VKG) built on-top of the OMOP CDM.
-It transforms the static OMOP vocabulary tables into a dynamic graph environment suitable for NLP grounding, clinical reasoning and other tasks that benefit from a knowledge graph.
+**omop-graph** is a lightweight Virtual Knowledge Graph (VKG) built on top of the OMOP CDM.
+It transforms the static OMOP vocabulary tables into a dynamic graph environment suitable for NLP grounding, clinical reasoning, and other tasks that benefit from a knowledge graph.
 
 ## Why omop-graph?
 
 Unlike generic graph libraries, `omop-graph` is built specifically for clinical data:
 
-- **Semantic Awareness**: Understands the difference between relationships.
-- **Efficient Grounding**: Instead of traversing every possible path, the library uses a **Standard Anchor** approach: translating non-standard terms to standard concepts and leveraging the OMOP `concept_ancestor` table for high-speed hierarchy validation.
-- **Transparent Scoring**: Decisions aren't black boxes. Every path is scored based on textual similarity, graph distance (parsimony), and clinical generality (broadness).
-- **Pre-classification**: Relationships are already pre-classified into overarching groups, allowing quicker restrictions of connections and more efficient graph traversal.
+- **Semantic Awareness**: Understands the difference between relationship kinds (hierarchy, identity, composition, association, attribute).
+- **Efficient Grounding**: Instead of traversing every possible path, the library uses a **Standard Anchor** approach — translating non-standard terms to standard concepts and leveraging the OMOP `concept_ancestor` table for high-speed hierarchy validation.
+- **Transparent Scoring**: Decisions aren't black boxes. Every candidate concept is scored based on textual similarity, graph distance (parsimony), and clinical generality (broadness).
+- **Pre-classification**: Relationships are pre-classified into semantic groups, enabling quicker traversal restrictions and more targeted reasoning.
+
 ---
 
 ## Documentation Overview
 
 ### Core Components
-- [KnowledgeGraph](graph/kg.md): The VKG interface and what it attempts to solve.
-- [Relationships](graph/edges.md): Pre-classification of edges/relationships of the OMOP CDM.
-- [Oaklib Interface](oaklib/interface.md): `oaklib`-compliant interface
+- [KnowledgeGraph](graph/kg.md): The VKG interface — connecting to OMOP and traversing the graph.
+- [Relationships](graph/edges.md): Pre-classification of OMOP edges into semantic kinds.
+- [Oaklib Interface](oaklib/interface.md): OAK-compliant adapter for cross-ontology tooling.
 
 ### Reasoning
 Explore the grounding pipeline used by clinical NLP tools.
 
-- [Semantic grounding](reasoning/grounding.md): How regular search terms can be traced to a standard Ontology
+- [Semantic Grounding](reasoning/grounding.md): Mapping free-text terms to standard OMOP concepts.
+- [Resolver Pipelines](reasoning/resolvers.md): How candidate concepts are retrieved from the database.
+
+### Embedding Support
+
+!!! info "Powered by omop-emb"
+    Embedding-based similarity (vector search, RAG retrieval, on-the-fly embedding computation) is provided by the companion [`omop-emb`](https://australiancancerdatanetwork.github.io/omop-emb/) package.
+    Install it with `pip install "omop-graph[emb]"` and see [Knowledge Graph — Embedding Configuration](graph/kg.md#embedding-configuration) for integration details.
 
 ### Interactive Exploration
-`omop-graph` includes built-in HTML renderers for Jupyter Notebooks, allowing you to visualize concepts and relationship summaries instantly.
+`omop-graph` includes built-in HTML and Mermaid renderers for Jupyter Notebooks, allowing you to visualise concepts, traversal traces, and relationship summaries directly in a notebook.
 
 ### Testing
-- [Testing](usage/testing.md): How test configuration works, what is covered, and how to set up environment variables for local test runs.
+- [Testing](usage/testing.md): Test configuration, coverage, and how to set up environment variables for local runs.