Knowledge Service

A personal knowledge service that reads what you give it and turns it into a small, queryable RDF graph with source-traceable provenance. Documents, news, claims, or notes go in; out the other side: structured triples with per-source confidence, contradictions surfaced when sources disagree, and a hybrid RAG endpoint that cites the chunks it grounded each answer in.

Built by Hikmah Technologies | @hikmahtech | @arshadansari27. The primary consumer is AEGIS, where AI agents gain awareness of your accumulated knowledge; the service also stands alone as a reusable knowledge API.

The ontology is the product. Sources are just input channels.

Why it exists

Every "second brain" tool treats all content as equal — a bookmark, a note, a highlight are all flat objects with tags. No confidence. No provenance. No temporal validity. No contradiction detection. No inference. This system separates content (what you consumed) from knowledge (what you derived from it), and models knowledge with:

Uncertainty — triples carry a confidence score; when multiple sources assert the same fact, their confidences combine via Noisy-OR (1 − Π(1 − cᵢ)).
Provenance — every triple traces back to its source, extraction method, timestamp, and the specific chunk it was derived from.
Temporality — knowledge has valid_from / valid_until, not just created_at.
Ontological structure — concepts link to established vocabularies (Schema.org, Dublin Core, SKOS) so "PostgreSQL" in your codebase and "PostgreSQL" in an article resolve to the same entity.
Inference — inverse, transitive, and type-inheritance rules derive extra triples at ingestion time, with source triples preserved for retraction.

For the design rationale behind the non-obvious choices — Noisy-OR replacing 332 lines of ProbLog, the pyoxigraph ↔ Postgres outbox, named graphs as trust labels rather than filters — see docs/architecture.md.

Five-minute demo

The repo ships with a small public-domain corpus (eight short documents covering the November 2023 OpenAI board weekend) and a script that ingests it, runs the read-side APIs, and prints what came out.

export ADMIN_PASSWORD=changeme
docker compose up -d
uv run python scripts/demo.py --api-key changeme

Expect to see, in order: ingestion progress per document; a summary of how many triples landed in each named graph (ontology / asserted / extracted / inferred); the contradictions the engine surfaced (e.g. OpenAI's CEO predicate resolving to four different people over five days); and three RAG answers with source citations and evidence snippets.

The corpus lives in examples/openai-nov-2023/ and is paraphrased synthesis of publicly reported events — not journalism, not from any single outlet, MIT-licensed alongside the rest of the repo. Point the script at your own directory of .md files with the same frontmatter format to swap in a different corpus.

Design highlights

If you only have time to read one section of the architecture doc, read the Noisy-OR story — it's the clearest "right primitive at the right altitude" lesson the codebase carries.

Named graphs as trust labels, not filters — five named graphs separate triples by provenance class. The graph a triple lives in is surfaced to readers as a trust_tier label, but retrieval is tier-agnostic. Filtering on tier is a choice the caller makes, not the system.
Noisy-OR vs ProbLog: 332 lines to 4 — multi-source confidence combination, in its entirety, is one stdlib import and four lines. The story of how it got there from a 332-line probabilistic-logic engine is the headline architectural lesson of this project.
The outbox: two stores, one truth — triples live in pyoxigraph, provenance lives in PostgreSQL, and a single transaction can't cover both. A transactional outbox plus a startup-time drainer keeps the two consistent across process crashes. Every operation is idempotent by construction.
Reader-side status filter — /api/search and /api/ask only return content whose latest ingestion job has reached a terminal status. Without this, in-flight content matches by chunk before its KG triples have committed — the half-picture problem.
Forward-chaining inference — three rules (inverse, transitive, type-inheritance), BFS with a depth cap and cycle detection, retraction cascade when source triples change. Every rule guards against literal objects, a generalisation of a real production bug.
Two-phase extraction + Wikidata-QID coreference — entities are extracted first so their URIs are available to the relation pass. Same-QID entities across documents merge deterministically before triples reach the store.

Architecture

Single FastAPI process with embedded components. No microservices.

FastAPI Process
├── ParserRegistry     Pluggable document parsing (PDF, HTML, CSV, JSON, images)
├── NlpPhase           spaCy NER + Wikidata entity linking pre-pass
├── CoreferencePhase   Entity dedup by shared Wikidata QID
├── TripleStore        pyoxigraph — RDF 1.2, RDF-star, 5 named graphs by provenance
├── InferenceEngine    3 forward-chaining rules (inverse, transitive, type inheritance)
├── QueryClassifier    Intent routing (semantic / entity / graph), LLM-classified
├── RAGRetriever       Hybrid chunk retrieval (BM25 + vector RRF) + KG triple context
├── ContentStore       PostgreSQL + pgvector — BM25 + vector hybrid search (RRF)
├── ExtractionClient   LLM extraction with retry-on-5xx/timeout
└── ProvenanceStore    Per-source evidence rows with chunk_id FK

Pipeline: Parse → Chunk → Embed → NLP Pre-pass → Extract → Coreference → Process

PostgreSQL
├── content_metadata   Document metadata (url, title, source_type, tags, raw_text)
├── content            Chunks with embeddings, section headers, full-text search
├── provenance         Per-source evidence rows with chunk_id FK
├── entity_embeddings  Entity URIs with embeddings for resolution
├── entity_aliases     Coreference alias → canonical URI mappings
├── ingestion_jobs     Async job tracking with per-phase progress
└── triple_outbox      Staged pyoxigraph writes — drained after PG commit

ProcessPhase consistency. Triples live in pyoxigraph, provenance lives in PostgreSQL, and a single transaction cannot cover both. Each per-triple write is staged as a row in triple_outbox inside the same PG transaction as its provenance row; an OutboxDrainer applies staged rows to pyoxigraph after commit, and re-runs on application startup to recover from crashes between commit and drain. Every outbox operation is idempotent (content-addressed inserts, SPARQL ASK-guarded RDF-star annotations). See docs/superpowers/specs/2026-04-15-processphase-2pc-outbox-design.md.

Reader-side status filtering. /api/search, /api/ask, and RAGRetriever only return content whose latest ingestion_jobs.status is terminal (completed or failed), or has no job row. In-flight content is hidden in SQL via a LEFT JOIN LATERAL against ingestion_jobs, so chunks without their KG triples never reach the retriever. Controlled by READER_EXCLUDE_INFLIGHT (default true). /api/content/{id}/chunks is deliberately exempt. See docs/superpowers/specs/2026-04-15-reader-status-filter-design.md.

For deployment details, see docs/deployment.md.

Knowledge Types

The schema accepts three Pydantic input shapes (TripleInput / EventInput / EntityInput). The knowledge_type field is a free-form label that is preserved on each triple's RDF-star annotation and shown in the admin browser, but it does not drive validation — Pydantic resolves the union by shape.

Conventional labels:

Label	Shape	Truth model	Example
Claim	`TripleInput`	Probabilistic (0.0–1.0)	"Intermittent fasting reduces inflammation" — 0.7 from a YouTube video
Fact	`TripleInput`	High-confidence (≥0.9)	"Project AEGIS uses PostgreSQL 16" — from codebase scan
Relationship	`TripleInput`	Typed link between entities	"AEGIS depends-on PostgreSQL"
Event	`EventInput`	Timestamped occurrence	Salary payment received 2026-03-01
Entity	`EntityInput`	Typed, ontology-linked	"AEGIS is a schema:SoftwareApplication"

Time-bounded facts use TripleInput with valid_from and valid_until. Earlier versions of this doc described separate Conclusion and TemporalState shapes with custom field names — those shapes had no Pydantic model and the extraction prompts no longer emit them.

Confidence Model

Two-layer design:

RDF-star annotation on each triple — the combined confidence after Noisy-OR over all sources:
```
<<:cold_exposure :increases :dopamine>>
    ks:confidence "0.88"^^xsd:float .
```

PostgreSQL provenance table — one row per source per triple:

(triple_hash, source_url, source_type, extractor, confidence, ingested_at, valid_from, valid_until)

When the same claim arrives from multiple sources, Noisy-OR combines their individual confidences:

combined = 1 - product(1 - ci)

# Example: source A at 0.7, source B at 0.6
combined = 1 - (0.3 × 0.4) = 0.88

The combined value is written back to the RDF-star annotation. The inference engine propagates confidence through forward-chaining rules (inverse, transitive, type inheritance) for derived conclusions.

API Reference

Base URL: http://localhost:8000 Interactive docs: http://localhost:8000/docs

Health

GET /health

Returns status of all components (knowledge store, PostgreSQL, LLM API).

Ingest Content

POST /api/content
Content-Type: application/json

Ingest content with knowledge items. Parses documents (auto-detects format), chunks text, runs NLP pre-pass + LLM extraction, resolves entities via coreference, and writes triples. If url is provided without raw_text and starts with http, the URL is fetched and parsed automatically.

Accepts a single object or a JSON array for batch processing.

Single request:

{
  "url": "https://example.com/article",
  "title": "Cold Exposure and Dopamine",
  "summary": "A review of studies on cold exposure effects.",
  "raw_text": "...",
  "source_type": "article",
  "tags": ["health", "neuroscience"],
  "metadata": {},
  "knowledge": [
    {
      "knowledge_type": "claim",
      "subject": "http://dbpedia.org/resource/Cold_shock_response",
      "predicate": "http://knowledge.local/schema/increases",
      "object": "http://dbpedia.org/resource/Dopamine",
      "confidence": 0.75
    },
    {
      "knowledge_type": "entity",
      "uri": "http://dbpedia.org/resource/Dopamine",
      "rdf_type": "schema:ChemicalSubstance",
      "label": "Dopamine",
      "properties": {}
    }
  ]
}

Response (202 Accepted):

{
  "content_id": "uuid",
  "job_id": "uuid",
  "status": "accepted",
  "chunks_total": 5,
  "chunks_capped_from": null
}

Processing happens asynchronously. Poll /api/content/{content_id}/status for progress.

Batch request — send an array, get an array:

[
  { "url": "https://a.com", "title": "Article A", "source_type": "article" },
  { "url": "https://b.com", "title": "Article B", "source_type": "article" }
]

Upload File

POST /api/content/upload
Content-Type: multipart/form-data

Upload a file (PDF, HTML, CSV, JSON, plain text) for ingestion. Format is auto-detected from filename, content-type, or magic bytes. Returns 202 with a job ID — poll /api/content/{id}/status for progress.

curl -X POST http://localhost:8000/api/content/upload \
  -H "X-API-Key: your-password" \
  -F "file=@paper.pdf;type=application/pdf" \
  -F "title=Research Paper" \
  -F "source_type=paper"

Response (202): {"content_id": "uuid", "job_id": "uuid", "chunks_total": 3}

Check Ingestion Status

GET /api/content/{content_id}/status

Returns job status with progress counters. Status values: embedding → analyzing → extracting → resolving → processing → completed / failed.

Ingest Claims Directly

POST /api/claims
Content-Type: application/json

Ingest knowledge items without storing raw content. Useful for programmatic ingestion where content storage is not needed.

Accepts a single object or a JSON array for batch processing. When an array is sent, a matching array of responses is returned.

Single request:

{
  "source_url": "https://example.com/paper",
  "source_type": "paper",
  "extractor": "llm_qwen3:14b",
  "knowledge": [
    {
      "knowledge_type": "fact",
      "subject": "http://knowledge.local/data/aegis",
      "predicate": "http://schema.org/softwareRequirements",
      "object": "http://dbpedia.org/resource/PostgreSQL",
      "confidence": 0.99
    }
  ]
}

Batch request — send an array, get an array:

[
  {
    "source_url": "https://example.com/a",
    "source_type": "bookmark",
    "extractor": "n8n",
    "knowledge": [{ "knowledge_type": "claim", "subject": "...", "predicate": "...", "object": "...", "confidence": 0.85 }]
  },
  {
    "source_url": "https://example.com/b",
    "source_type": "bookmark",
    "extractor": "n8n",
    "knowledge": [{ "knowledge_type": "claim", "subject": "...", "predicate": "...", "object": "...", "confidence": 0.9 }]
  }
]

All three input shapes are accepted. Examples:

// EventInput
{
  "knowledge_type": "event",
  "subject": "http://knowledge.local/data/payment/2026-03-01",
  "occurred_at": "2026-03-01",
  "properties": { "amount": "4500", "currency": "GBP" }
}

// TripleInput with temporal bounds (time-bounded fact)
{
  "knowledge_type": "temporalfact",
  "subject": "http://dbpedia.org/resource/Bitcoin",
  "predicate": "http://schema.org/price",
  "object": "65000",
  "valid_from": "2024-03-01",
  "valid_until": "2024-03-31"
}

// TripleInput (Relationship)
{
  "knowledge_type": "relationship",
  "subject": "http://knowledge.local/data/aegis",
  "predicate": "http://schema.org/hasPart",
  "object": "http://knowledge.local/data/knowledge-service",
  "confidence": 0.99
}

Notes on Entity and Event payloads:

properties values may be a string or a list of strings. List values expand into one triple per item, all sharing the same predicate.
For Entity, if the model nests rdf_type or label inside properties, the field is lifted to the top level during validation.
For Event, an unparseable occurred_at string is coerced to null and the event is dropped (no triples emitted) rather than rejected outright.

Semantic Search

GET /api/search?q=cold+exposure+dopamine&limit=10&source_type=article

Searches ingested content by semantic similarity using pgvector cosine distance. Returns chunk-level results — each result is the most relevant chunk of a document, not the full document. Short documents have a single chunk; long documents (≥4000 chars) are split into overlapping chunks.

Parameters:

q (required) — query text
limit — max results (1–100, default 10)
source_type — filter by source type (article, video, etc.)
tags — filter by tags (repeat for multiple: ?tags=health&tags=neuroscience)

Response:

[
  {
    "content_id": "uuid",
    "url": "https://...",
    "title": "Cold Exposure and Dopamine",
    "summary": "...",
    "similarity": 0.94,
    "source_type": "article",
    "tags": ["health"],
    "ingested_at": "2026-03-18T10:00:00Z",
    "chunk_text": "The relevant section matching the query...",
    "chunk_index": 0
  }
]

Query the Knowledge Graph

GET /api/knowledge/query?subject=http://dbpedia.org/resource/Dopamine

Structured query with optional subject, predicate, object filters. Returns triples with confidence, knowledge type, temporal bounds, and provenance.

Parameters: subject, predicate, object (at least one required, all are URIs or literals)

Response:

[
  {
    "subject": "http://dbpedia.org/resource/Cold_shock_response",
    "predicate": "http://knowledge.local/schema/increases",
    "object": "http://dbpedia.org/resource/Dopamine",
    "confidence": 0.88,
    "knowledge_type": "claim",
    "valid_from": null,
    "valid_until": null,
    "provenance": [
      {
        "source_url": "https://example.com/article",
        "source_type": "article",
        "confidence": "0.75",
        "ingested_at": "2026-03-18T10:00:00+00:00"
      }
    ]
  }
]

Raw SPARQL Query

POST /api/knowledge/sparql

Execute a SPARQL 1.2 SELECT or ASK query against the knowledge graph. Supports RDF-star syntax for querying annotations. Only SELECT and ASK queries are allowed (no INSERT, DELETE, or UPDATE).

Accepts two content types:

JSON body:

{
  "query": "SELECT ?s ?p ?o ?conf WHERE { ?s ?p ?o . << ?s ?p ?o >> <http://knowledge.local/schema/confidence> ?conf . FILTER(?conf > 0.8) }"
}

Raw SPARQL body (Content-Type: application/sparql-query):

SELECT ?s ?p ?o ?conf WHERE {
  ?s ?p ?o .
  << ?s ?p ?o >> <http://knowledge.local/schema/confidence> ?conf .
  FILTER(?conf > 0.8)
}

Contradictions

GET /api/knowledge/contradictions?min_confidence=0.5

Surfaces contradictions in the knowledge graph. Detects two patterns:

Same predicate, different objects — only fires for predicates declared owl:FunctionalProperty in the ontology (e.g. ks:amount, ks:currency); multi-valued predicates like has_property are intentionally excluded
Opposite predicates — e.g., "increases dopamine" vs "decreases dopamine" (via ks:oppositePredicate declarations)

Two filters keep noise down: pairs that share a chunk_id are dropped (extraction conflation — the LLM emitted two distinct values from one paragraph under one subject URI, not a real disagreement across sources), and pairs whose objects are identical (a SPARQL artefact of opposite-predicate pairs pointing at the same object) are dropped.

Contradiction probability is the product of both claims' confidence scores.

Parameters:

min_confidence — filter to pairs where conf_a × conf_b ≥ threshold (default 0.0)

Response:

[
  {
    "claim_a": {
      "subject": "...",
      "predicate": "...",
      "object": "beneficial",
      "confidence": 0.75
    },
    "claim_b": {
      "subject": "...",
      "predicate": "...",
      "object": "harmful",
      "confidence": 0.6
    },
    "contradiction_probability": 0.45,
    "provenance_a": [...],
    "provenance_b": [...]
  }
]

Ask a Question (RAG)

POST /api/ask
Content-Type: application/json

Ask a natural language question against the knowledge base. Retrieves relevant content (semantic search) and knowledge graph triples, checks for contradictions, and generates an LLM-powered answer grounded in your data.

{
  "question": "Does cold exposure increase dopamine?",
  "max_sources": 5,
  "min_confidence": 0.3
}

Parameters:

question (required) — natural language question (max 4000 chars)
max_sources — max content items to retrieve (1–20, default 5)
min_confidence — filter out knowledge triples below this confidence (0.0–1.0, default 0.0)

Response:

{
  "answer": "Based on your knowledge base, cold exposure likely increases dopamine...",
  "confidence": 0.88,
  "sources": [
    {
      "url": "https://example.com/article",
      "title": "Cold Exposure and Dopamine",
      "source_type": "article"
    }
  ],
  "knowledge_types_used": ["claim"],
  "contradictions": [],
  "evidence": [{"triple_subject": "...", "triple_predicate": "...", "triple_object": "...", "chunk_text": "...", "source_url": "..."}],
  "intent": "graph",
  "traversal_depth": 3
}

Admin Panel

A built-in web UI for monitoring and querying your knowledge base. Accessible at /admin after logging in.

Features

Dashboard — stats cards (triples, entities, content, events), confidence distribution chart, knowledge type breakdown, recent ingestion activity
Knowledge Explorer — searchable, filterable, paginated triple browser with entity detail views and content inspection
Chat — ask natural language questions against your knowledge base (uses the RAG pipeline), with source citations and confidence scores
Contradictions — visual side-by-side comparison of conflicting claims with confidence bars

Authentication

All routes (UI and API) are protected behind a password. Set ADMIN_PASSWORD in your .env file or environment:

ADMIN_PASSWORD=your-password-here

The service will not start without this variable. Visit /login to sign in — no username needed, just the password.

Sessions last 24 hours (signed cookie). Set SECRET_KEY for persistent sessions across restarts; if omitted, a random key is generated at startup (sessions lost on restart).

Tech

Server-rendered Jinja2 templates with Alpine.js and TailwindCSS (CDN). No JS build pipeline — everything ships inside the Python package.

Running Locally

Prerequisites

Python 3.12+
PostgreSQL 16 with pgvector extension
An LLM provider (Ollama or LiteLLM)

Option A: Ollama (simplest)

Ollama runs models locally with zero configuration.

Install Ollama and pull the required models:

ollama pull nomic-embed-text
ollama pull qwen3:14b

Start PostgreSQL (via docker-compose):

docker compose up -d postgres

Install and run:

pip install -e ".[dev]"
cp .env.example .env   # defaults work for Ollama — no changes needed
uvicorn knowledge_service.main:app --reload

Option B: LiteLLM Proxy

LiteLLM provides a unified OpenAI-compatible gateway to 100+ LLM providers.

Deploy LiteLLM with the required models in your litellm_config.yaml:

model_list:
  - model_name: nomic-embed-text
    litellm_params:
      model: ollama/nomic-embed-text
      api_base: http://localhost:11434
  - model_name: qwen3:14b
    litellm_params:
      model: ollama/qwen3:14b
      api_base: http://localhost:11434

Start LiteLLM:

litellm --config litellm_config.yaml

Configure and run:

pip install -e ".[dev]"
cp .env.example .env
# Edit .env:
#   LLM_BASE_URL=http://localhost:4000
#   LLM_API_KEY=sk-your-litellm-key
uvicorn knowledge_service.main:app --reload

With Docker (full stack)

A pre-built image is available on Docker Hub: arshadansari27/knowledge-service

# Set admin password (required)
export ADMIN_PASSWORD=changeme

# Using docker-compose (builds locally)
docker compose up -d

# Or pull the pre-built image directly
docker pull arshadansari27/knowledge-service:latest

Service available at http://localhost:8000. Admin panel at http://localhost:8000/login. Ollama must be running on the host machine.

Configuration

All settings via environment variables or .env file:

Variable	Default	Description
`DATABASE_URL`	`postgresql://knowledge:knowledge@localhost:5433/knowledge`	PostgreSQL connection
`LLM_BASE_URL`	`http://localhost:11434`	LLM API endpoint (Ollama or LiteLLM)
`LLM_API_KEY`	(empty)	API key (leave empty for Ollama)
`LLM_EMBED_MODEL`	`nomic-embed-text`	Embedding model (768-dim vectors)
`LLM_CHAT_MODEL`	`qwen3:14b`	Chat model for knowledge extraction
`LLM_RAG_MODEL`	(empty)	RAG answer model (defaults to `LLM_CHAT_MODEL` if empty)
`OXIGRAPH_DATA_DIR`	`./data/oxigraph`	RDF store data directory
`API_HOST`	`0.0.0.0`	Bind address
`API_PORT`	`8000`	Port
`ADMIN_PASSWORD`	(required)	Password for admin panel login and API key auth
`SECRET_KEY`	(required)	Session signing key
`SPACY_DATA_DIR`	`/app/data/spacy`	spaCy Wikidata KB storage directory
`MAX_UPLOAD_SIZE`	`52428800`	Maximum file upload size in bytes (default 50MB)
`URL_FETCH_TIMEOUT`	`30`	Timeout for URL auto-fetch (seconds)
`NLP_ENTITY_CONFIDENCE`	`0.5`	Confidence for spaCy-only fallback entities
`READER_EXCLUDE_INFLIGHT`	`true`	Exclude in-flight content from hybrid retrieval

Bulk Ingest CLI

A standalone script to ingest a directory of files and/or a list of URLs in one go.

Usage

# Ingest all supported files in a directory (recursive)
uv run python scripts/bulk_ingest.py ./documents/

# Ingest URLs from a file (one per line, # comments and blank lines ignored)
uv run python scripts/bulk_ingest.py --urls urls.txt

# Both at once, with tags and domain hints
uv run python scripts/bulk_ingest.py ./documents/ --urls urls.txt --tags health,research --domains health

# Preview what would be ingested without doing it
uv run python scripts/bulk_ingest.py ./documents/ --dry-run

Supported file types

.pdf, .html, .htm, .txt, .md, .json, .csv

Files are uploaded via POST /api/content/upload. URLs are submitted via POST /api/content (server auto-fetches).

Options

Option	Default	Description
`path` (positional)	—	Directory to scan recursively
`--urls FILE`	—	Text file with one URL per line
`--server URL`	`KNOWLEDGE_URL` env or `http://localhost:8000`	Target server
`--api-key KEY`	`KNOWLEDGE_API_KEY` env	API key for authentication
`--tags t1,t2`	—	Comma-separated tags for all items
`--domains d1,d2`	—	Comma-separated domain hints for extraction
`--dry-run`	—	List items without ingesting
`--poll-timeout N`	`300`	Seconds to wait per item before giving up

At least one of path or --urls is required.

How it works

Items are processed sequentially. For each item, the script POSTs to the server, then polls GET /api/content/{id}/status every 5 seconds until the job completes, fails, or times out. Progress is printed as it goes:

Bulk ingest: 6 items (5 files, 1 URLs)
Target: http://localhost:8000

  [1/6] report.pdf .................. OK  triples=14  23s
  [2/6] notes.txt ................... OK  triples=6   8s
  [3/6] https://example.com/article . FAIL  extraction failed  45s
  ...

Results: 4 passed, 1 failed, 1 skipped (total: 4m 32s)

Exit code is 0 if all items passed, 1 if any failed.

Against production

KNOWLEDGE_URL=https://knowledge.hikmahtech.in \
KNOWLEDGE_API_KEY=your-key \
  uv run python scripts/bulk_ingest.py ./documents/ --poll-timeout 600

Against local docker-compose

docker compose up -d
uv run python scripts/bulk_ingest.py ./documents/ --api-key changeme

Running Tests

pytest

All tests mock external dependencies — no PostgreSQL or LLM provider required.

CI/CD

GitHub Actions pipeline on every push/merge to main:

Lint — ruff check + ruff format --check
Test — pytest tests/ -v (~700 tests)
Version bump — auto-increments patch version in pyproject.toml, commits back to main, creates vX.Y.Z git tag
Docker build — builds and pushes to Docker Hub as arshadansari27/knowledge-service:X.Y.Z and :latest

Version is read from pyproject.toml and used as the Docker image tag and git tag. Bump commits include [skip ci] to prevent infinite loops.

Pull requests run lint + test only (no version bump or Docker push).

Project Structure

src/knowledge_service/
├── main.py                  # FastAPI app factory + lifespan
├── config.py                # Settings (pydantic-settings, .env)
├── models.py                # Pydantic input shapes (TripleInput / EventInput / EntityInput) + API contracts
├── _utils.py                # Shared RDF helpers + JSON extraction from LLM output
├── chunking.py              # Markdown-aware text splitting with section headers
├── admin/
│   ├── auth.py              # AuthMiddleware, login/logout, rate limiter, session cookies
│   ├── routes.py            # Admin page routes (dashboard, knowledge, chat, contradictions)
│   ├── stats.py             # /api/admin/stats/* and /api/admin/knowledge/triples endpoints
│   ├── jobs.py              # /api/admin/jobs
│   ├── content.py           # DELETE /api/admin/knowledge/content/{id} and /knowledge/source
│   └── templates/           # Jinja2 templates (base, dashboard, knowledge, chat, etc.)
├── api/
│   ├── content.py           # POST /api/content (JSON + URL auto-fetch)
│   ├── upload.py            # POST /api/content/upload (multipart file upload)
│   ├── claims.py            # POST /api/claims
│   ├── search.py            # GET /api/search
│   ├── knowledge.py         # GET /api/knowledge/query, POST /api/knowledge/sparql
│   ├── contradictions.py    # GET /api/knowledge/contradictions
│   ├── ask.py               # POST /api/ask (RAG question answering)
│   ├── changes.py           # GET /api/entity/{id}/changes
│   └── health.py            # GET /health
├── parsing/
│   ├── __init__.py          # ParserRegistry, ParsedDocument, Parser protocol
│   ├── pdf.py               # PdfParser (PyMuPDF)
│   ├── html.py              # HtmlParser (readability-lxml + BeautifulSoup)
│   ├── structured.py        # StructuredParser (JSON/CSV)
│   └── text.py              # TextParser (passthrough)
├── nlp/
│   ├── __init__.py          # NlpPhase, NlpResult, NlpEntity
│   └── bootstrap.py         # spaCy model + Wikidata KB loading
├── ingestion/
│   ├── pipeline.py          # Per-triple processing (delta, insert, contradiction, provenance, inference)
│   ├── worker.py            # 5-phase orchestrator (Embed → NLP → Extract → Coref → Process)
│   ├── phases.py            # EmbedPhase, ExtractPhase, ProcessPhase
│   ├── coreference.py       # CoreferencePhase (Wikidata QID merging)
│   └── outbox.py            # OutboxStore + OutboxDrainer (pyoxigraph/PG 2PC)
├── stores/
│   ├── __init__.py          # Stores dataclass
│   ├── triples.py           # pyoxigraph wrapper — RDF-star, named graphs
│   ├── content.py           # ContentStore — metadata + chunks + embeddings
│   ├── entities.py          # EntityStore — entity/predicate resolution + aliases
│   ├── provenance.py        # ProvenanceStore — per-source evidence rows
│   └── rag.py               # RAGRetriever — hybrid chunk + KG retrieval, intent-routed (3 strategies)
├── reasoning/
│   ├── engine.py            # InferenceEngine — forward-chaining rules
│   └── noisy_or.py          # Noisy-OR evidence combination
├── ontology/
│   ├── uri.py               # URI normalization (to_entity_uri, to_predicate_uri, slugify)
│   ├── namespaces.py        # ks:, schema:, dc:, skos:, prov: namespace constants
│   ├── registry.py          # DomainRegistry — predicate metadata from ontology
│   ├── bootstrap.py         # Loads schema.ttl + domains/*.ttl into ks:graph/ontology
│   ├── schema.ttl            # Knowledge type classes, properties
│   ├── domains/             # Domain TTL files (base, health, technology, research)
│   └── prompts/             # LLM extraction prompt templates (entities, relations)
├── clients/
│   ├── base.py              # BaseLLMClient — shared retry / timeout / auth handling
│   ├── llm.py               # EmbeddingClient + ExtractionClient
│   ├── prompt_builder.py    # Domain-aware extraction prompts from templates
│   └── rag.py               # RAGClient — LLM-powered answer generation
└── migrations/              # SQL migrations (auto-applied at startup)

Ontology

The system reuses established vocabularies and keeps the custom ks: namespace minimal:

Domain	Namespace	Purpose
Content metadata	`dc:` / `dcterms:`	Title, creator, date, format, source
General entities	`schema:` (Schema.org)	People, organisations, software, events
Topic hierarchies	`skos:`	Broader/narrower relationships, labels
People and social	`foaf:`	Personal information, social connections
Provenance	`prov:` (PROV-O)	Activity, agent, entity chains
Service-specific	`ks:`	Confidence, knowledge type, temporal validity, extraction metadata

Status

Deployed to production (~700 tests).

Capability	What
Knowledge model	3 Pydantic input shapes (`TripleInput` / `EventInput` / `EntityInput`); `knowledge_type` is preserved on each triple's RDF-star annotation as a lowercase canonical label (`claim` / `fact` / `event` / `entity` / `relationship` / …; `Relation` is collapsed to `relationship`); temporal validity via `valid_from` / `valid_until`
RDF store	pyoxigraph, 4 named graphs by provenance class (ontology / asserted / extracted / inferred)
Ingestion	Parse (PDF / HTML / CSV / JSON / text) → chunk → embed → spaCy NER + Wikidata linking → LLM extraction → QID-based coreference → ingest
Maintenance	Background asyncio task lowercases `ks:knowledgeType` annotations and remaps spaCy NER labels to schema.org canonical types on a configurable interval; manual trigger at `POST /api/admin/maintenance/run`
Hybrid retrieval	BM25 (OR-tokenised `to_tsquery`) + pgvector, fused via Reciprocal Rank Fusion
RAG endpoint	`/api/ask` with intent classification (semantic / entity / graph), returns answer + source chunks + triples + contradictions
Evidence combination	Noisy-OR across multi-source confidences: `1 - Π(1 − cᵢ)`
Inference	3 forward-chaining rules (inverse / transitive / type-inheritance), retraction cascade when source triples change
Consistency	Outbox 2PC between pyoxigraph and Postgres, startup drainer recovers from crash-between-commit-and-drain
Reader-side filter	In-flight content excluded from retrieval until ingestion reaches a terminal status

Not currently enforced (declared but not filtered on the read path): graph-level trust tiers, contradictions. Both are surfaced as labels / response fields, not used to re-rank or exclude results.

Documentation

Document	Description
Architecture notes	Design rationale for Noisy-OR, the outbox, named-graph trust tiers, inference, and the parts deliberately left out
API reference	Full endpoint reference with example payloads (interactive docs at `/docs` when running)
Deployment	Production AEGIS stack deployment
Demo corpus	Bundled public-domain corpus used by `scripts/demo.py`

Tech Stack

Component	Technology
API	Python 3.12, FastAPI, uvicorn
Knowledge store	pyoxigraph (embedded, RDF 1.2, SPARQL 1.2, RocksDB)
Confidence combination	Noisy-OR (4-line function)
Inference	3 forward-chaining rules at ingestion time (inverse / transitive / type-inheritance)
Operational store	PostgreSQL 16
Vector search	pgvector (HNSW index, halfvec)
Document parsing	PyMuPDF (PDF), readability-lxml + BeautifulSoup (HTML), stdlib (CSV/JSON)
NLP pre-pass	spaCy (en_core_web_sm) + spacy-entity-linker (Wikidata KB)
LLM gateway	Ollama (local) or LiteLLM (proxy) — any OpenAI-compatible API
Embeddings	nomic-embed-text (768-dim)
Knowledge extraction	qwen3:14b (auto-extracts from raw_text)
Infrastructure	Docker Compose, GitHub Actions CI/CD, Docker Swarm (production)

Name		Name	Last commit message	Last commit date
Latest commit History 452 Commits
.github/workflows		.github/workflows
docs		docs
examples/openai-nov-2023		examples/openai-nov-2023
migrations		migrations
scripts		scripts
src/knowledge_service		src/knowledge_service
tests		tests
.env.example		.env.example
.gitignore		.gitignore
API.md		API.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.test.yml		docker-compose.test.yml
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Knowledge Service

Why it exists

Five-minute demo

Design highlights

Architecture

Knowledge Types

Confidence Model

API Reference

Health

Ingest Content

Upload File

Check Ingestion Status

Ingest Claims Directly

Semantic Search

Query the Knowledge Graph

Raw SPARQL Query

Contradictions

Ask a Question (RAG)

Admin Panel

Features

Authentication

Tech

Running Locally

Prerequisites

Option A: Ollama (simplest)

Option B: LiteLLM Proxy

With Docker (full stack)

Configuration

Bulk Ingest CLI

Usage

Supported file types

Options

How it works

Against production

Against local docker-compose

Running Tests

CI/CD

Project Structure

Ontology

Status

Documentation

Tech Stack

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages