A personal knowledge service that reads what you give it and turns it into a small, queryable RDF graph with source-traceable provenance. Documents, news, claims, or notes go in; out the other side: structured triples with per-source confidence, contradictions surfaced when sources disagree, and a hybrid RAG endpoint that cites the chunks it grounded each answer in.
Built by Hikmah Technologies | @hikmahtech | @arshadansari27. The primary consumer is AEGIS, where AI agents gain awareness of your accumulated knowledge; the service also stands alone as a reusable knowledge API.
The ontology is the product. Sources are just input channels.
Every "second brain" tool treats all content as equal — a bookmark, a note, a highlight are all flat objects with tags. No confidence. No provenance. No temporal validity. No contradiction detection. No inference. This system separates content (what you consumed) from knowledge (what you derived from it), and models knowledge with:
- Uncertainty — triples carry a confidence score; when multiple sources assert the same fact, their confidences combine via Noisy-OR (
1 − Π(1 − cᵢ)). - Provenance — every triple traces back to its source, extraction method, timestamp, and the specific chunk it was derived from.
- Temporality — knowledge has
valid_from/valid_until, not justcreated_at. - Ontological structure — concepts link to established vocabularies (Schema.org, Dublin Core, SKOS) so "PostgreSQL" in your codebase and "PostgreSQL" in an article resolve to the same entity.
- Inference — inverse, transitive, and type-inheritance rules derive extra triples at ingestion time, with source triples preserved for retraction.
For the design rationale behind the non-obvious choices — Noisy-OR replacing 332 lines of ProbLog, the pyoxigraph ↔ Postgres outbox, named graphs as trust labels rather than filters — see docs/architecture.md.
The repo ships with a small public-domain corpus (eight short documents covering the November 2023 OpenAI board weekend) and a script that ingests it, runs the read-side APIs, and prints what came out.
export ADMIN_PASSWORD=changeme
docker compose up -d
uv run python scripts/demo.py --api-key changemeExpect to see, in order: ingestion progress per document; a summary of how many triples landed in each named graph (ontology / asserted / extracted / inferred); the contradictions the engine surfaced (e.g. OpenAI's CEO predicate resolving to four different people over five days); and three RAG answers with source citations and evidence snippets.
The corpus lives in examples/openai-nov-2023/ and is paraphrased synthesis of publicly reported events — not journalism, not from any single outlet, MIT-licensed alongside the rest of the repo. Point the script at your own directory of .md files with the same frontmatter format to swap in a different corpus.
If you only have time to read one section of the architecture doc, read the Noisy-OR story — it's the clearest "right primitive at the right altitude" lesson the codebase carries.
- Named graphs as trust labels, not filters — five named graphs separate triples by provenance class. The graph a triple lives in is surfaced to readers as a
trust_tierlabel, but retrieval is tier-agnostic. Filtering on tier is a choice the caller makes, not the system. - Noisy-OR vs ProbLog: 332 lines to 4 — multi-source confidence combination, in its entirety, is one stdlib import and four lines. The story of how it got there from a 332-line probabilistic-logic engine is the headline architectural lesson of this project.
- The outbox: two stores, one truth — triples live in pyoxigraph, provenance lives in PostgreSQL, and a single transaction can't cover both. A transactional outbox plus a startup-time drainer keeps the two consistent across process crashes. Every operation is idempotent by construction.
- Reader-side status filter —
/api/searchand/api/askonly return content whose latest ingestion job has reached a terminal status. Without this, in-flight content matches by chunk before its KG triples have committed — the half-picture problem. - Forward-chaining inference — three rules (inverse, transitive, type-inheritance), BFS with a depth cap and cycle detection, retraction cascade when source triples change. Every rule guards against literal objects, a generalisation of a real production bug.
- Two-phase extraction + Wikidata-QID coreference — entities are extracted first so their URIs are available to the relation pass. Same-QID entities across documents merge deterministically before triples reach the store.
Single FastAPI process with embedded components. No microservices.
FastAPI Process
├── ParserRegistry Pluggable document parsing (PDF, HTML, CSV, JSON, images)
├── NlpPhase spaCy NER + Wikidata entity linking pre-pass
├── CoreferencePhase Entity dedup by shared Wikidata QID
├── TripleStore pyoxigraph — RDF 1.2, RDF-star, 5 named graphs by provenance
├── InferenceEngine 3 forward-chaining rules (inverse, transitive, type inheritance)
├── QueryClassifier Intent routing (semantic / entity / graph), LLM-classified
├── RAGRetriever Hybrid chunk retrieval (BM25 + vector RRF) + KG triple context
├── ContentStore PostgreSQL + pgvector — BM25 + vector hybrid search (RRF)
├── ExtractionClient LLM extraction with retry-on-5xx/timeout
└── ProvenanceStore Per-source evidence rows with chunk_id FK
Pipeline: Parse → Chunk → Embed → NLP Pre-pass → Extract → Coreference → Process
PostgreSQL
├── content_metadata Document metadata (url, title, source_type, tags, raw_text)
├── content Chunks with embeddings, section headers, full-text search
├── provenance Per-source evidence rows with chunk_id FK
├── entity_embeddings Entity URIs with embeddings for resolution
├── entity_aliases Coreference alias → canonical URI mappings
├── ingestion_jobs Async job tracking with per-phase progress
└── triple_outbox Staged pyoxigraph writes — drained after PG commit
ProcessPhase consistency. Triples live in pyoxigraph, provenance lives in PostgreSQL, and a single transaction cannot cover both. Each per-triple write is staged as a row in triple_outbox inside the same PG transaction as its provenance row; an OutboxDrainer applies staged rows to pyoxigraph after commit, and re-runs on application startup to recover from crashes between commit and drain. Every outbox operation is idempotent (content-addressed inserts, SPARQL ASK-guarded RDF-star annotations). See docs/superpowers/specs/2026-04-15-processphase-2pc-outbox-design.md.
Reader-side status filtering. /api/search, /api/ask, and RAGRetriever only return content whose latest ingestion_jobs.status is terminal (completed or failed), or has no job row. In-flight content is hidden in SQL via a LEFT JOIN LATERAL against ingestion_jobs, so chunks without their KG triples never reach the retriever. Controlled by READER_EXCLUDE_INFLIGHT (default true). /api/content/{id}/chunks is deliberately exempt. See docs/superpowers/specs/2026-04-15-reader-status-filter-design.md.
For deployment details, see docs/deployment.md.
The schema accepts three Pydantic input shapes
(TripleInput / EventInput / EntityInput). The knowledge_type field
is a free-form label that is preserved on each triple's RDF-star annotation
and shown in the admin browser, but it does not drive validation — Pydantic
resolves the union by shape.
Conventional labels:
| Label | Shape | Truth model | Example |
|---|---|---|---|
| Claim | TripleInput |
Probabilistic (0.0–1.0) | "Intermittent fasting reduces inflammation" — 0.7 from a YouTube video |
| Fact | TripleInput |
High-confidence (≥0.9) | "Project AEGIS uses PostgreSQL 16" — from codebase scan |
| Relationship | TripleInput |
Typed link between entities | "AEGIS depends-on PostgreSQL" |
| Event | EventInput |
Timestamped occurrence | Salary payment received 2026-03-01 |
| Entity | EntityInput |
Typed, ontology-linked | "AEGIS is a schema:SoftwareApplication" |
Time-bounded facts use TripleInput with valid_from and valid_until.
Earlier versions of this doc described separate Conclusion and
TemporalState shapes with custom field names — those shapes had no
Pydantic model and the extraction prompts no longer emit them.
Two-layer design:
-
RDF-star annotation on each triple — the combined confidence after Noisy-OR over all sources:
<<:cold_exposure :increases :dopamine>> ks:confidence "0.88"^^xsd:float . -
PostgreSQL provenance table — one row per source per triple:
(triple_hash, source_url, source_type, extractor, confidence, ingested_at, valid_from, valid_until)
When the same claim arrives from multiple sources, Noisy-OR combines their individual confidences:
combined = 1 - product(1 - ci)
# Example: source A at 0.7, source B at 0.6
combined = 1 - (0.3 × 0.4) = 0.88
The combined value is written back to the RDF-star annotation. The inference engine propagates confidence through forward-chaining rules (inverse, transitive, type inheritance) for derived conclusions.
Base URL: http://localhost:8000
Interactive docs: http://localhost:8000/docs
GET /healthReturns status of all components (knowledge store, PostgreSQL, LLM API).
POST /api/content
Content-Type: application/jsonIngest content with knowledge items. Parses documents (auto-detects format), chunks text, runs NLP pre-pass + LLM extraction, resolves entities via coreference, and writes triples. If url is provided without raw_text and starts with http, the URL is fetched and parsed automatically.
Accepts a single object or a JSON array for batch processing.
Single request:
{
"url": "https://example.com/article",
"title": "Cold Exposure and Dopamine",
"summary": "A review of studies on cold exposure effects.",
"raw_text": "...",
"source_type": "article",
"tags": ["health", "neuroscience"],
"metadata": {},
"knowledge": [
{
"knowledge_type": "claim",
"subject": "http://dbpedia.org/resource/Cold_shock_response",
"predicate": "http://knowledge.local/schema/increases",
"object": "http://dbpedia.org/resource/Dopamine",
"confidence": 0.75
},
{
"knowledge_type": "entity",
"uri": "http://dbpedia.org/resource/Dopamine",
"rdf_type": "schema:ChemicalSubstance",
"label": "Dopamine",
"properties": {}
}
]
}Response (202 Accepted):
{
"content_id": "uuid",
"job_id": "uuid",
"status": "accepted",
"chunks_total": 5,
"chunks_capped_from": null
}Processing happens asynchronously. Poll /api/content/{content_id}/status for progress.
Batch request — send an array, get an array:
[
{ "url": "https://a.com", "title": "Article A", "source_type": "article" },
{ "url": "https://b.com", "title": "Article B", "source_type": "article" }
]POST /api/content/upload
Content-Type: multipart/form-dataUpload a file (PDF, HTML, CSV, JSON, plain text) for ingestion. Format is auto-detected from filename, content-type, or magic bytes. Returns 202 with a job ID — poll /api/content/{id}/status for progress.
curl -X POST http://localhost:8000/api/content/upload \
-H "X-API-Key: your-password" \
-F "file=@paper.pdf;type=application/pdf" \
-F "title=Research Paper" \
-F "source_type=paper"Response (202): {"content_id": "uuid", "job_id": "uuid", "chunks_total": 3}
GET /api/content/{content_id}/statusReturns job status with progress counters. Status values: embedding → analyzing → extracting → resolving → processing → completed / failed.
POST /api/claims
Content-Type: application/jsonIngest knowledge items without storing raw content. Useful for programmatic ingestion where content storage is not needed.
Accepts a single object or a JSON array for batch processing. When an array is sent, a matching array of responses is returned.
Single request:
{
"source_url": "https://example.com/paper",
"source_type": "paper",
"extractor": "llm_qwen3:14b",
"knowledge": [
{
"knowledge_type": "fact",
"subject": "http://knowledge.local/data/aegis",
"predicate": "http://schema.org/softwareRequirements",
"object": "http://dbpedia.org/resource/PostgreSQL",
"confidence": 0.99
}
]
}Batch request — send an array, get an array:
[
{
"source_url": "https://example.com/a",
"source_type": "bookmark",
"extractor": "n8n",
"knowledge": [{ "knowledge_type": "claim", "subject": "...", "predicate": "...", "object": "...", "confidence": 0.85 }]
},
{
"source_url": "https://example.com/b",
"source_type": "bookmark",
"extractor": "n8n",
"knowledge": [{ "knowledge_type": "claim", "subject": "...", "predicate": "...", "object": "...", "confidence": 0.9 }]
}
]All three input shapes are accepted. Examples:
// EventInput
{
"knowledge_type": "event",
"subject": "http://knowledge.local/data/payment/2026-03-01",
"occurred_at": "2026-03-01",
"properties": { "amount": "4500", "currency": "GBP" }
}
// TripleInput with temporal bounds (time-bounded fact)
{
"knowledge_type": "temporalfact",
"subject": "http://dbpedia.org/resource/Bitcoin",
"predicate": "http://schema.org/price",
"object": "65000",
"valid_from": "2024-03-01",
"valid_until": "2024-03-31"
}
// TripleInput (Relationship)
{
"knowledge_type": "relationship",
"subject": "http://knowledge.local/data/aegis",
"predicate": "http://schema.org/hasPart",
"object": "http://knowledge.local/data/knowledge-service",
"confidence": 0.99
}Notes on Entity and Event payloads:
propertiesvalues may be a string or a list of strings. List values expand into one triple per item, all sharing the same predicate.- For
Entity, if the model nestsrdf_typeorlabelinsideproperties, the field is lifted to the top level during validation. - For
Event, an unparseableoccurred_atstring is coerced tonulland the event is dropped (no triples emitted) rather than rejected outright.
GET /api/search?q=cold+exposure+dopamine&limit=10&source_type=articleSearches ingested content by semantic similarity using pgvector cosine distance. Returns chunk-level results — each result is the most relevant chunk of a document, not the full document. Short documents have a single chunk; long documents (≥4000 chars) are split into overlapping chunks.
Parameters:
q(required) — query textlimit— max results (1–100, default 10)source_type— filter by source type (article,video, etc.)tags— filter by tags (repeat for multiple:?tags=health&tags=neuroscience)
Response:
[
{
"content_id": "uuid",
"url": "https://...",
"title": "Cold Exposure and Dopamine",
"summary": "...",
"similarity": 0.94,
"source_type": "article",
"tags": ["health"],
"ingested_at": "2026-03-18T10:00:00Z",
"chunk_text": "The relevant section matching the query...",
"chunk_index": 0
}
]GET /api/knowledge/query?subject=http://dbpedia.org/resource/DopamineStructured query with optional subject, predicate, object filters. Returns triples with confidence, knowledge type, temporal bounds, and provenance.
Parameters: subject, predicate, object (at least one required, all are URIs or literals)
Response:
[
{
"subject": "http://dbpedia.org/resource/Cold_shock_response",
"predicate": "http://knowledge.local/schema/increases",
"object": "http://dbpedia.org/resource/Dopamine",
"confidence": 0.88,
"knowledge_type": "claim",
"valid_from": null,
"valid_until": null,
"provenance": [
{
"source_url": "https://example.com/article",
"source_type": "article",
"confidence": "0.75",
"ingested_at": "2026-03-18T10:00:00+00:00"
}
]
}
]POST /api/knowledge/sparqlExecute a SPARQL 1.2 SELECT or ASK query against the knowledge graph. Supports RDF-star syntax for querying annotations. Only SELECT and ASK queries are allowed (no INSERT, DELETE, or UPDATE).
Accepts two content types:
JSON body:
{
"query": "SELECT ?s ?p ?o ?conf WHERE { ?s ?p ?o . << ?s ?p ?o >> <http://knowledge.local/schema/confidence> ?conf . FILTER(?conf > 0.8) }"
}Raw SPARQL body (Content-Type: application/sparql-query):
SELECT ?s ?p ?o ?conf WHERE {
?s ?p ?o .
<< ?s ?p ?o >> <http://knowledge.local/schema/confidence> ?conf .
FILTER(?conf > 0.8)
}GET /api/knowledge/contradictions?min_confidence=0.5Surfaces contradictions in the knowledge graph. Detects two patterns:
- Same predicate, different objects — only fires for predicates declared
owl:FunctionalPropertyin the ontology (e.g.ks:amount,ks:currency); multi-valued predicates likehas_propertyare intentionally excluded - Opposite predicates — e.g., "increases dopamine" vs "decreases dopamine" (via
ks:oppositePredicatedeclarations)
Two filters keep noise down: pairs that share a chunk_id are dropped (extraction conflation — the LLM emitted two distinct values from one paragraph under one subject URI, not a real disagreement across sources), and pairs whose objects are identical (a SPARQL artefact of opposite-predicate pairs pointing at the same object) are dropped.
Contradiction probability is the product of both claims' confidence scores.
Parameters:
min_confidence— filter to pairs whereconf_a × conf_b ≥ threshold(default 0.0)
Response:
[
{
"claim_a": {
"subject": "...",
"predicate": "...",
"object": "beneficial",
"confidence": 0.75
},
"claim_b": {
"subject": "...",
"predicate": "...",
"object": "harmful",
"confidence": 0.6
},
"contradiction_probability": 0.45,
"provenance_a": [...],
"provenance_b": [...]
}
]POST /api/ask
Content-Type: application/jsonAsk a natural language question against the knowledge base. Retrieves relevant content (semantic search) and knowledge graph triples, checks for contradictions, and generates an LLM-powered answer grounded in your data.
{
"question": "Does cold exposure increase dopamine?",
"max_sources": 5,
"min_confidence": 0.3
}Parameters:
question(required) — natural language question (max 4000 chars)max_sources— max content items to retrieve (1–20, default 5)min_confidence— filter out knowledge triples below this confidence (0.0–1.0, default 0.0)
Response:
{
"answer": "Based on your knowledge base, cold exposure likely increases dopamine...",
"confidence": 0.88,
"sources": [
{
"url": "https://example.com/article",
"title": "Cold Exposure and Dopamine",
"source_type": "article"
}
],
"knowledge_types_used": ["claim"],
"contradictions": [],
"evidence": [{"triple_subject": "...", "triple_predicate": "...", "triple_object": "...", "chunk_text": "...", "source_url": "..."}],
"intent": "graph",
"traversal_depth": 3
}A built-in web UI for monitoring and querying your knowledge base. Accessible at /admin after logging in.
- Dashboard — stats cards (triples, entities, content, events), confidence distribution chart, knowledge type breakdown, recent ingestion activity
- Knowledge Explorer — searchable, filterable, paginated triple browser with entity detail views and content inspection
- Chat — ask natural language questions against your knowledge base (uses the RAG pipeline), with source citations and confidence scores
- Contradictions — visual side-by-side comparison of conflicting claims with confidence bars
All routes (UI and API) are protected behind a password. Set ADMIN_PASSWORD in your .env file or environment:
ADMIN_PASSWORD=your-password-hereThe service will not start without this variable. Visit /login to sign in — no username needed, just the password.
Sessions last 24 hours (signed cookie). Set SECRET_KEY for persistent sessions across restarts; if omitted, a random key is generated at startup (sessions lost on restart).
Server-rendered Jinja2 templates with Alpine.js and TailwindCSS (CDN). No JS build pipeline — everything ships inside the Python package.
- Python 3.12+
- PostgreSQL 16 with pgvector extension
- An LLM provider (Ollama or LiteLLM)
Ollama runs models locally with zero configuration.
- Install Ollama and pull the required models:
ollama pull nomic-embed-text
ollama pull qwen3:14b- Start PostgreSQL (via docker-compose):
docker compose up -d postgres- Install and run:
pip install -e ".[dev]"
cp .env.example .env # defaults work for Ollama — no changes needed
uvicorn knowledge_service.main:app --reloadLiteLLM provides a unified OpenAI-compatible gateway to 100+ LLM providers.
- Deploy LiteLLM with the required models in your
litellm_config.yaml:
model_list:
- model_name: nomic-embed-text
litellm_params:
model: ollama/nomic-embed-text
api_base: http://localhost:11434
- model_name: qwen3:14b
litellm_params:
model: ollama/qwen3:14b
api_base: http://localhost:11434- Start LiteLLM:
litellm --config litellm_config.yaml- Configure and run:
pip install -e ".[dev]"
cp .env.example .env
# Edit .env:
# LLM_BASE_URL=http://localhost:4000
# LLM_API_KEY=sk-your-litellm-key
uvicorn knowledge_service.main:app --reloadA pre-built image is available on Docker Hub: arshadansari27/knowledge-service
# Set admin password (required)
export ADMIN_PASSWORD=changeme
# Using docker-compose (builds locally)
docker compose up -d
# Or pull the pre-built image directly
docker pull arshadansari27/knowledge-service:latestService available at http://localhost:8000. Admin panel at http://localhost:8000/login. Ollama must be running on the host machine.
All settings via environment variables or .env file:
| Variable | Default | Description |
|---|---|---|
DATABASE_URL |
postgresql://knowledge:knowledge@localhost:5433/knowledge |
PostgreSQL connection |
LLM_BASE_URL |
http://localhost:11434 |
LLM API endpoint (Ollama or LiteLLM) |
LLM_API_KEY |
(empty) | API key (leave empty for Ollama) |
LLM_EMBED_MODEL |
nomic-embed-text |
Embedding model (768-dim vectors) |
LLM_CHAT_MODEL |
qwen3:14b |
Chat model for knowledge extraction |
LLM_RAG_MODEL |
(empty) | RAG answer model (defaults to LLM_CHAT_MODEL if empty) |
OXIGRAPH_DATA_DIR |
./data/oxigraph |
RDF store data directory |
API_HOST |
0.0.0.0 |
Bind address |
API_PORT |
8000 |
Port |
ADMIN_PASSWORD |
(required) | Password for admin panel login and API key auth |
SECRET_KEY |
(required) | Session signing key |
SPACY_DATA_DIR |
/app/data/spacy |
spaCy Wikidata KB storage directory |
MAX_UPLOAD_SIZE |
52428800 |
Maximum file upload size in bytes (default 50MB) |
URL_FETCH_TIMEOUT |
30 |
Timeout for URL auto-fetch (seconds) |
NLP_ENTITY_CONFIDENCE |
0.5 |
Confidence for spaCy-only fallback entities |
READER_EXCLUDE_INFLIGHT |
true |
Exclude in-flight content from hybrid retrieval |
A standalone script to ingest a directory of files and/or a list of URLs in one go.
# Ingest all supported files in a directory (recursive)
uv run python scripts/bulk_ingest.py ./documents/
# Ingest URLs from a file (one per line, # comments and blank lines ignored)
uv run python scripts/bulk_ingest.py --urls urls.txt
# Both at once, with tags and domain hints
uv run python scripts/bulk_ingest.py ./documents/ --urls urls.txt --tags health,research --domains health
# Preview what would be ingested without doing it
uv run python scripts/bulk_ingest.py ./documents/ --dry-run.pdf, .html, .htm, .txt, .md, .json, .csv
Files are uploaded via POST /api/content/upload. URLs are submitted via POST /api/content (server auto-fetches).
| Option | Default | Description |
|---|---|---|
path (positional) |
— | Directory to scan recursively |
--urls FILE |
— | Text file with one URL per line |
--server URL |
KNOWLEDGE_URL env or http://localhost:8000 |
Target server |
--api-key KEY |
KNOWLEDGE_API_KEY env |
API key for authentication |
--tags t1,t2 |
— | Comma-separated tags for all items |
--domains d1,d2 |
— | Comma-separated domain hints for extraction |
--dry-run |
— | List items without ingesting |
--poll-timeout N |
300 |
Seconds to wait per item before giving up |
At least one of path or --urls is required.
Items are processed sequentially. For each item, the script POSTs to the server, then polls GET /api/content/{id}/status every 5 seconds until the job completes, fails, or times out. Progress is printed as it goes:
Bulk ingest: 6 items (5 files, 1 URLs)
Target: http://localhost:8000
[1/6] report.pdf .................. OK triples=14 23s
[2/6] notes.txt ................... OK triples=6 8s
[3/6] https://example.com/article . FAIL extraction failed 45s
...
Results: 4 passed, 1 failed, 1 skipped (total: 4m 32s)
Exit code is 0 if all items passed, 1 if any failed.
KNOWLEDGE_URL=https://knowledge.hikmahtech.in \
KNOWLEDGE_API_KEY=your-key \
uv run python scripts/bulk_ingest.py ./documents/ --poll-timeout 600docker compose up -d
uv run python scripts/bulk_ingest.py ./documents/ --api-key changemepytestAll tests mock external dependencies — no PostgreSQL or LLM provider required.
GitHub Actions pipeline on every push/merge to main:
- Lint —
ruff check+ruff format --check - Test —
pytest tests/ -v(~700 tests) - Version bump — auto-increments patch version in
pyproject.toml, commits back tomain, createsvX.Y.Zgit tag - Docker build — builds and pushes to Docker Hub as
arshadansari27/knowledge-service:X.Y.Zand:latest
Version is read from pyproject.toml and used as the Docker image tag and git tag. Bump commits include [skip ci] to prevent infinite loops.
Pull requests run lint + test only (no version bump or Docker push).
src/knowledge_service/
├── main.py # FastAPI app factory + lifespan
├── config.py # Settings (pydantic-settings, .env)
├── models.py # Pydantic input shapes (TripleInput / EventInput / EntityInput) + API contracts
├── _utils.py # Shared RDF helpers + JSON extraction from LLM output
├── chunking.py # Markdown-aware text splitting with section headers
├── admin/
│ ├── auth.py # AuthMiddleware, login/logout, rate limiter, session cookies
│ ├── routes.py # Admin page routes (dashboard, knowledge, chat, contradictions)
│ ├── stats.py # /api/admin/stats/* and /api/admin/knowledge/triples endpoints
│ ├── jobs.py # /api/admin/jobs
│ ├── content.py # DELETE /api/admin/knowledge/content/{id} and /knowledge/source
│ └── templates/ # Jinja2 templates (base, dashboard, knowledge, chat, etc.)
├── api/
│ ├── content.py # POST /api/content (JSON + URL auto-fetch)
│ ├── upload.py # POST /api/content/upload (multipart file upload)
│ ├── claims.py # POST /api/claims
│ ├── search.py # GET /api/search
│ ├── knowledge.py # GET /api/knowledge/query, POST /api/knowledge/sparql
│ ├── contradictions.py # GET /api/knowledge/contradictions
│ ├── ask.py # POST /api/ask (RAG question answering)
│ ├── changes.py # GET /api/entity/{id}/changes
│ └── health.py # GET /health
├── parsing/
│ ├── __init__.py # ParserRegistry, ParsedDocument, Parser protocol
│ ├── pdf.py # PdfParser (PyMuPDF)
│ ├── html.py # HtmlParser (readability-lxml + BeautifulSoup)
│ ├── structured.py # StructuredParser (JSON/CSV)
│ └── text.py # TextParser (passthrough)
├── nlp/
│ ├── __init__.py # NlpPhase, NlpResult, NlpEntity
│ └── bootstrap.py # spaCy model + Wikidata KB loading
├── ingestion/
│ ├── pipeline.py # Per-triple processing (delta, insert, contradiction, provenance, inference)
│ ├── worker.py # 5-phase orchestrator (Embed → NLP → Extract → Coref → Process)
│ ├── phases.py # EmbedPhase, ExtractPhase, ProcessPhase
│ ├── coreference.py # CoreferencePhase (Wikidata QID merging)
│ └── outbox.py # OutboxStore + OutboxDrainer (pyoxigraph/PG 2PC)
├── stores/
│ ├── __init__.py # Stores dataclass
│ ├── triples.py # pyoxigraph wrapper — RDF-star, named graphs
│ ├── content.py # ContentStore — metadata + chunks + embeddings
│ ├── entities.py # EntityStore — entity/predicate resolution + aliases
│ ├── provenance.py # ProvenanceStore — per-source evidence rows
│ └── rag.py # RAGRetriever — hybrid chunk + KG retrieval, intent-routed (3 strategies)
├── reasoning/
│ ├── engine.py # InferenceEngine — forward-chaining rules
│ └── noisy_or.py # Noisy-OR evidence combination
├── ontology/
│ ├── uri.py # URI normalization (to_entity_uri, to_predicate_uri, slugify)
│ ├── namespaces.py # ks:, schema:, dc:, skos:, prov: namespace constants
│ ├── registry.py # DomainRegistry — predicate metadata from ontology
│ ├── bootstrap.py # Loads schema.ttl + domains/*.ttl into ks:graph/ontology
│ ├── schema.ttl # Knowledge type classes, properties
│ ├── domains/ # Domain TTL files (base, health, technology, research)
│ └── prompts/ # LLM extraction prompt templates (entities, relations)
├── clients/
│ ├── base.py # BaseLLMClient — shared retry / timeout / auth handling
│ ├── llm.py # EmbeddingClient + ExtractionClient
│ ├── prompt_builder.py # Domain-aware extraction prompts from templates
│ └── rag.py # RAGClient — LLM-powered answer generation
└── migrations/ # SQL migrations (auto-applied at startup)
The system reuses established vocabularies and keeps the custom ks: namespace minimal:
| Domain | Namespace | Purpose |
|---|---|---|
| Content metadata | dc: / dcterms: |
Title, creator, date, format, source |
| General entities | schema: (Schema.org) |
People, organisations, software, events |
| Topic hierarchies | skos: |
Broader/narrower relationships, labels |
| People and social | foaf: |
Personal information, social connections |
| Provenance | prov: (PROV-O) |
Activity, agent, entity chains |
| Service-specific | ks: |
Confidence, knowledge type, temporal validity, extraction metadata |
Deployed to production (~700 tests).
| Capability | What |
|---|---|
| Knowledge model | 3 Pydantic input shapes (TripleInput / EventInput / EntityInput); knowledge_type is preserved on each triple's RDF-star annotation as a lowercase canonical label (claim / fact / event / entity / relationship / …; Relation is collapsed to relationship); temporal validity via valid_from / valid_until |
| RDF store | pyoxigraph, 4 named graphs by provenance class (ontology / asserted / extracted / inferred) |
| Ingestion | Parse (PDF / HTML / CSV / JSON / text) → chunk → embed → spaCy NER + Wikidata linking → LLM extraction → QID-based coreference → ingest |
| Maintenance | Background asyncio task lowercases ks:knowledgeType annotations and remaps spaCy NER labels to schema.org canonical types on a configurable interval; manual trigger at POST /api/admin/maintenance/run |
| Hybrid retrieval | BM25 (OR-tokenised to_tsquery) + pgvector, fused via Reciprocal Rank Fusion |
| RAG endpoint | /api/ask with intent classification (semantic / entity / graph), returns answer + source chunks + triples + contradictions |
| Evidence combination | Noisy-OR across multi-source confidences: 1 - Π(1 − cᵢ) |
| Inference | 3 forward-chaining rules (inverse / transitive / type-inheritance), retraction cascade when source triples change |
| Consistency | Outbox 2PC between pyoxigraph and Postgres, startup drainer recovers from crash-between-commit-and-drain |
| Reader-side filter | In-flight content excluded from retrieval until ingestion reaches a terminal status |
Not currently enforced (declared but not filtered on the read path): graph-level trust tiers, contradictions. Both are surfaced as labels / response fields, not used to re-rank or exclude results.
| Document | Description |
|---|---|
| Architecture notes | Design rationale for Noisy-OR, the outbox, named-graph trust tiers, inference, and the parts deliberately left out |
| API reference | Full endpoint reference with example payloads (interactive docs at /docs when running) |
| Deployment | Production AEGIS stack deployment |
| Demo corpus | Bundled public-domain corpus used by scripts/demo.py |
| Component | Technology |
|---|---|
| API | Python 3.12, FastAPI, uvicorn |
| Knowledge store | pyoxigraph (embedded, RDF 1.2, SPARQL 1.2, RocksDB) |
| Confidence combination | Noisy-OR (4-line function) |
| Inference | 3 forward-chaining rules at ingestion time (inverse / transitive / type-inheritance) |
| Operational store | PostgreSQL 16 |
| Vector search | pgvector (HNSW index, halfvec) |
| Document parsing | PyMuPDF (PDF), readability-lxml + BeautifulSoup (HTML), stdlib (CSV/JSON) |
| NLP pre-pass | spaCy (en_core_web_sm) + spacy-entity-linker (Wikidata KB) |
| LLM gateway | Ollama (local) or LiteLLM (proxy) — any OpenAI-compatible API |
| Embeddings | nomic-embed-text (768-dim) |
| Knowledge extraction | qwen3:14b (auto-extracts from raw_text) |
| Infrastructure | Docker Compose, GitHub Actions CI/CD, Docker Swarm (production) |