Skip to content

arshadansari27/knowledge-service

Repository files navigation

Knowledge Service

CI Docker License: MIT

A personal knowledge service that reads what you give it and turns it into a small, queryable RDF graph with source-traceable provenance. Documents, news, claims, or notes go in; out the other side: structured triples with per-source confidence, contradictions surfaced when sources disagree, and a hybrid RAG endpoint that cites the chunks it grounded each answer in.

Built by Hikmah Technologies | @hikmahtech | @arshadansari27. The primary consumer is AEGIS, where AI agents gain awareness of your accumulated knowledge; the service also stands alone as a reusable knowledge API.

The ontology is the product. Sources are just input channels.


Why it exists

Every "second brain" tool treats all content as equal — a bookmark, a note, a highlight are all flat objects with tags. No confidence. No provenance. No temporal validity. No contradiction detection. No inference. This system separates content (what you consumed) from knowledge (what you derived from it), and models knowledge with:

  • Uncertainty — triples carry a confidence score; when multiple sources assert the same fact, their confidences combine via Noisy-OR (1 − Π(1 − cᵢ)).
  • Provenance — every triple traces back to its source, extraction method, timestamp, and the specific chunk it was derived from.
  • Temporality — knowledge has valid_from / valid_until, not just created_at.
  • Ontological structure — concepts link to established vocabularies (Schema.org, Dublin Core, SKOS) so "PostgreSQL" in your codebase and "PostgreSQL" in an article resolve to the same entity.
  • Inference — inverse, transitive, and type-inheritance rules derive extra triples at ingestion time, with source triples preserved for retraction.

For the design rationale behind the non-obvious choices — Noisy-OR replacing 332 lines of ProbLog, the pyoxigraph ↔ Postgres outbox, named graphs as trust labels rather than filters — see docs/architecture.md.


Five-minute demo

The repo ships with a small public-domain corpus (eight short documents covering the November 2023 OpenAI board weekend) and a script that ingests it, runs the read-side APIs, and prints what came out.

export ADMIN_PASSWORD=changeme
docker compose up -d
uv run python scripts/demo.py --api-key changeme

Expect to see, in order: ingestion progress per document; a summary of how many triples landed in each named graph (ontology / asserted / extracted / inferred); the contradictions the engine surfaced (e.g. OpenAI's CEO predicate resolving to four different people over five days); and three RAG answers with source citations and evidence snippets.

The corpus lives in examples/openai-nov-2023/ and is paraphrased synthesis of publicly reported events — not journalism, not from any single outlet, MIT-licensed alongside the rest of the repo. Point the script at your own directory of .md files with the same frontmatter format to swap in a different corpus.


Design highlights

If you only have time to read one section of the architecture doc, read the Noisy-OR story — it's the clearest "right primitive at the right altitude" lesson the codebase carries.

  • Named graphs as trust labels, not filters — five named graphs separate triples by provenance class. The graph a triple lives in is surfaced to readers as a trust_tier label, but retrieval is tier-agnostic. Filtering on tier is a choice the caller makes, not the system.
  • Noisy-OR vs ProbLog: 332 lines to 4 — multi-source confidence combination, in its entirety, is one stdlib import and four lines. The story of how it got there from a 332-line probabilistic-logic engine is the headline architectural lesson of this project.
  • The outbox: two stores, one truth — triples live in pyoxigraph, provenance lives in PostgreSQL, and a single transaction can't cover both. A transactional outbox plus a startup-time drainer keeps the two consistent across process crashes. Every operation is idempotent by construction.
  • Reader-side status filter/api/search and /api/ask only return content whose latest ingestion job has reached a terminal status. Without this, in-flight content matches by chunk before its KG triples have committed — the half-picture problem.
  • Forward-chaining inference — three rules (inverse, transitive, type-inheritance), BFS with a depth cap and cycle detection, retraction cascade when source triples change. Every rule guards against literal objects, a generalisation of a real production bug.
  • Two-phase extraction + Wikidata-QID coreference — entities are extracted first so their URIs are available to the relation pass. Same-QID entities across documents merge deterministically before triples reach the store.

Architecture

Single FastAPI process with embedded components. No microservices.

FastAPI Process
├── ParserRegistry     Pluggable document parsing (PDF, HTML, CSV, JSON, images)
├── NlpPhase           spaCy NER + Wikidata entity linking pre-pass
├── CoreferencePhase   Entity dedup by shared Wikidata QID
├── TripleStore        pyoxigraph — RDF 1.2, RDF-star, 5 named graphs by provenance
├── InferenceEngine    3 forward-chaining rules (inverse, transitive, type inheritance)
├── QueryClassifier    Intent routing (semantic / entity / graph), LLM-classified
├── RAGRetriever       Hybrid chunk retrieval (BM25 + vector RRF) + KG triple context
├── ContentStore       PostgreSQL + pgvector — BM25 + vector hybrid search (RRF)
├── ExtractionClient   LLM extraction with retry-on-5xx/timeout
└── ProvenanceStore    Per-source evidence rows with chunk_id FK

Pipeline: Parse → Chunk → Embed → NLP Pre-pass → Extract → Coreference → Process

PostgreSQL
├── content_metadata   Document metadata (url, title, source_type, tags, raw_text)
├── content            Chunks with embeddings, section headers, full-text search
├── provenance         Per-source evidence rows with chunk_id FK
├── entity_embeddings  Entity URIs with embeddings for resolution
├── entity_aliases     Coreference alias → canonical URI mappings
├── ingestion_jobs     Async job tracking with per-phase progress
└── triple_outbox      Staged pyoxigraph writes — drained after PG commit

ProcessPhase consistency. Triples live in pyoxigraph, provenance lives in PostgreSQL, and a single transaction cannot cover both. Each per-triple write is staged as a row in triple_outbox inside the same PG transaction as its provenance row; an OutboxDrainer applies staged rows to pyoxigraph after commit, and re-runs on application startup to recover from crashes between commit and drain. Every outbox operation is idempotent (content-addressed inserts, SPARQL ASK-guarded RDF-star annotations). See docs/superpowers/specs/2026-04-15-processphase-2pc-outbox-design.md.

Reader-side status filtering. /api/search, /api/ask, and RAGRetriever only return content whose latest ingestion_jobs.status is terminal (completed or failed), or has no job row. In-flight content is hidden in SQL via a LEFT JOIN LATERAL against ingestion_jobs, so chunks without their KG triples never reach the retriever. Controlled by READER_EXCLUDE_INFLIGHT (default true). /api/content/{id}/chunks is deliberately exempt. See docs/superpowers/specs/2026-04-15-reader-status-filter-design.md.

For deployment details, see docs/deployment.md.


Knowledge Types

The schema accepts three Pydantic input shapes (TripleInput / EventInput / EntityInput). The knowledge_type field is a free-form label that is preserved on each triple's RDF-star annotation and shown in the admin browser, but it does not drive validation — Pydantic resolves the union by shape.

Conventional labels:

Label Shape Truth model Example
Claim TripleInput Probabilistic (0.0–1.0) "Intermittent fasting reduces inflammation" — 0.7 from a YouTube video
Fact TripleInput High-confidence (≥0.9) "Project AEGIS uses PostgreSQL 16" — from codebase scan
Relationship TripleInput Typed link between entities "AEGIS depends-on PostgreSQL"
Event EventInput Timestamped occurrence Salary payment received 2026-03-01
Entity EntityInput Typed, ontology-linked "AEGIS is a schema:SoftwareApplication"

Time-bounded facts use TripleInput with valid_from and valid_until. Earlier versions of this doc described separate Conclusion and TemporalState shapes with custom field names — those shapes had no Pydantic model and the extraction prompts no longer emit them.


Confidence Model

Two-layer design:

  1. RDF-star annotation on each triple — the combined confidence after Noisy-OR over all sources:

    <<:cold_exposure :increases :dopamine>>
        ks:confidence "0.88"^^xsd:float .
  2. PostgreSQL provenance table — one row per source per triple:

    (triple_hash, source_url, source_type, extractor, confidence, ingested_at, valid_from, valid_until)
    

When the same claim arrives from multiple sources, Noisy-OR combines their individual confidences:

combined = 1 - product(1 - ci)

# Example: source A at 0.7, source B at 0.6
combined = 1 - (0.3 × 0.4) = 0.88

The combined value is written back to the RDF-star annotation. The inference engine propagates confidence through forward-chaining rules (inverse, transitive, type inheritance) for derived conclusions.


API Reference

Base URL: http://localhost:8000 Interactive docs: http://localhost:8000/docs

Health

GET /health

Returns status of all components (knowledge store, PostgreSQL, LLM API).


Ingest Content

POST /api/content
Content-Type: application/json

Ingest content with knowledge items. Parses documents (auto-detects format), chunks text, runs NLP pre-pass + LLM extraction, resolves entities via coreference, and writes triples. If url is provided without raw_text and starts with http, the URL is fetched and parsed automatically.

Accepts a single object or a JSON array for batch processing.

Single request:

{
  "url": "https://example.com/article",
  "title": "Cold Exposure and Dopamine",
  "summary": "A review of studies on cold exposure effects.",
  "raw_text": "...",
  "source_type": "article",
  "tags": ["health", "neuroscience"],
  "metadata": {},
  "knowledge": [
    {
      "knowledge_type": "claim",
      "subject": "http://dbpedia.org/resource/Cold_shock_response",
      "predicate": "http://knowledge.local/schema/increases",
      "object": "http://dbpedia.org/resource/Dopamine",
      "confidence": 0.75
    },
    {
      "knowledge_type": "entity",
      "uri": "http://dbpedia.org/resource/Dopamine",
      "rdf_type": "schema:ChemicalSubstance",
      "label": "Dopamine",
      "properties": {}
    }
  ]
}

Response (202 Accepted):

{
  "content_id": "uuid",
  "job_id": "uuid",
  "status": "accepted",
  "chunks_total": 5,
  "chunks_capped_from": null
}

Processing happens asynchronously. Poll /api/content/{content_id}/status for progress.

Batch request — send an array, get an array:

[
  { "url": "https://a.com", "title": "Article A", "source_type": "article" },
  { "url": "https://b.com", "title": "Article B", "source_type": "article" }
]

Upload File

POST /api/content/upload
Content-Type: multipart/form-data

Upload a file (PDF, HTML, CSV, JSON, plain text) for ingestion. Format is auto-detected from filename, content-type, or magic bytes. Returns 202 with a job ID — poll /api/content/{id}/status for progress.

curl -X POST http://localhost:8000/api/content/upload \
  -H "X-API-Key: your-password" \
  -F "file=@paper.pdf;type=application/pdf" \
  -F "title=Research Paper" \
  -F "source_type=paper"

Response (202): {"content_id": "uuid", "job_id": "uuid", "chunks_total": 3}


Check Ingestion Status

GET /api/content/{content_id}/status

Returns job status with progress counters. Status values: embeddinganalyzingextractingresolvingprocessingcompleted / failed.


Ingest Claims Directly

POST /api/claims
Content-Type: application/json

Ingest knowledge items without storing raw content. Useful for programmatic ingestion where content storage is not needed.

Accepts a single object or a JSON array for batch processing. When an array is sent, a matching array of responses is returned.

Single request:

{
  "source_url": "https://example.com/paper",
  "source_type": "paper",
  "extractor": "llm_qwen3:14b",
  "knowledge": [
    {
      "knowledge_type": "fact",
      "subject": "http://knowledge.local/data/aegis",
      "predicate": "http://schema.org/softwareRequirements",
      "object": "http://dbpedia.org/resource/PostgreSQL",
      "confidence": 0.99
    }
  ]
}

Batch request — send an array, get an array:

[
  {
    "source_url": "https://example.com/a",
    "source_type": "bookmark",
    "extractor": "n8n",
    "knowledge": [{ "knowledge_type": "claim", "subject": "...", "predicate": "...", "object": "...", "confidence": 0.85 }]
  },
  {
    "source_url": "https://example.com/b",
    "source_type": "bookmark",
    "extractor": "n8n",
    "knowledge": [{ "knowledge_type": "claim", "subject": "...", "predicate": "...", "object": "...", "confidence": 0.9 }]
  }
]

All three input shapes are accepted. Examples:

// EventInput
{
  "knowledge_type": "event",
  "subject": "http://knowledge.local/data/payment/2026-03-01",
  "occurred_at": "2026-03-01",
  "properties": { "amount": "4500", "currency": "GBP" }
}

// TripleInput with temporal bounds (time-bounded fact)
{
  "knowledge_type": "temporalfact",
  "subject": "http://dbpedia.org/resource/Bitcoin",
  "predicate": "http://schema.org/price",
  "object": "65000",
  "valid_from": "2024-03-01",
  "valid_until": "2024-03-31"
}

// TripleInput (Relationship)
{
  "knowledge_type": "relationship",
  "subject": "http://knowledge.local/data/aegis",
  "predicate": "http://schema.org/hasPart",
  "object": "http://knowledge.local/data/knowledge-service",
  "confidence": 0.99
}

Notes on Entity and Event payloads:

  • properties values may be a string or a list of strings. List values expand into one triple per item, all sharing the same predicate.
  • For Entity, if the model nests rdf_type or label inside properties, the field is lifted to the top level during validation.
  • For Event, an unparseable occurred_at string is coerced to null and the event is dropped (no triples emitted) rather than rejected outright.

Semantic Search

GET /api/search?q=cold+exposure+dopamine&limit=10&source_type=article

Searches ingested content by semantic similarity using pgvector cosine distance. Returns chunk-level results — each result is the most relevant chunk of a document, not the full document. Short documents have a single chunk; long documents (≥4000 chars) are split into overlapping chunks.

Parameters:

  • q (required) — query text
  • limit — max results (1–100, default 10)
  • source_type — filter by source type (article, video, etc.)
  • tags — filter by tags (repeat for multiple: ?tags=health&tags=neuroscience)

Response:

[
  {
    "content_id": "uuid",
    "url": "https://...",
    "title": "Cold Exposure and Dopamine",
    "summary": "...",
    "similarity": 0.94,
    "source_type": "article",
    "tags": ["health"],
    "ingested_at": "2026-03-18T10:00:00Z",
    "chunk_text": "The relevant section matching the query...",
    "chunk_index": 0
  }
]

Query the Knowledge Graph

GET /api/knowledge/query?subject=http://dbpedia.org/resource/Dopamine

Structured query with optional subject, predicate, object filters. Returns triples with confidence, knowledge type, temporal bounds, and provenance.

Parameters: subject, predicate, object (at least one required, all are URIs or literals)

Response:

[
  {
    "subject": "http://dbpedia.org/resource/Cold_shock_response",
    "predicate": "http://knowledge.local/schema/increases",
    "object": "http://dbpedia.org/resource/Dopamine",
    "confidence": 0.88,
    "knowledge_type": "claim",
    "valid_from": null,
    "valid_until": null,
    "provenance": [
      {
        "source_url": "https://example.com/article",
        "source_type": "article",
        "confidence": "0.75",
        "ingested_at": "2026-03-18T10:00:00+00:00"
      }
    ]
  }
]

Raw SPARQL Query

POST /api/knowledge/sparql

Execute a SPARQL 1.2 SELECT or ASK query against the knowledge graph. Supports RDF-star syntax for querying annotations. Only SELECT and ASK queries are allowed (no INSERT, DELETE, or UPDATE).

Accepts two content types:

JSON body:

{
  "query": "SELECT ?s ?p ?o ?conf WHERE { ?s ?p ?o . << ?s ?p ?o >> <http://knowledge.local/schema/confidence> ?conf . FILTER(?conf > 0.8) }"
}

Raw SPARQL body (Content-Type: application/sparql-query):

SELECT ?s ?p ?o ?conf WHERE {
  ?s ?p ?o .
  << ?s ?p ?o >> <http://knowledge.local/schema/confidence> ?conf .
  FILTER(?conf > 0.8)
}

Contradictions

GET /api/knowledge/contradictions?min_confidence=0.5

Surfaces contradictions in the knowledge graph. Detects two patterns:

  • Same predicate, different objects — only fires for predicates declared owl:FunctionalProperty in the ontology (e.g. ks:amount, ks:currency); multi-valued predicates like has_property are intentionally excluded
  • Opposite predicates — e.g., "increases dopamine" vs "decreases dopamine" (via ks:oppositePredicate declarations)

Two filters keep noise down: pairs that share a chunk_id are dropped (extraction conflation — the LLM emitted two distinct values from one paragraph under one subject URI, not a real disagreement across sources), and pairs whose objects are identical (a SPARQL artefact of opposite-predicate pairs pointing at the same object) are dropped.

Contradiction probability is the product of both claims' confidence scores.

Parameters:

  • min_confidence — filter to pairs where conf_a × conf_b ≥ threshold (default 0.0)

Response:

[
  {
    "claim_a": {
      "subject": "...",
      "predicate": "...",
      "object": "beneficial",
      "confidence": 0.75
    },
    "claim_b": {
      "subject": "...",
      "predicate": "...",
      "object": "harmful",
      "confidence": 0.6
    },
    "contradiction_probability": 0.45,
    "provenance_a": [...],
    "provenance_b": [...]
  }
]

Ask a Question (RAG)

POST /api/ask
Content-Type: application/json

Ask a natural language question against the knowledge base. Retrieves relevant content (semantic search) and knowledge graph triples, checks for contradictions, and generates an LLM-powered answer grounded in your data.

{
  "question": "Does cold exposure increase dopamine?",
  "max_sources": 5,
  "min_confidence": 0.3
}

Parameters:

  • question (required) — natural language question (max 4000 chars)
  • max_sources — max content items to retrieve (1–20, default 5)
  • min_confidence — filter out knowledge triples below this confidence (0.0–1.0, default 0.0)

Response:

{
  "answer": "Based on your knowledge base, cold exposure likely increases dopamine...",
  "confidence": 0.88,
  "sources": [
    {
      "url": "https://example.com/article",
      "title": "Cold Exposure and Dopamine",
      "source_type": "article"
    }
  ],
  "knowledge_types_used": ["claim"],
  "contradictions": [],
  "evidence": [{"triple_subject": "...", "triple_predicate": "...", "triple_object": "...", "chunk_text": "...", "source_url": "..."}],
  "intent": "graph",
  "traversal_depth": 3
}

Admin Panel

A built-in web UI for monitoring and querying your knowledge base. Accessible at /admin after logging in.

Features

  • Dashboard — stats cards (triples, entities, content, events), confidence distribution chart, knowledge type breakdown, recent ingestion activity
  • Knowledge Explorer — searchable, filterable, paginated triple browser with entity detail views and content inspection
  • Chat — ask natural language questions against your knowledge base (uses the RAG pipeline), with source citations and confidence scores
  • Contradictions — visual side-by-side comparison of conflicting claims with confidence bars

Authentication

All routes (UI and API) are protected behind a password. Set ADMIN_PASSWORD in your .env file or environment:

ADMIN_PASSWORD=your-password-here

The service will not start without this variable. Visit /login to sign in — no username needed, just the password.

Sessions last 24 hours (signed cookie). Set SECRET_KEY for persistent sessions across restarts; if omitted, a random key is generated at startup (sessions lost on restart).

Tech

Server-rendered Jinja2 templates with Alpine.js and TailwindCSS (CDN). No JS build pipeline — everything ships inside the Python package.


Running Locally

Prerequisites

  • Python 3.12+
  • PostgreSQL 16 with pgvector extension
  • An LLM provider (Ollama or LiteLLM)

Option A: Ollama (simplest)

Ollama runs models locally with zero configuration.

  1. Install Ollama and pull the required models:
ollama pull nomic-embed-text
ollama pull qwen3:14b
  1. Start PostgreSQL (via docker-compose):
docker compose up -d postgres
  1. Install and run:
pip install -e ".[dev]"
cp .env.example .env   # defaults work for Ollama — no changes needed
uvicorn knowledge_service.main:app --reload

Option B: LiteLLM Proxy

LiteLLM provides a unified OpenAI-compatible gateway to 100+ LLM providers.

  1. Deploy LiteLLM with the required models in your litellm_config.yaml:
model_list:
  - model_name: nomic-embed-text
    litellm_params:
      model: ollama/nomic-embed-text
      api_base: http://localhost:11434
  - model_name: qwen3:14b
    litellm_params:
      model: ollama/qwen3:14b
      api_base: http://localhost:11434
  1. Start LiteLLM:
litellm --config litellm_config.yaml
  1. Configure and run:
pip install -e ".[dev]"
cp .env.example .env
# Edit .env:
#   LLM_BASE_URL=http://localhost:4000
#   LLM_API_KEY=sk-your-litellm-key
uvicorn knowledge_service.main:app --reload

With Docker (full stack)

A pre-built image is available on Docker Hub: arshadansari27/knowledge-service

# Set admin password (required)
export ADMIN_PASSWORD=changeme

# Using docker-compose (builds locally)
docker compose up -d

# Or pull the pre-built image directly
docker pull arshadansari27/knowledge-service:latest

Service available at http://localhost:8000. Admin panel at http://localhost:8000/login. Ollama must be running on the host machine.

Configuration

All settings via environment variables or .env file:

Variable Default Description
DATABASE_URL postgresql://knowledge:knowledge@localhost:5433/knowledge PostgreSQL connection
LLM_BASE_URL http://localhost:11434 LLM API endpoint (Ollama or LiteLLM)
LLM_API_KEY (empty) API key (leave empty for Ollama)
LLM_EMBED_MODEL nomic-embed-text Embedding model (768-dim vectors)
LLM_CHAT_MODEL qwen3:14b Chat model for knowledge extraction
LLM_RAG_MODEL (empty) RAG answer model (defaults to LLM_CHAT_MODEL if empty)
OXIGRAPH_DATA_DIR ./data/oxigraph RDF store data directory
API_HOST 0.0.0.0 Bind address
API_PORT 8000 Port
ADMIN_PASSWORD (required) Password for admin panel login and API key auth
SECRET_KEY (required) Session signing key
SPACY_DATA_DIR /app/data/spacy spaCy Wikidata KB storage directory
MAX_UPLOAD_SIZE 52428800 Maximum file upload size in bytes (default 50MB)
URL_FETCH_TIMEOUT 30 Timeout for URL auto-fetch (seconds)
NLP_ENTITY_CONFIDENCE 0.5 Confidence for spaCy-only fallback entities
READER_EXCLUDE_INFLIGHT true Exclude in-flight content from hybrid retrieval

Bulk Ingest CLI

A standalone script to ingest a directory of files and/or a list of URLs in one go.

Usage

# Ingest all supported files in a directory (recursive)
uv run python scripts/bulk_ingest.py ./documents/

# Ingest URLs from a file (one per line, # comments and blank lines ignored)
uv run python scripts/bulk_ingest.py --urls urls.txt

# Both at once, with tags and domain hints
uv run python scripts/bulk_ingest.py ./documents/ --urls urls.txt --tags health,research --domains health

# Preview what would be ingested without doing it
uv run python scripts/bulk_ingest.py ./documents/ --dry-run

Supported file types

.pdf, .html, .htm, .txt, .md, .json, .csv

Files are uploaded via POST /api/content/upload. URLs are submitted via POST /api/content (server auto-fetches).

Options

Option Default Description
path (positional) Directory to scan recursively
--urls FILE Text file with one URL per line
--server URL KNOWLEDGE_URL env or http://localhost:8000 Target server
--api-key KEY KNOWLEDGE_API_KEY env API key for authentication
--tags t1,t2 Comma-separated tags for all items
--domains d1,d2 Comma-separated domain hints for extraction
--dry-run List items without ingesting
--poll-timeout N 300 Seconds to wait per item before giving up

At least one of path or --urls is required.

How it works

Items are processed sequentially. For each item, the script POSTs to the server, then polls GET /api/content/{id}/status every 5 seconds until the job completes, fails, or times out. Progress is printed as it goes:

Bulk ingest: 6 items (5 files, 1 URLs)
Target: http://localhost:8000

  [1/6] report.pdf .................. OK  triples=14  23s
  [2/6] notes.txt ................... OK  triples=6   8s
  [3/6] https://example.com/article . FAIL  extraction failed  45s
  ...

Results: 4 passed, 1 failed, 1 skipped (total: 4m 32s)

Exit code is 0 if all items passed, 1 if any failed.

Against production

KNOWLEDGE_URL=https://knowledge.hikmahtech.in \
KNOWLEDGE_API_KEY=your-key \
  uv run python scripts/bulk_ingest.py ./documents/ --poll-timeout 600

Against local docker-compose

docker compose up -d
uv run python scripts/bulk_ingest.py ./documents/ --api-key changeme

Running Tests

pytest

All tests mock external dependencies — no PostgreSQL or LLM provider required.


CI/CD

GitHub Actions pipeline on every push/merge to main:

  1. Lintruff check + ruff format --check
  2. Testpytest tests/ -v (~700 tests)
  3. Version bump — auto-increments patch version in pyproject.toml, commits back to main, creates vX.Y.Z git tag
  4. Docker build — builds and pushes to Docker Hub as arshadansari27/knowledge-service:X.Y.Z and :latest

Version is read from pyproject.toml and used as the Docker image tag and git tag. Bump commits include [skip ci] to prevent infinite loops.

Pull requests run lint + test only (no version bump or Docker push).


Project Structure

src/knowledge_service/
├── main.py                  # FastAPI app factory + lifespan
├── config.py                # Settings (pydantic-settings, .env)
├── models.py                # Pydantic input shapes (TripleInput / EventInput / EntityInput) + API contracts
├── _utils.py                # Shared RDF helpers + JSON extraction from LLM output
├── chunking.py              # Markdown-aware text splitting with section headers
├── admin/
│   ├── auth.py              # AuthMiddleware, login/logout, rate limiter, session cookies
│   ├── routes.py            # Admin page routes (dashboard, knowledge, chat, contradictions)
│   ├── stats.py             # /api/admin/stats/* and /api/admin/knowledge/triples endpoints
│   ├── jobs.py              # /api/admin/jobs
│   ├── content.py           # DELETE /api/admin/knowledge/content/{id} and /knowledge/source
│   └── templates/           # Jinja2 templates (base, dashboard, knowledge, chat, etc.)
├── api/
│   ├── content.py           # POST /api/content (JSON + URL auto-fetch)
│   ├── upload.py            # POST /api/content/upload (multipart file upload)
│   ├── claims.py            # POST /api/claims
│   ├── search.py            # GET /api/search
│   ├── knowledge.py         # GET /api/knowledge/query, POST /api/knowledge/sparql
│   ├── contradictions.py    # GET /api/knowledge/contradictions
│   ├── ask.py               # POST /api/ask (RAG question answering)
│   ├── changes.py           # GET /api/entity/{id}/changes
│   └── health.py            # GET /health
├── parsing/
│   ├── __init__.py          # ParserRegistry, ParsedDocument, Parser protocol
│   ├── pdf.py               # PdfParser (PyMuPDF)
│   ├── html.py              # HtmlParser (readability-lxml + BeautifulSoup)
│   ├── structured.py        # StructuredParser (JSON/CSV)
│   └── text.py              # TextParser (passthrough)
├── nlp/
│   ├── __init__.py          # NlpPhase, NlpResult, NlpEntity
│   └── bootstrap.py         # spaCy model + Wikidata KB loading
├── ingestion/
│   ├── pipeline.py          # Per-triple processing (delta, insert, contradiction, provenance, inference)
│   ├── worker.py            # 5-phase orchestrator (Embed → NLP → Extract → Coref → Process)
│   ├── phases.py            # EmbedPhase, ExtractPhase, ProcessPhase
│   ├── coreference.py       # CoreferencePhase (Wikidata QID merging)
│   └── outbox.py            # OutboxStore + OutboxDrainer (pyoxigraph/PG 2PC)
├── stores/
│   ├── __init__.py          # Stores dataclass
│   ├── triples.py           # pyoxigraph wrapper — RDF-star, named graphs
│   ├── content.py           # ContentStore — metadata + chunks + embeddings
│   ├── entities.py          # EntityStore — entity/predicate resolution + aliases
│   ├── provenance.py        # ProvenanceStore — per-source evidence rows
│   └── rag.py               # RAGRetriever — hybrid chunk + KG retrieval, intent-routed (3 strategies)
├── reasoning/
│   ├── engine.py            # InferenceEngine — forward-chaining rules
│   └── noisy_or.py          # Noisy-OR evidence combination
├── ontology/
│   ├── uri.py               # URI normalization (to_entity_uri, to_predicate_uri, slugify)
│   ├── namespaces.py        # ks:, schema:, dc:, skos:, prov: namespace constants
│   ├── registry.py          # DomainRegistry — predicate metadata from ontology
│   ├── bootstrap.py         # Loads schema.ttl + domains/*.ttl into ks:graph/ontology
│   ├── schema.ttl            # Knowledge type classes, properties
│   ├── domains/             # Domain TTL files (base, health, technology, research)
│   └── prompts/             # LLM extraction prompt templates (entities, relations)
├── clients/
│   ├── base.py              # BaseLLMClient — shared retry / timeout / auth handling
│   ├── llm.py               # EmbeddingClient + ExtractionClient
│   ├── prompt_builder.py    # Domain-aware extraction prompts from templates
│   └── rag.py               # RAGClient — LLM-powered answer generation
└── migrations/              # SQL migrations (auto-applied at startup)

Ontology

The system reuses established vocabularies and keeps the custom ks: namespace minimal:

Domain Namespace Purpose
Content metadata dc: / dcterms: Title, creator, date, format, source
General entities schema: (Schema.org) People, organisations, software, events
Topic hierarchies skos: Broader/narrower relationships, labels
People and social foaf: Personal information, social connections
Provenance prov: (PROV-O) Activity, agent, entity chains
Service-specific ks: Confidence, knowledge type, temporal validity, extraction metadata

Status

Deployed to production (~700 tests).

Capability What
Knowledge model 3 Pydantic input shapes (TripleInput / EventInput / EntityInput); knowledge_type is preserved on each triple's RDF-star annotation as a lowercase canonical label (claim / fact / event / entity / relationship / …; Relation is collapsed to relationship); temporal validity via valid_from / valid_until
RDF store pyoxigraph, 4 named graphs by provenance class (ontology / asserted / extracted / inferred)
Ingestion Parse (PDF / HTML / CSV / JSON / text) → chunk → embed → spaCy NER + Wikidata linking → LLM extraction → QID-based coreference → ingest
Maintenance Background asyncio task lowercases ks:knowledgeType annotations and remaps spaCy NER labels to schema.org canonical types on a configurable interval; manual trigger at POST /api/admin/maintenance/run
Hybrid retrieval BM25 (OR-tokenised to_tsquery) + pgvector, fused via Reciprocal Rank Fusion
RAG endpoint /api/ask with intent classification (semantic / entity / graph), returns answer + source chunks + triples + contradictions
Evidence combination Noisy-OR across multi-source confidences: 1 - Π(1 − cᵢ)
Inference 3 forward-chaining rules (inverse / transitive / type-inheritance), retraction cascade when source triples change
Consistency Outbox 2PC between pyoxigraph and Postgres, startup drainer recovers from crash-between-commit-and-drain
Reader-side filter In-flight content excluded from retrieval until ingestion reaches a terminal status

Not currently enforced (declared but not filtered on the read path): graph-level trust tiers, contradictions. Both are surfaced as labels / response fields, not used to re-rank or exclude results.


Documentation

Document Description
Architecture notes Design rationale for Noisy-OR, the outbox, named-graph trust tiers, inference, and the parts deliberately left out
API reference Full endpoint reference with example payloads (interactive docs at /docs when running)
Deployment Production AEGIS stack deployment
Demo corpus Bundled public-domain corpus used by scripts/demo.py

Tech Stack

Component Technology
API Python 3.12, FastAPI, uvicorn
Knowledge store pyoxigraph (embedded, RDF 1.2, SPARQL 1.2, RocksDB)
Confidence combination Noisy-OR (4-line function)
Inference 3 forward-chaining rules at ingestion time (inverse / transitive / type-inheritance)
Operational store PostgreSQL 16
Vector search pgvector (HNSW index, halfvec)
Document parsing PyMuPDF (PDF), readability-lxml + BeautifulSoup (HTML), stdlib (CSV/JSON)
NLP pre-pass spaCy (en_core_web_sm) + spacy-entity-linker (Wikidata KB)
LLM gateway Ollama (local) or LiteLLM (proxy) — any OpenAI-compatible API
Embeddings nomic-embed-text (768-dim)
Knowledge extraction qwen3:14b (auto-extracts from raw_text)
Infrastructure Docker Compose, GitHub Actions CI/CD, Docker Swarm (production)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages