Skip to content

S-Mahoney/Compopulate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project Doc Compiler

This pipeline uses a local LMStudio model to collate project documents from many multi-format source files. The system can be run to generate the doc, or can be set to monitor for changes. When the script detects a saved change to a source file it will automatically generate a new version of the collated doc and accompanying knowledge graph.

The knowledge graph acts as a version of the doc in which we can see which components of the source files contributed to which sections of the doc.

Setup

  • Download LMStudio
  • Use LMStudio to download an LLM
  • Load your LLM
  • Start the LLM Server inside LMStudio
  • Take the access IP/Port and write to your config.yaml
  • Add the source files to the /sources directory, or point to desired source directory in config.yaml
  • Edit the initial_request.txt with your doc creation specifications
  • Create a .venv
python -m venv .venv
  • Install requirements and package
pip install -r requirements.txt
pip install -e.
  • Now you can run:
# Run and create the doc and watch:
rag_runner --init
# Or: 
python -m rag_runner.run --init

# Run only the watcher:
rag-runner

System Flow: Automated RAG Manual Generator

Overview

The rag-runner system automates creation and continuous maintenance of technical manuals using a Retrieval-Augmented Generation (RAG) pipeline connected to a local LLM (e.g., LM Studio or Ollama). It monitors a directory of source files (Word docs, spreadsheets, CSVs, XLSX data, etc.), extracts their content, embeds them into a vector store, and generates an updated Markdown manual whenever those sources change.

Components and Their Roles

Module Purpose
run.py The orchestrator. Manages startup, initialization (--init), watching for source updates, regenerating manuals, and invoking downstream modules.
parsers.py Extracts structured text “facts” from supported file types (.docx, .csv, .xlsx, .txt). Each extracted fact becomes a chunk stored in the vector database.
embed_store.py Manages embeddings and vector storage using ChromaDB. Uses LM Studio / Ollama HTTP API or local SentenceTransformer fallback for embeddings.
template_gen.py Generates the initial manual template by prompting the LLM based on the user’s initial_request.txt and available source files.
prompts.py Holds prompt templates for various phases: RAG system, section updates, changelog generation, and template creation.
regenerate.py Selectively updates sections of the manual using retrieved context. Handles diffing, citation maps, and changelog entries.
knowledge_graph.py Builds a visual knowledge graph (knowledge_graph.html) linking source documents to the manual sections that reference them. Also generates an annotated HTML manual with clickable source highlights.
config.yaml Configuration file specifying paths, model settings, watch directories, and update intervals.
initial_request.txt The user’s instruction file for what kind of manual or summary to create (“technical specification,” “summary report,” etc.).

Continuous Monitoring

  1. File Watcher

    • watchdog observes the configured sources/ directory.
    • On any file creation/modification/deletion:
      • Extract and re-embed the updated content.
      • Compare to previous version (diff).
      • Trigger selective regeneration for affected manual sections.
  2. Selective Update

    • regenerate.py identifies which manual sections cite the changed source.

    • Queries the updated embeddings for relevant context.

    • Sends a structured update prompt (UPDATE_SECTIONS_PROMPT) to the LLM.

    • The LLM rewrites only the affected section while preserving citations and unchanged text.

  3. Changelog Generation

    • Every update produces a structured changelog (CHANGELOG_SUMMARY_PROMPT) containing:

      • Changed sources and versions.

      • Sections touched.

      • Equations altered.

      • Summary of impact.

    • Changelog is appended to a JSON or Markdown file in /logs/.

  4. Knowledge Graph Refresh

    • After any manual regeneration, the graph files are rebuilt so the visualization stays current.

    • Clicking a source node highlights affected sections in the manual viewer.

Data Flow Diagram (Simplified)

 ┌────────────────────┐
 │   Source Files     │ (.docx, .csv, .xlsx)
 └────────┬───────────┘
          │
   [Extract Facts]
          │
   parsers.py
          ▼
   [Embeddings]
          │
   embed_store.py ──► chroma vector DB
          │
          ▼
 [Query Context & Write Manual]
    regenerate.py / run.py
          │
          ▼
 [Manual Markdown] ───► manuals/
          │
          ├──► knowledge_graph.py ─► HTML Graph
          │
          └──► changelog.json
  1. Key Design Features
    • RAG-based Contextualization: Uses a vector search per section for precise retrieval.

    • Source Traceability: Inline citations ([^refN]) link every fact to its originating file and version.

    • Automatic Changelog: Each regeneration logs what changed and why.

    • Visual Provenance: The D3.js graph shows dependencies and lets users inspect which manual sections rely on each source.

    • Classification Filtering: Template generation can skip classified sources (PUBLIC/INTERNAL/SECRET).

    • Model-Agnostic: Works with LM Studio, Ollama, or local Hugging Face models using OpenAI API format.

    • Local-Only Operation: No cloud calls required; all embeddings, models, and storage can run offline.

About

Collate info from source documents into a user specified project document.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published