Skip to content

This PR adds external documentation integration capabilities to Litho, allowing the tool to incorporate existing architecture documentation from local files into the documentation generation process.#82

Merged
sopaco merged 3 commits intosopaco:mainfrom
alexeysviridov:main
Jan 31, 2026

Conversation

@alexeysviridov
Copy link

Key Features

  • Local Documentation Integration: Support for loading external docs from configured paths
  • Agent-Targeted Knowledge: Documents can be targeted to specific research/compose agents
  • Smart Chunking: Semantic, paragraph, and fixed-size chunking strategies for large documents
  • SQL Project Support: Specialized handling for .sqlproj files and SQL schema analysis
  • File Change Detection: Watch for changes in external docs and re-process automatically
  • Glob Pattern Support: Flexible file path patterns rooted to project directory

Configuration

New [knowledge.local_docs] section in litho.toml allows:

  • Enabling/disabling local docs integration
  • Defining document categories with paths and target agents
  • Customizing chunking behavior per category

Agents Enhanced

  • All 6 research agents can now receive external knowledge
  • All 5 compose/editor agents incorporate external knowledge
  • New DatabaseOverviewAnalyzer agent for SQL project analysis

Bug Fixes

  • Fixed watch_for_changes not detecting new/deleted files
  • Fixed target_agents filtering being ignored
  • Fixed glob patterns relative to CWD instead of project path
  • Fixed CodePurpose enum deserialization with doc comment aliases
  • Fixed error message formatting with .replacen()
  • Added model name to ReAct error messages for better debugging

…any available external documentation.

- Updated DataSource enum to support external knowledge categorized by specific categories.
- Modified GeneratorPromptBuilder to include agent filtering for external knowledge.
- Improved KnowledgeSyncer to categorize documents and handle chunking based on configuration.
- Added support for processing SQL, YAML, and JSON files in LocalDocsProcessor.
- Implemented document chunking strategies for better handling of large documents.
- Enhanced formatting for LLM consumption to include document categories and chunk information.
- Updated i18n module to include new database overview documentation translations.
- Refactored token estimator to remove unused methods for better code clarity.
Copilot AI review requested due to automatic review settings January 23, 2026 00:57
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive external documentation integration capabilities to Litho, enabling the tool to incorporate existing architecture documentation from local files into the documentation generation process. The PR includes SQL project support for .NET projects and implements document chunking strategies for large files.

Changes:

  • New knowledge configuration system with categorized local documentation support
  • SQL Server database project (.sqlproj) and SQL file (.sql) parsing and analysis
  • Document chunking with semantic, paragraph, and fixed-size strategies
  • Knowledge sync command for manually managing external documentation cache
  • Bug fixes for watch_for_changes, target_agents filtering, and glob pattern handling

Reviewed changes

Copilot reviewed 40 out of 41 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
src/config.rs Added KnowledgeConfig, LocalDocsConfig, DocumentCategory, and ChunkingConfig structures
src/cli.rs Added sync-knowledge subcommand for manual knowledge cache management
src/main.rs Added subcommand handling and sync_knowledge function
src/integrations/mod.rs New module for external integrations (knowledge sync, local docs)
src/integrations/local_docs.rs Document processing, chunking strategies, and file type detection
src/integrations/knowledge_sync.rs Knowledge synchronization, caching, and file change detection
src/types/code.rs Added CodePurpose serde aliases and SQL file classification logic
src/generator/preprocess/extractors/language_processors/csharp.rs Comprehensive SQL parsing for .sqlproj and .sql files with interface and dependency extraction
src/generator/preprocess/extractors/structure_extractor.rs Added importance scoring for SQL and database-related files
src/generator/research/types.rs Added DatabaseOverviewReport and related types for SQL analysis
src/generator/research/agents/database_overview_analyzer.rs New agent for analyzing SQL database projects and schemas
src/generator/research/orchestrator.rs Integrated DatabaseOverviewAnalyzer with conditional execution
src/generator/research/agents/*.rs Updated research agents to use categorized external knowledge
src/generator/compose/agents/database_editor.rs New editor for generating database documentation
src/generator/compose/mod.rs Integrated DatabaseEditor with conditional execution
src/generator/compose/agents/*.rs Updated compose agents to use categorized external knowledge
src/generator/step_forward_agent.rs Added agent_filter parameter and ExternalKnowledgeByCategory data source
src/generator/context.rs Added load_external_knowledge_by_categories method
src/generator/workflow.rs Integrated automatic knowledge sync during documentation generation
src/llm/client/mod.rs Fixed error message formatting with replacen()
src/llm/client/react_executor.rs Added model_name parameter to error messages
src/i18n.rs Added database documentation filenames for all supported languages
src/generator/outlet/mod.rs Added Database agent to DocTree
src/generator/compose/types.rs Added Database enum variant
src/utils/token_estimator.rs Removed unused helper methods
litho-example.toml Added comprehensive example configuration with knowledge integration
docs/ai-prompts.md Documentation of all AI prompts used in the system
docs/SQL_PROJECT_SUPPORT.md Documentation for SQL project support feature
Cargo.toml Added pdf-extract and base64 dependencies

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Context,
/// command-line interface (CLI) commandsx or message/request handlers
/// command-line interface (CLI) commands or message/request handlers
#[serde(alias = "command-line interface (CLI) commands or message/request handlers", alias = "command-line interface (CLI) commands or message/request handlers")]
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Command variant has a duplicate serde alias. The same alias "command-line interface (CLI) commands or message/request handlers" appears twice. One of these should be removed.

Copilot uses AI. Check for mistakes.
- Flag gaps between docs and implementation
- Use established business terminology

Rrequired output style (extremely important):
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "Rrequired" should be "Required" (double 'r' at the start).

Suggested change
Rrequired output style (extremely important):
Required output style (extremely important):

Copilot uses AI. Check for mistakes.
Comment on lines +587 to +606
#[cfg(test)]
mod tests {
use super::*;

#[test]
fn test_detect_file_type() {
assert_eq!(
LocalDocsProcessor::detect_file_type(Path::new("doc.pdf")).unwrap(),
DocFileType::Pdf
);
assert_eq!(
LocalDocsProcessor::detect_file_type(Path::new("readme.md")).unwrap(),
DocFileType::Markdown
);
assert_eq!(
LocalDocsProcessor::detect_file_type(Path::new("notes.txt")).unwrap(),
DocFileType::Text
);
}
}
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local_docs.rs module has limited test coverage. Critical functions like expand_glob_patterns, process_file_with_chunking, and the chunking strategies (chunk_semantic, chunk_by_paragraph, chunk_fixed_size) lack unit tests. Given the complexity of the chunking logic and file processing, these should have comprehensive test coverage to prevent regressions.

Copilot uses AI. Check for mistakes.
Comment on lines +643 to +650
let create_table_re = regex::Regex::new(r"(?i)CREATE\s+TABLE\s+(?:\[?(\w+)\]?\.)?\[?(\w+)\]?").unwrap();
let alter_table_re = regex::Regex::new(r"(?i)ALTER\s+TABLE\s+(?:\[?(\w+)\]?\.)?\[?(\w+)\]?").unwrap();
let create_view_re = regex::Regex::new(r"(?i)CREATE\s+(?:OR\s+ALTER\s+)?VIEW\s+(?:\[?(\w+)\]?\.)?\[?(\w+)\]?").unwrap();
let create_proc_re = regex::Regex::new(r"(?i)CREATE\s+(?:OR\s+ALTER\s+)?PROC(?:EDURE)?\s+(?:\[?(\w+)\]?\.)?\[?(\w+)\]?").unwrap();
let create_func_re = regex::Regex::new(r"(?i)CREATE\s+(?:OR\s+ALTER\s+)?FUNCTION\s+(?:\[?(\w+)\]?\.)?\[?(\w+)\]?").unwrap();
let create_trigger_re = regex::Regex::new(r"(?i)CREATE\s+(?:OR\s+ALTER\s+)?TRIGGER\s+(?:\[?(\w+)\]?\.)?\[?(\w+)\]?").unwrap();
let create_index_re = regex::Regex::new(r"(?i)CREATE\s+(?:UNIQUE\s+)?(?:CLUSTERED\s+|NONCLUSTERED\s+)?INDEX\s+\[?(\w+)\]?\s+ON\s+(?:\[?(\w+)\]?\.)?\[?(\w+)\]?").unwrap();
let create_type_re = regex::Regex::new(r"(?i)CREATE\s+TYPE\s+(?:\[?(\w+)\]?\.)?\[?(\w+)\]?").unwrap();
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex patterns for SQL parsing use .unwrap() on the Regex::new() calls. While these patterns are static and should compile successfully, if a pattern is malformed it will cause a panic. Consider using lazy_static or once_cell to compile these regexes once at startup, which would fail fast if patterns are invalid and avoid repeated compilation overhead.

Copilot uses AI. Check for mistakes.
Comment on lines +196 to +198



Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are four blank lines (196-198) that appear to be unintentional. These should be removed for code cleanliness.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +316
use anyhow::{Context, Result};
use serde::{Deserialize, Serialize};
use std::path::{Path, PathBuf};
use std::fs;
use std::collections::{HashMap, HashSet};
use chrono::{DateTime, Utc};

use crate::config::{Config, LocalDocsConfig};
use crate::integrations::local_docs::{LocalDocsProcessor, LocalDocMetadata};

/// Metadata about synced knowledge
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct KnowledgeMetadata {
pub last_synced: DateTime<Utc>,
pub local_docs: Vec<LocalDocMetadata>,
/// Documents organized by category
#[serde(default)]
pub categories: HashMap<String, Vec<LocalDocMetadata>>,
}

/// Syncs external knowledge sources to local cache
pub struct KnowledgeSyncer {
config: Config,
}

impl KnowledgeSyncer {
/// Create a new knowledge syncer
pub fn new(config: Config) -> Result<Self> {
Ok(Self { config })
}

/// Sync all configured knowledge sources
pub async fn sync_all(&self) -> Result<()> {
let target_lang = self.config.target_language.display_name();
println!("🔄 Syncing external knowledge sources (target language: {})...", target_lang);

let mut synced_any = false;

if let Some(ref local_docs_config) = self.config.knowledge.local_docs {
if local_docs_config.enabled {
self.sync_local_docs(local_docs_config).await?;
synced_any = true;
} else {
println!("ℹ️ Local docs integration is disabled");
}
}

if !synced_any {
println!("ℹ️ No knowledge sources are configured");
}

println!("✅ Knowledge sync completed");
Ok(())
}

/// Sync local documentation files
async fn sync_local_docs(&self, config: &LocalDocsConfig) -> Result<()> {
println!("\n📄 Processing local documentation files...");

let cache_dir = config
.cache_dir
.clone()
.unwrap_or_else(|| {
self.config
.internal_path
.join("knowledge")
.join("local_docs")
});

fs::create_dir_all(&cache_dir).context("Failed to create local docs cache directory")?;

let mut all_docs = Vec::new();
let mut categories_map: HashMap<String, Vec<LocalDocMetadata>> = HashMap::new();
let mut processed_count = 0;
let mut chunked_count = 0;

// Get default chunking config
let default_chunking = config.default_chunking.clone();
let project_root = self.config.project_path.as_path();

// Process categorized documents
for category in &config.categories {
println!("\n 📁 Processing category: {} ({})", category.name, category.description);

let files = LocalDocsProcessor::expand_glob_patterns(&category.paths, Some(project_root));

// Determine chunking config for this category
let chunking_config = category.chunking.as_ref().or(default_chunking.as_ref());

for file_path in files {
match LocalDocsProcessor::process_file_with_chunking(
&file_path,
&category.name,
&category.target_agents,
chunking_config,
) {
Ok(doc_metas) => {
let is_chunked = doc_metas.len() > 1;
if is_chunked {
println!(" ✓ [{}] {} (chunked into {} parts)",
category.name, file_path.display(), doc_metas.len());
chunked_count += 1;
} else {
println!(" ✓ [{}] {}", category.name, file_path.display());
}

for doc_meta in doc_metas {
// Add to category-specific map
categories_map
.entry(category.name.clone())
.or_default()
.push(doc_meta.clone());

// Also add to all_docs for combined access
all_docs.push(doc_meta);
}
processed_count += 1;
}
Err(e) => {
eprintln!(" ✗ Failed to process {}: {}", file_path.display(), e);
}
}
}
}

// Save metadata
let metadata = KnowledgeMetadata {
last_synced: Utc::now(),
local_docs: all_docs,
categories: categories_map,
};

let metadata_file = cache_dir.join("_metadata.json");
let metadata_json =
serde_json::to_string_pretty(&metadata).context("Failed to serialize metadata")?;
fs::write(&metadata_file, metadata_json).context("Failed to write metadata")?;

if chunked_count > 0 {
println!("✅ Processed {} files ({} chunked into multiple parts)", processed_count, chunked_count);
} else {
println!("✅ Processed {} local documentation files", processed_count);
}
Ok(())
}

/// Check if knowledge needs to be re-synced
pub fn should_sync(&self) -> Result<bool> {
// Check if local docs need syncing
if let Some(ref local_docs_config) = self.config.knowledge.local_docs {
if !local_docs_config.enabled {
return Ok(false);
}

let cache_dir = local_docs_config
.cache_dir
.clone()
.unwrap_or_else(|| {
self.config
.internal_path
.join("knowledge")
.join("local_docs")
});

let metadata_file = cache_dir.join("_metadata.json");

// Always sync local docs if cache doesn't exist or if watch_for_changes is true
if !metadata_file.exists() {
return Ok(true);
}

if local_docs_config.watch_for_changes {
// Check if any source file has been modified since last sync
let metadata_content = fs::read_to_string(&metadata_file)?;
let metadata: KnowledgeMetadata = serde_json::from_str(&metadata_content)?;

let mut cached_files: HashSet<PathBuf> = HashSet::new();
for doc in &metadata.local_docs {
let cached_path = Path::new(&doc.file_path);
cached_files.insert(Self::normalize_path(cached_path));
}

let mut current_files: HashSet<PathBuf> = HashSet::new();
let project_root = self.config.project_path.as_path();
for category in &local_docs_config.categories {
let files = LocalDocsProcessor::expand_glob_patterns(&category.paths, Some(project_root));
for file_path in files {
current_files.insert(Self::normalize_path(&file_path));
}
}

// Detect new or removed files quickly
if current_files.symmetric_difference(&cached_files).next().is_some() {
return Ok(true);
}




// Check if any source file has been modified
for doc in &metadata.local_docs {
let source_path = PathBuf::from(&doc.file_path);
if source_path.exists() {
if let Ok(file_metadata) = fs::metadata(&source_path) {
if let Ok(modified) = file_metadata.modified() {
// Convert SystemTime to DateTime<Utc>
let modified_datetime: DateTime<Utc> = modified.into();
// Compare with cached modification time
if modified_datetime > metadata.last_synced {
return Ok(true);
}
}
}
}
}
return Ok(false);
}
}

Ok(false)
}

fn normalize_path(path: &Path) -> PathBuf {
fs::canonicalize(path).unwrap_or_else(|_| path.to_path_buf())
}

/// Load cached knowledge for a specific category
pub fn load_cached_knowledge_by_category(
&self,
category: &str,
agent_filter: Option<&str>,
) -> Result<Option<String>> {
let local_docs_config = match &self.config.knowledge.local_docs {
Some(cfg) if cfg.enabled => cfg,
_ => return Ok(None),
};

let cache_dir = local_docs_config
.cache_dir
.clone()
.unwrap_or_else(|| {
self.config
.internal_path
.join("knowledge")
.join("local_docs")
});

let metadata_file = cache_dir.join("_metadata.json");
if !metadata_file.exists() {
return Ok(None);
}

let metadata_content = fs::read_to_string(&metadata_file)?;
let metadata: KnowledgeMetadata = serde_json::from_str(&metadata_content)?;

// Get documents for the specified category
let Some(docs) = metadata.categories.get(category) else {
return Ok(None);
};

let filtered_docs: Vec<LocalDocMetadata> = docs
.iter()
.cloned()
.filter(|doc| Self::doc_visible_to_agent(doc, agent_filter))
.collect();

if filtered_docs.is_empty() {
return Ok(None);
}

let target_lang = self.config.target_language.display_name();
let header = format!(
"# {} Documentation ({})\n\nCategory: {}\nLast processed: {}\nDocuments in category: {}\n\n",
Self::format_category_name(category),
target_lang,
category,
metadata.last_synced.format("%Y-%m-%d %H:%M:%S UTC"),
filtered_docs.len()
);

let formatted = LocalDocsProcessor::format_for_llm_with_options(
&filtered_docs,
Some(&header),
false,
);

Ok(Some(formatted))
}

/// Format category name for display
fn format_category_name(category: &str) -> String {
match category {
"architecture" => "Architecture".to_string(),
"database" => "Database".to_string(),
"deployment" => "Deployment & Infrastructure".to_string(),
"api" => "API".to_string(),
"adr" => "Architecture Decision Records".to_string(),
"workflow" => "Workflow & Business Process".to_string(),
"general" => "General".to_string(),
other => other.chars().next().map(|c| c.to_uppercase().to_string()).unwrap_or_default()
+ &other.chars().skip(1).collect::<String>(),
}
}

fn doc_visible_to_agent(doc: &LocalDocMetadata, agent_filter: Option<&str>) -> bool {
match agent_filter {
None => true,
Some(agent) => {
if doc.target_agents.is_empty() {
true
} else {
doc.target_agents.iter().any(|configured| configured == agent)
}
}
}
}
}
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The knowledge_sync.rs module has no test coverage. Critical functions like should_sync, sync_local_docs, load_cached_knowledge_by_category, and normalize_path should have unit tests, especially the file change detection logic and path normalization which can have platform-specific behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +463 to +476
/// Expand glob patterns to actual file paths
pub fn expand_glob_patterns(patterns: &[String], base_path: Option<&Path>) -> Vec<std::path::PathBuf> {
let mut files = Vec::new();

for pattern in patterns {
let pattern_path = Path::new(pattern);
let full_pattern = if pattern_path.is_absolute() {
pattern.clone()
} else if let Some(base) = base_path {
base.join(pattern_path).to_string_lossy().to_string()
} else {
pattern.clone()
};

Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential path traversal vulnerability: The expand_glob_patterns function accepts absolute paths and joins relative patterns without proper sanitization. While the code filters by file type, malicious glob patterns could potentially access files outside the intended project directory if an absolute pattern is provided. Consider validating that resolved paths remain within the project root directory or documenting this as a security consideration for administrators configuring litho.toml.

Copilot uses AI. Check for mistakes.
Comment on lines +72 to +87
excluded_files = [
"litho.toml",
"*.litho",
"*.log",
"*.tmp",
"*.cache",
"bun.lock",
"package-lock.json",
"yarn.lock",
"pnpm-lock.yaml",
"Cargo.lock",
".gitignore",
"*.tpl",
"*.md",
"*.txt",
".env"
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The excluded_files list includes ".md" and ".txt", but the knowledge integration feature supports these file types for external documentation. This could be confusing for users who want to use markdown or text files as external knowledge sources while excluding them from code analysis. Consider adding a clarifying comment or documentation noting that knowledge.local_docs paths override these exclusions.

Copilot uses AI. Check for mistakes.
Comment on lines +307 to +328
fn chunk_fixed_size(&self, content: &str) -> Vec<DocumentChunk> {
let mut chunks = Vec::new();
let chars: Vec<char> = content.chars().collect();
let mut start = 0;

while start < chars.len() {
let end = (start + self.config.max_chunk_size).min(chars.len());
let chunk_content: String = chars[start..end].iter().collect();

chunks.push(DocumentChunk {
content: chunk_content,
chunk_index: chunks.len(),
total_chunks: 0,
section_context: format!("Part {}", chunks.len() + 1),
});

// Move start, accounting for overlap
start = end.saturating_sub(self.config.chunk_overlap);
if start >= end {
break;
}
}
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential infinite loop in chunk_fixed_size: If chunk_overlap is greater than or equal to max_chunk_size, the start position will not advance (line 324), causing the loop to continue indefinitely. Add validation to ensure chunk_overlap is less than max_chunk_size, either in ChunkingConfig validation or at the start of this function.

Copilot uses AI. Check for mistakes.
@sopaco sopaco merged commit c7cbd20 into sopaco:main Jan 31, 2026
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants