Skip to content

Graph RAG for SDK teams bot#15824

Draft
tadelesh wants to merge 11 commits into
mainfrom
graph_rag
Draft

Graph RAG for SDK teams bot#15824
tadelesh wants to merge 11 commits into
mainfrom
graph_rag

Conversation

@tadelesh
Copy link
Copy Markdown
Member

@tadelesh tadelesh commented Jun 2, 2026

Under POC

Copilot AI review requested due to automatic review settings June 2, 2026 07:41
@tadelesh tadelesh marked this pull request as draft June 2, 2026 07:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Python-based “knowledge graph sync” pipeline that ports the existing doc-sync behavior and layers Microsoft GraphRAG indexing/publishing on top, plus bot-agent support for querying the graph (DRIFT search) and hot-reloading parquet artifacts via admin endpoints.

Changes:

  • Introduces azure-sdk-qa-bot-knowledge-graph-sync (Python): repo cloning + markdown processing + incremental/full GraphRAG indexing + publish/notify flow.
  • Adds bot-agent GraphRAG query plumbing (parquet loader + DRIFT search tool) and admin endpoints to reload/status the currently served graph build.
  • Updates documentation/CI and adds tests for the new Python sync project and bot-agent endpoints/tooling.

Reviewed changes

Copilot reviewed 34 out of 34 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tools/sdk-ai-bots/README.md Documents the new knowledge-graph-sync component and updates prerequisites/run instructions.
tools/sdk-ai-bots/.gitignore Ignores Python artifacts and GraphRAG input/output/cache directories.
tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/README.md New project README describing architecture, usage, env vars, and structure.
tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/pyproject.toml Defines the Python package, dependencies, and CLI entry point.
tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/ci.yml Adds CI pipeline to install deps and run pytest for the new tool.
tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/config/knowledge-config.json New unified knowledge source configuration (repos + doc paths).
tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/config/knowledge-config.schema.json JSON schema for the knowledge-config format.
tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/graphrag_config/settings.yaml GraphRAG indexing configuration (models, input, vector store).
tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/init.py Package marker for the sync project.
tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/main.py CLI entry point orchestrating doc sync + GraphRAG indexing + publish/notify.
tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/daily_sync.py Implements the doc sync pipeline (clone, preprocess, transform, upload, cleanup).
tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/graphrag/init.py Package marker for GraphRAG helpers.
tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/graphrag/run_indexing.py Runs full/incremental GraphRAG indexing using the Python API and manages inputs/outputs.
tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/graphrag/publish_output.py Uploads parquet snapshots to blob storage and triggers bot hot-reload.
tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/services/init.py Exposes service helpers (BlobService).
tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/services/app_config.py Loads .env then Azure App Configuration into environment variables.
tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/services/app_secret.py Loads secrets from Key Vault into environment variables.
tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/services/configuration_loader.py Parses knowledge-config.json into typed models and flattened legacy views.
tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/services/metadata_resolver.py Applies path defaults and glob overrides to resolve scope/service_type metadata.
tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/services/storage_service.py Blob CRUD/download and change detection helpers for doc sync + GraphRAG artifacts.
tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/services/spector_processor.py Converts TypeSpec scenario files into markdown using Azure OpenAI.
tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/services/typespec_processor.py Converts TypeSpec library .tsp definitions into structured markdown.
tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/tests/init.py Marks the new test package.
tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/tests/conftest.py Test path setup to import the project’s src package.
tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/tests/test_configuration_loader.py Unit tests for config parsing/flattening behavior.
tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/tests/test_daily_sync.py Unit tests for daily_sync content processing helpers.
tools/sdk-ai-bots/azure-sdk-qa-bot-agent/requirements.txt Adds GraphRAG + parquet dependencies needed for query-time loading.
tools/sdk-ai-bots/azure-sdk-qa-bot-agent/config/graphrag/settings.yaml Adds query-side GraphRAG config mirroring indexing settings and DRIFT tuning.
tools/sdk-ai-bots/azure-sdk-qa-bot-agent/utils/knowledge_graph.py Implements parquet download/load, DRIFT search wrapper, and atomic reload support.
tools/sdk-ai-bots/azure-sdk-qa-bot-agent/tools/graph_knowledge_tools.py Adds a tool wrapper that exposes DRIFT search results as References.
tools/sdk-ai-bots/azure-sdk-qa-bot-agent/agents/chat_agent/init.py Adds KNOWLEDGE_TOOL_MODE switching between vector search and graph search tool registration.
tools/sdk-ai-bots/azure-sdk-qa-bot-agent/server.py Adds admin endpoints to reload/status the graph build with shared-secret auth.
tools/sdk-ai-bots/azure-sdk-qa-bot-agent/tests/graph_knowledge_tools_test.py Adds a unit test for the graph knowledge tool.
tools/sdk-ai-bots/azure-sdk-qa-bot-agent/tests/admin_graphrag_endpoints_test.py Adds unit tests for the new admin reload/status endpoints.

Comment on lines +60 to +67
blob_client = self._container_client.get_blob_client(blob_path)
data = content.encode("utf-8") if isinstance(content, str) else content
blob_client.upload_blob(
data,
overwrite=True,
content_settings=ContentSettings(content_type=self._get_content_type(blob_path)),
metadata=metadata,
)
Comment on lines +167 to +187
current_md5 = self._calculate_md5(content)
existing = existing_blobs.get(blob_path)

if existing is None:
return True

# Check soft-delete flag
if existing.metadata and existing.metadata.get("IsDeleted") == "true":
return True

existing_md5 = existing.properties.get("content_md5")
if not existing_md5:
return True

# Azure returns content_md5 as bytearray; convert to base64 for comparison
if isinstance(existing_md5, (bytes, bytearray)):
existing_md5_b64 = b64encode(existing_md5).decode()
else:
existing_md5_b64 = str(existing_md5)

return existing_md5_b64 != current_md5
Comment on lines +32 to +35
[tool.hatch.build.targets.wheel]
packages = ["src"]
testpaths = ["tests"]
asyncio_mode = "auto"
Comment on lines +151 to +152
"required": ["description", "folder", "metadata"],
"additionalProperties": false
Comment on lines +100 to +103
| `AZURE_APP_CONFIG_ENDPOINT` | Azure App Configuration endpoint |
| `STORAGE_ACCOUNT_NAME` | Azure Storage account name |
| `STORAGE_KNOWLEDGE_CONTAINER` | Blob container for processed docs |
| `AI_SEARCH_ENDPOINT` | Azure AI Search endpoint URL |
| `AI_SEARCH_INDEX_TEXT_UNITS` | AI Search index for text unit embeddings |
| `AI_SEARCH_INDEX_ENTITIES` | AI Search index for entity embeddings |
| `AI_SEARCH_INDEX_COMMUNITIES` | AI Search index for community embeddings |
| `AZURE_OPENAI_ENDPOINT` | Azure OpenAI endpoint (for LLM + embeddings) |
Comment on lines +13 to +25
from config.tenant_config import TenantID
from tools.graph_knowledge_tools import GraphKnowledgeTools


@pytest.mark.asyncio
async def test_search_graph_knowledge_tool() -> None:
query = "What does the TypeSpec JSON Schema emitter do?"

result = await GraphKnowledgeTools().search_knowledge_graph(
queries=[query], tenant_id=TenantID.TYPESPEC_CHANNEL_QA_BOT
)

assert len(result.results) > 0
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 38 out of 38 changed files in this pull request and generated 10 comments.

Comment on lines +194 to +200
if repo.auth_type == "local":
# Copy from local path
for folder in repo.sparse_checkout or []:
src = os.path.join(clone_url, folder)
dst = os.path.join(repo_path, folder)
if os.path.exists(src):
shutil.copytree(src, dst, dirs_exist_ok=True)
Comment on lines +400 to +406
return ProcessedFile(
filename=converted["filename"],
content=converted["content"],
blob_path=blob_path,
is_valid=True,
metadata=metadata,
)
Comment on lines +418 to +426
# Fix 1: Replace # in code blocks
def fix_code_block(m: re.Match) -> str:
lang = m.group(1)
code = m.group(2)
transformed = re.sub(r"^#(\s*)", r"//\1", code, flags=re.MULTILINE)
return f"```{lang}\n{transformed}```"

result = re.sub(r"```(\w+)\s*\n([\s\S]*?)```", fix_code_block, content)

Comment on lines +155 to +166
doc += (
f"## Scenario: {data['heading']}\n"
f"{desc}\n"
f"``` typespec\n{cleaned}\n```\n\n"
)

doc += "## Full Sample: \n// main.tsp\n``` typespec\n"
doc += cls._remove_spector_content(main_spec) + "\n```\n"

if client_tsp:
doc += "// client.tsp\n``` typespec\n"
doc += cls._remove_spector_content(client_tsp) + "\n```\n"
Comment on lines +39 to +42
# Match both .md and .mdx — daily_sync._process_source_directory
# already harvests both extensions, so GraphRAG must index both too.
file_pattern: ".*\\.mdx?$$"

Comment on lines +47 to +56
index_schema:
text_unit_text:
index_name: "azuresdkqabot-dev-search-index-text-units"
vector_size: 1536
entity_description:
index_name: "azuresdkqabot-dev-search-index-entities"
vector_size: 1536
community_full_content:
index_name: "azuresdkqabot-dev-search-index-communities"
vector_size: 1536
Comment on lines +486 to +501
temp_dir = Path(tempfile.mkdtemp(prefix="graphrag-output-"))
logger.info(
"Downloading GraphRAG parquets from blob container '%s' (prefix='%s') "
"to %s",
self._blob_container,
snapshot_prefix,
temp_dir,
)

download_tasks = [
self._download_one_parquet(name, snapshot_prefix, temp_dir)
for name in _REQUIRED_PARQUETS
]
await asyncio.gather(*download_tasks)

return await asyncio.to_thread(self._load_parquets_from_path, temp_dir)
Comment on lines +17 to +25
@pytest.mark.asyncio
async def test_search_graph_knowledge_tool() -> None:
query = "What does the TypeSpec JSON Schema emitter do?"

result = await GraphKnowledgeTools().search_knowledge_graph(
queries=[query], tenant_id=TenantID.TYPESPEC_CHANNEL_QA_BOT
)

assert len(result.results) > 0
Comment on lines +162 to +166
- **GraphRAG as single indexing engine**: No custom search indexing or Cosmos upload code. GraphRAG handles entity extraction, embedding generation, and vector store writes natively via its `azure_ai_search` vector store backend.
- **Native incremental update**: Uses `graphrag update` instead of custom change-tracking logic for the graph. The doc sync still detects file-level changes to minimize unnecessary downloads.
- **Blob Storage as source of truth**: Raw processed markdown is stored in blobs. GraphRAG reads from a local `input/` directory populated from these blobs.
- **Managed Identity auth**: Uses Azure Managed Identity for both Azure OpenAI and AI Search (no API keys in config).
- **12 entity types**: Decorator, Pattern, Tool, Service, API, ErrorCode, Guideline, Library, Operation, Model, Configuration, Protocol
Comment on lines +53 to +58
A Python application that extends the knowledge sync pipeline with a knowledge graph layer built using [Microsoft GraphRAG](https://github.com/microsoft/graphrag). It performs the same documentation sync as the TypeScript service (ported to Python), then additionally:

- Extracts entities (decorators, patterns, APIs, services, etc.) and relationships from documentation
- Detects communities of related concepts via hierarchical clustering
- Uploads the graph to Azure Cosmos DB for entity-aware retrieval at query time
- Supports **incremental indexing** — only re-processes documents that changed in the current sync run
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants