Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new Python-based “knowledge graph sync” pipeline that ports the existing doc-sync behavior and layers Microsoft GraphRAG indexing/publishing on top, plus bot-agent support for querying the graph (DRIFT search) and hot-reloading parquet artifacts via admin endpoints.
Changes:
- Introduces
azure-sdk-qa-bot-knowledge-graph-sync(Python): repo cloning + markdown processing + incremental/full GraphRAG indexing + publish/notify flow. - Adds bot-agent GraphRAG query plumbing (parquet loader + DRIFT search tool) and admin endpoints to reload/status the currently served graph build.
- Updates documentation/CI and adds tests for the new Python sync project and bot-agent endpoints/tooling.
Reviewed changes
Copilot reviewed 34 out of 34 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/sdk-ai-bots/README.md | Documents the new knowledge-graph-sync component and updates prerequisites/run instructions. |
| tools/sdk-ai-bots/.gitignore | Ignores Python artifacts and GraphRAG input/output/cache directories. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/README.md | New project README describing architecture, usage, env vars, and structure. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/pyproject.toml | Defines the Python package, dependencies, and CLI entry point. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/ci.yml | Adds CI pipeline to install deps and run pytest for the new tool. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/config/knowledge-config.json | New unified knowledge source configuration (repos + doc paths). |
| tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/config/knowledge-config.schema.json | JSON schema for the knowledge-config format. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/graphrag_config/settings.yaml | GraphRAG indexing configuration (models, input, vector store). |
| tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/init.py | Package marker for the sync project. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/main.py | CLI entry point orchestrating doc sync + GraphRAG indexing + publish/notify. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/daily_sync.py | Implements the doc sync pipeline (clone, preprocess, transform, upload, cleanup). |
| tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/graphrag/init.py | Package marker for GraphRAG helpers. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/graphrag/run_indexing.py | Runs full/incremental GraphRAG indexing using the Python API and manages inputs/outputs. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/graphrag/publish_output.py | Uploads parquet snapshots to blob storage and triggers bot hot-reload. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/services/init.py | Exposes service helpers (BlobService). |
| tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/services/app_config.py | Loads .env then Azure App Configuration into environment variables. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/services/app_secret.py | Loads secrets from Key Vault into environment variables. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/services/configuration_loader.py | Parses knowledge-config.json into typed models and flattened legacy views. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/services/metadata_resolver.py | Applies path defaults and glob overrides to resolve scope/service_type metadata. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/services/storage_service.py | Blob CRUD/download and change detection helpers for doc sync + GraphRAG artifacts. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/services/spector_processor.py | Converts TypeSpec scenario files into markdown using Azure OpenAI. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/src/services/typespec_processor.py | Converts TypeSpec library .tsp definitions into structured markdown. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/tests/init.py | Marks the new test package. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/tests/conftest.py | Test path setup to import the project’s src package. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/tests/test_configuration_loader.py | Unit tests for config parsing/flattening behavior. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-knowledge-graph-sync/tests/test_daily_sync.py | Unit tests for daily_sync content processing helpers. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-agent/requirements.txt | Adds GraphRAG + parquet dependencies needed for query-time loading. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-agent/config/graphrag/settings.yaml | Adds query-side GraphRAG config mirroring indexing settings and DRIFT tuning. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-agent/utils/knowledge_graph.py | Implements parquet download/load, DRIFT search wrapper, and atomic reload support. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-agent/tools/graph_knowledge_tools.py | Adds a tool wrapper that exposes DRIFT search results as References. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-agent/agents/chat_agent/init.py | Adds KNOWLEDGE_TOOL_MODE switching between vector search and graph search tool registration. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-agent/server.py | Adds admin endpoints to reload/status the graph build with shared-secret auth. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-agent/tests/graph_knowledge_tools_test.py | Adds a unit test for the graph knowledge tool. |
| tools/sdk-ai-bots/azure-sdk-qa-bot-agent/tests/admin_graphrag_endpoints_test.py | Adds unit tests for the new admin reload/status endpoints. |
Comment on lines
+60
to
+67
| blob_client = self._container_client.get_blob_client(blob_path) | ||
| data = content.encode("utf-8") if isinstance(content, str) else content | ||
| blob_client.upload_blob( | ||
| data, | ||
| overwrite=True, | ||
| content_settings=ContentSettings(content_type=self._get_content_type(blob_path)), | ||
| metadata=metadata, | ||
| ) |
Comment on lines
+167
to
+187
| current_md5 = self._calculate_md5(content) | ||
| existing = existing_blobs.get(blob_path) | ||
|
|
||
| if existing is None: | ||
| return True | ||
|
|
||
| # Check soft-delete flag | ||
| if existing.metadata and existing.metadata.get("IsDeleted") == "true": | ||
| return True | ||
|
|
||
| existing_md5 = existing.properties.get("content_md5") | ||
| if not existing_md5: | ||
| return True | ||
|
|
||
| # Azure returns content_md5 as bytearray; convert to base64 for comparison | ||
| if isinstance(existing_md5, (bytes, bytearray)): | ||
| existing_md5_b64 = b64encode(existing_md5).decode() | ||
| else: | ||
| existing_md5_b64 = str(existing_md5) | ||
|
|
||
| return existing_md5_b64 != current_md5 |
Comment on lines
+32
to
+35
| [tool.hatch.build.targets.wheel] | ||
| packages = ["src"] | ||
| testpaths = ["tests"] | ||
| asyncio_mode = "auto" |
Comment on lines
+151
to
+152
| "required": ["description", "folder", "metadata"], | ||
| "additionalProperties": false |
Comment on lines
+100
to
+103
| | `AZURE_APP_CONFIG_ENDPOINT` | Azure App Configuration endpoint | | ||
| | `STORAGE_ACCOUNT_NAME` | Azure Storage account name | | ||
| | `STORAGE_KNOWLEDGE_CONTAINER` | Blob container for processed docs | | ||
| | `AI_SEARCH_ENDPOINT` | Azure AI Search endpoint URL | |
| | `AI_SEARCH_INDEX_TEXT_UNITS` | AI Search index for text unit embeddings | | ||
| | `AI_SEARCH_INDEX_ENTITIES` | AI Search index for entity embeddings | | ||
| | `AI_SEARCH_INDEX_COMMUNITIES` | AI Search index for community embeddings | | ||
| | `AZURE_OPENAI_ENDPOINT` | Azure OpenAI endpoint (for LLM + embeddings) | |
Comment on lines
+13
to
+25
| from config.tenant_config import TenantID | ||
| from tools.graph_knowledge_tools import GraphKnowledgeTools | ||
|
|
||
|
|
||
| @pytest.mark.asyncio | ||
| async def test_search_graph_knowledge_tool() -> None: | ||
| query = "What does the TypeSpec JSON Schema emitter do?" | ||
|
|
||
| result = await GraphKnowledgeTools().search_knowledge_graph( | ||
| queries=[query], tenant_id=TenantID.TYPESPEC_CHANNEL_QA_BOT | ||
| ) | ||
|
|
||
| assert len(result.results) > 0 |
Comment on lines
+194
to
+200
| if repo.auth_type == "local": | ||
| # Copy from local path | ||
| for folder in repo.sparse_checkout or []: | ||
| src = os.path.join(clone_url, folder) | ||
| dst = os.path.join(repo_path, folder) | ||
| if os.path.exists(src): | ||
| shutil.copytree(src, dst, dirs_exist_ok=True) |
Comment on lines
+400
to
+406
| return ProcessedFile( | ||
| filename=converted["filename"], | ||
| content=converted["content"], | ||
| blob_path=blob_path, | ||
| is_valid=True, | ||
| metadata=metadata, | ||
| ) |
Comment on lines
+418
to
+426
| # Fix 1: Replace # in code blocks | ||
| def fix_code_block(m: re.Match) -> str: | ||
| lang = m.group(1) | ||
| code = m.group(2) | ||
| transformed = re.sub(r"^#(\s*)", r"//\1", code, flags=re.MULTILINE) | ||
| return f"```{lang}\n{transformed}```" | ||
|
|
||
| result = re.sub(r"```(\w+)\s*\n([\s\S]*?)```", fix_code_block, content) | ||
|
|
Comment on lines
+155
to
+166
| doc += ( | ||
| f"## Scenario: {data['heading']}\n" | ||
| f"{desc}\n" | ||
| f"``` typespec\n{cleaned}\n```\n\n" | ||
| ) | ||
|
|
||
| doc += "## Full Sample: \n// main.tsp\n``` typespec\n" | ||
| doc += cls._remove_spector_content(main_spec) + "\n```\n" | ||
|
|
||
| if client_tsp: | ||
| doc += "// client.tsp\n``` typespec\n" | ||
| doc += cls._remove_spector_content(client_tsp) + "\n```\n" |
Comment on lines
+39
to
+42
| # Match both .md and .mdx — daily_sync._process_source_directory | ||
| # already harvests both extensions, so GraphRAG must index both too. | ||
| file_pattern: ".*\\.mdx?$$" | ||
|
|
Comment on lines
+47
to
+56
| index_schema: | ||
| text_unit_text: | ||
| index_name: "azuresdkqabot-dev-search-index-text-units" | ||
| vector_size: 1536 | ||
| entity_description: | ||
| index_name: "azuresdkqabot-dev-search-index-entities" | ||
| vector_size: 1536 | ||
| community_full_content: | ||
| index_name: "azuresdkqabot-dev-search-index-communities" | ||
| vector_size: 1536 |
Comment on lines
+486
to
+501
| temp_dir = Path(tempfile.mkdtemp(prefix="graphrag-output-")) | ||
| logger.info( | ||
| "Downloading GraphRAG parquets from blob container '%s' (prefix='%s') " | ||
| "to %s", | ||
| self._blob_container, | ||
| snapshot_prefix, | ||
| temp_dir, | ||
| ) | ||
|
|
||
| download_tasks = [ | ||
| self._download_one_parquet(name, snapshot_prefix, temp_dir) | ||
| for name in _REQUIRED_PARQUETS | ||
| ] | ||
| await asyncio.gather(*download_tasks) | ||
|
|
||
| return await asyncio.to_thread(self._load_parquets_from_path, temp_dir) |
Comment on lines
+17
to
+25
| @pytest.mark.asyncio | ||
| async def test_search_graph_knowledge_tool() -> None: | ||
| query = "What does the TypeSpec JSON Schema emitter do?" | ||
|
|
||
| result = await GraphKnowledgeTools().search_knowledge_graph( | ||
| queries=[query], tenant_id=TenantID.TYPESPEC_CHANNEL_QA_BOT | ||
| ) | ||
|
|
||
| assert len(result.results) > 0 |
Comment on lines
+162
to
+166
| - **GraphRAG as single indexing engine**: No custom search indexing or Cosmos upload code. GraphRAG handles entity extraction, embedding generation, and vector store writes natively via its `azure_ai_search` vector store backend. | ||
| - **Native incremental update**: Uses `graphrag update` instead of custom change-tracking logic for the graph. The doc sync still detects file-level changes to minimize unnecessary downloads. | ||
| - **Blob Storage as source of truth**: Raw processed markdown is stored in blobs. GraphRAG reads from a local `input/` directory populated from these blobs. | ||
| - **Managed Identity auth**: Uses Azure Managed Identity for both Azure OpenAI and AI Search (no API keys in config). | ||
| - **12 entity types**: Decorator, Pattern, Tool, Service, API, ErrorCode, Guideline, Library, Operation, Model, Configuration, Protocol |
Comment on lines
+53
to
+58
| A Python application that extends the knowledge sync pipeline with a knowledge graph layer built using [Microsoft GraphRAG](https://github.com/microsoft/graphrag). It performs the same documentation sync as the TypeScript service (ported to Python), then additionally: | ||
|
|
||
| - Extracts entities (decorators, patterns, APIs, services, etc.) and relationships from documentation | ||
| - Detects communities of related concepts via hierarchical clustering | ||
| - Uploads the graph to Azure Cosmos DB for entity-aware retrieval at query time | ||
| - Supports **incremental indexing** — only re-processes documents that changed in the current sync run |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Under POC