Feat/gh endpoint#8
Open
warisgill wants to merge 2 commits into
Open
Conversation
added 2 commits
May 12, 2026 07:52
Extends dataset-scout beyond Azure OpenAI so recon, inspect, and judge can run against any litellm-supported provider — primarily GitHub Copilot via `github_copilot/<model>` (OAuth device-code), but also `github/`, `openai/`, `anthropic/`. Core refactor - New `llm_client.py` chokepoint: `resolve_llm_params` dispatches by model prefix. Only `azure/` attaches `api_base` + Entra token provider; other prefixes return bare kwargs and let litellm handle per-provider auth. - `effective_model_id` centralizes precedence: CLI override > `ctx.model` > synthesized `azure/<deployment>` for back-compat. Recon stages - `decompose`, `keyword_expansion`, `strategy`, `coverage`, `judge`, `embedding_fit`, `inspect_` all gate on `ctx.llm_configured` and cache-key on `effective_model_id`. Six cache versions bumped so cross-provider runs can't share stale entries. - `coverage.py` dropped its eager `_make_token_provider()` call and gained a `cache=` parameter. - `pipeline.METADATA_ONLY_NOTICE`, `LLM_RUNTIME_HINT`, and the Markdown/HTML callout text rewritten provider-agnostic. `render/_view.py` mode-detection now imports the constants so it can't drift. Embedder - New `embedder.py` with an `Embedder` Protocol. `SbertEmbedder` (lazy sentence-transformers) is the default; `AOAIEmbedder` remains for the legacy path. - Cache key includes embedder name + model so sbert/AOAI vectors never collide. AOAI gates on endpoint + embedding deployment only, so mixed-provider setups (Copilot chat + AOAI embeddings) work. - `embedding_fit` runs in metadata-only mode too — sbert is local. Configuration - `ScoutContext` gains `model`, `embedding_backend`, `embedding_model`. `from_env` reads `DATASET_SCOUT_MODEL`, `DATASET_SCOUT_EMBEDDING_BACKEND`, `DATASET_SCOUT_EMBEDDING_MODEL`. - `recon` CLI gets `--model` (mirrors `decompose`). - `pyproject.toml` adds `[local-embeddings]` extra; `.env.example` documents the new env vars. Bug fixes (uncovered during the audit pass) - `judge.run_judge(model=...)` was recorded in the cache key but not forwarded to the actual completion call. Now threaded through `_LiteLLMChatClient.model_override`. - `SbertEmbedder` model-load failures (offline first run, bad HF repo id) now surface as `LLMError` instead of raw torch tracebacks. Tests: 614 passed, 1 skipped. New `test_llm_client.py` and `test_embedder.py`; regression tests for the three bugs above. mypy + ruff clean.
Walks new users through the GitHub Copilot, GitHub Models (free tier), OpenAI, Anthropic, and Azure / Entra paths. Previously the README and docs/ assumed Azure OpenAI was the only option, even though the code has supported any litellm-style provider since the universal-LLM commit. - README.md: quick-start now leads with `DATASET_SCOUT_MODEL= github_copilot/...` and shows AOAI as one option among several. Adds a "no Azure account?" callout pointing at the Configuration doc. - docs/configuration.md: § 1 fully rewritten — provider matrix table, dedicated subsections for Copilot (with TOS caveat), GitHub Models, and AOAI. New § 1d documents the embedder selector (sbert default, AOAI alternative). Provider id is now noted as part of the cache key. - docs/getting-started.md: install step now lists ONE-OF blocks for each provider rather than assuming AOAI. - docs/concepts.md: mode-trigger row covers any provider. - docs/cli.md: `recon --model` flag documented; `judge` example uses the correct litellm prefix (`azure/`, `github_copilot/`) instead of the made-up `azure-openai/`. AOAI-only language in the See-also and inspect/recon notes loosened to "LLM provider". - docs/architecture.md: pipeline diagram + module listing + llm_available example + status table all reflect the provider-agnostic shape. - docs/judged-corpus-shape.md: example `model` value in the judged-record schema uses litellm-style ids. Tests: 614 passed, 1 skipped — no code touched.
| # now reached lazily via llm_client.resolve_llm_params only on the | ||
| # Azure branch — but the symbol stays so existing test patches don't | ||
| # break. | ||
| _make_token_provider = make_token_provider |
|
|
||
| def embed(self, texts: list[str]) -> list[list[float]]: | ||
| """Return one embedding vector per input text, in input order.""" | ||
| ... |
| # Re-exported for back-compat with tests that monkeypatch these names | ||
| # at the embedding_fit module path. Production code goes through the | ||
| # Embedder protocol now and never invokes either function from here. | ||
| from dataset_scout.llm_client import import_litellm, make_token_provider # noqa: F401 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
Make the LLM provider universal across the pipeline. All call sites now route through a single
llm_clientchokepoint that dispatches by litellm-style model-id prefix, so contributors without an Azure account can usegithub_copilot/<model>(existing Copilot subscription, OAuth device-code), the freegithub/<model>tier,openai/, oranthropic/. The legacyAZURE_OPENAI_*path is unchanged.Behaviour change?
Yes, additive — no regressions for AOAI users.
DATASET_SCOUT_MODELenv var +--modelflag onrecon,decompose,inspect,judge.ctx.llm_configured(any provider) instead ofctx.aoai_configured. Metadata-only banner text rewritten provider-agnostic.EmbedderProtocol withSbertEmbedder(default, local — runs in metadata-only too) andAOAIEmbedder(legacy). Selector:DATASET_SCOUT_EMBEDDING_BACKEND={sbert|aoai|none}.How tested
uv run pytest -m "unit or recorded"— 614 passed, 1 skippeduv run ruff check .— cleanuv run mypy— cleantest_llm_client.py,test_embedder.py, regressions intest_judge.py/test_report_redesign.pyfor the bugs below.github_copilot/gpt-5-mini—decomposeworked end-to-end.Honest-limits impact
None — the rewrite sits below the documented limits (HF-lexical bound, ~20-cap on the assessor, paper-only datasets don't auto-promote). README /
docs/configuration.md/docs/getting-started.md/docs/concepts.md/docs/cli.md/docs/architecture.mdwere all Azure-only and are updated in the second commit to advertise the new providers.Notes for the reviewer
judge.run_judge(model=...)was in the cache key but never forwarded tolitellm.completion— silent provider mismatch.render/_view.pystill substring-matched the old AOAI notice text afterpipeline.pywas rewritten — the metadata-only banner never fired. Renderer now imports the constants directly so it can't drift again.docs/configuration.md§ 1a — Copilot's terms scope it to "code suggestions in editor," so batch use is a gray area. The free GitHub Models tier has no such caveat.judgeCLI--modelflag.