Feat/gh endpoint by warisgill · Pull Request #8 · mdressman/dataset-scout

warisgill · 2026-05-12T08:18:48Z

What & why

Make the LLM provider universal across the pipeline. All call sites now route through a single llm_client chokepoint that dispatches by litellm-style model-id prefix, so contributors without an Azure account can use github_copilot/<model> (existing Copilot subscription, OAuth device-code), the free github/<model> tier, openai/, or anthropic/. The legacy AZURE_OPENAI_* path is unchanged.

Behaviour change?

Yes, additive — no regressions for AOAI users.

New DATASET_SCOUT_MODEL env var + --model flag on recon, decompose, inspect, judge.
Mode-detection now gates on ctx.llm_configured (any provider) instead of ctx.aoai_configured. Metadata-only banner text rewritten provider-agnostic.
New Embedder Protocol with SbertEmbedder (default, local — runs in metadata-only too) and AOAIEmbedder (legacy). Selector: DATASET_SCOUT_EMBEDDING_BACKEND={sbert|aoai|none}.
Six cache versions bumped; cache key includes effective model id + embedder name.

How tested

uv run pytest -m "unit or recorded" — 614 passed, 1 skipped
uv run ruff check . — clean
uv run mypy — clean
New: test_llm_client.py, test_embedder.py, regressions in test_judge.py / test_report_redesign.py for the bugs below.
Manual smoke against github_copilot/gpt-5-mini — decompose worked end-to-end.

Honest-limits impact

None — the rewrite sits below the documented limits (HF-lexical bound, ~20-cap on the assessor, paper-only datasets don't auto-promote). README / docs/configuration.md / docs/getting-started.md / docs/concepts.md / docs/cli.md / docs/architecture.md were all Azure-only and are updated in the second commit to advertise the new providers.

Notes for the reviewer

Two bugs the rubber-duck pass surfaced are fixed here (regression tests included):
1. judge.run_judge(model=...) was in the cache key but never forwarded to litellm.completion — silent provider mismatch.
2. render/_view.py still substring-matched the old AOAI notice text after pipeline.py was rewritten — the metadata-only banner never fired. Renderer now imports the constants directly so it can't drift again.
Copilot TOS caveat documented in docs/configuration.md § 1a — Copilot's terms scope it to "code suggestions in editor," so batch use is a gray area. The free GitHub Models tier has no such caveat.
Deferred: per-provider concurrency cap (Copilot/Models 429 under fanout); judge CLI --model flag.

Extends dataset-scout beyond Azure OpenAI so recon, inspect, and judge can run against any litellm-supported provider — primarily GitHub Copilot via `github_copilot/<model>` (OAuth device-code), but also `github/`, `openai/`, `anthropic/`. Core refactor - New `llm_client.py` chokepoint: `resolve_llm_params` dispatches by model prefix. Only `azure/` attaches `api_base` + Entra token provider; other prefixes return bare kwargs and let litellm handle per-provider auth. - `effective_model_id` centralizes precedence: CLI override > `ctx.model` > synthesized `azure/<deployment>` for back-compat. Recon stages - `decompose`, `keyword_expansion`, `strategy`, `coverage`, `judge`, `embedding_fit`, `inspect_` all gate on `ctx.llm_configured` and cache-key on `effective_model_id`. Six cache versions bumped so cross-provider runs can't share stale entries. - `coverage.py` dropped its eager `_make_token_provider()` call and gained a `cache=` parameter. - `pipeline.METADATA_ONLY_NOTICE`, `LLM_RUNTIME_HINT`, and the Markdown/HTML callout text rewritten provider-agnostic. `render/_view.py` mode-detection now imports the constants so it can't drift. Embedder - New `embedder.py` with an `Embedder` Protocol. `SbertEmbedder` (lazy sentence-transformers) is the default; `AOAIEmbedder` remains for the legacy path. - Cache key includes embedder name + model so sbert/AOAI vectors never collide. AOAI gates on endpoint + embedding deployment only, so mixed-provider setups (Copilot chat + AOAI embeddings) work. - `embedding_fit` runs in metadata-only mode too — sbert is local. Configuration - `ScoutContext` gains `model`, `embedding_backend`, `embedding_model`. `from_env` reads `DATASET_SCOUT_MODEL`, `DATASET_SCOUT_EMBEDDING_BACKEND`, `DATASET_SCOUT_EMBEDDING_MODEL`. - `recon` CLI gets `--model` (mirrors `decompose`). - `pyproject.toml` adds `[local-embeddings]` extra; `.env.example` documents the new env vars. Bug fixes (uncovered during the audit pass) - `judge.run_judge(model=...)` was recorded in the cache key but not forwarded to the actual completion call. Now threaded through `_LiteLLMChatClient.model_override`. - `SbertEmbedder` model-load failures (offline first run, bad HF repo id) now surface as `LLMError` instead of raw torch tracebacks. Tests: 614 passed, 1 skipped. New `test_llm_client.py` and `test_embedder.py`; regression tests for the three bugs above. mypy + ruff clean.

Walks new users through the GitHub Copilot, GitHub Models (free tier), OpenAI, Anthropic, and Azure / Entra paths. Previously the README and docs/ assumed Azure OpenAI was the only option, even though the code has supported any litellm-style provider since the universal-LLM commit. - README.md: quick-start now leads with `DATASET_SCOUT_MODEL= github_copilot/...` and shows AOAI as one option among several. Adds a "no Azure account?" callout pointing at the Configuration doc. - docs/configuration.md: § 1 fully rewritten — provider matrix table, dedicated subsections for Copilot (with TOS caveat), GitHub Models, and AOAI. New § 1d documents the embedder selector (sbert default, AOAI alternative). Provider id is now noted as part of the cache key. - docs/getting-started.md: install step now lists ONE-OF blocks for each provider rather than assuming AOAI. - docs/concepts.md: mode-trigger row covers any provider. - docs/cli.md: `recon --model` flag documented; `judge` example uses the correct litellm prefix (`azure/`, `github_copilot/`) instead of the made-up `azure-openai/`. AOAI-only language in the See-also and inspect/recon notes loosened to "LLM provider". - docs/architecture.md: pipeline diagram + module listing + llm_available example + status table all reflect the provider-agnostic shape. - docs/judged-corpus-shape.md: example `model` value in the judged-record schema uses litellm-style ids. Tests: 614 passed, 1 skipped — no code touched.

+# now reached lazily via llm_client.resolve_llm_params only on the
+# Azure branch — but the symbol stays so existing test patches don't
+# break.
+_make_token_provider = make_token_provider


+
+    def embed(self, texts: list[str]) -> list[list[float]]:
+        """Return one embedding vector per input text, in input order."""
+        ...


+# Re-exported for back-compat with tests that monkeypatch these names
+# at the embedding_fit module path. Production code goes through the
+# Embedder protocol now and never invokes either function from here.
+from dataset_scout.llm_client import import_litellm, make_token_provider  # noqa: F401


Waris Gill added 2 commits May 12, 2026 07:52

warisgill requested a review from mdressman as a code owner May 12, 2026 08:18

github-advanced-security AI found potential problems May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/gh endpoint#8

Feat/gh endpoint#8
warisgill wants to merge 2 commits into
mdressman:mainfrom
warisgill:feat/gh-endpoint

warisgill commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

warisgill commented May 12, 2026

What & why

Behaviour change?

How tested

Honest-limits impact

Notes for the reviewer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants