Skip to content

Feat/gh endpoint#8

Open
warisgill wants to merge 2 commits into
mdressman:mainfrom
warisgill:feat/gh-endpoint
Open

Feat/gh endpoint#8
warisgill wants to merge 2 commits into
mdressman:mainfrom
warisgill:feat/gh-endpoint

Conversation

@warisgill
Copy link
Copy Markdown

What & why

Make the LLM provider universal across the pipeline. All call sites now route through a single llm_client chokepoint that dispatches by litellm-style model-id prefix, so contributors without an Azure account can use github_copilot/<model> (existing Copilot subscription, OAuth device-code), the free github/<model> tier, openai/, or anthropic/. The legacy AZURE_OPENAI_* path is unchanged.

Behaviour change?

Yes, additive — no regressions for AOAI users.

  • New DATASET_SCOUT_MODEL env var + --model flag on recon, decompose, inspect, judge.
  • Mode-detection now gates on ctx.llm_configured (any provider) instead of ctx.aoai_configured. Metadata-only banner text rewritten provider-agnostic.
  • New Embedder Protocol with SbertEmbedder (default, local — runs in metadata-only too) and AOAIEmbedder (legacy). Selector: DATASET_SCOUT_EMBEDDING_BACKEND={sbert|aoai|none}.
  • Six cache versions bumped; cache key includes effective model id + embedder name.

How tested

  • uv run pytest -m "unit or recorded"614 passed, 1 skipped
  • uv run ruff check . — clean
  • uv run mypy — clean
  • New: test_llm_client.py, test_embedder.py, regressions in test_judge.py / test_report_redesign.py for the bugs below.
  • Manual smoke against github_copilot/gpt-5-minidecompose worked end-to-end.

Honest-limits impact

None — the rewrite sits below the documented limits (HF-lexical bound, ~20-cap on the assessor, paper-only datasets don't auto-promote). README / docs/configuration.md / docs/getting-started.md / docs/concepts.md / docs/cli.md / docs/architecture.md were all Azure-only and are updated in the second commit to advertise the new providers.

Notes for the reviewer

  • Two bugs the rubber-duck pass surfaced are fixed here (regression tests included):
    1. judge.run_judge(model=...) was in the cache key but never forwarded to litellm.completion — silent provider mismatch.
    2. render/_view.py still substring-matched the old AOAI notice text after pipeline.py was rewritten — the metadata-only banner never fired. Renderer now imports the constants directly so it can't drift again.
  • Copilot TOS caveat documented in docs/configuration.md § 1a — Copilot's terms scope it to "code suggestions in editor," so batch use is a gray area. The free GitHub Models tier has no such caveat.
  • Deferred: per-provider concurrency cap (Copilot/Models 429 under fanout); judge CLI --model flag.

Waris Gill added 2 commits May 12, 2026 07:52
Extends dataset-scout beyond Azure OpenAI so recon, inspect, and
judge can run against any litellm-supported provider — primarily
GitHub Copilot via `github_copilot/<model>` (OAuth device-code),
but also `github/`, `openai/`, `anthropic/`.

Core refactor
- New `llm_client.py` chokepoint: `resolve_llm_params` dispatches
  by model prefix. Only `azure/` attaches `api_base` + Entra token
  provider; other prefixes return bare kwargs and let litellm
  handle per-provider auth.
- `effective_model_id` centralizes precedence: CLI override >
  `ctx.model` > synthesized `azure/<deployment>` for back-compat.

Recon stages
- `decompose`, `keyword_expansion`, `strategy`, `coverage`,
  `judge`, `embedding_fit`, `inspect_` all gate on
  `ctx.llm_configured` and cache-key on `effective_model_id`.
  Six cache versions bumped so cross-provider runs can't share
  stale entries.
- `coverage.py` dropped its eager `_make_token_provider()` call
  and gained a `cache=` parameter.
- `pipeline.METADATA_ONLY_NOTICE`, `LLM_RUNTIME_HINT`, and the
  Markdown/HTML callout text rewritten provider-agnostic.
  `render/_view.py` mode-detection now imports the constants so
  it can't drift.

Embedder
- New `embedder.py` with an `Embedder` Protocol. `SbertEmbedder`
  (lazy sentence-transformers) is the default; `AOAIEmbedder`
  remains for the legacy path.
- Cache key includes embedder name + model so sbert/AOAI vectors
  never collide. AOAI gates on endpoint + embedding deployment
  only, so mixed-provider setups (Copilot chat + AOAI embeddings)
  work.
- `embedding_fit` runs in metadata-only mode too — sbert is local.

Configuration
- `ScoutContext` gains `model`, `embedding_backend`,
  `embedding_model`. `from_env` reads `DATASET_SCOUT_MODEL`,
  `DATASET_SCOUT_EMBEDDING_BACKEND`, `DATASET_SCOUT_EMBEDDING_MODEL`.
- `recon` CLI gets `--model` (mirrors `decompose`).
- `pyproject.toml` adds `[local-embeddings]` extra;
  `.env.example` documents the new env vars.

Bug fixes (uncovered during the audit pass)
- `judge.run_judge(model=...)` was recorded in the cache key but
  not forwarded to the actual completion call. Now threaded
  through `_LiteLLMChatClient.model_override`.
- `SbertEmbedder` model-load failures (offline first run, bad HF
  repo id) now surface as `LLMError` instead of raw torch
  tracebacks.

Tests: 614 passed, 1 skipped. New `test_llm_client.py` and
`test_embedder.py`; regression tests for the three bugs above.
mypy + ruff clean.
Walks new users through the GitHub Copilot, GitHub Models (free
tier), OpenAI, Anthropic, and Azure / Entra paths. Previously the
README and docs/ assumed Azure OpenAI was the only option, even
though the code has supported any litellm-style provider since the
universal-LLM commit.

- README.md: quick-start now leads with `DATASET_SCOUT_MODEL=
  github_copilot/...` and shows AOAI as one option among several.
  Adds a "no Azure account?" callout pointing at the Configuration
  doc.
- docs/configuration.md: § 1 fully rewritten — provider matrix
  table, dedicated subsections for Copilot (with TOS caveat),
  GitHub Models, and AOAI. New § 1d documents the embedder
  selector (sbert default, AOAI alternative). Provider id is now
  noted as part of the cache key.
- docs/getting-started.md: install step now lists ONE-OF blocks
  for each provider rather than assuming AOAI.
- docs/concepts.md: mode-trigger row covers any provider.
- docs/cli.md: `recon --model` flag documented; `judge` example
  uses the correct litellm prefix (`azure/`, `github_copilot/`)
  instead of the made-up `azure-openai/`. AOAI-only language in
  the See-also and inspect/recon notes loosened to "LLM provider".
- docs/architecture.md: pipeline diagram + module listing +
  llm_available example + status table all reflect the
  provider-agnostic shape.
- docs/judged-corpus-shape.md: example `model` value in the
  judged-record schema uses litellm-style ids.

Tests: 614 passed, 1 skipped — no code touched.
@warisgill warisgill requested a review from mdressman as a code owner May 12, 2026 08:18
# now reached lazily via llm_client.resolve_llm_params only on the
# Azure branch — but the symbol stays so existing test patches don't
# break.
_make_token_provider = make_token_provider

def embed(self, texts: list[str]) -> list[list[float]]:
"""Return one embedding vector per input text, in input order."""
...
# Re-exported for back-compat with tests that monkeypatch these names
# at the embedding_fit module path. Production code goes through the
# Embedder protocol now and never invokes either function from here.
from dataset_scout.llm_client import import_litellm, make_token_provider # noqa: F401
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants