github-repo-discovery

Find and classify the top GitHub repositories in any category, with parallel sub-agents and a slop-aware rubric. Picks the right query type per category — and surfaces the answer that pure repo-name search misses, even when twenty real candidates exist.

A production-ready Claude Code skill that takes a category description, picks the right gh search strategy, dispatches three parallel sub-agents to fan out across established / rising / niche lanes, scores every candidate with a 16-signal composite rubric, and presents verified top results with rationale and caveats — grounded in your actual gh output, not training-data recall.

🎯 Why this exists

GitHub is huge. Finding the right repo for a given category is harder than it looks, for two reasons that compound each other:

Problem 1 — The query type matters more than the ranker. For niche capabilities, a naive gh search repos "<natural-language description>" often returns zero hits because no repo's name contains the user's phrasing. The same intent expressed as gh search code "<keyword>" --filename SKILL.md (or --filename plugin.json, or --filename Cargo.toml, etc.) typically returns dozens of real candidates because tool-specific files follow predictable filenames. For Claude Code capabilities, plugins, MCP servers, and any tooling that lives inside specific filenames, repo-name search is structurally the wrong tool. Most people never figure this out and conclude "the thing I want doesn't exist."

Problem 2 — Stars are gameable, and not by a little. A 2024 Carnegie Mellon study counted ~6 million suspected fake stars across 18,617 repos. By July 2024, 16.66% of repos with 50+ stars showed fake-star activity. Premium stargazer accounts now sell for up to $5,000. AI/LLM repos are the largest non-malicious category receiving fake stars (~177,000 suspected). The strongest single fake-star tell — fork-to-star ratio under 5% on a thousand-star repo — is something no UI surfaces by default.

This skill encodes both lessons. The category router picks the right query type before searching. The composite scorer log-scales stars (capped) and applies a hard penalty for AI-slop signatures. You get answers that survive contact with reality.

✨ What you get

When you trigger this skill in Claude Code, by the end of the conversation you have:

Outcome	Detail
🎯 A ranked top-N list	Each repo with score (0-100), stars, last push, license, one-liner, and a one-sentence "why it qualifies"
🧪 Per-repo evidence	Score breakdown across activity / health / relevance / trust / popularity / scorecard / slop
🚩 Caveats per repo	Deprecations, abandoned-but-popular flags, fork-of-X notes, suspicious growth patterns
🪵 A search log	The strategy used, queries fired, candidates considered, filtered count with reasons, URLs spot-checked — so you can re-run with adjusted thresholds
🔍 A classified verdict per candidate	Real evidence, not "based on common knowledge" — every repo went through a `gh` call in the current session

Cost and runtime

A typical run is 60-180 seconds end-to-end and costs roughly $0.05-$0.15 in agent token usage. The gh calls and the OpenSSF Scorecard call are free. Compare to your time spent doing this manually.

🚀 Quickstart

1. Install

git clone https://github.com/MJWNA/github-repo-discovery.git ~/.claude/skills/github-repo-discovery

2. Authenticate `gh` and check Python

gh auth status        # should show "Logged in to github.com"
python3 --version     # 3.9 or newer

If gh isn't installed: brew install gh && gh auth login.

3. Restart Claude Code

The skill becomes available after restart. Claude Code reads ~/.claude/skills/ at session start.

4. Trigger it

In any project, just say things like:

"What are the top Python orchestration libraries for AI agents?"
"Is there an open-source vector database with good Rust support?"
"Top GitHub repos for [niche topic]?"
"Find me a tool to [do X]"
"What's the best library for [Y]?"
"Compare repos for [Z]"
"Find me a Claude Code skill that does [X]"
"Is there an MCP server for [Y]?"

The skill description has 17+ trigger phrases — it's deliberately easy to invoke.

5. Walk through

Claude will:

Restate the category and announce the chosen strategy (so you can redirect)
Run the primary gh query
Decide whether to dispatch parallel sub-agents (>10 candidates → yes; ≤10 → score directly)
Score every candidate with scripts/score_repo.py
Spot-check 2 of every 5 returned URLs
Present the top N with rationale, caveats, and the search log

Total time: ~30-90 seconds for narrow categories; ~2-3 minutes for broad ones with parallel-agent dispatch.

🧭 What the skill does

It splits any "find me top GitHub repos for X" task into six steps:

Step 1 — Classify the category and pick a strategy

Match the user's category against a five-row table in SKILL.md. Pick one primary strategy:

Category signal	Primary strategy
Claude Code skill / capability	`gh search code "<kw>" --filename SKILL.md --limit 20`
Claude Code plugin / marketplace	Anthropic plugin marketplace + `--filename plugin.json`
Language + tool type	`gh search repos --topic ... --language ... --stars '>500' --pushed '>=YYYY-MM-DD'`
Generic category	Topic search + awesome-list lookup
>1000 results	Partition on `stars:` ranges (binary-search the cutoff)

The strategy is announced before searching so the user can correct it.

Step 2 — Run the primary query

Shell to gh CLI. Always explicit-sort (never best-match, which is opaque and changes without notice).

Step 3 — Decide on parallel sub-agents

If the primary query returned ≤10 high-quality candidates, score directly. If >10 OR the category has clear sub-lanes, dispatch N=3 parallel sub-agents:

A — Established: ≥1000 stars, created ≥2y ago, pushed in last 90d
B — Rising: ≥100 stars, created ≤12mo ago, pushed in last 30d, sort by stars-per-day
C — Niche: lower star floor, broader topic match, code search inside SKILL.md / plugin.json

Every brief contains a literal ## What other agents are covering — DO NOT DUPLICATE block. Anthropic's own multi-agent post-mortem identified vague-brief duplication as the #1 failure mode. The brief template is in references/sub-agent-brief-template.md.

Step 4 — Score candidates

python3 scripts/score_repo.py owner/repo --keywords "kw1,kw2,kw3"

Returns JSON with a 0-100 score plus a per-signal breakdown. The script implements the rubric in references/scoring-rubric.md:

Stars are log-scaled and capped (weight 6 of 100) — they cannot dominate
Slop penalty subtracts up to 15 after the weighted sum
Free OpenSSF Scorecard call (no auth, ~1M repos pre-computed)

Step 5 — Verify

Spot-check 2 of every 5 returned URLs via WebFetch or scripts/verify_urls.sh
Auto-flag repos with <10 stars OR last commit >2 years for manual review
Reject zero-tool-call sub-agent outputs (training-data fabrications)
Note duplicate repos across agents (signals brief leakage)

Step 6 — Present

# 🔍 Top {N} repos for "{category}"

| # | Repo | Stars | Last push | Score | Why |
|---|---|---|---|---|---|

## Detail per repo
## Search log

The search log is the part most users skim past — but it's the part that lets you trust or distrust the result. If three agents each fired only one tool call, the answer is suspicious regardless of the score.

🎬 What it looks like

A typical run, in the abstract:

You describe the category in natural language.
The skill announces the chosen strategy (e.g. "routing to Claude-Code-skill strategy" or "routing to language+tool-type strategy") so you can redirect if it picked wrong.
It runs the primary gh query and shows the candidate count.
If the candidate set warrants it, the skill dispatches three parallel sub-agents (established / rising / niche).
Every candidate is scored with scripts/score_repo.py against the rubric.
The top N are returned in a markdown table — repo, stars, last push, score, one-line rationale — followed by per-repo detail and a search log showing what was searched, what was filtered, and what was verified.

A run from "I asked the question" to "I have a ranked, verified shortlist" is typically 60-180 seconds.

🏗️ How it works (technical breakdown)

For readers who want to know exactly what's happening under the hood, see docs/HOW-IT-WORKS.md. Highlights:

Routing is in SKILL.md, not a Python script. Category routing is a natural-language judgment call — the LLM is the right tool. Wrapping it in Python adds ceremony without value.
Scoring is a Python script. The rubric has 16 signals across activity / health / relevance / trust / popularity / scorecard / slop. Math is fiddly enough to warrant a script.
Sub-agent dispatch happens via Claude Code's native subagent system. No custom orchestration. The brief template is the only thing the skill enforces.
Verification is a bash one-liner. verify_urls.sh shells curl to spot-check that URLs return 200/301/302. Anything else is flagged.

The full design philosophy — why query type beats ranker, why N=3 is the sweet spot, why slop penalty is a hard subtract — is in docs/PHILOSOPHY.md.

The four parallel-agent research tracks that produced the design are in research/:

research/00-synthesis.md — architecture + worked example
research/01-github-search-api.md — REST + GraphQL + gh CLI mechanics
research/02-classification-heuristics.md — quality signals + scoring rubric
research/03-claude-code-ecosystem.md — Claude Code ecosystem field guide
research/04-prompting-parallel-agents.md — sub-agent brief design

Read these if you want to understand the why behind every number in the rubric.

❓ FAQ

Does this work for non-Claude-Code categories?

Yes. The category router picks a different strategy (topic + star floor + freshness) when the input doesn't smell like a Claude Code capability. The Claude-Code-specific path is one of five strategies, not the whole skill.

Will it find the literal top N or only what `gh` indexes?

Only what gh indexes. There are real-world repos that exist but aren't searchable via gh for hours after creation (indexing lag), and topics that aren't applied to otherwise-relevant repos (false negatives). The skill mitigates both — running the primary search plus a niche-lane code search inside SKILL.md / plugin.json catches a lot of the gaps — but you cannot count on this finding a 1-star repo created 30 minutes ago.

What if my category has more than 1000 results?

The skill partitions on stars: ranges. Both REST and GraphQL search cap at 1000 results regardless of paging — the workaround is to slice the query into chunks of ≤1000 and union the results. The slicing logic is in references/search-api-cheatsheet.md.

Can I run the scorer standalone?

Yes:

python3 scripts/score_repo.py owner/repo --keywords "kw1,kw2,kw3"

Returns JSON. Useful for one-off ranking or ad-hoc checks.

How do I trust the score?

Read the breakdown. Every score has a per-signal table — if one signal is dragging it down (e.g. issue_health: 0.205 because the repo has many open issues), you decide whether that matters for your use case. The score is opinionated; the breakdown lets you override.

Does it cost money?

Marginally. Three sub-agents at ~10-15 tool calls each = ~$0.05–$0.15 per run on Sonnet/Opus. The gh and OpenSSF Scorecard calls are free. Compare to your time spent doing this manually.

🛠️ Customising

The skill is opinionated about workflow but flexible about content. Quick pointers:

Want to change	Edit
Category router rules	The strategy table in `SKILL.md`
Scoring weights	The composite formula in `scripts/score_repo.py` and the rubric in `references/scoring-rubric.md`
Sub-agent brief structure	`references/sub-agent-brief-template.md`
Slop detection patterns	The `SUPERLATIVES` and `LLM_TICS` constants in `score_repo.py`
Recency thresholds (90d / 30d / 12mo)	`SKILL.md` Step 3 — the lane definitions
OpenSSF Scorecard fallback (default 0.5)	The `fetch_scorecard` function in `score_repo.py`

🗺️ Roadmap

Possible future enhancements (not promises):

GraphQL-first scoring path — single round-trip per candidate instead of 3-4 REST calls
Awesome-list ingestion — automatically include curated entries from hesreallyhim/awesome-claude-code and friends as a fourth signal
Embedding-based relevance — cosine similarity between README and category brief for the top-50 candidates (currently lexical only)
Detect AI-generated commits — flag repos where most commits are from a bot account or have LLM-voice messages
Time-window selector — let the user pick 30d / 90d / 6mo / 12mo activity windows at runtime
JSON output mode — --json flag for machine-readable output, for piping into dashboards

PRs welcome.

🤝 Contributing

When proposing changes, include:

A real run output showing the change in action (paste the markdown export from chat)
Updated documentation if the workflow changed
Reproduction recipe if you fixed a bug

PRs that:

Improve the category router → likely accepted
Improve the scoring rubric (with evidence) → likely accepted
Add new query strategies → likely accepted
Restructure the dispatch → discuss in an issue first

🙋 Common gotchas

These are real things that bit early users:

gh not authenticated — the script will warn but gh calls will hit unauth rate limits (10/min vs 30/min). Run gh auth login before first use.
Indexing lag — newly pushed READMEs and freshly applied topics aren't searchable for minutes to hours. Very recent repos may not show up; fall back to direct gh api repos/{o}/{r} if you have a candidate name.
OpenSSF Scorecard misses small repos — only ~1M repos are pre-computed in the public dataset. Smaller repos get the default 0.5 (neutral). This isn't a fail; it just means the signal didn't help or hurt.
Star inflation on AI/LLM repos — these are the most-faked category. Even after the slop penalty, expect to be skeptical of repos that show 100k+ stars on a single-author project. Cross-check fork ratio and commit history.
Topics are a precision filter, not a recall filter — many otherwise-relevant repos never apply topics. The skill widens recall via in:name,description, but if a repo has zero topics applied, the relevance score will lean entirely on description and README density.
pushed_at is repo-level, not branch-level — a bot updating a stale branch counts as activity. The GraphQL defaultBranchRef.target.committedDate is more accurate but costs an extra call.

📜 Credits

Built by Ronnie Meagher from four parallel-agent research tracks (April 2026), grounded in:

The Carnegie Mellon "Six Million Fake Stars" study for the slop-aware rubric
Anthropic's How we built our multi-agent research system for the parallel-agent brief template
The OpenSSF Scorecard project for the free quality-signal API
A worked-example session that motivated the whole project: a category whose natural-language phrasing returned zero hits via repo-name search but twenty real candidates via code search inside SKILL.md

Built with Claude Code — the skill discovers Claude Code skills, including itself.

Sister project: claude-config-audit — same author, same pattern, but for cleaning up your Claude Code installation instead of finding new things to add to it.

📜 License

MIT — use it, fork it, modify it, ship it. If you make improvements, PRs back to the main repo are appreciated but not required.

🔗 Related

Claude Code Skills documentation
gh CLI documentation
GitHub REST search API
GitHub GraphQL search
OpenSSF Scorecard API
hesreallyhim/awesome-claude-code — canonical curated list of Claude Code skills/plugins/agents
skill-creator — the official skill that scaffolded this one

The right query beats the smarter ranker.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
docs		docs
references		references
research		research
scripts		scripts
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SKILL.md		SKILL.md

Folders and files

Latest commit

History

Repository files navigation

github-repo-discovery

🎯 Why this exists

✨ What you get

Cost and runtime

🚀 Quickstart

1. Install

2. Authenticate gh and check Python

3. Restart Claude Code

4. Trigger it

5. Walk through

🧭 What the skill does

Step 1 — Classify the category and pick a strategy

Step 2 — Run the primary query

Step 3 — Decide on parallel sub-agents

Step 4 — Score candidates

Step 5 — Verify

Step 6 — Present

🎬 What it looks like

🏗️ How it works (technical breakdown)

❓ FAQ

Does this work for non-Claude-Code categories?

Will it find the literal top N or only what gh indexes?

What if my category has more than 1000 results?

Can I run the scorer standalone?

How do I trust the score?

Does it cost money?

🛠️ Customising

🗺️ Roadmap

🤝 Contributing

🙋 Common gotchas

📜 Credits

📜 License

🔗 Related

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2. Authenticate `gh` and check Python

Will it find the literal top N or only what `gh` indexes?

Packages