Feat/agent refine v2#8
Merged
prakhar728 merged 4 commits intomainfrom Mar 22, 2026
Merged
Conversation
…e cleanup
- use idea_text-only embeddings with relevance_score and aligned flag
- expose {submission_id, novelty_score, aligned} via SkillCard.user_output_keys
- decouple routes via card.user_output_keys (no skill-internal imports)
- fix init greeting template and ready confirmation
- add 20 eval submissions and stabilize two-turn eval pipeline
- all 55 tests passing
…e detection - Swap all-MiniLM-L6-v2 → all-mpnet-base-v2 (768d, better similarity quality) - Remove compute_relevance_scores() — replaced by LLM-judged aligned (binary) - Triage node now reads idea text inline, judges aligned (true/false) per submission - Duplicate detection: near-duplicate pairs (sim > 0.7) surfaced to triage LLM for confirmation - Only the later submission in a duplicate pair is flagged; safety net prevents all-flagged edge case - Add nudge retry if triage returns flat format without aligned field - SIMILARITY_DUPLICATE_THRESHOLD: 0.95 → 0.7 - Remove relevance_score from all outputs, models, guardrails, frontend types - Add agentic ingest.py (text normalization node) - Fix SCORE_MODEL: openai/gpt-4o → deepseek-ai/DeepSeek-V3.1 - 57 unit tests + 15 e2e tests pass
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this branch is
A refactor of the hackathon_novelty skill pipeline. Three broken/missing pieces from v2 fixed, one new node added. Same graph topology: triage → router → flag/score → finalize.
What changed
Embedding model (deterministic.py)
Swapped all-MiniLM-L6-v2 → all-mpnet-base-v2 (768d). Better semantic similarity quality, making duplicate detection viable at a reasonable threshold.
Duplicate detection (agent.py, config.py, init.py)
Threshold dropped 0.95 → 0.7. Near-duplicate pairs are pre-computed and passed into the triage context explicitly — the triage LLM sees which pairs are flagged and confirms whether they’re truly the same concept. Only the later submission in a pair is classified as duplicate; the earlier proceeds to scoring.
Alignment judgment (agent.py)
aligned is now judged by the triage LLM — it reads each submission’s idea text inline alongside the operator guidelines and outputs true/false per submission. Replaces broken MiniLM cosine similarity to a reference text. relevance_score field removed everywhere; aligned (binary) is the replacement.
Ingestion node (ingest.py — new file)
Agentic node that runs before the deterministic layer. Normalizes submission text from plain text, markdown, and docx. Summarizes anything over 300 words.
Role-based output (config.py, frontend)
USER_OUTPUT_KEYS = {submission_id, novelty_score, aligned} — participants only see these three. Admins see the full set: criteria_scores, status, analysis_depth, duplicate_of.
Guardrails (guardrails.py)
Key whitelist prevents any unlisted fields from reaching API responses. Score bounds clamp out-of-range values. Leakage detection flags any result that contains a substring of raw submission input — tested against prompt injection attempts where adversarial text inside submissions tries to surface itself in outputs.