This repository implements a synthetic-data-first pipeline for the Pocket-Agent tool-calling challenge.
Phase 1 is implemented:
- schema-first tool contract source of truth
- synthetic generation spec scaffold
- strict output canonicalizer/validator
- conversation history contract
- smoke test for Phase 1 behavior
Phase 2 is implemented:
- synthetic data generator with tool/refusal balance
- multi-turn sample generation
- prompt dedup and hash export
- optional Gemini-based paraphrase augmentation using
GEMINI_API_KEY - deterministic local augmentation fallback
- smoke test for dataset quality and formatting
Phase 3 is implemented:
- QLoRA fine-tuning pipeline for TinyLlama 1.1B (configurable)
- training text formatter for prompt/history/response supervision
- dry-run path for fast validation before GPU training
- adapter export to
artifacts/phase3/adapter
Phase 4 is implemented:
- local 4-bit quantization pipeline (quanto backend)
- adapter merge then quantize flow
- artifact size report for <=500 MB gate checks
- CPU latency benchmark with 20-turn default prompt pack
- dry-run smoke checks for fast validation
Phase 5 is implemented:
- grader-facing
inference.pycontract - runtime with model generation + rule fallback
- strict output canonicalization before returning responses
- multi-turn CLI chatbot demo with visible tool-call output
- smoke test for contract and behavior checks
Phase 6 is implemented:
- unified pre-submit gate checker for hard requirements
- strict and non-strict verification modes
- leakage overlap check against
starter/public_test.jsonlwhen available - forbidden import and
run()signature checks forinference.py - demo launch validation and report generation
Run from repository root:
PYTHONPATH=. python scripts/phase1_smoke_test.pyExpected output:
Phase 1 smoke test passed.
Place your key in .env:
GEMINI_API_KEY=your_key_here
Run generator from repository root:
PYTHONPATH=. python3 scripts/generate_synthetic_data.pyOptional deterministic mode (no network augmentation):
PYTHONPATH=. python3 scripts/generate_synthetic_data.py --disable-geminiOutputs:
data/synthetic/train.jsonldata/synthetic/validation.jsonldata/synthetic/all.jsonldata/synthetic/prompt_hashes.txtdata/synthetic/summary.json
PYTHONPATH=. python3 scripts/phase2_smoke_test.pyExpected output:
Phase 2 smoke test passed.
Validate formatting and dataset wiring quickly:
PYTHONPATH=. python3 scripts/phase3_smoke_test.pyExpected output:
Phase 3 smoke test passed.
You can also inspect dry-run summary directly:
PYTHONPATH=. python3 scripts/train_phase3.py --dry-run --max-train-samples 32 --max-eval-samples 8Run full QLoRA fine-tuning:
PYTHONPATH=. python3 scripts/train_phase3.pyOutputs:
artifacts/phase3/adapter/(LoRA adapter + tokenizer files)artifacts/phase3/training_summary.json
Dry run first:
PYTHONPATH=. python3 scripts/phase4_smoke_test.pyQuantize adapter + base model:
PYTHONPATH=. python3 scripts/quantize_phase4.pyQuantization outputs:
artifacts/phase4/quantized/artifacts/phase4/quantization_summary.json
Run benchmark on quantized model:
PYTHONPATH=. python3 scripts/benchmark_latency_phase4.pyOutput:
artifacts/phase4/latency_summary.json
Grader entry point is:
inference.pywithrun(prompt: str, history: list[dict]) -> str
Run smoke validation:
PYTHONPATH=. python3 scripts/phase5_smoke_test.pyExpected output:
Phase 5 smoke test passed.
Run multi-turn local chatbot demo:
PYTHONPATH=. python3 scripts/demo_cli.pyDemo behavior:
- keeps conversation history in memory
- prints assistant output each turn
- shows wrapped
<tool_call>...</tool_call>outputs when tools are selected
Development smoke check (non-strict):
PYTHONPATH=. python3 scripts/phase6_smoke_test.pyStrict pre-submit check:
PYTHONPATH=. python3 scripts/preflight_phase6.pyIf starter/public_test.jsonl is not present locally during development:
PYTHONPATH=. python3 scripts/preflight_phase6.py --non-strict --allow-pending-leakagePreflight reports:
artifacts/phase6/preflight_report.jsonartifacts/phase6/preflight_report.md
configs/tool_schemas.jsonconfigs/synthetic_generation_spec.jsonpocket_agent/contracts.pypocket_agent/schema_loader.pypocket_agent/output_validator.pyscripts/phase1_smoke_test.py
pocket_agent/synthetic_data.pyscripts/generate_synthetic_data.pyscripts/phase2_smoke_test.py
configs/training_config.jsonpocket_agent/training.pyscripts/train_phase3.pyscripts/phase3_smoke_test.py
configs/phase4_config.jsonconfigs/latency_prompts.jsonpocket_agent/quantization.pypocket_agent/latency.pyscripts/quantize_phase4.pyscripts/benchmark_latency_phase4.pyscripts/phase4_smoke_test.py
configs/inference_config.jsonpocket_agent/inference_runtime.pyinference.pyscripts/demo_cli.pyscripts/phase5_smoke_test.py
configs/phase6_config.jsonpocket_agent/preflight.pyscripts/preflight_phase6.pyscripts/phase6_smoke_test.py