Modular Enhancement Pipeline for Small Code LLMs (CS297 Project)

Author: Sakshi Sanskruti Tripathy
San José State University — Fall 2024
Advisor: Prof. Amith Kamath Belman

This repository is artifact-complete for the CS297 project “Modular Enhancement Pipeline for Code-Generating LLMs.” We test whether a 0.5B model (Qwen/Qwen2.5-Coder-0.5B-Instruct) can be improved via external prompting, debugging, error classification, and risk prediction—without retraining. All runs used 164 HumanEval tasks under controlled seeds on Google Colab.

📌 What’s here

HumanEval-style dataset with prompts, canonical solutions, and tests (combined_humanEval_leetcode_dataset.jsonl).
Rubrics/configs for structured prompting (llm_coding_tutor/configs/blueprint_weighted_v4.yaml).
Generated model candidates and repair attempts (llm_coding_tutor/generated/).
Evaluation summaries, ablations, and plots (llm_coding_tutor/results/).
Classifier assets for error/risk prediction (ECNN, GEPN).

📁 Repository structure

CodeTutor-All-LLMS/
├── combined_humanEval_leetcode_dataset.jsonl   # tasks + canonical tests
├── llm_coding_tutor/
│   ├── configs/               # ADW rubric + model assets
│   │   ├── blueprint_weighted_v4.yaml
│   │   ├── ecnn/ ecnn_v3/ ecnn_v4/ ecnn_plus/   # TF-IDF + model joblibs, metadata
│   │   └── gepn/              # GEPN metadata and settings
│   ├── generated/             # per-variant generations/repairs
│   │   ├── baseline_seed1/
│   │   ├── baseline_primitive_seed1/
│   │   ├── adw_weighted_v4_teaching/
│   │   ├── adw_weighted_v4_teaching_seed1/
│   │   ├── adb_v1_from_adw_weighted/
│   │   ├── adb_v11_from_adw_weighted/
│   │   └── adb_v11_from_adw_weighted_seed1/
│   ├── results/               # CSV/JSON summaries, plots, ablations, error taxonomy
│   │   ├── ablation_barplot_v1.png
│   │   ├── ablation_cost_accuracy_scatter_v1.png
│   │   ├── ablation_curve_v1.png
│   │   ├── ablation_radar_v1.png
│   │   ├── error_heatmap_plot_v1.png
│   │   ├── pipeline_ablation_summary.csv
│   │   ├── summary_baseline.json
│   │   ├── adw_adb_pipeline_summary.json
│   │   └── summary_pipeline_v11.csv
└── README.md

🧩 Enhancement modules (implemented)

Baseline (40.24%) — Raw Qwen2.5-Coder-0.5B; deterministic seed; strict execution harness. Artifacts: generated/baseline_seed1/<task_id>/, summary results/summary_baseline.json.
Primitive baseline (12.19%) — Minimal prompt, higher temperature. Artifacts: generated/baseline_primitive_seed1/.
ADW v4 (29.88%) — Advanced Dynamic Weighting using blueprint_weighted_v4.yaml; hierarchical dimensions (~35 subcriteria) with teaching-mode corruption to stress robustness. Artifacts: generated/adw_weighted_v4_teaching/, results/adw_weighted_v4_teaching.csv.
ADB v11 (~26.83% recovery on ADW fails) — Single-round automated debugging; consumes prompt + failing code + traceback to patch shallow issues. Artifacts: generated/adb_v11_from_adw_weighted/, results/adb_from_adw_summary.csv.
ADW→ADB pipeline (56.71%) — Sequential ADW then ADB; combined view in results/adw_adb_pipeline_summary.json and results/summary_pipeline_v11.csv.
Controlled failure dataset — Synthetic + real labeled failures across six categories. Stored in results/controlled_failures_v1.csv and ecnn_v2/ecnn_train_mini.csv.
ECNN — 6-class error classifier (TF-IDF + structural signals). Assets in configs/ecnn* and ecnn_v2/.
GEPN — Risk predictor (TF-IDF + numeric features). Assets in configs/gepn/ and gepn/; gating results in results/adw_gepn_gate_v*.csv.
GEPN-gated controller (46.95%) — Chooses cheap/normal/heavy generation and when to call ADB/ECNN. Artifacts: generated/gepn_gated_adw_adb_ecnn/, results/gepn_gated_summary.json.
Union-mode ensemble (56.71%) — Counts success if any candidate across baseline/ADW/ADB/ECNN passes. See results/pipeline_ablation_summary.csv.

📊 Results

Variant / source file	Tasks	Passed	Pass rate	Notes
Baseline (`summary_baseline.json`)	328 (seeds 1–2)	132	40.24%	Raw model runs
Baseline subset (`summary_pipeline_v11.csv`)	164	66	40.24%	164-task slice
ADW weighted v4 teaching (`summary_pipeline_v11.csv`)	164	49	29.88%	Rubric prompting
ADB v11 from ADW (`summary_pipeline_v11.csv`)	115	44	38.26%	Repairs ADW failures only
ADW→ADB combined (`adw_adb_pipeline_summary.json`)	164	93	56.71%	Best combined in this set

📚 How to browse

Problems/tests: open rows in combined_humanEval_leetcode_dataset.jsonl.
Specific generation: llm_coding_tutor/generated/<variant>/HumanEval/<task_id>/ → read .py and .log.txt.
Metrics/plots: CSV/JSON + PNGs under llm_coding_tutor/results/.
Diagrams already rendered: error_heatmap_plot_v1.png, plots_passrate_adw_adb_v11.png, ablation_barplot_v1.png, ablation_curve_v1.png, ablation_radar_v1.png, ablation_cost_accuracy_scatter_v1.png.

🛠 How to reproduce (conceptual)

Runs were executed in Google Colab using notebooks under llm_coding_tutor/notebooks (not included here).
Typical flow: load HumanEval → run baseline or ADW → run ADB on ADW failures → optional ECNN/GEPN scoring → GEPN-gated controller → gather results/plots under results/ and generations under generated/.

Notes

Model: Qwen/Qwen2.5-Coder-0.5B-Instruct.
Rubric for ADW/ADB: llm_coding_tutor/configs/blueprint_weighted_v4.yaml.
This repo is an artifact store; execution environment setup is not part of the repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modular Enhancement Pipeline for Small Code LLMs (CS297 Project)

📌 What’s here

📁 Repository structure

🧩 Enhancement modules (implemented)

📊 Results

📚 How to browse

🛠 How to reproduce (conceptual)

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
llm_coding_tutor		llm_coding_tutor
README.md		README.md
combined_humanEval_leetcode_dataset.jsonl		combined_humanEval_leetcode_dataset.jsonl

Folders and files

Latest commit

History

Repository files navigation

Modular Enhancement Pipeline for Small Code LLMs (CS297 Project)

📌 What’s here

📁 Repository structure

🧩 Enhancement modules (implemented)

📊 Results

📚 How to browse

🛠 How to reproduce (conceptual)

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages