Checkpoint-backed transformer optimization and LLMOps lab for AG News classification.
BayesOptGPT is a config-driven engineering lab for AG News classification built around checkpoint-backed transformer workflows. Instead of presenting isolated training scripts, the repository links model configuration, bundle packaging, export validation, backend benchmarking, quantization analysis, and evidence curation into one reproducible system.
The codebase supports both a tiny LLaMA-inspired classifier and a modern transformer option through a shared model factory. It validates checkpoints and bundle compatibility, runs TorchScript and ONNX export/parity checks, evaluates backend behavior across CPU and CUDA paths, captures dynamic INT8 quantization outcomes, and publishes sanitized README-safe evidence under docs/evidence for transparent reporting.
| Area | Implementation |
|---|---|
| Task | AG News text classification |
| Model family | Tiny LLaMA-inspired classifier plus modern transformer option |
| Training/evaluation | Config-driven, checkpoint-backed workflows |
| Tracking | MLflow |
| Tuning | Optuna |
| Export targets | TorchScript and ONNX |
| Benchmark backends | PyTorch eager, TorchScript, ONNX Runtime |
| GPU validation | PyTorch CUDA and TorchScript CUDA; ONNX CUDA is provider-gated |
| Quantization | Dynamic INT8 on Linear modules |
| Evidence | Sanitized docs/evidence bundle |
| Quality gates | Ruff, mypy, pytest, check_environment, verify_implementations, validate_evidence |
| Validated Python | 3.12.13 |
| Validated tests | 258 passed, 1 skipped |
| CUDA device (local evidence) | NVIDIA GeForce RTX 4070 Laptop GPU |
| ONNX Runtime providers (local evidence) | AzureExecutionProvider, CPUExecutionProvider |
Transformer optimization claims are often reported without enough runtime context. BayesOptGPT is structured to answer practical engineering questions directly:
- Was the run checkpoint-backed or random-init?
- Did exported artifacts pass parity checks before runtime comparisons?
- Did the benchmark backend/provider actually match the requested target?
- Did quantization improve size, latency, both, or neither?
- Are committed results sanitized and reproducible from repository commands?
- Optimization decisions are validated against checkpoints, not only smoke runs.
- Export parity is treated as a prerequisite for downstream benchmark interpretation.
- GPU availability and provider availability are reported separately.
- Quantization outcomes are recorded even when latency does not improve.
- Evidence is collected and validated before being committed to documentation.
BayesOptGPT targets AG News text classification through the repository data pipeline:
- Dataset source: Hugging Face datasets AG News loader path used by the configured data module.
- Task: supervised news-category classification.
- Label space: 4 classes (World, Sports, Business, Sci/Tech) from label-map usage in serving and bundle tests.
- Model configs: configs/model.yaml and configs/model.modern.yaml.
- Export/benchmark/quantization configs: configs/export.yaml, configs/benchmark.yaml, configs/compression.yaml.
- Commit-safe evidence location: docs/evidence.
- Generated local runtime outputs: artifacts/runtime (ignored from Git).
flowchart TD
A[Configs and Dataset] --> B[Model Factory]
B --> C[Tiny LLaMA Classifier]
B --> D[Modern Transformer Classifier]
C --> E[Training and Evaluation]
D --> E
E --> F[Checkpoint and Bundle]
F --> G[Export Validation]
G --> H[TorchScript]
G --> I[ONNX]
F --> J[Benchmark Matrix]
J --> K[CPU Backends]
J --> L[CUDA Backends]
F --> M[Dynamic INT8 Quantization]
J --> N[README Evidence]
M --> N
N --> O[Sanitized docs/evidence]
src/bayes_gp_llmops/models
src/bayes_gp_llmops/exporting
src/bayes_gp_llmops/benchmarking
src/bayes_gp_llmops/quantization
scripts
configs
docs/evidence
tests
BayesOptGPT keeps model families and runtime behavior declarative through config files.
Token embedding:
Transformer residual block:
Feedforward residual block:
Classification head:
Training loss:
Dynamic INT8 approximation:
- Quality gates: Ruff, mypy, pytest, check_environment, verify_implementations, validate_evidence.
- Export claims depend on parity/compatibility reporting.
- Benchmark interpretation is hardware-specific and provider-aware.
- Champion selection and bundle metadata remain deterministic and manifest-backed in the promotion flow.
Export workflows support TorchScript and ONNX with parity checks before benchmark interpretation.
- Random-init exports are smoke checks for path correctness.
- Checkpoint-backed exports are used for meaningful runtime and parity evidence.
- Artifacts are generated under artifacts/runtime and are not committed.
uv run python scripts/export_model.py --config configs/export.yaml --model-config configs/model.yaml --checkpoint artifacts/checkpoints/best.ckpt --output-dir artifacts/runtime/export_checkpoint_tinyCheckpoint-backed benchmark evidence (checkpoint_loaded=True, random_init_used=False) is summarized below from docs/evidence.
| Run Group | Backend | Device/Provider Used | Mean Latency (ms) | Throughput (samples/s) | Note |
|---|---|---|---|---|---|
| cpu | pytorch_eager | cpu | 3.3095 | 302.1612 | baseline CPU row |
| cpu | torchscript | cpu | 1.8710 | 534.4610 | CPU TorchScript row |
| cpu | onnxruntime | CPUExecutionProvider | 1.0837 | 922.7271 | CPU ONNX Runtime row |
| cuda | pytorch_eager | cuda | 5.2837 | 189.2630 | CUDA PyTorch row |
| cuda | torchscript | cuda | 2.2536 | 443.7349 | CUDA TorchScript row |
| onnx_cuda_attempt | onnxruntime | CPUExecutionProvider | 0.9997 | 1000.2981 | Not an ONNX CUDA result. |
ONNX CUDA attempt rows are not ONNX CUDA results when provider_used is CPUExecutionProvider.
Benchmark values are local-hardware measurements and should not be interpreted as universal backend rankings.
From docs/evidence/quantization_summary.md and docs/evidence/quantization_summary.json:
- Method: dynamic_int8
- Target modules: Linear
- Quantized dynamic Linear modules: 58
- Original artifact size: 33,191,467 bytes
- Quantized artifact size: 20,631,109 bytes
- Compression ratio: about 1.61x
- Parity allclose: false
- INT8 artifacts were smaller but slower than FP32 in the tested local CPU run
| Metric | FP32 / Original | Dynamic INT8 |
|---|---|---|
| Artifact size | 33,191,467 bytes | 20,631,109 bytes |
| Compression ratio | 1.00x | 1.61x |
| Quantized Linear modules | n/a | 58 |
| Parity allclose | reference | false |
| Latency outcome | baseline | slower on tested host |
Dynamic INT8 is CPU-oriented and should be interpreted as a size/compression experiment here, not a guaranteed latency win.
- Evidence index
- Validation summary
- Benchmark matrix summary
- Quantization summary
- Warnings and limitations
- Implementation timeline
Evidence workflow model:
- docs/evidence contains sanitized summaries suitable for committed documentation.
- artifacts/runtime contains generated local outputs and remains ignored.
- scripts/validate_evidence.py checks required caveats and path-leak safety.
| Issue | Root Cause | Resolution | Engineering Lesson |
|---|---|---|---|
| Python interpreter drift to 3.14 | Tooling can attach to a non-project interpreter | Enforced project-bound uv workflow and environment checks | Reproducibility begins with interpreter control |
| pytest import path mismatch | Mixed invocation contexts and module roots | Standardized repo-root test execution and pythonpath config | Test determinism depends on stable import roots |
| scripts package import issue | Script/module execution assumptions diverged | Kept script entry points under scripts with consistent path usage | CLI ergonomics should be explicit |
| architecture hardcoding risk | Bundle/model code could assume one architecture | Added architecture-aware loading and compatibility checks | Factory patterns reduce fragile assumptions |
| export path confusion from bare commands | Missing scripts/config prefixes in docs | Normalized commands to scripts/ and configs/ paths | Documentation quality is operational quality |
| benchmark/parity status mismatch risk | Benchmarks can be read without export compatibility context | Kept parity/validation metadata coupled to benchmark reporting | Performance claims require correctness checks first |
| CPU vs CUDA reporting ambiguity | Device/provider details can be conflated | Recorded requested/used device/provider fields in evidence | Runtime observability must be explicit |
| ONNX CUDA provider unavailable | CUDAExecutionProvider not present in local ONNX Runtime | Marked fallback rows as CPUExecutionProvider and non-CUDA results | Provider-gated reporting avoids false GPU claims |
| TorchScript CUDA compatibility uncertainty | Export/runtime CUDA compatibility may differ by host | Added explicit compatibility metadata and checks | Compatibility should be measured, not assumed |
| ONNX fallback speedup misinterpretation | CPU fallback rows can appear fast without context | Added explicit "Not an ONNX CUDA result" notes | Context labels prevent invalid conclusions |
| Evidence absolute path leak risk | Raw reports can include local path fragments | Added sanitized evidence collection plus validate_evidence checks | Commit-safe docs need automated guards |
| Dynamic INT8 smaller but slower outcome | Quantization objective trade-offs are hardware-dependent | Published size and latency together with caveats | Compression gains do not imply latency gains |
- Config-driven architecture and runtime behavior
- Model factory support with architecture compatibility validation
- Checkpoint-backed validation and reporting
- Export parity before benchmark interpretation
- Backend matrix across CPU, CUDA, and provider-aware ONNX attempts
- CUDA/provider-aware caveat handling in committed evidence
- Quantization results with both parity and latency interpretation
- Curated, sanitized evidence bundle for README claims
- Dedicated evidence validator for leak/caveat checks
- Local quality gates that are repeatable on repository root commands
- Generated runtime artifacts excluded from source control
These are production-style engineering patterns, even when runs are local and hardware-specific.
uv sync --all-groupsuv run python --version
uv run ruff check .
uv run mypy src scripts tests
uv run pytest
uv run python scripts/check_environment.py
uv run python scripts/verify_implementations.py --include-quality-gates
uv run python scripts/validate_evidence.pyuv run python scripts/benchmark_matrix.py --benchmark-config configs/benchmark.yaml --model-config configs/model.yaml --checkpoint artifacts/checkpoints/best.ckpt --output-dir artifacts/runtime/benchmark_matrix_checkpoint_tiny --warmup-iterations 10 --measured-iterations 50uv run python scripts/quantize_model.py --config configs/compression.yaml --model-config configs/model.yaml --checkpoint artifacts/checkpoints/best.ckpt --output-dir artifacts/runtime/quantization_checkpoint_tiny --warmup-iterations 10 --measured-iterations 50uv run python scripts/collect_readme_evidence.py
uv run python scripts/validate_evidence.pyLocal serving and dashboard are supported by repository scripts.
uv run python scripts/serve.py --config configs/serving.yamlRelevant API routes:
- GET /
- GET /health
- GET /metadata
- POST /predict
- POST /predict/batch
- GET /version
- GET /docs
Example optional environment overrides for local runtime:
- SERVING_BUNDLE_DIR
- SERVING_CONFIG_PATH
- SERVING_HOST
- SERVING_PORT
- SERVING_DEVICE_PREFERENCE
uv run streamlit run streamlit_app.pyThis repository includes deployment-ready structure and documentation, while README claims here are scoped to validated local workflows and committed evidence.
No repository-level CI workflow is asserted in this README. Validation is currently driven by reproducible local quality gates and evidence checks.
Recommended next hardening step: add a CI docs gate that runs readme quality tests and scripts/validate_evidence.py on pull requests.
- docs/evidence/README_EVIDENCE.md
- docs/evidence/validation_summary.md
- docs/evidence/benchmark_matrix_summary.md
- docs/evidence/quantization_summary.md
- docs/evidence/warnings_limitations.md
- docs/evidence/implementation_timeline.md
See docs/evidence/warnings_limitations.md for the tracked registry.
- torch.jit and legacy ONNX exporter deprecations are tracked and intentionally deferred.
- Future torch.export modernization remains planned work.
- Future torchao migration remains planned work.
- ONNX CUDA execution requires CUDAExecutionProvider availability.
- Benchmark and quantization latency are hardware-specific.
- Dynamic INT8 improved artifact size but did not improve latency on the tested host.
Rohith Sundar Jonnalagadda
LinkedIn · MS Computer Science, Kennesaw State University
This project is licensed under the MIT License. See the LICENSE file for details.