BayesOptGPT

Checkpoint-backed transformer optimization and LLMOps lab for AG News classification.

Project Description

BayesOptGPT is a config-driven engineering lab for AG News classification built around checkpoint-backed transformer workflows. Instead of presenting isolated training scripts, the repository links model configuration, bundle packaging, export validation, backend benchmarking, quantization analysis, and evidence curation into one reproducible system.

The codebase supports both a tiny LLaMA-inspired classifier and a modern transformer option through a shared model factory. It validates checkpoints and bundle compatibility, runs TorchScript and ONNX export/parity checks, evaluates backend behavior across CPU and CUDA paths, captures dynamic INT8 quantization outcomes, and publishes sanitized README-safe evidence under docs/evidence for transparent reporting.

Executive Summary

Area	Implementation
Task	AG News text classification
Model family	Tiny LLaMA-inspired classifier plus modern transformer option
Training/evaluation	Config-driven, checkpoint-backed workflows
Tracking	MLflow
Tuning	Optuna
Export targets	TorchScript and ONNX
Benchmark backends	PyTorch eager, TorchScript, ONNX Runtime
GPU validation	PyTorch CUDA and TorchScript CUDA; ONNX CUDA is provider-gated
Quantization	Dynamic INT8 on Linear modules
Evidence	Sanitized docs/evidence bundle
Quality gates	Ruff, mypy, pytest, check_environment, verify_implementations, validate_evidence
Validated Python	3.12.13
Validated tests	258 passed, 1 skipped
CUDA device (local evidence)	NVIDIA GeForce RTX 4070 Laptop GPU
ONNX Runtime providers (local evidence)	AzureExecutionProvider, CPUExecutionProvider

Problem Context

Transformer optimization claims are often reported without enough runtime context. BayesOptGPT is structured to answer practical engineering questions directly:

Was the run checkpoint-backed or random-init?
Did exported artifacts pass parity checks before runtime comparisons?
Did the benchmark backend/provider actually match the requested target?
Did quantization improve size, latency, both, or neither?
Are committed results sanitized and reproducible from repository commands?

Why BayesOptGPT

Optimization decisions are validated against checkpoints, not only smoke runs.
Export parity is treated as a prerequisite for downstream benchmark interpretation.
GPU availability and provider availability are reported separately.
Quantization outcomes are recorded even when latency does not improve.
Evidence is collected and validated before being committed to documentation.

Dataset and Task Lineage

BayesOptGPT targets AG News text classification through the repository data pipeline:

Dataset source: Hugging Face datasets AG News loader path used by the configured data module.
Task: supervised news-category classification.
Label space: 4 classes (World, Sports, Business, Sci/Tech) from label-map usage in serving and bundle tests.
Model configs: configs/model.yaml and configs/model.modern.yaml.
Export/benchmark/quantization configs: configs/export.yaml, configs/benchmark.yaml, configs/compression.yaml.
Commit-safe evidence location: docs/evidence.
Generated local runtime outputs: artifacts/runtime (ignored from Git).

System Architecture

flowchart TD
    A[Configs and Dataset] --> B[Model Factory]
    B --> C[Tiny LLaMA Classifier]
    B --> D[Modern Transformer Classifier]
    C --> E[Training and Evaluation]
    D --> E
    E --> F[Checkpoint and Bundle]
    F --> G[Export Validation]
    G --> H[TorchScript]
    G --> I[ONNX]
    F --> J[Benchmark Matrix]
    J --> K[CPU Backends]
    J --> L[CUDA Backends]
    F --> M[Dynamic INT8 Quantization]
    J --> N[README Evidence]
    M --> N
    N --> O[Sanitized docs/evidence]

src/bayes_gp_llmops/models
src/bayes_gp_llmops/exporting
src/bayes_gp_llmops/benchmarking
src/bayes_gp_llmops/quantization
scripts
configs
docs/evidence
tests

Modeling Decisions and Formulas

BayesOptGPT keeps model families and runtime behavior declarative through config files.

Token embedding:

$$X = Embedding(input\_ids)$$

Transformer residual block:

$$H_{l+1} = H_l + Attention(Norm(H_l))$$

Feedforward residual block:

$$H_{l+1} = H_l + FFN(Norm(H_l))$$

Classification head:

$$logits = W \cdot Pool(H_L) + b$$

Training loss:

$$\mathcal{L} = CrossEntropy(logits, y)$$

Dynamic INT8 approximation:

$$W_{int8} \approx scale \cdot (W_{fp32} - zero\_point)$$

Metric and Selection Policy

Quality gates: Ruff, mypy, pytest, check_environment, verify_implementations, validate_evidence.
Export claims depend on parity/compatibility reporting.
Benchmark interpretation is hardware-specific and provider-aware.
Champion selection and bundle metadata remain deterministic and manifest-backed in the promotion flow.

Export and Parity Validation

Export workflows support TorchScript and ONNX with parity checks before benchmark interpretation.

Random-init exports are smoke checks for path correctness.
Checkpoint-backed exports are used for meaningful runtime and parity evidence.
Artifacts are generated under artifacts/runtime and are not committed.

uv run python scripts/export_model.py --config configs/export.yaml --model-config configs/model.yaml --checkpoint artifacts/checkpoints/best.ckpt --output-dir artifacts/runtime/export_checkpoint_tiny

Experiment/Benchmark Results

Checkpoint-backed benchmark evidence (checkpoint_loaded=True, random_init_used=False) is summarized below from docs/evidence.

Run Group	Backend	Device/Provider Used	Mean Latency (ms)	Throughput (samples/s)	Note
cpu	pytorch_eager	cpu	3.3095	302.1612	baseline CPU row
cpu	torchscript	cpu	1.8710	534.4610	CPU TorchScript row
cpu	onnxruntime	CPUExecutionProvider	1.0837	922.7271	CPU ONNX Runtime row
cuda	pytorch_eager	cuda	5.2837	189.2630	CUDA PyTorch row
cuda	torchscript	cuda	2.2536	443.7349	CUDA TorchScript row
onnx_cuda_attempt	onnxruntime	CPUExecutionProvider	0.9997	1000.2981	Not an ONNX CUDA result.

ONNX CUDA attempt rows are not ONNX CUDA results when provider_used is CPUExecutionProvider.

Benchmark values are local-hardware measurements and should not be interpreted as universal backend rankings.

Dynamic INT8 Quantization Results

From docs/evidence/quantization_summary.md and docs/evidence/quantization_summary.json:

Method: dynamic_int8
Target modules: Linear
Quantized dynamic Linear modules: 58
Original artifact size: 33,191,467 bytes
Quantized artifact size: 20,631,109 bytes
Compression ratio: about 1.61x
Parity allclose: false
INT8 artifacts were smaller but slower than FP32 in the tested local CPU run

Metric	FP32 / Original	Dynamic INT8
Artifact size	33,191,467 bytes	20,631,109 bytes
Compression ratio	1.00x	1.61x
Quantized Linear modules	n/a	58
Parity allclose	reference	false
Latency outcome	baseline	slower on tested host

Dynamic INT8 is CPU-oriented and should be interpreted as a size/compression experiment here, not a guaranteed latency win.

Evidence Bundle

Evidence workflow model:

docs/evidence contains sanitized summaries suitable for committed documentation.
artifacts/runtime contains generated local outputs and remains ignored.
scripts/validate_evidence.py checks required caveats and path-leak safety.

Issues Encountered and Resolutions

Issue	Root Cause	Resolution	Engineering Lesson
Python interpreter drift to 3.14	Tooling can attach to a non-project interpreter	Enforced project-bound uv workflow and environment checks	Reproducibility begins with interpreter control
pytest import path mismatch	Mixed invocation contexts and module roots	Standardized repo-root test execution and pythonpath config	Test determinism depends on stable import roots
scripts package import issue	Script/module execution assumptions diverged	Kept script entry points under scripts with consistent path usage	CLI ergonomics should be explicit
architecture hardcoding risk	Bundle/model code could assume one architecture	Added architecture-aware loading and compatibility checks	Factory patterns reduce fragile assumptions
export path confusion from bare commands	Missing scripts/config prefixes in docs	Normalized commands to scripts/ and configs/ paths	Documentation quality is operational quality
benchmark/parity status mismatch risk	Benchmarks can be read without export compatibility context	Kept parity/validation metadata coupled to benchmark reporting	Performance claims require correctness checks first
CPU vs CUDA reporting ambiguity	Device/provider details can be conflated	Recorded requested/used device/provider fields in evidence	Runtime observability must be explicit
ONNX CUDA provider unavailable	CUDAExecutionProvider not present in local ONNX Runtime	Marked fallback rows as CPUExecutionProvider and non-CUDA results	Provider-gated reporting avoids false GPU claims
TorchScript CUDA compatibility uncertainty	Export/runtime CUDA compatibility may differ by host	Added explicit compatibility metadata and checks	Compatibility should be measured, not assumed
ONNX fallback speedup misinterpretation	CPU fallback rows can appear fast without context	Added explicit "Not an ONNX CUDA result" notes	Context labels prevent invalid conclusions
Evidence absolute path leak risk	Raw reports can include local path fragments	Added sanitized evidence collection plus validate_evidence checks	Commit-safe docs need automated guards
Dynamic INT8 smaller but slower outcome	Quantization objective trade-offs are hardware-dependent	Published size and latency together with caveats	Compression gains do not imply latency gains

Why This Repository Is Industry-Standard

Config-driven architecture and runtime behavior
Model factory support with architecture compatibility validation
Checkpoint-backed validation and reporting
Export parity before benchmark interpretation
Backend matrix across CPU, CUDA, and provider-aware ONNX attempts
CUDA/provider-aware caveat handling in committed evidence
Quantization results with both parity and latency interpretation
Curated, sanitized evidence bundle for README claims
Dedicated evidence validator for leak/caveat checks
Local quality gates that are repeatable on repository root commands
Generated runtime artifacts excluded from source control

These are production-style engineering patterns, even when runs are local and hardware-specific.

How To Run

Setup

uv sync --all-groups

Local Quality Gates

uv run python --version
uv run ruff check .
uv run mypy src scripts tests
uv run pytest
uv run python scripts/check_environment.py
uv run python scripts/verify_implementations.py --include-quality-gates
uv run python scripts/validate_evidence.py

Benchmark Matrix

uv run python scripts/benchmark_matrix.py --benchmark-config configs/benchmark.yaml --model-config configs/model.yaml --checkpoint artifacts/checkpoints/best.ckpt --output-dir artifacts/runtime/benchmark_matrix_checkpoint_tiny --warmup-iterations 10 --measured-iterations 50

Quantization

uv run python scripts/quantize_model.py --config configs/compression.yaml --model-config configs/model.yaml --checkpoint artifacts/checkpoints/best.ckpt --output-dir artifacts/runtime/quantization_checkpoint_tiny --warmup-iterations 10 --measured-iterations 50

Evidence Refresh and Validation

uv run python scripts/collect_readme_evidence.py
uv run python scripts/validate_evidence.py

Serving/API/Dashboard

Local serving and dashboard are supported by repository scripts.

FastAPI service

uv run python scripts/serve.py --config configs/serving.yaml

Relevant API routes:

GET /
GET /health
GET /metadata
POST /predict
POST /predict/batch
GET /version
GET /docs

Example optional environment overrides for local runtime:

SERVING_BUNDLE_DIR
SERVING_CONFIG_PATH
SERVING_HOST
SERVING_PORT
SERVING_DEVICE_PREFERENCE

Streamlit dashboard

uv run streamlit run streamlit_app.py

This repository includes deployment-ready structure and documentation, while README claims here are scoped to validated local workflows and committed evidence.

CI/CD or Local Quality Checks

No repository-level CI workflow is asserted in this README. Validation is currently driven by reproducible local quality gates and evidence checks.

Recommended next hardening step: add a CI docs gate that runs readme quality tests and scripts/validate_evidence.py on pull requests.

Verification Evidence

Limitations and Future Work

See docs/evidence/warnings_limitations.md for the tracked registry.

torch.jit and legacy ONNX exporter deprecations are tracked and intentionally deferred.
Future torch.export modernization remains planned work.
Future torchao migration remains planned work.
ONNX CUDA execution requires CUDAExecutionProvider availability.
Benchmark and quantization latency are hardware-specific.
Dynamic INT8 improved artifact size but did not improve latency on the tested host.

Author

Rohith Sundar Jonnalagadda
LinkedIn · MS Computer Science, Kennesaw State University

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
.hf_space_deploy		.hf_space_deploy
.vscode		.vscode
artifacts		artifacts
configs		configs
data		data
docker		docker
docs		docs
notebooks		notebooks
scripts		scripts
src/bayes_gp_llmops		src/bayes_gp_llmops
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
BayesOptGPT.code-workspace		BayesOptGPT.code-workspace
Dockerfile		Dockerfile
Dockerfile.streamlit		Dockerfile.streamlit
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
RULES.md		RULES.md
build_logs.txt		build_logs.txt
pyproject.toml		pyproject.toml
runtime_logs.txt		runtime_logs.txt
space_info.json		space_info.json
streamlit_app.py		streamlit_app.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

BayesOptGPT

Project Description

Executive Summary

Problem Context

Why BayesOptGPT

Dataset and Task Lineage

System Architecture

Modeling Decisions and Formulas

Metric and Selection Policy

Export and Parity Validation

Experiment/Benchmark Results

Dynamic INT8 Quantization Results

Evidence Bundle

Issues Encountered and Resolutions

Why This Repository Is Industry-Standard

How To Run

Setup

Local Quality Gates

Benchmark Matrix

Quantization

Evidence Refresh and Validation

Serving/API/Dashboard

FastAPI service

Streamlit dashboard

CI/CD or Local Quality Checks

Verification Evidence

Limitations and Future Work

Author

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages