Skip to content

sundar139/BayesOptGPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BayesOptGPT

Checkpoint-backed transformer optimization and LLMOps lab for AG News classification.

License: MIT Python 3.12 uv PyTorch FastAPI Streamlit MLflow Optuna ONNX Runtime Ruff mypy pytest CUDA-capable benchmark support Evidence validated

Project Description

BayesOptGPT is a config-driven engineering lab for AG News classification built around checkpoint-backed transformer workflows. Instead of presenting isolated training scripts, the repository links model configuration, bundle packaging, export validation, backend benchmarking, quantization analysis, and evidence curation into one reproducible system.

The codebase supports both a tiny LLaMA-inspired classifier and a modern transformer option through a shared model factory. It validates checkpoints and bundle compatibility, runs TorchScript and ONNX export/parity checks, evaluates backend behavior across CPU and CUDA paths, captures dynamic INT8 quantization outcomes, and publishes sanitized README-safe evidence under docs/evidence for transparent reporting.

Executive Summary

Area Implementation
Task AG News text classification
Model family Tiny LLaMA-inspired classifier plus modern transformer option
Training/evaluation Config-driven, checkpoint-backed workflows
Tracking MLflow
Tuning Optuna
Export targets TorchScript and ONNX
Benchmark backends PyTorch eager, TorchScript, ONNX Runtime
GPU validation PyTorch CUDA and TorchScript CUDA; ONNX CUDA is provider-gated
Quantization Dynamic INT8 on Linear modules
Evidence Sanitized docs/evidence bundle
Quality gates Ruff, mypy, pytest, check_environment, verify_implementations, validate_evidence
Validated Python 3.12.13
Validated tests 258 passed, 1 skipped
CUDA device (local evidence) NVIDIA GeForce RTX 4070 Laptop GPU
ONNX Runtime providers (local evidence) AzureExecutionProvider, CPUExecutionProvider

Problem Context

Transformer optimization claims are often reported without enough runtime context. BayesOptGPT is structured to answer practical engineering questions directly:

  • Was the run checkpoint-backed or random-init?
  • Did exported artifacts pass parity checks before runtime comparisons?
  • Did the benchmark backend/provider actually match the requested target?
  • Did quantization improve size, latency, both, or neither?
  • Are committed results sanitized and reproducible from repository commands?

Why BayesOptGPT

  • Optimization decisions are validated against checkpoints, not only smoke runs.
  • Export parity is treated as a prerequisite for downstream benchmark interpretation.
  • GPU availability and provider availability are reported separately.
  • Quantization outcomes are recorded even when latency does not improve.
  • Evidence is collected and validated before being committed to documentation.

Dataset and Task Lineage

BayesOptGPT targets AG News text classification through the repository data pipeline:

  • Dataset source: Hugging Face datasets AG News loader path used by the configured data module.
  • Task: supervised news-category classification.
  • Label space: 4 classes (World, Sports, Business, Sci/Tech) from label-map usage in serving and bundle tests.
  • Model configs: configs/model.yaml and configs/model.modern.yaml.
  • Export/benchmark/quantization configs: configs/export.yaml, configs/benchmark.yaml, configs/compression.yaml.
  • Commit-safe evidence location: docs/evidence.
  • Generated local runtime outputs: artifacts/runtime (ignored from Git).

System Architecture

flowchart TD
    A[Configs and Dataset] --> B[Model Factory]
    B --> C[Tiny LLaMA Classifier]
    B --> D[Modern Transformer Classifier]
    C --> E[Training and Evaluation]
    D --> E
    E --> F[Checkpoint and Bundle]
    F --> G[Export Validation]
    G --> H[TorchScript]
    G --> I[ONNX]
    F --> J[Benchmark Matrix]
    J --> K[CPU Backends]
    J --> L[CUDA Backends]
    F --> M[Dynamic INT8 Quantization]
    J --> N[README Evidence]
    M --> N
    N --> O[Sanitized docs/evidence]
Loading
src/bayes_gp_llmops/models
src/bayes_gp_llmops/exporting
src/bayes_gp_llmops/benchmarking
src/bayes_gp_llmops/quantization
scripts
configs
docs/evidence
tests

Modeling Decisions and Formulas

BayesOptGPT keeps model families and runtime behavior declarative through config files.

Token embedding:

$$X = Embedding(input\_ids)$$

Transformer residual block:

$$H_{l+1} = H_l + Attention(Norm(H_l))$$

Feedforward residual block:

$$H_{l+1} = H_l + FFN(Norm(H_l))$$

Classification head:

$$logits = W \cdot Pool(H_L) + b$$

Training loss:

$$\mathcal{L} = CrossEntropy(logits, y)$$

Dynamic INT8 approximation:

$$W_{int8} \approx scale \cdot (W_{fp32} - zero\_point)$$

Metric and Selection Policy

  • Quality gates: Ruff, mypy, pytest, check_environment, verify_implementations, validate_evidence.
  • Export claims depend on parity/compatibility reporting.
  • Benchmark interpretation is hardware-specific and provider-aware.
  • Champion selection and bundle metadata remain deterministic and manifest-backed in the promotion flow.

Export and Parity Validation

Export workflows support TorchScript and ONNX with parity checks before benchmark interpretation.

  • Random-init exports are smoke checks for path correctness.
  • Checkpoint-backed exports are used for meaningful runtime and parity evidence.
  • Artifacts are generated under artifacts/runtime and are not committed.
uv run python scripts/export_model.py --config configs/export.yaml --model-config configs/model.yaml --checkpoint artifacts/checkpoints/best.ckpt --output-dir artifacts/runtime/export_checkpoint_tiny

Experiment/Benchmark Results

Checkpoint-backed benchmark evidence (checkpoint_loaded=True, random_init_used=False) is summarized below from docs/evidence.

Run Group Backend Device/Provider Used Mean Latency (ms) Throughput (samples/s) Note
cpu pytorch_eager cpu 3.3095 302.1612 baseline CPU row
cpu torchscript cpu 1.8710 534.4610 CPU TorchScript row
cpu onnxruntime CPUExecutionProvider 1.0837 922.7271 CPU ONNX Runtime row
cuda pytorch_eager cuda 5.2837 189.2630 CUDA PyTorch row
cuda torchscript cuda 2.2536 443.7349 CUDA TorchScript row
onnx_cuda_attempt onnxruntime CPUExecutionProvider 0.9997 1000.2981 Not an ONNX CUDA result.

ONNX CUDA attempt rows are not ONNX CUDA results when provider_used is CPUExecutionProvider.

Benchmark values are local-hardware measurements and should not be interpreted as universal backend rankings.

Dynamic INT8 Quantization Results

From docs/evidence/quantization_summary.md and docs/evidence/quantization_summary.json:

  • Method: dynamic_int8
  • Target modules: Linear
  • Quantized dynamic Linear modules: 58
  • Original artifact size: 33,191,467 bytes
  • Quantized artifact size: 20,631,109 bytes
  • Compression ratio: about 1.61x
  • Parity allclose: false
  • INT8 artifacts were smaller but slower than FP32 in the tested local CPU run
Metric FP32 / Original Dynamic INT8
Artifact size 33,191,467 bytes 20,631,109 bytes
Compression ratio 1.00x 1.61x
Quantized Linear modules n/a 58
Parity allclose reference false
Latency outcome baseline slower on tested host

Dynamic INT8 is CPU-oriented and should be interpreted as a size/compression experiment here, not a guaranteed latency win.

Evidence Bundle

Evidence workflow model:

  • docs/evidence contains sanitized summaries suitable for committed documentation.
  • artifacts/runtime contains generated local outputs and remains ignored.
  • scripts/validate_evidence.py checks required caveats and path-leak safety.

Issues Encountered and Resolutions

Issue Root Cause Resolution Engineering Lesson
Python interpreter drift to 3.14 Tooling can attach to a non-project interpreter Enforced project-bound uv workflow and environment checks Reproducibility begins with interpreter control
pytest import path mismatch Mixed invocation contexts and module roots Standardized repo-root test execution and pythonpath config Test determinism depends on stable import roots
scripts package import issue Script/module execution assumptions diverged Kept script entry points under scripts with consistent path usage CLI ergonomics should be explicit
architecture hardcoding risk Bundle/model code could assume one architecture Added architecture-aware loading and compatibility checks Factory patterns reduce fragile assumptions
export path confusion from bare commands Missing scripts/config prefixes in docs Normalized commands to scripts/ and configs/ paths Documentation quality is operational quality
benchmark/parity status mismatch risk Benchmarks can be read without export compatibility context Kept parity/validation metadata coupled to benchmark reporting Performance claims require correctness checks first
CPU vs CUDA reporting ambiguity Device/provider details can be conflated Recorded requested/used device/provider fields in evidence Runtime observability must be explicit
ONNX CUDA provider unavailable CUDAExecutionProvider not present in local ONNX Runtime Marked fallback rows as CPUExecutionProvider and non-CUDA results Provider-gated reporting avoids false GPU claims
TorchScript CUDA compatibility uncertainty Export/runtime CUDA compatibility may differ by host Added explicit compatibility metadata and checks Compatibility should be measured, not assumed
ONNX fallback speedup misinterpretation CPU fallback rows can appear fast without context Added explicit "Not an ONNX CUDA result" notes Context labels prevent invalid conclusions
Evidence absolute path leak risk Raw reports can include local path fragments Added sanitized evidence collection plus validate_evidence checks Commit-safe docs need automated guards
Dynamic INT8 smaller but slower outcome Quantization objective trade-offs are hardware-dependent Published size and latency together with caveats Compression gains do not imply latency gains

Why This Repository Is Industry-Standard

  • Config-driven architecture and runtime behavior
  • Model factory support with architecture compatibility validation
  • Checkpoint-backed validation and reporting
  • Export parity before benchmark interpretation
  • Backend matrix across CPU, CUDA, and provider-aware ONNX attempts
  • CUDA/provider-aware caveat handling in committed evidence
  • Quantization results with both parity and latency interpretation
  • Curated, sanitized evidence bundle for README claims
  • Dedicated evidence validator for leak/caveat checks
  • Local quality gates that are repeatable on repository root commands
  • Generated runtime artifacts excluded from source control

These are production-style engineering patterns, even when runs are local and hardware-specific.

How To Run

Setup

uv sync --all-groups

Local Quality Gates

uv run python --version
uv run ruff check .
uv run mypy src scripts tests
uv run pytest
uv run python scripts/check_environment.py
uv run python scripts/verify_implementations.py --include-quality-gates
uv run python scripts/validate_evidence.py

Benchmark Matrix

uv run python scripts/benchmark_matrix.py --benchmark-config configs/benchmark.yaml --model-config configs/model.yaml --checkpoint artifacts/checkpoints/best.ckpt --output-dir artifacts/runtime/benchmark_matrix_checkpoint_tiny --warmup-iterations 10 --measured-iterations 50

Quantization

uv run python scripts/quantize_model.py --config configs/compression.yaml --model-config configs/model.yaml --checkpoint artifacts/checkpoints/best.ckpt --output-dir artifacts/runtime/quantization_checkpoint_tiny --warmup-iterations 10 --measured-iterations 50

Evidence Refresh and Validation

uv run python scripts/collect_readme_evidence.py
uv run python scripts/validate_evidence.py

Serving/API/Dashboard

Local serving and dashboard are supported by repository scripts.

FastAPI service

uv run python scripts/serve.py --config configs/serving.yaml

Relevant API routes:

  • GET /
  • GET /health
  • GET /metadata
  • POST /predict
  • POST /predict/batch
  • GET /version
  • GET /docs

Example optional environment overrides for local runtime:

  • SERVING_BUNDLE_DIR
  • SERVING_CONFIG_PATH
  • SERVING_HOST
  • SERVING_PORT
  • SERVING_DEVICE_PREFERENCE

Streamlit dashboard

uv run streamlit run streamlit_app.py

This repository includes deployment-ready structure and documentation, while README claims here are scoped to validated local workflows and committed evidence.

CI/CD or Local Quality Checks

No repository-level CI workflow is asserted in this README. Validation is currently driven by reproducible local quality gates and evidence checks.

Recommended next hardening step: add a CI docs gate that runs readme quality tests and scripts/validate_evidence.py on pull requests.

Verification Evidence

Limitations and Future Work

See docs/evidence/warnings_limitations.md for the tracked registry.

  • torch.jit and legacy ONNX exporter deprecations are tracked and intentionally deferred.
  • Future torch.export modernization remains planned work.
  • Future torchao migration remains planned work.
  • ONNX CUDA execution requires CUDAExecutionProvider availability.
  • Benchmark and quantization latency are hardware-specific.
  • Dynamic INT8 improved artifact size but did not improve latency on the tested host.

Author

Rohith Sundar Jonnalagadda
LinkedIn · MS Computer Science, Kennesaw State University

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

End-to-end NLP text classification pipeline on AG News, a custom LLaMA-inspired transformer with RoPE/RMSNorm/SwiGLU, Optuna + MLflow hyperparameter tuning, uncertainty-aware evaluation, bundle promotion, FastAPI serving, and Streamlit dashboard. Deployable via Docker & HF Spaces.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages