TD-Classifier Suite

A suite for detecting and classifying technical debt in software repositories using transformer models. It covers 18 TD categories (architecture, security, performance, code quality, and more), ships 17 pre-trained models on Hugging Face, and works end-to-end from raw GitHub issues to structured predictions — with or without a GPU.

Use Cases

Scenario	How this tool helps
Repository health audit	Fetch all issues from a GitHub repo and flag which ones describe technical debt
Issue triage	Automatically tag incoming issues so engineers know which ones carry TD risk
Research / metrics	Measure TD density across projects, releases, or teams over time
Custom classifier	Fine-tune on your own labelled dataset to match your team's definition of TD
CI / pre-merge checks	Pipe PR description or commit message through the model to surface TD before merge
Multi-category analysis	Run an ensemble to detect which type of TD is present (security, performance, etc.)

Features

18 TD categories — general TD, architecture, code quality, security, performance, defects, infrastructure, requirements, design, usability, compatibility, reliability, process, build, maintenance, automation, people, portability
17 pre-trained models on Hugging Face Hub — zero training required for inference
ONNX-first inference — CPU by default, no PyTorch required; all 17 models ship model.onnx on Hugging Face Hub — auto-downloaded on first use; if model.onnx is absent, the engine automatically exports from safetensors via torch.onnx.export (requires torch + onnx)
GitHub issues pipeline — fetch → clean → classify in three commands
Custom training — fine-tune on your own data with cross-validation, class weighting, and early stopping
Ensemble inference — combine multiple category models with custom weights
Carbon tracking — CodeCarbon emissions tracking on every training and inference run
Gradio web UI — browser-based interface for non-CLI users

Installation

Default backend is ONNX (CPU). PyTorch and CUDA are optional extras — only needed for training or explicit GPU inference.

With UV (recommended)

UV is a fast Python package manager that replaces pip + venv.

# Install UV — macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install UV — Windows (PowerShell)
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

git clone https://github.com/KarthikShivasankar/text_classification
cd text_classification

uv venv                        # create .venv/
uv pip install -e .            # CPU inference (ONNX) — no GPU / PyTorch required
uv pip install -e ".[gpu]"     # + GPU inference via onnxruntime-gpu + torch (CUDA 12.4)
uv pip install -e ".[train]"   # + full training stack (torch, codecarbon, evaluate…)
uv pip install -e ".[onnx]"    # + onnx package for exporting your own models
uv pip install -e ".[dev]"     # + black, isort, flake8

# optimum (optional) — only needed if you want to use optimum's ORTModel API
# NOTE: optimum 2.x requires transformers<5; install in a separate venv if needed.
# The built-in torch.onnx.export fallback works without optimum.
uv pip install "optimum[onnxruntime]" "transformers<5"

After uv venv, set your IDE's Python interpreter to .venv/Scripts/python.exe (Windows) or .venv/bin/python (Linux/Mac) so imports resolve correctly.

With pip

git clone https://github.com/KarthikShivasankar/text_classification
cd text_classification
pip install -e .                       # CPU inference — no GPU needed
pip install -e ".[gpu,dev,test]"       # GPU + dev + test extras

CPU-only (no GPU, inference only)

The default pip install -e . already gives you full CPU inference via ONNX. No additional steps required — model.onnx is auto-downloaded from Hugging Face Hub the first time you run inference:

pip install -e .
# Run inference directly — model.onnx downloads automatically on first use
tdsuite-inference --model_name karths/binary_classification_train_TD \
    --text "The auth module has no rate limiting"

Publishing to PyPI

uv pip install build twine
python -m build
twine upload dist/*

Quick Start

Classify issues from a public GitHub repo in under 5 minutes — no GPU needed:

# Install (CPU-only, no PyTorch required)
git clone https://github.com/KarthikShivasankar/text_classification && cd text_classification
uv venv && uv pip install -e .

# Fetch the 100 most recent issues
python scripts/fetch_github_issues.py --repo microsoft/vscode --output issues.csv

# Extract the body text
python scripts/extract_issue_bodies.py --input issues.csv --output issue_texts.csv --min-length 50

# Classify — model.onnx is auto-downloaded from Hugging Face Hub on first run
tdsuite-inference \
    --model_name karths/binary_classification_train_TD \
    --input_file issue_texts.csv

Results land in a timestamped folder: outputs/.../inference_YYYYMMDD_HHMMSS/predictions_issue_texts.csv

GPU inference (requires pip install -e ".[gpu]"):

tdsuite-inference \
    --model_name karths/binary_classification_train_TD \
    --device cuda \
    --input_file issue_texts.csv

Use Case Walkthroughs

Use Case 1 — Scan a GitHub repo for technical debt

Audit any public repository without writing any code.

# Step 1: fetch issues (default: latest 100, newest-first)
python scripts/fetch_github_issues.py \
    --repo owner/repo \
    --output issues.csv

# Fetch more issues
python scripts/fetch_github_issues.py --repo owner/repo --limit 500 --output issues.csv

# Fetch everything (slow on large repos)
python scripts/fetch_github_issues.py --repo owner/repo --all --output issues.csv

Rate limit? Unauthenticated calls are capped at 60/hour. Add --token $GITHUB_TOKEN to raise this to 5,000/hour. Create a token (no scopes needed for public repos) at https://github.com/settings/tokens. The script prints the exact wait time and that link automatically when the limit is hit.

# Step 2: clean — keep only body text, drop short/duplicate entries
python scripts/extract_issue_bodies.py \
    --input issues.csv \
    --output issue_texts.csv \
    --min-length 50 \
    --drop-duplicates \
    --keep-metadata        # also keep 'number' and 'title' columns for traceability

# Step 3: classify — ONNX CPU by default, model.onnx auto-downloads from HF Hub
tdsuite-inference \
    --model_name karths/binary_classification_train_TD \
    --input_file issue_texts.csv

# Step 3 (GPU): requires pip install 'tdsuite[gpu]'
tdsuite-inference \
    --model_name karths/binary_classification_train_TD \
    --device cuda \
    --input_file issue_texts.csv

The output CSV has one row per issue with predicted_class (0/1) and predicted_probability. If you used --keep-metadata the number and title columns let you trace results directly back to GitHub issues.

Use Case 2 — Classify your own dataset

You have a CSV of issue bodies, commit messages, or code review comments and want predictions.

# Your CSV must have a column with text. Default column name is 'text'.
# Example: my_issues.csv
#   text,project
#   "This module has no unit tests",projectA
#   "Hard-coded credentials in config.py",projectA

tdsuite-inference \
    --model_name karths/binary_classification_train_TD \
    --input_file my_issues.csv \
    --text_column text \
    --batch_size 64 \
    --output_file results/predictions.csv

If your CSV also has a label column with ground truth (0/1), metrics are computed automatically and saved alongside the predictions:

tdsuite-inference \
    --model_name karths/binary_classification_train_TD \
    --input_file labelled_data.csv \
    --output_file results/predictions.csv
# → results are saved + metrics.json, confusion_matrix.png, roc_curve.png

Classify a single string:

tdsuite-inference \
    --model_name karths/binary_classification_train_TD \
    --text "The authentication module has no rate limiting and stores passwords in plain text"

Output:

{
  "text": "The authentication module has no rate limiting...",
  "predicted_class": 1,
  "predicted_probability": 0.94,
  "class_probabilities": [0.06, 0.94]
}

Use Case 3 — Fine-tune a model on your codebase

Use your own labelled data to build a classifier tuned to your team's codebase or TD definition.

Prepare your data

Your training CSV needs a text column and a label column. Labels can be:

Numeric (0 = not TD, 1 = TD): use --numeric_labels
Categorical (e.g. "TD", "non-TD"): use --positive_category TD

text,label
"No input validation on the API endpoint",TD
"Refactored the payment module",non-TD
"Missing error handling in the database layer",TD

Option A: train on a local file

tdsuite-train \
    --data_file data/my_labelled_issues.csv \
    --model_name distilbert-base-uncased \
    --positive_category TD \
    --output_dir outputs/my_model \
    --num_epochs 5 \
    --batch_size 16 \
    --learning_rate 2e-5 \
    --warmup_steps 500

Option B: train on a Hugging Face dataset

tdsuite-train \
    --data_file karths/binary-10IQR-TD \
    --model_name distilbert-base-uncased \
    --numeric_labels \
    --output_dir outputs/td_model \
    --num_epochs 5 \
    --batch_size 16

Option C: cross-validation (recommended for small datasets)

tdsuite-train \
    --data_file data/my_labelled_issues.csv \
    --model_name distilbert-base-uncased \
    --positive_category TD \
    --output_dir outputs/cv_model \
    --cross_validation \
    --n_splits 5 \
    --num_epochs 3

Each fold saves its own metrics and visualisations. After training, run inference with your new model:

tdsuite-inference \
    --model_path outputs/my_model \
    --input_file data/unlabelled.csv

No GPU? Training on CPU is possible but slow. The recommended approach is to fine-tune on a GPU machine (e.g. Google Colab), then export to ONNX for local CPU inference.

Use Case 4 — Run without a GPU

CPU inference is the default — no GPU, no PyTorch, no extra steps. All 21 pre-trained models ship a model.onnx file on Hugging Face Hub that downloads automatically on the first inference call.

Install (CPU — no GPU required)

pip install -e .   # onnxruntime is included; torch is NOT required

Classify on CPU — model downloads automatically

# Batch file — model.onnx auto-downloaded from HF Hub on first run
tdsuite-inference \
    --model_name karths/binary_classification_train_TD \
    --input_file issue_texts.csv

# Single string
tdsuite-inference \
    --model_name karths/binary_classification_train_TD \
    --text "No tests exist for this module"

Use a local ONNX file (offline / custom model)

# Export your own fine-tuned model once (requires pip install 'tdsuite[onnx]')
python scripts/export_onnx.py \
    --model_path outputs/my_model \
    --output models/my_model.onnx

# Then run offline
tdsuite-inference --onnx_path models/my_model.onnx --input_file issue_texts.csv

ONNX Runtime is typically 2–4× faster than PyTorch on CPU and has no dependency on CUDA or torch.

Use Case 5 — Detect multiple TD types (ensemble)

Run several category-specific models in parallel and combine their predictions. Useful when you want to know not just whether an issue is TD, but what kind.

# Weighted ensemble: general TD + security + code quality
tdsuite-inference \
    --model_names \
        karths/binary_classification_train_TD \
        karths/binary_classification_train_secu \
        karths/binary_classification_train_code \
    --input_file issue_texts.csv \
    --weights 0.5 0.3 0.2 \
    --output_file results/ensemble_predictions.csv

If --weights is omitted, equal weights are applied automatically. The final prediction is a weighted average of each model's class probabilities.

Local ensemble (after training multiple models):

tdsuite-inference \
    --model_paths \
        outputs/fold_0 \
        outputs/fold_1 \
        outputs/fold_2 \
    --input_file test_data.csv

Use Case 6 — Use the web UI

A Gradio interface is available for teams that prefer not to use the CLI.

pip install gradio
python app.py
# → opens at http://localhost:7077

Fine-tune tab:

Upload a labelled CSV (text + label columns)
Select a base model and set train/test split
Click Fine-tune — view accuracy, classification report, confusion matrix inline
Download the predictions CSV

Evaluate tab:

Upload an unlabelled CSV
Select one or more pre-trained models (General TD, Code Quality, Types)
Run — download predictions with per-class probabilities

Hugging Face Models & Datasets

Pre-trained Models

Category	Model
General TD	karths/binary_classification_train_TD
Architecture	karths/binary_classification_train_architecture
Code Quality	karths/binary_classification_train_code
Defects	karths/binary_classification_train_defect
Infrastructure	karths/binary_classification_train_infrastructure
Performance	karths/binary_classification_train_perf
Requirements	karths/binary_classification_train_requirement
Design	karths/binary_classification_train_design
Security	karths/binary_classification_train_secu
Usability	karths/binary_classification_train_usab
Reliability	karths/binary_classification_train_reli
Process	karths/binary_classification_train_process
Build	karths/binary_classification_train_build
Maintenance	karths/binary_classification_train_main
Automation	karths/binary_classification_train_automation
People	karths/binary_classification_train_people
Portability	karths/binary_classification_train_port

Available Datasets

Category	Dataset
General TD	karths/binary-10IQR-TD
Architecture	karths/binary-10IQR-architecture
Code Quality	karths/binary-10IQR-code
Defects	karths/binary-10IQR-defect
Infrastructure	karths/binary-10IQR-infrastructure
Performance	karths/binary-10IQR-perf
Requirements	karths/binary-10IQR-requirement
Design	karths/binary-10IQR-design
Security	karths/binary-10IQR-secu
Usability	karths/binary-10IQR-usab
Compatibility	karths/binary-10IQR-comp
Reliability	karths/binary-10IQR-reli
Process	karths/binary-10IQR-process
Build	karths/binary-10IQR-build
Maintenance	karths/binary-10IQR-main
Automation	karths/binary-10IQR-automation
People	karths/binary-10IQR-people
Portability	karths/binary-10IQR-port

Full CLI Reference

`tdsuite-split-data`

Splits a dataset into balanced train/test sets and optionally extracts top-contributing repositories.

tdsuite-split-data \
    --data_file karths/binary-10IQR-TD \
    --output_dir data/split \
    --is_numeric_labels \
    --repo_column repo \
    --is_huggingface_dataset

Argument	Default	Description
`--data_file`	(required)	Local file path (CSV/JSON/JSONL) or HF dataset name
`--output_dir`	(required)	Directory for train.csv, test.csv, top_repos.csv
`--test_size`	`0.2`	Fraction of data for the test split
`--random_state`	`42`	Random seed
`--repo_column`	—	Column containing repository names (enables top-repo extraction)
`--is_huggingface_dataset`	`false`	Load from Hugging Face Hub
`--is_numeric_labels`	`false`	Labels are already 0/1 integers

`tdsuite-train`

Fine-tune a transformer model on binary TD classification data.

tdsuite-train \
    --data_file karths/binary-10IQR-TD \
    --model_name distilbert-base-uncased \
    --numeric_labels \
    --output_dir outputs/my_model \
    --num_epochs 5 \
    --batch_size 16 \
    --learning_rate 2e-5 \
    --warmup_steps 1000

Argument	Default	Description
`--data_file`	(required)	Local file path or HF dataset name
`--model_name`	(required)	HF model ID or local path
`--output_dir`	(required)	Save directory for model and metrics
`--text_column`	`text`	Text column name
`--label_column`	`label`	Label column name
`--positive_category`	—	Label string for the positive class (categorical labels)
`--numeric_labels`	`false`	Labels are already 0/1 integers
`--is_huggingface_dataset`	`false`	Load dataset from Hugging Face Hub
`--num_epochs`	`3`	Training epochs
`--batch_size`	`16`	Per-device batch size
`--learning_rate`	`2e-5`	Peak learning rate
`--weight_decay`	`0.01`	AdamW weight decay
`--warmup_steps`	`500`	Linear LR warmup steps
`--gradient_accumulation_steps`	`1`	Steps before a weight update
`--cross_validation`	`false`	Enable k-fold CV
`--n_splits`	`5`	Number of CV folds
`--max_length`	`512`	Maximum token sequence length
`--seed`	`42`	Random seed
`--device`	auto	`cuda` or `cpu`

`tdsuite-inference`

Run predictions on a file or a single string. Defaults to ONNX on CPU — PyTorch is not required.

# CPU inference (default) — model.onnx auto-downloaded from HF Hub
tdsuite-inference --model_name karths/binary_classification_train_TD --input_file issues.csv

# Single string
tdsuite-inference --model_name karths/binary_classification_train_TD \
    --text "No input validation on this endpoint"

# GPU inference — ONNX with CUDAExecutionProvider (requires pip install 'tdsuite[gpu]')
tdsuite-inference --model_name karths/binary_classification_train_TD \
    --device cuda --input_file issues.csv

# Local ONNX file (offline)
tdsuite-inference --onnx_path models/td.onnx --input_file issues.csv

# Local model checkpoint (PyTorch, requires --use_torch)
tdsuite-inference --model_path outputs/my_model --use_torch --input_file issues.csv

# Ensemble (uses PyTorch backend automatically)
tdsuite-inference \
    --model_names karths/binary_classification_train_TD karths/binary_classification_train_secu \
    --input_file issues.csv \
    --weights 0.6 0.4

Argument	Default	Description
`--model_path`	—	Local model directory (ONNX auto-detected; or use `--use_torch`)
`--model_name`	—	HF model name — `model.onnx` downloaded automatically
`--model_paths`	—	Multiple local directories (ensemble, PyTorch)
`--model_names`	—	Multiple HF model names (ensemble, PyTorch)
`--onnx_path`	—	Explicit path to a local `.onnx` file
`--use_torch`	`false`	Force PyTorch backend (requires `pip install 'tdsuite[gpu]'`)
`--text`	—	Single text string to classify
`--input_file`	—	CSV or JSON file to classify
`--text_column`	`text`	Column containing text
`--output_file`	—	Path to save the predictions CSV
`--results_dir`	—	Custom results directory (default: timestamped subfolder)
`--batch_size`	`32`	Inference batch size
`--max_length`	`512`	Max token length
`--device`	`cpu`	`cpu` (ONNX CPU) or `cuda` (ONNX GPU via CUDAExecutionProvider)
`--weights`	—	Per-model weights for ensemble averaging
`--disable_progress_bar`	`false`	Suppress tqdm bars
`--track_emissions`	`true`	Record carbon emissions via CodeCarbon

--onnx_path, --model_path, --model_name, --model_paths, and --model_names are mutually exclusive.

`export_onnx.py`

Export a custom or fine-tuned model to ONNX format. The 17 pre-trained models already have model.onnx on Hugging Face Hub — you only need this for your own fine-tuned models.

Requires: pip install 'tdsuite[onnx]' (adds torch + onnx + onnxscript)

# Export a local fine-tuned checkpoint
python scripts/export_onnx.py --model_path outputs/my_model --output models/my_model.onnx

# Export directly from Hugging Face (e.g. your own model)
python scripts/export_onnx.py \
    --model_name my-org/my-custom-td-model \
    --output models/custom.onnx

Argument	Default	Description
`--model_path`	—	Local model directory (mutually exclusive with `--model_name`)
`--model_name`	—	HF model name (mutually exclusive with `--model_path`)
`--output`	(required)	Destination `.onnx` file path
`--max_length`	`512`	Sequence length for the export dummy input
`--opset`	`14`	ONNX opset version

The tokenizer is saved alongside the .onnx file automatically.

`fetch_github_issues.py`

Fetch issues from any public GitHub repository into a CSV file.

python scripts/fetch_github_issues.py --repo owner/repo --output issues.csv
python scripts/fetch_github_issues.py --repo owner/repo --limit 500 --output issues.csv
python scripts/fetch_github_issues.py --repo owner/repo --all --output issues.csv
python scripts/fetch_github_issues.py --repo owner/repo --token "$GITHUB_TOKEN" --output issues.csv

Argument	Default	Description
`--repo`	(required)	Repository in `owner/repo` format
`--output`	`issues.csv`	Output CSV file path
`--state`	`all`	Issue state: `open`, `closed`, or `all`
`--limit`	`100`	Max issues to fetch, newest-first (mutually exclusive with `--all`)
`--all`	`false`	Fetch every issue across all pages (mutually exclusive with `--limit`)
`--token`	—	GitHub personal access token (raises rate limit from 60 to 5,000 req/h)

Output columns: id, number, title, body, state, created_at, updated_at, closed_at, user_login, labels, comments, url.

`extract_issue_bodies.py`

Clean a GitHub issues CSV down to a text column ready for tdsuite-inference.

python scripts/extract_issue_bodies.py \
    --input issues.csv \
    --output issue_texts.csv \
    --min-length 50 \
    --drop-duplicates \
    --keep-metadata

Argument	Default	Description
`--input`	(required)	Input CSV (from `fetch_github_issues.py`)
`--output`	`issue_texts.csv`	Output CSV path
`--body-column`	`body`	Source column for issue text
`--min-length`	`20`	Drop rows shorter than N characters
`--drop-duplicates`	`false`	Remove duplicate body texts
`--keep-metadata`	`false`	Also retain `number` and `title` columns

CI/CD Integration

TD-Classifier Suite works well as a gate in automated pipelines. The recommended approach for CI is ONNX inference — it requires no GPU, has no PyTorch dependency, and cold-starts fast enough for pull-request checks.

GitHub Actions — PR TD check

Flags a pull request if its description or changed commit messages contain technical debt language. The ONNX model is cached between runs so subsequent jobs skip the download.

# .github/workflows/td-check.yml
name: Technical Debt Check

on:
  pull_request:
    types: [opened, synchronize, reopened]

jobs:
  td-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install tdsuite (CPU / ONNX — no GPU required)
        run: pip install -e .

      - name: Cache downloaded ONNX model
        uses: actions/cache@v4
        with:
          # HF Hub caches to ~/.cache/huggingface; cache it between runs
          path: ~/.cache/huggingface
          key: hf-onnx-td-v1

      - name: Write PR description to file
        env:
          PR_BODY: ${{ github.event.pull_request.body }}
        run: |
          echo "$PR_BODY" > /tmp/pr_text.txt

      - name: Classify PR description
        id: classify
        run: |
          # model.onnx auto-downloads from HF Hub on first run (cached above)
          result=$(tdsuite-inference \
            --model_name karths/binary_classification_train_TD \
            --text "$(cat /tmp/pr_text.txt)")
          echo "$result"
          # Fail if predicted_class == 1 (TD detected)
          echo "$result" | python -c "
          import sys, json
          data = json.load(sys.stdin)
          if data['predicted_class'] == 1:
              print(f\"::warning::TD detected (confidence {data['predicted_probability']:.0%}). Review before merging.\")
              sys.exit(1)
          "

Change sys.exit(1) to sys.exit(0) if you want informational warnings without blocking merges.

GitHub Actions — nightly repo scan

Runs a full issues scan on a schedule and uploads the results as a workflow artifact.

# .github/workflows/td-nightly.yml
name: Nightly TD Scan

on:
  schedule:
    - cron: "0 2 * * *"   # 02:00 UTC every day
  workflow_dispatch:        # allow manual runs

jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install tdsuite (CPU / ONNX — no GPU required)
        run: pip install -e .

      - name: Cache HF Hub downloads
        uses: actions/cache@v4
        with:
          path: ~/.cache/huggingface
          key: hf-onnx-td-v1

      - name: Fetch issues
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          python scripts/fetch_github_issues.py \
            --repo ${{ github.repository }} \
            --token "$GITHUB_TOKEN" \
            --limit 200 \
            --output /tmp/issues.csv

      - name: Clean issues
        run: |
          python scripts/extract_issue_bodies.py \
            --input /tmp/issues.csv \
            --output /tmp/issue_texts.csv \
            --min-length 50 \
            --drop-duplicates \
            --keep-metadata

      - name: Classify (model.onnx auto-downloads from HF Hub)
        run: |
          tdsuite-inference \
            --model_name karths/binary_classification_train_TD \
            --input_file /tmp/issue_texts.csv \
            --output_file /tmp/td_predictions.csv

      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: td-scan-${{ github.run_id }}
          path: /tmp/td_predictions.csv
          retention-days: 30

GitLab CI

# .gitlab-ci.yml (relevant excerpt)
variables:
  MODEL_CACHE: "$CI_PROJECT_DIR/.cache/onnx"

td-check:
  stage: test
  image: python:3.11-slim
  cache:
    key: hf-onnx-td
    paths:
      - ~/.cache/huggingface/
  before_script:
    - pip install -e . -q
  script:
    - |
      # model.onnx auto-downloads from HF Hub (cached between runs)
      tdsuite-inference \
        --model_name karths/binary_classification_train_TD \
        --text "$CI_MERGE_REQUEST_DESCRIPTION" \
        | python -c "
      import sys, json
      data = json.load(sys.stdin)
      prob = data['predicted_probability']
      if data['predicted_class'] == 1:
          print(f'TD detected — confidence {prob:.0%}')
          sys.exit(1)
      print(f'No TD detected (confidence {1-prob:.0%})')
      "
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"

Docker / self-hosted runners

For air-gapped or self-hosted environments, bake the ONNX model into your runner image so the download step is eliminated entirely.

# Dockerfile.runner — bake model.onnx into the image for offline / air-gapped use
FROM python:3.11-slim

WORKDIR /app
COPY . .
# Install tdsuite (CPU, no GPU), then pre-download model.onnx into the image
RUN pip install -e . && \
    python -c "from tdsuite.utils.onnx_inference import OnnxInferenceEngine; \
               OnnxInferenceEngine.from_pretrained('karths/binary_classification_train_TD')"

ENTRYPOINT ["tdsuite-inference"]

# Build once (model baked in — no network at runtime)
docker build -f Dockerfile.runner -t tdsuite-runner:latest .

# Use in any CI job
docker run --rm tdsuite-runner:latest \
    --model_name karths/binary_classification_train_TD \
    --text "Hard-coded API keys in the config module"

Tips for CI environments:

Concern	Recommendation
Cold-start time	Cache `~/.cache/huggingface` between runs — model.onnx is ~250–500 MB
No GPU available	Default ONNX backend runs entirely on CPU — no extra flags needed
GPU available	Add `--device cuda` and install `pip install 'tdsuite[gpu]'`
Blocking vs. warning	Set `sys.exit(0)` for informational-only checks
Multiple TD categories	Add `--model_names` with category models to detect which type of TD
Rate limits on issue fetch	Store `GITHUB_TOKEN` as a CI secret and pass via `--token`
Air-gapped networks	Pre-bake the model in a Docker image as shown above

Output Files

After training:

outputs/my_model/
├── pytorch_model.bin          # model weights
├── config.json                # HF model config
├── tokenizer_config.json      # tokenizer files
├── training_config.json       # CLI args used for this run
├── metrics.json               # accuracy, F1, MCC, AUC-ROC
├── confusion_matrix.png
├── roc_curve.png
└── emissions/
    └── emissions.csv          # CodeCarbon energy/CO2 data

After cross-validation training:

outputs/cv_model/
├── fold_0/ … fold_N/
│   ├── metrics.json
│   ├── confusion_matrix.png
│   └── roc_curve.png
├── cross_validation_results.json
└── cross_validation_visualization.png

After inference:

outputs/my_model/inference_YYYYMMDD_HHMMSS/
├── predictions_<input_filename>.csv   # original columns + predicted_class + probabilities
├── metrics/
│   ├── metrics.json
│   ├── confusion_matrix.png
│   └── roc_curve.png                  # only generated when ground truth labels are present
└── emissions/
    └── inference_emissions.csv

Project Structure

text_classification/
├── scripts/
│   ├── fetch_github_issues.py        # fetch issues from any public GitHub repo → CSV
│   ├── extract_issue_bodies.py       # clean issue CSV → text column for inference
│   ├── export_onnx.py                # export a single model to ONNX (one-time, needs torch)
│   └── export_and_upload_onnx.py    # batch-export all 17 TDSuite models and upload to HF Hub
├── tdsuite/
│   ├── cli.py                        # all argparse parsers (single source of truth)
│   ├── train.py                      # tdsuite-train entry point
│   ├── inference.py                  # tdsuite-inference entry point (ONNX default)
│   ├── split_data.py                 # tdsuite-split-data entry point
│   ├── upload_to_hf.py               # upload a trained model to Hugging Face Hub
│   ├── config/
│   │   └── config.py                 # ModelConfig, TrainingConfig, DataConfig, InferenceConfig
│   ├── data/
│   │   ├── dataset.py                # TDDataset, TDProcessor, BinaryTDProcessor
│   │   └── data_splitter.py          # DataSplitter — balanced splits, top-repo extraction
│   ├── models/
│   │   ├── base.py                   # BaseModel with weighted loss support
│   │   └── transformer.py            # TransformerModel (load, predict, save)
│   ├── trainers/
│   │   ├── base.py                   # WeightedLossTrainer, BaseTrainer (emissions tracking)
│   │   └── td_trainer.py             # TDTrainer — cross-validation, early stopping, ensemble
│   └── utils/
│       ├── onnx_inference.py         # OnnxInferenceEngine — default CPU/GPU inference (no torch)
│       ├── inference.py              # InferenceEngine, EnsembleInferenceEngine (PyTorch)
│       ├── metrics.py                # compute_metrics, confusion matrix, ROC plots
│       └── data_utils.py             # load_dataset, preprocess_text
├── app.py                            # Gradio web UI (port 7077)
├── pyproject.toml                    # packaging, tool config, optional deps (gpu/train/onnx/dev)
└── test-requirements.txt             # pytest, pytest-cov

Testing

The test suite uses pytest with coverage reporting. All tests run entirely offline — no GPU, no HuggingFace model downloads. Transformer model calls are mocked with unittest.mock.

Running the test suite

Install test dependencies first:

# UV (recommended)
uv pip install -r test-requirements.txt

# pip fallback
pip install pytest pytest-cov

Run all tests with coverage:

pytest

This uses the config in pyproject.toml (testpaths = ["tests"], --cov=tdsuite).

Run a specific test file:

pytest tests/test_config.py -v
pytest tests/test_data_splitter.py -v
pytest tests/test_metrics.py -v

Run a specific test class or function:

pytest tests/test_config.py::TestModelConfig -v
pytest tests/test_cli.py::TestInferenceParser::test_ensemble_model_paths -v

Run with detailed coverage report:

pytest --cov=tdsuite --cov-report=term-missing --cov-report=html
# Open htmlcov/index.html in a browser for the line-by-line report

Run only fast unit tests (skip any marked slow):

pytest -m "not slow" -v

Expected output (all passing):

tests/test_cli.py                      ............. 35 passed
tests/test_config.py                   ............. 22 passed
tests/test_data_splitter.py            ............. 17 passed
tests/test_data_utils.py               ............. 10 passed
tests/test_dataset.py                  ............. 22 passed
tests/test_extract_issue_bodies.py     ............. 20 passed
tests/test_inference.py                ............. 22 passed
tests/test_metrics.py                  ............. 14 passed
tests/test_onnx_inference.py           ............. 23 passed

Test coverage by module

Module	Test file	What is tested
`tdsuite/config/config.py`	`test_config.py`	`ModelConfig`, `TrainingConfig`, `DataConfig`, `InferenceConfig`, `Config` — defaults, `to_dict`, `from_dict`, `save`, `load`
`tdsuite/utils/data_utils.py`	`test_data_utils.py`	`load_dataset_from_file` (CSV/JSON/JSONL), `load_dataset` auto-detection, `preprocess_text` truncation & whitespace
`tdsuite/data/dataset.py`	`test_dataset.py`	`TDDataset` indexing & length; `TDProcessor` load/tokenize; `BinaryTDProcessor` binary conversion, `extract_top_repo`, `extract_top_repos_by_category`
`tdsuite/data/data_splitter.py`	`test_data_splitter.py`	`DataSplitter` load, preprocess, `balance_classes`, `split_and_save` with/without repo column; standalone `split_data()`
`tdsuite/utils/metrics.py`	`test_metrics.py`	`compute_metrics` return structure, value ranges, perfect-prediction case, JSON/PNG outputs
`tdsuite/cli.py`	`test_cli.py`	All six `get_*_parser()` functions — required args, defaults, flags, mutually-exclusive groups, error cases
`scripts/extract_issue_bodies.py`	`test_extract_issue_bodies.py`	`clean_text` — code blocks, HTML, Markdown, URLs, emoji, whitespace; full CSV pipeline
`tdsuite/utils/inference.py`	`test_inference.py`	`InferenceEngine` `predict_single`, `predict_batch`, `predict_from_file`; `EnsembleInferenceEngine` init, weight normalisation, `predict_single`, `predict_batch` — all mocked
`tdsuite/utils/onnx_inference.py`	`test_onnx_inference.py`	`OnnxInferenceEngine` `predict_single`, `predict_batch`, `predict_from_file`; `from_pretrained` Hub-download path and `torch.onnx.export` fallback — all mocked

What each test file covers

tests/conftest.py — shared fixtures: binary_df, categorical_df, repo_df, csv_file, json_file, jsonl_file, predictions_df.

tests/test_config.py — verifies that every config dataclass serialises to/from JSON correctly and that save()/load() round-trips work for both file and directory targets.

tests/test_data_utils.py — verifies that local CSV/JSON/JSONL files load correctly, that missing files raise FileNotFoundError, unsupported formats raise ValueError, and that preprocess_text truncates and collapses whitespace.

tests/test_dataset.py — verifies TDDataset item shapes and label values; TDProcessor calls the tokenizer with the correct kwargs; BinaryTDProcessor converts categorical labels, raises on missing positive_category, and correctly extracts top repos.

tests/test_data_splitter.py — verifies train/test split files are created, sizes are approximately correct (±5%), train and test sets do not overlap, top-repo extraction writes top_repos.csv, and label mappings are saved for categorical data.

tests/test_metrics.py — verifies all metric keys are present, values are in [0, 1], perfect predictions yield 1.0 for accuracy/F1/MCC, metrics.json and PNG plots are written when output_dir is given, and nothing is written when output_dir is None.

tests/test_cli.py — parses argument vectors for all six CLI commands and asserts correct defaults, custom values, and that SystemExit is raised for invalid/missing arguments and mutually-exclusive conflicts.

tests/test_extract_issue_bodies.py — calls clean_text() with various inputs (fenced code blocks, inline code, HTML tags, markdown links, images, headings, bold, lists, blockquotes, emoji, extra whitespace) and verifies both removal of noise and preservation of prose; also exercises the CSV pipeline and deduplication logic.

tests/test_inference.py — patches TransformerModel and AutoTokenizer (for InferenceEngine) and AutoModelForSequenceClassification/AutoTokenizer (for EnsembleInferenceEngine) to avoid any network access or GPU requirement; verifies output structure, probability ranges, file I/O, and error handling.

tests/test_onnx_inference.py — patches onnxruntime.InferenceSession and AutoTokenizer so no network or GPU is needed; verifies OnnxInferenceEngine output structure, probability ranges, file I/O (CSV, JSON, JSONL), error handling, and that from_pretrained calls _export_to_onnx when model.onnx is absent from the Hub.

Contributing

Contributions are welcome. Please open a pull request with a clear description of the change and ensure flake8 passes:

flake8 tdsuite/ scripts/

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
scripts		scripts
tdsuite		tdsuite
tests		tests
.flake8		.flake8
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
HF_UPLOAD_README.md		HF_UPLOAD_README.md
LICENSE		LICENSE
README.md		README.md
app.py		app.py
hf_upload_example.py		hf_upload_example.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_training.py		run_training.py
setup.py		setup.py
setup_upload_script.py		setup_upload_script.py
test-requirements.txt		test-requirements.txt

Folders and files

Latest commit

History

Repository files navigation

TD-Classifier Suite

Table of Contents

Use Cases

Features

Installation

With UV (recommended)

With pip

CPU-only (no GPU, inference only)

Publishing to PyPI

Quick Start

Use Case Walkthroughs

Use Case 1 — Scan a GitHub repo for technical debt

Use Case 2 — Classify your own dataset

Use Case 3 — Fine-tune a model on your codebase

Use Case 4 — Run without a GPU

Use Case 5 — Detect multiple TD types (ensemble)

Use Case 6 — Use the web UI

Hugging Face Models & Datasets

Pre-trained Models

Available Datasets

Full CLI Reference

tdsuite-split-data

tdsuite-train

tdsuite-inference

export_onnx.py

fetch_github_issues.py

extract_issue_bodies.py

CI/CD Integration

GitHub Actions — PR TD check

GitHub Actions — nightly repo scan

GitLab CI

Docker / self-hosted runners

Output Files

Project Structure

Testing

Running the test suite

Test coverage by module

What each test file covers

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`tdsuite-split-data`

`tdsuite-train`

`tdsuite-inference`

`export_onnx.py`

`fetch_github_issues.py`

`extract_issue_bodies.py`

Packages