ChangeScout

Final CAS NLP project report: Joss_Lorenz_ChangeScout_CAS_NLP_Final_Project_Report.pdf

ChangeScout is a deterministic lead prioritization and review support pipeline for potential TLM relevant changes from official canton level web sources.

ChangeScout is not an automatic TLM update system.

ChangeScout does not replace expert judgement.

A lead is not a confirmed TLM change.

A lead is a source that should be reviewed because it may describe a persistent TLM road or path geometry update.

The workflow remains Human in the Loop.

Purpose

ChangeScout supports reviewers by reducing manual search and screening effort.

It monitors manually curated official canton level source registries, extracts relevant text, scores candidate documents, selects review leads, enriches them with review aids, and writes scoped outputs for manual inspection.

The operational use case is inference on one selected source registry.

Typical questions are:

Which official sources should a reviewer inspect first?
Which leads have strong TLM geometry signals?
Which leads are likely actionable according to the evaluated score and TF IDF setup?
Which locations or GeoAdmin hints may help manual review?

ChangeScout should not answer:

Should TLM be updated now?
Is this project already built?
Is this location a verified project geometry?

For the detailed project goal and evaluation framing, see:

docs/project_goal.md

For adding a new canton or source registry and running inference, see:

docs/inference_runbook.md

Current operational capability

The current scoped operational workflow supports:

source registry validation and discovery smoke tests
source registry resolution
discovery
crawling
HTML cleaning
hard filtering
thematic scoring
optional operational TF IDF inference
candidate selection with score_only or score_or_tfidf
local location hinting on selected leads
optional GeoAdmin enrichment on selected leads
scoped review export package
scoped monitoring summary
run metadata and logs

Operational outputs are written under:

artifacts/runs/<run_id>/

Operational runs do not write to:

data/annotation/evaluation/

or:

results/evaluation/

The historical MVP reproduction workflow remains separate:

bash scripts/operational/run.sh

Do not mix the scoped operational inference workflow with the frozen evaluation and reproduction workflow.

Repository structure

src/changescout/
  ingestion/      discovery, crawling, cleaning, filtering
  ranking/        scoring, candidate selection, decision logic
  enrichment/     geography, local hints, GeoAdmin enrichment
  ml/             TF IDF, classification, LLM explainability
  review/         leads, review export, inference QA
  annotation/     annotation helpers
  validation/     registry validation and snapshots

scripts/
  operational/    scoped run helpers and review exports
  annotation/     annotation dataset construction and expansion
  evaluation/     evaluation and result package builders
  ml/             model training and LLM experiment scripts
  legacy/         historical MVP reproduction helpers

data/
  annotation/labeled/      curated labeled datasets
  annotation/evaluation/  frozen evaluation datasets
  models/                 tracked operational model artifacts
  reference/              stable reference files

results/
  evaluation/     generated evaluation results and report package

artifacts/
  generated operational run outputs, ignored by Git

Environment setup

python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt

Optional local LLM dependencies are separate because CUDA wheels are platform dependent.

source .venv/bin/activate
python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
python -m pip install -r requirements-llm.txt

Configuration

Monitoring is controlled by configuration.

Scope config

config/scope.yaml defines:

version
canton_id
languages
time_window_days
source_registry
source_policy

Source registry

config/sources/<registry>.yaml defines source entries.

Required fields for each source:

source_id
name
base_url
crawl_type
crawl_frequency_hours
active

For crawl_type: html_pattern, the source also requires:

include_patterns

Example mapping:

source_registry: "zh" maps to config/sources/zh.yaml

For detailed instructions, see:

docs/inference_runbook.md

Validate source registry

Before running a new source registry, validate its configuration.

PYTHONPATH=src python -m changescout.cli validate-registry \
  --config-dir config \
  --source-registry be

With discovery smoke test and scoped validation outputs:

PYTHONPATH=src python -m changescout.cli validate-registry \
  --config-dir config \
  --source-registry be \
  --smoke-discovery \
  --output-dir artifacts/registry_validation/be_001 \
  --timeout-seconds 10

This writes:

artifacts/registry_validation/<run_id>/
  registry_validation_report.json
  discovery_smoke.jsonl
  discovery_smoke_report.json

The validation command checks required fields, duplicate source_id, active sources, supported crawl_type, required include_patterns, valid URLs, and broad include patterns.

Resolve configured sources

PYTHONPATH=src python -m changescout.cli snapshot \
  --config-dir config \
  --snapshot-dir artifacts

This writes a resolved scope snapshot with the active sources.

Standard inference preset

For regular inference, use the concise preset command.

The preset runs the recommended high recall setup:

scoped operational run
score_or_tfidf candidate selection
default TF IDF artifact at data/models/tfidf_actionable/tfidf_actionable_v1
local location hinting
optional GeoAdmin enrichment

Minimal example:

PYTHONPATH=src python -m changescout.cli infer \
  --source-registry be \
  --canton-id be \
  --run-id be_infer_001

With optional GeoAdmin enrichment:

PYTHONPATH=src python -m changescout.cli infer \
  --source-registry be \
  --canton-id be \
  --run-id be_infer_geoadmin_001 \
  --enable-geoadmin-enrichment

The preset still writes only to:

artifacts/runs/<run_id>/

Use the full run command when custom thresholds, custom filter config, custom scoring config, or debugging options are needed.

Run scoped operational inference

Score only mode

This is the deterministic baseline mode.

PYTHONPATH=src python -m changescout.cli run \
  --config-dir config \
  --source-registry zh \
  --canton-id zh \
  --run-id zh_score_only_001 \
  --output-root artifacts/runs \
  --html-root data/crawling \
  --lead-threshold 0.10 \
  --candidate-selection-mode score_only \
  --timeout-seconds 10

Recommended high recall mode

The recommended current inference mode is score_or_tfidf.

It uses the union of:

thematic_score >= lead_threshold
tfidf_actionable_probability >= tfidf_threshold

It requires an explicit operational TF IDF model artifact.

PYTHONPATH=src python -m changescout.cli run \
  --config-dir config \
  --source-registry be \
  --canton-id be \
  --run-id be_hybrid_001 \
  --output-root artifacts/runs \
  --html-root data/crawling \
  --lead-threshold 0.10 \
  --tfidf-threshold 0.50 \
  --candidate-selection-mode score_or_tfidf \
  --tfidf-model-artifact data/models/tfidf_actionable/tfidf_actionable_v1 \
  --timeout-seconds 10

With optional GeoAdmin enrichment

GeoAdmin enrichment runs only on selected leads.

GeoAdmin hints are review aids only.

They are not verified project geometries.

PYTHONPATH=src python -m changescout.cli run \
  --config-dir config \
  --source-registry be \
  --canton-id be \
  --run-id be_hybrid_geoadmin_001 \
  --output-root artifacts/runs \
  --html-root data/crawling \
  --lead-threshold 0.10 \
  --tfidf-threshold 0.50 \
  --candidate-selection-mode score_or_tfidf \
  --tfidf-model-artifact data/models/tfidf_actionable/tfidf_actionable_v1 \
  --enable-geoadmin-enrichment \
  --timeout-seconds 10

Build review export package

After a scoped run, build a reviewer facing export package.

PYTHONPATH=src python scripts/operational/build_review_export.py \
  --run-dir artifacts/runs/be_hybrid_geoadmin_001 \
  --top-n 30

This writes:

artifacts/runs/<run_id>/review/
  review_leads.csv
  review_summary.md
  review_export_report.json

The review export reads the best available lead output in this order:

leads_with_llm_explanations.jsonl
leads_with_geoadmin_locations.jsonl
leads_with_locations.jsonl
leads.jsonl

The export deduplicates repeated leads by canonical URL and preserves duplicate source ids.

Build scoped monitoring summary

PYTHONPATH=src python scripts/operational/build_monitoring_summary.py \
  --run-id be_hybrid_geoadmin_001

This writes:

artifacts/runs/<run_id>/monitoring_summary.json
artifacts/runs/<run_id>/monitoring_summary.md

The monitoring summary reads scoped run metadata and scoped stage reports.

It is a lightweight run summary.

It is not production alerting.

Scoped operational output layout

A typical scoped run writes:

artifacts/runs/<run_id>/
  scope_snapshot.json
  discovery.jsonl
  crawl.jsonl
  cleaned.jsonl
  excluded.jsonl
  filtered.jsonl
  filtered_excluded.jsonl
  scored.jsonl
  scored_with_tfidf.jsonl
  leads.jsonl
  leads.csv
  leads_with_locations.jsonl
  leads_with_locations.csv
  leads_with_geoadmin_locations.jsonl
  leads_with_geoadmin_locations.csv
  monitoring_summary.json
  monitoring_summary.md
  review/
    review_leads.csv
    review_summary.md
    review_export_report.json
  reports/
    discovery_report.json
    crawl_report.json
    cleaning_report.json
    filter_report.json
    scoring_report.json
    tfidf_inference_report.json
    lead_generation_report.json
    location_hinting_report.json
    geoadmin_location_hinting_report.json
  metadata/
    run_metadata.json
  logs/
    run.log

Some files are optional and exist only when the corresponding stage is enabled.

Train operational TF IDF model artifact

PYTHONPATH=src python scripts/ml/train_operational_tfidf.py \
  --dataset data/annotation/evaluation/triage_3class_dataset.csv \
  --output-dir data/models/tfidf_actionable/tfidf_actionable_v1 \
  --model-version tfidf_actionable_v1

The target is actionable binary:

positive: confirmed_relevant, needs_review
negative: not_relevant

The tracked operational artifact contains:

data/models/tfidf_actionable/tfidf_actionable_v1/
  model.joblib
  metadata.json

test_predictions.csv may be generated during training for audit purposes, but it is ignored by Git by default.

Operational runs load the artifact explicitly.

They do not retrain inside the inference run.

See:

docs/model_artifacts.md

Standalone TF IDF inference

PYTHONPATH=src python scripts/operational/run_tfidf_inference.py \
  --input artifacts/runs/<run_id>/scored.jsonl \
  --output artifacts/runs/<run_id>/scored_with_tfidf.jsonl \
  --report-output artifacts/runs/<run_id>/reports/tfidf_inference_report.json \
  --artifact-dir data/models/tfidf_actionable/tfidf_actionable_v1

This adds:

tfidf_model_version
tfidf_actionable_probability
tfidf_actionable_prediction
tfidf_actionable_threshold

TF IDF probability is a learned review signal.

It is not a confirmed TLM relevance decision.

Discovery only

PYTHONPATH=src python -m changescout.cli discover \
  --config-dir config \
  --output artifacts/discovery.jsonl

Discovery:

fetches source HTML
extracts links
normalizes URLs
filters by include patterns
removes binary assets
deduplicates canonical URLs
writes JSONL

Crawling only

PYTHONPATH=src python -m changescout.cli crawl \
  --input artifacts/discovery.jsonl \
  --output artifacts/crawl.jsonl \
  --html-base-dir data/crawling \
  --run-id run_001

Crawling:

fetches discovered URLs
stores raw HTML
computes content hash
writes structured crawl records
continues on errors

HTML cleaning only

PYTHONPATH=src python -m changescout.cli run \
  --config-dir config \
  --source-registry <registry> \
  --canton-id <canton> \
  --run-id <run_id>

HTML cleaning:

reads raw HTML files
extracts title and main content
removes boilerplate
normalizes text
applies basic quality filtering
writes cleaned and excluded outputs

Hard filtering only

PYTHONPATH=src python -m changescout.cli filter \
  --input artifacts/cleaned.jsonl \
  --config config/filter.yaml \
  --output artifacts/filtered.jsonl \
  --excluded-output artifacts/filtered_excluded.jsonl \
  --report-output artifacts/filter_report.json

Hard filtering:

removes clearly non domain documents
preserves plausible infrastructure related content
writes filtering signals and report

Thematic scoring only

PYTHONPATH=src python -m changescout.cli score \
  --input artifacts/filtered.jsonl \
  --config config/scoring.yaml \
  --output artifacts/scored.jsonl \
  --report-output artifacts/scoring_report.json

Scoring:

computes deterministic thematic relevance signals
writes thematic_score
writes inspectable scoring signals
writes scoring report

Standalone lead enrichment

Local location hinting

PYTHONPATH=src python scripts/operational/add_location_hints_to_leads.py \
  --input artifacts/runs/<run_id>/leads.jsonl \
  --reference data/reference/location_hints_reference.csv \
  --output-jsonl artifacts/runs/<run_id>/leads_with_locations.jsonl \
  --output-csv artifacts/runs/<run_id>/leads_with_locations.csv \
  --report-output artifacts/runs/<run_id>/reports/location_hinting_report.json

GeoAdmin enrichment

PYTHONPATH=src python scripts/operational/enrich_location_hints_geoadmin.py \
  --input artifacts/runs/<run_id>/leads_with_locations.jsonl \
  --output-jsonl artifacts/runs/<run_id>/leads_with_geoadmin_locations.jsonl \
  --output-csv artifacts/runs/<run_id>/leads_with_geoadmin_locations.csv \
  --report-output artifacts/runs/<run_id>/reports/geoadmin_location_hinting_report.json \
  --cache data/reference/geoadmin_search_cache.jsonl \
  --max-queries 3

GeoAdmin enrichment is optional and non authoritative.

API failure does not invalidate lead generation.

Evaluation workflow

Frozen evaluation datasets are stored under:

data/annotation/evaluation/

Generated evaluation results are stored under:

results/evaluation/

The expanded annotation dataset contains 348 manually reviewed sources.

Generated evaluation datasets:

Dataset	Rows	Train	Test	Positive class	Excluded class
strict_binary	264	211	53	confirmed_relevant	needs_review
actionable_binary	348	278	70	confirmed_relevant or needs_review	none
triage_3class	348	278	70	three class target	none

Build evaluation datasets:

PYTHONPATH=src python scripts/evaluation/build_evaluation_datasets.py

Evaluate deterministic score baseline:

PYTHONPATH=src python scripts/evaluation/evaluate_score_baseline.py

Evaluate classical TF IDF baseline:

PYTHONPATH=src python scripts/evaluation/evaluate_classical_text_classifier.py

Run local LLM triage evaluation:

PYTHONPATH=src python scripts/ml/run_local_llm_triage.py \
  --model-id Qwen/Qwen2.5-7B-Instruct \
  --prompt-variant hierarchical

PYTHONPATH=src python scripts/evaluation/evaluate_local_llm_triage.py \
  --predictions results/evaluation/local_llm/Qwen__Qwen2.5-7B-Instruct/hierarchical/llm_triage_predictions.jsonl

Evaluate aligned method comparison:

PYTHONPATH=src python scripts/evaluation/evaluate_aligned_method_comparison.py

Evaluate hybrid lead selection:

PYTHONPATH=src python scripts/evaluation/evaluate_hybrid_lead_selection.py \
  --llm-predictions results/evaluation/local_llm/Qwen__Qwen2.5-7B-Instruct/direct/llm_triage_predictions.jsonl \
  --output-dir results/evaluation/hybrid_lead_selection_qwen7b_direct

PYTHONPATH=src python scripts/evaluation/compare_hybrid_lead_selection_runs.py

Build report package:

PYTHONPATH=src python scripts/evaluation/build_evaluation_report_package.py

The report package is written to:

results/evaluation/report_package/

Current method findings

Task specific binary split results:

Dataset	Method	Precision	Recall	F1
strict_binary	thematic_score	0.864	0.760	0.809
strict_binary	TF IDF Logistic Regression	0.852	0.920	0.885
actionable_binary	thematic_score	0.771	0.881	0.822
actionable_binary	TF IDF Logistic Regression	0.812	0.929	0.867

Aligned triage test split results:

Task	Method	Precision	Recall	F1
strict_binary	thematic_score	0.957	0.880	0.917
actionable_binary	thematic_score	0.750	0.857	0.800
actionable_binary	TF IDF Logistic Regression	0.771	0.881	0.822
actionable_binary	Qwen2.5 14B hierarchical	0.969	0.738	0.838

Hybrid lead selection findings:

Review depth	Best mode	Precision at N	Recall at N	False negatives
10	`tfidf_only`	1.000	0.238	32
20	`hybrid_recall_guard`	1.000	0.476	22
50	`score_or_tfidf`	0.780	0.929	3
70	`score_or_tfidf`	0.600	1.000	0

Recommended current setup:

use score_or_tfidf for high recall candidate selection
use local LLMs for explanation and review support, not hard exclusion
treat GeoAdmin hints as review aids only
keep the final decision with human reviewers

Current limitations

ChangeScout currently does not:

confirm whether a lead corresponds to a finished real world change
update TLM automatically
verify project geometries
provide a production scheduler
provide production alerting
guarantee canton independent generalization
provide a stable standalone LLM classifier

Additional limitations:

HTML cleaning prioritizes recall over precision
thematic scoring is transparent but keyword and pattern based
TF IDF is a learned review signal and can be miscalibrated on new source types
local LLMs are conservative and not stable enough as hard classifiers
GeoAdmin results are heuristic and sometimes noisy
review exports are aids for manual inspection, not final decisions

Script inventory

The repository contains operational code, MVP reproduction helpers, annotation tools, and evaluation scripts.

Script responsibilities are documented in:

docs/script_inventory.md

Current standard inference entry point:

PYTHONPATH=src python -m changescout.cli infer

Full operational entry point:

PYTHONPATH=src python -m changescout.cli run

Historical MVP reproduction entry point:

bash scripts/operational/run.sh

Tests

Run all tests:

PYTHONPATH=src pytest -q

Generated data policy

Generated operational outputs belong under:

artifacts/runs/<run_id>/

Raw crawled HTML belongs under:

data/crawling/<run_id>/

GeoAdmin cache is local generated data:

data/reference/geoadmin_search_cache.jsonl

Frozen evaluation datasets belong under:

data/annotation/evaluation/

Generated evaluation results belong under:

results/evaluation/

Operational inference must not overwrite frozen evaluation datasets or curated evaluation results.

Demo Runbook

For a compact reproducible demo run, see docs/demo_runbook.md.

Labeled dataset

The canonical labeled dataset is data/annotation/labeled/annotation_dataset_expanded.csv. It contains 348 manually reviewed records from multiple cantons. annotation_dataset_expanded.jsonl is the equivalent machine readable copy.

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
.github/workflows		.github/workflows
config		config
data		data
docs		docs
report		report
results/evaluation		results/evaluation
scripts		scripts
src/changescout		src/changescout
tests		tests
tools/annotation		tools/annotation
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements-llm.txt		requirements-llm.txt
requirements.lock		requirements.lock
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

ChangeScout

Purpose

Current operational capability

Repository structure

Environment setup

Configuration

Scope config

Source registry

Validate source registry

Resolve configured sources

Standard inference preset

Run scoped operational inference

Score only mode

Recommended high recall mode

With optional GeoAdmin enrichment

Build review export package

Build scoped monitoring summary

Scoped operational output layout

Train operational TF IDF model artifact

Standalone TF IDF inference

Discovery only

Crawling only

HTML cleaning only

Hard filtering only

Thematic scoring only

Standalone lead enrichment

Local location hinting

GeoAdmin enrichment

Evaluation workflow

Current method findings

Current limitations

Script inventory

Tests

Generated data policy

Demo Runbook

Labeled dataset

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages