An asynchronous Python library that classifies HTML segments extracted by a page-segmenter into structured component types.
The segment_classifier implements a 4-stage classification pipeline with progressive fallback to optimize for cost and speed:
- Rule-based heuristics — Zero LLM cost. Uses DOM structure, text density, siblings, and attributes.
- L1 exact fingerprint cache — Zero LLM cost. Exact matching on structural DOM fingerprint hashes.
- L2 fuzzy cluster cache — Zero LLM cost. TF-IDF and cosine similarity on fingerprint tokens.
- LLM batch classification — Batched fallback via LiteLLM with feature-based model routing based on segment complexity.
You can install the package using poetry:
poetry installOr via pip (once published):
pip install segment-classifierThe library uses pydantic-settings to manage configuration via a .env file or environment variables.
Required environment variables:
CLASSIFIER_LITELLM_API_KEY="your-api-key"import asyncio
from segment_classifier import ClassifierPipeline
from segment_classifier.config import ClassifierSettings
from segment_classifier.models import InputSegment, SegmentPosition
async def main():
settings = ClassifierSettings()
pipeline = ClassifierPipeline(settings)
await pipeline.initialize()
segments = [
InputSegment(
segment_id="seg_001",
page_url="https://example.com/products",
page_slug="products",
raw_html="<div class='product-card'>...</div>",
text_content="Product Item",
position_hint=SegmentPosition.MIDDLE,
sibling_count=3,
)
]
result = await pipeline.run(segments)
await pipeline.shutdown()
for seg in result.classified:
print(seg.component_type)
asyncio.run(main())Caches are stored by default in .cache/l1_fingerprints.json and .cache/l2_clusters.json / .cache/l2_embeddings.npy.
Every returned ClassifiedSegment will be marked with a classification_stage indicating which of the 4 stages resolved the query.