A modular Python application for processing and classifying log files through a configurable processor pipeline, with LLM-assisted regex pattern generation for semantic classification.
The Semantic Log Line Classifier processes log files line-by-line through a configurable pipeline of processors, then classifies processed lines into semantic categories using regex matching with LLM-assisted regex generation for new patterns.
Input File → [Line Reader] → [Processor Pipeline] → [Classifier] → [Report Generator]
│
┌───────────────┼───────────────┐
▼ ▼ ▼
Timestamp IP/GUID Remover Tokenizer
Remover │ │
┌─────────────┼───────────────┘
▼ ▼
Token Normalizer → Token Filter → Token Counter → Patcher
The default pipeline consists of the following processors in order:
- TimestampRemover - Removes ISO 8601 timestamps, common date formats, and Unix timestamps
- IPPortRemover - Removes IP addresses with port numbers (e.g.,
10.68.21.11:48438) - IPRemover - Removes standalone IP addresses (e.g.,
127.0.0.1) - GUIDRemover - Removes UUIDs and 32-char hex identifiers
- Tokenizer - Splits input on whitespace into tokens
- TokenNormalizer - Normalizes tokens (lowercase, remove non-alphanumeric)
- TokenFilter - Filters out single characters, empty strings, and stop words
- TokenCounter - Counts token occurrences
- Patcher - Converts token counts to a single string format
This project uses simple modular Python packages with no installation machinery. Install the required LLM provider package(s):
# For Anthropic/Claude (default)
pip install anthropic
# For OpenAI
pip install openai
# Or install both
pip install anthropic openaiNo other external dependencies are required (uses standard library for everything else).
from pathlib import Path
from log_classifier.pipeline import Pipeline
from log_classifier.classifier import Classifier, LLMClient, LLMConfig
from log_classifier.reporter import Reporter
# Setup components
# Option 1: Use factory function (recommended)
from log_classifier.classifier import create_llm_client
llm_client = create_llm_client("anthropic", api_key="your-api-key")
# Or for OpenAI:
# llm_client = create_llm_client("openai", api_key="your-api-key")
# Option 2: Use provider-specific configs
# from log_classifier.classifier import AnthropicLLMClient, AnthropicConfig
# config = AnthropicConfig.from_env() # Uses ANTHROPIC_API_KEY env var
# llm_client = AnthropicLLMClient(config)
classifier = Classifier(llm_client)
pipeline = Pipeline.default()
reporter = Reporter(Path("./output"))
# Process log file
with open("input.log") as f:
for line_num, line in enumerate(f, 1):
line = line.strip()
if not line:
continue
# Process line through pipeline
processed = pipeline.run([line])
if processed:
# Classify the processed line
class_name = classifier.classify(processed, line, line_num)
# Generate reports
classes = classifier.get_all_classes()
reporter.generate(classes)You can create a custom pipeline with specific processors:
from log_classifier.pipeline import Pipeline
from log_classifier.processors import (
TimestampRemover, GUIDRemover, Tokenizer,
TokenNormalizer, TokenFilter, TokenCounter, Patcher
)
# Create custom pipeline
custom_processors = [
TimestampRemover(),
GUIDRemover(),
Tokenizer(),
TokenNormalizer(),
TokenFilter(),
TokenCounter(),
Patcher(),
]
custom_pipeline = Pipeline(custom_processors)You can customize the stop words list for the TokenFilter:
from log_classifier.processors import TokenFilter
custom_stop_words = frozenset({"custom", "stop", "words"})
token_filter = TokenFilter(stop_words=custom_stop_words)The classifier supports multiple LLM providers (Anthropic/Claude and OpenAI). Multiple configuration methods are supported:
Use the create_llm_client() factory function for easy provider switching:
from log_classifier.classifier import create_llm_client
# Anthropic/Claude (default)
llm_client = create_llm_client("anthropic", api_key="your-api-key")
# OpenAI
llm_client = create_llm_client("openai", api_key="your-api-key", model="gpt-4o")
# Using environment variables
# export ANTHROPIC_API_KEY="your-api-key" or export OPENAI_API_KEY="your-api-key"
llm_client = create_llm_client("anthropic") # Uses ANTHROPIC_API_KEY env varAnthropic/Claude Configuration:
from log_classifier.classifier import AnthropicLLMClient, AnthropicConfig, ClaudeModels
# Using environment variables
config = AnthropicConfig.from_env()
llm_client = AnthropicLLMClient(config)
# Using model constants
config = AnthropicConfig.with_model(ClaudeModels.OPUS_4, api_key="your-api-key")
llm_client = AnthropicLLMClient(config)
# Direct configuration
config = AnthropicConfig(
api_key="your-api-key",
model=ClaudeModels.SONNET_4,
max_tokens=1024
)
llm_client = AnthropicLLMClient(config)
# Available Claude models:
# - ClaudeModels.SONNET_4 (default)
# - ClaudeModels.OPUS_4
# - ClaudeModels.HAIKU_4
# - ClaudeModels.SONNET_3_5
# - ClaudeModels.OPUS_3
# - ClaudeModels.HAIKU_3OpenAI Configuration:
from log_classifier.classifier import OpenAILLMClient, OpenAIConfig, OpenAIModels
# Using environment variables
config = OpenAIConfig.from_env()
llm_client = OpenAILLMClient(config)
# Using model constants
config = OpenAIConfig.with_model(OpenAIModels.GPT_4O, api_key="your-api-key")
llm_client = OpenAILLMClient(config)
# Direct configuration
config = OpenAIConfig(
api_key="your-api-key",
model=OpenAIModels.GPT_4O,
max_tokens=1024,
temperature=0.0 # Lower for more deterministic outputs
)
llm_client = OpenAILLMClient(config)
# Available OpenAI models:
# - OpenAIModels.GPT_4O (default)
# - OpenAIModels.GPT_4O_MINI
# - OpenAIModels.GPT_4_TURBO
# - OpenAIModels.GPT_4
# - OpenAIModels.GPT_3_5_TURBOThe old LLMConfig and LLMClient aliases still work for backward compatibility:
from log_classifier.classifier import LLMConfig, LLMClient
# Old-style configuration still works
config = LLMConfig(api_key="your-api-key")
llm_client = LLMClient(config)The reporter generates two types of output:
Each class gets its own JSON file with all members:
{
"class_name": "HTTP 404 Request Failed",
"regex": ".*code.*404.*event.*request.*failed.*",
"member_count": 3,
"members": [
{
"original_line": "{\"code\": 404, \"event\": \"request_finished\"...}",
"line_number": 2,
"processed_line": "code\u00001\u0001event\u00001..."
}
]
}A summary of all classes:
{
"total_classes": 3,
"total_lines_processed": 7,
"classes": [
{
"class_name": "HTTP 404 Request Failed",
"regex": ".*code.*404.*event.*request.*failed.*",
"member_count": 3
}
]
}log_classifier/
├── processors/ # Processing pipeline components
│ ├── base.py # Abstract Processor base class
│ ├── timestamp.py # TimestampRemover
│ ├── ip.py # IPRemover, IPPortRemover
│ ├── guid.py # GUIDRemover
│ ├── tokenizer.py # Tokenizer
│ ├── normalizer.py # TokenNormalizer
│ ├── filter.py # TokenFilter
│ ├── counter.py # TokenCounter
│ └── patcher.py # Patcher
├── pipeline/ # Pipeline orchestration
│ └── pipeline.py # Pipeline class
├── classifier/ # Classification system
│ ├── classifier.py # Classifier and data models
│ └── llm_client.py # LLM integration
├── reporter/ # Report generation
│ └── reporter.py # Reporter class
└── tests/ # Test suite
├── sample_data.py # Sample log lines
├── test_processors.py
├── test_pipeline.py
└── test_classifier.py
- Python 3.10+
- At least one LLM provider package:
- anthropic - For Anthropic/Claude models
- openai - For OpenAI models
- Or both for flexibility
re- Regular expressionsjson- JSON handlingdataclasses- Data structurespathlib- File pathstyping- Type hintsabc- Abstract base classescollections- Counter for token countingos- Environment variables
-
Processing: Each log line is processed through the pipeline:
- Timestamps, IPs, and GUIDs are removed
- Text is tokenized and normalized
- Tokens are filtered and counted
- Result is converted to a patched string format
-
Classification: Processed lines are matched against existing regex patterns:
- If a match is found, the line is added to that class
- If no match is found, the LLM generates a new regex pattern and class name
- New classes are added to the registry in order
-
Reporting: Classification results are written to JSON files:
- One file per class with all member log lines
- A summary file with class statistics
Environment variables are supported for both providers:
Anthropic/Claude:
- ANTHROPIC_API_KEY (required): Your Anthropic API key
- ANTHROPIC_MODEL (optional): Model identifier (default:
claude-sonnet-4-20250514) - ANTHROPIC_MAX_TOKENS (optional): Maximum tokens per request (default:
1024)
OpenAI:
- OPENAI_API_KEY (required): Your OpenAI API key
- OPENAI_MODEL (optional): Model identifier (default:
gpt-4o) - OPENAI_MAX_TOKENS (optional): Maximum tokens per request (default:
1024) - OPENAI_TEMPERATURE (optional): Temperature for sampling (default:
0.0)
Set them in your shell:
# For Anthropic
export ANTHROPIC_API_KEY="your-api-key-here"
export ANTHROPIC_MODEL="claude-opus-4-20250514" # Optional
export ANTHROPIC_MAX_TOKENS="2048" # Optional
# For OpenAI
export OPENAI_API_KEY="your-api-key-here"
export OPENAI_MODEL="gpt-4o" # Optional
export OPENAI_MAX_TOKENS="2048" # Optional
export OPENAI_TEMPERATURE="0.0" # OptionalThen use them in code:
from log_classifier.classifier import create_llm_client, AnthropicConfig, OpenAIConfig
# Using factory function (automatically uses env vars)
anthropic_client = create_llm_client("anthropic")
openai_client = create_llm_client("openai")
# Or using provider-specific configs
from log_classifier.classifier import AnthropicLLMClient, OpenAILLMClient
anthropic_config = AnthropicConfig.from_env()
openai_config = OpenAIConfig.from_env()
anthropic_client = AnthropicLLMClient(anthropic_config)
openai_client = OpenAILLMClient(openai_config)Override specific values:
# Override model while using env var for API key
anthropic_client = create_llm_client("anthropic", model="claude-haiku-4-20250514")
openai_client = create_llm_client("openai", model="gpt-4o-mini")Run the test suite:
cd log_classifier
python -m pytest tests/Or run individual test files:
python -m unittest log_classifier.tests.test_processors
python -m unittest log_classifier.tests.test_pipeline
python -m unittest log_classifier.tests.test_classifierThe classifier expects JSON log lines like:
{"event": "Created RunSession 222202", "timestamp": "2026-01-10T15:29:20.266784Z", "trace_id": "a6dfb9dd-eb72-4e42-a1ff-cdac6105020e"}
{"code": 404, "event": "request_finished", "timestamp": "2026-01-10T15:29:45.865083Z"}
{"event": "request_failed", "status_code": 404, "remote_addr": "10.68.21.11:48438", "timestamp": "2026-01-10T15:29:45.865522Z"}- Requires an API key from at least one LLM provider (Anthropic or OpenAI)
- LLM API calls are made for each unmatched log line (consider caching or batching for large files)
- No retry logic for LLM API failures (exceptions are raised)
- Classification is based on processed token patterns, not semantic understanding of log content
- OpenAI models require JSON mode support (GPT-3.5-turbo and GPT-4 series)
This project uses simple modular Python packages with direct imports. Each module should be independently testable.
[Add your license information here]