Skip to content

CarolinaRiascos/stylometry

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

stylefp

Extract stylistic fingerprints from text corpora.

stylefp analyzes a body of writing and produces a detailed stylistic profile β€” quantitative metrics across 8 dimensions plus optional LLM-powered qualitative analysis. Use it to understand a writer's style, generate actionable style guides, or rewrite documents to match a target voice.

pip install stylefp

Quick Start

# Analyze a collection of documents
stylefp analyze ./my-writing/ -o ./output

# Analyze without LLM (no API key needed)
stylefp analyze ./my-writing/ --no-qualitative

# Rewrite a document in a target style
stylefp rewrite draft.md -s ./output/stylefp_profile.json

Output:

  • stylefp_profile.json β€” full quantitative + qualitative fingerprint
  • style_guide.md β€” human-readable style guide (LLM-generated or template-based)

How It Works

                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                          β”‚   Input Corpus   β”‚
                          β”‚  .txt .md .html  β”‚
                          β”‚    .rst .htm     β”‚
                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                   β”‚
                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
                          β”‚   spaCy NLP     β”‚
                          β”‚   Processing    β”‚
                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                   β”‚
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚                      β”‚                      β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ Sentence        β”‚   β”‚ Vocabulary      β”‚   β”‚ Punctuation      β”‚
   β”‚ Structure       β”‚   β”‚ Readability     β”‚   β”‚ Rhetorical       β”‚
   β”‚ Image           β”‚   β”‚ Writing Style   β”‚   β”‚                  β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚                      β”‚                      β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                   β”‚
                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                     β”‚  Qualitative Analysis     β”‚
                     β”‚  (Claude LLM, optional)   β”‚
                     β”‚  Fed by quantitative data β”‚
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                   β”‚
                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                     β”‚         Outputs           β”‚
                     β”‚  JSON Profile + Style Guideβ”‚
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The pipeline is feed-forward: quantitative metrics are computed first, then passed to the LLM as context for richer qualitative analysis. Each analyzer can be independently enabled or disabled.

Rewrite Verification

When rewriting a document, stylefp extracts all numeric data points and structured data containers (tables, lists, key-value pairs) from the original text. The rewritten output is then verified against this whitelist:

  • Diagram validation β€” every number inside a generated Mermaid diagram must trace back to the source document. Zero-tolerance: a single fabricated value causes the entire diagram to be stripped.
  • Prose validation β€” numbers in the rewritten prose are cross-referenced against the original. Fabricated numbers are flagged with their surrounding context so the user can review them.

This prevents the LLM from hallucinating statistics, percentages, or chart data that don't exist in the source material.


Style Metrics

Analyzer What it measures
Sentence Length distributions, type ratios (declarative/interrogative/exclamatory/imperative), grammatical complexity, opener patterns
Vocabulary TTR, MATTR, hapax ratio, Yule's K, formality score, POS distribution, TF-IDF characteristic words, jargon ratio
Punctuation Per-sentence punctuation frequencies, quotation usage, emphasis markers (italics, bold, ALL-CAPS)
Structure Paragraph and document length distributions, structural elements (headers, lists, blockquotes)
Readability Flesch-Kincaid, Flesch Reading Ease, Gunning Fog, Coleman-Liau, ARI, SMOG, Dale-Chall
Rhetorical Passive voice, hedging, intensifiers, contractions, pronoun ratios, dialogue ratio
Image Image features and diagram detection
Writing Style 8 independent style dimensions (0.0–1.0): descriptive, persuasive, narrative, expository, review, technical, objective, subjective

Qualitative Analysis (LLM-powered, optional)

When an Anthropic API key is available, Claude analyzes a representative sample and produces tone, mood, narrative voice, rhetorical devices, thematic patterns, distinctive quirks, audience assessment, and style register.


Installation

With uv (recommended)

git clone https://github.com/CarolinaRiascos/stylometry.git
cd stylometry
uv venv
uv pip install -e ".[dev]"
uv run python -m spacy download en_core_web_sm

With pip

pip install stylefp
python -m spacy download en_core_web_sm

Environment Variables

# Required for qualitative analysis and rewrite commands
export ANTHROPIC_API_KEY=sk-ant-...

CLI Reference

stylefp analyze

Analyze a corpus and extract a stylistic fingerprint.

stylefp analyze <paths>... [OPTIONS]
Option Description Default
-o, --output Output directory Current directory
--no-qualitative Skip LLM analysis False
--spacy-model spaCy model name en_core_web_sm
--json-only JSON output only, skip style guide False
-q, --quiet Suppress progress output False

Examples:

# Analyze a directory of Markdown files
stylefp analyze ./blog-posts/ -o ./analysis

# Analyze specific files
stylefp analyze essay1.txt essay2.txt essay3.md

# Fast analysis (no API key needed)
stylefp analyze ./docs/ --no-qualitative --json-only

stylefp rewrite

Rewrite a document to match a target writing style.

stylefp rewrite <input_file> -s <style_profile.json> [OPTIONS]
Option Description Default
-s, --style Path to a stylefp_profile.json Required
--sample Sample text in the target style None
-o, --output Output directory Current directory
-q, --quiet Suppress progress output False

The rewrite command automatically detects a style_guide.md in the same directory as the profile JSON and uses it for additional context.

Examples:

# Rewrite a draft to match an analyzed style
stylefp rewrite my-draft.md -s ./hemingway-analysis/stylefp_profile.json

# Include a sample of the target style for better matching
stylefp rewrite report.txt -s ./style/stylefp_profile.json --sample ./style/example.txt

Web App

A FastAPI web interface is also available, featuring a demo tab with a precomputed analysis and two style-transferred rewrites: an Eiffel Tower article and an AI in Logistics research paper.

uvicorn stylefp.web.app:app

stylefp schema

Print the JSON schema for the StyleFingerprint model.

stylefp schema

Supported Input Formats

Format Extensions
Plain text .txt
Markdown .md, .markdown
reStructuredText .rst
HTML .html, .htm

Files are read as UTF-8 (with latin-1 fallback). Markdown and HTML formatting is stripped before analysis.


Output Format

stylefp_profile.json

A structured JSON file containing all computed metrics. Top-level fields:

{
  "corpus_name": "my-writing",
  "document_count": 12,
  "total_words": 45230,
  "sentence": { ... },
  "vocabulary": { ... },
  "punctuation": { ... },
  "structure": { ... },
  "readability": { ... },
  "rhetorical": { ... },
  "writing_style": { ... },
  "qualitative": { ... },
  "metadata": { "version": "0.1.0", "timestamp": "...", "spacy_model": "en_core_web_sm" }
}

Use stylefp schema to see the full JSON schema.

style_guide.md

A human-readable style guide covering voice, sentence structure, vocabulary, punctuation, and rhetorical patterns. When qualitative analysis is enabled, this is generated by Claude as an actionable writing guide. Otherwise, a template-based guide is produced from quantitative data alone.


Development

# Clone and install
git clone https://github.com/CarolinaRiascos/stylometry.git
cd stylometry
uv venv
uv pip install -e ".[dev]"
uv run python -m spacy download en_core_web_sm

# Run tests
uv run pytest tests/ -v

# Lint
uv run ruff check src/

# Type check
uv run mypy src/stylefp/

Architecture

src/stylefp/
β”œβ”€β”€ __init__.py
β”œβ”€β”€ __main__.py             # Entry point
β”œβ”€β”€ cli.py                  # Typer CLI (analyze, rewrite, schema)
β”œβ”€β”€ config.py               # StylefpConfig dataclass
β”œβ”€β”€ corpus.py               # Document loading & text extraction
β”œβ”€β”€ data_search.py          # Data point & container extraction
β”œβ”€β”€ formula.py              # Formula handling
β”œβ”€β”€ models.py               # Pydantic models for all features
β”œβ”€β”€ nlp.py                  # spaCy NLP utilities
β”œβ”€β”€ pipeline.py             # Analyzer orchestration
β”œβ”€β”€ validation.py           # Fabricated diagram & number validation
β”œβ”€β”€ analyzers/
β”‚   β”œβ”€β”€ base.py             # BaseAnalyzer abstract class
β”‚   β”œβ”€β”€ image.py            # Image feature analysis
β”‚   β”œβ”€β”€ sentence.py         # Sentence types, complexity, openers
β”‚   β”œβ”€β”€ vocabulary.py       # Lexical diversity, formality, TF-IDF
β”‚   β”œβ”€β”€ punctuation.py      # Punctuation habits & emphasis
β”‚   β”œβ”€β”€ structure.py        # Paragraph & document organization
β”‚   β”œβ”€β”€ readability.py      # Standard readability indices
β”‚   β”œβ”€β”€ rhetorical.py       # Voice, hedging, pronouns, dialogue
β”‚   β”œβ”€β”€ writing_style.py    # 8-dimension style classification
β”‚   └── qualitative.py      # LLM-powered analysis (Claude)
β”œβ”€β”€ output/
β”‚   β”œβ”€β”€ json_writer.py      # JSON profile output
β”‚   └── markdown_writer.py  # Style guide output
β”œβ”€β”€ prompts/
β”‚   └── templates.py        # LLM prompt templates
└── web/
    β”œβ”€β”€ app.py              # FastAPI web interface
    β”œβ”€β”€ preloaded_examples.py # Bundled demo data loader
    β”œβ”€β”€ schemas.py          # Request/response models
    β”œβ”€β”€ data/               # Precomputed example files
    └── static/             # HTML, CSS, JS assets

All analyzers implement BaseAnalyzer.analyze(corpus, docs) and return strongly-typed Pydantic models. The pipeline registers analyzers as (field_name, label, instance) tuples, runs them sequentially, and assembles the results into a StyleFingerprint.


License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors