LingFrame — Linguistic Atomization Framework

A computational rhetoric platform that decomposes text into hierarchical atomic units, applies six configurable analysis modules across every level of linguistic granularity, and generates interactive visualizations — spanning 46 canonical works in 15+ languages across 12 literary traditions.

Why This Exists
Philosophical Framework
What LingFrame Does
The Global Canonical Corpus
Technical Architecture
Installation and Quick Start
Three User Pathways
Analysis Methodology
Multilingual Support
Extension Points
Sample Projects
Technology Stack
Documentation
Roadmap
Cross-Organ Context
Citation
Contributing
License

Why This Exists

Rhetorical analysis — the millennia-old discipline of understanding how language persuades, instructs, moves, and transforms — remains overwhelmingly subjective. A scholar reads a text, identifies patterns through trained intuition, and produces an interpretation that another scholar may read entirely differently. Writers seeking feedback on argumentation quality receive impressionistic assessments. Translation studies rely on bilingual intuition rather than measurable divergence metrics. Comparative literature traces structural parallels across traditions by hand, one passage at a time.

The problem is not that human interpretation lacks value. It is that human interpretation lacks instruments. A musician does not abandon their ear when they use a tuner; the tuner reveals what the ear might miss and confirms what the ear suspects. Rhetorical analysis needs its tuner.

LingFrame is that tuner: a computational rhetoric platform that atomizes text into a hierarchical structure — theme, paragraph, sentence, word, letter — and applies configurable analysis pipelines at every level of that hierarchy. It does not replace the scholar's trained eye. It arms the scholar with pattern-detection capabilities that would take weeks to replicate manually, running across texts in 15+ languages spanning four millennia of literary production.

This is not a grammar checker, not a writing assistant, not a generative AI that rewrites your prose. LingFrame is a heuristic analysis instrument — it highlights patterns for human interpretation, quantifies what was previously only felt, and makes visible the invisible architectures of language.

Philosophical Framework

Linguistic Atomization as Epistemological Method

The central idea of LingFrame is deceptively simple: any text can be decomposed into atoms, and those atoms can be analyzed at multiple levels of granularity simultaneously. This is linguistic atomization — the systematic breaking of language into its constituent units, not to reduce it, but to reveal the structures that emerge at each scale.

The metaphor is deliberately drawn from physics. Just as matter reveals different properties at the atomic, molecular, and macroscopic scales, language reveals different patterns at the letter, word, sentence, paragraph, and thematic scales. A sentiment arc visible at the paragraph level may be invisible at the sentence level. A rhetorical strategy that operates through word choice (diction) functions differently from one that operates through sentence structure (syntax), even when both serve the same argumentative purpose.

The Hierarchy of Atoms

LingFrame's atomization model produces a five-level tree:

Theme (T)
  └── Paragraph (P)
        └── Sentence (S)
              └── Word (W)
                    └── Letter (L)

Each node in this tree is an "atom" — a discrete unit of text carrying metadata about its position, its content, and its relationships to adjacent and parent atoms. Analysis modules traverse this tree, computing metrics at each level and aggregating them upward. The result is not a single score or a single interpretation, but a multi-dimensional profile of the text that respects the irreducible complexity of language.

Why Hierarchical Decomposition Matters

Traditional computational linguistics tends to operate at a single level — bag-of-words models work at the word level, sentiment analyzers at the sentence or document level, topic models at the document level. LingFrame's hierarchical approach means that every analysis module produces results at every level simultaneously. This reveals patterns that single-level analysis misses entirely:

Cross-level rhetorical strategies: An author who uses emotionally charged individual words (Word level) within logically structured sentences (Sentence level) within thematically neutral paragraphs (Paragraph level) is executing a specific rhetorical technique — one that only becomes visible when all three levels are examined together.
Structural rhythm: The alternation between long and short paragraphs, between complex and simple sentences, between Latinate and Anglo-Saxon diction — these rhythmic patterns emerge from multi-level analysis and correspond to the "music" of prose that experienced readers feel but rarely quantify.
Translation fidelity at scale: When comparing translations, divergence may occur at different levels — one translator preserves sentence structure but alters word choice; another preserves vocabulary but restructures paragraphs. Hierarchical comparison reveals where and how translations diverge, not merely that they diverge.

Heuristic Honesty

LingFrame is explicitly a heuristic tool. Its scores are pattern-detection indicators, not validated psychometric measurements. The 9-step evaluation framework produces numbers between 0 and 100, but those numbers measure the density of detectable linguistic markers, not the "quality" of rhetoric in any absolute sense. A text that scores 30 on Pathos is not necessarily less emotionally effective than one that scores 80 — it may simply employ emotional strategies that LingFrame's current pattern library does not detect.

This philosophical honesty is foundational. LingFrame exists within ORGAN-I: Theoria, the epistemological organ of the organvm system. Theoria's mandate is not to produce convenient answers but to pursue rigorous inquiry — and rigorous inquiry requires acknowledging the boundaries of its instruments. LingFrame's documentation, scoring explanations, and methodology papers all maintain this commitment to epistemic transparency.

What LingFrame Does

LingFrame performs computational rhetorical analysis through five interconnected stages:

Atomization — Decomposes any text (plain text, PDF, uploaded document) into a five-level hierarchical structure using configurable naming strategies. Each atom receives a unique identifier and metadata about its position, content, and linguistic properties.
Multi-Dimensional Analysis — Applies six configurable analysis modules across the hierarchy. Each module operates independently, producing its own data structures, but all modules share the same atomic tree, enabling cross-module correlation.
Pattern Detection — Identifies linguistic markers (evidence markers, emotional appeals, authority signals, weakness indicators, transition patterns), rhetorical strategies, and structural patterns that emerge at individual and cross-level scales.
Visualization — Generates five types of interactive HTML visualizations (force-directed semantic networks, Sankey temporal flows, sentiment charts, entity browsers, evaluation dashboards) with cross-visualization linking that allows click-through navigation between views.
Report Generation — Produces human-readable narrative reports in HTML with findings, recommendations, and — optionally — revision suggestions generated through the framework's generation layer.

The Six Analysis Modules

Module	Method	What It Reveals
Heuristic Evaluation	Pattern matching against rhetorical markers for evidence, emotion, authority	A 9-step assessment (Critique, Logos, Pathos, Ethos, Logic Check, Blind Spots, Shatter Points, Bloom, Evolve) with per-level breakdowns and weighted aggregation
Semantic Analysis	TF-IDF similarity computation + entity co-occurrence mapping	Theme networks showing how concepts cluster and relate across the text; concept density at each hierarchical level
Temporal Analysis	Verb tense extraction + temporal marker detection	Timeline flow revealing how narrative moves through past, present, and future; tense distribution patterns
Sentiment Analysis	VADER + domain-specific lexicons (English) / XLM-RoBERTa (multilingual)	Emotional arcs across the text; valence and arousal mapping at sentence, paragraph, and theme levels
Entity Analysis	spaCy NER + domain-specific regex patterns (15+ languages)	Named entity extraction, co-occurrence matrices, entity density mapping across text sections
Translation Analysis	Sentence alignment + embedding-based comparison	Divergence metrics between source and translation; semantic fidelity scores at each hierarchical level

The Global Canonical Corpus

LingFrame ships with a curated corpus of 46 canonical works across 115 text files spanning 12 literary traditions and four millennia of human literary production. All texts are in the public domain.

Traditions Represented

Tradition	Works	Example Texts
Classical Greek & Latin	Iliad, Odyssey, Aeneid, Metamorphoses	Homer (Butler, 1898), Virgil (Latin original + Dryden, 1697 + Conington, 1866), Ovid
Ancient Near Eastern	Epic of Gilgamesh	Jastrow (1920), Thompson (1928)
Arabic-Persian	Quran, Arabian Nights, Rubaiyat, Conference of the Birds	Arabic original + 4 English translations (Pickthall, Rodwell, Sale, Yusuf Ali); Burton (16 vols); Fitzgerald
Chinese Classical	Tao Te Ching, Analects, Art of War, Journey to the West	Chinese originals + multiple translations (Legge, Goddard, Medhurst, Giles)
Sanskrit	Bhagavad Gita, Mahabharata, Ramayana, Rigveda	Sanskrit originals + English translations
Hebrew	Tanakh	Hebrew original + English translations
Japanese	Tale of Genji, Heike Monogatari, Kojiki, Oku no Hosomichi	Japanese originals + English translations
Medieval European	Beowulf, Canterbury Tales, Divine Comedy, Song of Roland	Old English, Middle English, Italian originals + translations
Renaissance	Don Quixote, The Prince, Hamlet, The Tempest	Shakespeare (First Folio, 1623), Machiavelli, Cervantes
Enlightenment	Candide, Gulliver's Travels, Paradise Lost	Voltaire, Swift, Milton
Romantic	Frankenstein, Pride and Prejudice, Faust, Les Fleurs du Mal, Eugene Onegin, Les Miserables	Shelley, Austen, Goethe, Baudelaire, Pushkin (Russian original), Hugo
Modern	Moby-Dick, Crime and Punishment, Madame Bovary, Dubliners, The Trial, We	Melville, Dostoevsky (Russian original), Flaubert, Joyce, Kafka, Zamyatin

Corpus Design Principles

The corpus is not a random collection. It is curated to enable specific analytical workflows:

Cross-translation comparison: Major works include the original-language text alongside 2-5 public-domain English translations (e.g., the Quran includes Arabic + 4 translations; the Tao Te Ching includes Chinese + 3 translations). This enables LingFrame's Translation Analysis module to quantify divergence across renderings of the same source.
Cross-tradition structural analysis: Works are selected to span rhetorical traditions — Homeric oral-formulaic composition, Quranic sajʿ (rhymed prose), Chinese classical parallelism, Sanskrit epic metre — enabling comparative rhetorical study across civilizational boundaries.
Temporal coverage: From the Epic of Gilgamesh (c. 2100 BCE) through Joyce's Dubliners (1914), the corpus enables diachronic study of how rhetorical strategies evolve across four millennia.
Script diversity: Texts span 8+ writing systems (Latin, Cyrillic, Greek, Arabic, Hebrew, Devanagari, CJK, Japanese), exercising LingFrame's multilingual tokenization pipeline and validating its script-aware analysis capabilities.

Technical Architecture

Directory Structure

linguistic-atomization-framework/
├── framework/                    # Core library
│   ├── core/                     # Atomization engine
│   │   ├── atomizer.py           # Text → hierarchical atom tree
│   │   ├── ontology.py           # Data structures (Atom, Corpus, etc.)
│   │   ├── naming.py             # 5 naming strategies
│   │   ├── pipeline.py           # Analysis orchestration
│   │   ├── language.py           # Language detection and configuration
│   │   ├── tokenizers.py         # Script-aware tokenization
│   │   ├── registry.py           # Component registry
│   │   ├── recursion.py          # Iterative analysis capability
│   │   └── reproducibility.py    # Checksum tracking
│   ├── analysis/                 # 6 analysis modules
│   │   ├── base.py               # BaseAnalysisModule interface
│   │   ├── evaluation.py         # 9-step heuristic evaluation
│   │   ├── semantic.py           # TF-IDF + entity co-occurrence
│   │   ├── temporal.py           # Verb tense + temporal markers
│   │   ├── sentiment.py          # VADER / XLM-RoBERTa
│   │   ├── entity.py             # spaCy NER + domain patterns
│   │   └── translation.py        # Alignment + embedding comparison
│   ├── visualization/            # 5 visualization adapters
│   │   ├── adapters/             # D3.js, Plotly, Chart.js adapters
│   │   ├── cross_linking.py      # Cross-visualization navigation
│   │   └── base.py               # BaseVisualizationAdapter interface
│   ├── output/                   # Report formatters
│   │   ├── narrative.py          # HTML narrative reports
│   │   └── scholarly.py          # LaTeX, TEI-XML, CoNLL export
│   ├── generation/               # Revision suggestion engine
│   ├── domains/                  # Domain profiles (military, literary, etc.)
│   ├── loaders/                  # PDF and document ingestion
│   ├── llm/                      # Optional LLM integration
│   └── schemas/                  # Atomization schemas (default, Arabic, Chinese, Japanese)
│
├── app/                          # Streamlit web interface
│   ├── streamlit_app.py          # Main web entry point
│   └── components/               # UI components
│       ├── corpus_observatory.py # Browse, preview, compare 46 texts
│       ├── rhetoric_gym.py       # Practice exercises with feedback
│       ├── analysis_engine.py    # Analysis orchestration
│       ├── visualization_bridge.py # Visualization rendering
│       ├── upload.py             # Document upload handler
│       └── results.py            # Results display
│
├── cli/                          # Command-line interfaces
│   ├── main.py                   # Full CLI (atomize, analyze, visualize, migrate, etc.)
│   └── simple.py                 # Simplified one-command interface
│
├── corpus/                       # 46 works, 115 files, 12 traditions
│   ├── classical/                # Greek & Latin
│   ├── arabic-persian/           # Quran, Arabian Nights, Rubaiyat, etc.
│   ├── chinese-classical/        # Tao Te Ching, Analects, Art of War, etc.
│   ├── sanskrit/                 # Bhagavad Gita, Mahabharata, Ramayana, Rigveda
│   ├── hebrew/                   # Tanakh
│   ├── japanese/                 # Tale of Genji, Heike Monogatari, etc.
│   ├── medieval/                 # Beowulf, Canterbury Tales, Divine Comedy, etc.
│   ├── renaissance/              # Don Quixote, Hamlet, The Prince
│   ├── enlightenment/            # Candide, Gulliver's Travels, Paradise Lost
│   ├── romantic/                 # Frankenstein, Faust, Eugene Onegin, etc.
│   ├── modern/                   # Moby-Dick, Crime and Punishment, The Trial, etc.
│   └── early-modern/             # The Tempest
│
├── projects/                     # Analysis projects with configurations
│   └── literary-analysis/        # Sample projects (tomb-unknowns, MET4MORFOSES)
│
├── templates/                    # HTML report templates
├── tests/                        # 142 tests across all modules
├── visualizations/               # Pre-generated interactive visualizations
├── docs/                         # Theory, methodology, limitations, tutorials
├── lingframe.py                  # CLI entry point
└── run_web.py                    # Web app launcher

Core Concepts

Hierarchical Atomization. Text is decomposed into a tree structure: Theme -> Paragraph -> Sentence -> Word -> Letter. Each node is an "atom" with an identifier, content, metadata, and parent-child relationships. The atomizer supports configurable detection of theme boundaries (heading patterns, whitespace heuristics) and delegates sentence/word tokenization to language-appropriate pipelines.

Five Naming Strategies. Every atom receives a unique identifier. The naming strategy determines how that identifier is constructed:

Strategy	Example ID	Use Case
`hybrid` (recommended)	`T001:section-title.P001.S001`	Readable + unique; best for most analyses
`hierarchical`	`T001.P001.S001`	Clean parent-child paths for programmatic access
`semantic`	`military-town.para-1.sent-1`	Content-derived slugs for human navigation
`legacy`	`T001`, `P0001`, `S00001`	Flat IDs; backward compatibility
`uuid`	`T_abc12345`	Globally unique; cross-corpus reference

Domain Profiles. Analysis behavior adapts to textual domain through YAML-defined profiles containing domain-specific sentiment lexicons and entity recognition patterns. The framework ships with military and base (general) profiles, with a documented pattern for creating literary, legal, technical, and other domain profiles.

Analysis Pipeline. The processing flow is:

Source Text → Atomization → [Analysis Modules] → Visualization → Report
                                    ↓
                           Generation (revision suggestions)
                                    ↓
                              [Recursion loop]

The optional recursion loop enables iterative analysis: apply suggestions, re-atomize, re-analyze, compare results. Reproducibility tracking ensures that each iteration is checksummed and can be exactly replicated.

Installation and Quick Start

Prerequisites

Python 3.10+ (3.11+ recommended)
pip or conda for package management
500MB+ disk space (corpus texts are included)

Installation

git clone https://github.com/organvm-i-theoria/linguistic-atomization-framework.git
cd linguistic-atomization-framework

python3 -m venv new_venv
source new_venv/bin/activate  # Windows: new_venv\Scripts\activate

pip install -r requirements.txt
python -m spacy download en_core_web_sm  # Base English model
# python -m spacy download en_core_web_trf  # Higher accuracy (optional, larger)

Quick Start — Analyze a Document

# Analyze a PDF or text file — opens interactive HTML report in browser
python lingframe.py analyze document.pdf

# Quick console summary (no browser required)
python lingframe.py quick document.txt

# Save report to specific location
python lingframe.py analyze essay.pdf -o analysis-report.html

Quick Start — Web Interface

python run_web.py
# Opens at http://localhost:8501
# Features: Corpus Observatory, Rhetoric Gym, document upload, interactive analysis

Quick Start — Project-Based Analysis

# List available analysis projects
python lingframe.py list-projects

# Run full pipeline on a sample project
python lingframe.py run -p literary-analysis/tomb-unknowns --visualize --verbose

# Individual pipeline stages
python lingframe.py atomize -p literary-analysis/tomb-unknowns
python lingframe.py analyze -p literary-analysis/tomb-unknowns
python lingframe.py visualize -p literary-analysis/tomb-unknowns

Three User Pathways

Scholar Pathway

Full platform access for linguistic research, corpus analysis, and methodology development. Scholars work with the project-based workflow, defining custom analysis pipelines through project.yaml configuration files, building domain-specific lexicons, and using the Corpus Observatory to browse and compare texts from the 46-work canonical corpus.

# Project-based corpus analysis with full visualization
lingframe run -p literary-analysis/corpus-name --visualize --verbose

# Custom pipeline stages
lingframe atomize -p my-project && lingframe analyze -p my-project

# Migrate a project to ontological naming
lingframe migrate --project my-project --category literary-analysis --naming hybrid

Writer Pathway

Simplified interface for document feedback and revision guidance. Writers upload a document (PDF, plain text) and receive an immediate analysis with a narrative report covering rhetorical strengths, weaknesses, and concrete revision suggestions.

# Instant analysis with HTML report
lingframe analyze essay.pdf

# Quick console summary for iterative editing
lingframe quick document.txt

Developer Pathway

API access for embedding analysis in applications. The framework exposes a clean Python API for atomization, analysis, and visualization, enabling integration into educational platforms, writing tools, and research pipelines.

from framework.core.pipeline import Pipeline
from framework.core.atomizer import Atomizer

atomizer = Atomizer(config)
corpus = atomizer.atomize(text)
results = Pipeline(config).run(corpus)

Analysis Methodology

The 9-Step Heuristic Evaluation Framework

The evaluation module — LingFrame's most distinctive analytical instrument — applies a structured assessment organized into four phases:

Phase	Steps	Focus
Evaluation	1. Critique, 2. Logos, 3. Pathos, 4. Ethos	Initial rhetorical assessment across the three Aristotelian appeals plus overall critique
Reinforcement	5. Logic Check	Argument flow, transition quality, logical consistency
Risk	6. Blind Spots, 7. Shatter Points	Vulnerabilities in reasoning, unstated assumptions, points where the argument is most susceptible to counterattack
Growth	8. Bloom, 9. Evolve	Emergent insights — patterns that suggest the text is reaching toward something it has not yet fully articulated; synthesis opportunities

How Scoring Works

Each step produces a score (0-100) based on:

Pattern matching against predefined linguistic markers (evidence markers, emotional markers, authority markers, weakness markers, transition markers)
Density calculations at each atomization level — how many markers appear per atom at each level of the hierarchy
Weighted aggregation across levels: Letter (5%), Word (15%), Sentence (35%), Paragraph (30%), Theme (15%)

The weighting reflects the relative rhetorical significance of each level — sentences and paragraphs carry the most rhetorical weight in Western argumentation traditions, while letter-level patterns (alliteration, assonance) contribute subtly but measurably.

Linguistic Marker Categories

The framework detects patterns across five marker categories:

Evidence Markers: Statistics, citations, logical connectors ("therefore," "because," "studies show"), quantifiers, specific examples
Emotional Markers: Appeals to shared values, urgency language, inclusive pronouns ("we," "our"), intensifiers, sensory language
Authority Markers: Credential references, source citations, trust builders ("research confirms"), appropriate hedging ("likely," "suggests")
Weakness Markers: Unsupported claims, vague quantifiers ("many," "some"), logical fallacies, contradiction indicators
Transition Markers: Addition ("furthermore," "moreover"), contrast ("however," "nevertheless"), cause-effect ("consequently," "therefore"), sequence ("first," "finally")

Epistemic Transparency

Every score in LingFrame comes with an explanation layer that shows exactly which markers were detected, at which level, and how they were weighted. This is not a black box. The docs/methodology.md and docs/theory.md documents provide full theoretical grounding for the marker taxonomy, the weighting scheme, and the known limitations of each detection heuristic.

Multilingual Support

LingFrame supports 15+ languages across 8+ writing systems, with script-aware tokenization pipelines that adapt to the specific segmentation requirements of each language family.

Script	Languages	Tokenization Strategy
Latin	English, German, French, Spanish, Italian, Portuguese, Latin	Whitespace + spaCy NLP models
Cyrillic	Russian, Ukrainian, Bulgarian	Whitespace + spaCy
Greek	Modern Greek, Ancient Greek	Whitespace + spaCy
Arabic	Arabic, Persian	CAMeL Tools / whitespace fallback
Hebrew	Hebrew	Whitespace + script-aware segmentation
Devanagari	Hindi, Sanskrit	Indic NLP Library
CJK	Chinese (Simplified/Traditional)	jieba (NLP) or character-level
Japanese	Japanese	fugashi/MeCab or character-level
Korean	Korean	KoNLPy or whitespace
Thai	Thai	PyThaiNLP

CJK Strategy Options

For Chinese, Japanese, and Korean texts, LingFrame offers three tokenization strategies:

nlp: Language-specific word segmentation (jieba for Chinese, fugashi/MeCab for Japanese, KoNLPy for Korean). Best for semantic analysis where word boundaries matter.
character: Character-by-character tokenization. Best for deep structural analysis of character-level patterns, calligraphic analysis, and texts where word segmentation is contested.
hybrid: NLP segmentation with character-level fallback for unsegmentable passages. Recommended default for mixed-script or uncertain texts.

Custom atomization schemas for Arabic (framework/schemas/arabic.yaml), Chinese (framework/schemas/chinese.yaml), and Japanese (framework/schemas/japanese.yaml) configure script-specific handling of diacritics, punctuation, and sentence boundaries.

Extension Points

LingFrame is designed for extensibility through three documented extension patterns.

Add an Analysis Module

# framework/analysis/my_analysis.py
from framework.analysis.base import BaseAnalysisModule

class MyAnalysis(BaseAnalysisModule):
    name = "my_analysis"

    def analyze(self, corpus, domain, config):
        # Traverse the atom tree, compute metrics at each level
        results = self._compute(corpus)
        return self.make_output(
            data=results,
            summary={"key_finding": value}
        )

Add a Visualization Adapter

# framework/visualization/adapters/my_viz.py
from framework.visualization.base import BaseVisualizationAdapter

class MyVizAdapter(BaseVisualizationAdapter):
    name = "my_viz"
    analysis_types = ["my_analysis"]

    def generate(self, analysis_output, config):
        return self._render_template("my_viz.html", data=analysis_output)

Add a Domain Profile

# framework/domains/legal/lexicon.yaml
formal_terms:
  whereas: 0.3
  hereby: 0.2
  stipulate: 0.1

# framework/domains/legal/patterns.yaml
entities:
  case_citation:
    pattern: '\d+\s+[A-Z][a-z]+\.?\s+\d+'
    label: LEGAL_CITATION

Sample Projects

literary-analysis/tomb-unknowns

Military memorial analysis demonstrating domain-specific analysis capabilities:

Domain lexicon: Military terminology, formal tone detection, ceremonial language patterns
Entity recognition: Ranks, units, locations, equipment — using custom regex patterns defined in framework/domains/military/
Full pipeline output: Atomized JSON, 4 analysis outputs (semantic, temporal, sentiment, entity), 5 interactive visualizations, narrative report
Use case: Understanding how official commemorative rhetoric constructs authority, invokes emotion, and manages the tension between individual sacrifice and institutional purpose

literary-analysis/MET4MORFOSES

Literary metamorphosis study demonstrating creative-analytical applications:

Narrative structure analysis: Tracking transformation arcs across three cycles of a creative manuscript
Character transformation tracking: Entity analysis applied to fictional metamorphosis
Thematic mapping: Semantic network visualization of how transformation themes cluster and evolve
Temporal flow: Sankey diagram showing narrative movement through time

Technology Stack

Layer	Technology	Purpose
Core	Python 3.11+, spaCy, NLTK, VADER	Text processing, NLP, sentiment baseline
Multilingual NLP	jieba, fugashi, CAMeL Tools, Indic NLP, langdetect	Script-aware tokenization and language detection
Analysis	scikit-learn (TF-IDF), transformers (XLM-RoBERTa), sentence-transformers	Semantic similarity, multilingual sentiment, embedding comparison
Visualization	D3.js (force graphs), Plotly.js (Sankey), Chart.js (charts)	Interactive browser-based visualizations
Web Interface	Streamlit	Corpus Observatory, Rhetoric Gym, upload-and-analyze
PDF Processing	pdfplumber, PyMuPDF	Document ingestion from PDF format
Unicode	unidecode, pysbd	Transliteration and sentence boundary detection
LLM Integration	Anthropic, OpenAI, local models (optional)	Enhanced analysis and revision suggestions
Export	LaTeX, TEI-XML, CoNLL	Scholarly output formats

Documentation

Document	Description
CLAUDE.md	Developer guide — commands, architecture patterns, data schemas
docs/theory.md	Theoretical foundation — linguistic frameworks, epistemological grounding
docs/methodology.md	Analysis methodology — marker taxonomy, scoring, weighting
docs/limitations.md	Scope and limitations — honest assessment of heuristic boundaries
docs/architecture.md	Technical architecture — component relationships, data flow
docs/tutorials/	4 tutorials: first analysis, comparative analysis, custom domains, building modules
CONTRIBUTING.md	Contribution guidelines — code style, PR process, module development

Roadmap

Completed

Planned

Async pipeline for parallel module execution
Plugin system for external module discovery
Embedding-based semantic comparison (beyond TF-IDF)
Interactive annotation interface
Real-time collaborative analysis

Cross-Organ Context

LingFrame is a flagship repository within ORGAN-I: Theoria — the epistemological and theoretical foundation of the organvm creative-institutional system. It exemplifies Theoria's mandate: build instruments of rigorous inquiry that make visible the hidden structures of knowledge.

Organ	Domain	Organization
I — Theoria	Theory, epistemology, recursion	`organvm-i-theoria`
II — Poiesis	Art, generative & experiential	`organvm-ii-poiesis`
III — Ergon	Commerce, SaaS & products	`organvm-iii-ergon`
IV — Taxis	Orchestration & governance	`organvm-iv-taxis`
V — Logos	Public process & essays	`organvm-v-logos`
VI — Koinonia	Community & salons	`organvm-vi-koinonia`
VII — Kerygma	Marketing & distribution	`organvm-vii-kerygma`
VIII — Meta	Umbrella organization	`meta-organvm`

Cross-Organ Relationships

ORGAN-II (Poiesis): LingFrame's analysis modules can be applied to creative texts produced within Poiesis. The MET4MORFOSES sample project demonstrates this intersection — literary analysis of creative metamorphosis narratives.
ORGAN-III (Ergon): The framework's analysis capabilities could be productized through Ergon as a SaaS platform, a writing-feedback API, or an educational tool. The three-pathway architecture (Scholar, Writer, Developer) anticipates these commercial applications.
ORGAN-V (Logos): LingFrame's methodology and findings serve as source material for public-process essays exploring computational approaches to rhetoric, the epistemology of text analysis, and the relationship between measurement and meaning.
ORGAN-IV (Taxis): The project's project.yaml configuration pattern and domain profile system demonstrate the kind of structured governance that Taxis orchestrates across the system.

LingFrame sits at the intersection of computational linguistics and classical rhetoric — a theoretical instrument that demonstrates ORGAN-I's commitment to rigorous, systematic inquiry. Its hierarchical atomization model and heuristic evaluation framework embody the epistemic recursion that defines the Theoria organ: the tool that examines language is itself subject to the same standards of clarity, honesty, and structural integrity that it applies to the texts it analyzes.

Citation

If you use LingFrame in academic work:

@software{lingframe2025,
  title     = {LingFrame: A Computational Rhetoric Platform for Linguistic Atomization},
  author    = {4444j99},
  year      = {2025},
  url       = {https://github.com/organvm-i-theoria/linguistic-atomization-framework},
  note      = {Hierarchical text atomization with 6 analysis modules across 15+ languages}
}

Contributing

Contributions are welcome. See CONTRIBUTING.md for guidelines covering code style, PR process, and the module development workflow.

Key areas for contribution:

New analysis modules — extend the BaseAnalysisModule interface
New visualization adapters — extend the BaseVisualizationAdapter interface
Domain profiles — YAML lexicons and pattern files for new domains (legal, scientific, journalistic, etc.)
Corpus additions — public-domain texts with original-language versions and multiple translations preferred
Language support — tokenization pipelines for additional languages and writing systems

License

MIT License. See LICENSE for details.

Author

@4444j99

LingFrame — Computational rhetoric for linguistic scholarship

_{Case Study · Portfolio · System Directory · ORGAN I · Theoria · Part of the ORGANVM eight-organ system}

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github		.github
.history		.history
.legacy		.legacy
.specstory		.specstory
absorb-alchemize		absorb-alchemize
app		app
cli		cli
corpus		corpus
data		data
desktop		desktop
docs		docs
ecosystem/pillar-dna		ecosystem/pillar-dna
framework		framework
projects		projects
scripts		scripts
templates		templates
tests		tests
visualizations		visualizations
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.nojekyll		.nojekyll
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
GEMINI.md		GEMINI.md
LICENSE		LICENSE
README.md		README.md
ecosystem.yaml		ecosystem.yaml
lingframe.py		lingframe.py
requirements.txt		requirements.txt
run_web.py		run_web.py
seed.yaml		seed.yaml

Folders and files

Latest commit

History

Repository files navigation

LingFrame — Linguistic Atomization Framework

Table of Contents

Why This Exists

Philosophical Framework

Linguistic Atomization as Epistemological Method

The Hierarchy of Atoms

Why Hierarchical Decomposition Matters

Heuristic Honesty

What LingFrame Does

The Six Analysis Modules

The Global Canonical Corpus

Traditions Represented

Corpus Design Principles

Technical Architecture

Directory Structure

Core Concepts

Installation and Quick Start

Prerequisites

Installation

Quick Start — Analyze a Document

Quick Start — Web Interface

Quick Start — Project-Based Analysis

Three User Pathways

Scholar Pathway

Writer Pathway

Developer Pathway

Analysis Methodology

The 9-Step Heuristic Evaluation Framework

How Scoring Works

Linguistic Marker Categories

Epistemic Transparency

Multilingual Support

CJK Strategy Options

Extension Points

Add an Analysis Module

Add a Visualization Adapter

Add a Domain Profile

Sample Projects

literary-analysis/tomb-unknowns

literary-analysis/MET4MORFOSES

Technology Stack

Documentation

Roadmap

Completed

Planned

Cross-Organ Context

Cross-Organ Relationships

Citation

Contributing

License

Author

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages