A Python library for textual analysis for spatial and digital humanities research created as part of the Spatial Narratives Project. It supports spatio-textual annotation, analysis and visualisation for digital humanities projects, with initial applications to:
- Corpus of Lake District Writing (CLDW)
- Holocaust survivors' testimonies (e.g., USC Shoah Foundation archives)
This release includes sentence-safe chunking, list-of-texts input, file/segment IDs in output, robust saving of annotation (JSON/JSONL/TSV/CSV), efficient loading of annotation (Pandas friendly) and a multiprocessing CLI with tqdm progress report and a clean Python API.
Additional features:
- βοΈ Testimony segmentation (QβA-aware turns; roles: interviewer/witness)
- π Sentiment (rule backend; LLM/HF hooks to follow)
- πΆβπ«οΈ Emotion (Ekman-style labels via rule backend; LLM/HF hooks to follow)
- π§ Interpretation (summaries/themes; optional LLM hook)
- πΊοΈπ Visualisation (GeoJSON/Folium map; co-occurrence graphs)
Core
- spaCy model loader with optional EntityRuler patterns
- Place classification (COUNTRY, CITY, CONTINENT, CAMP, β¦)
- Sentenceβsafe chunking (no broken sentences)
- Flexible annotation APIs for single or multiple files and lists of texts
- Robust saving in JSON / JSONL / TSV / CSV + a pandasβfriendly loader
Testimony features
- βοΈ QβAβaware segmentation of testimonies (roles:
interviewer|witness|narration|unknown), withturnId,qaPairId,isQuestion,isAnswer
Affect analysis
- π Sentiment classification (rule backend; hooks for LLM/HF)
- πΆβπ«οΈ Emotion classification (Neutral, Joy, Surprise, Sadness, Fear, Anger, Disgust) via rule backend; hooks for LLM/HF
Interpretation
- π§ Recordβlevel summaries, affect explanations, simple theme tags (LLMβfriendly hook)
Visualisation
- πΊοΈ GeoJSON builder + optional Folium HTML map
- π Coβoccurrence edge list for graphs (entity coβmentions)
CLI & performance
- Multiprocessing with
--workers/--chunksize - Optional
--tqdmprogress bars
python -m venv venv
source venv/bin/activate # on Windows: venv\Scripts\activate
pip install -U pip wheel
pip install -r requirements.txt
# Download at least one spaCy model
python -m spacy download en_core_web_trf # transformer (higher quality)
# or
python -m spacy download en_core_web_sm # small (fast)Packaging tip: expose a console entry point (e.g.
spatio-textual=spatio_textual.cli:main) so users can call the CLI afterpip install ..
# Base
spacy>=3.6,<4.0
geonamescache>=2.0.0
# Progress bar (optional)
tqdm>=4.66.0
# Visualisation (optional)
folium>=0.15.0
# (Optional) Transformers/LLM backends for richer sentiment/emotion later
# transformers>=4.41.0
# torch>=2.0.0
# openai>=1.0.0
Place pattern lists under resources/ (or pass --resources-dir). Recognised files:
combined_geonouns.txt,non_verbals.txt,family_terms.txt,cleaned_holocaust_camps.txt
Details of the spatio_textual/resources/ files:
| File | Description |
|---|---|
combined_geonouns.txt |
Common geographic feature nouns (e.g., valley, road, lake) |
cleaned_holocaust_camps.txt |
Known Holocaust camp names (e.g., Auschwitz, Theresienstadt) |
ambiguous_cities.txt |
Locations with possible ambiguity (e.g., Paris, Lancaster) |
non_verbal.txt |
Non-verbal expressions by survivors ([PAUSES], [LAUGHS] etc) in the testimonies |
family_terms.txt |
Family-related entity names (e.g., mother, uncle) |
from spatio_textual.utils import load_spacy_model, Annotator, save_annotations, load_annotations
from spatio_textual.sentiment import SentimentAnalyzer
from spatio_textual.emotion import EmotionAnalyzer
from spatio_textual.analysis import analyze_records
nlp = load_spacy_model("en_core_web_sm")
ann = Annotator(nlp)
text = "Anne Frank was taken from Amsterdam to Auschwitz."
rec = ann.annotate(text, include_verbs=True) # entities + verbs
rec["fileId"], rec["segId"], rec["segCount"] = "example", 1, 1
rec["text"] = text
# Affect (rule backends work offline)
sent = SentimentAnalyzer("rule").predict([text])[0]
emo = EmotionAnalyzer("rule").predict([text])[0]
rec.update({
"sentiment_label": sent["label"], "sentiment_score": sent["score"],
"emotion_label": emo["label"], "emotion_score": emo["score"],
})
# Optional interpretation
[rec] = analyze_records([rec])
save_annotations([rec], "out/sample.jsonl")
df = load_annotations("out/sample.jsonl")
print(df[["fileId","segId","sentiment_label","emotion_label"]].head())from spatio_textual.annotate import annotate_files
# Chunk each file into `n_segments` sentence-safe segments and annotate
results = annotate_files(["data/"], chunk=True, n_segments=50, include_verbs=True)# Colab: install runtime deps
pip -q install spacy geonamescache tqdm folium
python -m spacy download en_core_web_trffrom spatio_textual.utils import load_spacy_model, Annotator, save_annotations, load_annotations
from spatio_textual.qa import segment_testimony
from spatio_textual.sentiment import SentimentAnalyzer
from spatio_textual.emotion import EmotionAnalyzer
from spatio_textual.analysis import analyze_records
nlp = load_spacy_model("en_core_web_sm")
ann = Annotator(nlp)
text = "Anne Frank was taken from Amsterdam to Auschwitz."
segments = [s.text for s in segment_testimony("Q: " + text, nlp=nlp)]
recs = ann.annotate_texts(segments, file_id="sample", include_text=True)
sent = SentimentAnalyzer("rule")
emo = EmotionAnalyzer("rule")
spans = [r["text"] for r in recs]
for r, s in zip(recs, sent.predict(spans)):
r["sentiment_label"], r["sentiment_score"] = s["label"], s["score"]
for r, e in zip(recs, emo.predict(spans)):
r["emotion_label"], r["emotion_score"] = e["label"], e["score"]
recs = analyze_records(recs)
save_annotations(recs, "sample.jsonl")
df = load_annotations("sample.jsonl")
df.head()# Show environment info
python -m spatio_textual.cli --info --spacy-model en_core_web_sm --resources-dir resources/
# Single file β chunked (~100 segs) β pretty JSON
python -m spatio_textual.cli -i data/sample.txt --pretty
# Whole-file mode (no chunking)
python -m spatio_textual.cli -i data/sample.txt --no-chunk -o out/sample.json --output-format json
# Directory (recursive) β JSONL stream, multiprocessing + tqdm, with verbs
python -m spatio_textual.cli \
-i corpus/ --glob "*.txt" --workers 6 --chunksize 16 --tqdm --verbs \
-o out/corpus.jsonl --output-format jsonl
# Testimony segmentation + affect + interpretation
python -m spatio_textual.cli \
-i data/testimonies/ --glob "*.txt" --tqdm \
--testimony --sentiment rule --emotion rule --interpret \
-o out/testimonies.jsonl --output-format jsonl
# List-of-texts mode (JSON array) β TSV
python -m spatio_textual.cli --segments-json segments.json --tqdm -o out/segments.tsv --output-format tsv-
load_spacy_model(model_name='en_core_web_trf', resources_dir=None, add_entity_ruler=True)βnlp -
split_into_segments(text, n_segments=100, nlp=None)β[str] -
save_annotations(records, path, fmt=None)β writes json/jsonl/tsv/csv -
load_annotations(path, fmt=None, ensure_columns=True)βpandas.DataFrame -
class Annotator(nlp):.annotate(text, include_entities=True, include_verbs=False).annotate_texts(texts, file_id=None, include_text=False, ...)β[dict].annotate_file_chunked(path, n_segments=100, ...)β[dict].annotate_inputs(inputs, glob_pattern, recursive, ...)β[dict]
annotate_text,annotate_texts,chunk_and_annotate_text,chunk_and_annotate_file,annotate_files
-
segment_testimony(text, nlp=None, speaker_patterns=None, join_continuations=True, sentence_safe=True)β[Segment]Segment:text, role, turn_id, is_question, is_answer, qa_pair_id
-
segment_testimony_file(path, **kwargs)
SentimentAnalyzer(backend='rule'|'llm', model_name=None, llm_fn=None).predict(list[str])β[{'label','score'}]EmotionAnalyzer(backend='rule'|'llm', model_name=None, llm_fn=None).predict(list[str])β[{'label','score', 'distribution'?}]
analyze_records(records, llm_fn=None, summarize=True, explain=True, tag_themes=True)β records +summary,interpretation,themes
to_geojson(records, geocoder=None)β GeoJSON FeatureCollectionmake_map_geojson(geojson, out_html='map.html')β saves Folium HTML (if installed)build_cooccurrence(records, nodes=("PERSON","GPE"), window=1)β(u,v,w)edge list
Common keys (always present or filled with empty defaults):
file, fileId, segId, segCount, entities, verb_data, text?, error?
# Testimony extras
role, turnId, qaPairId, isQuestion, isAnswer
# Affect
sentiment_label, sentiment_score, emotion_label, emotion_score, emotion_dist?
# Interpretation
summary, interpretation, themes
entities and verb_data are arrays (flattened to JSON strings in TSV/CSV).
- Prefer
--workers Nclose to your CPU core count; tune--chunksize(8β32) for large corpora - Use
--tqdmfor a progress bar - For massive analyses, save as JSONL and load with
load_annotations()
- spaCy model not found β run
python -m spacy download en_core_web_trf(or_sm) - Slow transformer model β try
en_core_web_smduring development - Empty map β you must supply a geocoder to
to_geojson - Pandas schema mismatches β
load_annotations(..., ensure_columns=True)fills missing columns
GNU GENERAL PUBLIC LICENSE. See LICENSE.
Feel free to fork, improve and contribute! Future improvements:
- Better disambiguation of place names
- Map integration and visualisation tools
- CLDW-specific training data integration
The project is funded in the UK from 2022 to 2025 by ESRC, project reference: ES/W003473/1. We also acknowledge the input and advice from the other members of the project team in generating requirements for our research presented here and the UCREL Hex team for providing the compute needs for this project. More details of the project can be found on the website: Spatial Narratives Project