A pipeline for collecting, classifying, and embedding news articles to analyze topic framing and ideological positioning across media sources.
This repository handles scraping, embedding, batching, and coordination for the Hugging Face dataset zanimal/anti-echo-artifacts. The goal is to construct a dataset and retrieval engine that can surface contrasting arguments on the same topic by comparing both topic similarity and stance divergence.
Run this notebook to scrape news feeds, classify articles, generate embeddings, and publish new batches to Hugging Face.
Run this notebook to rebuild the Chroma index from Hugging Face, upload an article, and discover contrasting perspectives from different outlets.
Media ecosystems often exhibit echo chamber effects—readers consume news primarily from sources aligned with their existing viewpoints, limiting exposure to opposing arguments. This project addresses that by building tools to:
- Identify articles discussing the same topic across different outlets
- Quantify differences in framing, tone, and ideological positioning
- Surface high-quality alternative perspectives on shared issues
The pipeline creates dual embedding spaces:
- Topic Space — What is each article about? (semantic meaning)
- Stance Space — How does it argue? (tone, ideology, framing)
Combined with source bias metadata, the system can identify article pairs where:
- Topic overlap is high (they address the same issue)
- Stance divergence is high (they argue from different positions)
- Neither is simply repeating outlet bias (genuine alternative view)
The pipeline operates in modular stages defined in config.yaml, processing free, factual sources (Reuters, The Guardian UK, etc.) with full transparency and auditability.
Purpose: Download and structure full articles and metadata from RSS feeds or Selenium scrapes.
Tools:
feedparser,requests,trafilatura— RSS and HTML parsingBeautifulSoup4— fallback DOM extraction- Configurable source list via
source_bias.json
Output: source, url, title, date, author, section, content
Purpose: Encode semantic content for clustering and retrieval.
| Parameter | Value |
|---|---|
| Model | intfloat/e5-base-v2 |
| Dimensionality | 768 |
| Chunk Size | 512 tokens |
| Normalization | yes |
| Collection | news_topic |
Method:
- Articles are chunked into ≤512 token windows
- Each chunk is embedded and averaged to a single topic vector
- Related topics are mapped using the taxonomy in
topics.json - Topic overlap detected via cosine similarity ≥ 0.4
Purpose: Classify political leaning, implied policy stance, and rhetorical framing.
| Parameter | Value |
|---|---|
| Provider | OpenAI |
| Model | gpt-4o-mini |
| Temperature | 0.4 |
| Max Tokens | 256 |
Reference Ontologies:
political_leanings.json— ideological families (progressive left, center, conservative right, etc.)implied_stances.json— policy positions (pro/anti regulation, austerity, etc.)source_bias.json— outlet bias metadata and expected positioning
Prompt Logic:
- Summarizes the article's argument in one sentence
- Assigns political leaning and implied stance
- Compares to outlet's expected bias family
Tone-Bias Alignment:
- In-bias — stance aligns with outlet's typical positioning
- Counter-bias — stance diverges from outlet's usual framing (editorial independence)
- Neutral/Mixed — no consistent ideological pattern
Example Output:
{
"political_leaning": "center left",
"implied_stance": "pro regulation",
"summary": "Argues that public oversight is necessary to keep markets fair.",
"tone_alignment": "in-bias"
}Purpose: Create dense embeddings representing worldview, tone, and rhetorical stance.
| Parameter | Value |
|---|---|
| Model | all-mpnet-base-v2 |
| Inputs | [political_leaning] + [implied_stance] + [summary] |
| Max Length | 4096 characters |
| Normalization | yes |
| Collection | news_stance |
The hybrid text captures ideological direction and emotional framing in a single consistent vector space.
Local Storage: ChromaDB
- Path:
chroma_db/ - Metric: cosine distance
- Collections:
news_topic,news_stance - Auto-rebuild if missing
Batch Artifacts:
embeddings_topic.npz— topic vectorsembeddings_stance.npz— stance vectorsmetadata.jsonl— article metadatamanifest.json— batch manifest
Public Export: Hugging Face Dataset: zanimal/anti-echo-artifacts
Edit the scraper config to add new feeds:
FEEDS = [
"https://www.reuters.com/rssFeed/worldNews",
"https://www.theguardian.com/world/rss",
"https://www.theguardian.com/politics/rss"
]Each article automatically flows through cleaning → classification → embedding → storage.
For pages requiring dynamic rendering, create a DataFrame:
| Column | Type | Description |
|---|---|---|
source |
str | outlet identifier |
url |
str | canonical link |
title |
str | headline |
date |
str | ISO date |
content |
str | article body |
author |
str | (optional) author |
section |
str | (optional) category |
Then process:
from pipeline import process_dataframe
process_dataframe(df)Triggers the full classification, embedding, and tone-bias pipeline.
The analysis tool identifies ideological contrast across and within outlets.
Algorithm:
-
Topic Matching — Retrieve top-N articles by cosine similarity in
news_topic(same subject) -
Stance Divergence — Compute cosine distance in
news_stance(opposite framing) -
Bias Contrast Evaluation — Combine stance and tone-alignment metadata to highlight pairs where:
- Both discuss the same topic
- One is in-bias and the other counter-bias relative to their sources
- Stance vectors indicate genuine disagreement (not just noise)
Example:
- Input: Guardian article (center left, pro regulation, in-bias)
- Retrieved: Wall Street Journal (center right, pro market, in-bias)
- Both tagged "Economy / Regulation"
- → Flagged as contrast pair with high anti-echo score
Libraries: chromadb, numpy, optional faiss for scale.
- Logging Level: INFO
- Failure Records: saved to
logs/ - Checkpointing: every 100 records
- Validation:
- Non-zero embedding vectors
- JSON schema compliance
- Tone alignment computed successfully
- Manual sample preview (≤500 chars)
Author: Zan Merrill (MSBA, UT Austin)
- Data Sourcing: Only scrape open, legally accessible text (Reuters, Guardian, AP, BBC, etc.)
- Copyright: Do not upload paywalled or copyrighted content
- Quality: Verify each record includes valid text, stance classification, and tone alignment before embedding
- New Sources: When adding sources, include bias metadata in
source_bias.jsonso tone-alignment logic remains accurate - Transparency: All embeddings are reproducible and auditable via config versions
This project is open source. Please respect the terms of use for all sourced news data.
- GitHub: AHMerrill/anti-echo-chamber
- Dataset: zanimal/anti-echo-artifacts
- Analysis Tools: See included Jupyter notebooks for interactive examples