Anti Echo Chamber

A pipeline for collecting, classifying, and embedding news articles to analyze topic framing and ideological positioning across media sources.

This repository handles scraping, embedding, batching, and coordination for the Hugging Face dataset zanimal/anti-echo-artifacts. The goal is to construct a dataset and retrieval engine that can surface contrasting arguments on the same topic by comparing both topic similarity and stance divergence.

Quick Start

Notebooks (Google Colab)

1. Scraper and Batch Builder

Run this notebook to scrape news feeds, classify articles, generate embeddings, and publish new batches to Hugging Face.

2. Analysis and Stance Comparison

Run this notebook to rebuild the Chroma index from Hugging Face, upload an article, and discover contrasting perspectives from different outlets.

Project Overview

The Problem

Media ecosystems often exhibit echo chamber effects—readers consume news primarily from sources aligned with their existing viewpoints, limiting exposure to opposing arguments. This project addresses that by building tools to:

Identify articles discussing the same topic across different outlets
Quantify differences in framing, tone, and ideological positioning
Surface high-quality alternative perspectives on shared issues

The Solution

The pipeline creates dual embedding spaces:

Topic Space — What is each article about? (semantic meaning)
Stance Space — How does it argue? (tone, ideology, framing)

Combined with source bias metadata, the system can identify article pairs where:

Topic overlap is high (they address the same issue)
Stance divergence is high (they argue from different positions)
Neither is simply repeating outlet bias (genuine alternative view)

Architecture

The pipeline operates in modular stages defined in config.yaml, processing free, factual sources (Reuters, The Guardian UK, etc.) with full transparency and auditability.

Stage 1: Collection

Purpose: Download and structure full articles and metadata from RSS feeds or Selenium scrapes.

Tools:

feedparser, requests, trafilatura — RSS and HTML parsing
BeautifulSoup4 — fallback DOM extraction
Configurable source list via source_bias.json

Output: source, url, title, date, author, section, content

Stage 2: Topic Embedding

Purpose: Encode semantic content for clustering and retrieval.

Parameter	Value
Model	`intfloat/e5-base-v2`
Dimensionality	768
Chunk Size	512 tokens
Normalization	yes
Collection	`news_topic`

Method:

Articles are chunked into ≤512 token windows
Each chunk is embedded and averaged to a single topic vector
Related topics are mapped using the taxonomy in topics.json
Topic overlap detected via cosine similarity ≥ 0.4

Stage 3: Stance and Ideological Classification (LLM)

Purpose: Classify political leaning, implied policy stance, and rhetorical framing.

Parameter	Value
Provider	OpenAI
Model	`gpt-4o-mini`
Temperature	0.4
Max Tokens	256

Reference Ontologies:

political_leanings.json — ideological families (progressive left, center, conservative right, etc.)
implied_stances.json — policy positions (pro/anti regulation, austerity, etc.)
source_bias.json — outlet bias metadata and expected positioning

Prompt Logic:

Summarizes the article's argument in one sentence
Assigns political leaning and implied stance
Compares to outlet's expected bias family

Tone-Bias Alignment:

In-bias — stance aligns with outlet's typical positioning
Counter-bias — stance diverges from outlet's usual framing (editorial independence)
Neutral/Mixed — no consistent ideological pattern

Example Output:

{
  "political_leaning": "center left",
  "implied_stance": "pro regulation",
  "summary": "Argues that public oversight is necessary to keep markets fair.",
  "tone_alignment": "in-bias"
}

Stage 4: Stance Embedding (Hybrid)

Purpose: Create dense embeddings representing worldview, tone, and rhetorical stance.

Parameter	Value
Model	`all-mpnet-base-v2`
Inputs	`[political_leaning] + [implied_stance] + [summary]`
Max Length	4096 characters
Normalization	yes
Collection	`news_stance`

The hybrid text captures ideological direction and emotional framing in a single consistent vector space.

Stage 5: Storage and Dataset Export

Local Storage: ChromaDB

Path: chroma_db/
Metric: cosine distance
Collections: news_topic, news_stance
Auto-rebuild if missing

Batch Artifacts:

embeddings_topic.npz — topic vectors
embeddings_stance.npz — stance vectors
metadata.jsonl — article metadata
manifest.json — batch manifest

Public Export: Hugging Face Dataset: zanimal/anti-echo-artifacts

Extending the Pipeline

Adding RSS Feeds

Edit the scraper config to add new feeds:

FEEDS = [
    "https://www.reuters.com/rssFeed/worldNews",
    "https://www.theguardian.com/world/rss",
    "https://www.theguardian.com/politics/rss"
]

Each article automatically flows through cleaning → classification → embedding → storage.

Integrating Selenium Scrapes

For pages requiring dynamic rendering, create a DataFrame:

Column	Type	Description
`source`	str	outlet identifier
`url`	str	canonical link
`title`	str	headline
`date`	str	ISO date
`content`	str	article body
`author`	str	(optional) author
`section`	str	(optional) category

Then process:

from pipeline import process_dataframe
process_dataframe(df)

Triggers the full classification, embedding, and tone-bias pipeline.

Cross-Bias Comparison Logic

The analysis tool identifies ideological contrast across and within outlets.

Algorithm:

Topic Matching — Retrieve top-N articles by cosine similarity in news_topic (same subject)
Stance Divergence — Compute cosine distance in news_stance (opposite framing)
Bias Contrast Evaluation — Combine stance and tone-alignment metadata to highlight pairs where:
- Both discuss the same topic
- One is in-bias and the other counter-bias relative to their sources
- Stance vectors indicate genuine disagreement (not just noise)

Example:

Input: Guardian article (center left, pro regulation, in-bias)
Retrieved: Wall Street Journal (center right, pro market, in-bias)
Both tagged "Economy / Regulation"
→ Flagged as contrast pair with high anti-echo score

Libraries: chromadb, numpy, optional faiss for scale.

Logging and Diagnostics

Logging Level: INFO
Failure Records: saved to logs/
Checkpointing: every 100 records
Validation:
- Non-zero embedding vectors
- JSON schema compliance
- Tone alignment computed successfully
- Manual sample preview (≤500 chars)

Contributors

Author: Zan Merrill (MSBA, UT Austin)

Contributor Notes

Data Sourcing: Only scrape open, legally accessible text (Reuters, Guardian, AP, BBC, etc.)
Copyright: Do not upload paywalled or copyrighted content
Quality: Verify each record includes valid text, stance classification, and tone alignment before embedding
New Sources: When adding sources, include bias metadata in source_bias.json so tone-alignment logic remains accurate
Transparency: All embeddings are reproducible and auditable via config versions

License

This project is open source. Please respect the terms of use for all sourced news data.

Links

GitHub: AHMerrill/anti-echo-chamber
Dataset: zanimal/anti-echo-artifacts
Analysis Tools: See included Jupyter notebooks for interactive examples

Name		Name	Last commit message	Last commit date
Latest commit History 266 Commits
artifacts		artifacts
config		config
feeds		feeds
.gitignore		.gitignore
README.md		README.md
anti_echo_chamber.ipynb		anti_echo_chamber.ipynb
app.py		app.py
requirements.txt		requirements.txt
scraper_artifacts.ipynb		scraper_artifacts.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Anti Echo Chamber

Quick Start

Notebooks (Google Colab)

1. Scraper and Batch Builder

2. Analysis and Stance Comparison

Project Overview

The Problem

The Solution

Architecture

Stage 1: Collection

Stage 2: Topic Embedding

Stage 3: Stance and Ideological Classification (LLM)

Stage 4: Stance Embedding (Hybrid)

Stage 5: Storage and Dataset Export

Extending the Pipeline

Adding RSS Feeds

Integrating Selenium Scrapes

Cross-Bias Comparison Logic

Logging and Diagnostics

Contributors

Contributor Notes

License

Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Anti Echo Chamber

Quick Start

Notebooks (Google Colab)

1. Scraper and Batch Builder

2. Analysis and Stance Comparison

Project Overview

The Problem

The Solution

Architecture

Stage 1: Collection

Stage 2: Topic Embedding

Stage 3: Stance and Ideological Classification (LLM)

Stage 4: Stance Embedding (Hybrid)

Stage 5: Storage and Dataset Export

Extending the Pipeline

Adding RSS Feeds

Integrating Selenium Scrapes

Cross-Bias Comparison Logic

Logging and Diagnostics

Contributors

Contributor Notes

License

Links

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages