Skip to content

AHMerrill/anti-echo-chamber

Repository files navigation

Anti Echo Chamber

A pipeline for collecting, classifying, and embedding news articles to analyze topic framing and ideological positioning across media sources.

This repository handles scraping, embedding, batching, and coordination for the Hugging Face dataset zanimal/anti-echo-artifacts. The goal is to construct a dataset and retrieval engine that can surface contrasting arguments on the same topic by comparing both topic similarity and stance divergence.


Quick Start

Notebooks (Google Colab)

1. Scraper and Batch Builder

Open In Colab

Run this notebook to scrape news feeds, classify articles, generate embeddings, and publish new batches to Hugging Face.

2. Analysis and Stance Comparison

Open In Colab

Run this notebook to rebuild the Chroma index from Hugging Face, upload an article, and discover contrasting perspectives from different outlets.


Project Overview

The Problem

Media ecosystems often exhibit echo chamber effects—readers consume news primarily from sources aligned with their existing viewpoints, limiting exposure to opposing arguments. This project addresses that by building tools to:

  • Identify articles discussing the same topic across different outlets
  • Quantify differences in framing, tone, and ideological positioning
  • Surface high-quality alternative perspectives on shared issues

The Solution

The pipeline creates dual embedding spaces:

  1. Topic Space — What is each article about? (semantic meaning)
  2. Stance Space — How does it argue? (tone, ideology, framing)

Combined with source bias metadata, the system can identify article pairs where:

  • Topic overlap is high (they address the same issue)
  • Stance divergence is high (they argue from different positions)
  • Neither is simply repeating outlet bias (genuine alternative view)

Architecture

The pipeline operates in modular stages defined in config.yaml, processing free, factual sources (Reuters, The Guardian UK, etc.) with full transparency and auditability.

Stage 1: Collection

Purpose: Download and structure full articles and metadata from RSS feeds or Selenium scrapes.

Tools:

  • feedparser, requests, trafilatura — RSS and HTML parsing
  • BeautifulSoup4 — fallback DOM extraction
  • Configurable source list via source_bias.json

Output: source, url, title, date, author, section, content


Stage 2: Topic Embedding

Purpose: Encode semantic content for clustering and retrieval.

Parameter Value
Model intfloat/e5-base-v2
Dimensionality 768
Chunk Size 512 tokens
Normalization yes
Collection news_topic

Method:

  • Articles are chunked into ≤512 token windows
  • Each chunk is embedded and averaged to a single topic vector
  • Related topics are mapped using the taxonomy in topics.json
  • Topic overlap detected via cosine similarity ≥ 0.4

Stage 3: Stance and Ideological Classification (LLM)

Purpose: Classify political leaning, implied policy stance, and rhetorical framing.

Parameter Value
Provider OpenAI
Model gpt-4o-mini
Temperature 0.4
Max Tokens 256

Reference Ontologies:

  • political_leanings.json — ideological families (progressive left, center, conservative right, etc.)
  • implied_stances.json — policy positions (pro/anti regulation, austerity, etc.)
  • source_bias.json — outlet bias metadata and expected positioning

Prompt Logic:

  1. Summarizes the article's argument in one sentence
  2. Assigns political leaning and implied stance
  3. Compares to outlet's expected bias family

Tone-Bias Alignment:

  • In-bias — stance aligns with outlet's typical positioning
  • Counter-bias — stance diverges from outlet's usual framing (editorial independence)
  • Neutral/Mixed — no consistent ideological pattern

Example Output:

{
  "political_leaning": "center left",
  "implied_stance": "pro regulation",
  "summary": "Argues that public oversight is necessary to keep markets fair.",
  "tone_alignment": "in-bias"
}

Stage 4: Stance Embedding (Hybrid)

Purpose: Create dense embeddings representing worldview, tone, and rhetorical stance.

Parameter Value
Model all-mpnet-base-v2
Inputs [political_leaning] + [implied_stance] + [summary]
Max Length 4096 characters
Normalization yes
Collection news_stance

The hybrid text captures ideological direction and emotional framing in a single consistent vector space.


Stage 5: Storage and Dataset Export

Local Storage: ChromaDB

  • Path: chroma_db/
  • Metric: cosine distance
  • Collections: news_topic, news_stance
  • Auto-rebuild if missing

Batch Artifacts:

  • embeddings_topic.npz — topic vectors
  • embeddings_stance.npz — stance vectors
  • metadata.jsonl — article metadata
  • manifest.json — batch manifest

Public Export: Hugging Face Dataset: zanimal/anti-echo-artifacts


Extending the Pipeline

Adding RSS Feeds

Edit the scraper config to add new feeds:

FEEDS = [
    "https://www.reuters.com/rssFeed/worldNews",
    "https://www.theguardian.com/world/rss",
    "https://www.theguardian.com/politics/rss"
]

Each article automatically flows through cleaning → classification → embedding → storage.


Integrating Selenium Scrapes

For pages requiring dynamic rendering, create a DataFrame:

Column Type Description
source str outlet identifier
url str canonical link
title str headline
date str ISO date
content str article body
author str (optional) author
section str (optional) category

Then process:

from pipeline import process_dataframe
process_dataframe(df)

Triggers the full classification, embedding, and tone-bias pipeline.


Cross-Bias Comparison Logic

The analysis tool identifies ideological contrast across and within outlets.

Algorithm:

  1. Topic Matching — Retrieve top-N articles by cosine similarity in news_topic (same subject)

  2. Stance Divergence — Compute cosine distance in news_stance (opposite framing)

  3. Bias Contrast Evaluation — Combine stance and tone-alignment metadata to highlight pairs where:

    • Both discuss the same topic
    • One is in-bias and the other counter-bias relative to their sources
    • Stance vectors indicate genuine disagreement (not just noise)

Example:

  • Input: Guardian article (center left, pro regulation, in-bias)
  • Retrieved: Wall Street Journal (center right, pro market, in-bias)
  • Both tagged "Economy / Regulation"
  • → Flagged as contrast pair with high anti-echo score

Libraries: chromadb, numpy, optional faiss for scale.


Logging and Diagnostics

  • Logging Level: INFO
  • Failure Records: saved to logs/
  • Checkpointing: every 100 records
  • Validation:
    • Non-zero embedding vectors
    • JSON schema compliance
    • Tone alignment computed successfully
    • Manual sample preview (≤500 chars)

Contributors

Author: Zan Merrill (MSBA, UT Austin)


Contributor Notes

  • Data Sourcing: Only scrape open, legally accessible text (Reuters, Guardian, AP, BBC, etc.)
  • Copyright: Do not upload paywalled or copyrighted content
  • Quality: Verify each record includes valid text, stance classification, and tone alignment before embedding
  • New Sources: When adding sources, include bias metadata in source_bias.json so tone-alignment logic remains accurate
  • Transparency: All embeddings are reproducible and auditable via config versions

License

This project is open source. Please respect the terms of use for all sourced news data.


Links

About

NLP-powered tool to break media echo chambers by surfacing diverse political perspectives

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors