📰 Wikipedia Headline Tracker

A Python project I built to practice web scraping, building CLI tools, and data visualization.

About this project

I wanted to build something that actually does something — not just a tutorial clone. The idea is simple: visit Wikipedia's front page every day, grab all the headlines, save them, and track which words keep showing up over time.

After a week of daily runs, you have a real dataset that tells you what topics Wikipedia covered most that week. It's small, but it connects several things I've been learning: web scraping, data analysis, and building proper Python tools.

What it does

Scrapes Wikipedia's Spanish front page for today's headlines
Cleans the text — removes common filler words (stopwords), months, and punctuation
Saves the headlines to a CSV file, appending new data each day so it accumulates over time
Analyzes which words appear most often in that day's headlines
Visualizes word frequency trends as an interactive chart you can open in your browser

What I learned building this

Dealing with a real Windows encoding bug — the terminal was crashing on special characters because Windows defaults to cp1252 instead of UTF-8. Took some digging to fix properly.
Cleaning messy data — raw headlines are full of noise (months, prepositions, image captions). I built a stopword filter to strip all of that out so only meaningful words remain.
Organizing code into modules instead of dumping everything into one file. Each module (fetcher, storage, analyzer, visualizer) has one job.
Building a CLI others can actually use — flags like --top, --url, and --plot make the tool flexible without needing to touch the code.

Project structure

wikipedia_scraper/
├── scraper/
│   ├── __init__.py        # Ties the package together
│   ├── fetcher.py         # Downloads and parses the Wikipedia page
│   ├── storage.py         # Saves data to CSV (appends, never overwrites)
│   ├── analyzer.py        # Counts word frequency, filters noise
│   └── visualizer.py      # Generates the interactive Plotly chart
├── output/
│   └── .gitkeep           # Keeps the folder in git; data files are gitignored
├── main.py                # CLI entry point — two commands: run and visualize
└── requirements.txt

Installation

# 1. Clone the repo
git clone https://github.com/YOUR_USERNAME/wikipedia-scraper.git
cd wikipedia-scraper

# 2. Create and activate a virtual environment
python -m venv venv

# Windows
.\venv\Scripts\activate

# macOS / Linux
source venv/bin/activate

# 3. Install dependencies
pip install -r requirements.txt

Usage

Windows note: add -X utf8 when running the script to avoid an encoding issue with special characters.

Scrape today's headlines and see the word frequency

python -X utf8 main.py run

Scrape today's data AND generate the chart in one step

python -X utf8 main.py run --plot

Generate the interactive chart from your accumulated data

python -X utf8 main.py visualize

Change how many top words to show

python -X utf8 main.py run --top 10

See all available options

python -X utf8 main.py run --help

Project Showcase

Top 5 words — April 19, 2026 (360 headlines tracked over 5 days)

After filtering out months, prepositions, and other noise, the words that actually came out on top:

Rank	Word	Occurrences	What it reflects
1	futbolista	27	Several footballer obituaries and sports news
2	día	30	International Days listed on Wikipedia
3	campeonato	22	Multiple active championships (NBA, snooker, judo)
4	nace	22	Birthday entries in the Efemérides section
5	actriz	18	Several actress entries in current events

What the CSV looks like inside

titular,seccion,fecha
"19 de abril: Elecciones legislativas de Bulgaria",Noticias de actualidad,2026-04-19T20:26:40
"19 de abril: Amstel Gold Race masculina y femenina de ciclismo",Noticias de actualidad,2026-04-19T20:26:40
"Pericles",Articulo destacado,2026-04-19T20:26:40

After several daily runs, main.py visualize creates an interactive chart at output/timeseries_plot.html. Open it in any browser — hover to see exact values, click the legend to toggle words on and off, and zoom in.

Dependencies

Package	What I used it for
`requests`	Making the HTTP request to download the Wikipedia page
`beautifulsoup4` + `lxml`	Parsing the HTML and extracting the headline text
`typer[all]`	Building the CLI with flags, help text, and colored output
`pandas`	Grouping and counting word frequencies by day
`plotly`	Generating the interactive HTML time-series chart

This is a learning project. I tried to write it cleanly — but I'm still improving and there's always more to learn!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
output		output
scraper		scraper
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📰 Wikipedia Headline Tracker

About this project

What it does

What I learned building this

Project structure

Installation

Usage

Scrape today's headlines and see the word frequency

Scrape today's data AND generate the chart in one step

Generate the interactive chart from your accumulated data

Change how many top words to show

See all available options

Project Showcase

Top 5 words — April 19, 2026 (360 headlines tracked over 5 days)

What the CSV looks like inside

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📰 Wikipedia Headline Tracker

About this project

What it does

What I learned building this

Project structure

Installation

Usage

Scrape today's headlines and see the word frequency

Scrape today's data AND generate the chart in one step

Generate the interactive chart from your accumulated data

Change how many top words to show

See all available options

Project Showcase

Top 5 words — April 19, 2026 (360 headlines tracked over 5 days)

What the CSV looks like inside

Dependencies

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages