Skip to content

Jossey28/wikipedia-data-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wikipedia Data Analytics

Context

Where the Idea Came From

This project idea was spawned from watching adumb's graph of wikipedia video

Youtube Video Preview and Link

I enjoy watching videos about data analytics so I thought, "why not try it myself" :)

Wikipedia's Data Dumps

Wikimedia has all its latest data for the English wikipedia site hosted at https://dumps.wikimedia.org/enwiki/latest

The data I'm using for this project is:

page IDs, titles, namespaces, redirect flags via sql.gz @ enwiki-latest-page.sql.gz

internal wikilinks between pages via sql.gz @ enwiki-latest-pagelinks.sql.gz

redirect source -> target mappings (page ID, target title, target namespace) via sql.gz @ enwiki-latest-redirect.sql.gz

page -> categories via sql.gz @ enwiki-latest-categorylinks.sql.gz

Project Set-Up (Windows)

Download and Extract Data

# Create the data and raw directories if they don't exist
New-Item -Path ".\data" -ItemType Directory -Force
New-Item -Path ".\raw" -ItemType Directory -Force

# Download seperate data sources
## Article Titles and IDs 
Invoke-WebRequest "https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.sql.gz" -OutFile ".\raw\enwiki-latest-page.sql.gz" 

## Internal Links between articles (e.g graph edges in the video)
Invoke-WebRequest "https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pagelinks.sql.gz" -OutFile ".\raw\enwiki-latest-pagelinks.sql.gz"

## Redirect mappings, maps alises to article titles
Invoke-Webrequest "https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-redirect.sql.gz" -OutFile ".\raw\enwiki-latest-redirect.sql.gz"

Download database server for quick enumeration

Invoke-WebRequest "https://dev.mysql.com/get/Downloads/MySQLInstaller/mysql-installer-community-8.0.46.0.msi" -OutFile "$env:TEMP\mysql-installer.msi"

& "$env:TEMP\mysql-installer.msi" # I just did full install


mysqlsh --sql -u root -p -e "CREATE DATABASE wikipedia;"

# These 3 commands can take a varying amount of time, with the size being ~40GB of data being imported
mysql -u root -p wikipedia --execute "SOURCE data/enwiki-latest-page.sql"
mysql -u root -p wikipedia --execute "SOURCE data/enwiki-latest-redirect.sql"
mysql -u root -p wikipedia --execute "SOURCE data/enwiki-latest-pagelinks.sql"

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages