London Underground Usage: Analysis and Visualisation is a deep dive into 2 datasets with the purpose of gaining insights into how the Underground is used throughout an annual period
- I chose 2 datasets for this project, one was the 2024-2025 Station Footfall data from TfL (Transport for London)
- The other was a list of all Underground stations and their lat and lon coordinates
- My top requiremnt was for the dataset to be visualised at the end in the easiest to understand format possible with as much information as possible
- My Hypothesis is that overall usage would go up over time, as more and more jobs are created everyday in London as well as the population increasing
- The best way to validate this would be a line chart of Date vs Entry Count
- My first step was to find a dataset that fit my requiremnts, this ended up being two
- I endeavoured to keep the integrity of the data as best as possible through the ETL and manipulation processes
- I decided on a scatter map box plot to map the business requirement. having the stations with the backdrop of the map makes it easy to spot geographical patterns.
- I employed descriptive and diagnostic analysis in my approach to understand the data and the reasons for it being that way.
- I started off with descriptive analytics for each visualisation to give an overview of what the data tells us, then employed diagnostic methods to understand the cause
- Initially the first dataset was a limitation as it had no coordinates, but this was quickly fixed by merging a dataset with the necessary information
- I primarily used gen AI such as ChatGPT to brainstorm potention routes for analysis as well as what data visualisations to use
- The dataset was obtained from the TfL open data website
- I noticed during the ETL stage that the Kings Cross St Pancras station could cause a problem, it's a hybrid station and an international hub as well as an underground station but the data for Underground footfall was not seperate. As this would be a major outlier and could prove to give false insights I decided to exclude it from my analysis as it's purely focused on Underground stations
- I'm really going to focus on data visulaisation techniques as I found the stage difficult in comparison to ETL
- Pandas for ETL and data manipulation
- Matplotlib for basic plots like line and bar plots
- Seaborn for advanced plots like heatmaps and violin plots
- Plotly for interactive plots like box and map scatter plots
