This project delivers a comprehensive analysis and visualization of global air quality using Power BI and Python data preprocessing. It integrates raw pollutant data, data preprocessing scripts, SQL analysis, and interactive dashboards to deliver insights into global air pollution trends.
โโโ data/
โ โโโ global_air_pollution_data.csv # Original raw data (pollutant readings, AQI, etc.)
โ โโโ clean_air_quality.xlsx # Cleaned dataset ready for Power BI import
โ
โโโ notebooks/
โ โโโ data_processing.ipynb # Jupyter notebook for data cleaning and transformation
โ
โโโ dashboard/
โ โโโ Project.pbix # Power BI dashboard file
โ โโโ overview.png # Screenshot of the Overview dashboard
โ โโโ pollutant.png # Screenshot of the Pollutant Impact dashboard
โ
โโโ sql_scripts/
โ โโโ air_pollutant_share_by_type.sql
โ โโโ countries_and_city_larger_zero.sql
โ โโโ global_AQI_value_distribution.sql
โ โโโ pollutants_with_the_greatest_impact_on_global_average_AQI.sql
โ
โโโ README.md # Project documentation
Before running the project, make sure you have the following installed:
-
Python (>=3.10 recommended)โ Download -
Jupyter Notebook:-
Option 1 (Recommend): Open the project folder in VS Code, then open the notebook
data_processing.ipynb. -
Option 2 (Optional): Jupyter Notebook (web interface via CMD/Terminal)
- Open cmd/terminal and run:
pip install jupyter - After install move on folder project with:
cd path/to/Air-Quality-Analysis - Run Notebook:
jupyter notebook notebooks/data_processing.ipynb - Run each cell to preprocess by using (Shift + Enter)
- Open cmd/terminal and run:
-
-
Required Python libraries:
- All libraries are listed in requirements.txt, including
- pandas
- sqlalchemy
- openpyxl
- They will be automatically installed when running the notebook
- Or you can install them manually using:
pip install -r requirements.txt
- All libraries are listed in requirements.txt, including
-
Power BI Desktop (for dashboards) โ Download
-
(Optional) PostgreSQL: Needed only if you want to run SQL scripts in sql_scripts/.
I. Terminal
-
Clone the repository:
git clone https://github.com/KANH12/Air-Quality-Analysis.git -
Navigate to the project directory:
cd Air-Quality-Analysis -
Check the raw data
- Ensure the file
global_air_pollution_data.csvexists. - This is the input dataset for all processing.
- Ensure the file
II. Open the Notebook
๐บ๏ธ Open the notebook file in one of the following ways:
-
Option A: Jupyter Notebook (Web Interface):
jupyter notebook notebooks/data_processing.ipynbโ This will open the notebook in your default web browser.- Run each cell (Shift + Enter) to preprocess and clean the data.
-
Option B: Visual Studio Code
- Open the folder in VS Code
- Open
notebooks/data_processing.ipynb - Run the notebook using the โRun Allโ button or
Shift + Enterper cell.
-
The dataset
global_air_pollution_data.csvโ includes pollutant concentration data (PM2.5, Ozone, NOโ, CO) and computed AQI for major global cities. -
Data fields include:
-
Country,City -
PollutantPM2.5 value,PM2.5 categoryOzone value,Ozone categoryNOโ value,NOโ categoryCO value,CO category
-
AQI valueandAQI category
-
Performed in data_processing.ipynb using Python libraries:
- Data Cleaning
-
Renamed columns to standardized and readable names.
-
Check duplicate value column
city -
Handled missing values by removing, particularly those with null Country fields
Records with null Country values were removed because, although other columns (including City) had data, each city appeared only once in the raw dataset. Without national reference data or repeated city entries, it was impossible to determine the corresponding country, so these records were excluded.
-
Filtered out invalid or inconsistent data points to ensure data quality.
-
- Data Transformation โ No additional transformation was applied as each city record was unique.
- Output
- Export cleaned dataset (
clean_air_quality.xlsx) - Loads the same dataset into PostgreSQL for SQL-based analysis.
The project integrates with PostgreSQL to execute analytical queries for deeper air quality exploration.
Folder sql_scripts/ contains queries for data exploration and analysis:
air_pollutant_share_by_type.sqlโ Compares pollutant proportions by typecountries_and_city_larger_zero.sqlโ Filters valid countries/citiesglobal_AQI_value_distribution.sqlโ Analyzes global AQI range distributionspollutants_with_the_greatest_impact_on_global_average_AQI.sqlโ Identifies major pollution drivers
๐ก All SQL scripts operate on the cleaned dataset loaded into PostgreSQL from the ETL pipeline.
This project follows a complete ETLV (Extract โ Transform โ Load โ Visualize) workflow that connects multiple tools for end-to-end data analysis.
-
Extract
- Collected
global_air_pollution_data.csvformat from Global Air Quality dataset - The dataset includes pollutant readings (PM2.5, NOโ, CO, Oโ), AQI values, and geographic metadata.
- Collected
-
Transform and Cleaning
-
Cleaned and standardized raw data using Python (Pandas) in
data_processing.ipynb. -
Task performed:
- Handle missing values and rename columns
- Filter invalid values (to avoid meaningless or corrupted data)
- Prepare structured data for analysis
No further transformation was required since the dataset already contained all necessary columns. -
-
Load
- Exported transformed data to:
clean_air_quality.xlsxโ used in Power BI for visualization- PostgreSQL โ used for intermediate SQL analysis (queries in
/sql_scripts/)
- Exported transformed data to:
-
Visualize
- Built interactive dashboards in Power BI using the cleaned dataset.
- Dashboards highlight trends, pollutant impacts, and geographic air quality differences.
Raw CSV
โ
Python (Cleaning)
โ โ
[1] PostgreSQL (SQL Analysis) [2] Excel (.xlsx)
โ โ
Power BI (Visualization)
The project contains two interactive dashboards, designed for multi-dimensional analysis that highlights key air quality metrics.
Purpose: Provide a global-level summary of air quality distribution.
Key Visuals:
-
Country & City & Status Filters: Dynamic filtering by geography and AQI status.
-
KPI Cards:
- Country count
- City count
- Average AQI
-
Area Chart: AQI distribution by value range.
-
Map Visualization: Global AQI levels by region.
-
Treemap: Distribution of AQI categories (Good, Moderate, Unhealthy, etc.).
Purpose: Analyze air quality by pollutant types and their relative contributions.
Key Visuals:
-
Country & City & Pollutants Filters: Dynamic filtering by geography and each pollutant.
-
KPI Cards:
- Countries and Cities recorded
- Average PM2.5, Ozone, NOโ, CO concentrations
- Active pollutants count
-
Pie Chart: Pollutant share by type.
-
Tree map: Block size and color indicate average concentration, highlighting the major contributors to air quality.
- Excel:
clean_air_quality.xlsx - SQL:
sql_scripts/runs on PostgreSQL - Power BI: dashboard.pbix
๐ฆ Key Insight 1 โ Global Air Quality Stability
- The global average AQI is 72.34, which falls within the Moderate range.
- Most countries maintain relatively low AQI levels, indicating overall stable and acceptable air quality worldwide.
๐ฅ Key Insight 2 โ AQI Distribution Patterns
- Most countries have AQI values below 100, concentrated in the lower range.
- Only a few countries exceed AQI 200, meaning severe pollution events are geographically limited rather than globally widespread.
๐จ Key Insight 3 โ Pollutant Composition
- PM2.5 dominates, contributing 63.9% of total air contamination.
- Ozone (33.3%) is the second largest contributor.
- NOโ and CO have minor shares, showing that fine particulate matter and ozone are key global air quality concerns.
๐ฉ Key Insight 4 โ Pollutant Severity
- PM2.5 has the highest average concentration (68.88 ยตg/mยณ) โ nearly double that of Ozone (35.23 ยตg/mยณ).
- This highlights serious health risks from fine particles, especially in urban and industrial regions.
๐ช Key Insight 5 โ Global Coverage & Data Scope
- Dataset includes 175 countries and over 23,000 cities, ensuring broad global coverage.
- Such scale enhances the reliability of insights on worldwide air quality trends.
โ Overall, while global air quality appears moderately stable, the dominance of PM2.5 and Ozone indicates that ongoing monitoring and pollution control remain essential to sustain healthy atmospheric conditions.
| Category | Tools | Description |
|---|---|---|
| Visualization | Power BI | Data visualization and dashboard building |
| Programming | Python | Data preprocessing and scripting |
| Library | Pandas, NumPy | Data cleaning, manipulation, and analysis |
| Data Formats | Excel, CSV | Data storage and export formats |
| Query Language | SQL (PostgreSQL, MySQL) | Data querying and analysis |
- Integrate real-time air quality data from public APIs to enable live dashboard updates.
- Automate the ETL process using Python scripts and schedule with Apache Airflow or Cron.
- Deploy the dashboard on Power BI Service or Streamlit for public accessibility.
Le Nguyen Bao Khang [Khngzxz]
Data Analyst | Skilled in Python, SQL & Power BI
- ๐ง baokhang1608@gmail.com
- ๐ GitHub | Linkedln
-
Screenshots in this documentation correspond to the Power BI dashboard views:
-
Ensure the file paths are correct when connecting
clean_air_quality.xlsxto Power BI.
ยฉ 2025 Le Nguyen Bao Khang โ All rights reserved