📊 Data Mining Project – Assignment #02 (Autumn 2025)

This repository contains Group 08's full submission for Assignment #02 in the Data Mining (KSDAMIN1KU-20252) course at IT University of Copenhagen. It includes our code, report, and documentation for a group-based data mining project exploring climate-health relationships using machine learning techniques.

🌍 Dataset: Global Climate-Health Impact Tracker (2015–2025)

We used the Global Climate-Health Impact Tracker (2015–2025) dataset curated by Sohum Gokhale, which links climate events to health outcomes across 25 countries over an 11-year period.

📦 Dataset Highlights

Records: 14,100 weekly entries
Features: 30 columns including climate, health, air quality, and socioeconomic indicators
Coverage: 25 countries across 8 regions and 3 income levels
Completeness: 100% (no missing values)
License: CC0 – Public Domain

📁 Dataset Structure

Geographic Dimensions

record_id: Unique identifier for each record
country_code: ISO 3-letter country code
country_name: Full country name
region: Geographic region (8 regions)
income_level: World Bank income classification (High, Upper-Middle, Lower-Middle)
latitude, longitude: Geographic coordinates
population_millions: Population in millions

Temporal Dimensions

date: Week start date (YYYY-MM-DD)
year: Year (2015-2025)
month: Month (1-12)
week: ISO week number ### Climate Indicators
temperature_celsius: Average weekly temperature
temp_anomaly_celsius: Temperature deviation from historical average
precipitation_mm: Total weekly precipitation
heat_wave_days: Number of heat wave days in the week
drought_indicator: Binary indicator for drought conditions
flood_indicator: Binary indicator for flood events
extreme_weather_events: Total count of extreme events

Air Quality

pm25_ugm3: PM2.5 particulate matter concentration (μg/m³)
air_quality_index: Overall air quality index (0-500)

Health Outcomes (rates per 100,000 population)

respiratory_disease_rate: Respiratory illness incidence rate
cardio_mortality_rate: Cardiovascular mortality rate
vector_disease_risk_score: Risk score for vector-borne diseases (malaria, dengue)
waterborne_disease_incidents: Waterborne disease incident rate
heat_related_admissions: Hospital admissions due to heat-related illness

Socioeconomic & Health System

healthcare_access_index: Healthcare system accessibility (0-100)
gdp_per_capita_usd: GDP per capita in USD ### Wellbeing Indicators
mental_health_index: Population mental health score (0-100)
food_security_index: Food security score (0-100)

📈 Key Statistics

Temporal Coverage: 2015–2025 (weekly data)
Extreme Weather Events: 2,334 total events tracked
Temperature Range: -20.7°C to 38.3°C
Countries: USA, India, China, Brazil, Nigeria, Germany, Japan, UK, France, Australia, Kenya, Mexico, Indonesia, Pakistan, Bangladesh, Egypt, South Africa, Canada, Spain, Italy, Thailand, Philippines, Vietnam, Argentina, Colombia

🧠 Project Goals

We aimed to extract meaningful insights from the dataset using both supervised and unsupervised learning techniques. Our approach was guided by research questions that explored predictive modeling, clustering, correlation analysis, and regional comparisons.

🔍 Research Questions

We organized our research questions into thematic groups:

🌎 Climate Change & Regional Differences

How does the impact of climate change differ between the Global South and the Global North?
Have droughts and heatwaves intensified over the past 10 years due to global warming?

💰 Socioeconomic & Health Drivers

Which factor has the greatest influence on health — GDP, average income, or climate?
How has food security changed in a given country over the past 10 years, and what effect has this had on mortality rates?

🧑‍⚕️ Wellbeing & Mental Health

To what extent do food security and extreme weather affect mental health?

🌾 Climate Events & Food Security

How do drought indicators, flood indicators, and extreme weather events relate to food security?

📊 Data Mining & Modeling

What patterns or anomalies can be identified in the dataset?
How accurately can we classify or predict key health outcomes based on climate indicators?
What dimensionality reduction or clustering techniques reveal hidden structure across regions?

🧪 Methods Applied

We applied a range of data mining techniques:

Preprocessing: Normalization, outlier detection, feature selection
Unsupervised Learning: K-Means clustering, PCA
Supervised Learning: Decision Trees, Logistic Regression, Random Forest
Evaluation Metrics: Accuracy, F1-score, Pearson/Spearman correlation, baseline comparison

All code is written in Python using pandas, numpy, scikit-learn, matplotlib, and seaborn. The complete dependencies are saved in the requirements.txt file.

📊 Results & Insights

Our final report includes:

Visualizations of data distributions and model outputs
Quantitative evaluation of model performance
Discussion of challenges, limitations, and societal impact
Reflections on privacy, reproducibility, and ethical considerations

📁 Repository Structure

├── data/ # Raw and preprocessed datasets ├── notebooks/ # Jupyter Notebooks with analysis and experiments ├── src/ # Python scripts for preprocessing and modeling ├── report/ # Final report in IEEE double-column format (PDF) ├── work_allocation.md # Group member contributions ├── requirements.txt # List of required Python packages └── README.md # Project overview and instructions

👥 Team & Contributions

This project was completed by Group #08. See work_allocation.md for a breakdown of individual contributions.

⚠️ Academic Integrity

All code and analysis are original or adapted from course labs with proper attribution. We affirm that this submission reflects our own work and understanding.

📚 Sources

This dataset synthesizes patterns from multiple authoritative sources including:

World Bank Climate Data
WHO Global Health Observatory
Climate reanalysis models
Air quality monitoring networks
National health statistics

Data follows realistic statistical distributions and correlations observed in climate-health research literature.

Global Climate-Health Impact Tracker (2015–2025)
Created: October 2025
Source: Kaggle Datasets

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
notebooks		notebooks
.gitignore		.gitignore
Assignment #02.pdf		Assignment #02.pdf
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📊 Data Mining Project – Assignment #02 (Autumn 2025)

🌍 Dataset: Global Climate-Health Impact Tracker (2015–2025)

📦 Dataset Highlights

📁 Dataset Structure

Geographic Dimensions

Temporal Dimensions

Air Quality

Health Outcomes (rates per 100,000 population)

Socioeconomic & Health System

📈 Key Statistics

🧠 Project Goals

🔍 Research Questions

🌎 Climate Change & Regional Differences

💰 Socioeconomic & Health Drivers

🧑‍⚕️ Wellbeing & Mental Health

🌾 Climate Events & Food Security

📊 Data Mining & Modeling

🧪 Methods Applied

📊 Results & Insights

📁 Repository Structure

👥 Team & Contributions

⚠️ Academic Integrity

📚 Sources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📊 Data Mining Project – Assignment #02 (Autumn 2025)

🌍 Dataset: Global Climate-Health Impact Tracker (2015–2025)

📦 Dataset Highlights

📁 Dataset Structure

Geographic Dimensions

Temporal Dimensions

Air Quality

Health Outcomes (rates per 100,000 population)

Socioeconomic & Health System

📈 Key Statistics

🧠 Project Goals

🔍 Research Questions

🌎 Climate Change & Regional Differences

💰 Socioeconomic & Health Drivers

🧑‍⚕️ Wellbeing & Mental Health

🌾 Climate Events & Food Security

📊 Data Mining & Modeling

🧪 Methods Applied

📊 Results & Insights

📁 Repository Structure

👥 Team & Contributions

⚠️ Academic Integrity

📚 Sources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages