This repository contains Group 08's full submission for Assignment #02 in the Data Mining (KSDAMIN1KU-20252) course at IT University of Copenhagen. It includes our code, report, and documentation for a group-based data mining project exploring climate-health relationships using machine learning techniques.
We used the Global Climate-Health Impact Tracker (2015–2025) dataset curated by Sohum Gokhale, which links climate events to health outcomes across 25 countries over an 11-year period.
- Records: 14,100 weekly entries
- Features: 30 columns including climate, health, air quality, and socioeconomic indicators
- Coverage: 25 countries across 8 regions and 3 income levels
- Completeness: 100% (no missing values)
- License: CC0 – Public Domain
record_id: Unique identifier for each recordcountry_code: ISO 3-letter country codecountry_name: Full country nameregion: Geographic region (8 regions)income_level: World Bank income classification (High, Upper-Middle, Lower-Middle)latitude,longitude: Geographic coordinatespopulation_millions: Population in millions
date: Week start date (YYYY-MM-DD)year: Year (2015-2025)month: Month (1-12)week: ISO week number ### Climate Indicatorstemperature_celsius: Average weekly temperaturetemp_anomaly_celsius: Temperature deviation from historical averageprecipitation_mm: Total weekly precipitationheat_wave_days: Number of heat wave days in the weekdrought_indicator: Binary indicator for drought conditionsflood_indicator: Binary indicator for flood eventsextreme_weather_events: Total count of extreme events
pm25_ugm3: PM2.5 particulate matter concentration (μg/m³)air_quality_index: Overall air quality index (0-500)
respiratory_disease_rate: Respiratory illness incidence ratecardio_mortality_rate: Cardiovascular mortality ratevector_disease_risk_score: Risk score for vector-borne diseases (malaria, dengue)waterborne_disease_incidents: Waterborne disease incident rateheat_related_admissions: Hospital admissions due to heat-related illness
healthcare_access_index: Healthcare system accessibility (0-100)gdp_per_capita_usd: GDP per capita in USD ### Wellbeing Indicatorsmental_health_index: Population mental health score (0-100)food_security_index: Food security score (0-100)
- Temporal Coverage: 2015–2025 (weekly data)
- Extreme Weather Events: 2,334 total events tracked
- Temperature Range: -20.7°C to 38.3°C
- Countries: USA, India, China, Brazil, Nigeria, Germany, Japan, UK, France, Australia, Kenya, Mexico, Indonesia, Pakistan, Bangladesh, Egypt, South Africa, Canada, Spain, Italy, Thailand, Philippines, Vietnam, Argentina, Colombia
We aimed to extract meaningful insights from the dataset using both supervised and unsupervised learning techniques. Our approach was guided by research questions that explored predictive modeling, clustering, correlation analysis, and regional comparisons.
We organized our research questions into thematic groups:
- How does the impact of climate change differ between the Global South and the Global North?
- Have droughts and heatwaves intensified over the past 10 years due to global warming?
- Which factor has the greatest influence on health — GDP, average income, or climate?
- How has food security changed in a given country over the past 10 years, and what effect has this had on mortality rates?
- To what extent do food security and extreme weather affect mental health?
- How do drought indicators, flood indicators, and extreme weather events relate to food security?
- What patterns or anomalies can be identified in the dataset?
- How accurately can we classify or predict key health outcomes based on climate indicators?
- What dimensionality reduction or clustering techniques reveal hidden structure across regions?
We applied a range of data mining techniques:
- Preprocessing: Normalization, outlier detection, feature selection
- Unsupervised Learning: K-Means clustering, PCA
- Supervised Learning: Decision Trees, Logistic Regression, Random Forest
- Evaluation Metrics: Accuracy, F1-score, Pearson/Spearman correlation, baseline comparison
All code is written in Python using pandas, numpy, scikit-learn, matplotlib, and seaborn. The complete dependencies are saved in the requirements.txt file.
Our final report includes:
- Visualizations of data distributions and model outputs
- Quantitative evaluation of model performance
- Discussion of challenges, limitations, and societal impact
- Reflections on privacy, reproducibility, and ethical considerations
├── data/ # Raw and preprocessed datasets ├── notebooks/ # Jupyter Notebooks with analysis and experiments ├── src/ # Python scripts for preprocessing and modeling ├── report/ # Final report in IEEE double-column format (PDF) ├── work_allocation.md # Group member contributions ├── requirements.txt # List of required Python packages └── README.md # Project overview and instructions
This project was completed by Group #08. See work_allocation.md for a breakdown of individual contributions.
All code and analysis are original or adapted from course labs with proper attribution. We affirm that this submission reflects our own work and understanding.
This dataset synthesizes patterns from multiple authoritative sources including:
- World Bank Climate Data
- WHO Global Health Observatory
- Climate reanalysis models
- Air quality monitoring networks
- National health statistics
Data follows realistic statistical distributions and correlations observed in climate-health research literature.
Global Climate-Health Impact Tracker (2015–2025)
Created: October 2025
Source: Kaggle Datasets