Skip to content

JohannesHackl/DataMining_2025_A2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📊 Data Mining Project – Assignment #02 (Autumn 2025)

This repository contains Group 08's full submission for Assignment #02 in the Data Mining (KSDAMIN1KU-20252) course at IT University of Copenhagen. It includes our code, report, and documentation for a group-based data mining project exploring climate-health relationships using machine learning techniques.


🌍 Dataset: Global Climate-Health Impact Tracker (2015–2025)

We used the Global Climate-Health Impact Tracker (2015–2025) dataset curated by Sohum Gokhale, which links climate events to health outcomes across 25 countries over an 11-year period.

📦 Dataset Highlights

  • Records: 14,100 weekly entries
  • Features: 30 columns including climate, health, air quality, and socioeconomic indicators
  • Coverage: 25 countries across 8 regions and 3 income levels
  • Completeness: 100% (no missing values)
  • License: CC0 – Public Domain

📁 Dataset Structure

Geographic Dimensions

  • record_id: Unique identifier for each record
  • country_code: ISO 3-letter country code
  • country_name: Full country name
  • region: Geographic region (8 regions)
  • income_level: World Bank income classification (High, Upper-Middle, Lower-Middle)
  • latitude, longitude: Geographic coordinates
  • population_millions: Population in millions

Temporal Dimensions

  • date: Week start date (YYYY-MM-DD)
  • year: Year (2015-2025)
  • month: Month (1-12)
  • week: ISO week number ### Climate Indicators
  • temperature_celsius: Average weekly temperature
  • temp_anomaly_celsius: Temperature deviation from historical average
  • precipitation_mm: Total weekly precipitation
  • heat_wave_days: Number of heat wave days in the week
  • drought_indicator: Binary indicator for drought conditions
  • flood_indicator: Binary indicator for flood events
  • extreme_weather_events: Total count of extreme events

Air Quality

  • pm25_ugm3: PM2.5 particulate matter concentration (μg/m³)
  • air_quality_index: Overall air quality index (0-500)

Health Outcomes (rates per 100,000 population)

  • respiratory_disease_rate: Respiratory illness incidence rate
  • cardio_mortality_rate: Cardiovascular mortality rate
  • vector_disease_risk_score: Risk score for vector-borne diseases (malaria, dengue)
  • waterborne_disease_incidents: Waterborne disease incident rate
  • heat_related_admissions: Hospital admissions due to heat-related illness

Socioeconomic & Health System

  • healthcare_access_index: Healthcare system accessibility (0-100)
  • gdp_per_capita_usd: GDP per capita in USD ### Wellbeing Indicators
  • mental_health_index: Population mental health score (0-100)
  • food_security_index: Food security score (0-100)

📈 Key Statistics

  • Temporal Coverage: 2015–2025 (weekly data)
  • Extreme Weather Events: 2,334 total events tracked
  • Temperature Range: -20.7°C to 38.3°C
  • Countries: USA, India, China, Brazil, Nigeria, Germany, Japan, UK, France, Australia, Kenya, Mexico, Indonesia, Pakistan, Bangladesh, Egypt, South Africa, Canada, Spain, Italy, Thailand, Philippines, Vietnam, Argentina, Colombia

🧠 Project Goals

We aimed to extract meaningful insights from the dataset using both supervised and unsupervised learning techniques. Our approach was guided by research questions that explored predictive modeling, clustering, correlation analysis, and regional comparisons.


🔍 Research Questions

We organized our research questions into thematic groups:

🌎 Climate Change & Regional Differences

  • How does the impact of climate change differ between the Global South and the Global North?
  • Have droughts and heatwaves intensified over the past 10 years due to global warming?

💰 Socioeconomic & Health Drivers

  • Which factor has the greatest influence on health — GDP, average income, or climate?
  • How has food security changed in a given country over the past 10 years, and what effect has this had on mortality rates?

🧑‍⚕️ Wellbeing & Mental Health

  • To what extent do food security and extreme weather affect mental health?

🌾 Climate Events & Food Security

  • How do drought indicators, flood indicators, and extreme weather events relate to food security?

📊 Data Mining & Modeling

  • What patterns or anomalies can be identified in the dataset?
  • How accurately can we classify or predict key health outcomes based on climate indicators?
  • What dimensionality reduction or clustering techniques reveal hidden structure across regions?

🧪 Methods Applied

We applied a range of data mining techniques:

  • Preprocessing: Normalization, outlier detection, feature selection
  • Unsupervised Learning: K-Means clustering, PCA
  • Supervised Learning: Decision Trees, Logistic Regression, Random Forest
  • Evaluation Metrics: Accuracy, F1-score, Pearson/Spearman correlation, baseline comparison

All code is written in Python using pandas, numpy, scikit-learn, matplotlib, and seaborn. The complete dependencies are saved in the requirements.txt file.


📊 Results & Insights

Our final report includes:

  • Visualizations of data distributions and model outputs
  • Quantitative evaluation of model performance
  • Discussion of challenges, limitations, and societal impact
  • Reflections on privacy, reproducibility, and ethical considerations

📁 Repository Structure

├── data/ # Raw and preprocessed datasets ├── notebooks/ # Jupyter Notebooks with analysis and experiments ├── src/ # Python scripts for preprocessing and modeling ├── report/ # Final report in IEEE double-column format (PDF) ├── work_allocation.md # Group member contributions ├── requirements.txt # List of required Python packages └── README.md # Project overview and instructions


👥 Team & Contributions

This project was completed by Group #08. See work_allocation.md for a breakdown of individual contributions.


⚠️ Academic Integrity

All code and analysis are original or adapted from course labs with proper attribution. We affirm that this submission reflects our own work and understanding.


📚 Sources

This dataset synthesizes patterns from multiple authoritative sources including:

  • World Bank Climate Data
  • WHO Global Health Observatory
  • Climate reanalysis models
  • Air quality monitoring networks
  • National health statistics

Data follows realistic statistical distributions and correlations observed in climate-health research literature.

Global Climate-Health Impact Tracker (2015–2025)
Created: October 2025
Source: Kaggle Datasets

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors