Healthcare Insurance ETL & Data Visualization Project

Overview

This project is a basic analytics project whwre i performed an ETL (Extract, Transform, Load) workflow and create basic data visualizations using Python and associated libraries such as pandas for creating codes, markdown for comments and matplotlib, seaborn for basic visualisations and plotly was used to create a basic visualisation that is interactive. The dataset used is a CSV file containing healthcare insurance data.

Objectives

Extract data from a CSV file
Transform the data using cleaning and preprocessing techniques
Load the data into a pandas DataFrame for analysis
Perform exploratory data analysis (EDA)
Create visualizations using matplotlib, seaborn and plotly

Technologies Used

Python 3.12.8
pandas
matplotlib
seaborn
plotly
NumPy
Jupyter Notebook
Git & GitHub
vs code

Key Findings

Smokers are charged significantly higher
BMI and age positively correlate with increased insurance charges
Regional geographic differences are evident in pricing but not as significant as other personal features such as age, smoking habits and BMI.

Project Structure

Open Healthcare_insurance.ipynb in Jupyter Notebook in VS Code IDE.
Run the notebook cells to follow the ETL workflow and view the visualizations.

Key Steps

Extract: Load the CSV data using pandas.
Transform: Clean missing values, check data types, and preprocess as needed.
Load: Store the cleaned data in a DataFrame.
Visualize: Generate plots to explore relationships and trends in the data.

Agile Methodothology

This is a collaborative and sustainable way of managing projects by breaking the work into small, manageable tasks, delivering results step by step, and improving tasks based on regular feedbacks (Beck, K., et al.(2001). Manifesto for Agile Software Development. [https://agilemanifesto.org]

User Story

As a student Data Analyst, I want to build an ETL pipeline to extract, clean, transform, and load healthcare insurance cost data (including personal features such as age, gender, BMI, family size, smoking habits, and geographic region) from Kaggle, so that I can perform analysis, create and generate reports on how personal and geographic factors influence insurance costs.

Tasks

Extract Data
- Download and extract healthcare insurance data from Kaggle (CSV format) [https://www.kaggle.com/datasets/willianoliveiragibin/healthcare-insurance]
Clean Data
- Handle missing values (impute or drop as appropriate).
- Normalize data types and ensure correct formatting of all fields.
Transform Data
- Apply business logic to create derived columns (e.g., BMI category).
- Encode categorical variables for analytics.
Load Data
- Load the cleaned and transformed data into a Pandas DataFrame.
Validate & Visualize Data
- Check data quality and consistency (duplicates, nulls, ranges).
- Perform descriptive statistics and correlation analysis.
- Visualize insights, predictive models, and the impact of geographic regions.

Acceptance Criteria

1. Data Extraction

1 The pipeline shall extract raw data from Kaggle in CSV format, including fields: age, gender, BMI, smoking status, family size, region, and insurance cost.
2 The extraction process must log missing or duplicate records and report extraction success/failure.

2. Data Transformation

1 The pipeline shall clean data by removing duplicates, handling missing values, and standardizing field formats (e.g., gender as ‘Male’/‘Female’).
2 All numerical fields (age, BMI, family size, insurance cost) must be validated to fall within realistic, pre-defined ranges.
3 Categorical variables (gender, smoking status, region) must be encoded for downstream analytics.
4 The pipeline shall create derived features, such as BMI category and age group.

3. Data Loading

1 The cleaned and transformed data shall be loaded into a Pandas DataFrame following a pre-defined schema.
2 The loading process must be IDE potent, preventing duplicate entries.
3 The pipeline must validate and log each load, including row counts and data checksums.

4. Data Quality & Integrity

1 The pipeline shall include automated data quality checks (e.g., null checks, referential integrity) and alert on anomalies.
2 All transformation steps must be documented and auditable.

5. Security & Compliance

1 Sensitive attributes (age, region) must be handled in compliance with data privacy regulations (e.g., HIPAA, GDPR).
2 Access to data must be restricted and audit-logged.

6. Documentation & Monitoring

1 The ETL process shall be documented with setup, configuration, and troubleshooting instructions.
2 The pipeline must include monitoring and alerting for failures or significant data anomalies.

Additional Requirements for Analysis & Reporting

The pipeline must enable:
Descriptive statistics (average insurance charges by age, gender, region, etc.).
Correlation and visualisation reports between personal features and insurance cost.
Visualization of geographic factors on insurance cost using maps.

Visualizations

Distribution of insurance charges
Correlation heatmaps
Age vs. charges scatter plots
Categorical comparisons (e.g., smoker vs. non-smoker)
Gegographic factors in relation to healthcare insurance cost [https://greggpeter234.github.io/Individual_Project-1/]

Credits

Copilot for code correction, review and explanation
Chatgpt for code research and context.
Professional definition for Agile methodology. Beck, K., et al.(2001). Manifesto for Agile Software Development. [https://agilemanifesto.org]

Acknowledgements

First and foremost my partner who provided support and understanding through out the course of this project and undertood that i have to be left alone to work and also helped in bringing insights and suggestions for my visualisation section.
To my instructors at code institute who have been very supportive and helpful, who take their time to ensure we are all good even during times i lost confidence in myself. Most especially to Emma Lamont who is very reassuring and patient with us and John Rearden for always being there and ready to assist in helping to resolve all the coding and Technical stuff.

Created as part of my Data Analytics bootcamp journey!# Individual-Project-1

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.vscode		.vscode
datasets		datasets
jupyter_notebooks		jupyter_notebooks
venv		venv
.gitignore		.gitignore
.slugignore		.slugignore
Healthcare_insurance.ipynb		Healthcare_insurance.ipynb
Procfile		Procfile
README.md		README.md
firstproject.py		firstproject.py
index.html		index.html
requirements.txt		requirements.txt
runtime.txt		runtime.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Healthcare Insurance ETL & Data Visualization Project

Overview

Objectives

Technologies Used

Key Findings

Project Structure

Key Steps

Agile Methodothology

User Story

Tasks

Acceptance Criteria

1. Data Extraction

2. Data Transformation

3. Data Loading

4. Data Quality & Integrity

5. Security & Compliance

6. Documentation & Monitoring

Additional Requirements for Analysis & Reporting

Visualizations

Credits

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Healthcare Insurance ETL & Data Visualization Project

Overview

Objectives

Technologies Used

Key Findings

Project Structure

Key Steps

Agile Methodothology

User Story

Tasks

Acceptance Criteria

1. Data Extraction

2. Data Transformation

3. Data Loading

4. Data Quality & Integrity

5. Security & Compliance

6. Documentation & Monitoring

Additional Requirements for Analysis & Reporting

Visualizations

Credits

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages