This is a study on how personal attributes and geographic factors influence insurance charges based on a healthcare insurance dataset extracted from Kaggle.
The dataset contains information on the relationship between personal attributes (age, gender, BMI, family size, smoking habits) and geographic factors and their impact on medical insurance charges. This can be downloaded from the link below.
https://www.kaggle.com/datasets/willianoliveiragibin/healthcare-insurance
- Display basic statistics, including average insurance charges by age, gender and region.
- Visualise correlations between different attributes and insurance charges.
- Geographic Analysis: Visualise the impact of geographic regions on insurance charges.
Insurance charge is expected to:
- increase with age
- be higher for smoker
- be lower for female
- vary with BMI
Family size may also be a contributing factor to insurance charges.
We will examine and validate through analyses and visualisations that are set out under the business requirements.
Project is managed over 6 stages, supported by GitHub project board (https://github.com/users/8osco/projects/6/views/1):
- Project setup with new GitHub repo for storage and version control, VScode as IDE, and access to Kaggle for dataset
- Data extract and familiarisation
- Data cleaning and preparation
- Data quality check
- Data analysis and visualisation
- README documentation
Insurance charges are set based on a number of factors. The relationships between charges and these factors are best examined and visualised in charts. Interactive features are invaluable for comparison between factors and combination of factors, when they are carefully designed. They can also help identify outliers or areas for further examinations more easily.
The relationship between charges and age, for example, would be best represented in a chart, given the value range of both variables. Generation of subplots to further break down the relationship by region, gender, smoker status, etc provide greater information to help understand the inter-relationships amongst the variables.
My analysis begins with data familiarisation and an attempt to understand the data distribution.
This provides some indication on the materiality of different factors and limitations of the analysis (e.g. due to limited data points available).
This helps prioritise and feed into the design of the visualisations.
Interactive features are particularly useful for this project, for providing and comparing between visualisations to help observe relationships and patterns. Code Institute learning materials, tutoring and AI tools have been helpful in every stage from visual designs, code optimisation, visualisation enhancements.
The data has already been anonymised at source.
The results of the analysis may pose potential ethical questions, for example if charges are noticeably higher in a region a consistent differential between male and female charges.
I was exploring the use of dotted lines in the plots to help make clearer distinctions between lines. Whilst the AI tool offer some suggestions, it was not successfully implemented. This is to be further experimented.
The image added on top of this README file does not seem to load on GitHub, although it can be seen here in the VScode preview. I would like to follow up to understand. I have tried changing the image size with this code also: "
", but it didn't work.
There are many routes from both coding and data structuring perspective to arrive at the some output. I would like to explore the most efficient and effective way to do this, through further testing and experimenting, and getting a better handle on the strengths and limitations of different packages.
NumPy and Pandas for data interrogation and manuipulation. Seaborn and Plotly, supported by Matplotlib, for the visualisations.
Code Institute course materials, SME, data coach, PDBA sessions. ChatGPT for code clarification and creation. The image at the top of this README file was sourced from Kaggle.
A special thanks to Project Group 2, Apr 2025 data analytics cohort and coaches. The support received and discussions have been invaluable.
