Project Car Price Analysis

Project Car Price Analysis is a comprehensive data analysis tool designed to streamline data exploration, analysis, and visualisation. The tool supports multiple data formats and provides an intuitive interface for both novice and expert data scientists. This project is a comprehensive data analysis of a car price dataset. Our primary objective was to explore the key factors that influence car prices and use these insights to inform the development of a professional, interactive dashboard. By uncovering which specifications, features, and configurations have the strongest impact on a vehicle's price, this analysis provides valuable intelligence for automotive businesses and enthusiasts.

Dataset Content

The Car Price Prediction Multiple Linear Regression dataset, sourced from Kaggle https://www.kaggle.com/datasets/hellbuoy/car-price-prediction, is a comprehensive collection of information on various automobiles, designed for building a machine learning model to predict car prices.
The dataset is composed of two primary files:

CarPrice_Assignment.CSV: This is the main data file, containing 205 observations and 26 features. The features include details such as the car's make and model, technical specifications (e.g., engine size, horsepower, city and highway MPG), physical dimensions (e.g., wheelbase, length, width, height), and other key attributes like fuel type, number of doors, and engine location. The target variable for prediction is the price of the car.
Data Dictionary - carprices.xlsx: This supplementary file provides a detailed data dictionary, offering clear descriptions for each of the features found in the CSV file. This helps in understanding the meaning and data type of each column, which is crucial for data analysis and model development.

Business Requirements

Describe your business requirements Core Objective: The primary objective is to serve as a data analytics consulting team for Geely Auto, providing a clear, data-backed strategy for their entry into the competitive US automobile market. The project will address the key business questions related to car pricing and market dynamics.

Data Analysis Goals:

Analyse car prices based on various independent variables to understand the factors affecting car pricing in the US market.
Develop interactive dashboards to provide insights into car pricing dynamics, helping the company design cars and develop business strategies that meet market demands.

Context:

An automobile company aims to enter the US market and compete with local and European manufacturers. To adjust its strategies and designs accordingly, it needs to understand the factors influencing car prices in the US market. The dataset includes various car attributes, such as make, model, engine size, and price.

How the Project Addresses the Objectives:

Pricing Factor Analysis: We will analyze the provided dataset to identify which variables (e.g., horsepower, fuel efficiency, engine size, brand) are most significant in predicting a car's price in the American market. This directly answers Geely's need to understand how pricing works outside of their home market.
Competitive Strategy Development: Instead of a single, broad analysis, our project focuses on the four identified customer personas (Budget Buyer, Family Buyer, Luxury Buyer, and Eco-friendly Buyer). We will provide specific, tailored insights for how Geely can design cars to be competitive within each market segment.
Actionable Insights: The final deliverable will not just be a set of visualizations but a comprehensive strategy. The presentation will guide Geely on what features to prioritize and what market segments offer the most promising opportunities for a new competitor.

Hypothesis and how to validate?

Project Hypotheses

Car price is strongly influenced by engine size, horsepower, and brand. Validation: Use correlation analysis and regression plots to measure the relationship between price and these features. High correlation coefficients and clear trends in visualisations will support this hypothesis.
Composite scores for family/city, offroad, and sportscar categories effectively segment cars by suitability. Validation: Compare top-scoring cars in each category with their actual specifications and market positioning. Use visualisations to check if the scores align with expected car types.
Luxury cars (top 25% by price) exhibit distinct feature patterns compared to non-luxury cars. Validation: Use box plots and summary statistics to compare features (e.g., power_to_weight, car_volume) between luxury and non-luxury segments.
Fuel efficiency (mpg_ratio) and power-to-weight ratio are key differentiators for buyers with practical vs. performance needs. Validation: Segment cars by these features and analyze their distribution and impact on price and category scores.

Validation Approach Perform statistical analysis (correlation, regression, summary statistics) using the processed dataset. Use data visualisations (pairplots, box plots, heatmaps, score charts) to visually confirm relationships and segmentation. Cross-check results with domain knowledge and market expectations to ensure findings are meaningful and actionable.

🗺️ Project Plan

The project is delivered over four structured milestones, each designed to scaffold a reproducible analytics workflow and map directly to the repo structure.

🔹 Milestone 1 — Project Setup · Kick‑off (Day 0)

Objective
Establish a working, collaborative delivery environment so the team can start shipping immediately and predictably.

Key Actions

Create GitHub repository and Projects (Kanban) board with status columns, labels, and issue/PR templates.
Define folder structure: Data/raw, Data/clean, jupyter_notebooks, Media/, Dashboard/.
Stage the Kaggle dataset into Data/raw/ (cars.csv, Data Dictionary - carprices.xlsx).
Seed ETL/EDA backlog and link issues/PRs to milestones; assign owners/reviewers.

Success Criteria
Any teammate can clone, install deps, and run the first notebook without friction. The board provides real‑time visibility into planned, active, and completed work.

🔹 Milestone 2 — Initial ETL Build & First Visuals (Day 1)

Objective
Turn the raw dataset into a clean, analysis‑ready table and verify the pipeline end‑to‑end.

Key Actions

Implement notebook‑driven ETL (jupyter_notebooks/ETL.ipynb): parsing brand/model, standardizing categorical labels, converting units/types, engineering features, and exporting Data/clean/cars_processed.csv (index=False).
Handle missing values safely (e.g., Subaru model → dl), keep outliers by design, and document assumptions.
Produce first EDA visuals in jupyter_notebooks/Visualisation.ipynb and save artifacts to Media/.

🔹 Milestone 3 — ETL Refinement, Dashboards & Testing (Day 2)

Objective
Harden and extend the pipeline to support decision‑grade insights and prep the first dashboard build.

Key Actions

Add and validate comparability scores (City/Family, Outdoor/Off‑Road, Sport): per‑feature normalization (invert where higher is worse), sum, and re‑normalize to 0–1.
Quantify drivers of price: compute absolute correlations and visualize strongest relationships (heatmap + scatter); export to Media/.
Make paths robust (prefer pathlib), ensure idempotent/deterministic writes.
Draft Power BI page layout (overview + persona pages) and field naming conventions aligned with ETL (e.g., Engine Size (L)).

🔹 Milestone 4 — Final Refinements, Presentation, Documentation & Publish (Day 3)

Objective
Polish deliverables for readability and accessibility, and package the work for sharing and demo.

Key Actions

Freeze Data/clean/cars_processed.csv and commit all Media/ visuals (PNGs/HTML).
Populate the presentation deck with an EDA narrative and figure references.
(Optional) Publish a GitHub Release bundling CSV, figures, and PBIX when ready.

📁 Repository Structure (scaffold)

car-price-analysis-hackathon-team4/
├─ Dashboard/                     # PBIX + exported screenshots
├─ Data/
│  ├─ raw/                        # Kaggle CSV + data dictionary
│  └─ clean/
│     └─ cars_processed.csv       # ETL output (written by ETL.ipynb)
├─ jupyter_notebooks/
│  ├─ ETL.ipynb                   # cleaning, feature engineering, scoring, export
│  ├─ Visualisation.ipynb         # EDA (dists, correlations, outliers, scores)
│  └─ Notebook_Template copy.ipynb
├─ Media/                         # EDA figures/HTML + deck assets (images, GIFs)
├─ .gitignore
├─ .python-version
├─ .slugignore
├─ Procfile
├─ README.md
├─ requirements.txt               
└─ setup.sh

1️⃣ High‑Level Steps Taken for the Analysis

Phase	Description
Setup	Repo creation, Kanban board setup, folder scaffolding, dataset staging
ETL Build	Raw‑to‑clean transformation, notebook‑driven pipeline, first visuals
EDA & Scores	Distributions, correlations, comparability scores, outlier rationale
Finalization	Presentation narrative, documentation, (optional) GitHub Release

2️⃣ Data Management Across All Phases

Collection — Kaggle dataset staged in Data/raw/, tracked via GitHub issues.
Processing — ETL notebooks version‑controlled; outputs stored in Data/clean/.
Analysis — EDA steps logged in notebooks; visuals exported to Media/.
Interpretation — Insights narrated in the presentation; dashboard framework prepared.
Collaboration — Issues/PRs tied to milestones; reviewers ensure reproducibility and clarity.

3️⃣ Justification of Research Methodologies

Methodology	Rationale
Notebook‑Driven ETL	Transparent, auditable, easy onboarding
Power BI (planned)	Stakeholder‑friendly, interactive exploration
Milestone‑Based Delivery	Agile iteration, clear accountability
GitHub Project Management	Labels, issues, reviewers → traceability

👥 Collaboration & Teamwork

GitHub Projects — Kanban board, labels, milestones, and linked issues/PRs.
Google Drive — Shared workspace for presentation and shared docs.
Discord — Daily check‑ins, async updates, links to issues and previews.

📦 Final Deliverables (current state)

✅ Cleaned dataset: Data/clean/cars_processed.csv
✅ EDA artifacts saved in Media/ (distributions, correlations, outliers, score‑vs‑price)
✅ Jupyter notebooks: jupyter_notebooks/ETL.ipynb, jupyter_notebooks/Visualisation.ipynb
🧭 Power BI dashboard: framework defined, PBIX to be built in Dashboard/

The rationale to map the business requirements to the Data Visualisations

Business Requirements

Understand key factors influencing car prices - identify which car attributes (e.g., engine size, horsepower, brand, body type) most strongly affect market price.
Segment cars by category and suitability - classify cars into family/city, offroad, and sportscar categories to support targeted recommendations.
Detect market diversity and outliers - recognize rare or luxury vehicles and price variability to reflect genuine market diversity.
Assess fuel efficiency and performance - compare cars based on fuel economy and power-to-weight ratios for practical and performance-oriented buyers.
Support decision-making for buyers and sellers - provide clear, actionable insights for different user groups (e.g., families, enthusiasts, dealers).

Rationale for Data Visualisations

In order to map the business requirements to meaningful data visualizations, the dataset includes both raw and processed features. Some features, such as power_to_weight, car_volume, and the various *_score metrics, were created during the ETL phase to provide more actionable insights. These features, along with the original data, serve as the input for the visualization phase, enabling analysis of market segmentation, performance, fuel efficiency, and suitability for different driving needs.

Pairplots and correlation heatmaps reveal relationships between price and key numeric features, directly addressing the need to understand price drivers.
Box plots by car body and drivewheel visualize market segmentation and price variability, supporting category-based analysis and outlier detection.
Composite score visualisations (city, outdoor, sport) map cars to their suitability for different use cases, enabling targeted recommendations.
Fuel efficiency and performance charts help users compare cars on practical and technical criteria, meeting requirements for efficiency and performance assessment.
Metadata tables and summary statistics provide transparency and context, ensuring that insights are actionable and relevant for business decisions.

Car Dataset Features:

Column	Description	Significance
`car_ID`	Unique car identifier	Distinguishes records
`brand`	Car manufacturer	Trend & market analysis
`model`	Specific car model	Comparison between models
`symbolling`	Insurance risk rating	Indicates safety risk
`fueltype`	Fuel type (gas/diesel)	Impacts cost, emissions, performance
`aspiration`	Engine aspiration (standard/turbo)	Affects engine power & efficiency
`doornumber`	Number of doors	Family/city suitability
`carbody`	Car body style	Market segmentation & price
`drivewheel`	Drivetrain type	Handling & offroad capability
`enginelocation`	Engine placement	Affects balance & design
`wheelbase`	Distance between axles (inches)	Stability & interior space
`carlength`	Car length (inches)	Interior space & parking
`carwidth`	Car width (inches)	Maneuverability & comfort
`carheight`	Car height (inches)	Headroom & ground clearance
`curbweight`	Car weight (lbs)	Fuel efficiency & handling
`enginetype`	Engine type/configuration	Technical analysis
`cylinernumber`	Number of cylinders	Engine power & smoothness
`enginesize`	Engine size (cubic inches)	Predictor of power & price
`fuelsystem`	Fuel delivery system	Efficiency & performance
`boreratio`	Bore-Stroke ratio	Engine design metric
`stroke`	Piston stroke length	Engine design metric
`compressionratio`	Compression ratio	Efficiency & power
`horsepower`	Engine power output	Performance & price
`peakrpm`	Max engine RPM	Performance characteristics
`citympg`	City fuel efficiency	Cost & environmental analysis
`Price`	Market price ($)	Target variable
`power_to_weight`	Horsepower / curbweight	Real-world performance
`car_volume`	Estimated car volume	Interior space & comfort
`mpg_ratio`	City MPG / highway MPG	Compare fuel efficiency
`is_luxury`	Luxury car flag	Market segmentation
`city_score`	Family/city suitability	Composite score
`outdoor_score`	Offroad suitability	Composite score
`sport_score`	Sportscar suitability	Composite score
`city_score_normalized`	Normalized city score (0-1)	Comparison across cars
`outdoor_score_normalized`	Normalized outdoor score (0-1)	Comparison across cars
`sport_score_normalized`	Normalized sport score (0-1)	Comparison across cars

Ethical Considerations

As data analysts, it's essential to address the ethical implications of this project—especially when sharing it publicly on platforms like GitHub and Kaggle. Our commitment to best practices ensures responsible and transparent data science.

1. Data and Privacy

We use a public dataset from Kaggle, which contains no personally identifiable information (PII).
The dataset focuses solely on vehicle attributes, not individual owners.
This approach respects user privacy and aligns with best practices for data anonymity.

2. Bias and Fairness

The model’s accuracy depends on the quality and representativeness of the training data.
We acknowledge the risk of algorithmic bias, especially if the dataset over-represents certain car types or regions.
To mitigate this:

We analyzed feature correlations
We documented potential biases in the analysis process
We aim to maintain fairness and transparency in predictions

3. Transparency and Accountability

This project is open-source, with clearly documented:

Data sources
Modeling assumptions
Known limitations

Accessibility for Color Blind Users

All visualizations in this project have been tested for accessibility using COBLIS (Color Blindness Simulator). The accompanying image demonstrates how the charts appear under a wide range of color vision conditions, including: • Red-Weak (Protanomaly) • Green-Weak (Deuteranomaly) • Blue-Weak (Tritanomaly) • Red-Blind (Protanopia) • Green-Blind (Deuteranopia) • Blue-Blind (Tritanopia) • Monochromacy (Achromatopsia) • Blue Cone Monochromacy Each chart was evaluated across these conditions to ensure that key visual distinctions—such as data clusters, trends, and histogram distributions—remain perceptible and interpretable regardless of color vision type. Based on the analysis of the simulation grid, no further adjustments were necessary. The charts maintain sufficient contrast, shape differentiation, and layout clarity, making them accessible to users with all forms of color vision deficiency. This ensures that the visual integrity of the data is preserved without relying solely on color cues. By proactively testing and validating our visualizations, we aim to support inclusive design and make data insights available to a broader audience.

Example:

Dashboard Design

We created 3 pages each corresponding to a segment of the car market, each has a diverse array of visualisations both interactive and not with filters.
Some visualisations use pre engineered KPI's such as City Score whereas some use more technical data such as Engine Size.
There are easy to read charts such as Tables and Donut charts for a less technical audience but also charts such as scatters and Histograms for our technical audience.

Unfixed Bugs

Full path directories were used when reading the csv file containing our unprocessed dataset, this means without the user manually changing this path when trying to use the notebook they won't be able to load the dataset into a dataframe

Development Roadmap

Throughout the project, we adopted an agile workflow, holding frequent team meetings to ensure clear communication, avoid misunderstandings, and keep everyone aligned on objectives and deliverables. This collaborative approach helped us quickly address challenges and adapt to changing requirements.

Challenges and Strategies

We encountered several technical challenges, including data cleaning complexities, feature engineering decisions, and integration of multiple analysis tools.
Managing the project on GitHub presented issues such as merge conflicts and branching problems. With the support of our tutors, as well as through our own initiative, we learned to resolve these conflicts efficiently. This hands-on experience greatly improved our understanding of collaborative version control and project management.
Regular meetings and open communication channels allowed us to identify and address blockers early, ensuring steady progress and a shared understanding of project goals.

Skills and Tools for Future Development

Based on our experience, we plan to further develop our skills in advanced data visualization (e.g., Power BI, interactive dashboards), machine learning techniques, and cloud-based deployment solutions.
We aim to deepen our expertise in collaborative tools like GitHub, as well as project management methodologies to support larger, more complex analytics projects in the future.
Continuous learning and teamwork will remain central to our approach, building on the strong foundation established during this project.

Deployment

This project is designed to be run locally in a Jupyter Notebook environment. To deploy and use the analysis:

Clone the repository

Download or clone the project files to your local machine. Set up the environment

Install required Python packages using pip install -r requirements.txt. Ensure you have Jupyter Notebook or JupyterLab installed. Prepare the data

Place the raw dataset files in the raw directory as described in the Inputs section. Run the notebooks

Open and execute the ETL.ipynb notebook to process and clean the data. Then, run the Visualisation.ipynb notebook to generate visualisations and insights. View the Power BI file

Open the provided .pbix Power BI file using Microsoft Power BI Desktop. You can download Power BI Desktop for free from the official Microsoft website. Use the Power BI file to explore interactive dashboards and visualisations based on the processed dataset. Outputs

The processed dataset will be saved in cars_processed.csv. Visual outputs and metadata tables are generated within the notebooks. Notes No web or cloud deployment is required; all analysis is performed locally. For sharing results, export notebook outputs, processed data, or Power BI dashboards as needed. If you wish to deploy as a web app or dashboard, consider using frameworks like Streamlit or Dash (not included in this project).

Main Data Analysis Libraries

This project leverages a combination of Python libraries and BI tools for data exploration, visualization, and modeling:

Pandas — for data manipulation and cleaning
NumPy — for numerical operations and array handling
Matplotlib — for basic plotting and static visualizations
Seaborn — for statistical graphics and enhanced plots
Plotly — for interactive and dynamic visualizations
Power BI — for dashboarding and business intelligence reporting

Credits

Content

Dataset
Car Price Prediction (Multiple Linear Regression) — sourced from Kaggle.
This dataset was used to explore factors influencing car prices and build regression models.
References
- GitHub repositories consulted for best practices in modular project setup and Git workflows
Generative AI Tools
- ChatGPT (Microsoft Copilot) — for ideation, documentation scaffolding, and troubleshooting support
- GitHub Copilot — for code suggestions and inline development assistance
Team Collaboration
- Contributions from team members in planning, coding, and documentation phases

Media

Screenshots of the Git Kanban Board and other visual assets are stored in the Media/ directory

Acknowledgements

Thanks to the team for collaboration across GitHub Projects, Google Drive, and Discord — and to mentors/coaches for feedback throughout the hackathon.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
Dashboard		Dashboard
Data		Data
Media		Media
jupyter_notebooks		jupyter_notebooks
.gitignore		.gitignore
.python-version		.python-version
.slugignore		.slugignore
Procfile		Procfile
README.md		README.md
requirements.txt		requirements.txt
setup.sh		setup.sh

Folders and files

Latest commit

History

Repository files navigation

Project Car Price Analysis

Dataset Content

Business Requirements

Context:

How the Project Addresses the Objectives:

Hypothesis and how to validate?

🗺️ Project Plan

🔹 Milestone 1 — Project Setup · Kick‑off (Day 0)

🔹 Milestone 2 — Initial ETL Build & First Visuals (Day 1)

🔹 Milestone 3 — ETL Refinement, Dashboards & Testing (Day 2)

🔹 Milestone 4 — Final Refinements, Presentation, Documentation & Publish (Day 3)

📁 Repository Structure (scaffold)

1️⃣ High‑Level Steps Taken for the Analysis

2️⃣ Data Management Across All Phases

3️⃣ Justification of Research Methodologies

👥 Collaboration & Teamwork

📦 Final Deliverables (current state)

The rationale to map the business requirements to the Data Visualisations

Ethical Considerations

1. Data and Privacy

2. Bias and Fairness

3. Transparency and Accountability

Accessibility for Color Blind Users

Dashboard Design

Unfixed Bugs

Development Roadmap

Challenges and Strategies

Skills and Tools for Future Development

Deployment

Main Data Analysis Libraries

Credits

Content

Media

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages