Skip to content

AL-MASNAI/car-price-analysis-hackathon-team4

Repository files navigation

Project Car Price Analysis

Project Car Price Analysis is a comprehensive data analysis tool designed to streamline data exploration, analysis, and visualisation. The tool supports multiple data formats and provides an intuitive interface for both novice and expert data scientists. This project is a comprehensive data analysis of a car price dataset. Our primary objective was to explore the key factors that influence car prices and use these insights to inform the development of a professional, interactive dashboard. By uncovering which specifications, features, and configurations have the strongest impact on a vehicle's price, this analysis provides valuable intelligence for automotive businesses and enthusiasts.

CI logo

Dataset Content

  • The Car Price Prediction Multiple Linear Regression dataset, sourced from Kaggle https://www.kaggle.com/datasets/hellbuoy/car-price-prediction, is a comprehensive collection of information on various automobiles, designed for building a machine learning model to predict car prices.

  • The dataset is composed of two primary files:

  1. CarPrice_Assignment.CSV: This is the main data file, containing 205 observations and 26 features. The features include details such as the car's make and model, technical specifications (e.g., engine size, horsepower, city and highway MPG), physical dimensions (e.g., wheelbase, length, width, height), and other key attributes like fuel type, number of doors, and engine location. The target variable for prediction is the price of the car.

  2. Data Dictionary - carprices.xlsx: This supplementary file provides a detailed data dictionary, offering clear descriptions for each of the features found in the CSV file. This helps in understanding the meaning and data type of each column, which is crucial for data analysis and model development.

Business Requirements

  • Describe your business requirements Core Objective: The primary objective is to serve as a data analytics consulting team for Geely Auto, providing a clear, data-backed strategy for their entry into the competitive US automobile market. The project will address the key business questions related to car pricing and market dynamics.

Data Analysis Goals:

  • Analyse car prices based on various independent variables to understand the factors affecting car pricing in the US market.

  • Develop interactive dashboards to provide insights into car pricing dynamics, helping the company design cars and develop business strategies that meet market demands.

Context:

An automobile company aims to enter the US market and compete with local and European manufacturers. To adjust its strategies and designs accordingly, it needs to understand the factors influencing car prices in the US market. The dataset includes various car attributes, such as make, model, engine size, and price.

How the Project Addresses the Objectives:

  • Pricing Factor Analysis: We will analyze the provided dataset to identify which variables (e.g., horsepower, fuel efficiency, engine size, brand) are most significant in predicting a car's price in the American market. This directly answers Geely's need to understand how pricing works outside of their home market.

  • Competitive Strategy Development: Instead of a single, broad analysis, our project focuses on the four identified customer personas (Budget Buyer, Family Buyer, Luxury Buyer, and Eco-friendly Buyer). We will provide specific, tailored insights for how Geely can design cars to be competitive within each market segment.

  • Actionable Insights: The final deliverable will not just be a set of visualizations but a comprehensive strategy. The presentation will guide Geely on what features to prioritize and what market segments offer the most promising opportunities for a new competitor.

Hypothesis and how to validate?

Project Hypotheses

  1. Car price is strongly influenced by engine size, horsepower, and brand. Validation: Use correlation analysis and regression plots to measure the relationship between price and these features. High correlation coefficients and clear trends in visualisations will support this hypothesis.

  2. Composite scores for family/city, offroad, and sportscar categories effectively segment cars by suitability. Validation: Compare top-scoring cars in each category with their actual specifications and market positioning. Use visualisations to check if the scores align with expected car types.

  3. Luxury cars (top 25% by price) exhibit distinct feature patterns compared to non-luxury cars. Validation: Use box plots and summary statistics to compare features (e.g., power_to_weight, car_volume) between luxury and non-luxury segments.

  4. Fuel efficiency (mpg_ratio) and power-to-weight ratio are key differentiators for buyers with practical vs. performance needs. Validation: Segment cars by these features and analyze their distribution and impact on price and category scores.

Validation Approach Perform statistical analysis (correlation, regression, summary statistics) using the processed dataset. Use data visualisations (pairplots, box plots, heatmaps, score charts) to visually confirm relationships and segmentation. Cross-check results with domain knowledge and market expectations to ensure findings are meaningful and actionable.

🗺️ Project Plan

The project is delivered over four structured milestones, each designed to scaffold a reproducible analytics workflow and map directly to the repo structure.

🔹 Milestone 1 — Project Setup · Kick‑off (Day 0)

Objective
Establish a working, collaborative delivery environment so the team can start shipping immediately and predictably.

Key Actions

  • Create GitHub repository and Projects (Kanban) board with status columns, labels, and issue/PR templates.
  • Define folder structure: Data/raw, Data/clean, jupyter_notebooks, Media/, Dashboard/.
  • Stage the Kaggle dataset into Data/raw/ (cars.csv, Data Dictionary - carprices.xlsx).
  • Seed ETL/EDA backlog and link issues/PRs to milestones; assign owners/reviewers.

Success Criteria
Any teammate can clone, install deps, and run the first notebook without friction. The board provides real‑time visibility into planned, active, and completed work.


🔹 Milestone 2 — Initial ETL Build & First Visuals (Day 1)

Objective
Turn the raw dataset into a clean, analysis‑ready table and verify the pipeline end‑to‑end.

Key Actions

  • Implement notebook‑driven ETL (jupyter_notebooks/ETL.ipynb): parsing brand/model, standardizing categorical labels, converting units/types, engineering features, and exporting Data/clean/cars_processed.csv (index=False).
  • Handle missing values safely (e.g., Subaru model → dl), keep outliers by design, and document assumptions.
  • Produce first EDA visuals in jupyter_notebooks/Visualisation.ipynb and save artifacts to Media/.

🔹 Milestone 3 — ETL Refinement, Dashboards & Testing (Day 2)

Objective
Harden and extend the pipeline to support decision‑grade insights and prep the first dashboard build.

Key Actions

  • Add and validate comparability scores (City/Family, Outdoor/Off‑Road, Sport): per‑feature normalization (invert where higher is worse), sum, and re‑normalize to 0–1.
  • Quantify drivers of price: compute absolute correlations and visualize strongest relationships (heatmap + scatter); export to Media/.
  • Make paths robust (prefer pathlib), ensure idempotent/deterministic writes.
  • Draft Power BI page layout (overview + persona pages) and field naming conventions aligned with ETL (e.g., Engine Size (L)).

🔹 Milestone 4 — Final Refinements, Presentation, Documentation & Publish (Day 3)

Objective
Polish deliverables for readability and accessibility, and package the work for sharing and demo.

Key Actions

  • Freeze Data/clean/cars_processed.csv and commit all Media/ visuals (PNGs/HTML).
  • Populate the presentation deck with an EDA narrative and figure references.
  • (Optional) Publish a GitHub Release bundling CSV, figures, and PBIX when ready.

📁 Repository Structure (scaffold)

car-price-analysis-hackathon-team4/
├─ Dashboard/                     # PBIX + exported screenshots
├─ Data/
│  ├─ raw/                        # Kaggle CSV + data dictionary
│  └─ clean/
│     └─ cars_processed.csv       # ETL output (written by ETL.ipynb)
├─ jupyter_notebooks/
│  ├─ ETL.ipynb                   # cleaning, feature engineering, scoring, export
│  ├─ Visualisation.ipynb         # EDA (dists, correlations, outliers, scores)
│  └─ Notebook_Template copy.ipynb
├─ Media/                         # EDA figures/HTML + deck assets (images, GIFs)
├─ .gitignore
├─ .python-version
├─ .slugignore
├─ Procfile
├─ README.md
├─ requirements.txt               
└─ setup.sh

1️⃣ High‑Level Steps Taken for the Analysis

Phase Description
Setup Repo creation, Kanban board setup, folder scaffolding, dataset staging
ETL Build Raw‑to‑clean transformation, notebook‑driven pipeline, first visuals
EDA & Scores Distributions, correlations, comparability scores, outlier rationale
Finalization Presentation narrative, documentation, (optional) GitHub Release

2️⃣ Data Management Across All Phases

  • Collection — Kaggle dataset staged in Data/raw/, tracked via GitHub issues.
  • Processing — ETL notebooks version‑controlled; outputs stored in Data/clean/.
  • Analysis — EDA steps logged in notebooks; visuals exported to Media/.
  • Interpretation — Insights narrated in the presentation; dashboard framework prepared.
  • Collaboration — Issues/PRs tied to milestones; reviewers ensure reproducibility and clarity.

3️⃣ Justification of Research Methodologies

Methodology Rationale
Notebook‑Driven ETL Transparent, auditable, easy onboarding
Power BI (planned) Stakeholder‑friendly, interactive exploration
Milestone‑Based Delivery Agile iteration, clear accountability
GitHub Project Management Labels, issues, reviewers → traceability

👥 Collaboration & Teamwork

  • GitHub Projects — Kanban board, labels, milestones, and linked issues/PRs.
  • Google Drive — Shared workspace for presentation and shared docs.
  • Discord — Daily check‑ins, async updates, links to issues and previews.

📦 Final Deliverables (current state)

  • ✅ Cleaned dataset: Data/clean/cars_processed.csv
  • ✅ EDA artifacts saved in Media/ (distributions, correlations, outliers, score‑vs‑price)
  • ✅ Jupyter notebooks: jupyter_notebooks/ETL.ipynb, jupyter_notebooks/Visualisation.ipynb
  • 🧭 Power BI dashboard: framework defined, PBIX to be built in Dashboard/

The rationale to map the business requirements to the Data Visualisations

Business Requirements

  • Understand key factors influencing car prices - identify which car attributes (e.g., engine size, horsepower, brand, body type) most strongly affect market price.
  • Segment cars by category and suitability - classify cars into family/city, offroad, and sportscar categories to support targeted recommendations.
  • Detect market diversity and outliers - recognize rare or luxury vehicles and price variability to reflect genuine market diversity.
  • Assess fuel efficiency and performance - compare cars based on fuel economy and power-to-weight ratios for practical and performance-oriented buyers.
  • Support decision-making for buyers and sellers - provide clear, actionable insights for different user groups (e.g., families, enthusiasts, dealers).

Rationale for Data Visualisations

In order to map the business requirements to meaningful data visualizations, the dataset includes both raw and processed features. Some features, such as power_to_weight, car_volume, and the various *_score metrics, were created during the ETL phase to provide more actionable insights. These features, along with the original data, serve as the input for the visualization phase, enabling analysis of market segmentation, performance, fuel efficiency, and suitability for different driving needs.

  • Pairplots and correlation heatmaps reveal relationships between price and key numeric features, directly addressing the need to understand price drivers.
  • Box plots by car body and drivewheel visualize market segmentation and price variability, supporting category-based analysis and outlier detection.
  • Composite score visualisations (city, outdoor, sport) map cars to their suitability for different use cases, enabling targeted recommendations.
  • Fuel efficiency and performance charts help users compare cars on practical and technical criteria, meeting requirements for efficiency and performance assessment.
  • Metadata tables and summary statistics provide transparency and context, ensuring that insights are actionable and relevant for business decisions.

Car Dataset Features:

Column Description Significance
car_ID Unique car identifier Distinguishes records
brand Car manufacturer Trend & market analysis
model Specific car model Comparison between models
symbolling Insurance risk rating Indicates safety risk
fueltype Fuel type (gas/diesel) Impacts cost, emissions, performance
aspiration Engine aspiration (standard/turbo) Affects engine power & efficiency
doornumber Number of doors Family/city suitability
carbody Car body style Market segmentation & price
drivewheel Drivetrain type Handling & offroad capability
enginelocation Engine placement Affects balance & design
wheelbase Distance between axles (inches) Stability & interior space
carlength Car length (inches) Interior space & parking
carwidth Car width (inches) Maneuverability & comfort
carheight Car height (inches) Headroom & ground clearance
curbweight Car weight (lbs) Fuel efficiency & handling
enginetype Engine type/configuration Technical analysis
cylinernumber Number of cylinders Engine power & smoothness
enginesize Engine size (cubic inches) Predictor of power & price
fuelsystem Fuel delivery system Efficiency & performance
boreratio Bore-Stroke ratio Engine design metric
stroke Piston stroke length Engine design metric
compressionratio Compression ratio Efficiency & power
horsepower Engine power output Performance & price
peakrpm Max engine RPM Performance characteristics
citympg City fuel efficiency Cost & environmental analysis
Price Market price ($) Target variable
power_to_weight Horsepower / curbweight Real-world performance
car_volume Estimated car volume Interior space & comfort
mpg_ratio City MPG / highway MPG Compare fuel efficiency
is_luxury Luxury car flag Market segmentation
city_score Family/city suitability Composite score
outdoor_score Offroad suitability Composite score
sport_score Sportscar suitability Composite score
city_score_normalized Normalized city score (0-1) Comparison across cars
outdoor_score_normalized Normalized outdoor score (0-1) Comparison across cars
sport_score_normalized Normalized sport score (0-1) Comparison across cars

Ethical Considerations

As data analysts, it's essential to address the ethical implications of this project—especially when sharing it publicly on platforms like GitHub and Kaggle. Our commitment to best practices ensures responsible and transparent data science.


1. Data and Privacy

  • We use a public dataset from Kaggle, which contains no personally identifiable information (PII).
    The dataset focuses solely on vehicle attributes, not individual owners.
    This approach respects user privacy and aligns with best practices for data anonymity.

2. Bias and Fairness

  • The model’s accuracy depends on the quality and representativeness of the training data.
    We acknowledge the risk of algorithmic bias, especially if the dataset over-represents certain car types or regions.
    To mitigate this:
  • We analyzed feature correlations
  • We documented potential biases in the analysis process
  • We aim to maintain fairness and transparency in predictions

3. Transparency and Accountability

  • This project is open-source, with clearly documented:
  • Data sources
  • Modeling assumptions
  • Known limitations

Accessibility for Color Blind Users

All visualizations in this project have been tested for accessibility using COBLIS (Color Blindness Simulator). The accompanying image demonstrates how the charts appear under a wide range of color vision conditions, including: • Red-Weak (Protanomaly) • Green-Weak (Deuteranomaly) • Blue-Weak (Tritanomaly) • Red-Blind (Protanopia) • Green-Blind (Deuteranopia) • Blue-Blind (Tritanopia) • Monochromacy (Achromatopsia) • Blue Cone Monochromacy Each chart was evaluated across these conditions to ensure that key visual distinctions—such as data clusters, trends, and histogram distributions—remain perceptible and interpretable regardless of color vision type. Based on the analysis of the simulation grid, no further adjustments were necessary. The charts maintain sufficient contrast, shape differentiation, and layout clarity, making them accessible to users with all forms of color vision deficiency. This ensures that the visual integrity of the data is preserved without relying solely on color cues. By proactively testing and validating our visualizations, we aim to support inclusive design and make data insights available to a broader audience.

Example: COBLIS Test Result

Dashboard Design

  • We created 3 pages each corresponding to a segment of the car market, each has a diverse array of visualisations both interactive and not with filters.
  • Some visualisations use pre engineered KPI's such as City Score whereas some use more technical data such as Engine Size.
  • There are easy to read charts such as Tables and Donut charts for a less technical audience but also charts such as scatters and Histograms for our technical audience.

Unfixed Bugs

  • Full path directories were used when reading the csv file containing our unprocessed dataset, this means without the user manually changing this path when trying to use the notebook they won't be able to load the dataset into a dataframe

Development Roadmap

Throughout the project, we adopted an agile workflow, holding frequent team meetings to ensure clear communication, avoid misunderstandings, and keep everyone aligned on objectives and deliverables. This collaborative approach helped us quickly address challenges and adapt to changing requirements.

Challenges and Strategies

  • We encountered several technical challenges, including data cleaning complexities, feature engineering decisions, and integration of multiple analysis tools.
  • Managing the project on GitHub presented issues such as merge conflicts and branching problems. With the support of our tutors, as well as through our own initiative, we learned to resolve these conflicts efficiently. This hands-on experience greatly improved our understanding of collaborative version control and project management.
  • Regular meetings and open communication channels allowed us to identify and address blockers early, ensuring steady progress and a shared understanding of project goals.

Skills and Tools for Future Development

  • Based on our experience, we plan to further develop our skills in advanced data visualization (e.g., Power BI, interactive dashboards), machine learning techniques, and cloud-based deployment solutions.
  • We aim to deepen our expertise in collaborative tools like GitHub, as well as project management methodologies to support larger, more complex analytics projects in the future.
  • Continuous learning and teamwork will remain central to our approach, building on the strong foundation established during this project.

Deployment

This project is designed to be run locally in a Jupyter Notebook environment. To deploy and use the analysis:

Clone the repository

Download or clone the project files to your local machine. Set up the environment

Install required Python packages using pip install -r requirements.txt. Ensure you have Jupyter Notebook or JupyterLab installed. Prepare the data

Place the raw dataset files in the raw directory as described in the Inputs section. Run the notebooks

Open and execute the ETL.ipynb notebook to process and clean the data. Then, run the Visualisation.ipynb notebook to generate visualisations and insights. View the Power BI file

Open the provided .pbix Power BI file using Microsoft Power BI Desktop. You can download Power BI Desktop for free from the official Microsoft website. Use the Power BI file to explore interactive dashboards and visualisations based on the processed dataset. Outputs

The processed dataset will be saved in cars_processed.csv. Visual outputs and metadata tables are generated within the notebooks. Notes No web or cloud deployment is required; all analysis is performed locally. For sharing results, export notebook outputs, processed data, or Power BI dashboards as needed. If you wish to deploy as a web app or dashboard, consider using frameworks like Streamlit or Dash (not included in this project).

Main Data Analysis Libraries

This project leverages a combination of Python libraries and BI tools for data exploration, visualization, and modeling:

  • Pandas — for data manipulation and cleaning
  • NumPy — for numerical operations and array handling
  • Matplotlib — for basic plotting and static visualizations
  • Seaborn — for statistical graphics and enhanced plots
  • Plotly — for interactive and dynamic visualizations
  • Power BI — for dashboarding and business intelligence reporting

Credits

Content

  • Dataset
    Car Price Prediction (Multiple Linear Regression) — sourced from Kaggle.
    This dataset was used to explore factors influencing car prices and build regression models.

  • References

    • GitHub repositories consulted for best practices in modular project setup and Git workflows
  • Generative AI Tools

    • ChatGPT (Microsoft Copilot) — for ideation, documentation scaffolding, and troubleshooting support
    • GitHub Copilot — for code suggestions and inline development assistance
  • Team Collaboration

    • Contributions from team members in planning, coding, and documentation phases

Media

  • Screenshots of the Git Kanban Board and other visual assets are stored in the Media/ directory

Acknowledgements

  • Thanks to the team for collaboration across GitHub Projects, Google Drive, and Discord — and to mentors/coaches for feedback throughout the hackathon.

About

Model the price of cars with the available independent variables. It will be used to understand how exactly the prices vary with the independent variables. Manipulate the design of the cars accordingly, the business strategy etc. to meet certain price levels

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors