Project Car Price Analysis is a comprehensive data analysis tool designed to streamline data exploration, analysis, and visualisation. The tool supports multiple data formats and provides an intuitive interface for both novice and expert data scientists. This project is a comprehensive data analysis of a car price dataset. Our primary objective was to explore the key factors that influence car prices and use these insights to inform the development of a professional, interactive dashboard. By uncovering which specifications, features, and configurations have the strongest impact on a vehicle's price, this analysis provides valuable intelligence for automotive businesses and enthusiasts.
-
The Car Price Prediction Multiple Linear Regression dataset, sourced from Kaggle https://www.kaggle.com/datasets/hellbuoy/car-price-prediction, is a comprehensive collection of information on various automobiles, designed for building a machine learning model to predict car prices.
-
The dataset is composed of two primary files:
-
CarPrice_Assignment.CSV: This is the main data file, containing 205 observations and 26 features. The features include details such as the car's make and model, technical specifications (e.g., engine size, horsepower, city and highway MPG), physical dimensions (e.g., wheelbase, length, width, height), and other key attributes like fuel type, number of doors, and engine location. The target variable for prediction is the price of the car.
-
Data Dictionary - carprices.xlsx: This supplementary file provides a detailed data dictionary, offering clear descriptions for each of the features found in the CSV file. This helps in understanding the meaning and data type of each column, which is crucial for data analysis and model development.
- Describe your business requirements Core Objective: The primary objective is to serve as a data analytics consulting team for Geely Auto, providing a clear, data-backed strategy for their entry into the competitive US automobile market. The project will address the key business questions related to car pricing and market dynamics.
Data Analysis Goals:
-
Analyse car prices based on various independent variables to understand the factors affecting car pricing in the US market.
-
Develop interactive dashboards to provide insights into car pricing dynamics, helping the company design cars and develop business strategies that meet market demands.
An automobile company aims to enter the US market and compete with local and European manufacturers. To adjust its strategies and designs accordingly, it needs to understand the factors influencing car prices in the US market. The dataset includes various car attributes, such as make, model, engine size, and price.
-
Pricing Factor Analysis: We will analyze the provided dataset to identify which variables (e.g., horsepower, fuel efficiency, engine size, brand) are most significant in predicting a car's price in the American market. This directly answers Geely's need to understand how pricing works outside of their home market.
-
Competitive Strategy Development: Instead of a single, broad analysis, our project focuses on the four identified customer personas (Budget Buyer, Family Buyer, Luxury Buyer, and Eco-friendly Buyer). We will provide specific, tailored insights for how Geely can design cars to be competitive within each market segment.
-
Actionable Insights: The final deliverable will not just be a set of visualizations but a comprehensive strategy. The presentation will guide Geely on what features to prioritize and what market segments offer the most promising opportunities for a new competitor.
Project Hypotheses
-
Car price is strongly influenced by engine size, horsepower, and brand. Validation: Use correlation analysis and regression plots to measure the relationship between price and these features. High correlation coefficients and clear trends in visualisations will support this hypothesis.
-
Composite scores for family/city, offroad, and sportscar categories effectively segment cars by suitability. Validation: Compare top-scoring cars in each category with their actual specifications and market positioning. Use visualisations to check if the scores align with expected car types.
-
Luxury cars (top 25% by price) exhibit distinct feature patterns compared to non-luxury cars. Validation: Use box plots and summary statistics to compare features (e.g., power_to_weight, car_volume) between luxury and non-luxury segments.
-
Fuel efficiency (mpg_ratio) and power-to-weight ratio are key differentiators for buyers with practical vs. performance needs. Validation: Segment cars by these features and analyze their distribution and impact on price and category scores.
Validation Approach Perform statistical analysis (correlation, regression, summary statistics) using the processed dataset. Use data visualisations (pairplots, box plots, heatmaps, score charts) to visually confirm relationships and segmentation. Cross-check results with domain knowledge and market expectations to ensure findings are meaningful and actionable.
The project is delivered over four structured milestones, each designed to scaffold a reproducible analytics workflow and map directly to the repo structure.
Objective
Establish a working, collaborative delivery environment so the team can start shipping immediately and predictably.
Key Actions
- Create GitHub repository and Projects (Kanban) board with status columns, labels, and issue/PR templates.
- Define folder structure:
Data/raw,Data/clean,jupyter_notebooks,Media/,Dashboard/. - Stage the Kaggle dataset into
Data/raw/(cars.csv,Data Dictionary - carprices.xlsx). - Seed ETL/EDA backlog and link issues/PRs to milestones; assign owners/reviewers.
Success Criteria
Any teammate can clone, install deps, and run the first notebook without friction. The board provides real‑time visibility into planned, active, and completed work.
Objective
Turn the raw dataset into a clean, analysis‑ready table and verify the pipeline end‑to‑end.
Key Actions
- Implement notebook‑driven ETL (
jupyter_notebooks/ETL.ipynb): parsing brand/model, standardizing categorical labels, converting units/types, engineering features, and exportingData/clean/cars_processed.csv(index=False). - Handle missing values safely (e.g., Subaru model →
dl), keep outliers by design, and document assumptions. - Produce first EDA visuals in
jupyter_notebooks/Visualisation.ipynband save artifacts toMedia/.
Objective
Harden and extend the pipeline to support decision‑grade insights and prep the first dashboard build.
Key Actions
- Add and validate comparability scores (City/Family, Outdoor/Off‑Road, Sport): per‑feature normalization (invert where higher is worse), sum, and re‑normalize to 0–1.
- Quantify drivers of price: compute absolute correlations and visualize strongest relationships (heatmap + scatter); export to
Media/. - Make paths robust (prefer
pathlib), ensure idempotent/deterministic writes. - Draft Power BI page layout (overview + persona pages) and field naming conventions aligned with ETL (e.g., Engine Size (L)).
Objective
Polish deliverables for readability and accessibility, and package the work for sharing and demo.
Key Actions
- Freeze
Data/clean/cars_processed.csvand commit allMedia/visuals (PNGs/HTML). - Populate the presentation deck with an EDA narrative and figure references.
- (Optional) Publish a GitHub Release bundling CSV, figures, and PBIX when ready.
car-price-analysis-hackathon-team4/
├─ Dashboard/ # PBIX + exported screenshots
├─ Data/
│ ├─ raw/ # Kaggle CSV + data dictionary
│ └─ clean/
│ └─ cars_processed.csv # ETL output (written by ETL.ipynb)
├─ jupyter_notebooks/
│ ├─ ETL.ipynb # cleaning, feature engineering, scoring, export
│ ├─ Visualisation.ipynb # EDA (dists, correlations, outliers, scores)
│ └─ Notebook_Template copy.ipynb
├─ Media/ # EDA figures/HTML + deck assets (images, GIFs)
├─ .gitignore
├─ .python-version
├─ .slugignore
├─ Procfile
├─ README.md
├─ requirements.txt
└─ setup.sh
| Phase | Description |
|---|---|
| Setup | Repo creation, Kanban board setup, folder scaffolding, dataset staging |
| ETL Build | Raw‑to‑clean transformation, notebook‑driven pipeline, first visuals |
| EDA & Scores | Distributions, correlations, comparability scores, outlier rationale |
| Finalization | Presentation narrative, documentation, (optional) GitHub Release |
- Collection — Kaggle dataset staged in
Data/raw/, tracked via GitHub issues. - Processing — ETL notebooks version‑controlled; outputs stored in
Data/clean/. - Analysis — EDA steps logged in notebooks; visuals exported to
Media/. - Interpretation — Insights narrated in the presentation; dashboard framework prepared.
- Collaboration — Issues/PRs tied to milestones; reviewers ensure reproducibility and clarity.
| Methodology | Rationale |
|---|---|
| Notebook‑Driven ETL | Transparent, auditable, easy onboarding |
| Power BI (planned) | Stakeholder‑friendly, interactive exploration |
| Milestone‑Based Delivery | Agile iteration, clear accountability |
| GitHub Project Management | Labels, issues, reviewers → traceability |
- GitHub Projects — Kanban board, labels, milestones, and linked issues/PRs.
- Google Drive — Shared workspace for presentation and shared docs.
- Discord — Daily check‑ins, async updates, links to issues and previews.
- ✅ Cleaned dataset:
Data/clean/cars_processed.csv - ✅ EDA artifacts saved in
Media/(distributions, correlations, outliers, score‑vs‑price) - ✅ Jupyter notebooks:
jupyter_notebooks/ETL.ipynb,jupyter_notebooks/Visualisation.ipynb - 🧭 Power BI dashboard: framework defined, PBIX to be built in
Dashboard/
Business Requirements
- Understand key factors influencing car prices - identify which car attributes (e.g., engine size, horsepower, brand, body type) most strongly affect market price.
- Segment cars by category and suitability - classify cars into family/city, offroad, and sportscar categories to support targeted recommendations.
- Detect market diversity and outliers - recognize rare or luxury vehicles and price variability to reflect genuine market diversity.
- Assess fuel efficiency and performance - compare cars based on fuel economy and power-to-weight ratios for practical and performance-oriented buyers.
- Support decision-making for buyers and sellers - provide clear, actionable insights for different user groups (e.g., families, enthusiasts, dealers).
Rationale for Data Visualisations
In order to map the business requirements to meaningful data visualizations, the dataset includes both raw and processed features. Some features, such as power_to_weight, car_volume, and the various *_score metrics, were created during the ETL phase to provide more actionable insights. These features, along with the original data, serve as the input for the visualization phase, enabling analysis of market segmentation, performance, fuel efficiency, and suitability for different driving needs.
- Pairplots and correlation heatmaps reveal relationships between price and key numeric features, directly addressing the need to understand price drivers.
- Box plots by car body and drivewheel visualize market segmentation and price variability, supporting category-based analysis and outlier detection.
- Composite score visualisations (city, outdoor, sport) map cars to their suitability for different use cases, enabling targeted recommendations.
- Fuel efficiency and performance charts help users compare cars on practical and technical criteria, meeting requirements for efficiency and performance assessment.
- Metadata tables and summary statistics provide transparency and context, ensuring that insights are actionable and relevant for business decisions.
Car Dataset Features:
| Column | Description | Significance |
|---|---|---|
car_ID |
Unique car identifier | Distinguishes records |
brand |
Car manufacturer | Trend & market analysis |
model |
Specific car model | Comparison between models |
symbolling |
Insurance risk rating | Indicates safety risk |
fueltype |
Fuel type (gas/diesel) | Impacts cost, emissions, performance |
aspiration |
Engine aspiration (standard/turbo) | Affects engine power & efficiency |
doornumber |
Number of doors | Family/city suitability |
carbody |
Car body style | Market segmentation & price |
drivewheel |
Drivetrain type | Handling & offroad capability |
enginelocation |
Engine placement | Affects balance & design |
wheelbase |
Distance between axles (inches) | Stability & interior space |
carlength |
Car length (inches) | Interior space & parking |
carwidth |
Car width (inches) | Maneuverability & comfort |
carheight |
Car height (inches) | Headroom & ground clearance |
curbweight |
Car weight (lbs) | Fuel efficiency & handling |
enginetype |
Engine type/configuration | Technical analysis |
cylinernumber |
Number of cylinders | Engine power & smoothness |
enginesize |
Engine size (cubic inches) | Predictor of power & price |
fuelsystem |
Fuel delivery system | Efficiency & performance |
boreratio |
Bore-Stroke ratio | Engine design metric |
stroke |
Piston stroke length | Engine design metric |
compressionratio |
Compression ratio | Efficiency & power |
horsepower |
Engine power output | Performance & price |
peakrpm |
Max engine RPM | Performance characteristics |
citympg |
City fuel efficiency | Cost & environmental analysis |
Price |
Market price ($) | Target variable |
power_to_weight |
Horsepower / curbweight | Real-world performance |
car_volume |
Estimated car volume | Interior space & comfort |
mpg_ratio |
City MPG / highway MPG | Compare fuel efficiency |
is_luxury |
Luxury car flag | Market segmentation |
city_score |
Family/city suitability | Composite score |
outdoor_score |
Offroad suitability | Composite score |
sport_score |
Sportscar suitability | Composite score |
city_score_normalized |
Normalized city score (0-1) | Comparison across cars |
outdoor_score_normalized |
Normalized outdoor score (0-1) | Comparison across cars |
sport_score_normalized |
Normalized sport score (0-1) | Comparison across cars |
As data analysts, it's essential to address the ethical implications of this project—especially when sharing it publicly on platforms like GitHub and Kaggle. Our commitment to best practices ensures responsible and transparent data science.
- We use a public dataset from Kaggle, which contains no personally identifiable information (PII).
The dataset focuses solely on vehicle attributes, not individual owners.
This approach respects user privacy and aligns with best practices for data anonymity.
- The model’s accuracy depends on the quality and representativeness of the training data.
We acknowledge the risk of algorithmic bias, especially if the dataset over-represents certain car types or regions.
To mitigate this:
- We analyzed feature correlations
- We documented potential biases in the analysis process
- We aim to maintain fairness and transparency in predictions
- This project is open-source, with clearly documented:
- Data sources
- Modeling assumptions
- Known limitations
All visualizations in this project have been tested for accessibility using COBLIS (Color Blindness Simulator). The accompanying image demonstrates how the charts appear under a wide range of color vision conditions, including: • Red-Weak (Protanomaly) • Green-Weak (Deuteranomaly) • Blue-Weak (Tritanomaly) • Red-Blind (Protanopia) • Green-Blind (Deuteranopia) • Blue-Blind (Tritanopia) • Monochromacy (Achromatopsia) • Blue Cone Monochromacy Each chart was evaluated across these conditions to ensure that key visual distinctions—such as data clusters, trends, and histogram distributions—remain perceptible and interpretable regardless of color vision type. Based on the analysis of the simulation grid, no further adjustments were necessary. The charts maintain sufficient contrast, shape differentiation, and layout clarity, making them accessible to users with all forms of color vision deficiency. This ensures that the visual integrity of the data is preserved without relying solely on color cues. By proactively testing and validating our visualizations, we aim to support inclusive design and make data insights available to a broader audience.
- We created 3 pages each corresponding to a segment of the car market, each has a diverse array of visualisations both interactive and not with filters.
- Some visualisations use pre engineered KPI's such as City Score whereas some use more technical data such as Engine Size.
- There are easy to read charts such as Tables and Donut charts for a less technical audience but also charts such as scatters and Histograms for our technical audience.
- Full path directories were used when reading the csv file containing our unprocessed dataset, this means without the user manually changing this path when trying to use the notebook they won't be able to load the dataset into a dataframe
Throughout the project, we adopted an agile workflow, holding frequent team meetings to ensure clear communication, avoid misunderstandings, and keep everyone aligned on objectives and deliverables. This collaborative approach helped us quickly address challenges and adapt to changing requirements.
- We encountered several technical challenges, including data cleaning complexities, feature engineering decisions, and integration of multiple analysis tools.
- Managing the project on GitHub presented issues such as merge conflicts and branching problems. With the support of our tutors, as well as through our own initiative, we learned to resolve these conflicts efficiently. This hands-on experience greatly improved our understanding of collaborative version control and project management.
- Regular meetings and open communication channels allowed us to identify and address blockers early, ensuring steady progress and a shared understanding of project goals.
- Based on our experience, we plan to further develop our skills in advanced data visualization (e.g., Power BI, interactive dashboards), machine learning techniques, and cloud-based deployment solutions.
- We aim to deepen our expertise in collaborative tools like GitHub, as well as project management methodologies to support larger, more complex analytics projects in the future.
- Continuous learning and teamwork will remain central to our approach, building on the strong foundation established during this project.
This project is designed to be run locally in a Jupyter Notebook environment. To deploy and use the analysis:
Clone the repository
Download or clone the project files to your local machine. Set up the environment
Install required Python packages using pip install -r requirements.txt. Ensure you have Jupyter Notebook or JupyterLab installed. Prepare the data
Place the raw dataset files in the raw directory as described in the Inputs section. Run the notebooks
Open and execute the ETL.ipynb notebook to process and clean the data. Then, run the Visualisation.ipynb notebook to generate visualisations and insights. View the Power BI file
Open the provided .pbix Power BI file using Microsoft Power BI Desktop. You can download Power BI Desktop for free from the official Microsoft website. Use the Power BI file to explore interactive dashboards and visualisations based on the processed dataset. Outputs
The processed dataset will be saved in cars_processed.csv. Visual outputs and metadata tables are generated within the notebooks. Notes No web or cloud deployment is required; all analysis is performed locally. For sharing results, export notebook outputs, processed data, or Power BI dashboards as needed. If you wish to deploy as a web app or dashboard, consider using frameworks like Streamlit or Dash (not included in this project).
This project leverages a combination of Python libraries and BI tools for data exploration, visualization, and modeling:
- Pandas — for data manipulation and cleaning
- NumPy — for numerical operations and array handling
- Matplotlib — for basic plotting and static visualizations
- Seaborn — for statistical graphics and enhanced plots
- Plotly — for interactive and dynamic visualizations
- Power BI — for dashboarding and business intelligence reporting
-
Dataset
Car Price Prediction (Multiple Linear Regression) — sourced from Kaggle.
This dataset was used to explore factors influencing car prices and build regression models. -
References
- GitHub repositories consulted for best practices in modular project setup and Git workflows
-
Generative AI Tools
- ChatGPT (Microsoft Copilot) — for ideation, documentation scaffolding, and troubleshooting support
- GitHub Copilot — for code suggestions and inline development assistance
-
Team Collaboration
- Contributions from team members in planning, coding, and documentation phases
- Screenshots of the Git Kanban Board and other visual assets are stored in the
Media/directory
- Thanks to the team for collaboration across GitHub Projects, Google Drive, and Discord — and to mentors/coaches for feedback throughout the hackathon.

