This project uses machine learning algorithms to predict car prices based on various features such as engine specifications, fuel type, body style, and more. The model is trained on a dataset containing detailed information about different car models and their corresponding prices.
Source: Kaggle - Car Price Prediction Dataset
The dataset contains information about cars with the following key features:
- Engine specifications: Engine size, horsepower, fuel system
- Physical attributes: Length, width, height, curb weight
- Performance metrics: City MPG, highway MPG, compression ratio
- Categories: Car company, fuel type, aspiration, body style, drive wheels
- Target variable: Price (what we're predicting)
CarPrice_Assignment.csv- Main dataset with car features and pricesData Dictionary - carprices.xlsx- Detailed description of all features
- Python 3.x
- pandas - Data manipulation and analysis
- numpy - Numerical computing
- scikit-learn - Machine learning algorithms and tools
- matplotlib - Data visualization
- Custom transformers - Log transformation for specific features
PredictCarPrice/
│
├── dataset/
│ ├── CarPrice_Assignment.csv # Main dataset
│ └── Data Dictionary - carprices.xlsx # Feature descriptions
│
├── main.py # Main script with model training
├── loadData.py # Data loading and splitting utilities
├── dataCleanup.py # Data preprocessing and feature engineering
├── custom_transformers.py # Custom sklearn transformers
├── .gitignore # Git ignore file
└── README.md # This file
- Log transformation applied to
horsepowerandenginesizefor better distribution - Robust scaling for numerical features to handle outliers
- One-hot encoding for categorical variables
- Custom pipeline for seamless data preprocessing
- LogTransformer: Applies log transformation to specified numerical features
- RobustScaler: Scales numerical features while being robust to outliers
- OneHotEncoder: Converts categorical variables to numerical format
The project evaluates multiple regression algorithms:
- Linear Regression - Baseline model
- Decision Tree Regressor - Non-linear relationships
- Random Forest Regressor - Ensemble method (final choice)
- Cross-validation with 10 folds for robust evaluation
- Grid Search for hyperparameter tuning
- RMSE (Root Mean Square Error) as the primary evaluation metric
The Random Forest model is optimized using GridSearchCV with the following parameters:
n_estimators: [3, 10, 30]max_features: [2, 4, 6, 8]bootstrap: [True, False]
-
Clone the repository:
git clone https://github.com/ArnavGRao/PredictCarPriceWithRegression.git cd PredictCarPriceWithRegression -
Install required packages:
pip install -r requirements.txt
Or install manually:
pip install pandas numpy scikit-learn matplotlib scipy seaborn
-
Run the main script:
python main.py
- Main execution script
- Model training, evaluation, and comparison
- Hyperparameter tuning with GridSearchCV
- Final model evaluation and feature importance analysis
- Data loading utilities
- Train-test split functionality
- Data preprocessing functions
- Feature engineering pipeline
LogTransformer: Custom sklearn transformer for log transformation
The project includes comprehensive model evaluation:
- Cross-validation scores for model reliability
- Grid search results for optimal hyperparameters
- Feature importance rankings to understand which car attributes drive price predictions
- Test set evaluation for final model performance assessment
- End-to-end ML pipeline from data loading to model evaluation
- Proper data preprocessing techniques for mixed data types
- Model comparison and selection methodology
- Hyperparameter optimization using grid search
- Custom transformer creation for specialized preprocessing needs
ArnavGRao
- GitHub: @ArnavGRao
- Email: arnavgrao@gmail.com
This project is open source and available under the MIT License.