An end-to-end Machine Learning pipeline for predicting cab fares using batch training and real-time streaming with Kafka.
This project builds a Cab Fare Prediction System that:
- Cleans and preprocesses raw cab ride data
- Trains a Linear Regression model
- Evaluates performance (RMSE, MAE, R²)
- Serves predictions via:
- User Interface (batch prediction)
- Kafka Streaming (real-time prediction)
The system consists of the following components:
INPUT → PREPROCESSING → ML MODEL → SAVED MODEL
USER → INTERFACE & KAFKA PIPELINE
- Raw dataset:
cab_rides.csv
- Remove null values
- Filter invalid data
- Datetime extraction
- Derived features
- Encoding categorical variables
Includes:
- Indexer
- Encoder
- Vector Assembler
- Scaler
- Algorithm: Linear Regression
- RMSE (Root Mean Squared Error)
- MAE (Mean Absolute Error)
- R² Score
- Saved pipeline
- Saved trained model
User Input Form → Load Models → Process Input → Predict Fare → Display Result
Allows users to manually input ride details and receive predicted fare.
- Reads:
ride_features.csv - Sends features to Kafka topic:
cab_price_features
- Manages topic and streaming data
- Read Kafka stream
- Parse JSON
- Load saved model
- Process stream
- Predict prices
- Output results
Enables real-time fare prediction.
- Python
- Apache Spark (ML Pipeline)
- Linear Regression
- Apache Kafka
- JSON Streaming
- Scikit-learn / Spark ML (depending on implementation)
| Metric | Description |
|---|---|
| RMSE | Measures prediction error magnitude |
| MAE | Average absolute error |
| R² | Variance explained by the model |
📌 Features ✅ End-to-end ML pipeline ✅ Feature engineering automation ✅ Real-time streaming prediction ✅ Batch and streaming support ✅ Modular architecture
python train_model.py
2️⃣ Start Kafka
zookeeper-server-start.sh config/zookeeper.properties
kafka-server-start.sh config/server.properties
3️⃣ Start Producer
python kafka_producer.py
4️⃣ Start Consumer
python kafka_consumer.py
5️⃣ Run UI
python app.py