🚀 Hazardous Asteroid Prediction

A machine learning pipeline to predict potentially hazardous asteroids using NASA's Near Earth Object Web Service (NeoWs) dataset.
This project applies data preprocessing, feature engineering, synthetic data balancing (SMOTE), and machine learning model training using PyCaret.

📌 Project Overview

Preprocessing: Cleans the dataset, removes unnecessary columns, and encodes categorical variables.
EDA: Generates visual insights into asteroid characteristics.
Feature Engineering: Applies SMOTE to balance class distribution and scales features.
Model Training: Compares multiple models using PyCaret and selects the best one.
Evaluation: Stores model performance metrics for analysis.

📌 Data Pipeline

1️⃣ Preprocessing

Removes redundant features (e.g., duplicate distance measurements).
Converts categorical values (e.g., "Hazardous" → 1/0).
Saves cleaned dataset.

2️⃣ Exploratory Data Analysis (EDA)

EDA visualizations help us understand the dataset distribution and feature relationships.

Class Distribution of Hazardous vs. Non-Hazardous Asteroids

🔍 Observations:

The dataset is highly imbalanced, with far more non-hazardous asteroids (0) than hazardous ones (1).
This imbalance explains why the model initially overfitted, favoring the majority class.
SMOTE was applied to balance the dataset and ensure the model learns to detect hazardous asteroids correctly.

📌 Why Does This Matter?

Without balancing, the model would classify most asteroids as "not hazardous", leading to poor recall.
SMOTE improves recall, ensuring that actual threats are detected.

Feature Correlation Heatmap

🔍 Observations:

High correlation values (>0.8) suggest feature redundancy:
- Est Dia in KM(min) and Est Dia in KM(max) are strongly correlated (1.00) → Keeping both may be unnecessary.
- Perihelion Time and Epoch Osculation are strongly correlated (0.98).
- Mean Motion and Orbital Period are negatively correlated (-0.99) → Likely convey the same information.
Low correlation with Hazardous (~0.2 - 0.3):
- No single feature directly determines whether an asteroid is hazardous.
- The model must learn complex, non-linear relationships instead.

📌 Why Does This Matter?

Removing redundant features improves model efficiency.
Low correlation with Hazardous means the model must combine multiple weak signals to make predictions.
Feature selection and regularization help avoid overfitting.

📌 Model Training & Evaluation

4️⃣ Model Training

Trains Decision Tree, Random Forest, AdaBoost, Gradient Boosting, and more.
Uses PyCaret to compare models automatically.
Saves model performance metrics.

Model	Accuracy	Precision	Recall	F1-Score
Gradient Boosting Classifier	🚀 1.00	1.00	1.00	1.00

📌 Full performance table saved in: reports/model_performance.csv

🛠 Why is Accuracy 100%?

Asteroid impacts are rare → The dataset might have strong feature correlations.
SMOTE added synthetic minority samples → Might have led to overfitting.
Tree-based models can memorize training data → Need stronger regularization.
Next steps:
- Apply cross-validation to validate performance.
- Limit tree depth to prevent memorization.
- Experiment with alternative resampling methods.

📌 How to Run the Project

1️⃣ Install Dependencies

Ensure you have all required libraries installed:

pip install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
config		config
data		data
models		models
notebooks		notebooks
reports		reports
src		src
README.md		README.md
logs.log		logs.log
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Hazardous Asteroid Prediction

📌 Project Overview

📌 Data Pipeline

1️⃣ Preprocessing

2️⃣ Exploratory Data Analysis (EDA)

Class Distribution of Hazardous vs. Non-Hazardous Asteroids

Feature Correlation Heatmap

📌 Model Training & Evaluation

4️⃣ Model Training

🛠 Why is Accuracy 100%?

📌 How to Run the Project

1️⃣ Install Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 Hazardous Asteroid Prediction

📌 Project Overview

📌 Data Pipeline

1️⃣ Preprocessing

2️⃣ Exploratory Data Analysis (EDA)

Class Distribution of Hazardous vs. Non-Hazardous Asteroids

Feature Correlation Heatmap

📌 Model Training & Evaluation

4️⃣ Model Training

🛠 Why is Accuracy 100%?

📌 How to Run the Project

1️⃣ Install Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages