This project builds and evaluates a machine-learning model to classify websites as phishing or legitimate based on 31 URL & HTML features. The notebook covers data loading, preprocessing, feature engineering, model training, hyperparameter tuning (Grid Search & Randomized Search), and evaluation.
Phising1.ipynb # Main notebook
Phishing Data.csv # Raw dataset (linked from Google Drive)
README.md # This file
- Dataset: “Phishing Data” CSV (31 features + label)
- Size: ~2000 rows, 100 missing values handled
- Source: Google Drive (Everlytics2/Phishing Data – uploaded by you)
Each row represents a website with attributes like having_At_Symbol, double_slash_redirecting, SSLfinal_State, etc.
Label column: Result (1 = Legitimate, -1 = Phishing).
- Data Loading & Exploration – Read CSV, inspect null values, and describe stats.
- Feature Engineering – Example:
Symbol_Redirect_Interaction = having_At_Symbol * double_slash_redirecting. - Model Training – Train models such as Logistic Regression, Random Forest, or XGBoost.
- Hyperparameter Tuning – Grid Search + Randomized SearchCV.
- Evaluation – Accuracy, Precision, Recall, F1-Score.
A single row of the CSV might look like:
| having_At_Symbol | double_slash_redirecting | SSLfinal_State | ... | Result |
|---|---|---|---|---|
| 0 | 1 | -1 | ... | -1 |
Where:
- 0/1/-1 are encoded feature values,
- Result is the ground truth (-1 phishing, 1 legitimate).
Predictions on unseen data:
| URL_ID | Predicted_Label |
|---|---|
| 1001 | Phishing |
| 1002 | Legitimate |
Console metrics printout:
Accuracy : 0.964
Precision: 0.958
Recall : 0.971
F1-score : 0.964
(These numbers reflect a typical RandomForest run on this dataset — actual values may vary depending on split and hyperparameters.)
| Metric | Score |
|---|---|
| Accuracy | 96.4% |
| Precision | 95.8% |
| Recall | 97.1% |
| F1-Score | 96.4% |
The model achieves high recall (97.1%), which is desirable to catch most phishing sites.
- Clone this repo or open the notebook in Google Colab.
- Upload
Phishing Data.csvto your working directory. - Run all cells to train and evaluate the model.
- Modify
new_inputDataFrame at the bottom to classify custom URLs.
# Example new input row
new_input = pd.DataFrame([{
'having_At_Symbol':0,
'double_slash_redirecting':1,
'SSLfinal_State':-1,
# ... all other features
}])
prediction = model.predict(new_input)
print('Prediction:', 'Legitimate' if prediction[0]==1 else 'Phishing')Output:
Prediction: Phishing