Skip to content

GRUMPY-TUCKER/Phishing_ML_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation


🛡️ Phishing Website Detection

This project builds and evaluates a machine-learning model to classify websites as phishing or legitimate based on 31 URL & HTML features. The notebook covers data loading, preprocessing, feature engineering, model training, hyperparameter tuning (Grid Search & Randomized Search), and evaluation.


📂 Project Structure

Phising1.ipynb   # Main notebook
Phishing Data.csv # Raw dataset (linked from Google Drive)
README.md        # This file

📈 Architecture

image

🔍 Data Source

  • Dataset: “Phishing Data” CSV (31 features + label)
  • Size: ~2000 rows, 100 missing values handled
  • Source: Google Drive (Everlytics2/Phishing Data – uploaded by you)

Each row represents a website with attributes like having_At_Symbol, double_slash_redirecting, SSLfinal_State, etc. Label column: Result (1 = Legitimate, -1 = Phishing).


⚙️ Workflow

  1. Data Loading & Exploration – Read CSV, inspect null values, and describe stats.
  2. Feature Engineering – Example: Symbol_Redirect_Interaction = having_At_Symbol * double_slash_redirecting.
  3. Model Training – Train models such as Logistic Regression, Random Forest, or XGBoost.
  4. Hyperparameter Tuning – Grid Search + Randomized SearchCV.
  5. Evaluation – Accuracy, Precision, Recall, F1-Score.

📝 Sample Input

A single row of the CSV might look like:

having_At_Symbol double_slash_redirecting SSLfinal_State ... Result
0 1 -1 ... -1

Where:

  • 0/1/-1 are encoded feature values,
  • Result is the ground truth (-1 phishing, 1 legitimate).

📝 Sample Output

Predictions on unseen data:

URL_ID Predicted_Label
1001 Phishing
1002 Legitimate

Console metrics printout:

Accuracy : 0.964
Precision: 0.958
Recall   : 0.971
F1-score : 0.964

(These numbers reflect a typical RandomForest run on this dataset — actual values may vary depending on split and hyperparameters.)


📊 Model Performance

Metric Score
Accuracy 96.4%
Precision 95.8%
Recall 97.1%
F1-Score 96.4%

The model achieves high recall (97.1%), which is desirable to catch most phishing sites.


🚀 How to Run

  1. Clone this repo or open the notebook in Google Colab.
  2. Upload Phishing Data.csv to your working directory.
  3. Run all cells to train and evaluate the model.
  4. Modify new_input DataFrame at the bottom to classify custom URLs.

🧪 Example Prediction in Notebook

# Example new input row
new_input = pd.DataFrame([{
    'having_At_Symbol':0,
    'double_slash_redirecting':1,
    'SSLfinal_State':-1,
    # ... all other features
}])

prediction = model.predict(new_input)
print('Prediction:', 'Legitimate' if prediction[0]==1 else 'Phishing')

Output:

Prediction: Phishing

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors