Skip to content

Shobhanashankar/Malware-Classification-Using-Assembly-Instruction-Analysis-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

🚀 Malware Classification Using Assembly Instruction Analysis

📌 Overview

This project focuses on malware classification by analyzing the frequency of assembly instructions extracted from disassembled files. It leverages a hybrid ML/DL pipeline that combines:

  • ✅ Deep Learning (MLP) for complex pattern recognition
  • ✅ Traditional ML (RandomForest, XGBoost, StackingClassifier) for ensemble robustness
  • ✅ SHAP Explainability for interpretable AI and feature impact analysis

🛠 Tech Stack

  • Machine Learning: RandomForestClassifier, XGBoost, StackingClassifier
  • Deep Learning: MLP (Multi-Layer Perceptron) built with TensorFlow/Keras
  • Feature Engineering: Opcode frequency extraction from .asm files
  • Libraries Used: Scikit-learn, XGBoost, TensorFlow, SHAP, Matplotlib, Pandas, NumPy
  • Visualization & Explainability: SHAP, Matplotlib

📂 Dataset & Features

The dataset consists of malware samples with counts of key assembly instructions.

Extracted Instructions: mov, jmp, call, push, pop, cmp, lea, xor, test, sub, add, shr, shl

Target Labels (Malware Families):

Label Malware Type
0 Trojan 🐴
1 Ransomware 💰
2 Worm 🪱
3 Spyware 🔍
4 Adware 📢
5 Rootkit 🛠️
6 Backdoor 🔓
7 Keylogger ⌨️
8 Fileless Malware 🌫

📊 Project Workflow

🔹 1. Data Preprocessing

  • ✅ Extract opcode frequency features
  • ✅ Normalize features using StandardScaler
  • ✅ Select top 10 features via RandomForest importance

🔹 2. Model Training

🧠 Deep Learning – MLP Architecture

  • Input: Top 10 most important features
  • Hidden Layers: 128 → 64 neurons (ReLU + Dropout)
  • Output: Softmax over 9 malware classes
  • Loss: Categorical Crossentropy
  • Optimizer: Adam

🤖 Traditional ML – Stacking Model

  • Base Models: RandomForest 🌲, XGBoost ⚡
  • Meta Learner: RandomForest
  • Trained on selected top features

🔎 Model Evaluation

📈 Metrics

  • Accuracy Score
  • Classification Report
  • Confusion Matrix

🎯 Results

Model Accuracy
MLP 90.20%
Stacking Model 97.84%

🔍 Explainability & Visualization

  • SHAP Summary Plot: Shows feature contributions for predictions
  • Feature Importance Plot: Highlights top influential instructions
  • Instruction Frequency by Malware Type: Explains behavioral patterns

👨‍💻 Contributors

Name GitHub
Shobhana Shankar @Shobhanashankar
Madhumita @Madhumita-05
Sahil Khan @Sahil-Khan10
Arivumathi @Arivumathi007

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors