🚀 Malware Classification Using Assembly Instruction Analysis

📌 Overview

This project focuses on malware classification by analyzing the frequency of assembly instructions extracted from disassembled files. It leverages a hybrid ML/DL pipeline that combines:

✅ Deep Learning (MLP) for complex pattern recognition
✅ Traditional ML (RandomForest, XGBoost, StackingClassifier) for ensemble robustness
✅ SHAP Explainability for interpretable AI and feature impact analysis

🛠 Tech Stack

Machine Learning: RandomForestClassifier, XGBoost, StackingClassifier
Deep Learning: MLP (Multi-Layer Perceptron) built with TensorFlow/Keras
Feature Engineering: Opcode frequency extraction from .asm files
Libraries Used: Scikit-learn, XGBoost, TensorFlow, SHAP, Matplotlib, Pandas, NumPy
Visualization & Explainability: SHAP, Matplotlib

📂 Dataset & Features

The dataset consists of malware samples with counts of key assembly instructions.

Extracted Instructions: mov, jmp, call, push, pop, cmp, lea, xor, test, sub, add, shr, shl

Target Labels (Malware Families):

Label	Malware Type
0	Trojan 🐴
1	Ransomware 💰
2	Worm 🪱
3	Spyware 🔍
4	Adware 📢
5	Rootkit 🛠️
6	Backdoor 🔓
7	Keylogger ⌨️
8	Fileless Malware 🌫

📊 Project Workflow

🔹 1. Data Preprocessing

✅ Extract opcode frequency features
✅ Normalize features using StandardScaler
✅ Select top 10 features via RandomForest importance

🔹 2. Model Training

🧠 Deep Learning – MLP Architecture

Input: Top 10 most important features
Hidden Layers: 128 → 64 neurons (ReLU + Dropout)
Output: Softmax over 9 malware classes
Loss: Categorical Crossentropy
Optimizer: Adam

🤖 Traditional ML – Stacking Model

Base Models: RandomForest 🌲, XGBoost ⚡
Meta Learner: RandomForest
Trained on selected top features

🔎 Model Evaluation

📈 Metrics

Accuracy Score
Classification Report
Confusion Matrix

🎯 Results

Model	Accuracy
MLP	90.20%
Stacking Model	97.84%

🔍 Explainability & Visualization

SHAP Summary Plot: Shows feature contributions for predictions
Feature Importance Plot: Highlights top influential instructions
Instruction Frequency by Malware Type: Explains behavioral patterns

👨‍💻 Contributors

Name	GitHub
Shobhana Shankar	@Shobhanashankar
Madhumita	@Madhumita-05
Sahil Khan	@Sahil-Khan10
Arivumathi	@Arivumathi007

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
ML1.ipynb		ML1.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Malware Classification Using Assembly Instruction Analysis

📌 Overview

🛠 Tech Stack

📂 Dataset & Features

📊 Project Workflow

🔹 1. Data Preprocessing

🔹 2. Model Training

🧠 Deep Learning – MLP Architecture

🤖 Traditional ML – Stacking Model

🔎 Model Evaluation

📈 Metrics

🎯 Results

🔍 Explainability & Visualization

👨‍💻 Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 Malware Classification Using Assembly Instruction Analysis

📌 Overview

🛠 Tech Stack

📂 Dataset & Features

📊 Project Workflow

🔹 1. Data Preprocessing

🔹 2. Model Training

🧠 Deep Learning – MLP Architecture

🤖 Traditional ML – Stacking Model

🔎 Model Evaluation

📈 Metrics

🎯 Results

🔍 Explainability & Visualization

👨‍💻 Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages