This project is inspired by the winners of "Microsoft Malware Classification Challenge". The objective of our project is to classify executable files into benign files or to one of nine malicious file classes.
In order to achive our goal we used two models:
-
Machine Learning - Our main feature was based on opcode count: we read disassembly of EXE files and then splited them into n-grams . We used XGBoost package (an implementation of gradient boosted decision trees) in order to construct different decision trees and combine them into an Improved model.
-
Deep Learning - we implimented a convolutional neural network based on Raff’s groundbreaking paper: 'Malware Detection by Eating a Whole EXE'.
We examined files that can be categorized into ten different classes (one bengin class and nine malware classes). Moreover, we ensured that each class received equal representation in the test set, so we can make sure that the model doesn't classifies all the files into the same class.
| Accuracy | Average loss | |
|---|---|---|
| Train set | 99.487231% | 0.013942 |
| Test set | 94.611516% | 0.249856 |
| Accuracy | Average loss | |
|---|---|---|
| Train set | 99.256321% | 0.025617 |
| Test set | 91.666667% | 0.368867 |
- xgboost
- numpy
- sklearn
- pydasm
- pytorch
- numpy

