Skip to content

Raz99/ole-macro-malware-detection

Repository files navigation

OLE Macro Malware Detection System

Overview

This repository implements a lightweight static malware detection system for legacy Microsoft Office documents based on the OLE (Object Linking and Embedding) format (.doc, .xls, .ppt).

The system focuses on detecting malicious VBA macros through static analysis without executing the document. It extracts a unified feature vector from embedded macro code and classifies files using a trained Random Forest model.

The project combines academic research methodology with a production-style microservices architecture.


System Architecture

The system follows a microservices architecture using Docker Compose.

Components:

  • FastAPI backend (api.py)
  • Celery worker (worker.py, tasks.py)
  • Redis message broker
  • Streamlit frontend (streamlit_app.py)
  • Machine learning model (Random Forest)

Processing flow:

  1. User uploads file via Streamlit UI
  2. FastAPI receives file and enqueues a Celery task
  3. Celery worker extracts features and runs inference
  4. Result is stored in Redis backend
  5. API returns verdict and confidence score

User Interface

Upload & Batch Analysis

The system allows uploading multiple OLE files simultaneously.
Files are processed asynchronously and progress is tracked in real time.

Upload UI


Detailed File Results

For each analyzed file, the system provides:

  • Verdict (Benign / Malicious)
  • Confidence score
  • VBA macro size
  • Malicious vs benign probability distribution
  • Top feature importances influencing the decision

Results UI


Feature Engineering

The model is trained on 112 static features extracted from VBA macro code.

Baseline 1 – Keyword-Based Features (92)

  • File size
  • Important VBA keywords
  • Critical high-risk keywords
  • Auto-execution routines
  • System interaction patterns

Baseline 2 – Obfuscation-Based Features (15)

  • Entropy-based metrics
  • Identifier length statistics
  • String concatenation patterns
  • Encoding-related function usage
  • Logic complexity indicators

Optimized Features (5)

  • Density-normalized keyword features
  • Hex density
  • Longest string length
  • Suspicious keyword density

These features combine semantic detection and obfuscation awareness for robust classification.


Machine Learning Model

  • Algorithm: Random Forest
  • Stratified train/test split: 75% / 25%
  • Custom decision threshold prioritizing recall
  • Short-circuit logic for very small macro code

Dataset

The dataset used in this project is not included in this repository, as it contains malicious samples that cannot be publicly redistributed.

The model was trained on 1,034 legacy OLE documents:

  • 539 malicious samples (MalwareBazaar)
  • 495 benign samples (GovDocs corpus)

Files with macro content shorter than 150 bytes were excluded.


API Endpoints

Single File Prediction

POST /predict

Returns:

{
  "task_id": "...",
  "filename": "...",
  "status": "PENDING"
}

Retrieve Result

GET /result/{task_id}

Returns:

{
  "status": "SUCCESS",
  "result": {
    "verdict": "Malicious",
    "confidence": 0.87,
    "top_features": {...}
  }
}

Batch Prediction

POST /predict/batch  
GET /results/batch

Running the Project

Using Docker Compose

docker compose up --build

Access Frontend

Streamlit default:

http://localhost:8501


Repository Structure

.  
├── api.py  
├── extract_features.py  
├── tasks.py  
├── worker.py  
├── train_model.py  
├── streamlit_app.py  
├── docker-compose.yml  
├── Dockerfile.app  
├── Dockerfile.worker  
├── requirements-app.txt  
├── requirements-worker.txt  
├── models/  
├── baselines/  
└── docs/  

Academic Artifacts

The repository includes:


Security Notes

  • No macro execution is performed.
  • Do not commit real malware samples.
  • Use hashes or safe public dataset references when demonstrating detection capability.

Authors

Raz Cohen Priva
Dael Hacohen Waingarten

About

Lightweight static detection system for malicious OLE documents using VBA macro analysis and machine learning.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages