OLE Macro Malware Detection System

Overview

This repository implements a lightweight static malware detection system for legacy Microsoft Office documents based on the OLE (Object Linking and Embedding) format (.doc, .xls, .ppt).

The system focuses on detecting malicious VBA macros through static analysis without executing the document. It extracts a unified feature vector from embedded macro code and classifies files using a trained Random Forest model.

The project combines academic research methodology with a production-style microservices architecture.

System Architecture

The system follows a microservices architecture using Docker Compose.

Components:

FastAPI backend (api.py)
Celery worker (worker.py, tasks.py)
Redis message broker
Streamlit frontend (streamlit_app.py)
Machine learning model (Random Forest)

Processing flow:

User uploads file via Streamlit UI
FastAPI receives file and enqueues a Celery task
Celery worker extracts features and runs inference
Result is stored in Redis backend
API returns verdict and confidence score

User Interface

Upload & Batch Analysis

The system allows uploading multiple OLE files simultaneously.
Files are processed asynchronously and progress is tracked in real time.

Detailed File Results

For each analyzed file, the system provides:

Verdict (Benign / Malicious)
Confidence score
VBA macro size
Malicious vs benign probability distribution
Top feature importances influencing the decision

Feature Engineering

The model is trained on 112 static features extracted from VBA macro code.

Baseline 1 – Keyword-Based Features (92)

File size
Important VBA keywords
Critical high-risk keywords
Auto-execution routines
System interaction patterns

Baseline 2 – Obfuscation-Based Features (15)

Entropy-based metrics
Identifier length statistics
String concatenation patterns
Encoding-related function usage
Logic complexity indicators

Optimized Features (5)

Density-normalized keyword features
Hex density
Longest string length
Suspicious keyword density

These features combine semantic detection and obfuscation awareness for robust classification.

Machine Learning Model

Algorithm: Random Forest
Stratified train/test split: 75% / 25%
Custom decision threshold prioritizing recall
Short-circuit logic for very small macro code

Dataset

The dataset used in this project is not included in this repository, as it contains malicious samples that cannot be publicly redistributed.

The model was trained on 1,034 legacy OLE documents:

539 malicious samples (MalwareBazaar)
495 benign samples (GovDocs corpus)

Files with macro content shorter than 150 bytes were excluded.

API Endpoints

Single File Prediction

POST /predict

Returns:

{
  "task_id": "...",
  "filename": "...",
  "status": "PENDING"
}

Retrieve Result

GET /result/{task_id}

Returns:

{
  "status": "SUCCESS",
  "result": {
    "verdict": "Malicious",
    "confidence": 0.87,
    "top_features": {...}
  }
}

Batch Prediction

POST /predict/batch  
GET /results/batch

Running the Project

Using Docker Compose

docker compose up --build

Access Frontend

Streamlit default:

http://localhost:8501

Repository Structure

.  
├── api.py  
├── extract_features.py  
├── tasks.py  
├── worker.py  
├── train_model.py  
├── streamlit_app.py  
├── docker-compose.yml  
├── Dockerfile.app  
├── Dockerfile.worker  
├── requirements-app.txt  
├── requirements-worker.txt  
├── models/  
├── baselines/  
└── docs/

Academic Artifacts

The repository includes:

Security Notes

No macro execution is performed.
Do not commit real malware samples.
Use hashes or safe public dataset references when demonstrating detection capability.

Authors

Raz Cohen Priva
Dael Hacohen Waingarten

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OLE Macro Malware Detection System

Overview

System Architecture

User Interface

Upload & Batch Analysis

Detailed File Results

Feature Engineering

Baseline 1 – Keyword-Based Features (92)

Baseline 2 – Obfuscation-Based Features (15)

Optimized Features (5)

Machine Learning Model

Dataset

API Endpoints

Single File Prediction

Retrieve Result

Batch Prediction

Running the Project

Using Docker Compose

Access Frontend

Repository Structure

Academic Artifacts

Security Notes

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
baselines		baselines
docs		docs
models		models
.gitignore		.gitignore
Dockerfile.app		Dockerfile.app
Dockerfile.worker		Dockerfile.worker
README.md		README.md
api.py		api.py
docker-compose.yml		docker-compose.yml
extract_features.py		extract_features.py
requirements-app.txt		requirements-app.txt
requirements-worker.txt		requirements-worker.txt
streamlit_app.py		streamlit_app.py
tasks.py		tasks.py
train_model.py		train_model.py
worker.py		worker.py

Folders and files

Latest commit

History

Repository files navigation

OLE Macro Malware Detection System

Overview

System Architecture

User Interface

Upload & Batch Analysis

Detailed File Results

Feature Engineering

Baseline 1 – Keyword-Based Features (92)

Baseline 2 – Obfuscation-Based Features (15)

Optimized Features (5)

Machine Learning Model

Dataset

API Endpoints

Single File Prediction

Retrieve Result

Batch Prediction

Running the Project

Using Docker Compose

Access Frontend

Repository Structure

Academic Artifacts

Security Notes

Authors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages