This repository implements a lightweight static malware detection system for legacy Microsoft Office documents based on the OLE (Object Linking and Embedding) format (.doc, .xls, .ppt).
The system focuses on detecting malicious VBA macros through static analysis without executing the document. It extracts a unified feature vector from embedded macro code and classifies files using a trained Random Forest model.
The project combines academic research methodology with a production-style microservices architecture.
The system follows a microservices architecture using Docker Compose.
Components:
- FastAPI backend (
api.py) - Celery worker (
worker.py,tasks.py) - Redis message broker
- Streamlit frontend (
streamlit_app.py) - Machine learning model (Random Forest)
Processing flow:
- User uploads file via Streamlit UI
- FastAPI receives file and enqueues a Celery task
- Celery worker extracts features and runs inference
- Result is stored in Redis backend
- API returns verdict and confidence score
The system allows uploading multiple OLE files simultaneously.
Files are processed asynchronously and progress is tracked in real time.
For each analyzed file, the system provides:
- Verdict (Benign / Malicious)
- Confidence score
- VBA macro size
- Malicious vs benign probability distribution
- Top feature importances influencing the decision
The model is trained on 112 static features extracted from VBA macro code.
- File size
- Important VBA keywords
- Critical high-risk keywords
- Auto-execution routines
- System interaction patterns
- Entropy-based metrics
- Identifier length statistics
- String concatenation patterns
- Encoding-related function usage
- Logic complexity indicators
- Density-normalized keyword features
- Hex density
- Longest string length
- Suspicious keyword density
These features combine semantic detection and obfuscation awareness for robust classification.
- Algorithm: Random Forest
- Stratified train/test split: 75% / 25%
- Custom decision threshold prioritizing recall
- Short-circuit logic for very small macro code
The dataset used in this project is not included in this repository, as it contains malicious samples that cannot be publicly redistributed.
The model was trained on 1,034 legacy OLE documents:
- 539 malicious samples (MalwareBazaar)
- 495 benign samples (GovDocs corpus)
Files with macro content shorter than 150 bytes were excluded.
POST /predict
Returns:
{
"task_id": "...",
"filename": "...",
"status": "PENDING"
}
GET /result/{task_id}
Returns:
{
"status": "SUCCESS",
"result": {
"verdict": "Malicious",
"confidence": 0.87,
"top_features": {...}
}
}
POST /predict/batch
GET /results/batch
docker compose up --build
Streamlit default:
.
├── api.py
├── extract_features.py
├── tasks.py
├── worker.py
├── train_model.py
├── streamlit_app.py
├── docker-compose.yml
├── Dockerfile.app
├── Dockerfile.worker
├── requirements-app.txt
├── requirements-worker.txt
├── models/
├── baselines/
└── docs/
The repository includes:
- No macro execution is performed.
- Do not commit real malware samples.
- Use hashes or safe public dataset references when demonstrating detection capability.
Raz Cohen Priva
Dael Hacohen Waingarten

