TIEF is an end-to-end threat intelligence extraction framework that takes raw cybersecurity reports in PDF form and automatically surfaces Tactics, Techniques, and Procedures (TTPs) aligned to the MITRE ATT&CK framework, extracts Indicators of Compromise (IOCs), and packages everything into machine-readable STIX 2.1 bundles. The entire pipeline — from scraping ground-truth labels, augmenting under-represented classes with a large language model, training a multi-label classifier, running topic modelling, and emitting structured threat reports — is implemented and documented in the notebooks found in this repository.
- Project Overview
- Data Collection — Scraping MITRE ATT&CK
- Dataset Augmentation with Claude 3.5 Sonnet
- Dataset Preparation and Preprocessing
- Tokenization and Multi-Label Classification Training
- Evaluation and Metrics
- Validation and Error Analysis
- Inference Pipeline — PDF to TTP
- Two Workflow Variants
- Technology Stack Summary
Cyber threat intelligence is typically locked inside unstructured PDF reports — analyst write-ups, APT group profiles, incident disclosures, and LinkedIn-style threat actor summaries. TIEF automates the process of reading those reports and producing a structured, standardised output that can be ingested by SIEM platforms, threat-sharing ecosystems (e.g., TAXII/STIX servers), or downstream analytics pipelines.
At its core, the system:
- Uses a fine-tuned DistilBERT model to classify free-text descriptions of attacker behaviour into MITRE ATT&CK labels (Tactic, Technique, Sub-Technique, and Technique ID simultaneously as a multi-label problem).
- Uses BERTopic to first compress and group the raw text from a PDF into semantically coherent topics before classifying, substantially reducing noise and token cost.
- Extracts concrete IOCs (IP addresses, domains, URLs, file hashes, CVEs, registry keys, emails, etc.) using a custom soft-tagging module.
- Outputs one or more STIX 2.1 JSON bundles per report, with
AttackPattern,Indicator,Relationship, andReportobjects fully linked.
Location: Mitre Scrapped Data/
The ground-truth training data was scraped from the MITRE ATT&CK knowledge base using Selenium, a browser-automation web scraping tool. The scrape covered all three MITRE ATT&CK domains — Enterprise, Industrial Control Systems (ICS), and Mobile — producing an initial dataset of 11,068 rows and 7 columns. Each row corresponds to one real-world procedure — a documented instance of a specific threat group or piece of malware using a particular technique. The schema is:
| Column | Description |
|---|---|
Technique |
Technique category label |
id |
MITRE ATT&CK Technique ID (e.g., T1049, T1027.002) |
Tactic-Name |
ATT&CK Tactic (e.g., Discovery, Defense Evasion) |
Technique-Name |
ATT&CK Technique name (e.g., System Network Connections Discovery) |
SubTechnique-Name |
Sub-technique name, or No sub-Technique |
Platforms |
Target platforms (e.g., Windows, Linux, macOS) |
Description |
Free-text description of the real-world attack procedure |
The raw training split (train_data.csv) and held-out test split (test_set.csv) are kept separate from the outset to prevent data leakage.
The MITRE ATT&CK framework covers 560 sub-techniques across its three domains. However, the scraped data is severely imbalanced: sentence counts per class ranged from 1 to 300, with some dominant technique IDs (e.g., T1489) holding hundreds of examples while many sub-techniques had only one or two. This imbalance would cause a trained model to be biased toward high-frequency classes and fail to generalise to rare ones. Columns pertaining to defensive strategies and metadata (Detection, Mitigations, Version, Created, Last Modified) were excluded since they are not directly relevant to classification of adversarial behaviour. This imbalance motivated the augmentation step below.
Location: claude-dataset-augmentation.ipynb
Stack: anthropic Python SDK · pandas · re · time
To ensure every technique ID had at least 100 labelled examples before training, a synthetic data generation pipeline was built using Anthropic's Claude 3.5 Sonnet (claude-3-5-sonnet-20240620). In total, approximately 64,953 new descriptions were generated across all 560 technique and sub-technique IDs, bringing the combined training corpus to a far more balanced state.
Claude 3.5 Sonnet was used because it can generate contextually accurate, varied natural-language descriptions of attacker behaviour while staying faithful to specific ATT&CK technique semantics when given few-shot examples. Its 8192-token output window allows generating dozens of new descriptions in a single API call. The data generation was driven by previous incidents taken from the original dataset, ensuring newly generated content remained closely aligned with existing descriptions in terms of context.
-
Identify under-represented classes. The dataset is grouped by
idand all technique IDs with fewer than 100 examples are flagged (id_counts[id_counts < 100]). -
Chunk IDs into batches. The list of under-represented IDs is split into chunks of 50 to keep per-batch API payloads manageable. This produced 11 chunk files (
ids_chunk_1.csvthroughids_chunk_11.csv). -
Prompt construction. For each technique ID in a chunk, all existing real descriptions for that technique are collected and inserted into a few-shot prompt:
Based on the following real-life attack scenarios: '<existing_descriptions>', generate <n_descriptions> additional procedure descriptions that closely relate to the MITRE technique ID <id>, technique name '<technique_name>', and tactic name '<tactic_name>'. Ensure the new descriptions emphasise the attack method and strategic objective used in these scenarios.The prompt grounds Claude in real attack evidence rather than asking it to speculate freely, keeping the generated text close to the ground-truth distribution.
-
Retrying and rate limiting. If a call returns fewer descriptions than requested, the loop retries up to 10 times with a 12-second inter-call sleep (respecting the 5 requests-per-minute rate limit on the Anthropic free tier). Each call uses
max_tokens=8192. -
Post-processing. A
clean_text()function strips numbered list prefixes (e.g.,1.,2.) and excess whitespace introduced by the model, then splits the output into individual sentences. -
Appending back. Generated rows inherit all metadata columns (
Tactic-Name,Technique-Name,SubTechnique-Name,Platforms,id) from the original group, preserving label integrity. -
Saving per-chunk and whole-dataset outputs. Each augmented chunk is saved separately for inspection, and a
missing_augmented.csvcaptures techniques with previously zero or very few examples. Techniques that already had ≥ 100 examples were left untouched. -
Quality assurance. The generation process used three quality controls: (a) structured prompt engineering that explicitly referenced technique IDs, technique names, and tactic names to ensure alignment with existing attack patterns; (b) contextual anchoring with authentic examples to guide relevant outputs; and (c) expert validation through manual review by graduate and PhD cybersecurity students to verify technical accuracy and MITRE framework consistency.
The synthetic data did not negatively impact classifier accuracy. The marginal decrease in ROC-AUC (97.45% → 96.9%) was outweighed by a dramatic increase in F1 score (35% → 93%), confirming that LLM augmentation successfully resolved class imbalance without compromising discriminative ability. The final combined dataset — original scraped data plus all augmented rows — is saved to Model Training/Final_Dataset.csv.
Location: Model Training/MultiLabel Classification.ipynb
Stack: pandas · scikit-learn (MultiLabelBinarizer) · matplotlib
Once the augmented dataset was in place, preparation for training involved several cleaning and structuring steps.
After merging TrainCombine.csv (the augmented training set) with test_set.csv, the combined DataFrame was deduplicated using drop_duplicates(). NaN values were inspected column-by-column. Rows where Tactic-Name, Technique-Name, or SubTechnique-Name were missing were audited; for a known set of parent-level IDs (e.g., T1055, T1078, T1027, T1090, T1110, T1016) that lack a sub-technique by design, the metadata was manually filled in to ensure no rows were dropped that contained usable descriptions.
A single Labels column was constructed by concatenating the four label columns with a comma separator:
merged_df['Labels'] = merged_df[['Tactic-Name', 'Technique-Name', 'SubTechnique-Name', 'id']].agg(', '.join, axis=1)This turned each row's four classification targets into one string, e.g.:
discovery, system network connections discovery, no sub-technique, t1049
The Labels column was then split back into a Python list and passed to MultiLabelBinarizer from scikit-learn, which produced a sparse binary matrix where each column corresponds to one unique label. The full label vocabulary covers all tactic names, technique names, sub-technique names, and technique IDs from the 560 MITRE ATT&CK classes spanning Enterprise, ICS, and Mobile domains.
An 80 / 20 stratified split was performed using train_test_split(..., random_state=42), yielding training and validation sets that were used throughout training and evaluation.
The length distribution of descriptions was plotted with matplotlib to understand the spread. The analysis confirmed that most descriptions fall well within DistilBERT's 512-token limit at the word level, validating the max_length=128 tokenisation cap chosen for training (short cybersecurity procedures rarely exceed 128 tokens).
Location: Model Training/MultiLabel Classification.ipynb
Stack: transformers (DistilBertTokenizer, DistilBertForSequenceClassification) · torch · accelerate · scikit-learn
A comparative evaluation of three BERT-based architectures was conducted using Optuna for automated hyperparameter optimisation: DistilBERT (distilbert-base-uncased), SecBERT, and SecureBERT. Performance was assessed on accuracy, ROC-AUC, Hamming loss, and validation loss.
DistilBERT was selected for several reasons:
- It achieved the best balance of high accuracy (93.29%) and exceptional discriminative power (96.67% ROC-AUC) while using 40% fewer parameters than BERT-base.
- It retains ~97% of BERT's language understanding at 60% of the inference cost — important for a pipeline that must classify hundreds of text chunks per PDF.
- Its bidirectional attention captures full sentence context of cybersecurity descriptions, which are often structured as
[Subject] used [technique], resulting in [outcome]. - The HuggingFace
DistilBertForSequenceClassificationclass natively supportsproblem_type="multi_label_classification", applying binary cross-entropy loss across all label columns simultaneously — correctly handling the fact that a single text can map to multiple labels (e.g., a technique that belongs to both Defense Evasion and Privilege Escalation).
SecBERT and SecureBERT, while domain-specific to cybersecurity, did not outperform DistilBERT in this multi-label setting when hyperparameters were properly optimised via Optuna.
DistilBertTokenizer.from_pretrained("distilbert-base-uncased") was used. Each description was tokenised with:
truncation=True— descriptions longer than 128 tokens are truncatedpadding="max_length"— shorter descriptions are padded to 128 tokensmax_length=128- Output:
input_idsandattention_masktensors
A CustomDataset class (inheriting from torch.utils.data.Dataset) wraps the tokenised inputs and the multi-hot label vectors, returning flattened tensors per sample so they can be collated by HuggingFace's Trainer.
The model was trained using HuggingFace's Trainer API. Hyperparameters were determined through automated search via Optuna. The optimal configuration — identified in Trial 3 of the Optuna sweep — is:
| Hyperparameter | Optimal Value |
|---|---|
num_train_epochs |
4 |
per_device_train_batch_size |
32 |
per_device_eval_batch_size |
32 |
learning_rate |
1.023e-05 |
warmup_ratio |
0.0866 |
weight_decay |
0.0475 |
classification_threshold |
0.391 |
evaluation_strategy |
epoch |
save_total_limit |
2 |
checkpointing_interval |
every 1000 steps |
The low learning rate (1.023e-5) with a warm-up ratio of 0.0866 stabilises training in the early epochs before the full learning rate is applied, preventing large destabilising gradient updates right at the start of fine-tuning. Weight decay acts as L2 regularisation to prevent over-fitting on the augmented dataset. The classification threshold of 0.391 (rather than 0.5) was derived from the Optuna sweep to produce the highest macro F1 — a slightly permissive threshold improves recall across rare sub-technique classes at an acceptable precision cost.
After training, the following artifacts are persisted for use during inference:
distilbert-finetuned/— model weights + tokenizer (viatrainer.save_model()andtokenizer.save_pretrained())multilabel_binarizer.pkl— the fittedMultiLabelBinarizerneeded to convert binary predictions back to human-readable label stringstokenizer_best.pkl— a pickle of the tokenizer for environments without the HuggingFace cache
The model architecture itself consists of 6 transformer layers and 12 attention heads in the DistilBERT encoder, followed by a linear classification layer with sigmoid activation that generates independent probabilities for each of the 560 TTP labels.
Location: Model Training/MultiLabel Classification.ipynb
Stack: scikit-learn · torch
Since this is a multi-label classification problem — where the positive class frequency varies enormously across 560 labels — standard accuracy would be misleading. Three metrics were computed at each evaluation epoch:
f1_score(y_true, y_pred, average='macro')Macro F1 computes F1 independently per label and averages them, giving equal weight to rare labels. This is the primary metric because the goal is to classify all MITRE ATT&CK techniques equally well, not just the common ones.
roc_auc_score(y_true, y_pred, average='macro')AUC measures the model's ability to rank positive predictions above negatives across all thresholds. The macro average ensures rare classes are not drowned out by high-frequency ones.
hamming_loss(y_true, y_pred)Hamming loss is the fraction of label positions that are incorrectly predicted (both false positives and false negatives contribute). Lower is better. It provides an overall picture of per-label error rate.
| Metric | Value |
|---|---|
| F1 Score (macro avg across epochs) | 0.933 |
| Best F1 Score (final evaluation) | 0.9339 |
| ROC-AUC (macro avg) | 0.964 |
| Hamming Loss | 0.000393 |
| Runtime per epoch | ~22.2 seconds |
| Throughput | ~540 samples/second · ~16.8 steps/second |
| Confusion matrix (TPs / TNs / errors each) | 6036 / 6036 / 459 |
The F1 score of 0.933 represents a significant advancement over prior work: TTPDrill achieved 84% precision/82% recall, rcATT achieved 80% F1, and TTPHunter (SecureBERT-based) achieved 97.09% — however, all prior approaches classified at technique level only and did not cover sub-techniques. TIEF is the first to classify across all 560 MITRE ATT&CK sub-techniques and achieve comparable accuracy.
During inference, logits are passed through a sigmoid activation, and any probability ≥ 0.391 (determined by Optuna) is treated as a positive prediction for that label. This threshold was tuned to maximise the macro F1 across all 560 labels during the Optuna sweep.
For model validation, 1,500 records were randomly sampled from the CTI-Bench dataset (which contains 3,115 entries of real-world CTI descriptions). Because the original CTI-Bench dataset lacked procedure-level MITRE ATT&CK labels critical for evaluation, a team of graduate and PhD cybersecurity students manually reviewed and annotated each prediction generated by TIEF.
The model correctly predicted 831 out of 1,500 samples, yielding an accuracy of approximately 55.4%. This moderate accuracy reflects a granularity mismatch: TIEF is designed to classify MITRE ATT&CK procedures — detailed, step-by-step descriptions of adversarial actions. CTI-Bench entries are higher-level, generalised summaries that mention affected products and vulnerability types but lack concrete procedural context or step-by-step behavioural detail. There are currently no publicly available benchmark datasets that fully align with procedure-level classification.
Of the 669 misclassified cases, 60 examples were manually reviewed to identify error causes:
| Error Category | Share | Explanation |
|---|---|---|
| Semantic overlap | ~58% | Model confused closely related sub-techniques sharing vocabulary (e.g., file upload vulnerabilities vs. system discovery via terms like "systemConfig" or "upload") |
| Insufficient procedural detail | ~25% | Input descriptions lacked explicit attacker actions or concrete operational context required for precise sub-technique disambiguation |
| Label noise / ambiguity | ~17% | Inconsistencies or vagueness in the underlying reference data |
TIEF uses DistilBERT Base-Uncased, a general-purpose language model not specifically pre-trained on cybersecurity corpora. Domain-specific terminology and subtle expressions of attack methods may not be fully captured. Models trained on cybersecurity-specific corpora (SecBERT, SecureBERT, CyberBERT) may further improve performance. Future work includes developing procedure-aligned benchmark datasets and integrating domain-adapted language models. The topic modelling component is also planned to be upgraded from BERTopic to U-BERTopic — an urgency-aware, BERT-enhanced topic modelling approach designed for cybersecurity contexts — to improve grouping of semantically related threat sentences.
Location: Final Workflow/Final Workflow-Paragraph.ipynb and Final Workflow/Final Workflow-Sentences.ipynb
Stack: pymupdf · spacy · bertopic · umap-learn · hdbscan · sentence-transformers · transformers · torch · stix2 · nltk · Custom TTPelement module
The inference pipeline is a seven-step process that converts a raw PDF into a folder of STIX 2.1 JSON files.
Library: PyMuPDF (pymupdf / fitz)
with fitz.open(pdf_path) as doc:
text = ""
for page in doc:
text += page.get_text()PyMuPDF was chosen because it handles complex PDF layouts (multi-column, embedded tables, mixed fonts) more reliably than PyPDF2 or pdfminer for cybersecurity report formats, and it requires no Java runtime unlike Tika. It returns UTF-8 text directly, preserving paragraph structure.
Libraries: re (regex) · spaCy (en_core_web_sm / en_core_web_lg)
The raw extracted text is cleaned in two passes:
- Whitespace normalisation —
re.sub(r'\s+', ' ', text)collapses all run-on whitespace from PDF line breaks. - Character filtering —
re.sub(r'[^a-zA-Z0-9\s.,!?:/()\[\]@_-]+', '', text)retains only characters relevant to natural language and IOC patterns (IP notation, URLs, file paths, email addresses) while removing PDF artefacts, bullet glyphs, and non-ASCII junk.
Sentence boundary detection is then done with spaCy (nlp(text).sents). spaCy was chosen over NLTK's sent_tokenize because its rule-based sentence detector handles the heavy use of abbreviations, acronyms, and version strings common in cybersecurity text without splitting mid-sentence on strings like CVE-2021-26855 or v3.2.exe.
Libraries: BERTopic · sentence-transformers · UMAP · HDBSCAN · CountVectorizer · nltk (PorterStemmer) · spaCy
A PDF security report can yield hundreds of sentences. Passing every sentence through the DistilBERT classifier would be expensive and noisy. Instead, BERTopic is used to group sentences into semantic topics and extract only the most representative sentences per topic, dramatically reducing the classification input while preserving coverage.
The BERTopic model is assembled from custom sub-components:
BERTopic uses sentence-transformers (all-MiniLM or similar) under the hood to embed each sentence into a dense vector space.
UMAP(n_neighbors=3, n_components=5, min_dist=0.0, metric='cosine', random_state=42)UMAP reduces the high-dimensional sentence embeddings to 5 dimensions before clustering. n_neighbors=3 and min_dist=0.0 are set aggressively small for a small corpus (a single PDF's sentences), so tightly grouped semantically similar sentences collapse into clear clusters. Cosine distance is used because direction (not magnitude) encodes semantic similarity in embedding space.
HDBSCAN(min_cluster_size=2, min_samples=1, metric='euclidean', cluster_selection_method='eom', prediction_data=True)HDBSCAN is a density-based hierarchical clustering algorithm that does not require specifying the number of clusters in advance — ideal since different reports will have different numbers of topics. min_cluster_size=2 ensures even pairs of similar sentences form a topic. Points that do not belong to any cluster are labelled as noise (topic -1) and can be handled separately. prediction_data=True enables the .transform() call for assigning new points to existing clusters.
Three representation models run in parallel to produce human-readable topic keywords:
KeyBERTInspired— extracts keywords from each topic's representative documents using BERT-style cosine similarity between document embeddings and candidate n-grams. This forms the "Main" representation.PartOfSpeech(spaCy) — filters candidates by POS patterns tuned for cybersecurity semantics:ADJ+NOUN(e.g., malicious software), bareNOUN(e.g., malware),VERB+NOUN(e.g., exploit vulnerability),NOUN+NOUN(e.g., threat actor), andPROPN(e.g., APT29, Lazarus Group). This ensures topic labels are substantive rather than stop-word debris.MaximalMarginalRelevance (diversity=0.4)— diversifies the final keyword set so the N returned keywords are not all synonyms of each other, giving a broader thematic label.
A custom CountVectorizer is configured with:
ngram_range=(1, 2)— captures both single keywords and common two-word cybersecurity phrases- A
StemTokenizerthat applies NLTK'sPorterStemmer— reducing encrypted/encrypting/encryption to a single stem reduces vocabulary fragmentation in short text min_df=1— even words appearing once are included (important for rare CVE or malware names)max_df=0.95— discards words appearing in 95%+ of documents (noise / common English words not caught by the English stop list)
After fitting, topic_model.get_representative_docs() returns the 3 most centroid-adjacent documents for each discovered topic. These representative sentences, rather than all sentences, are passed to subsequent stages.
Module: Custom TTPelement
Before classification, each representative sentence is passed through a custom IOC extraction module that systematically identifies 13 IOC types using a hybrid approach combining Regular Expressions (regex) and a gazetteer:
- Regex handles pattern-structured artifacts (IP addresses, hashes, URLs, file paths, CVEs, registry keys, email addresses) regardless of obfuscation — critical for inconsistently formatted threat reports.
- Gazetteer handles vocabulary-based recognition for terms that do not conform to strict syntactic rules. Communication protocols, for example, are matched against a curated dictionary of terms:
http,https,ftp,smtp,pop3,dns, etc.
| IOC Type | Extraction Method | Examples |
|---|---|---|
ipv4 |
Regex | 45.33.32.156, 203.0.113.60 |
ipv6 |
Regex | IPv6 addresses |
domain |
Regex | compromised-site.com, malicious-ads.io |
url |
Regex | http://phishing-site.com/login |
email |
Regex | noreply@security-alerts.com |
filename |
Regex | UpdateInstaller.exe, ransom_payload_v3.2.exe |
hash |
Regex | MD5/SHA file hashes |
filepath |
Regex | Absolute file paths |
cve |
Regex | CVE-2021-26855 |
regkey |
Regex | Windows registry keys |
asn |
Regex | Autonomous system numbers |
communicationprotocol |
Gazetteer | HTTPS, FTP, SMTP, DNS |
encodeencryptalgorithms |
Gazetteer | AES, RSA, XOR encryption names |
The module performs soft tagging: each IOC instance in the text is replaced with a generic token representing its category, while the original values are preserved in a dictionary of arrays indexed by their respective categories. For example:
- Input:
"The C2 server at 203.0.113.60 communicated via HTTPS" - Soft-tagged output:
"The C2 server at ipv4 communicated via communicationprotocol" - IOC dictionary:
{"ipv4": ["203.0.113.60"], "communicationprotocol": ["HTTPS"]}
This abstraction serves two purposes: reducing data variability (so the classifier never overfits to specific numeric IPs or domain strings) and preventing the model from pattern-matching on individual IOC instances rather than learning the behavioural semantics. The soft-tagged version is what the DistilBERT classifier receives; the original IOC dictionary is passed separately to the STIX generation stage.
Libraries: transformers · torch · pandas
The fine-tuned DistilBERT model is loaded from distilbert-finedtune/ along with the multilabel_binarizer.pkl. The soft-tagged sentence is tokenised at max_length=512 (longer limit than training to accommodate occasional longer representative paragraphs), pushed through the model's 6-layer encoder and linear classification head, and predictions are extracted:
sigmoid = torch.nn.Sigmoid()
probs = sigmoid(logits).cpu().numpy()
preds = (probs >= 0.391).astype(int) # threshold from Optuna optimisation
labels = multilabel_binarizer.inverse_transform(preds)The raw predicted labels are a flat tuple of strings such as ('defense evasion', 'phishing', 'spearphishing link', 't1566.002'). These are categorised against the Labels.csv lookup table into four structured fields:
- Tactic Name — matched against
Tactic-Namecolumn - Technique Name — matched against
Technique-Namecolumn - Sub-Technique Name — matched against
SubTechnique-Namecolumn - Technique ID — matched against
idcolumn
The model will not always predict all four label types simultaneously. The analyze_output function implements a progressive crosswalk strategy using the MITRE ATT&CK taxonomy table (Labels.csv):
- If all four label types are predicted and jointly match a row in the table → accept as-is.
- If three of four are predicted and match a row → infer the missing fourth from that row.
- If two of four are predicted → check all pairwise combinations against the table, reconstruct a merged answer.
- If only one type is predicted → perform a single-column lookup to retrieve the full row.
This guarantees that even partial model outputs are expanded into complete, coherent ATT&CK mappings rather than producing incomplete records.
Paragraph-mode fallback: In Final Workflow-Paragraph.ipynb, if a merged paragraph-level chunk returns empty predictions, the system falls back to sentence-level processing, running predict() on each sentence individually.
Library: stix2
Once TTPs and IOCs are known for a chunk, three types of STIX 2.1 objects are created:
One object per TTP prediction, containing:
name:"<Technique Name> - <Sub-Technique Name>"description: the original (non-replaced) text chunkexternal_references: sourcemitre-attackwith the Technique ID asexternal_idkill_chain_phases: tactic name normalised to kebab-case asphase_nameundermitre-attackkill chain
One object per extracted IOC, containing:
name: the IOC typeindicator_types:["malicious-activity"]pattern: a type-specific STIX pattern string, e.g.:- IP:
[ipv4-addr:value = '45.33.32.156'] - Domain:
[domain-name:value = 'compromised-site.com'] - URL:
[url:value = 'http://phishing-site.com/login'] - CVE:
[vulnerability:cve = 'CVE-2021-26855'] - Registry key:
[windows-registry-key:key = '...']
- IP:
One "indicates" relationship per Indicator → AttackPattern link, describing which IOC is associated with which attack technique. Timestamps are set at creation time.
All objects are collected into a STIX Report (with the full text as description and labels=["threat-report"]) and then a Bundle. The bundle is serialised as a JSON file. Each chunk produces one or more JSON files (one per TTP prediction set) saved to a structured folder hierarchy:
Results_Topic_Para/
└── <PDF report name>/
└── topic_<id>/
└── chunk_<n>/
├── prediction_1.json
└── prediction_2.json
The Final Workflow/ directory contains two notebooks that differ in how the text is chunked before topic modelling and classification:
Each sentence from the cleaned PDF text is treated as a discrete unit. Topic modelling and IOC extraction are applied per sentence. This produces finer-grained STIX objects and is better suited to reports where individual sentences describe discrete attack steps. Results are saved to Results_Topic_Sentences/.
After topic modelling, the representative documents for each topic are merged into a single paragraph before IOC extraction and classification. This coarser granularity reduces the number of STIX bundles produced but each bundle encompasses a fuller contextual picture of the technique. It is better suited to narrative reports where context is distributed across multiple sentences. Results are saved to Results_Topic_Para/.
Both variants share identical code for IOC extraction, model inference, label completion, and STIX generation. The process_reports() batch function present in both notebooks iterates over all PDFs in a Reports/ folder, processes up to 100 at a time, logs per-file status to processing.log, and reports total wall-clock time at the end.
| Stage | Tools / Libraries | Reason for Choice |
|---|---|---|
| Data scraping | Selenium (browser automation) |
Automated extraction of all 560 ATT&CK technique/sub-technique procedures across Enterprise, ICS, and Mobile domains from the MITRE ATT&CK website |
| Dataset augmentation | anthropic (Claude 3.5 Sonnet) |
High-quality few-shot generation of on-topic attack descriptions; 8192-token output window |
| Data wrangling | pandas, re |
Standard tabular manipulation; regex for description cleaning |
| Label encoding | scikit-learn MultiLabelBinarizer |
Native support for variable-length label sets → binary matrix for BCE loss |
| Tokenization | transformers DistilBertTokenizer |
Sub-word BPE tokenizer aligned with pre-trained weights; handles technical jargon and OOV terms gracefully |
| Multi-label classification | DistilBertForSequenceClassification (HuggingFace) |
40% smaller than BERT with comparable accuracy; native multi-label classification mode with BCE loss |
| Hyperparameter optimisation | Optuna |
Automated hyperparameter search across batch size, learning rate, warmup ratio, weight decay, and classification threshold; DistilBERT selected over SecBERT/SecureBERT |
| Training loop | transformers Trainer, accelerate |
Handles gradient accumulation, evaluation, checkpointing, and mixed-precision without boilerplate |
| Evaluation | scikit-learn (F1, ROC-AUC, Hamming Loss) |
Macro-averaged metrics give equal importance to rare ATT&CK techniques |
| PDF extraction | PyMuPDF (fitz) |
Reliable layout-agnostic text extraction from complex PDFs |
| NLP / sentence chunking | spaCy (en_core_web_sm/lg) |
Production-grade sentence boundary detection; handles abbreviations and technical strings |
| Topic modelling | BERTopic |
Combines transformer embeddings + UMAP + HDBSCAN; auto-determines topic count |
| Dimensionality reduction | umap-learn |
Preserves local topology of semantic embedding space; better than PCA/t-SNE for downstream clustering |
| Clustering | hdbscan |
Parameter-light density clustering; handles noise class natively; no k specification needed |
| Topic representation | KeyBERTInspired, PartOfSpeech, MaximalMarginalRelevance |
Layered keyword extraction tuned to cybersecurity vocabulary and POS patterns |
| Vectorization | scikit-learn CountVectorizer + nltk PorterStemmer |
Stemmed n-gram vocabulary reduces fragmentation in short technical text |
| IOC extraction | Custom TTPelement module (regex + gazetteer) |
Hybrid approach: regex for structured artifacts (IPs, hashes, CVEs, URLs), gazetteer for vocabulary-based terms (protocols); 13 IOC types; produces soft-tagged text for classifier |
| STIX output | stix2 |
Industry-standard structured threat information sharing format; native Python library for STIX 2.1 object construction |
| Serialization | json, pickle |
Bundle export to .json; model/binarizer persistence via pickle |
| Logging / orchestration | logging, os, sys, time |
Per-file processing logs, structured output folder creation, rate-limit management |