TIEF: Threat Intelligence Extraction Framework

TIEF is an end-to-end threat intelligence extraction framework that takes raw cybersecurity reports in PDF form and automatically surfaces Tactics, Techniques, and Procedures (TTPs) aligned to the MITRE ATT&CK framework, extracts Indicators of Compromise (IOCs), and packages everything into machine-readable STIX 2.1 bundles. The entire pipeline — from scraping ground-truth labels, augmenting under-represented classes with a large language model, training a multi-label classifier, running topic modelling, and emitting structured threat reports — is implemented and documented in the notebooks found in this repository.

Project Overview
Data Collection — Scraping MITRE ATT&CK
Dataset Augmentation with Claude 3.5 Sonnet
Dataset Preparation and Preprocessing
Tokenization and Multi-Label Classification Training
Evaluation and Metrics
Validation and Error Analysis
Inference Pipeline — PDF to TTP
Two Workflow Variants
Technology Stack Summary

Project Overview

Cyber threat intelligence is typically locked inside unstructured PDF reports — analyst write-ups, APT group profiles, incident disclosures, and LinkedIn-style threat actor summaries. TIEF automates the process of reading those reports and producing a structured, standardised output that can be ingested by SIEM platforms, threat-sharing ecosystems (e.g., TAXII/STIX servers), or downstream analytics pipelines.

At its core, the system:

Uses a fine-tuned DistilBERT model to classify free-text descriptions of attacker behaviour into MITRE ATT&CK labels (Tactic, Technique, Sub-Technique, and Technique ID simultaneously as a multi-label problem).
Uses BERTopic to first compress and group the raw text from a PDF into semantically coherent topics before classifying, substantially reducing noise and token cost.
Extracts concrete IOCs (IP addresses, domains, URLs, file hashes, CVEs, registry keys, emails, etc.) using a custom soft-tagging module.
Outputs one or more STIX 2.1 JSON bundles per report, with AttackPattern, Indicator, Relationship, and Report objects fully linked.

1. Data Collection — Scraping MITRE ATT&CK

Location: Mitre Scrapped Data/

The ground-truth training data was scraped from the MITRE ATT&CK knowledge base using Selenium, a browser-automation web scraping tool. The scrape covered all three MITRE ATT&CK domains — Enterprise, Industrial Control Systems (ICS), and Mobile — producing an initial dataset of 11,068 rows and 7 columns. Each row corresponds to one real-world procedure — a documented instance of a specific threat group or piece of malware using a particular technique. The schema is:

Column	Description
`Technique`	Technique category label
`id`	MITRE ATT&CK Technique ID (e.g., `T1049`, `T1027.002`)
`Tactic-Name`	ATT&CK Tactic (e.g., Discovery, Defense Evasion)
`Technique-Name`	ATT&CK Technique name (e.g., System Network Connections Discovery)
`SubTechnique-Name`	Sub-technique name, or `No sub-Technique`
`Platforms`	Target platforms (e.g., `Windows, Linux, macOS`)
`Description`	Free-text description of the real-world attack procedure

The raw training split (train_data.csv) and held-out test split (test_set.csv) are kept separate from the outset to prevent data leakage.

The MITRE ATT&CK framework covers 560 sub-techniques across its three domains. However, the scraped data is severely imbalanced: sentence counts per class ranged from 1 to 300, with some dominant technique IDs (e.g., T1489) holding hundreds of examples while many sub-techniques had only one or two. This imbalance would cause a trained model to be biased toward high-frequency classes and fail to generalise to rare ones. Columns pertaining to defensive strategies and metadata (Detection, Mitigations, Version, Created, Last Modified) were excluded since they are not directly relevant to classification of adversarial behaviour. This imbalance motivated the augmentation step below.

2. Dataset Augmentation with Claude 3.5 Sonnet

Location: claude-dataset-augmentation.ipynb
Stack: anthropic Python SDK · pandas · re · time

To ensure every technique ID had at least 100 labelled examples before training, a synthetic data generation pipeline was built using Anthropic's Claude 3.5 Sonnet (claude-3-5-sonnet-20240620). In total, approximately 64,953 new descriptions were generated across all 560 technique and sub-technique IDs, bringing the combined training corpus to a far more balanced state.

Why Claude 3.5 Sonnet

Claude 3.5 Sonnet was used because it can generate contextually accurate, varied natural-language descriptions of attacker behaviour while staying faithful to specific ATT&CK technique semantics when given few-shot examples. Its 8192-token output window allows generating dozens of new descriptions in a single API call. The data generation was driven by previous incidents taken from the original dataset, ensuring newly generated content remained closely aligned with existing descriptions in terms of context.

How the augmentation works

Identify under-represented classes. The dataset is grouped by id and all technique IDs with fewer than 100 examples are flagged (id_counts[id_counts < 100]).
Chunk IDs into batches. The list of under-represented IDs is split into chunks of 50 to keep per-batch API payloads manageable. This produced 11 chunk files (ids_chunk_1.csv through ids_chunk_11.csv).

Prompt construction. For each technique ID in a chunk, all existing real descriptions for that technique are collected and inserted into a few-shot prompt:

Based on the following real-life attack scenarios: '<existing_descriptions>',
generate <n_descriptions> additional procedure descriptions that closely relate
to the MITRE technique ID <id>, technique name '<technique_name>', and tactic
name '<tactic_name>'. Ensure the new descriptions emphasise the attack method
and strategic objective used in these scenarios.

The prompt grounds Claude in real attack evidence rather than asking it to speculate freely, keeping the generated text close to the ground-truth distribution.

Retrying and rate limiting. If a call returns fewer descriptions than requested, the loop retries up to 10 times with a 12-second inter-call sleep (respecting the 5 requests-per-minute rate limit on the Anthropic free tier). Each call uses max_tokens=8192.
Post-processing. A clean_text() function strips numbered list prefixes (e.g., 1., 2.) and excess whitespace introduced by the model, then splits the output into individual sentences.
Appending back. Generated rows inherit all metadata columns (Tactic-Name, Technique-Name, SubTechnique-Name, Platforms, id) from the original group, preserving label integrity.
Saving per-chunk and whole-dataset outputs. Each augmented chunk is saved separately for inspection, and a missing_augmented.csv captures techniques with previously zero or very few examples. Techniques that already had ≥ 100 examples were left untouched.
Quality assurance. The generation process used three quality controls: (a) structured prompt engineering that explicitly referenced technique IDs, technique names, and tactic names to ensure alignment with existing attack patterns; (b) contextual anchoring with authentic examples to guide relevant outputs; and (c) expert validation through manual review by graduate and PhD cybersecurity students to verify technical accuracy and MITRE framework consistency.

The synthetic data did not negatively impact classifier accuracy. The marginal decrease in ROC-AUC (97.45% → 96.9%) was outweighed by a dramatic increase in F1 score (35% → 93%), confirming that LLM augmentation successfully resolved class imbalance without compromising discriminative ability. The final combined dataset — original scraped data plus all augmented rows — is saved to Model Training/Final_Dataset.csv.

3. Dataset Preparation and Preprocessing

Location: Model Training/MultiLabel Classification.ipynb
Stack: pandas · scikit-learn (MultiLabelBinarizer) · matplotlib

Once the augmented dataset was in place, preparation for training involved several cleaning and structuring steps.

Deduplication and NaN handling

After merging TrainCombine.csv (the augmented training set) with test_set.csv, the combined DataFrame was deduplicated using drop_duplicates(). NaN values were inspected column-by-column. Rows where Tactic-Name, Technique-Name, or SubTechnique-Name were missing were audited; for a known set of parent-level IDs (e.g., T1055, T1078, T1027, T1090, T1110, T1016) that lack a sub-technique by design, the metadata was manually filled in to ensure no rows were dropped that contained usable descriptions.

Label construction

A single Labels column was constructed by concatenating the four label columns with a comma separator:

merged_df['Labels'] = merged_df[['Tactic-Name', 'Technique-Name', 'SubTechnique-Name', 'id']].agg(', '.join, axis=1)

This turned each row's four classification targets into one string, e.g.:

discovery, system network connections discovery, no sub-technique, t1049

The Labels column was then split back into a Python list and passed to MultiLabelBinarizer from scikit-learn, which produced a sparse binary matrix where each column corresponds to one unique label. The full label vocabulary covers all tactic names, technique names, sub-technique names, and technique IDs from the 560 MITRE ATT&CK classes spanning Enterprise, ICS, and Mobile domains.

Train / validation split

An 80 / 20 stratified split was performed using train_test_split(..., random_state=42), yielding training and validation sets that were used throughout training and evaluation.

Description length analysis

The length distribution of descriptions was plotted with matplotlib to understand the spread. The analysis confirmed that most descriptions fall well within DistilBERT's 512-token limit at the word level, validating the max_length=128 tokenisation cap chosen for training (short cybersecurity procedures rarely exceed 128 tokens).

4. Tokenization and Multi-Label Classification Training

Location: Model Training/MultiLabel Classification.ipynb
Stack: transformers (DistilBertTokenizer, DistilBertForSequenceClassification) · torch · accelerate · scikit-learn

Model Selection — Why DistilBERT

A comparative evaluation of three BERT-based architectures was conducted using Optuna for automated hyperparameter optimisation: DistilBERT (distilbert-base-uncased), SecBERT, and SecureBERT. Performance was assessed on accuracy, ROC-AUC, Hamming loss, and validation loss.

DistilBERT was selected for several reasons:

It achieved the best balance of high accuracy (93.29%) and exceptional discriminative power (96.67% ROC-AUC) while using 40% fewer parameters than BERT-base.
It retains ~97% of BERT's language understanding at 60% of the inference cost — important for a pipeline that must classify hundreds of text chunks per PDF.
Its bidirectional attention captures full sentence context of cybersecurity descriptions, which are often structured as [Subject] used [technique], resulting in [outcome].
The HuggingFace DistilBertForSequenceClassification class natively supports problem_type="multi_label_classification", applying binary cross-entropy loss across all label columns simultaneously — correctly handling the fact that a single text can map to multiple labels (e.g., a technique that belongs to both Defense Evasion and Privilege Escalation).

SecBERT and SecureBERT, while domain-specific to cybersecurity, did not outperform DistilBERT in this multi-label setting when hyperparameters were properly optimised via Optuna.

Tokenization

DistilBertTokenizer.from_pretrained("distilbert-base-uncased") was used. Each description was tokenised with:

truncation=True — descriptions longer than 128 tokens are truncated
padding="max_length" — shorter descriptions are padded to 128 tokens
max_length=128
Output: input_ids and attention_mask tensors

A CustomDataset class (inheriting from torch.utils.data.Dataset) wraps the tokenised inputs and the multi-hot label vectors, returning flattened tensors per sample so they can be collated by HuggingFace's Trainer.

Training hyperparameters

The model was trained using HuggingFace's Trainer API. Hyperparameters were determined through automated search via Optuna. The optimal configuration — identified in Trial 3 of the Optuna sweep — is:

Hyperparameter	Optimal Value
`num_train_epochs`	4
`per_device_train_batch_size`	32
`per_device_eval_batch_size`	32
`learning_rate`	`1.023e-05`
`warmup_ratio`	`0.0866`
`weight_decay`	`0.0475`
`classification_threshold`	`0.391`
`evaluation_strategy`	`epoch`
`save_total_limit`	2
`checkpointing_interval`	every 1000 steps

The low learning rate (1.023e-5) with a warm-up ratio of 0.0866 stabilises training in the early epochs before the full learning rate is applied, preventing large destabilising gradient updates right at the start of fine-tuning. Weight decay acts as L2 regularisation to prevent over-fitting on the augmented dataset. The classification threshold of 0.391 (rather than 0.5) was derived from the Optuna sweep to produce the highest macro F1 — a slightly permissive threshold improves recall across rare sub-technique classes at an acceptable precision cost.

Saved artifacts

After training, the following artifacts are persisted for use during inference:

distilbert-finetuned/ — model weights + tokenizer (via trainer.save_model() and tokenizer.save_pretrained())
multilabel_binarizer.pkl — the fitted MultiLabelBinarizer needed to convert binary predictions back to human-readable label strings
tokenizer_best.pkl — a pickle of the tokenizer for environments without the HuggingFace cache

The model architecture itself consists of 6 transformer layers and 12 attention heads in the DistilBERT encoder, followed by a linear classification layer with sigmoid activation that generates independent probabilities for each of the 560 TTP labels.

5. Evaluation and Metrics

Location: Model Training/MultiLabel Classification.ipynb
Stack: scikit-learn · torch

Since this is a multi-label classification problem — where the positive class frequency varies enormously across 560 labels — standard accuracy would be misleading. Three metrics were computed at each evaluation epoch:

Macro F1 Score

f1_score(y_true, y_pred, average='macro')

Macro F1 computes F1 independently per label and averages them, giving equal weight to rare labels. This is the primary metric because the goal is to classify all MITRE ATT&CK techniques equally well, not just the common ones.

ROC-AUC (Macro)

roc_auc_score(y_true, y_pred, average='macro')

AUC measures the model's ability to rank positive predictions above negatives across all thresholds. The macro average ensures rare classes are not drowned out by high-frequency ones.

Hamming Loss

hamming_loss(y_true, y_pred)

Hamming loss is the fraction of label positions that are incorrectly predicted (both false positives and false negatives contribute). Lower is better. It provides an overall picture of per-label error rate.

Achieved metrics (optimal configuration — Trial 3)

Metric	Value
F1 Score (macro avg across epochs)	0.933
Best F1 Score (final evaluation)	0.9339
ROC-AUC (macro avg)	0.964
Hamming Loss	0.000393
Runtime per epoch	~22.2 seconds
Throughput	~540 samples/second · ~16.8 steps/second
Confusion matrix (TPs / TNs / errors each)	6036 / 6036 / 459

The F1 score of 0.933 represents a significant advancement over prior work: TTPDrill achieved 84% precision/82% recall, rcATT achieved 80% F1, and TTPHunter (SecureBERT-based) achieved 97.09% — however, all prior approaches classified at technique level only and did not cover sub-techniques. TIEF is the first to classify across all 560 MITRE ATT&CK sub-techniques and achieve comparable accuracy.

Prediction threshold

During inference, logits are passed through a sigmoid activation, and any probability ≥ 0.391 (determined by Optuna) is treated as a positive prediction for that label. This threshold was tuned to maximise the macro F1 across all 560 labels during the Optuna sweep.

6. Validation and Error Analysis

For model validation, 1,500 records were randomly sampled from the CTI-Bench dataset (which contains 3,115 entries of real-world CTI descriptions). Because the original CTI-Bench dataset lacked procedure-level MITRE ATT&CK labels critical for evaluation, a team of graduate and PhD cybersecurity students manually reviewed and annotated each prediction generated by TIEF.

The model correctly predicted 831 out of 1,500 samples, yielding an accuracy of approximately 55.4%. This moderate accuracy reflects a granularity mismatch: TIEF is designed to classify MITRE ATT&CK procedures — detailed, step-by-step descriptions of adversarial actions. CTI-Bench entries are higher-level, generalised summaries that mention affected products and vulnerability types but lack concrete procedural context or step-by-step behavioural detail. There are currently no publicly available benchmark datasets that fully align with procedure-level classification.

Error analysis

Of the 669 misclassified cases, 60 examples were manually reviewed to identify error causes:

Error Category	Share	Explanation
Semantic overlap	~58%	Model confused closely related sub-techniques sharing vocabulary (e.g., file upload vulnerabilities vs. system discovery via terms like "systemConfig" or "upload")
Insufficient procedural detail	~25%	Input descriptions lacked explicit attacker actions or concrete operational context required for precise sub-technique disambiguation
Label noise / ambiguity	~17%	Inconsistencies or vagueness in the underlying reference data

Limitations

TIEF uses DistilBERT Base-Uncased, a general-purpose language model not specifically pre-trained on cybersecurity corpora. Domain-specific terminology and subtle expressions of attack methods may not be fully captured. Models trained on cybersecurity-specific corpora (SecBERT, SecureBERT, CyberBERT) may further improve performance. Future work includes developing procedure-aligned benchmark datasets and integrating domain-adapted language models. The topic modelling component is also planned to be upgraded from BERTopic to U-BERTopic — an urgency-aware, BERT-enhanced topic modelling approach designed for cybersecurity contexts — to improve grouping of semantically related threat sentences.

7. Inference Pipeline — PDF to TTP

Location: Final Workflow/Final Workflow-Paragraph.ipynb and Final Workflow/Final Workflow-Sentences.ipynb
Stack: pymupdf · spacy · bertopic · umap-learn · hdbscan · sentence-transformers · transformers · torch · stix2 · nltk · Custom TTPelement module

The inference pipeline is a seven-step process that converts a raw PDF into a folder of STIX 2.1 JSON files.

7.1 PDF Text Extraction

Library: PyMuPDF (pymupdf / fitz)

with fitz.open(pdf_path) as doc:
    text = ""
    for page in doc:
        text += page.get_text()

PyMuPDF was chosen because it handles complex PDF layouts (multi-column, embedded tables, mixed fonts) more reliably than PyPDF2 or pdfminer for cybersecurity report formats, and it requires no Java runtime unlike Tika. It returns UTF-8 text directly, preserving paragraph structure.

7.2 Text Cleaning and Sentence Chunking

Libraries: re (regex) · spaCy (en_core_web_sm / en_core_web_lg)

The raw extracted text is cleaned in two passes:

Whitespace normalisation — re.sub(r'\s+', ' ', text) collapses all run-on whitespace from PDF line breaks.
Character filtering — re.sub(r'[^a-zA-Z0-9\s.,!?:/()\[\]@_-]+', '', text) retains only characters relevant to natural language and IOC patterns (IP notation, URLs, file paths, email addresses) while removing PDF artefacts, bullet glyphs, and non-ASCII junk.

Sentence boundary detection is then done with spaCy (nlp(text).sents). spaCy was chosen over NLTK's sent_tokenize because its rule-based sentence detector handles the heavy use of abbreviations, acronyms, and version strings common in cybersecurity text without splitting mid-sentence on strings like CVE-2021-26855 or v3.2.exe.

7.3 Topic Modelling

Libraries: BERTopic · sentence-transformers · UMAP · HDBSCAN · CountVectorizer · nltk (PorterStemmer) · spaCy

A PDF security report can yield hundreds of sentences. Passing every sentence through the DistilBERT classifier would be expensive and noisy. Instead, BERTopic is used to group sentences into semantic topics and extract only the most representative sentences per topic, dramatically reducing the classification input while preserving coverage.

The BERTopic model is assembled from custom sub-components:

Sentence Embeddings (implicit via BERTopic)

BERTopic uses sentence-transformers (all-MiniLM or similar) under the hood to embed each sentence into a dense vector space.

Dimensionality Reduction — UMAP

UMAP(n_neighbors=3, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

UMAP reduces the high-dimensional sentence embeddings to 5 dimensions before clustering. n_neighbors=3 and min_dist=0.0 are set aggressively small for a small corpus (a single PDF's sentences), so tightly grouped semantically similar sentences collapse into clear clusters. Cosine distance is used because direction (not magnitude) encodes semantic similarity in embedding space.

Clustering — HDBSCAN

HDBSCAN(min_cluster_size=2, min_samples=1, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

HDBSCAN is a density-based hierarchical clustering algorithm that does not require specifying the number of clusters in advance — ideal since different reports will have different numbers of topics. min_cluster_size=2 ensures even pairs of similar sentences form a topic. Points that do not belong to any cluster are labelled as noise (topic -1) and can be handled separately. prediction_data=True enables the .transform() call for assigning new points to existing clusters.

Topic Representation

Three representation models run in parallel to produce human-readable topic keywords:

KeyBERTInspired — extracts keywords from each topic's representative documents using BERT-style cosine similarity between document embeddings and candidate n-grams. This forms the "Main" representation.
PartOfSpeech (spaCy) — filters candidates by POS patterns tuned for cybersecurity semantics: ADJ+NOUN (e.g., malicious software), bare NOUN (e.g., malware), VERB+NOUN (e.g., exploit vulnerability), NOUN+NOUN (e.g., threat actor), and PROPN (e.g., APT29, Lazarus Group). This ensures topic labels are substantive rather than stop-word debris.
MaximalMarginalRelevance (diversity=0.4) — diversifies the final keyword set so the N returned keywords are not all synonyms of each other, giving a broader thematic label.

Vectorizer

A custom CountVectorizer is configured with:

ngram_range=(1, 2) — captures both single keywords and common two-word cybersecurity phrases
A StemTokenizer that applies NLTK's PorterStemmer — reducing encrypted/encrypting/encryption to a single stem reduces vocabulary fragmentation in short text
min_df=1 — even words appearing once are included (important for rare CVE or malware names)
max_df=0.95 — discards words appearing in 95%+ of documents (noise / common English words not caught by the English stop list)

After fitting, topic_model.get_representative_docs() returns the 3 most centroid-adjacent documents for each discovered topic. These representative sentences, rather than all sentences, are passed to subsequent stages.

7.4 IOC Extraction and Soft Tagging

Module: Custom TTPelement

Before classification, each representative sentence is passed through a custom IOC extraction module that systematically identifies 13 IOC types using a hybrid approach combining Regular Expressions (regex) and a gazetteer:

Regex handles pattern-structured artifacts (IP addresses, hashes, URLs, file paths, CVEs, registry keys, email addresses) regardless of obfuscation — critical for inconsistently formatted threat reports.
Gazetteer handles vocabulary-based recognition for terms that do not conform to strict syntactic rules. Communication protocols, for example, are matched against a curated dictionary of terms: http, https, ftp, smtp, pop3, dns, etc.

IOC Type	Extraction Method	Examples
`ipv4`	Regex	`45.33.32.156`, `203.0.113.60`
`ipv6`	Regex	IPv6 addresses
`domain`	Regex	`compromised-site.com`, `malicious-ads.io`
`url`	Regex	`http://phishing-site.com/login`
`email`	Regex	`noreply@security-alerts.com`
`filename`	Regex	`UpdateInstaller.exe`, `ransom_payload_v3.2.exe`
`hash`	Regex	MD5/SHA file hashes
`filepath`	Regex	Absolute file paths
`cve`	Regex	`CVE-2021-26855`
`regkey`	Regex	Windows registry keys
`asn`	Regex	Autonomous system numbers
`communicationprotocol`	Gazetteer	`HTTPS`, `FTP`, `SMTP`, `DNS`
`encodeencryptalgorithms`	Gazetteer	AES, RSA, XOR encryption names

The module performs soft tagging: each IOC instance in the text is replaced with a generic token representing its category, while the original values are preserved in a dictionary of arrays indexed by their respective categories. For example:

Input: "The C2 server at 203.0.113.60 communicated via HTTPS"
Soft-tagged output: "The C2 server at ipv4 communicated via communicationprotocol"
IOC dictionary: {"ipv4": ["203.0.113.60"], "communicationprotocol": ["HTTPS"]}

This abstraction serves two purposes: reducing data variability (so the classifier never overfits to specific numeric IPs or domain strings) and preventing the model from pattern-matching on individual IOC instances rather than learning the behavioural semantics. The soft-tagged version is what the DistilBERT classifier receives; the original IOC dictionary is passed separately to the STIX generation stage.

7.5 TTP Classification and Label Completion

Libraries: transformers · torch · pandas

The fine-tuned DistilBERT model is loaded from distilbert-finedtune/ along with the multilabel_binarizer.pkl. The soft-tagged sentence is tokenised at max_length=512 (longer limit than training to accommodate occasional longer representative paragraphs), pushed through the model's 6-layer encoder and linear classification head, and predictions are extracted:

sigmoid = torch.nn.Sigmoid()
probs = sigmoid(logits).cpu().numpy()
preds = (probs >= 0.391).astype(int)   # threshold from Optuna optimisation
labels = multilabel_binarizer.inverse_transform(preds)

The raw predicted labels are a flat tuple of strings such as ('defense evasion', 'phishing', 'spearphishing link', 't1566.002'). These are categorised against the Labels.csv lookup table into four structured fields:

Tactic Name — matched against Tactic-Name column
Technique Name — matched against Technique-Name column
Sub-Technique Name — matched against SubTechnique-Name column
Technique ID — matched against id column

Label completion logic (`analyze_output`)

The model will not always predict all four label types simultaneously. The analyze_output function implements a progressive crosswalk strategy using the MITRE ATT&CK taxonomy table (Labels.csv):

If all four label types are predicted and jointly match a row in the table → accept as-is.
If three of four are predicted and match a row → infer the missing fourth from that row.
If two of four are predicted → check all pairwise combinations against the table, reconstruct a merged answer.
If only one type is predicted → perform a single-column lookup to retrieve the full row.

This guarantees that even partial model outputs are expanded into complete, coherent ATT&CK mappings rather than producing incomplete records.

Paragraph-mode fallback: In Final Workflow-Paragraph.ipynb, if a merged paragraph-level chunk returns empty predictions, the system falls back to sentence-level processing, running predict() on each sentence individually.

7.6 STIX 2.1 Bundle Generation

Library: stix2

Once TTPs and IOCs are known for a chunk, three types of STIX 2.1 objects are created:

`AttackPattern`

One object per TTP prediction, containing:

name: "<Technique Name> - <Sub-Technique Name>"
description: the original (non-replaced) text chunk
external_references: source mitre-attack with the Technique ID as external_id
kill_chain_phases: tactic name normalised to kebab-case as phase_name under mitre-attack kill chain

`Indicator`

One object per extracted IOC, containing:

name: the IOC type
indicator_types: ["malicious-activity"]
pattern: a type-specific STIX pattern string, e.g.:
- IP: [ipv4-addr:value = '45.33.32.156']
- Domain: [domain-name:value = 'compromised-site.com']
- URL: [url:value = 'http://phishing-site.com/login']
- CVE: [vulnerability:cve = 'CVE-2021-26855']
- Registry key: [windows-registry-key:key = '...']

`Relationship`

One "indicates" relationship per Indicator → AttackPattern link, describing which IOC is associated with which attack technique. Timestamps are set at creation time.

All objects are collected into a STIX Report (with the full text as description and labels=["threat-report"]) and then a Bundle. The bundle is serialised as a JSON file. Each chunk produces one or more JSON files (one per TTP prediction set) saved to a structured folder hierarchy:

Results_Topic_Para/
└── <PDF report name>/
    └── topic_<id>/
        └── chunk_<n>/
            ├── prediction_1.json
            └── prediction_2.json

8. Two Workflow Variants

The Final Workflow/ directory contains two notebooks that differ in how the text is chunked before topic modelling and classification:

`Final Workflow-Sentences.ipynb`

Each sentence from the cleaned PDF text is treated as a discrete unit. Topic modelling and IOC extraction are applied per sentence. This produces finer-grained STIX objects and is better suited to reports where individual sentences describe discrete attack steps. Results are saved to Results_Topic_Sentences/.

`Final Workflow-Paragraph.ipynb`

After topic modelling, the representative documents for each topic are merged into a single paragraph before IOC extraction and classification. This coarser granularity reduces the number of STIX bundles produced but each bundle encompasses a fuller contextual picture of the technique. It is better suited to narrative reports where context is distributed across multiple sentences. Results are saved to Results_Topic_Para/.

Both variants share identical code for IOC extraction, model inference, label completion, and STIX generation. The process_reports() batch function present in both notebooks iterates over all PDFs in a Reports/ folder, processes up to 100 at a time, logs per-file status to processing.log, and reports total wall-clock time at the end.

9. Technology Stack Summary

Stage	Tools / Libraries	Reason for Choice
Data scraping	`Selenium` (browser automation)	Automated extraction of all 560 ATT&CK technique/sub-technique procedures across Enterprise, ICS, and Mobile domains from the MITRE ATT&CK website
Dataset augmentation	`anthropic` (Claude 3.5 Sonnet)	High-quality few-shot generation of on-topic attack descriptions; 8192-token output window
Data wrangling	`pandas`, `re`	Standard tabular manipulation; regex for description cleaning
Label encoding	`scikit-learn MultiLabelBinarizer`	Native support for variable-length label sets → binary matrix for BCE loss
Tokenization	`transformers DistilBertTokenizer`	Sub-word BPE tokenizer aligned with pre-trained weights; handles technical jargon and OOV terms gracefully
Multi-label classification	`DistilBertForSequenceClassification` (HuggingFace)	40% smaller than BERT with comparable accuracy; native multi-label classification mode with BCE loss
Hyperparameter optimisation	`Optuna`	Automated hyperparameter search across batch size, learning rate, warmup ratio, weight decay, and classification threshold; DistilBERT selected over SecBERT/SecureBERT
Training loop	`transformers Trainer`, `accelerate`	Handles gradient accumulation, evaluation, checkpointing, and mixed-precision without boilerplate
Evaluation	`scikit-learn (F1, ROC-AUC, Hamming Loss)`	Macro-averaged metrics give equal importance to rare ATT&CK techniques
PDF extraction	`PyMuPDF (fitz)`	Reliable layout-agnostic text extraction from complex PDFs
NLP / sentence chunking	`spaCy (en_core_web_sm/lg)`	Production-grade sentence boundary detection; handles abbreviations and technical strings
Topic modelling	`BERTopic`	Combines transformer embeddings + UMAP + HDBSCAN; auto-determines topic count
Dimensionality reduction	`umap-learn`	Preserves local topology of semantic embedding space; better than PCA/t-SNE for downstream clustering
Clustering	`hdbscan`	Parameter-light density clustering; handles noise class natively; no k specification needed
Topic representation	`KeyBERTInspired`, `PartOfSpeech`, `MaximalMarginalRelevance`	Layered keyword extraction tuned to cybersecurity vocabulary and POS patterns
Vectorization	`scikit-learn CountVectorizer` + `nltk PorterStemmer`	Stemmed n-gram vocabulary reduces fragmentation in short technical text
IOC extraction	Custom `TTPelement` module (regex + gazetteer)	Hybrid approach: regex for structured artifacts (IPs, hashes, CVEs, URLs), gazetteer for vocabulary-based terms (protocols); 13 IOC types; produces soft-tagged text for classifier
STIX output	`stix2`	Industry-standard structured threat information sharing format; native Python library for STIX 2.1 object construction
Serialization	`json`, `pickle`	Bundle export to `.json`; model/binarizer persistence via pickle
Logging / orchestration	`logging`, `os`, `sys`, `time`	Per-file processing logs, structured output folder creation, rate-limit management

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Final Workflow		Final Workflow
Mitre Scrapped Data		Mitre Scrapped Data
Model Training		Model Training
Distilbert.drawio		Distilbert.drawio
FlowChartDetailed.drawio		FlowChartDetailed.drawio
README.md		README.md
claude-dataset-augmentation.ipynb		claude-dataset-augmentation.ipynb
environment.yml		environment.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

TIEF: Threat Intelligence Extraction Framework

Table of Contents

Project Overview

1. Data Collection — Scraping MITRE ATT&CK

2. Dataset Augmentation with Claude 3.5 Sonnet

Why Claude 3.5 Sonnet

How the augmentation works

3. Dataset Preparation and Preprocessing

Deduplication and NaN handling

Label construction

Train / validation split

Description length analysis

4. Tokenization and Multi-Label Classification Training

Model Selection — Why DistilBERT

Tokenization

Training hyperparameters

Saved artifacts

5. Evaluation and Metrics

Macro F1 Score

ROC-AUC (Macro)

Hamming Loss

Achieved metrics (optimal configuration — Trial 3)

Prediction threshold

6. Validation and Error Analysis

Error analysis

Limitations

7. Inference Pipeline — PDF to TTP

7.1 PDF Text Extraction

7.2 Text Cleaning and Sentence Chunking

7.3 Topic Modelling

Sentence Embeddings (implicit via BERTopic)

Dimensionality Reduction — UMAP

Clustering — HDBSCAN

Topic Representation

Vectorizer

7.4 IOC Extraction and Soft Tagging

7.5 TTP Classification and Label Completion

Label completion logic (analyze_output)

7.6 STIX 2.1 Bundle Generation

AttackPattern

Indicator

Relationship

8. Two Workflow Variants

Final Workflow-Sentences.ipynb

Final Workflow-Paragraph.ipynb

9. Technology Stack Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Label completion logic (`analyze_output`)

`AttackPattern`

`Indicator`

`Relationship`

`Final Workflow-Sentences.ipynb`

`Final Workflow-Paragraph.ipynb`

Packages