A compact and educational replication of the original BERT model (Devlin et al., 2018), MiniBERT is designed for practical understanding and experimentation under constrained compute environments. This project pre-trains a simplified BERT architecture on ~30M tokens and fine-tunes it on two NLP tasks: SST-2 for sentiment classification and SWAG for commonsense inference.
"Pre-train on unlabeled data, fine-tune on everything else."
MiniBERT recreates the BERT pipeline — from scratch — with:
- 4 Transformer layers
- Hidden size of 256
- 4 attention heads
- Max sequence length: 128
- Pretraining on BookCorpus + Wikipedia (trimmed)
- Fine-tuning on SST-2 & SWAG
This makes it an ideal reference for anyone learning about Transformers, BERT architecture, or resource-efficient deep learning.
| Component | MiniBERT | BERT-Base |
|---|---|---|
| Layers | 4 | 12 |
| Hidden Size | 256 | 768 |
| Attention Heads | 4 | 12 |
| Seq Length | 128 | 512 |
| Vocabulary Size | 30,522 | 30,522 |
Built in PyTorch, with inspiration from HuggingFace's transformers library.
- 15% of tokens masked (80%
[MASK], 10% random, 10% unchanged) - Loss calculated only on masked tokens
- 50%
IsNext(same document), 50%NotNext(random pairing) - Input format:
[CLS] Sentence A [SEP] Sentence B [SEP]
- Binary classification
- 87.2% validation accuracy
- 4-choice multiple choice
- 37.3% validation accuracy (limited by compute/resources)
- Python
- PyTorch
- Hugging Face
transformers - Google Colab (T4 GPU)
- Datasets: BookCorpus, Wikipedia, SST-2, SWAG
- Clone the repo and install dependencies:
pip install -r requirements.txt- Open the notebook:
jupyter notebook DLbertcode.ipynb- Run each section:
- Preprocessing + Dataset loading
- Model implementation
- Pre-training (MLM + NSP)
- Fine-tuning on SST-2 & SWAG
| File | Description |
|---|---|
DLbertcode.ipynb |
Main notebook with full implementation |
DLbert.pdf |
Project report (background, results) |
requirements.txt |
Dependencies list |
models/ |
(Optional) Saved checkpoints |
data/ |
(Optional) Preprocessed datasets |
- Small model capacity
- Limited compute (1 GPU, 3 epochs)
- Restricted token count (~30M vs. BERT's 3.3B+)
- Only 2 downstream tasks evaluated
- Devlin et al., 2018 — BERT: Pre-training of Deep Bidirectional Transformers
- Vaswani et al., 2017 — Attention is All You Need
- Zellers et al., 2018 — SWAG Dataset
- Socher et al., 2013 — SST-2 Dataset
Thanks to the contributors and the department of CSE(AIML) for their support and guidance.