A minimal, educational implementation of the IBM Granite 4.0 350M transformer model in pure C. This project demonstrates the core concepts of large language model inference without relying on external deep learning frameworks. This project is designed for learning and understanding transformer architectures. It prioritizes code clarity and readability over performance optimization. Do not expect production-level speed or efficiency.
Granite.c is a from-scratch implementation of a decoder-only transformer model that can run inference on the IBM Granite 4.0 350M parameter model. The codebase demonstrates:
- Transformer Architecture: Multi-head grouped-query attention, SwiGLU activations, RoPE positional embeddings
- Custom Data Types: BFloat16 (BF16) weight storage with runtime conversion to FP32
- Memory Management: Manual memory allocation and KV-cache implementation
- Tokenization: Basic BPE-style tokenizer with vocabulary lookup
- Autoregressive Generation: Token-by-token text generation with streaming output
The implementation includes the following components:
- Vocabulary Size: 100,352 tokens
- Context Length: 32,768 tokens
- Model Dimension: 1,024
- Number of Layers: 28
- Attention Heads: 16 (with 4 KV heads for grouped-query attention)
- Feed-Forward Dimension: 2,048
- Positional Encoding: RoPE (Rotary Position Embedding)
- Grouped-Query Attention (GQA): Reduces KV-cache memory by sharing key/value heads across query heads
- SwiGLU Activation: Gated activation function in the feed-forward network
- RMS Normalization: Efficient normalization alternative to LayerNorm
- BFloat16 Weights: Reduces model size while maintaining float32 dynamic range
- Streaming Output: Real-time token generation with terminal output
granite.c/
├── src/
│ ├── main.c # Entry point and generation loop
│ ├── model.h # Model architecture and hyperparameters
│ ├── model_init.c # Model initialization and weight loading
│ ├── forward.c # Forward pass implementation
│ ├── tokenizer.c # Tokenization and detokenization
│ ├── weights_loader.c # Binary weight file loading
│ ├── dtype.h # BFloat16 type and conversions
│ ├── math_utils.h # Matrix operations and normalization
│ ├── activations.h # Activation functions (SiLU, SwiGLU)
│ └── utils.h # Utility macros
├── granite-4.0-350m-BF16/ # Model weights directory
│ ├── model.gguf # Original GGUF format model
│ ├── vocab.txt # Vocabulary file
│ ├── weights_index.json # Weight metadata
│ └── weights/ # Extracted binary weight files
├── extract_weights.py # GGUF to binary weight converter
├── export_vocab.py # Vocabulary extraction script
├── tokenize.py # Tokenization script
├── Makefile # Build configuration
└── README.md # This file
- C Compiler: GCC (OpenMP support is optional but helpful)
- Python 3.8+: For weight extraction scripts
- Python Libraries:
numpygguf(for GGUF file reading)transformers(for vocabulary extraction)
Download the IBM Granite 4.0 350M model and place it in the project directory:
mkdir -p granite-4.0-350m-BF16
wget https://huggingface.co/ibm-granite/granite-4.0-350m-GGUF/resolve/main/granite-4.0-350m-bf16.gguf?download=true -O granite-4.0-350m-BF16/model.ggufRun the Python scripts to convert the model weights from GGUF format to binary files:
# Activate virtual environment (if using one)
python3 -m venv .venv
source .venv/bin/activate # On Linux/macOS
# .venv\Scripts\activate # On Windows
# Install dependencies
pip install numpy gguf transformers
# Extract weights from GGUF file
python3 extract_weights.py
# Export vocabulary
python3 export_vocab.pyThis will create the granite-4.0-350m-BF16/weights/ directory with individual binary files for each tensor and a vocab.txt file.
Compile the C source code using the provided Makefile:
makeThis will create the executable at build/granite-c.
Execute the compiled binary:
./build/release/granite-cThe program will generate text starting from the initial token "Hello" and display the output in real-time. If you want to set a different starting point for the model change the initial token(s) in src/main.c and rebuild. You can use tokenize.py to get the tokenization for the desired input.
The forward pass consists of:
- Embedding Lookup: Convert token ID to embedding vector with scaling
- Layer Processing: 28 transformer layers, each with:
- RMS normalization
- Multi-head grouped-query attention with RoPE
- Residual connection
- RMS normalization
- SwiGLU feed-forward network
- Residual connection
- Output Normalization: Final RMS norm
- Logit Computation: Matrix multiplication with embedding weights (tied weights)
The implementation uses Grouped-Query Attention (GQA):
- 16 query heads
- 4 key/value heads (each KV head is shared by 4 query heads)
- Rotary Position Embeddings (RoPE) applied to queries and keys
- Attention scores scaled by
1/sqrt(head_dim) = 0.015625
To avoid recomputing attention for previous tokens during autoregressive generation, the model maintains a key-value cache:
- Stores key and value projections for each layer and position
- Dimensions:
[N_LAYERS][MAX_SEQ_LEN][N_KV_HEADS * HEAD_DIM]
The tokenizer implements:
- Vocabulary lookup from
vocab.txt - Special token handling (spaces as
Ġ, newlines asĊ) - Detokenization with proper spacing and formatting
This implementation is not optimized for speed with basic OpenMP parallelization only.
Expected Performance: Slow. This is intended for educational purposes to understand how transformers work at a low level.
To use a different model or variant:
- Update the constants in
src/model.h(vocabulary size, layers, dimensions, etc.) - Modify
BASE_DIRinsrc/main.cto point to your model directory - Ensure weight extraction scripts handle your model format
Modify in src/main.c:
NUM_TOKENS: Number of tokens to generateMAX_SEQ_LEN: Maximum sequence length (insrc/model.h)- Initial prompt tokens
Currently uses greedy decoding (argmax). To implement other sampling methods:
- Modify the sampling logic in
src/main.cafterforward_token() - Consider temperature scaling, top-k, top-p, or other techniques
This implementation demonstrates concepts from:
- "Attention Is All You Need" (Vaswani et al., 2017) - Original Transformer paper
- "RoFormer: Enhanced Transformer with Rotary Position Embedding" (Su et al., 2021)
- "GQA: Training Generalized Multi-Query Transformer" (Ainslie et al., 2023)
- "GLU Variants Improve Transformer" (Shazeer, 2020) - SwiGLU activation
This project is for educational purposes. Please refer to IBM Granite's license for model usage terms.