Granite.c

A minimal, educational implementation of the IBM Granite 4.0 350M transformer model in pure C. This project demonstrates the core concepts of large language model inference without relying on external deep learning frameworks. This project is designed for learning and understanding transformer architectures. It prioritizes code clarity and readability over performance optimization. Do not expect production-level speed or efficiency.

Overview

Granite.c is a from-scratch implementation of a decoder-only transformer model that can run inference on the IBM Granite 4.0 350M parameter model. The codebase demonstrates:

Transformer Architecture: Multi-head grouped-query attention, SwiGLU activations, RoPE positional embeddings
Custom Data Types: BFloat16 (BF16) weight storage with runtime conversion to FP32
Memory Management: Manual memory allocation and KV-cache implementation
Tokenization: Basic BPE-style tokenizer with vocabulary lookup
Autoregressive Generation: Token-by-token text generation with streaming output

Architecture Details

The implementation includes the following components:

Model Configuration (Granite 4.0 350M)

Vocabulary Size: 100,352 tokens
Context Length: 32,768 tokens
Model Dimension: 1,024
Number of Layers: 28
Attention Heads: 16 (with 4 KV heads for grouped-query attention)
Feed-Forward Dimension: 2,048
Positional Encoding: RoPE (Rotary Position Embedding)

Key Features

Grouped-Query Attention (GQA): Reduces KV-cache memory by sharing key/value heads across query heads
SwiGLU Activation: Gated activation function in the feed-forward network
RMS Normalization: Efficient normalization alternative to LayerNorm
BFloat16 Weights: Reduces model size while maintaining float32 dynamic range
Streaming Output: Real-time token generation with terminal output

Project Structure

granite.c/
├── src/
│   ├── main.c              # Entry point and generation loop
│   ├── model.h             # Model architecture and hyperparameters
│   ├── model_init.c        # Model initialization and weight loading
│   ├── forward.c           # Forward pass implementation
│   ├── tokenizer.c         # Tokenization and detokenization
│   ├── weights_loader.c    # Binary weight file loading
│   ├── dtype.h             # BFloat16 type and conversions
│   ├── math_utils.h        # Matrix operations and normalization
│   ├── activations.h       # Activation functions (SiLU, SwiGLU)
│   └── utils.h             # Utility macros
├── granite-4.0-350m-BF16/  # Model weights directory
│   ├── model.gguf          # Original GGUF format model
│   ├── vocab.txt           # Vocabulary file
│   ├── weights_index.json  # Weight metadata
│   └── weights/            # Extracted binary weight files
├── extract_weights.py      # GGUF to binary weight converter
├── export_vocab.py         # Vocabulary extraction script
├── tokenize.py             # Tokenization script
├── Makefile                # Build configuration
└── README.md               # This file

Prerequisites

C Compiler: GCC (OpenMP support is optional but helpful)
Python 3.8+: For weight extraction scripts
Python Libraries:
- numpy
- gguf (for GGUF file reading)
- transformers (for vocabulary extraction)

Getting Started

1. Download the Model

Download the IBM Granite 4.0 350M model and place it in the project directory:

mkdir -p granite-4.0-350m-BF16
wget https://huggingface.co/ibm-granite/granite-4.0-350m-GGUF/resolve/main/granite-4.0-350m-bf16.gguf?download=true -O granite-4.0-350m-BF16/model.gguf

2. Extract Weights and Vocabulary

Run the Python scripts to convert the model weights from GGUF format to binary files:

# Activate virtual environment (if using one)
python3 -m venv .venv
source .venv/bin/activate  # On Linux/macOS
# .venv\Scripts\activate   # On Windows

# Install dependencies
pip install numpy gguf transformers

# Extract weights from GGUF file
python3 extract_weights.py

# Export vocabulary
python3 export_vocab.py

This will create the granite-4.0-350m-BF16/weights/ directory with individual binary files for each tensor and a vocab.txt file.

3. Build the Project

Compile the C source code using the provided Makefile:

make

This will create the executable at build/granite-c.

4. Run Inference

Execute the compiled binary:

./build/release/granite-c

The program will generate text starting from the initial token "Hello" and display the output in real-time. If you want to set a different starting point for the model change the initial token(s) in src/main.c and rebuild. You can use tokenize.py to get the tokenization for the desired input.

Code Walkthrough

Forward Pass (`src/forward.c`)

The forward pass consists of:

Embedding Lookup: Convert token ID to embedding vector with scaling
Layer Processing: 28 transformer layers, each with:
- RMS normalization
- Multi-head grouped-query attention with RoPE
- Residual connection
- RMS normalization
- SwiGLU feed-forward network
- Residual connection
Output Normalization: Final RMS norm
Logit Computation: Matrix multiplication with embedding weights (tied weights)

Attention Mechanism

The implementation uses Grouped-Query Attention (GQA):

16 query heads
4 key/value heads (each KV head is shared by 4 query heads)
Rotary Position Embeddings (RoPE) applied to queries and keys
Attention scores scaled by 1/sqrt(head_dim) = 0.015625

KV-Cache

To avoid recomputing attention for previous tokens during autoregressive generation, the model maintains a key-value cache:

Stores key and value projections for each layer and position
Dimensions: [N_LAYERS][MAX_SEQ_LEN][N_KV_HEADS * HEAD_DIM]

Tokenization

The tokenizer implements:

Vocabulary lookup from vocab.txt
Special token handling (spaces as Ġ, newlines as Ċ)
Detokenization with proper spacing and formatting

Performance Notes

This implementation is not optimized for speed with basic OpenMP parallelization only.

Expected Performance: Slow. This is intended for educational purposes to understand how transformers work at a low level.

Customization

Changing the Model

To use a different model or variant:

Update the constants in src/model.h (vocabulary size, layers, dimensions, etc.)
Modify BASE_DIR in src/main.c to point to your model directory
Ensure weight extraction scripts handle your model format

Adjusting Generation

Modify in src/main.c:

NUM_TOKENS: Number of tokens to generate
MAX_SEQ_LEN: Maximum sequence length (in src/model.h)
Initial prompt tokens

Sampling Strategy

Currently uses greedy decoding (argmax). To implement other sampling methods:

Modify the sampling logic in src/main.c after forward_token()
Consider temperature scaling, top-k, top-p, or other techniques

Learning Resources

This implementation demonstrates concepts from:

"Attention Is All You Need" (Vaswani et al., 2017) - Original Transformer paper
"RoFormer: Enhanced Transformer with Rotary Position Embedding" (Su et al., 2021)
"GQA: Training Generalized Multi-Query Transformer" (Ainslie et al., 2023)
"GLU Variants Improve Transformer" (Shazeer, 2020) - SwiGLU activation

License

This project is for educational purposes. Please refer to IBM Granite's license for model usage terms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Granite.c

Overview

Architecture Details

Model Configuration (Granite 4.0 350M)

Key Features

Project Structure

Prerequisites

Getting Started

1. Download the Model

2. Extract Weights and Vocabulary

3. Build the Project

4. Run Inference

Code Walkthrough

Forward Pass (`src/forward.c`)

Attention Mechanism

KV-Cache

Tokenization

Performance Notes

Customization

Changing the Model

Adjusting Generation

Sampling Strategy

Learning Resources

License

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
.vscode		.vscode
src		src
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
export_vocab.py		export_vocab.py
extract_weights.py		extract_weights.py
tokenize.py		tokenize.py

Folders and files

Latest commit

History

Repository files navigation

Granite.c

Overview

Architecture Details

Model Configuration (Granite 4.0 350M)

Key Features

Project Structure

Prerequisites

Getting Started

1. Download the Model

2. Extract Weights and Vocabulary

3. Build the Project

4. Run Inference

Code Walkthrough

Forward Pass (src/forward.c)

Attention Mechanism

KV-Cache

Tokenization

Performance Notes

Customization

Changing the Model

Adjusting Generation

Sampling Strategy

Learning Resources

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

Forward Pass (`src/forward.c`)