Ternary Weights Explained

Ternary Weights Explained — Why {-1, 0, +1} Changes Everything

How to eliminate float multiplies from neural network inference

The problem with normal neural networks

A standard neural network layer does this: y = W @ x, where W is a matrix of float32 weights and x is the input vector. Each element of the output requires multiplying every weight by the corresponding input and summing. For a 256x64 matrix, that is 16,384 float multiplies per layer.

On a microcontroller without an FPU, a single float multiply takes 10-50 clock cycles. On a $2 chip running at 160 MHz, this becomes the bottleneck that makes neural networks impractical.

The ternary solution

What if every weight is restricted to just three values: -1, 0, or +1?

Then the multiply becomes:

Weight	Operation	Cost
+1	`y += x[i]`	1 add
-1	`y -= x[i]`	1 subtract
0	skip	0

Zero float multiplies. The entire matrix-vector product is pure addition and subtraction. On an MCU, an add/subtract is 1 cycle. The speedup compared to float multiply is 10-50x.

How it works in ATOME

for (int i = 0; i < cols; ++i) {
    int8_t w = atome_unpack_trit(packed, row_offset + i);
    if (w == 1)       acc += x[i];
    else if (w == -1) acc -= x[i];
    /* w == 0: skip */
}
out[j] = acc * scale;

The only float multiply is the final scale factor — one per output element, not one per weight.

2-bit packing: 4x compression

Since each weight has only 3 possible values, it fits in 2 bits:

00 = 0
01 = +1
11 = -1

Four weights per byte. A 256x64 weight matrix takes just 4,096 bytes instead of 65,536 bytes (float32) or 16,384 bytes (int8). This is critical for fitting models into MCU flash.

The scale factor

During training, ternary weights are derived from full-precision weights using a sign function plus a learned scale:

ternary_W = sign(W) * scale

Where scale = mean(|W|) (the average absolute value). This single float per weight matrix preserves the magnitude information that the signs alone would lose.

Does it actually work?

Yes. Research has shown that ternary and binary neural networks can achieve surprisingly good performance:

BitNet (Microsoft, 2023) showed ternary weights match full-precision at scale
Ternary Weight Networks (Li et al., 2016) demonstrated the approach on ImageNet
ATOME's own model produces identical outputs between Python (full training framework) and C (ternary-only inference), verified on all test inputs

The key insight: neural networks are surprisingly tolerant of weight quantization. Most of the information is in the sign pattern, not the magnitude.

Where ternary shines

MCUs without FPU: eliminates the float multiply bottleneck entirely
Memory-constrained devices: 4x compression vs int8, 16x vs float32
Battery-powered devices: fewer cycles per inference = longer battery life
High-throughput sensors: can run inference on every reading, not just sampled ones

Where ternary struggles

Large models: at billions of parameters, the accuracy gap narrows but doesn't disappear
Fine-grained tasks: tasks requiring very precise numerical reasoning lose more from quantization
Training: ternary training requires special techniques (straight-through estimator, learned scales)

ATOME focuses on the sweet spot: small models (10K–500K parameters) on microcontrollers where ternary's advantages are overwhelming.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ternary Weights Explained

Ternary Weights Explained — Why {-1, 0, +1} Changes Everything

How to eliminate float multiplies from neural network inference

The problem with normal neural networks

The ternary solution

How it works in ATOME

2-bit packing: 4x compression

The scale factor

Does it actually work?

Where ternary shines

Where ternary struggles

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally