Skip to content

Ternary Weights Explained

thesavantidiot edited this page Apr 18, 2026 · 1 revision

Ternary Weights Explained — Why {-1, 0, +1} Changes Everything

How to eliminate float multiplies from neural network inference

The problem with normal neural networks

A standard neural network layer does this: y = W @ x, where W is a matrix of float32 weights and x is the input vector. Each element of the output requires multiplying every weight by the corresponding input and summing. For a 256x64 matrix, that is 16,384 float multiplies per layer.

On a microcontroller without an FPU, a single float multiply takes 10-50 clock cycles. On a $2 chip running at 160 MHz, this becomes the bottleneck that makes neural networks impractical.

The ternary solution

What if every weight is restricted to just three values: -1, 0, or +1?

Then the multiply becomes:

Weight Operation Cost
+1 y += x[i] 1 add
-1 y -= x[i] 1 subtract
0 skip 0

Zero float multiplies. The entire matrix-vector product is pure addition and subtraction. On an MCU, an add/subtract is 1 cycle. The speedup compared to float multiply is 10-50x.

How it works in ATOME

for (int i = 0; i < cols; ++i) {
    int8_t w = atome_unpack_trit(packed, row_offset + i);
    if (w == 1)       acc += x[i];
    else if (w == -1) acc -= x[i];
    /* w == 0: skip */
}
out[j] = acc * scale;

The only float multiply is the final scale factor — one per output element, not one per weight.

2-bit packing: 4x compression

Since each weight has only 3 possible values, it fits in 2 bits:

  • 00 = 0
  • 01 = +1
  • 11 = -1

Four weights per byte. A 256x64 weight matrix takes just 4,096 bytes instead of 65,536 bytes (float32) or 16,384 bytes (int8). This is critical for fitting models into MCU flash.

The scale factor

During training, ternary weights are derived from full-precision weights using a sign function plus a learned scale:

ternary_W = sign(W) * scale

Where scale = mean(|W|) (the average absolute value). This single float per weight matrix preserves the magnitude information that the signs alone would lose.

Does it actually work?

Yes. Research has shown that ternary and binary neural networks can achieve surprisingly good performance:

  • BitNet (Microsoft, 2023) showed ternary weights match full-precision at scale
  • Ternary Weight Networks (Li et al., 2016) demonstrated the approach on ImageNet
  • ATOME's own model produces identical outputs between Python (full training framework) and C (ternary-only inference), verified on all test inputs

The key insight: neural networks are surprisingly tolerant of weight quantization. Most of the information is in the sign pattern, not the magnitude.

Where ternary shines

  • MCUs without FPU: eliminates the float multiply bottleneck entirely
  • Memory-constrained devices: 4x compression vs int8, 16x vs float32
  • Battery-powered devices: fewer cycles per inference = longer battery life
  • High-throughput sensors: can run inference on every reading, not just sampled ones

Where ternary struggles

  • Large models: at billions of parameters, the accuracy gap narrows but doesn't disappear
  • Fine-grained tasks: tasks requiring very precise numerical reasoning lose more from quantization
  • Training: ternary training requires special techniques (straight-through estimator, learned scales)

ATOME focuses on the sweet spot: small models (10K–500K parameters) on microcontrollers where ternary's advantages are overwhelming.