Skip to content

PUMA Core

Aayush Ankit edited this page Sep 24, 2018 · 13 revisions

Instruction latency - in terms of cycles

Fetch and Decode take 1 cycle each.
Execute latency varies across instructions.
Instruction Execute Comment
Load/Store 24 = vec_width *(16/256)*3 for vec_width = 128
ALU - non-transcendental 128 = vec_width*1 add, sub, mul etc. for vec_width = 128
MVM 2304 = (16+2)*128 for operand_precision of 16 bits
Copy/Set 128 = vec_width*1 for vec_width = 128
ALU - transcendental 384 = vec_width*3 tanh, sig for vec_width = 128
Send/Receive 158 = 31+vec_width-1 in practice, will depend on distance between tiles, we assume average latency with receive_rate = send_rate i.e. no network saturation

Clone this wiki locally