PUMA Core

Instruction	Execute	Comment
Load/Store	24 = `vec_width` (16/256)3	for `vec_width` = 128
ALU - non-transcendental	128 = `vec_width`*1	`add`, `sub`, `mul` etc. for `vec_width` = 128
MVM	2304 = (16+2)*128	for `operand_precision` of 16 bits
Copy/Set	128 = `vec_width`*1	for `vec_width` = 128
ALU - transcendental	384 = `vec_width`*3	`tanh`, `sig` for `vec_width` = 128
Send/Receive	158 = 31+`vec_width`-1	in practice, will depend on distance between tiles, we assume average latency with receive_rate = send_rate i.e. no network saturation

Provide feedback