The 128-bit AVX multiply+add benchmark fails to achieve the theoretical 4 instructions/cycle on AMD Zen when running with one thread. With two threads on the core, it's possible.
See if it's possible to get 4/cycle with just one thread without the help of SMT.