I saw this was released https://github.com/SuriyaaMM/feather not too long ago.
I was wondering, is that something that would ideally be implemented here? Would it basically speed up all fp8 operations presuming the operation was completed long before the memory was copied from VRAM to the registers? Seems like it's in line with the kernels here.
I saw this was released https://github.com/SuriyaaMM/feather not too long ago.
I was wondering, is that something that would ideally be implemented here? Would it basically speed up all fp8 operations presuming the operation was completed long before the memory was copied from VRAM to the registers? Seems like it's in line with the kernels here.