âš  We use dtype FP16, becuase F32 is much slower due to the hardware limit TFLOPS = 32(INT8) / 16(FP16) / 2(FP32), and INT8 does not even work properly as we tried twice :( ...