So recently I've been using the QwenTTS backend to make a novel reading tool, and I ran into a problem: whenever I set QWEN3_TTS_CODEC_GPU=1, my GPU driver crashes. I've tried both AMD and NVIDIA GPUs, and they all have the same kind of problem.
After several hours of debugging I tracked down the culprit — conv-transpose-1d.cu. So the Metal hang on M1 you wrote about at the beginning of qwen3_tts.cpp is probably not a coincidence — it's the same kernel bug, just on a different backend.
A dumb way to fix it is to force GGML_OP_CONV_TRANSPOSE_1D to return false in supports_op:
case GGML_OP_CONV_TRANSPOSE_1D:
// This causes bug!!
// ggml_type src0_type = op->src[0]->type;
// ggml_type src1_type = op->src[1]->type;
// if ((src0_type == GGML_TYPE_F32 || src0_type == GGML_TYPE_F16) && src1_type == GGML_TYPE_F32) {
// return true;
// }
return false;
This is the best way I found — with this, GPU handles all codec computation except conv1d_transpose, and TTS speeds up about 50%. Works on both AMD and NVIDIA cards.
I think something is wrong with the ggml kernel, that it just can't handle this much conv1d_transpose in TTS.
I also tested conv1d_transpose on s2.cpp (another TTS implementation that uses ggml), and it shares the same problem, just less severe — s2.cpp only uses transposed conv in the quantizer upsample stage, while CrispASR hits it in every decoder block. So it's not a project-specific issue, it's the ggml kernel that can't handle this op at TTS scale.
Data
CrispASR (Qwen3-TTS 1.7B), same Japanese utterance:
|
AMD RX 7900 XTX (HIP) |
NVIDIA RTX 5060 Ti (CUDA) |
| Before fix (7 frames) |
GPU 2142ms vs CPU 305ms |
— |
| Before fix (longer) |
driver crash vs CPU 655ms |
driver crash |
| 59-frame standalone codec |
driver crash |
— |
| After fix |
codec 161ms, rtf 0.3 |
— |
s2.cpp (FishSpeech), same long Chinese utterance, 100+ frames, AMD RX 7900 XTX only:
|
decode (ms) |
rtf |
| conv on GPU |
~204-277 |
~0.80-0.85 |
| conv on CPU |
~210-237 |
~0.79-0.87 |
| diff |
~5-15ms slower on GPU |
— |
s2.cpp uses transposed conv only in the quantizer upsample stage (far fewer calls than CrispASR), so the overhead is small and no crash occurs. The kernel itself is still slow — just not enough to trigger TDR.
So recently I've been using the QwenTTS backend to make a novel reading tool, and I ran into a problem: whenever I set
QWEN3_TTS_CODEC_GPU=1, my GPU driver crashes. I've tried both AMD and NVIDIA GPUs, and they all have the same kind of problem.After several hours of debugging I tracked down the culprit —
conv-transpose-1d.cu. So the Metal hang on M1 you wrote about at the beginning ofqwen3_tts.cppis probably not a coincidence — it's the same kernel bug, just on a different backend.A dumb way to fix it is to force
GGML_OP_CONV_TRANSPOSE_1Dto returnfalseinsupports_op:This is the best way I found — with this, GPU handles all codec computation except
conv1d_transpose, and TTS speeds up about 50%. Works on both AMD and NVIDIA cards.I think something is wrong with the ggml kernel, that it just can't handle this much
conv1d_transposein TTS.I also tested
conv1d_transposeon s2.cpp (another TTS implementation that uses ggml), and it shares the same problem, just less severe — s2.cpp only uses transposed conv in the quantizer upsample stage, while CrispASR hits it in every decoder block. So it's not a project-specific issue, it's the ggml kernel that can't handle this op at TTS scale.Data
CrispASR (Qwen3-TTS 1.7B), same Japanese utterance:
s2.cpp (FishSpeech), same long Chinese utterance, 100+ frames, AMD RX 7900 XTX only:
s2.cpp uses transposed conv only in the quantizer upsample stage (far fewer calls than CrispASR), so the overhead is small and no crash occurs. The kernel itself is still slow — just not enough to trigger TDR.