Skip to content

Codec GPU Bug in GGML_OP_CONV_TRANSPOSE_1D (and my dumb fix) #155

@Rafa00127

Description

@Rafa00127

So recently I've been using the QwenTTS backend to make a novel reading tool, and I ran into a problem: whenever I set QWEN3_TTS_CODEC_GPU=1, my GPU driver crashes. I've tried both AMD and NVIDIA GPUs, and they all have the same kind of problem.

After several hours of debugging I tracked down the culprit — conv-transpose-1d.cu. So the Metal hang on M1 you wrote about at the beginning of qwen3_tts.cpp is probably not a coincidence — it's the same kernel bug, just on a different backend.

A dumb way to fix it is to force GGML_OP_CONV_TRANSPOSE_1D to return false in supports_op:

case GGML_OP_CONV_TRANSPOSE_1D:
    // This causes bug!!
    // ggml_type src0_type = op->src[0]->type;
    // ggml_type src1_type = op->src[1]->type;
    // if ((src0_type == GGML_TYPE_F32 || src0_type == GGML_TYPE_F16) && src1_type == GGML_TYPE_F32) {
    //     return true;
    // }
    return false;

This is the best way I found — with this, GPU handles all codec computation except conv1d_transpose, and TTS speeds up about 50%. Works on both AMD and NVIDIA cards.

I think something is wrong with the ggml kernel, that it just can't handle this much conv1d_transpose in TTS.

I also tested conv1d_transpose on s2.cpp (another TTS implementation that uses ggml), and it shares the same problem, just less severe — s2.cpp only uses transposed conv in the quantizer upsample stage, while CrispASR hits it in every decoder block. So it's not a project-specific issue, it's the ggml kernel that can't handle this op at TTS scale.


Data

CrispASR (Qwen3-TTS 1.7B), same Japanese utterance:

AMD RX 7900 XTX (HIP) NVIDIA RTX 5060 Ti (CUDA)
Before fix (7 frames) GPU 2142ms vs CPU 305ms
Before fix (longer) driver crash vs CPU 655ms driver crash
59-frame standalone codec driver crash
After fix codec 161ms, rtf 0.3

s2.cpp (FishSpeech), same long Chinese utterance, 100+ frames, AMD RX 7900 XTX only:

decode (ms) rtf
conv on GPU ~204-277 ~0.80-0.85
conv on CPU ~210-237 ~0.79-0.87
diff ~5-15ms slower on GPU

s2.cpp uses transposed conv only in the quantizer upsample stage (far fewer calls than CrispASR), so the overhead is small and no crash occurs. The kernel itself is still slow — just not enough to trigger TDR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions