Codec GPU Bug in GGML_OP_CONV_TRANSPOSE_1D (and my dumb fix)

So recently I've been using the QwenTTS backend to make a novel reading tool, and I ran into a problem: whenever I set `QWEN3_TTS_CODEC_GPU=1`, my GPU driver crashes. I've tried both AMD and NVIDIA GPUs, and they all have the same kind of problem.

After several hours of debugging I tracked down the culprit — `conv-transpose-1d.cu`. So the Metal hang on M1 you wrote about at the beginning of `qwen3_tts.cpp` is probably not a coincidence — it's the same kernel bug, just on a different backend.

A dumb way to fix it is to force `GGML_OP_CONV_TRANSPOSE_1D` to return `false` in `supports_op`:

```cpp
case GGML_OP_CONV_TRANSPOSE_1D:
    // This causes bug!!
    // ggml_type src0_type = op->src[0]->type;
    // ggml_type src1_type = op->src[1]->type;
    // if ((src0_type == GGML_TYPE_F32 || src0_type == GGML_TYPE_F16) && src1_type == GGML_TYPE_F32) {
    //     return true;
    // }
    return false;
```

This is the best way I found — with this, GPU handles all codec computation except `conv1d_transpose`, and TTS speeds up about 50%. Works on both AMD and NVIDIA cards.

I think something is wrong with the ggml kernel, that it just can't handle this much `conv1d_transpose` in TTS.

I also tested `conv1d_transpose` on s2.cpp (another TTS implementation that uses ggml), and it shares the same problem, just less severe — s2.cpp only uses transposed conv in the quantizer upsample stage, while CrispASR hits it in every decoder block. So it's not a project-specific issue, it's the ggml kernel that can't handle this op at TTS scale.

---

### Data

**CrispASR (Qwen3-TTS 1.7B), same Japanese utterance:**

| | AMD RX 7900 XTX (HIP) | NVIDIA RTX 5060 Ti (CUDA) |
|---|---|---|
| Before fix (7 frames) | GPU 2142ms vs CPU 305ms | — |
| Before fix (longer) | driver crash vs CPU 655ms | driver crash |
| 59-frame standalone codec | driver crash | — |
| After fix | codec **161ms**, rtf **0.3** | — |

**s2.cpp (FishSpeech), same long Chinese utterance, 100+ frames, AMD RX 7900 XTX only:**

| | decode (ms) | rtf |
|---|---|---|
| conv on GPU | ~204-277 | ~0.80-0.85 |
| conv on CPU | ~210-237 | ~0.79-0.87 |
| diff | ~5-15ms slower on GPU | — |

s2.cpp uses transposed conv only in the quantizer upsample stage (far fewer calls than CrispASR), so the overhead is small and no crash occurs. The kernel itself is still slow — just not enough to trigger TDR.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codec GPU Bug in GGML_OP_CONV_TRANSPOSE_1D (and my dumb fix) #155

Data

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	AMD RX 7900 XTX (HIP)	NVIDIA RTX 5060 Ti (CUDA)
Before fix (7 frames)	GPU 2142ms vs CPU 305ms	—
Before fix (longer)	driver crash vs CPU 655ms	driver crash
59-frame standalone codec	driver crash	—
After fix	codec 161ms, rtf 0.3	—

	decode (ms)	rtf
conv on GPU	~204-277	~0.80-0.85
conv on CPU	~210-237	~0.79-0.87
diff	~5-15ms slower on GPU	—

Codec GPU Bug in GGML_OP_CONV_TRANSPOSE_1D (and my dumb fix) #155

Description

Data

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions