Skip to content

Channel dimension mismatch in encodec/msstftd.py #93

Description

@angelmf97

🐛 Bug Report

There is a dimension mismatch in how the input channels of the inner convolution layers are handled in the DiscriminatorSTFT class.

Problem description

Currently, the code sets the input channels for the second convolution in line 70 using:

in_chs = min(filters_scale * self.filters, max_filters)

However, the output of the first convolution layer has self.filters channels, so if filters_scale > 1, the next layer expects more channels than are actually produced. For example, if filters=64 and filters_scale=2, the first layer outputs 64 channels, but the second expects 128.

This results in a dimension mismatch error.


To Reproduce

No specific steps are needed to reproduce; this is a static error in the model definition logic. For instance, constructing DiscriminatorSTFT(filters=64, filters_scale=2) will produce a dimension mismatch.

Expected behavior

The number of input channels for each convolution layer should match the number of output channels from the previous layer. There should be no dimension mismatch.

Actual Behavior

When filters_scale > 1, there is a mismatch between the output channels of one convolution and the expected input channels of the next, leading to a runtime error.

Solution

An easy fix would be changing line 70 to match the input channels for the second convolution to the output of the first one as follows:

in_chs = self.filters

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions