🐛 Bug Report
There is a dimension mismatch in how the input channels of the inner convolution layers are handled in the DiscriminatorSTFT class.
Problem description
Currently, the code sets the input channels for the second convolution in line 70 using:
in_chs = min(filters_scale * self.filters, max_filters)
However, the output of the first convolution layer has self.filters channels, so if filters_scale > 1, the next layer expects more channels than are actually produced. For example, if filters=64 and filters_scale=2, the first layer outputs 64 channels, but the second expects 128.
This results in a dimension mismatch error.
To Reproduce
No specific steps are needed to reproduce; this is a static error in the model definition logic. For instance, constructing DiscriminatorSTFT(filters=64, filters_scale=2) will produce a dimension mismatch.
Expected behavior
The number of input channels for each convolution layer should match the number of output channels from the previous layer. There should be no dimension mismatch.
Actual Behavior
When filters_scale > 1, there is a mismatch between the output channels of one convolution and the expected input channels of the next, leading to a runtime error.
Solution
An easy fix would be changing line 70 to match the input channels for the second convolution to the output of the first one as follows:
🐛 Bug Report
There is a dimension mismatch in how the input channels of the inner convolution layers are handled in the
DiscriminatorSTFTclass.Problem description
Currently, the code sets the input channels for the second convolution in line 70 using:
However, the output of the first convolution layer has
self.filterschannels, so iffilters_scale > 1, the next layer expects more channels than are actually produced. For example, iffilters=64andfilters_scale=2, the first layer outputs 64 channels, but the second expects 128.This results in a dimension mismatch error.
To Reproduce
No specific steps are needed to reproduce; this is a static error in the model definition logic. For instance, constructing
DiscriminatorSTFT(filters=64, filters_scale=2)will produce a dimension mismatch.Expected behavior
The number of input channels for each convolution layer should match the number of output channels from the previous layer. There should be no dimension mismatch.
Actual Behavior
When
filters_scale > 1, there is a mismatch between the output channels of one convolution and the expected input channels of the next, leading to a runtime error.Solution
An easy fix would be changing line 70 to match the input channels for the second convolution to the output of the first one as follows: