Skip to content

S3: rewrite metrics tests, stop collecting errors from missing keys#118

Merged
ahouene merged 4 commits intoPowerDNS:mainfrom
ahouene:s3-metrics-rewrite-tests
Jan 7, 2026
Merged

S3: rewrite metrics tests, stop collecting errors from missing keys#118
ahouene merged 4 commits intoPowerDNS:mainfrom
ahouene:s3-metrics-rewrite-tests

Conversation

@ahouene
Copy link
Contributor

@ahouene ahouene commented Jan 6, 2026

The TestMetrics tests are failing on my machine's and Luit's, as the Minio Go SDK seems to issue different errors on macOS (a difference in resolving?).

That prompted me to rewrite the tests to make sure the use of a healthy backend sees no metrics being collected, and test that metrics are collected with an error we have more control over.

Moreover, working on this made me realise that we collect ErrNotExist just as other errors. Since it does not reflect an unhealthy backend but merely a missing key, I also propose we no longer collect it.

Lastly, we no longer return ErrNotExist for a missing bucket, as it is a completely different error.

@ahouene ahouene requested review from Luit and neilcook January 6, 2026 14:30
@ahouene ahouene self-assigned this Jan 6, 2026
}
errRes := minio.ToErrorResponse(err)
if !isList && errRes.StatusCode == 404 {
if !isList && errRes.Code == "NoSuchKey" {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not very obvious about what it does: a missing bucket also has errRes.StatusCode == 404. This makes the function only return ErrNotExist when we have a missing key. A missing bucket will return a non-converted error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment in the code to this effect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like being able to differentiate between missing key and missing bucket, and looking at your tests it seems that there is a "NoSuchBucket" code, so the label will be correct.

SecretKey: "bar",
Bucket: "test-bucket",
CreateBucket: false,
DialTimeout: 1 * time.Second,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found why this test results in different behavior for me:

$ time dig +short nosuchhost

real	0m3,920s
user	0m0,008s
sys	0m0,003s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, on my side it takes less than a second, and changing the timeout to 20s just in case doesn't change the diff in the test failure! Does raising the timeout fix it for you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh slow DNS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I increase the DialTimeout, something else is getting REAL slow, because the default 30s timeout in my editor aborts the test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does fix some of it, though not the storage_s3_call_error_by_type_total{error="NotFound",method="load"} 3 discrepancy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving DialTimeout at 1 second but changing nosuchhost to nosuchhost.invalid yields the same results but takes one fifth as long.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think would have been solved with the metricCallErrorsType.Reset() change that Rod added

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Luit The tests should probably pass if you omit -run ^TestMetrics$ in your command then?

Neither adding .invalid or extending the DialTimeout (or both) changes the outcome for me. Also the contributor in #117 had a similar problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, same output for a full go test -count=1 ./....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, if I fix my env for devcontainers it succeeds.

Copy link
Contributor

@neilcook neilcook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comment but overall LGTM

}
errRes := minio.ToErrorResponse(err)
if !isList && errRes.StatusCode == 404 {
if !isList && errRes.Code == "NoSuchKey" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like being able to differentiate between missing key and missing bucket, and looking at your tests it seems that there is a "NoSuchBucket" code, so the label will be correct.

if err = convertMinioError(err, false); err != nil {
metricCallErrors.WithLabelValues("load").Inc()
metricCallErrorsType.WithLabelValues("load", errorToMetricsLabel(err)).Inc()
if !errors.Is(err, os.ErrNotExist) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give some more info on why you think 'missing bucket' should be reported for this metric, but not "missing key"? I guess because missing key is likely to happen for all kinds of good reasons (like cleanup of files by another process) but missing bucket is always bad?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a missing bucket means either the ability to create a bucket was not given to the client in the options and the bucket does not exist, or the bucket was removed from another client we have no control over.

I'm documenting it better in aa828a5

@ahouene ahouene changed the title S3: rewrite metrics tests, stop collecting ErrNotExist S3: rewrite metrics tests, stop collecting errors from missing keys Jan 6, 2026
@ahouene ahouene merged commit 49a2a8d into PowerDNS:main Jan 7, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants