Skip to content

Conversation

@mateczagany
Copy link
Contributor

@mateczagany mateczagany commented Jan 13, 2026

What is the purpose of the change

Update flink-shaded to 21.0, with the biggest change being the new Netty version.

Brief change log

  • Update flink-shaded to 21.0
  • Update rest_api_v1.snapshot

Verifying this change

  • Compiled locally and started example jobs
  • Make sure that CI passes

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): yes
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): yes
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@flinkbot
Copy link
Collaborator

flinkbot commented Jan 13, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

pom.xml Outdated
<groupId>io.netty</groupId>
<artifactId>netty-bom</artifactId>
<version>4.1.100.Final</version>
<version>4.2.6.Final</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we update flink-shaded and bump netty here in different PRs?
Probably it will simplify the review part

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I think that's a great idea. I'll shave off this PR to only contain the flink-shaded upgrade, and create another PR for Netty-specific changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snuyanzin thank you for your reviews, I will create other PRs after this gets merged:

  • Upgrade Netty
  • Upgrade Pekko

I also experienced a few test failures after upgrading flink-shaded-netty, but I couldn't find out exactly what causes these. What I know is that they are all intermittent failures, caused by various race conditions. My suspicion is that some Netty internals have changed, as all these newly failing tests relied on subtle behavior and timing between components.

I don't think we'll have a successful CI like this, but I'll create PRs to fix each flaky tests.

@github-actions github-actions bot added the community-reviewed PR has been reviewed by the community. label Jan 14, 2026
@snuyanzin
Copy link
Contributor

"checkpointed_size" : {
"type" : "object",
"id" : "urn:jsonschema:org:apache:flink:runtime:rest:messages:util.stats:StatsSummaryDto",
"id" : "urn:jsonschema:org:apache:flink:runtime:rest:messages:util:stats:StatsSummaryDto",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this related to flink-shaded upgrade?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is generated by flink-shaded-jackson-module-jsonSchema, and I had to re-generate it because RuntimeRestAPIStabilityTest was failing otherwise, instructing me to re-generate this file.

According to the test, the API was modified in a compatible way with this upgrade.

messageHeaders.getResponseStatusCode(),
responseHeaders));
responseHeaders),
ctx.executor());
Copy link
Contributor Author

@mateczagany mateczagany Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snuyanzin the issue was this stuck thread:

Jan 14 21:15:14 	at org.apache.flink.core.testutils.BlockerSync.awaitBlocker(BlockerSync.java:59)
Jan 14 21:15:14 	- locked <0x00000000aa141c48> (a java.lang.Object)
Jan 14 21:15:14 	at org.apache.flink.runtime.rest.RestServerEndpointITCase.testShouldWaitForHandlersWhenClosing(RestServerEndpointITCase.java:597)

Something has changed internally in Netty, and this test consistently times out on Azure CI. The issue arises when the response is not written from the event loop, and Netty enqueues the write operations on the event loop instead, and then we close the REST server itself (which is the point of the test), the writes don't always go through properly.

By running the write operations on the event loop, this issue goes away. I think it's better to do it this way anyways, so that Netty does not have to create 3 write tasks to eventually execute on the event loop thread.

Nevermind, unfortunately this only solved the issue on my own computer, but the CI still failed at the exact same steps. I'll continue digging...

Copy link
Contributor Author

@mateczagany mateczagany Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nevermind, this is the same issue as before, I was on wrong branch locally. Unfortunately keeping the Netty version as-is did not help get rid of this.

If you try to run this test suite locally, you will see these errors: Received fatal alert: certificate_unknown. This is because starting from Netty 4.2, SSL hostname verification is on by default. Setting security.ssl.verify-hostname has NO effect, because this is not used anywhere. So I suggest to change its default value to false, and use it to set SSL hostname verification, as that would allow us to have the same behavior as before.

The reason you don't see any errors in the CI is that this error causes one of the tests to hang indefinitely, so no output will be shown of the errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-reviewed PR has been reviewed by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants