Initial support of querying sdk worker status by xinyu-liu-glean · Pull Request #3 · askscio/beam

xinyu-liu-glean · 2025-06-04T00:49:58Z

This patch adds the WorkerStatus grpc server on the taskManager side, which allows the python sdk worker to connect and report status. Also created a httpServer inside the taskManger so we can query the endpoint to get the stack dump as well as heap dump.

This feature is gated by the --enable_worker_status option. We will add configs to enable this option.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

steve-scio · 2025-06-04T15:42:39Z

.../org/apache/beam/runners/flink/translation/functions/FlinkExecutableStageContextFactory.java

+              if (DefaultJobBundleFactory.getEnableWorkerStatus(jobInfo)) {
+                SdkWorkerStatusServer.create();
+              }
+


Just for learning's sake: can you explain why we initialize the server here?

My understanding is this: FlinkExecutableStageContextFactory is called upon initialization of an ExecutableStageDoFnOperator on the taskmanager. Ie, each time the taskmanager loads an operator (which encapsulates a series of beam transformations ie. executable stage)

SdkWorkerStatusServer is a wrapper that invokes BeamWorkerStatusGrpcService which is connected to grpc clients on the sdk harnesses and will use that to fetch + aggregate their statuses. (Instrumenting the status grpc service is something that needs to be done in future PRs as well)

The changes in DefaultJobBundleFactory below are to initialize the resources that the FlinkExecutableStageContextFactory will rely on (as job bundle is initialized before executable stage operators).

Is that right?

Good question. The server is started during the creatiion of the ExecutableStageContext.Factory for this jobId. Based on the javadoc above ("This map should only ever have a single element..."), there should be only a single factory so the server should be created once. I also added some logic in the server to be safe. Since this server is only used when we run Beam Flink portable runner, I added to the FlinkExecutableStageContextFactory, instead of the common portablity part. such as DefaultJobBundleFactory or DefaultExecutableStageContext.

The changes in DefaultJobBundleFactory below is to create the WorkerStatus grpc server so we can fetch the worker statuf from there. Once the status grpc url is set into provision info, the python worker side should automatically connect to the status server. I am going to test this out to verify we can get worker status from the server endpoint.

thanks! one follow up question if the boundary between DefaultJobBundleFactory vs ExecutableStageContextFactory is that the former is for all jobs while the latter is for portability framework jobs only, then theoretically today for non-portability jobs we'd be creating the GRPC server but never using it. is that right? (non blocking, given we don't use this custom fork for java jobs atm)

My experience is that DefaultJobBundleFactory is only used for portability framework too. The difference is that DefaultJobBundleFactory is used across runners while FlinkExecutableStageContextFactory is only used by Flink. For Java pipelines, Flink runs a different runner (FlinkRunner) instead of FlinkPipelineRunner (portability). For all open source runners, they have both java runner and portability runner. I am not sure about dataflow though. Do you know whether Google converged into a single (portable) runner?

The cost of adding the grpc server and http server without much traffic should be pretty small. I am thinking whether we can add this by default so we can do profiling anytime. Right now I use a pipelineoption to gate it. But it seems pretty cumbersome to pass in the option and restart the pipeline.

steve-scio · 2025-06-04T15:43:15Z

adding @timmy-xiao-glean you might be interested in this stuff

xinyu-liu-glean added 2 commits June 3, 2025 17:36

Initial support of querying sdk worker status

5dca66d

Add null check

c169420

xinyu-liu-glean requested review from ranjithkumar-glean and steve-scio June 4, 2025 00:51

ranjithkumar-glean approved these changes Jun 4, 2025

View reviewed changes

steve-scio approved these changes Jun 4, 2025

View reviewed changes

steve-scio requested a review from timmy-xiao-glean June 4, 2025 15:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial support of querying sdk worker status#3

Initial support of querying sdk worker status#3
xinyu-liu-glean wants to merge 2 commits intotimmy-2.59from
sdk-worker-status

xinyu-liu-glean commented Jun 4, 2025 •

edited

Loading

Uh oh!

steve-scio Jun 4, 2025

Uh oh!

xinyu-liu-glean Jun 4, 2025 •

edited

Loading

Uh oh!

steve-scio Jun 4, 2025 •

edited

Loading

Uh oh!

xinyu-liu-glean Jun 4, 2025 •

edited

Loading

Uh oh!

steve-scio commented Jun 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

xinyu-liu-glean commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GitHub Actions Tests Status (on master branch)

Uh oh!

steve-scio Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

xinyu-liu-glean Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steve-scio Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xinyu-liu-glean Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steve-scio commented Jun 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

xinyu-liu-glean commented Jun 4, 2025 •

edited

Loading

xinyu-liu-glean Jun 4, 2025 •

edited

Loading

steve-scio Jun 4, 2025 •

edited

Loading

xinyu-liu-glean Jun 4, 2025 •

edited

Loading