Skip to content

[CI] Stabilize PostCommit Java ValidatesRunner Dataflow Streaming workflow#38753

Open
durgaprasadml wants to merge 5 commits into
apache:masterfrom
durgaprasadml:final-streaming-fix-38710
Open

[CI] Stabilize PostCommit Java ValidatesRunner Dataflow Streaming workflow#38753
durgaprasadml wants to merge 5 commits into
apache:masterfrom
durgaprasadml:final-streaming-fix-38710

Conversation

@durgaprasadml
Copy link
Copy Markdown
Contributor

Summary

This PR stabilizes the PostCommit Java ValidatesRunner Dataflow Streaming workflow, which is currently failing more than 50% of the time due to infrastructure contention, legacy streaming runner limitations, and overly aggressive streaming job cancellation behavior.

Fixes #38710


Root Causes Identified

1. Unbounded Parallelism / Resource Exhaustion

The validatesRunner tasks were configured with:

groovy id="n9b5m2" maxParallelForks Integer.MAX_VALUE

Combined with GitHub Actions max-workers: 12, this could launch up to 12 concurrent Dataflow streaming jobs simultaneously.

This frequently exhausted:

  • Compute Engine IP quotas
  • CPUs
  • concurrent Dataflow job quotas
  • self-hosted runner resources

leading to worker startup starvation and test timeouts.


2. Legacy Streaming Worker Non-Termination

The workflow previously used the legacy VM-based streaming execution path:

bash id="ewl1pu" :runners:google-cloud-dataflow-java:validatesRunnerStreaming

Bounded streaming pipelines under the legacy runner often failed to terminate automatically, remaining in RUNNING state until the 15-minute timeout cancelled them.


3. Aggressive Failure Cancellation

TestDataflowRunner immediately cancelled jobs upon encountering any JOB_MESSAGE_ERROR, even for transient worker/network issues that Dataflow could automatically recover from.

This caused false-negative failures in CI.


Changes Implemented

Throttle validatesRunner concurrency

Reduced validatesRunner concurrency to:

groovy id="pvb0u9" maxParallelForks = 4

with support for overriding via:

bash id="6a4s1z" -PmaxParallelForks=

This reduces quota pressure and runner overload.


Migrate workflow to Streaming Engine

Updated the workflow to run:

bash id="0kcg4q" :runners:google-cloud-dataflow-java:validatesRunnerStreamingEngine

Benefits:

  • faster startup
  • improved bounded-source termination
  • reduced infrastructure overhead
  • improved stability

Add metrics-driven early termination

Enhanced TestDataflowRunner to continuously poll:

  • PAssertSuccess
  • PAssertFailure

during streaming execution.

Behavior:

  • early cancel on assertion success
  • early cancel on assertion failure

This reduces successful test runtime from ~15 minutes to ~2–3 minutes.


Delay cancellation on transient worker errors

Added a recovery window before cancelling jobs due to transient JOB_MESSAGE_ERROR entries, allowing Dataflow retries and self-healing to stabilize the pipeline.


Add Gradle test retry support

Integrated the org.gradle.test-retry plugin for CI integration tests to reduce transient infrastructure-related failures.


Validation

Added/updated tests covering:

  • streaming early-success termination
  • streaming early-failure termination
  • metrics polling behavior

Verification command:

bash id="j7bw3w" ./gradlew :runners:google-cloud-dataflow-java:test \ --tests "org.apache.beam.runners.dataflow.TestDataflowRunnerTest"

Streaming validation command:

bash id="o3llzt" ./gradlew :runners:google-cloud-dataflow-java:validatesRunnerStreamingEngine \ -PtestFilter="org.apache.beam.sdk.transforms.GroupByKeyTest"


Expected Impact

These changes are expected to:

  • significantly reduce CI flakiness
  • reduce GCP quota pressure
  • improve workflow runtime stability
  • shorten streaming test execution time
  • improve overall validatesRunner reliability

…kflow

- throttle validatesRunner parallelism
- migrate workflow to Streaming Engine
- add metrics-driven streaming termination
- delay transient JOB_MESSAGE_ERROR cancellations
- add Gradle test retry support
- add streaming runner tests

Fixes apache#38710
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a series of stability improvements for the PostCommit Java ValidatesRunner Dataflow Streaming workflow. By addressing resource contention through concurrency throttling, migrating to the more efficient Streaming Engine, and implementing smarter, metrics-driven job termination, the changes aim to drastically reduce CI flakiness and improve the reliability of the testing infrastructure.

Highlights

  • Concurrency Throttling: Reduced the default parallelism for validatesRunner tasks to 4 to prevent resource exhaustion and quota issues, with an option to override via -PmaxParallelForks.
  • Streaming Engine Migration: Updated the workflow to utilize the Dataflow Streaming Engine, improving startup times, termination reliability, and overall stability.
  • Metrics-Driven Termination: Implemented early termination for streaming jobs by polling PAssert metrics, significantly reducing test execution time.
  • Resilience Improvements: Added a recovery window for transient worker errors and integrated the Gradle test-retry plugin to mitigate CI flakiness.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/beam_PostCommit_Java_ValidatesRunner_Dataflow_Streaming.yml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the org.gradle.test-retry plugin to automatically retry failed tests on CI, throttles default test parallelism in the Dataflow runner to prevent quota exhaustion, and implements early success/failure cancellation for streaming jobs in TestDataflowRunner by polling metrics asynchronously. The review feedback highlights critical robustness improvements for the background monitoring thread to handle transient API exceptions without silently terminating, and suggests updating the new unit tests to explicitly verify asynchronous job cancellation using Mockito's timeout verification.

Comment thread buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy Outdated
@github-actions github-actions Bot added the java label May 30, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The PostCommit Java ValidatesRunner Dataflow Streaming job is flaky

2 participants