Skip to content

[Fix-17817] [Master] Fix workflow timeout alerts failed#17819

Open
Zzih96 wants to merge 13 commits intoapache:devfrom
Zzih96:add_workflow_timeout
Open

[Fix-17817] [Master] Fix workflow timeout alerts failed#17819
Zzih96 wants to merge 13 commits intoapache:devfrom
Zzih96:add_workflow_timeout

Conversation

@Zzih96
Copy link
Contributor

@Zzih96 Zzih96 commented Dec 24, 2025

Purpose of the pull request

fix #17817

Brief change log

Verify this pull request

This pull request is code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(or)

Pull Request Notice

Pull Request Notice

If your pull request contains incompatible change, you should also add it to docs/docs/en/guide/upgrade/incompatible.md

@SbloodyS SbloodyS changed the title [Fix-17817] [Master]add workflow timeout [Fix-17817] [Master] Fix workflow timeout alerts failed Dec 24, 2025
@SbloodyS SbloodyS added the bug Something isn't working label Dec 24, 2025
@SbloodyS SbloodyS added this to the 3.4.0 milestone Dec 24, 2025
Copy link
Member

@ruanwenjun ruanwenjun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, it's better to add IT case.

* @param modifyBy modifyBy
*/
public void sendWorkflowTimeoutAlert(WorkflowInstance workflowInstance, ProjectUser projectUser) {
public void sendWorkflowTimeoutAlert(WorkflowInstance workflowInstance, ProjectUser projectUser, String modifyBy) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't change this, this PR should only fix the alert bug.

Comment on lines +52 to +55
// Calculate remaining time until timeout: timeout - elapsed time
long delayTime = TimeUnit.MINUTES.toMillis(timeout)
- (System.currentTimeMillis() - workflowInstance.getStartTime().getTime());
// Ensure delayTime is not negative (trigger immediately if already timeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not clear in which case the delayTime might be negative, since System.currentTimeMillis() - workflowInstance.getStartTime().getTime() should always > 0.

Comment on lines +63 to +65
private void doWorkflowTimeoutAlert(final WorkflowInstance workflowInstance) {
// ProjectUser will be built in WorkflowAlertManager
workflowAlertManager.sendWorkflowTimeoutAlert(workflowInstance, null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should fix like #17818, otherwise will throw NPE

@SbloodyS SbloodyS modified the milestones: 3.4.0, 3.4.1 Jan 5, 2026
@ruanwenjun ruanwenjun force-pushed the add_workflow_timeout branch from 8e10927 to 3de2674 Compare January 8, 2026 13:00
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our environment, we've already resolved the task timeout alerts and workflow timeout alerts. Here's how I determine whether a workflow has completed:

final IWorkflowExecutionGraph workflowExecutionGraph = workflowExecutionRunnable.getWorkflowExecutionGraph();
if (workflowExecutionGraph.isAllTaskExecutionRunnableChainFinish()) {
// all the TaskExecutionRunnable chain in the graph is finish, means the workflow is already finished.
return;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isFinalState() reflects the persistent final state, making it more reliable. isAllTaskExecutionRunnableChainFinish(), on the other hand, only reflects the task completion state in memory; it may not yet have transitioned to the final workflow state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isFinalState() reflects the persistent final state, making it more reliable. isAllTaskExecutionRunnableChainFinish(), on the other hand, only reflects the task completion state in memory; it may not yet have transitioned to the final workflow state.

In my opinion, If the workflow is already in a completed state in memory, does that mean timeout handling is no longer meaningful?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being equal to zero doesn't seem reasonable either.

final int timeout = workflowInstance.getTimeout();
checkState(timeout > 0, "The workflow timeout: %s must >0 minutes", timeout);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it shouldn't be 0, but I need to double-check.

@Zzih96
Copy link
Contributor Author

Zzih96 commented Jan 9, 2026

@ruanwenjun I've already modified it as requested. Could you please check it again?

Copy link
Member

@ruanwenjun ruanwenjun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please removed the UT, this kind of ut help little, we should add IT case.

import org.mockito.junit.jupiter.MockitoExtension;

@ExtendWith(MockitoExtension.class)
class WorkflowTimeoutLifecycleEventHandlerTest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this UT.

import org.mockito.junit.jupiter.MockitoExtension;

@ExtendWith(MockitoExtension.class)
class WorkflowStartLifecycleEventHandlerTest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this UT.

import org.mockito.junit.jupiter.MockitoExtension;

@ExtendWith(MockitoExtension.class)
class WorkflowTimeoutLifecycleEventTest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this UT.

@sonarqubecloud
Copy link

Please retry analysis of this Pull-Request directly on SonarQube Cloud

…ler/server/master/engine/workflow/lifecycle/handler/WorkflowTimeoutLifecycleEventHandler.java

Co-authored-by: Wenjun Ruan <wenjun@apache.org>
Zzih96 and others added 2 commits January 26, 2026 16:33
…ler/server/master/engine/workflow/lifecycle/handler/WorkflowTimeoutLifecycleEventHandler.java

Co-authored-by: Wenjun Ruan <wenjun@apache.org>
@SbloodyS SbloodyS modified the milestones: 3.4.1, 3.4.2 Feb 28, 2026
@honkerjha
Copy link

In DolphinScheduler 3.4.1, when both "Timeout Alert" and "Timeout Failure" are selected in the timeout policy settings, only the alert is sent, but the task does not fail as expected. Will this be fixed later?

@njnu-seafish
Copy link
Contributor

In DolphinScheduler 3.4.1, when both "Timeout Alert" and "Timeout Failure" are selected in the timeout policy settings, only the alert is sent, but the task does not fail as expected. Will this be fixed later?

Task timeout fail bug is fixed by this PR: #17818

I test it as follow:
企业微信截图_17733185198974

2026-03-12 20:24:10.819 INFO - Final Script Content:
====================
#!/bin/bash
BASEDIR=$(cd dirname $0; pwd)
cd $BASEDIR
source /usr/local/apache-dolphinscheduler-dev-SNAPSHOT-bin/bin/env/dolphinscheduler_env.sh
sleep 2m;
====================
2026-03-12 20:24:10.820 INFO - Executing shell command : sudo -u root -i /tmp/dolphinscheduler/exec/process/42/42.sh
2026-03-12 20:24:10.830 INFO - process start, process id is: 59604
2026-03-12 20:25:10.866 INFO - Begin killing task instance, processId: 59604
2026-03-12 20:25:12.268 INFO - prepare to parse pid, raw pid string: sudo(59604)---42.sh(59609)---sleep(59802)

2026-03-12 20:25:15.785 INFO - Sending 2 to process group: 59604 59609 59802, command: sudo -u root -i kill -2 59604 59609 59802
2026-03-12 20:25:30.036 INFO - Kill command: sudo -u root -i kill -2 59604 59609 59802, timed out, still running PIDs: 59604 59609 59802
2026-03-12 20:25:33.314 INFO - Sending 15 to process group: 59604 59609 59802, command: sudo -u root -i kill -15 59604 59609 59802
2026-03-12 20:25:34.401 ERROR - process has failure due to timeout kill, timeout value is:60, timeoutStrategy is:WARNFAILED
2026-03-12 20:25:34.404 INFO - process has killed. execute path:/tmp/dolphinscheduler/exec/process/42, processId:59604 ,exitStatusCode:-1 ,processWaitForStatus:false ,processExitValue:143
2026-03-12 20:25:37.716 INFO - Successfully killed process tree by SIGTERM, processId: 59604
2026-03-12 20:25:37.717 INFO - Process tree for task: 42 is killed or already finished, pid: 59604

@honkerjha
Copy link

honkerjha commented Mar 13, 2026

In DolphinScheduler 3.4.1, when both "Timeout Alert" and "Timeout Failure" are selected in the timeout policy settings, only the alert is sent, but the task does not fail as expected. Will this be fixed later?

Task timeout fail bug is fixed by this PR: #17818

Has this issue been fixed in 3.4.1? I set up timeout failure for stored procedure tasks in DolphinScheduler 3.4.1, but it's not working as expected. @njnu-seafish

image image image image image

Looking at the execution logs of the stored procedure, the following content is present:

2026-03-13 10:00:13.781 INFO  - setSqlParamsMap: Property with paramName: value1 put in sqlParamsMap of content call aSP_test20260204(null,null,${value1}); successfully.
2026-03-13 10:01:13.783 INFO  - Try to cancel this procedure task
2026-03-13 10:01:13.791 INFO  - This procedure task was canceled
2026-03-13 10:02:44.029 INFO  - out parameter varchar key : value1 , value : 0

So does the timeout failure strategy only cancel the execution of the stored procedure, rather than changing the status of this node to "failed"? I am wondering whether this is a bug or a software feature.

@njnu-seafish
Copy link
Contributor

In DolphinScheduler 3.4.1, when both "Timeout Alert" and "Timeout Failure" are selected in the timeout policy settings, only the alert is sent, but the task does not fail as expected. Will this be fixed later?

Task timeout fail bug is fixed by this PR: #17818

Has this issue been fixed in 3.4.1? I set up timeout failure for stored procedure tasks in DolphinScheduler 3.4.1, but it's not working as expected. @njnu-seafish

image image image image image
Looking at the execution logs of the stored procedure, the following content is present:

2026-03-13 10:00:13.781 INFO  - setSqlParamsMap: Property with paramName: value1 put in sqlParamsMap of content call aSP_test20260204(null,null,${value1}); successfully.
2026-03-13 10:01:13.783 INFO  - Try to cancel this procedure task
2026-03-13 10:01:13.791 INFO  - This procedure task was canceled
2026-03-13 10:02:44.029 INFO  - out parameter varchar key : value1 , value : 0

So does the timeout failure strategy only cancel the execution of the stored procedure, rather than changing the status of this node to "failed"? I am wondering whether this is a bug or a software feature.

Other Task Type timeout is ok? The logs indicate that the timeout cancellation logic was triggered; however, it appears the stored procedure failed to cancel successfully.

@njnu-seafish
Copy link
Contributor

@Zzih96 @ruanwenjun This bug involves a critical feature and has been stagnant for several months. Since I actually resolved the task timeout and workflow timeout issue quite some time ago, I would like to finalize the process and get it closed.
企业微信截图_1773370742381

@honkerjha
Copy link

In DolphinScheduler 3.4.1, when both "Timeout Alert" and "Timeout Failure" are selected in the timeout policy settings, only the alert is sent, but the task does not fail as expected. Will this be fixed later?

Task timeout fail bug is fixed by this PR: #17818

Has this issue been fixed in 3.4.1? I set up timeout failure for stored procedure tasks in DolphinScheduler 3.4.1, but it's not working as expected. @njnu-seafish
image image image image image
Looking at the execution logs of the stored procedure, the following content is present:

2026-03-13 10:00:13.781 INFO  - setSqlParamsMap: Property with paramName: value1 put in sqlParamsMap of content call aSP_test20260204(null,null,${value1}); successfully.
2026-03-13 10:01:13.783 INFO  - Try to cancel this procedure task
2026-03-13 10:01:13.791 INFO  - This procedure task was canceled
2026-03-13 10:02:44.029 INFO  - out parameter varchar key : value1 , value : 0

So does the timeout failure strategy only cancel the execution of the stored procedure, rather than changing the status of this node to "failed"? I am wondering whether this is a bug or a software feature.

Other Task Type timeout is ok? The logs indicate that the timeout cancellation logic was triggered; however, it appears the stored procedure failed to cancel successfully.

Other tasks work fine, but stored procedures with SLEEP cannot be stopped at all—this is likely a database-specific behavior.

@njnu-seafish
Copy link
Contributor

In DolphinScheduler 3.4.1, when both "Timeout Alert" and "Timeout Failure" are selected in the timeout policy settings, only the alert is sent, but the task does not fail as expected. Will this be fixed later?

Task timeout fail bug is fixed by this PR: #17818

Has this issue been fixed in 3.4.1? I set up timeout failure for stored procedure tasks in DolphinScheduler 3.4.1, but it's not working as expected. @njnu-seafish
image image image image image
Looking at the execution logs of the stored procedure, the following content is present:

2026-03-13 10:00:13.781 INFO  - setSqlParamsMap: Property with paramName: value1 put in sqlParamsMap of content call aSP_test20260204(null,null,${value1}); successfully.
2026-03-13 10:01:13.783 INFO  - Try to cancel this procedure task
2026-03-13 10:01:13.791 INFO  - This procedure task was canceled
2026-03-13 10:02:44.029 INFO  - out parameter varchar key : value1 , value : 0

So does the timeout failure strategy only cancel the execution of the stored procedure, rather than changing the status of this node to "failed"? I am wondering whether this is a bug or a software feature.

Other Task Type timeout is ok? The logs indicate that the timeout cancellation logic was triggered; however, it appears the stored procedure failed to cancel successfully.

Other tasks work fine, but stored procedures with SLEEP cannot be stopped at all—this is likely a database-specific behavior.

Both SqlTask and ProcedureTask may fail to respond to interruption signals under certain circumstances. Specifically, the database can enter an 'uninterruptible' state where even simple SQL statements become atomic operations during specific execution phases, preventing them from responding to cancellation requests immediately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend bug Something isn't working test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] [Master] Missing workflow timeout alerts

5 participants