[Fix-17817] [Master] Fix workflow timeout alerts failed#17819
[Fix-17817] [Master] Fix workflow timeout alerts failed#17819Zzih96 wants to merge 13 commits intoapache:devfrom
Conversation
...inscheduler/server/master/engine/workflow/lifecycle/event/WorkflowTimeoutLifecycleEvent.java
Fixed
Show fixed
Hide fixed
ruanwenjun
left a comment
There was a problem hiding this comment.
LGTM, it's better to add IT case.
| * @param modifyBy modifyBy | ||
| */ | ||
| public void sendWorkflowTimeoutAlert(WorkflowInstance workflowInstance, ProjectUser projectUser) { | ||
| public void sendWorkflowTimeoutAlert(WorkflowInstance workflowInstance, ProjectUser projectUser, String modifyBy) { |
There was a problem hiding this comment.
Please don't change this, this PR should only fix the alert bug.
| // Calculate remaining time until timeout: timeout - elapsed time | ||
| long delayTime = TimeUnit.MINUTES.toMillis(timeout) | ||
| - (System.currentTimeMillis() - workflowInstance.getStartTime().getTime()); | ||
| // Ensure delayTime is not negative (trigger immediately if already timeout) |
There was a problem hiding this comment.
I am not clear in which case the delayTime might be negative, since System.currentTimeMillis() - workflowInstance.getStartTime().getTime() should always > 0.
| private void doWorkflowTimeoutAlert(final WorkflowInstance workflowInstance) { | ||
| // ProjectUser will be built in WorkflowAlertManager | ||
| workflowAlertManager.sendWorkflowTimeoutAlert(workflowInstance, null); |
8e10927 to
3de2674
Compare
There was a problem hiding this comment.
In our environment, we've already resolved the task timeout alerts and workflow timeout alerts. Here's how I determine whether a workflow has completed:
final IWorkflowExecutionGraph workflowExecutionGraph = workflowExecutionRunnable.getWorkflowExecutionGraph();
if (workflowExecutionGraph.isAllTaskExecutionRunnableChainFinish()) {
// all the TaskExecutionRunnable chain in the graph is finish, means the workflow is already finished.
return;
}
There was a problem hiding this comment.
isFinalState() reflects the persistent final state, making it more reliable. isAllTaskExecutionRunnableChainFinish(), on the other hand, only reflects the task completion state in memory; it may not yet have transitioned to the final workflow state.
There was a problem hiding this comment.
isFinalState()reflects the persistent final state, making it more reliable.isAllTaskExecutionRunnableChainFinish(), on the other hand, only reflects the task completion state in memory; it may not yet have transitioned to the final workflow state.
In my opinion, If the workflow is already in a completed state in memory, does that mean timeout handling is no longer meaningful?
There was a problem hiding this comment.
Being equal to zero doesn't seem reasonable either.
final int timeout = workflowInstance.getTimeout();
checkState(timeout > 0, "The workflow timeout: %s must >0 minutes", timeout);
There was a problem hiding this comment.
I think it shouldn't be 0, but I need to double-check.
|
@ruanwenjun I've already modified it as requested. Could you please check it again? |
ruanwenjun
left a comment
There was a problem hiding this comment.
Please removed the UT, this kind of ut help little, we should add IT case.
...er/server/master/engine/workflow/lifecycle/handler/WorkflowTimeoutLifecycleEventHandler.java
Outdated
Show resolved
Hide resolved
...er/server/master/engine/workflow/lifecycle/handler/WorkflowTimeoutLifecycleEventHandler.java
Outdated
Show resolved
Hide resolved
| import org.mockito.junit.jupiter.MockitoExtension; | ||
|
|
||
| @ExtendWith(MockitoExtension.class) | ||
| class WorkflowTimeoutLifecycleEventHandlerTest { |
| import org.mockito.junit.jupiter.MockitoExtension; | ||
|
|
||
| @ExtendWith(MockitoExtension.class) | ||
| class WorkflowStartLifecycleEventHandlerTest { |
| import org.mockito.junit.jupiter.MockitoExtension; | ||
|
|
||
| @ExtendWith(MockitoExtension.class) | ||
| class WorkflowTimeoutLifecycleEventTest { |
|
Please retry analysis of this Pull-Request directly on SonarQube Cloud |
…ler/server/master/engine/workflow/lifecycle/handler/WorkflowTimeoutLifecycleEventHandler.java Co-authored-by: Wenjun Ruan <wenjun@apache.org>
…ler/server/master/engine/workflow/lifecycle/handler/WorkflowTimeoutLifecycleEventHandler.java Co-authored-by: Wenjun Ruan <wenjun@apache.org>
|
In DolphinScheduler 3.4.1, when both "Timeout Alert" and "Timeout Failure" are selected in the timeout policy settings, only the alert is sent, but the task does not fail as expected. Will this be fixed later? |
Task timeout fail bug is fixed by this PR: #17818
2026-03-12 20:25:15.785 INFO - Sending 2 to process group: 59604 59609 59802, command: sudo -u root -i kill -2 59604 59609 59802 |
Has this issue been fixed in 3.4.1? I set up timeout failure for stored procedure tasks in DolphinScheduler 3.4.1, but it's not working as expected. @njnu-seafish
Looking at the execution logs of the stored procedure, the following content is present: So does the timeout failure strategy only cancel the execution of the stored procedure, rather than changing the status of this node to "failed"? I am wondering whether this is a bug or a software feature. |
Other Task Type timeout is ok? The logs indicate that the timeout cancellation logic was triggered; however, it appears the stored procedure failed to cancel successfully. |
|
@Zzih96 @ruanwenjun This bug involves a critical feature and has been stagnant for several months. Since I actually resolved the task timeout and workflow timeout issue quite some time ago, I would like to finalize the process and get it closed. |
Other tasks work fine, but stored procedures with SLEEP cannot be stopped at all—this is likely a database-specific behavior. |
Both SqlTask and ProcedureTask may fail to respond to interruption signals under certain circumstances. Specifically, the database can enter an 'uninterruptible' state where even simple SQL statements become atomic operations during specific execution phases, preventing them from responding to cancellation requests immediately. |












Purpose of the pull request
fix #17817
Brief change log
Verify this pull request
This pull request is code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(or)
Pull Request Notice
Pull Request Notice
If your pull request contains incompatible change, you should also add it to
docs/docs/en/guide/upgrade/incompatible.md