Skip to content

Conversation

@cxzl25
Copy link
Contributor

@cxzl25 cxzl25 commented Jan 22, 2026

Which issue does this PR close?

Closes #1937

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

How was this patch tested?

@cxzl25 cxzl25 changed the title Q14b q14b Jan 22, 2026
@cxzl25 cxzl25 marked this pull request as draft January 22, 2026 17:41
@cxzl25 cxzl25 requested a review from Copilot January 22, 2026 17:41
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a thread-safety issue in the deserializeExpression method by adding synchronization around object deserialization when a TaskContext is present. The fix re-enables the previously flaky q14b TPC-DS query test.

Changes:

  • Added synchronization on taskMetrics() during expression deserialization to fix thread-safety issues with Spark's TaskMetrics#externalAccums
  • Re-enabled q14b in the TPC-DS test suite after addressing the flaky failure

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
spark-extension/src/main/scala/org/apache/spark/sql/auron/NativeConverters.scala Added TaskContext import and synchronized block around readObject calls in deserializeExpression to fix thread-safety issue
.github/workflows/tpcds-reusable.yml Re-enabled q14b test and removed the comment about the flaky failure

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@cxzl25 cxzl25 closed this Jan 22, 2026
@cxzl25 cxzl25 reopened this Jan 22, 2026
@cxzl25 cxzl25 changed the title q14b [AURON #1937][CI] Enable TPCDS test q14b Jan 22, 2026
@cxzl25 cxzl25 closed this Jan 23, 2026
@cxzl25 cxzl25 reopened this Jan 23, 2026
@yew1eb
Copy link
Contributor

yew1eb commented Jan 23, 2026

Nice fix @cxzl25! Could you share the root cause and troubleshooting clues for this q14b issue?

@cxzl25
Copy link
Contributor Author

cxzl25 commented Jan 23, 2026

Could you share the root cause and troubleshooting clues for this q14b issue

Judging from the call stack, the accumUpdates of TaskResult contains null values.

ERROR TaskResultGetter: Exception while getting task result
java.lang.NullPointerException
	at org.apache.spark.scheduler.TaskResultGetter$$anon$3.$anonfun$run$3(TaskResultGetter.scala:109)

The only place where TaskMetrics#externalAccums might be updated is TaskMetrics#registerAccumulator, which in turn is invoked via taskContext.registerAccumulator(this) called by the deserialization of AccumulatorV2#readObject. From the perspective of the code, it is highly unlikely for a null value to be written here.

Since externalAccums is an ArrayBuffer and not thread-safe, I tried adding the synchronized keyword to TaskMetrics#registerAccumulator, and the NPE issue was resolved. I then added some logging to record which threads were accessing this method concurrently, and ultimately identified that Auron's deserialization of expressions may be the root cause of this problem.

@cxzl25 cxzl25 marked this pull request as ready for review January 26, 2026 12:34
@richox richox merged commit 5ea773c into apache:master Jan 27, 2026
346 of 347 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI tpcds q14b flaky failure on spark 3.0/3.1/3.2

3 participants