[spark] Refactor PaimonSparkWriter write flow by huangxiaopingRD · Pull Request #8417 · apache/paimon

huangxiaopingRD · 2026-07-01T12:06:42Z

Purpose

Refactor PaimonSparkWriter to make the Spark write flow easier to follow without changing write semantics.
This change mainly:

splits bucket-mode handling in write into dedicated private methods
extracts write context initialization from the main write flow
extracts the shared per-partition write pattern for bucketed and non-bucketed writes
keeps the existing write topology and commit merge behavior unchanged

Tests

Not adding new tests.
Reason:

this change is a readability refactor and does not intend to change behavior
the affected paths are already covered indirectly by existing Spark write, dynamic bucket, update, delete, and merge test suites

JingsongLi · 2026-07-03T05:14:27Z

cc @Zouxxyy

JingsongLi · 2026-07-03T10:33:04Z

    batchId: Option[Long] = None)
  extends WriteHelper {

+  private case class WriteContext(


The task closures below capture this whole WriteContext via ctx.newWrite() / ctx.bucketColIdx. Because the context also stores driver-side objects (SparkSession, the original DataFrame, and preparedData), those objects become part of every executor closure even though they are only needed while building the plan on the driver. This can make Spark closure serialization fail or drag SparkContext/Dataset state into tasks. Please keep WriteContext limited to serializable task inputs, or extract only the needed scalars/functions before mapPartitions and keep SparkSession/DataFrame out of the captured object.

huangxiaopingRD added 4 commits July 1, 2026 18:56

[spark] Refactor PaimonSparkWriter write flow

995a6c3

[spark] Fix hash dynamic writer closure serialization

a2fdcff

[spark] Tidy PaimonSparkWriter formatting

75be08f

[spark] Fix clustering input chaining in PaimonSparkWriter

a1142af

huangxiaopingRD marked this pull request as draft July 1, 2026 15:12

[spark] Restore Dataset write path for clustering plan

ae6c59c

huangxiaopingRD marked this pull request as ready for review July 1, 2026 16:19

JingsongLi reviewed Jul 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[spark] Refactor PaimonSparkWriter write flow#8417

[spark] Refactor PaimonSparkWriter write flow#8417
huangxiaopingRD wants to merge 5 commits into
apache:masterfrom
huangxiaopingRD:refactor-paimon-spark-writer

huangxiaopingRD commented Jul 1, 2026

Uh oh!

JingsongLi commented Jul 3, 2026

Uh oh!

JingsongLi Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

huangxiaopingRD commented Jul 1, 2026

Purpose

Tests

Uh oh!

JingsongLi commented Jul 3, 2026

Uh oh!

JingsongLi Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants