[SPARK-56125][SQL] Simplify schema calculation for Merge Into Schema Evolution#54934
[SPARK-56125][SQL] Simplify schema calculation for Merge Into Schema Evolution#54934szehon-ho wants to merge 2 commits intoapache:masterfrom
Conversation
| * @param root type schema | ||
| * @param path name segments | ||
| */ | ||
| def fieldExistsAtPath( |
There was a problem hiding this comment.
unfortunately this is still needed, but only for the top level unresolved reference case
| private def fieldExistsAtPathInternal( | ||
| dt: DataType, | ||
| parts: Seq[String]): Boolean = { | ||
| def checkAndRecurse( |
There was a problem hiding this comment.
This looks correct.
nit: checkAndRecurse seems unnecessary, can we inline the logic? Also, can we consider rewriting the recursion using pattern matching on parts so the base case is handled in one place?
| * @param valueType type of the assignment value at this path (typically source column) | ||
| * @param changes accumulator for [[TableChange]] instances | ||
| * @param fieldPath qualified path segments for nested columns (`element` / `key` / `value` | ||
| * under arrays and mapss) |
| val changes = mutable.LinkedHashSet.empty[TableChange] | ||
| val failIncompatible: () => Nothing = () => | ||
| throw QueryExecutionErrors.failedToMergeIncompatibleSchemasError( | ||
| originalTarget, originalSource, null) |
There was a problem hiding this comment.
nit: failIncompatible passes null as the cause, the error only shows full target/source schemas with no hint about which field path actually conflicts. Since fieldPath, keyType, and valueType are already available at the call site, should we include them in the exception? Would make debugging much easier for deeply nested schemas.
|
Actually I will close this pr, because #54488 makes another dependency on the method computeSchemaChanges() (it's called by MERGE INTO and INSERT), so can't refactor like this. May target a smaller refactor later |
|
closing in favor of: #55124 |
What changes were proposed in this pull request?
Replace 'sourceSchemaForSchemaEvolution' with simply 'pendingChanges'. Also reduced path based comparisons where possible (where have the resolved type from target/source)
Why are the changes needed?
This was suggested by @aokolnychyi after the initial pr was merged. The 'sourceSchemaForSchemaEvolution' is confusing, it is supposed to be a view of the source schema, pruned by the fields actually referred by the MERGE into statement. It is used by the subsequent logic (that compares it with the target table schema) but it is hard to explain.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Run existing tests
Was this patch authored or co-authored using generative AI tooling?
Yes cursor