Skip to content

[SPARK-57142][INFRA] Share SBT precompile artifact with tpcds-1g CI job#56200

Open
zhengruifeng wants to merge 1 commit into
apache:masterfrom
zhengruifeng:precompile-tpcds-ci-share-dev5
Open

[SPARK-57142][INFRA] Share SBT precompile artifact with tpcds-1g CI job#56200
zhengruifeng wants to merge 1 commit into
apache:masterfrom
zhengruifeng:precompile-tpcds-ci-share-dev5

Conversation

@zhengruifeng
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR wires the tpcds-1g job in .github/workflows/build_and_test.yml to consume the shared precompile artifact, extending the pattern already applied to docker-integration-tests and k8s-integration-tests (SPARK-57069; parent SPARK-56830).

Concretely:

  • The precompile job's if: gate is extended to also fire when tpcds-1g == 'true' in the precondition output, so the artifact is available whenever the job runs.
  • tpcds-1g:
    • needs: precondition -> needs: [precondition, precompile]
    • if: extended with (!cancelled()) && so the job still runs if precompile is cancelled.
    • Adds "Download precompiled artifact" + "Extract precompiled artifact" steps after Java install, with graceful fallback (continue-on-error: true).

The tpcds-1g job drives SBT directly via build/sbt "sql/testOnly ..." (and build/sbt "sql/Test/runMain org.apache.spark.sql.GenTPCDSData ..." on a TPC-DS data cache miss), so it does not go through dev/run-tests.py and needs no SKIP_SCALA_BUILD flag -- the same situation as k8s-integration-tests. The first SBT invocation otherwise compiles sql/core (main + test) from scratch. The precompile job already runs Test/package, which compiles the sql/core test classes this job depends on (TPCDSQueryTestSuite, TPCDSCollationQueryTestSuite, GenTPCDSData, TPCDSSchema). Extracting the precompiled target/ lets SBT skip that compile and run the test phase directly.

Optional: graceful fallback if precompile fails

Same pattern as the prior consumers:

  • precompile keeps continue-on-error: true.
  • The "Download precompiled artifact" step is gated on needs.precompile.result == 'success' and has continue-on-error: true.
  • "Extract precompiled artifact" is gated on the download succeeding and has continue-on-error: true.
  • If extraction fails or the artifact is missing, SBT compiles from scratch exactly as before.

Worst case is degraded to the pre-PR behavior, not a workflow failure.

Note: the existing # Any TPC-DS related updates on this job need to be applied to tpcds-1g-gen job of benchmark.yml as well comment refers to TPC-DS data-generation parameters (scale factor, tpcds-kit ref, GenTPCDSData args). This PR changes none of those -- it only adds build-artifact reuse, and benchmark.yml is a standalone workflow with no shared precompile job -- so no corresponding change is needed there.

Why are the changes needed?

Today every run of build_and_test.yml that requires tpcds-1g re-runs the same sql/core SBT compile that the precompile job already produced for pyspark / sparkr / build / docker / k8s. Wiring tpcds-1g to the existing artifact removes that duplicate compile for free (precompile is already running).

Does this PR introduce any user-facing change?

No. CI infrastructure change only.

How was this patch tested?

The change is exercised by the CI run of this PR itself. The Download/Extract steps log the artifact size; if the precompile job is forced to fail (or its artifact is missing), the job falls back to the original local SBT build.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7)

Wire the tpcds-1g job to consume the shared precompile artifact, extending the
pattern already used by docker-integration-tests and k8s-integration-tests
(SPARK-57069).

The tpcds-1g job drives SBT directly via 'build/sbt "sql/testOnly ..."', so the
first SBT invocation otherwise compiles sql/core (main + test) from scratch.
The precompile job already runs 'Test/package', which compiles the sql/core
test classes (TPCDSQueryTestSuite, TPCDSCollationQueryTestSuite, GenTPCDSData,
TPCDSSchema). Extracting the precompiled target/ lets SBT skip that compile and
run the test phase directly, the same way the k8s job reuses the artifact (no
SKIP_SCALA_BUILD needed since the job does not go through dev/run-tests.py).

- precompile 'if:' gate fires on tpcds-1g == 'true'.
- tpcds-1g: 'needs: precondition' -> 'needs: [precondition, precompile]', plus
  '(!cancelled()) &&' so it still runs if precompile is cancelled.
- Download/Extract steps after Java install, with graceful fallback
  (continue-on-error). If the artifact is missing, SBT compiles from scratch as
  before.

Generated-by: Claude Code (Opus 4.7)
@zhengruifeng zhengruifeng changed the title [INFRA] Share SBT precompile artifact with tpcds-1g CI job [SPARK-57142][INFRA] Share SBT precompile artifact with tpcds-1g CI job May 29, 2026
@zhengruifeng zhengruifeng marked this pull request as ready for review May 29, 2026 11:01
@zhengruifeng
Copy link
Copy Markdown
Contributor Author

CI performance: before vs after

Samples: BEFORE = 3 build_uds scheduled runs (2026-05-04, 2026-05-19, 2026-05-25 — precompile job was already running but tpcds-1g not yet consuming it). AFTER = 1 PR run (2026-05-29, this PR).

Step BEFORE avg (n=3) AFTER Savings
Generate TPC-DS data (SBT compile + data gen) 9m28s 2m44s ~6m44s
Total job wall-time 86m56s 81m06s ~5m50s (~7%)

Samples:

Where the savings come from

Before this PR, the tpcds-1g job's first SBT invocation is sql/Test/runMain GenTPCDSData, but it also drives a full sql/core compile from scratch before the data generator can run. That is why the "Generate TPC-DS data" step took ~9.5 min even when no new data is needed -- roughly 7 of those minutes was SBT compile. With the precompile artifact, SBT sees the pre-built target/ and goes straight to data generation: the step drops to 2m44s, a ~6m44s reduction that is very consistent across the 3 BEFORE samples (9m24s--9m35s).

Test phase -- unaffected (as expected)

Phase BEFORE avg AFTER Delta
Sort merge join 18m29s 18m31s +0m02s
Broadcast hash join 7m04s 7m05s +0m01s
Shuffled hash join 18m47s 17m54s −0m53s
Collated data 31m26s 31m04s −0m22s

All within CI noise -- the precompile artifact has no effect on SQL query execution speed.

Bottom line

~7% wall-clock reduction per tpcds-1g run (~5m50s), driven by eliminating the duplicate sql/core compile that the shared precompile job already produced. Like the docker/k8s jobs in SPARK-57069, the savings are real and stable but partially masked by the long (~75min) test-dominated job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant