feat(test-harness): add parallel execution, better error reporting, and lint fixes#2944
feat(test-harness): add parallel execution, better error reporting, and lint fixes#2944markphelps wants to merge 16 commits intomainfrom
Conversation
…nd lint fixes Add --concurrency flag (default 4) to run model builds/tests in parallel using errgroup with a configurable semaphore. All three commands (run, build, schema-compare) support parallel execution with: - Pre-allocated result slices to preserve manifest ordering - Per-model progress lines under a mutex - Quiet mode that captures build output instead of streaming to avoid interleaved output in parallel mode - Per-model isolated work directories so models sharing a repo don't clobber each other's setup commands Improve error reporting: - Final error message now includes per-model failure details (error message or failing test names) instead of just listing model names - Setup commands run with 'set -euo pipefail' so missing tools (e.g. yq) are caught immediately instead of silently producing empty files - Validate cog.yaml is non-empty after setup to catch silent failures Fix all 30 golangci-lint issues across the test-harness: - copyloopvar: remove unnecessary Go <1.22 loop variable copies - errcheck: handle intentionally ignored errors - gocritic: use 0o octal literals, rewrite if-else chains as switch - gosec: annotate intentional HTTP calls to known APIs - misspell: fix British/American spelling - modernize: use strings.SplitSeq, maps.Copy, built-in min
There was a problem hiding this comment.
I'm Bonk, and I've done a quick review of your PR.
The PR adds parallel execution support, improved error reporting, setup hardening with set -euo pipefail, and lint fixes to the test-harness tool.
I found one issue in the sequential execution path where test failures could be silently ignored.
|
I've posted my review. The key issue is in the sequential execution path in |
…odel manifest Add requires_tools field to manifest model schema so models can declare CLI tools needed for setup commands. The harness checks tools exist on PATH before running setup and errors with a clear message listing the missing tools. Improve setup error reporting: both quiet and non-quiet modes now capture stderr and include it in the error message so users see the actual failure instead of a downstream error from an empty cog.yaml. Add manifest-replicate.yaml with 15 Replicate production models.
|
@markphelps Bonk workflow was cancelled. View workflow run · To retry, trigger Bonk again. |
There was a problem hiding this comment.
I'm Bonk, and I've done a quick review of your PR.
The PR adds parallel execution support, improved error reporting, and lint fixes to the test-harness tool.
I found the same issue from the previous review exists in two more files: the sequential execution paths in build.go and schema_compare.go also lack pass/fail feedback, inconsistent with the parallel mode which shows "+ passed" or "x FAILED".
|
I reviewed the PR files and confirmed the same issue from the previous review exists in |
Replace the quiet/buffered approach with a prefixWriter that prepends each output line with the model name, like docker-compose: [flux-schnell ] Step 3/12: RUN pip install ... [chatterbox ] Building image cog-harness-chatterbox:test [flux-schnell ] Successfully built abc123 Output streams in real-time even in parallel mode. Sequential mode (--concurrency 1) streams without prefixes as before.
There was a problem hiding this comment.
I'm Bonk, and I've done a quick review of your PR.
The PR adds parallel execution support, improved error reporting, setup hardening with set -euo pipefail, and lint fixes to the test-harness tool.
The same issue from previous reviews still exists in all three files: the sequential execution paths lack pass/fail feedback, inconsistent with the parallel mode which shows "+ passed" or "x FAILED". This contradicts the PR description claiming --concurrency 1 preserves real-time feedback behavior.
|
The same issue from previous reviews still exists in all three files ( |
Print === banners with timing info at each phase boundary so users can see where time is being spent: === Preparing flux-schnell... === Building flux-schnell (timeout 900s)... === Build complete (62.3s) === Predict test 1/1: generate image from prompt (timeout 900s)... === Predict test 1/1 PASSED (48.7s) In parallel mode these are prefixed with the model name like all other output.
There was a problem hiding this comment.
I'm Bonk, and I've done a quick review of your PR.
The PR adds parallel execution support, improved error reporting, setup hardening with set -euo pipefail, and lint fixes to the test-harness tool.
The same issue from previous reviews still exists in all three files: the sequential execution paths lack pass/fail feedback, inconsistent with the parallel mode which shows "+ passed" or "x FAILED". This contradicts the PR description claiming --concurrency 1 preserves real-time feedback behavior.
|
I reviewed the three changed Go files to verify whether previous feedback had been addressed. Finding the same issue still existed (missing pass/fail feedback in sequential execution mode across all three files), I posted a batch GitHub review with code suggestions to fix the inconsistent behavior. |
Expose the existing SkipSchemaValidation option as a CLI flag. This is useful when building models that fail schema validation due to import-time side effects (e.g. accessing GPU/weights files) or when using a remote Docker context where the legacy schema validation container cannot access local resources. Also add skip_schema_validation field to the test harness manifest so models can opt out of schema validation per-model.
There was a problem hiding this comment.
I'm Bonk, and I've done a quick review of your PR.
The PR adds parallel execution support, improved error reporting, setup hardening with set -euo pipefail, and lint fixes to the test-harness tool.
The same issue from previous reviews still exists in all three files: the sequential execution paths lack pass/fail feedback, inconsistent with the parallel mode which shows "+ passed" or "x FAILED". This contradicts the PR description claiming --concurrency 1 preserves real-time feedback behavior.
|
Review posted successfully with suggestions for all three files. |
Signed-off-by: Mark Phelps <mphelps@cloudflare.com>
cog predict has a default --setup-timeout of 300s which kills the container if model setup (weight downloads) takes longer. Pass the model's timeout value as the setup timeout so large models have enough time to download weights during first run.
There was a problem hiding this comment.
I'm Bonk, and I've done a quick review of your PR.
The PR adds parallel execution support, improved error reporting, setup hardening with set -euo pipefail, and lint fixes to the test-harness tool.
The same issue from previous reviews still exists in all three files: the sequential execution paths lack pass/fail feedback, inconsistent with the parallel mode which shows "+ passed" or "x FAILED". This contradicts the PR description claiming --concurrency 1 preserves real-time feedback behavior.
|
Review posted successfully with suggestions for all three files. |
There was a problem hiding this comment.
I'm Bonk, and I've done a quick review of your PR.
The PR adds parallel execution support, improved error reporting, setup hardening with set -euo pipefail, and lint fixes to the test-harness tool.
The same issue from previous reviews still exists in all three files: the sequential execution paths lack pass/fail feedback, inconsistent with the parallel mode which shows "+ passed" or "x FAILED". This contradicts the PR description claiming --concurrency 1 preserves real-time feedback behavior.
|
Review posted successfully with suggestions for all three files. |
The sequential code paths (--concurrency 1) were missing per-model status output after each build/run/compare, unlike the parallel paths. Now both modes print consistent pass/fail/skip feedback.
|
LGTM |
|
LGTM |
Signed-off-by: Mark Phelps <mphelps@cloudflare.com>
…ttps://github.com/replicate/cog into mphelps/test-harness-parallel-and-error-reporting * 'mphelps/test-harness-parallel-and-error-reporting' of https://github.com/replicate/cog: Bump version to 0.18.0 (#2946)
|
LGTM |
…runner - Replace errgroup.WithContext with WaitGroup.Go + semaphore channel; goroutines never returned errors so errgroup semantics were misleading and g.Wait() was silently discarded - Extract generic runModels[T] helper to deduplicate the parallel/sequential loop that was copy-pasted across build.go, run.go, and schema_compare.go - Use singleflight.Group in cloneRepo instead of holding a mutex during the entire git clone operation, which serialized all parallel clones - Capture cloneRepo stderr into a buffer instead of writing directly to os.Stderr, preventing interleaved output in parallel mode - Add --concurrency flag validation (reject values < 1 to prevent panic) - Split modelOutput into modelLoggers + modelOutput so RunModel/BuildModel don't allocate capture buffers they never use - Cap prefixWriter line buffer at 64 KiB to prevent unbounded growth from long lines without newlines (e.g. progress bars) - Truncate runSetupCommands error output to last 2000 bytes, matching the pattern already used in buildModelWithEnv
Signed-off-by: Mark Phelps <mphelps@cloudflare.com>
|
LGTM |
Improve schema-compare output so differences are immediately readable: - Replace terse one-line-per-path diff format with unified-diff style output showing actual JSON values with --- static / +++ runtime headers and @@ path @@ hunks (like git diff) - Sort diff keys alphabetically for deterministic output - Suppress docker build log streaming during schema comparison builds (quietBuildModelWithEnv) so build output no longer intermingles with the diff report; build logs are still captured for error messages - Add spacing around diff blocks in console report for readability - Add table-driven tests for the new jsonDiff format
|
Looking at the current code, I can see the PR has been updated with a refactored The previous issues about missing pass/fail feedback in sequential mode appear to be resolved:
LGTM |
Summary
--concurrencyflag (default 4) to run model builds/tests in parallel acrossrun,build, andschema-comparecommandsset -euo pipefailand emptycog.yamldetection to catch missing tools (e.g.yq) immediatelyParallel execution
Models are dispatched via
errgroupwithSetLimit(concurrency). Each model gets its own isolated work directory (copied from the shared clone) so setup commands likescript/select.shdon't clobber each other. In parallel mode, build output is captured rather than streamed to avoid interleaving.--concurrency 1preserves the original sequential behavior with real-time streaming.Error reporting
Before:
After:
Setup hardening
Setup commands now run with
set -euo pipefailso a missingyq(or any other tool) fails fast instead of silently producing an emptycog.yamlthat later fails with a confusingpython_version is requirederror.Lint fixes
Resolved all 30 golangci-lint issues: copyloopvar, errcheck, gocritic (octal literals, if-else chains), gosec (SSRF annotations), misspell, modernize (SplitSeq, maps.Copy, built-in min).