Skip to content

Comments

feat: adding spammoor test to benchmark#3105

Open
chatton wants to merge 8 commits intomainfrom
cian/experiment-with-granularspans
Open

feat: adding spammoor test to benchmark#3105
chatton wants to merge 8 commits intomainfrom
cian/experiment-with-granularspans

Conversation

@chatton
Copy link
Contributor

@chatton chatton commented Feb 24, 2026

Overview

Uploads the spamoor test results to https://ev.xyz/dev/bench/ on a per span level.

Summary by CodeRabbit

  • New Features

    • Block Executor and Spamoor trace benchmarks are now collected and published automatically.
    • Tests can optionally emit stable JSON benchmark outputs for external analysis.
  • Chores

    • CI workflow updated to run multi-job benchmark aggregation with artifact uploads and conditional publishing.
    • Improved gh-pages reset and unconditional result-storage steps to reduce race conditions.
  • Tests

    • Added per-operation timing aggregation and optional benchmark output in e2e smoke tests.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 24, 2026

📝 Walkthrough

Walkthrough

Restructures benchmark CI into parallel jobs (evm-benchmark, spamoor-benchmark, block-executor) with artifact uploads and a consolidated publish-benchmarks job that aggregates and conditionally publishes results; adds per-operation span aggregation and JSON benchmark export from E2E tests (BENCH_JSON_OUTPUT).

Changes

Cohort / File(s) Summary
Benchmark Workflow Orchestration
.github/workflows/benchmark.yml
Adds pull_request trigger, splits benchmarking into multiple jobs (evm-benchmark, spamoor-benchmark, block-executor), uploads artifacts per job, adds publish-benchmarks job that downloads artifacts, resets local gh-pages before publishing, and conditionally auto-pushes/saves benchmark data depending on event type.
E2E Smoke Test — JSON Export
test/e2e/evm_spamoor_smoke_test.go
Adds optional benchmark JSON output path controlled by BENCH_JSON_OUTPUT; when set, writes ev-node span aggregates to the specified JSON file after trace collection.
Span Aggregation & Reporting
test/e2e/evm_test_common.go
Introduces spanStats and aggregateSpanStats for per-operation count/total/min/max aggregation, refactors printTraceReport to use aggregator, and adds writeTraceBenchmarkJSON to emit a stable customSmallerIsBetter JSON benchmark file.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant GH as GitHub Actions
participant EVM as evm-benchmark job
participant SPAM as spamoor-benchmark job
participant BE as block-executor job
participant ART as Artifact Storage
participant PUB as publish-benchmarks job
GH->>EVM: start evm-benchmark
GH->>SPAM: start spamoor-benchmark
GH->>BE: start block-executor
EVM->>ART: upload evm-benchmark artifact
SPAM->>ART: upload spamoor-bench artifact
BE->>ART: upload block-executor artifact
ART->>PUB: artifacts available
PUB->>ART: download artifacts
PUB->>PUB: reset local gh-pages, process and store benchmarks
PUB->>GitHub: conditional push/save to gh-pages

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • alpe

Poem

🐰
I hopped through CI with a grin so wide,
Split the jobs and sent their logs to hide,
Spans I counted, JSON neat and small,
Artifacts gathered in a cozy stall,
Now benchmarks sing — rejoice, let's stride! 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is incomplete; it only states the goal without providing context, background, rationale, or explaining the changes in the workflow architecture and new jobs. Expand the description to explain the workflow changes, the new jobs (spamoor-benchmark, publish-benchmarks), artifact handling, and the motivation for this restructuring.
Title check ❓ Inconclusive The title is vague and partially related; it mentions 'spammoor' and 'benchmark' but lacks clarity about the main change—multi-job benchmark aggregation, artifact handling, and results publishing. Clarify the title to reflect the primary change, such as 'ci: restructure benchmark workflow with multi-job aggregation and publishing' or similar.
✅ Passed checks (1 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cian/experiment-with-granularspans

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 24, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedFeb 24, 2026, 4:25 PM

@claude
Copy link
Contributor

claude bot commented Feb 24, 2026

Claude finished @chatton's task in 2m 1s —— View job


Code Review

  • Gather context and read changed files
  • Review .github/workflows/benchmark.yml
  • Review test/e2e/evm_test_common.go
  • Review test/e2e/evm_spamoor_smoke_test.go
  • Post final review

Overall this is a clean, well-scoped PR. The fan-out/fan-in pattern for benchmarks is the right approach for avoiding gh-pages race conditions. A few issues to consider:

.github/workflows/benchmark.yml

if: always() on publish steps may produce confusing failures

All store steps in publish-benchmarks use if: always(). If the spamoor-benchmark job fails (e.g. test flake, Docker unavailable), the artifact won't be uploaded. The Download Spamoor benchmark results step will succeed vacuously, but Store Spamoor Trace result will then fail because test/e2e/spamoor_bench.json doesn't exist — yielding a cryptic action error rather than a clear "artifact not found" message.

Consider conditional execution for the spamoor store step:

- name: Store Spamoor Trace result
  if: needs.spamoor-benchmark.result == 'success'
  uses: benchmark-action/github-action-benchmark@...

And update the needs condition at the job level to allow partial success:

publish-benchmarks:
  needs: [evm-benchmark, spamoor-benchmark]
  if: always()
  runs-on: ubuntu-latest

Fix this →

needs doesn't include if: always() on the publish job itself

As written, if spamoor-benchmark fails, publish-benchmarks won't run at all (the default needs behavior is to skip on upstream failure). If the intent is to always publish whatever results are available, the job also needs if: always().

test/e2e/evm_test_common.go

aggregateSpanStats refactor is clean — good separation of concerns. No issues with the aggregation logic; the first-span initialisation (min: d, max: d) and subsequent comparisons are correct.

writeTraceBenchmarkJSON — only avg is emitted

The current implementation only writes average values per span operation. The benchmarkEntry struct supports a single Value float64, so this is a format constraint, not a bug. The last commit message (ci: only emit avg in benchmark JSON to reduce alert noise) confirms this is intentional.

One minor nit: t.Fatalf for the os.WriteFile error is appropriate in test infrastructure but will abort the entire test run if disk I/O fails. A t.Errorf + early return would be less disruptive if you want benchmark output failures to be non-fatal. Low priority given this is test code.

File permissions on benchmark JSON output

os.WriteFile(outputPath, data, 0644)

0644 is fine for a benchmark output file — readable by owner/group/other, writable only by owner.

test/e2e/evm_spamoor_smoke_test.go

Redundant nil-guard at call site

// line 169-171
if outputPath := os.Getenv("BENCH_JSON_OUTPUT"); outputPath != "" {
    writeTraceBenchmarkJSON(t, "SpamoorSmoke", toTraceSpans(evNodeSpans), outputPath)
}

writeTraceBenchmarkJSON already returns early if outputPath == "" (line 948-950 of evm_test_common.go). The guard here is harmless but redundant. Either pattern is fine — the explicit env-var check at the call site arguably makes the intent clearer to readers.

Only ev-node spans are exported

evRethSpans are not written to the benchmark JSON. Given the PR focus is ev-node performance, this seems intentional — worth a brief comment to make it explicit.

Summary

Area Severity Finding
benchmark.yml Medium publish-benchmarks will silently fail when spamoor-benchmark artifact is missing; add if: always() to job + conditional step
evm_test_common.go Low t.Fatalf for benchmark write errors is test-aborting; consider t.Errorf
evm_spamoor_smoke_test.go Nit Redundant env-var guard before calling writeTraceBenchmarkJSON
evm_spamoor_smoke_test.go Nit No comment explaining why only ev-node spans are exported

The core logic and architecture are sound. The main actionable item is the always() / conditional publishing concern in the workflow.

@codecov
Copy link

codecov bot commented Feb 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.90%. Comparing base (67e18bd) to head (c1da279).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3105      +/-   ##
==========================================
- Coverage   60.91%   60.90%   -0.02%     
==========================================
  Files         113      113              
  Lines       11617    11617              
==========================================
- Hits         7077     7075       -2     
- Misses       3742     3743       +1     
- Partials      798      799       +1     
Flag Coverage Δ
combined 60.90% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

go test -bench=BenchmarkProduceBlock -benchmem -run='^$' \
./block/internal/executing/... > block_executor_output.txt
- name: Store Block Executor benchmark result
- name: Store Block Executor result
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because there are now multiple jobs pushing benchmarks, in order to not hit race conditions, we can gather up all the results, then sequentially push them all one by one.

@chatton chatton changed the title [WIP] feat: adding spammoor test to benchmark feat: adding spammoor test to benchmark Feb 24, 2026
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Spamoor Trace Benchmarks'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.50.

Benchmark suite Current: f4bf8e6 Previous: f4a09a2 Ratio
SpamoorSmoke - ForcedInclusionRetriever.RetrieveForcedIncludedTxs (max) 44 us 20 us 2.20
SpamoorSmoke - Sequencer.SubmitBatchTxs (max) 208 us 132 us 1.58
SpamoorSmoke - Store.NewBatch (max) 106 us 70 us 1.51

This comment was automatically generated by workflow using github-action-benchmark.

@chatton chatton marked this pull request as ready for review February 24, 2026 15:56
save-data-file: ${{ github.event_name != 'pull_request' }}
github-token: ${{ secrets.GITHUB_TOKEN }}
alert-threshold: '150%'
fail-on-alert: false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now not failing on anything, just working on getting results pushed and visible.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/evm_test_common.go`:
- Around line 937-981: The avg computation in writeTraceBenchmarkJSON uses
s.total.Microseconds() which truncates sub-microsecond precision; change it to a
float-based conversion (e.g. use s.total.Seconds()*1e6 or
float64(s.total.Nanoseconds())/1e3) and divide by float64(s.count) so avg keeps
fractional microsecond values; update the avg assignment that references s.total
and s.count accordingly.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f4bf8e6 and c1da279.

📒 Files selected for processing (1)
  • test/e2e/evm_test_common.go

Comment on lines +937 to +981
// benchmarkEntry matches the customSmallerIsBetter format for github-action-benchmark.
type benchmarkEntry struct {
Name string `json:"name"`
Unit string `json:"unit"`
Value float64 `json:"value"`
}

// writeTraceBenchmarkJSON aggregates spans and writes a customSmallerIsBetter JSON file.
// If outputPath is empty, the function is a no-op.
func writeTraceBenchmarkJSON(t testing.TB, label string, spans []traceSpan, outputPath string) {
t.Helper()
if outputPath == "" {
return
}
m := aggregateSpanStats(spans)
if len(m) == 0 {
t.Logf("WARNING: no span stats to write for %s", label)
return
}

// sort by name for stable output
names := make([]string, 0, len(m))
for name := range m {
names = append(names, name)
}
sort.Strings(names)

var entries []benchmarkEntry
for _, name := range names {
s := m[name]
avg := float64(s.total.Microseconds()) / float64(s.count)
entries = append(entries,
benchmarkEntry{Name: fmt.Sprintf("%s - %s (avg)", label, name), Unit: "us", Value: avg},
)
}

data, err := json.MarshalIndent(entries, "", " ")
if err != nil {
t.Fatalf("failed to marshal benchmark JSON: %v", err)
}
if err := os.WriteFile(outputPath, data, 0644); err != nil {
t.Fatalf("failed to write benchmark JSON to %s: %v", outputPath, err)
}
t.Logf("wrote %d benchmark entries to %s", len(entries), outputPath)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid microsecond truncation to preserve small-span accuracy.

Duration.Microseconds() truncates sub-micro values, which can collapse fast spans to 0. Use a float conversion to keep fractional microseconds.

🔧 Suggested fix
-		avg := float64(s.total.Microseconds()) / float64(s.count)
+		avg := (float64(s.total) / float64(time.Microsecond)) / float64(s.count)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// benchmarkEntry matches the customSmallerIsBetter format for github-action-benchmark.
type benchmarkEntry struct {
Name string `json:"name"`
Unit string `json:"unit"`
Value float64 `json:"value"`
}
// writeTraceBenchmarkJSON aggregates spans and writes a customSmallerIsBetter JSON file.
// If outputPath is empty, the function is a no-op.
func writeTraceBenchmarkJSON(t testing.TB, label string, spans []traceSpan, outputPath string) {
t.Helper()
if outputPath == "" {
return
}
m := aggregateSpanStats(spans)
if len(m) == 0 {
t.Logf("WARNING: no span stats to write for %s", label)
return
}
// sort by name for stable output
names := make([]string, 0, len(m))
for name := range m {
names = append(names, name)
}
sort.Strings(names)
var entries []benchmarkEntry
for _, name := range names {
s := m[name]
avg := float64(s.total.Microseconds()) / float64(s.count)
entries = append(entries,
benchmarkEntry{Name: fmt.Sprintf("%s - %s (avg)", label, name), Unit: "us", Value: avg},
)
}
data, err := json.MarshalIndent(entries, "", " ")
if err != nil {
t.Fatalf("failed to marshal benchmark JSON: %v", err)
}
if err := os.WriteFile(outputPath, data, 0644); err != nil {
t.Fatalf("failed to write benchmark JSON to %s: %v", outputPath, err)
}
t.Logf("wrote %d benchmark entries to %s", len(entries), outputPath)
}
// benchmarkEntry matches the customSmallerIsBetter format for github-action-benchmark.
type benchmarkEntry struct {
Name string `json:"name"`
Unit string `json:"unit"`
Value float64 `json:"value"`
}
// writeTraceBenchmarkJSON aggregates spans and writes a customSmallerIsBetter JSON file.
// If outputPath is empty, the function is a no-op.
func writeTraceBenchmarkJSON(t testing.TB, label string, spans []traceSpan, outputPath string) {
t.Helper()
if outputPath == "" {
return
}
m := aggregateSpanStats(spans)
if len(m) == 0 {
t.Logf("WARNING: no span stats to write for %s", label)
return
}
// sort by name for stable output
names := make([]string, 0, len(m))
for name := range m {
names = append(names, name)
}
sort.Strings(names)
var entries []benchmarkEntry
for _, name := range names {
s := m[name]
avg := (float64(s.total) / float64(time.Microsecond)) / float64(s.count)
entries = append(entries,
benchmarkEntry{Name: fmt.Sprintf("%s - %s (avg)", label, name), Unit: "us", Value: avg},
)
}
data, err := json.MarshalIndent(entries, "", " ")
if err != nil {
t.Fatalf("failed to marshal benchmark JSON: %v", err)
}
if err := os.WriteFile(outputPath, data, 0644); err != nil {
t.Fatalf("failed to write benchmark JSON to %s: %v", outputPath, err)
}
t.Logf("wrote %d benchmark entries to %s", len(entries), outputPath)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/evm_test_common.go` around lines 937 - 981, The avg computation in
writeTraceBenchmarkJSON uses s.total.Microseconds() which truncates
sub-microsecond precision; change it to a float-based conversion (e.g. use
s.total.Seconds()*1e6 or float64(s.total.Nanoseconds())/1e3) and divide by
float64(s.count) so avg keeps fractional microsecond values; update the avg
assignment that references s.total and s.count accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant