Only increment stats when the worker acknowledged the test#373
Only increment stats when the worker acknowledged the test#373
Conversation
ruby/lib/minitest/queue/runner.rb
Outdated
| attributes.merge(type: type) | ||
| end.compact | ||
| File.open(queue_config.warnings_file, 'w') do |f| | ||
| File.open(queue_config.warnings_file, 'a') do |f| |
There was a problem hiding this comment.
We call this multiple times so we need to use append.
a9c8024 to
14cf9e8
Compare
| @queue.increment_test_failed if acknowledged == 1 | ||
| if acknowledged | ||
| # if another worker already acknowledged the test, we don't need to update the global stats or increment the test failed count | ||
| record_stats(stats) |
There was a problem hiding this comment.
This may not be enough.
Considering this scenario:
- Worker A runs test T1 (reclaim), fails, calls record_error. acknowledged = 0 → we don’t call record_stats.
- BuildStatusRecorder has already done self.failures += 1 (or self.errors += 1) for T1 in record() before calling build.record_error.
- Worker A then runs test T2, fails, calls record_error. acknowledged = 1 → we do call record_stats(stats).
stats is built from current self.failures / self.errors, which still include T1. So we send e.g. failures: 2 (T1 + T2) for worker A. - Redis ends up with worker A’s failures = 2, but only one of those (T2) was a first ack; T1 was a duplicate and was never supposed to be counted in stats.
So we over-count whenever a worker has both a duplicate ack and a later first ack: the duplicate stays in the in-memory counter and is included in the next record_stats.
There was a problem hiding this comment.
Yes this could be a problem. In that case you need to flip the logging around which makes the diff / refactor a lot more complicated.
Something like this https://github.com/Shopify/ci-queue/compare/cbruckmayer/fix-logging-of-tests?expand=1
There was a problem hiding this comment.
Sure, I can continue the work on your branch.
There was a problem hiding this comment.
I would probably keep it simple and remove the ignored state and introduce it in a later dedicated PR.
| end | ||
|
|
||
| def record_error(id, payload, stats: nil) | ||
| acknowledged, _ = redis.pipelined do |pipeline| |
There was a problem hiding this comment.
Current code uses one pipeline: acknowledge + record_stats together. The new code goes more round trip which may cause performance regression.
There was a problem hiding this comment.
Yes, if you want to keep it pipelined you need to inline it into the lua script which is a lot more complicated as you cannot rely on the result of the previous command in a pipeline. I don't think it's a significant performance regression so I prefer simplicity here.
There was a problem hiding this comment.
Make sense. We can start simple first
| def record_error(id, payload, stats: nil) | ||
| acknowledged, _ = redis.pipelined do |pipeline| | ||
| # Run acknowledge first so we know whether we're the first to ack | ||
| acknowledged = redis.pipelined do |pipeline| |
There was a problem hiding this comment.
If there is only one command, you don't need a pipeline.
There was a problem hiding this comment.
Yeah, you are right. This lua script is a single round-trip even though it does several Redis calls inside.
I will remove pipeline in here.
| @queue.acknowledge(id, error: payload, pipeline: pipeline) | ||
| record_stats(stats, pipeline: pipeline) | ||
| end | ||
| end.first |
There was a problem hiding this comment.
Yeah, it is not necessary. Removing it along with pipeline.
| nil | ||
| if acknowledged | ||
| # We were the first to ack; another worker already ack'd would get falsy from SADD | ||
| redis.pipelined do |pipeline| |
There was a problem hiding this comment.
If there is only one command, we don't need a pipeline.
There was a problem hiding this comment.
inside record_stats method (this is not lua script), it does multiple commands (hset and expire). So we can get benefit with pipeline. I prefer to keep it in here.
| @queue.increment_test_failed | ||
| end | ||
| # Return so caller can roll back local counter when not acknowledged | ||
| !!acknowledged |
There was a problem hiding this comment.
Do we need to update the other implementations of record_error too?
There was a problem hiding this comment.
Don't think so.
- lib/ci/queue/build_record.rb (base BuildRecord): Static/local queue, single process, no Redis, no distributed workers.
- lib/ci/queue/redis/grind_record.rb (GrindRecord): Grind mode (run tests many times). Different model from the main queue. It just lpushes onto an error list and updates stats. No “processed” set or SADD-based first-ack semantics.
There was a problem hiding this comment.
What I mean is: Are they used with the reporter because they all return nil now which means we would always decrement the error count.
| @queue.increment_test_failed | ||
| end | ||
| # Return so caller can roll back local counter when not acknowledged | ||
| !!acknowledged |
There was a problem hiding this comment.
!! does not change whether the value is “success” or “failure”. It only turns that into a real boolean.
acknowledge.lua already returns a boolean, so the !! is not a must. However, this can make sure the future change to acknowledge.lua will be safe.
|
|
||
| private | ||
|
|
||
| def stat_delta(counter, test) |
There was a problem hiding this comment.
Why is this needed?
0fff968 to
93edc70
Compare
…place - Record stats only when worker acknowledges; duplicate acks do not increment - Redis: record_stats_delta (HINCRBY); record_success returns true when ack'd or replaced - Stat correction when success replaces failure; real assertion count (test.assertions) in delta - Test helper: Requeue before Skip when both set; test_aggregation and integration expectations updated - Remove [stats] debug logging from Redis BuildRecord; test_redis_reporter assertions = 8
b5df285 to
b1ea42b
Compare
Only increment error stats when the worker acknowledged the test otherwise we end up with an incorrect counter.