fix: commit request queue dedup cache only after batch_add_requests succeeds#975
fix: commit request queue dedup cache only after batch_add_requests succeeds#975vdusek wants to merge 8 commits into
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #975 +/- ##
==========================================
+ Coverage 89.90% 91.69% +1.78%
==========================================
Files 49 50 +1
Lines 3091 3203 +112
==========================================
+ Hits 2779 2937 +158
+ Misses 312 266 -46
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…-on-success # Conflicts: # tests/unit/storage_clients/test_apify_request_queue_client.py
…form writes Commit-after-success regressed concurrent deduplication. Overlapping producers re-sent the same request and multiplied platform writes, which the parallel-dedup integration test caught. In-flight adds are now tracked as per-request futures that concurrent producers await instead of re-sending. A producer learns the request is present only once the platform accepts it. If the original add fails, awaiters are told it was not committed and report it unprocessed so it gets retried, never falsely succeeded.
There was a problem hiding this comment.
Pull request overview
This PR fixes request queue deduplication correctness for both Apify request queue client implementations by ensuring local dedup caches are only committed after a successful batch_add_requests call, and by coordinating concurrent producers via per-request in-flight futures.
Changes:
- Deferred local cache/head updates until
batch_add_requestssucceeds, committing only requests accepted by the platform (skippingunprocessed_requests). - Added per-request “in-flight add” tracking to deduplicate concurrent producers without reporting false success on failures.
- Added unit tests covering failure cases, concurrent producer deduplication, and partial unprocessed behavior for both single/shared clients.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| tests/unit/storage_clients/test_apify_request_queue_client.py | Adds unit tests for failed adds, concurrent in-flight deduplication, and partial acceptance behavior across both clients. |
| src/apify/storage_clients/_apify/_utils.py | Introduces helper functions for settling and awaiting in-flight per-request additions. |
| src/apify/storage_clients/_apify/_request_queue_single_client.py | Defers cache/head commits until API success and coordinates concurrent producers via in-flight futures. |
| src/apify/storage_clients/_apify/_request_queue_shared_client.py | Defers dedup-cache commits until API success and coordinates concurrent producers via in-flight futures. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Both Apify request queue clients (
ApifyRequestQueueSingleClient,ApifyRequestQueueSharedClient) cached new requests locally (the single client also updates_head_requests) before callingbatch_add_requests. That caused two bugs:was_already_presentbefore the first call finished. If that call then failed, the producer had already reported success for a request that never reached the platform.Now the cache and head are committed only after
batch_add_requestssucceeds, and only for requests the platform accepted (unprocessed_requestsare skipped). A failed call commits nothing.To keep deduplicating concurrent producers (so overlapping batches do not multiply platform writes), each in-flight add is tracked by a per-request future. A concurrent producer of the same request awaits it instead of re-sending. It is told the request is present only if the original add committed it. If the original add failed, the producer reports the request unprocessed so Crawlee retries it, rather than receiving false success.
Within-batch deduplication is unaffected, since Crawlee already collapses a batch by
unique_key(via_transform_requests) before it reaches these clients.Tests (parametrized over both clients, plus integration):
batch_add_requestsleaves no cached entry, and the retry reaches the platform.test_request_queue_parallel_deduplicationconfirms overlapping concurrent producers write each request to the platform exactly once.