refactor!: Align `RequestQueueClient` interface with its Python counterpart by janbuchar · Pull Request #3729 · apify/crawlee

janbuchar · 2026-06-10T10:04:21Z

request-queue sibling of refactor!: Align DatasetClient and KeyValueStoreClient interfaces with their Python counterparts #3627
notably, this removes the RequestProvider -> RequestQueue(V2) hierarchy, only RequestQueue is left
Closes Rework the storage client system #3075
Coincidentally closes Remove unlocking requests from BasicCrawler #2840
the memory storage changes don't deserve too much attention — we're going to replace the implementation with crawlee-storage soon anyway
- this also covers MemoryStorageEmulator, which is slated for removal in Split off FileSystemStorage from MemoryStorage #3593. Its getRequestQueueItems helper now drains the queue via fetchNextRequest() and reclaims everything afterwards. Because reclaimRequest recomputes orderNo from scratch, this does not faithfully restore the original ordering (forefront requests lose their sign/priority, and same-millisecond reclaims can collapse relative order). That's fine here: all current callers use non-forefront requests with order-tolerant assertions, and the helper is going away with Split off FileSystemStorage from MemoryStorage #3593 — so it's not worth fixing.
- isEmpty()/isFinished() in the memory client now do a full head scan (with forced per-entry disk reads) on every call instead of the old in-memory head-cache short-circuit, so they're O(N) per AutoscaledPool tick. We're not optimizing this: a client-local cache can't be made correct across multiple consumers sharing an on-disk queue (it would be blind to foreign locks), and this implementation is being replaced by crawlee-storage anyway — production cost lands in the Apify client, which computes finished-ness server-side.
related to Move Apify RequestQueue-specific logic to SDK #3328

Reduce `RequestQueueClient` from 12 methods to 9. Replace `listHead`/`addRequest`/`batchAddRequests`/`updateRequest`/ `listAndLockHead`/`prolongRequestLock`/`deleteRequestLock` with `addBatchOfRequests`/`fetchNextRequest`/`markRequestAsHandled`/ `reclaimRequest`/`isEmpty`, and drop the now-dead head/lock option types (`QueueHead`, `ListOptions`, `ListAndLockOptions`, `ListAndLockHeadResult`, `ProlongRequestLockOptions`, `ProlongRequestLockResult`, `DeleteRequestLockOptions`, `RequestQueueHeadItem`). Locking/coordination of multiple clients on the same queue is now an internal concern of the client implementation, not part of the interface. Part of #3075.

Reimplement the in-memory request queue client against the new interface. The client now owns the pending/in-progress/handled bookkeeping (an `inProgress` set on top of the existing `orderNo`-based ordering): - `addBatchOfRequests` replaces `addRequest`/`batchAddRequests` - `fetchNextRequest` pops the next pending request and marks it in progress - `markRequestAsHandled`/`reclaimRequest` replace `updateRequest` and operate on in-progress requests (returning `null` when not in progress) - `getRequest` is keyed by `uniqueKey` - `isEmpty` reports whether any pending request remains (in-progress requests are not counted) The lock-based `listHead`/`listAndLockHead`/`prolongRequestLock`/ `deleteRequestLock` methods are removed. Part of #3075.

…estQueue` class `RequestProvider`, `RequestQueueV1` and `RequestQueueV2` no longer differ in behaviour — request coordination (locking, queue-head management) is now an internal concern of the storage client — so they are merged into a single concrete `RequestQueue` class: - `RequestProvider` becomes `RequestQueue`; the `request_provider.ts` and `request_queue_v2.ts` modules are removed and the implementation lives in `request_queue.ts`. - `fetchNextRequest`/`markRequestHandled`/`reclaimRequest`/`isEmpty` delegate to the slim client; the queue-head, locking, consistency and `recentlyHandledRequests` bookkeeping is gone. - `isFinished` returns `false` while a background add batch is in flight, otherwise `client.isEmpty()`. - `isEmpty()` reflects only pending requests (the next `fetchNextRequest()` would return `null`), preserving the crawler's task-scheduling contract. - `getRequest` is keyed by `uniqueKey`. The `RequestProvider`/`RequestQueueV1`/`RequestQueueV2` exports are removed; use `RequestQueue` instead. `RequestProviderOptions` is renamed to `RequestQueueOptions`. Part of #3075.

Adapt `BasicCrawler` to the slim `RequestQueueClient` and the merged `RequestQueue` class: - Use `RequestQueue` in place of the removed `RequestProvider`/`RequestQueueV1` (instanceof checks, `open()`, parameter and field types). - The same-domain-delay path no longer pokes the queue's private `inProgress` set; it relies on `reclaimRequest` to return the request to the queue. - The error-handling safety net reclaims the request via `reclaimRequest` instead of calling the removed `deleteRequestLock`. Reclaiming a request that is no longer in progress is a harmless no-op on the client. Part of #3075.

Rewrite the request-queue test suites against the new API. Obsolete white-box tests of the removed locking/queue-head machinery are replaced with behavioral tests using a real MemoryStorage-backed client: - memory-storage forefront/handledRequestCount/ignore-non-json tests now drive `addBatchOfRequests`/`fetchNextRequest`/`markRequestAsHandled`/`reclaimRequest`/ `isEmpty`. - core `request-queue-v2` and `request_queue` tests cover the new lifecycle and `isEmpty`/`isFinished` semantics. - `MemoryStorageEmulator.getRequestQueueItems` drains pending requests via `fetchNextRequest` and reclaims them, since `listHead` no longer exists. Part of #3075.

Expand the `RequestQueueClient` migration table in the v4 upgrading guide with the full method mapping, the new fetch/handle/reclaim lifecycle, the `isEmpty` semantics, and the merge of `RequestProvider`/`RequestQueueV1`/`RequestQueueV2` into a single `RequestQueue` class. List the removed head/lock types. Drop the obsolete request-locking experiment guide (locking is no longer an opt-in experiment) and remove its now-empty "Experiments" sidebar category. Update the parallel scraping guide to use `RequestQueue` and drop the `requestLocking` experiment flag. Closes #3075.

janbuchar · 2026-06-10T16:00:14Z

@vdusek — a heads-up on a RequestQueueClient.is_empty semantic that surfaced while porting this RQ-client rework to crawlee-js. It's not a bug, but the contract is subtler than the method name and the base docstring suggest, and I think it's worth documenting on the Python side too.

TL;DR — In crawlee-python, RequestQueueClient.is_empty() does not mean "the next fetch_next_request() would return None". It means "there is no outstanding work left at all" — including requests currently in progress / locked (fetched but not yet handled or reclaimed). It's effectively a building block for is_finished(), not an "is there anything fetchable right now" check.

Where this comes from — all three clients agree, even though the base interface doesn't state it:

FS client: returns False as soon as len(state.in_progress_requests) > 0.
Apify single client: return not self._head_requests and not self._requests_in_progress.
Apify shared client: return len(head.items) == 0 and not self._queue_has_locked_requests.

The RequestQueue.is_empty() storage-wrapper docstring states it correctly ("either pending or being processed"), but the RequestQueueClient.is_empty() base interface docstring just says "True if the request queue is empty" — which reads like the weaker guarantee. The fetch_next_request docstrings ("a None return value does not mean processing finished … use is_finished") reinforce the wrong mental model.

Why it matters (multi-consumer) — is_empty() feeds is_finished(), and the autoscaled pool calls is_finished() on every orchestrator loop iteration, not only when idle. So a narrower is_empty() would let a consumer stop while it (or, in the shared case, another consumer) still holds the last requests under a lock. queue_has_locked_requests is the multi-consumer analogue of the single client's _requests_in_progress set.

is_empty() feeds is_finished()

... and also this is kinda weird in itself.

A request fetched via `fetchNextRequest` is locked (in progress) for 3 minutes by persisting a future `orderNo` to disk. If the process ends before the request is handled or reclaimed, that lock used to linger until it expired, blocking the request for the next consumer of the same on-disk queue. The in-memory client now tracks the requests it has locked and releases them in `MemoryStorage.teardown()`, resetting their `orderNo` (sign preserved, so forefront/normal ordering survives) so they become immediately fetchable again. Only this client needs the cleanup: the Apify platform releases a run's locks automatically on migrate/abort, and the file-system storage does not lock at all.

`RequestQueueClient.isEmpty()` previously reported `true` as soon as no request was immediately fetchable, ignoring requests that are merely locked (in progress) by another consumer. With multiple consumers sharing a queue, a consumer could therefore see the queue as empty — and let the crawler finish — while another consumer still held the last requests under a lock. `listPendingHead` now also reports whether it skipped any unhandled-but- locked request, and `isEmpty()` returns `true` only when nothing is pending AND nothing is locked. This mirrors the Apify platform shared client, whose `isEmpty` accounts for `queueHasLockedRequests`. Tests that encoded the old "in-progress means empty" semantics are updated accordingly (an in-progress request now keeps the queue non-empty and unfinished until it is handled).

Request locking is internal to the storage client, but the only sensible lock duration — long enough to outlive the request handler — is known only to the crawler. Thread that knowledge across the boundary via a new `setExpectedRequestProcessingTime(secs)` hint: - `RequestQueueClient` (optional) and `IRequestManager` gain the method. - The memory-storage client uses it to size how long `fetchNextRequest` locks a request (defaults to 3 minutes), fixing long handlers (>3 min) having their request handed out — and processed — a second time. - The `RequestQueue` frontend owns the multi-consumer policy: it only ever raises the duration, never lowers it, so a short-lived consumer sharing the queue cannot cut short a long-lived one's reservation. The client stays a dumb setter. - `RequestManagerTandem` forwards the hint to its underlying manager (applying it once the lazily-opened manager resolves), so the RequestList/tandem path is covered too. - `BasicCrawler` hints `requestHandlerTimeoutMillis + padding` to whatever request manager it uses, not just a plain `RequestQueue`.

… and `RequestQueue.internalTimeoutMillis` Now that V1/V2 are merged and locking lives in the storage client, two things became inert and are removed: - `RequestQueue.internalTimeoutMillis` — nothing reads it anymore (the old `RequestProvider` that did is gone). `BasicCrawler` no longer writes it, so `applyRequestManagerTimeouts` drops its `instanceof RequestQueue` branch and `_getRequestQueue` is inlined into `openOwnedRequestQueue`. - The `requestLocking` crawler experiment, the `experiments` option, and the `CrawlerExperiments` type — both branches of the old `_getRequestQueue` returned the same `RequestQueue.open(...)`, so the flag only logged a lie. BREAKING CHANGE: the `experiments` crawler option and the `CrawlerExperiments` type are removed. Request locking has been the default since v3.10 with no alternative to opt out to, so any `experiments: { requestLocking }` can be deleted.

`markRequestAsHandled` and `reclaimRequest` guarded on the request still being locked (`isRequestLocked`). A consumer whose processing outlived the lock — a slow handler, a GC/event-loop pause, or a low `setExpectedRequestProcessingTime` — would therefore have its `markRequestAsHandled` call silently return `null` and do nothing: the request stayed pending, was never counted as handled, and got handed out again on the next `fetchNextRequest`. With every pass exceeding the lock, the request was reprocessed forever and the queue never finished. The guards now require only that the request exists and is not already handled (`orderNo !== null`), regardless of lock state, matching the pre-refactor unconditional behavior. `reclaimRequest` likewise no longer drops a `forefront` reorder after expiry.

janbuchar · 2026-06-11T17:22:12Z

In addition to #3729 (comment):

I digged through the Python Crawlee+SDK history.

Phase 1 — the original (correct) design. Both SDKs split the two predicates: is_empty() was weak (ensure_head_is_non_empty(); return len(queue_head) == 0 — "anything fetchable now?") and is_finished() was strong, independently computed via return not self._queue_has_locked_requests ("all work done anywhere, incl. other clients' locks?"). Task-readiness rode weak is_empty; termination rode strong is_finished. No spin, name matched meaning.

Phase 2 — the collapse. crawlee-python apify/crawlee-python#1194 ("Introduce new storage client system") rewrote storages ground-up and incidentally merged the predicates: frontend is_empty() → thin client.is_empty(), is_finished() → bg-tasks-check + is_empty(). The PR description never mentions the semantic merge — it was drift, not a decision.

Phase 3 — the compensation. To keep termination correct after the collapse, the Apify client's is_empty had to absorb the strong semantics. apify-sdk-python apify/apify-sdk-python#470 shipped len(head.items) == 0 and not self._queue_has_locked_requests; apify/apify-sdk-python#573 carried it into both single (not head and not in_progress) and shared variants. The name stopped matching the behavior here.

Takeaway. I will add isFinished in this PR and I believe we should do the same in crawlee-python and apify-sdk-python so that the extension point is easier to understand. It will also save some negligible busy waits in AutoscaledPool down the line.

barjin

Thanks @janbuchar! Aside from these nits, I didn't find anything, so approving 👍

Claude hints at a performance issue with RequestQueue in memory storage, but I didn't get anything measurable in any conceivable scenario. Together with mem-storage being sunset by the new Rust storage, it's a no-op for me

barjin · 2026-06-12T11:51:56Z

+     * the same request out again while it is still being processed. Implementations that do not need
+     * this hint may leave it `undefined`.
+     */
+    setExpectedRequestProcessingTime?(secs: number): void;


In other time-related symbols, we use the ...Secs or ...Millis suffix. Any chance we can get this here as well?

barjin · 2026-06-12T12:16:40Z

+     * queue. Returns information about the operation, or `null` if the request was not in progress.
+     */
+    reclaimRequest(request: UpdateRequestSchema, options?: RequestOptions): Promise<QueueOperationInfo | null>;


or null if the request was not in progress

In this case, will / should the request be added to the queue, or is it an invalid operation (and therefore an error)?

janbuchar force-pushed the refactor/align-request-queue-client-interface branch from 6803f1f to acc2337 Compare June 10, 2026 10:13

janbuchar force-pushed the refactor/align-request-queue-client-interface branch 2 times, most recently from c16bb30 to 01d189a Compare June 10, 2026 13:16

janbuchar added 5 commits June 10, 2026 16:31

janbuchar force-pushed the refactor/align-request-queue-client-interface branch from 01d189a to 1d3ccf5 Compare June 10, 2026 14:35

janbuchar marked this pull request as draft June 10, 2026 14:40

janbuchar added 4 commits June 10, 2026 18:07

janbuchar force-pushed the refactor/align-request-queue-client-interface branch from 7cd8151 to a22f5e7 Compare June 11, 2026 15:17

fix: Split off isFinished from isEmpty

a9a82b2

janbuchar requested review from B4nan and barjin June 11, 2026 17:35

janbuchar marked this pull request as ready for review June 11, 2026 17:36

janbuchar added 6 commits June 11, 2026 19:38

refactor: Make markRequestAsHandled naming consistent

25fef87

refactor: Remove dead code in RequestQueue

b62a781

fix: Avoid full queue scan in fetchNextRequest

67b6bb4

docs: Update stale naming

4f56482

fix: Handle concurrency in memory storage more correctly

c2b6446

fix: Reset better on purge

e354e96

barjin approved these changes Jun 12, 2026

View reviewed changes

janbuchar mentioned this pull request Jun 12, 2026

fix: Finalize the RequestQueueClient apify/crawlee-storage#33

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor!: Align `RequestQueueClient` interface with its Python counterpart#3729

refactor!: Align `RequestQueueClient` interface with its Python counterpart#3729
janbuchar wants to merge 18 commits into
v4from
refactor/align-request-queue-client-interface

janbuchar commented Jun 10, 2026 •

edited

Loading

Uh oh!

janbuchar commented Jun 10, 2026 •

edited

Loading

Uh oh!

janbuchar commented Jun 11, 2026

Uh oh!

barjin left a comment

Uh oh!

barjin Jun 12, 2026

Uh oh!

barjin Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

janbuchar commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

janbuchar commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

janbuchar commented Jun 11, 2026

Uh oh!

barjin left a comment

Choose a reason for hiding this comment

Uh oh!

barjin Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

barjin Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

janbuchar commented Jun 10, 2026 •

edited

Loading

janbuchar commented Jun 10, 2026 •

edited

Loading