Skip to content

refactor!: Align RequestQueueClient interface with its Python counterpart#3729

Open
janbuchar wants to merge 18 commits into
v4from
refactor/align-request-queue-client-interface
Open

refactor!: Align RequestQueueClient interface with its Python counterpart#3729
janbuchar wants to merge 18 commits into
v4from
refactor/align-request-queue-client-interface

Conversation

@janbuchar

@janbuchar janbuchar commented Jun 10, 2026

Copy link
Copy Markdown
Contributor
  • request-queue sibling of refactor!: Align DatasetClient and KeyValueStoreClient interfaces with their Python counterparts #3627
  • notably, this removes the RequestProvider -> RequestQueue(V2) hierarchy, only RequestQueue is left
  • Closes Rework the storage client system #3075
  • Coincidentally closes Remove unlocking requests from BasicCrawler #2840
  • the memory storage changes don't deserve too much attention — we're going to replace the implementation with crawlee-storage soon anyway
    • this also covers MemoryStorageEmulator, which is slated for removal in Split off FileSystemStorage from MemoryStorage #3593. Its getRequestQueueItems helper now drains the queue via fetchNextRequest() and reclaims everything afterwards. Because reclaimRequest recomputes orderNo from scratch, this does not faithfully restore the original ordering (forefront requests lose their sign/priority, and same-millisecond reclaims can collapse relative order). That's fine here: all current callers use non-forefront requests with order-tolerant assertions, and the helper is going away with Split off FileSystemStorage from MemoryStorage #3593 — so it's not worth fixing.
    • isEmpty()/isFinished() in the memory client now do a full head scan (with forced per-entry disk reads) on every call instead of the old in-memory head-cache short-circuit, so they're O(N) per AutoscaledPool tick. We're not optimizing this: a client-local cache can't be made correct across multiple consumers sharing an on-disk queue (it would be blind to foreign locks), and this implementation is being replaced by crawlee-storage anyway — production cost lands in the Apify client, which computes finished-ness server-side.
  • related to Move Apify RequestQueue-specific logic to SDK #3328

@janbuchar janbuchar force-pushed the refactor/align-request-queue-client-interface branch from 6803f1f to acc2337 Compare June 10, 2026 10:13
Reduce `RequestQueueClient` from 12 methods to 9.
Replace `listHead`/`addRequest`/`batchAddRequests`/`updateRequest`/
`listAndLockHead`/`prolongRequestLock`/`deleteRequestLock` with
`addBatchOfRequests`/`fetchNextRequest`/`markRequestAsHandled`/
`reclaimRequest`/`isEmpty`, and drop the now-dead head/lock option types
(`QueueHead`, `ListOptions`, `ListAndLockOptions`, `ListAndLockHeadResult`,
`ProlongRequestLockOptions`, `ProlongRequestLockResult`,
`DeleteRequestLockOptions`, `RequestQueueHeadItem`).

Locking/coordination of multiple clients on the same queue is now an
internal concern of the client implementation, not part of the interface.

Part of #3075.
@janbuchar janbuchar force-pushed the refactor/align-request-queue-client-interface branch 2 times, most recently from c16bb30 to 01d189a Compare June 10, 2026 13:16
Reimplement the in-memory request queue client against the new interface.
The client now owns the pending/in-progress/handled bookkeeping (an
`inProgress` set on top of the existing `orderNo`-based ordering):

- `addBatchOfRequests` replaces `addRequest`/`batchAddRequests`
- `fetchNextRequest` pops the next pending request and marks it in progress
- `markRequestAsHandled`/`reclaimRequest` replace `updateRequest` and
  operate on in-progress requests (returning `null` when not in progress)
- `getRequest` is keyed by `uniqueKey`
- `isEmpty` reports whether any pending request remains (in-progress
  requests are not counted)

The lock-based `listHead`/`listAndLockHead`/`prolongRequestLock`/
`deleteRequestLock` methods are removed.

Part of #3075.
…estQueue` class

`RequestProvider`, `RequestQueueV1` and `RequestQueueV2` no longer differ in
behaviour — request coordination (locking, queue-head management) is now an
internal concern of the storage client — so they are merged into a single
concrete `RequestQueue` class:

- `RequestProvider` becomes `RequestQueue`; the `request_provider.ts` and
  `request_queue_v2.ts` modules are removed and the implementation lives in
  `request_queue.ts`.
- `fetchNextRequest`/`markRequestHandled`/`reclaimRequest`/`isEmpty` delegate to
  the slim client; the queue-head, locking, consistency and
  `recentlyHandledRequests` bookkeeping is gone.
- `isFinished` returns `false` while a background add batch is in flight,
  otherwise `client.isEmpty()`.
- `isEmpty()` reflects only pending requests (the next `fetchNextRequest()`
  would return `null`), preserving the crawler's task-scheduling contract.
- `getRequest` is keyed by `uniqueKey`.

The `RequestProvider`/`RequestQueueV1`/`RequestQueueV2` exports are removed; use
`RequestQueue` instead. `RequestProviderOptions` is renamed to
`RequestQueueOptions`.

Part of #3075.
Adapt `BasicCrawler` to the slim `RequestQueueClient` and the merged
`RequestQueue` class:

- Use `RequestQueue` in place of the removed `RequestProvider`/`RequestQueueV1`
  (instanceof checks, `open()`, parameter and field types).
- The same-domain-delay path no longer pokes the queue's private `inProgress`
  set; it relies on `reclaimRequest` to return the request to the queue.
- The error-handling safety net reclaims the request via `reclaimRequest`
  instead of calling the removed `deleteRequestLock`. Reclaiming a request that
  is no longer in progress is a harmless no-op on the client.

Part of #3075.
Rewrite the request-queue test suites against the new API. Obsolete white-box
tests of the removed locking/queue-head machinery are replaced with behavioral
tests using a real MemoryStorage-backed client:

- memory-storage forefront/handledRequestCount/ignore-non-json tests now drive
  `addBatchOfRequests`/`fetchNextRequest`/`markRequestAsHandled`/`reclaimRequest`/
  `isEmpty`.
- core `request-queue-v2` and `request_queue` tests cover the new lifecycle and
  `isEmpty`/`isFinished` semantics.
- `MemoryStorageEmulator.getRequestQueueItems` drains pending requests via
  `fetchNextRequest` and reclaims them, since `listHead` no longer exists.

Part of #3075.
Expand the `RequestQueueClient` migration table in the v4 upgrading guide with
the full method mapping, the new fetch/handle/reclaim lifecycle, the `isEmpty`
semantics, and the merge of `RequestProvider`/`RequestQueueV1`/`RequestQueueV2`
into a single `RequestQueue` class. List the removed head/lock types.

Drop the obsolete request-locking experiment guide (locking is no longer an
opt-in experiment) and remove its now-empty "Experiments" sidebar category.
Update the parallel scraping guide to use `RequestQueue` and drop the
`requestLocking` experiment flag.

Closes #3075.
@janbuchar janbuchar force-pushed the refactor/align-request-queue-client-interface branch from 01d189a to 1d3ccf5 Compare June 10, 2026 14:35
@janbuchar janbuchar marked this pull request as draft June 10, 2026 14:40
@janbuchar

janbuchar commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

@vdusek — a heads-up on a RequestQueueClient.is_empty semantic that surfaced while porting this RQ-client rework to crawlee-js. It's not a bug, but the contract is subtler than the method name and the base docstring suggest, and I think it's worth documenting on the Python side too.

TL;DR — In crawlee-python, RequestQueueClient.is_empty() does not mean "the next fetch_next_request() would return None". It means "there is no outstanding work left at all" — including requests currently in progress / locked (fetched but not yet handled or reclaimed). It's effectively a building block for is_finished(), not an "is there anything fetchable right now" check.

Where this comes from — all three clients agree, even though the base interface doesn't state it:

  • FS client: returns False as soon as len(state.in_progress_requests) > 0.
  • Apify single client: return not self._head_requests and not self._requests_in_progress.
  • Apify shared client: return len(head.items) == 0 and not self._queue_has_locked_requests.

The RequestQueue.is_empty() storage-wrapper docstring states it correctly ("either pending or being processed"), but the RequestQueueClient.is_empty() base interface docstring just says "True if the request queue is empty" — which reads like the weaker guarantee. The fetch_next_request docstrings ("a None return value does not mean processing finished … use is_finished") reinforce the wrong mental model.

Why it matters (multi-consumer)is_empty() feeds is_finished(), and the autoscaled pool calls is_finished() on every orchestrator loop iteration, not only when idle. So a narrower is_empty() would let a consumer stop while it (or, in the shared case, another consumer) still holds the last requests under a lock. queue_has_locked_requests is the multi-consumer analogue of the single client's _requests_in_progress set.

is_empty() feeds is_finished()

... and also this is kinda weird in itself.

A request fetched via `fetchNextRequest` is locked (in progress) for 3
minutes by persisting a future `orderNo` to disk. If the process ends
before the request is handled or reclaimed, that lock used to linger
until it expired, blocking the request for the next consumer of the same
on-disk queue.

The in-memory client now tracks the requests it has locked and releases
them in `MemoryStorage.teardown()`, resetting their `orderNo` (sign
preserved, so forefront/normal ordering survives) so they become
immediately fetchable again.

Only this client needs the cleanup: the Apify platform releases a run's
locks automatically on migrate/abort, and the file-system storage does
not lock at all.
`RequestQueueClient.isEmpty()` previously reported `true` as soon as no
request was immediately fetchable, ignoring requests that are merely
locked (in progress) by another consumer. With multiple consumers
sharing a queue, a consumer could therefore see the queue as empty —
and let the crawler finish — while another consumer still held the last
requests under a lock.

`listPendingHead` now also reports whether it skipped any unhandled-but-
locked request, and `isEmpty()` returns `true` only when nothing is
pending AND nothing is locked. This mirrors the Apify platform shared
client, whose `isEmpty` accounts for `queueHasLockedRequests`.

Tests that encoded the old "in-progress means empty" semantics are
updated accordingly (an in-progress request now keeps the queue
non-empty and unfinished until it is handled).
Request locking is internal to the storage client, but the only sensible
lock duration — long enough to outlive the request handler — is known only
to the crawler. Thread that knowledge across the boundary via a new
`setExpectedRequestProcessingTime(secs)` hint:

- `RequestQueueClient` (optional) and `IRequestManager` gain the method.
- The memory-storage client uses it to size how long `fetchNextRequest`
  locks a request (defaults to 3 minutes), fixing long handlers (>3 min)
  having their request handed out — and processed — a second time.
- The `RequestQueue` frontend owns the multi-consumer policy: it only ever
  raises the duration, never lowers it, so a short-lived consumer sharing
  the queue cannot cut short a long-lived one's reservation. The client
  stays a dumb setter.
- `RequestManagerTandem` forwards the hint to its underlying manager
  (applying it once the lazily-opened manager resolves), so the
  RequestList/tandem path is covered too.
- `BasicCrawler` hints `requestHandlerTimeoutMillis + padding` to whatever
  request manager it uses, not just a plain `RequestQueue`.
… and `RequestQueue.internalTimeoutMillis`

Now that V1/V2 are merged and locking lives in the storage client, two
things became inert and are removed:

- `RequestQueue.internalTimeoutMillis` — nothing reads it anymore (the old
  `RequestProvider` that did is gone). `BasicCrawler` no longer writes it,
  so `applyRequestManagerTimeouts` drops its `instanceof RequestQueue`
  branch and `_getRequestQueue` is inlined into `openOwnedRequestQueue`.
- The `requestLocking` crawler experiment, the `experiments` option, and
  the `CrawlerExperiments` type — both branches of the old `_getRequestQueue`
  returned the same `RequestQueue.open(...)`, so the flag only logged a lie.

BREAKING CHANGE: the `experiments` crawler option and the `CrawlerExperiments`
type are removed. Request locking has been the default since v3.10 with no
alternative to opt out to, so any `experiments: { requestLocking }` can be
deleted.
@janbuchar janbuchar force-pushed the refactor/align-request-queue-client-interface branch from 7cd8151 to a22f5e7 Compare June 11, 2026 15:17
`markRequestAsHandled` and `reclaimRequest` guarded on the request still
being locked (`isRequestLocked`). A consumer whose processing outlived the
lock — a slow handler, a GC/event-loop pause, or a low
`setExpectedRequestProcessingTime` — would therefore have its
`markRequestAsHandled` call silently return `null` and do nothing: the
request stayed pending, was never counted as handled, and got handed out
again on the next `fetchNextRequest`. With every pass exceeding the lock,
the request was reprocessed forever and the queue never finished.

The guards now require only that the request exists and is not already
handled (`orderNo !== null`), regardless of lock state, matching the
pre-refactor unconditional behavior. `reclaimRequest` likewise no longer
drops a `forefront` reorder after expiry.
@janbuchar

Copy link
Copy Markdown
Contributor Author

In addition to #3729 (comment):

I digged through the Python Crawlee+SDK history.

Phase 1 — the original (correct) design. Both SDKs split the two predicates: is_empty() was weak (ensure_head_is_non_empty(); return len(queue_head) == 0 — "anything fetchable now?") and is_finished() was strong, independently computed via return not self._queue_has_locked_requests ("all work done anywhere, incl. other clients' locks?"). Task-readiness rode weak is_empty; termination rode strong is_finished. No spin, name matched meaning.

Phase 2 — the collapse. crawlee-python apify/crawlee-python#1194 ("Introduce new storage client system") rewrote storages ground-up and incidentally merged the predicates: frontend is_empty() → thin client.is_empty(), is_finished()bg-tasks-check + is_empty(). The PR description never mentions the semantic merge — it was drift, not a decision.

Phase 3 — the compensation. To keep termination correct after the collapse, the Apify client's is_empty had to absorb the strong semantics. apify-sdk-python apify/apify-sdk-python#470 shipped len(head.items) == 0 and not self._queue_has_locked_requests; apify/apify-sdk-python#573 carried it into both single (not head and not in_progress) and shared variants. The name stopped matching the behavior here.

Takeaway. I will add isFinished in this PR and I believe we should do the same in crawlee-python and apify-sdk-python so that the extension point is easier to understand. It will also save some negligible busy waits in AutoscaledPool down the line.

@janbuchar janbuchar requested review from B4nan and barjin June 11, 2026 17:35
@janbuchar janbuchar marked this pull request as ready for review June 11, 2026 17:36

@barjin barjin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @janbuchar! Aside from these nits, I didn't find anything, so approving 👍


Claude hints at a performance issue with RequestQueue in memory storage, but I didn't get anything measurable in any conceivable scenario. Together with mem-storage being sunset by the new Rust storage, it's a no-op for me

* the same request out again while it is still being processed. Implementations that do not need
* this hint may leave it `undefined`.
*/
setExpectedRequestProcessingTime?(secs: number): void;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other time-related symbols, we use the ...Secs or ...Millis suffix. Any chance we can get this here as well?

Comment on lines +286 to +288
* queue. Returns information about the operation, or `null` if the request was not in progress.
*/
reclaimRequest(request: UpdateRequestSchema, options?: RequestOptions): Promise<QueueOperationInfo | null>;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or null if the request was not in progress

In this case, will / should the request be added to the queue, or is it an invalid operation (and therefore an error)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants