refactor!: Align RequestQueueClient interface with its Python counterpart#3729
refactor!: Align RequestQueueClient interface with its Python counterpart#3729janbuchar wants to merge 18 commits into
RequestQueueClient interface with its Python counterpart#3729Conversation
6803f1f to
acc2337
Compare
Reduce `RequestQueueClient` from 12 methods to 9. Replace `listHead`/`addRequest`/`batchAddRequests`/`updateRequest`/ `listAndLockHead`/`prolongRequestLock`/`deleteRequestLock` with `addBatchOfRequests`/`fetchNextRequest`/`markRequestAsHandled`/ `reclaimRequest`/`isEmpty`, and drop the now-dead head/lock option types (`QueueHead`, `ListOptions`, `ListAndLockOptions`, `ListAndLockHeadResult`, `ProlongRequestLockOptions`, `ProlongRequestLockResult`, `DeleteRequestLockOptions`, `RequestQueueHeadItem`). Locking/coordination of multiple clients on the same queue is now an internal concern of the client implementation, not part of the interface. Part of #3075.
c16bb30 to
01d189a
Compare
Reimplement the in-memory request queue client against the new interface. The client now owns the pending/in-progress/handled bookkeeping (an `inProgress` set on top of the existing `orderNo`-based ordering): - `addBatchOfRequests` replaces `addRequest`/`batchAddRequests` - `fetchNextRequest` pops the next pending request and marks it in progress - `markRequestAsHandled`/`reclaimRequest` replace `updateRequest` and operate on in-progress requests (returning `null` when not in progress) - `getRequest` is keyed by `uniqueKey` - `isEmpty` reports whether any pending request remains (in-progress requests are not counted) The lock-based `listHead`/`listAndLockHead`/`prolongRequestLock`/ `deleteRequestLock` methods are removed. Part of #3075.
…estQueue` class `RequestProvider`, `RequestQueueV1` and `RequestQueueV2` no longer differ in behaviour — request coordination (locking, queue-head management) is now an internal concern of the storage client — so they are merged into a single concrete `RequestQueue` class: - `RequestProvider` becomes `RequestQueue`; the `request_provider.ts` and `request_queue_v2.ts` modules are removed and the implementation lives in `request_queue.ts`. - `fetchNextRequest`/`markRequestHandled`/`reclaimRequest`/`isEmpty` delegate to the slim client; the queue-head, locking, consistency and `recentlyHandledRequests` bookkeeping is gone. - `isFinished` returns `false` while a background add batch is in flight, otherwise `client.isEmpty()`. - `isEmpty()` reflects only pending requests (the next `fetchNextRequest()` would return `null`), preserving the crawler's task-scheduling contract. - `getRequest` is keyed by `uniqueKey`. The `RequestProvider`/`RequestQueueV1`/`RequestQueueV2` exports are removed; use `RequestQueue` instead. `RequestProviderOptions` is renamed to `RequestQueueOptions`. Part of #3075.
Adapt `BasicCrawler` to the slim `RequestQueueClient` and the merged `RequestQueue` class: - Use `RequestQueue` in place of the removed `RequestProvider`/`RequestQueueV1` (instanceof checks, `open()`, parameter and field types). - The same-domain-delay path no longer pokes the queue's private `inProgress` set; it relies on `reclaimRequest` to return the request to the queue. - The error-handling safety net reclaims the request via `reclaimRequest` instead of calling the removed `deleteRequestLock`. Reclaiming a request that is no longer in progress is a harmless no-op on the client. Part of #3075.
Rewrite the request-queue test suites against the new API. Obsolete white-box tests of the removed locking/queue-head machinery are replaced with behavioral tests using a real MemoryStorage-backed client: - memory-storage forefront/handledRequestCount/ignore-non-json tests now drive `addBatchOfRequests`/`fetchNextRequest`/`markRequestAsHandled`/`reclaimRequest`/ `isEmpty`. - core `request-queue-v2` and `request_queue` tests cover the new lifecycle and `isEmpty`/`isFinished` semantics. - `MemoryStorageEmulator.getRequestQueueItems` drains pending requests via `fetchNextRequest` and reclaims them, since `listHead` no longer exists. Part of #3075.
Expand the `RequestQueueClient` migration table in the v4 upgrading guide with the full method mapping, the new fetch/handle/reclaim lifecycle, the `isEmpty` semantics, and the merge of `RequestProvider`/`RequestQueueV1`/`RequestQueueV2` into a single `RequestQueue` class. List the removed head/lock types. Drop the obsolete request-locking experiment guide (locking is no longer an opt-in experiment) and remove its now-empty "Experiments" sidebar category. Update the parallel scraping guide to use `RequestQueue` and drop the `requestLocking` experiment flag. Closes #3075.
01d189a to
1d3ccf5
Compare
|
@vdusek — a heads-up on a TL;DR — In crawlee-python, Where this comes from — all three clients agree, even though the base interface doesn't state it:
The Why it matters (multi-consumer) —
... and also this is kinda weird in itself. |
A request fetched via `fetchNextRequest` is locked (in progress) for 3 minutes by persisting a future `orderNo` to disk. If the process ends before the request is handled or reclaimed, that lock used to linger until it expired, blocking the request for the next consumer of the same on-disk queue. The in-memory client now tracks the requests it has locked and releases them in `MemoryStorage.teardown()`, resetting their `orderNo` (sign preserved, so forefront/normal ordering survives) so they become immediately fetchable again. Only this client needs the cleanup: the Apify platform releases a run's locks automatically on migrate/abort, and the file-system storage does not lock at all.
`RequestQueueClient.isEmpty()` previously reported `true` as soon as no request was immediately fetchable, ignoring requests that are merely locked (in progress) by another consumer. With multiple consumers sharing a queue, a consumer could therefore see the queue as empty — and let the crawler finish — while another consumer still held the last requests under a lock. `listPendingHead` now also reports whether it skipped any unhandled-but- locked request, and `isEmpty()` returns `true` only when nothing is pending AND nothing is locked. This mirrors the Apify platform shared client, whose `isEmpty` accounts for `queueHasLockedRequests`. Tests that encoded the old "in-progress means empty" semantics are updated accordingly (an in-progress request now keeps the queue non-empty and unfinished until it is handled).
Request locking is internal to the storage client, but the only sensible lock duration — long enough to outlive the request handler — is known only to the crawler. Thread that knowledge across the boundary via a new `setExpectedRequestProcessingTime(secs)` hint: - `RequestQueueClient` (optional) and `IRequestManager` gain the method. - The memory-storage client uses it to size how long `fetchNextRequest` locks a request (defaults to 3 minutes), fixing long handlers (>3 min) having their request handed out — and processed — a second time. - The `RequestQueue` frontend owns the multi-consumer policy: it only ever raises the duration, never lowers it, so a short-lived consumer sharing the queue cannot cut short a long-lived one's reservation. The client stays a dumb setter. - `RequestManagerTandem` forwards the hint to its underlying manager (applying it once the lazily-opened manager resolves), so the RequestList/tandem path is covered too. - `BasicCrawler` hints `requestHandlerTimeoutMillis + padding` to whatever request manager it uses, not just a plain `RequestQueue`.
… and `RequestQueue.internalTimeoutMillis`
Now that V1/V2 are merged and locking lives in the storage client, two
things became inert and are removed:
- `RequestQueue.internalTimeoutMillis` — nothing reads it anymore (the old
`RequestProvider` that did is gone). `BasicCrawler` no longer writes it,
so `applyRequestManagerTimeouts` drops its `instanceof RequestQueue`
branch and `_getRequestQueue` is inlined into `openOwnedRequestQueue`.
- The `requestLocking` crawler experiment, the `experiments` option, and
the `CrawlerExperiments` type — both branches of the old `_getRequestQueue`
returned the same `RequestQueue.open(...)`, so the flag only logged a lie.
BREAKING CHANGE: the `experiments` crawler option and the `CrawlerExperiments`
type are removed. Request locking has been the default since v3.10 with no
alternative to opt out to, so any `experiments: { requestLocking }` can be
deleted.
7cd8151 to
a22f5e7
Compare
`markRequestAsHandled` and `reclaimRequest` guarded on the request still being locked (`isRequestLocked`). A consumer whose processing outlived the lock — a slow handler, a GC/event-loop pause, or a low `setExpectedRequestProcessingTime` — would therefore have its `markRequestAsHandled` call silently return `null` and do nothing: the request stayed pending, was never counted as handled, and got handed out again on the next `fetchNextRequest`. With every pass exceeding the lock, the request was reprocessed forever and the queue never finished. The guards now require only that the request exists and is not already handled (`orderNo !== null`), regardless of lock state, matching the pre-refactor unconditional behavior. `reclaimRequest` likewise no longer drops a `forefront` reorder after expiry.
|
In addition to #3729 (comment): I digged through the Python Crawlee+SDK history. Phase 1 — the original (correct) design. Both SDKs split the two predicates: Phase 2 — the collapse. crawlee-python apify/crawlee-python#1194 ("Introduce new storage client system") rewrote storages ground-up and incidentally merged the predicates: frontend Phase 3 — the compensation. To keep termination correct after the collapse, the Apify client's Takeaway. I will add |
barjin
left a comment
There was a problem hiding this comment.
Thanks @janbuchar! Aside from these nits, I didn't find anything, so approving 👍
Claude hints at a performance issue with RequestQueue in memory storage, but I didn't get anything measurable in any conceivable scenario. Together with mem-storage being sunset by the new Rust storage, it's a no-op for me
| * the same request out again while it is still being processed. Implementations that do not need | ||
| * this hint may leave it `undefined`. | ||
| */ | ||
| setExpectedRequestProcessingTime?(secs: number): void; |
There was a problem hiding this comment.
In other time-related symbols, we use the ...Secs or ...Millis suffix. Any chance we can get this here as well?
| * queue. Returns information about the operation, or `null` if the request was not in progress. | ||
| */ | ||
| reclaimRequest(request: UpdateRequestSchema, options?: RequestOptions): Promise<QueueOperationInfo | null>; |
There was a problem hiding this comment.
or
nullif the request was not in progress
In this case, will / should the request be added to the queue, or is it an invalid operation (and therefore an error)?
DatasetClientandKeyValueStoreClientinterfaces with their Python counterparts #3627RequestProvider->RequestQueue(V2)hierarchy, onlyRequestQueueis leftBasicCrawler#2840MemoryStorageEmulator, which is slated for removal in Split off FileSystemStorage from MemoryStorage #3593. ItsgetRequestQueueItemshelper now drains the queue viafetchNextRequest()and reclaims everything afterwards. BecausereclaimRequestrecomputesorderNofrom scratch, this does not faithfully restore the original ordering (forefront requests lose their sign/priority, and same-millisecond reclaims can collapse relative order). That's fine here: all current callers use non-forefront requests with order-tolerant assertions, and the helper is going away with Split off FileSystemStorage from MemoryStorage #3593 — so it's not worth fixing.isEmpty()/isFinished()in the memory client now do a full head scan (with forced per-entry disk reads) on every call instead of the old in-memory head-cache short-circuit, so they're O(N) perAutoscaledPooltick. We're not optimizing this: a client-local cache can't be made correct across multiple consumers sharing an on-disk queue (it would be blind to foreign locks), and this implementation is being replaced by crawlee-storage anyway — production cost lands in the Apify client, which computes finished-ness server-side.