-
Notifications
You must be signed in to change notification settings - Fork 13
Keep a HTTPX session per thread instead of sharing it across threads #80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
We're relying on HTTPX's :persistent plugin, which is designed for fiber-based concurrency within a single thread, not multi-threaded access.
|
From a session with Claude, examining logs of jobs that were running together with the |
|
I did a bit of spelunking with Claude as well and this is what "we" came up with in case it's helpful at all: https://gist.github.com/trevorturk/3b034da90954fa8e41cc063b40da11db
It might be worth copying in the Async and HTTPX maintainers at this point? This is an area that's rapidly evolving. That being said, if this gem depends on Rails (IIRC it does) we could probably consider option 2 here, which would leverage the Anyway, hope this is somewhat helpful or at least give us some threads (and/or fibers) to pull on! 🤣 |
|
Great digging @rosa! |
|
Thank you!
Ohhh, I hadn't thought of this, and I think it makes sense! I'll change to this and will test it out. Basically, this, from the document:
This gem is basically for Rails only, so yes, I think it makes perfect sense. |
So this supports both Fiber and Thread isolation levels. Thanks to @trevorturk for the idea!
|
Cool, please do let us know how it goes! |
This tries to address an issue we ran into when running
ActionPushNative::NotificationJobjobs in several threads (with Solid Queue's multiple thread option). I described it in #25 (comment)Even though we were running these in 3 threads per process (worker), we'd get frequent
HTTPX::PoolTimeoutErrorlike this:even though the max number of connections is 5 by default, so more connections than threads. We increased the number of connections to 10 and the errors decreased, but still happened from time to time. They disappeared when we incremented the number from 10 to 20.
However, things weren't working fine still. The jobs were running mixed with other jobs, all processed by the same workers. We noticed that certain jobs would have a big delay in all queries, from 100ms to 800ms extra, in every query. We also saw that the servers where these jobs were running got frequent CPU spikes that didn't happen before. All of this was really strange and only fixed itself when we changed these workers to run with a single thread. In that way, nothing would go wrong. We could return to the default max connections of 5, and there wouldn't be any problems.
I left it that way, running in a single thread, as it was working fine, until I finally had time to investigate properly. I started looking into it last week, first by trying to reproduce it in a single VM, with 3 threads, some
ActionPushNative::NotificationJobjobs mixed with other not-user-facing jobs, andHTTPX_DEBUGset to1. This was a very small number ofActionPushNative::NotificationJob, and the error didn't happen. Then, I just set all jobs to run with 3 threads, mixed with those not-user-facing-jobs. TheHTTPX::PoolTimeoutErrorhappened right away, so we could get a bunch of logs to investigate. Here's an extract from those logs:Logs
Some things to note from the logs and from the usage of
HTTPX::Session:HTTPX::Sessioninstance per service and per application configured, shared between all threadsaction_push_native/lib/action_push_native/service/apns.rb
Lines 62 to 65 in 65e32ee
:persistentplugin@contextshash that is not thread-safe.Now, looking at the logs in more detail:
tid:175520, fid:175528tid:143944, fid:143960So, the connection #134080 was used by Thread A first (stream 541), then by Thread B later (stream 543). However, Thread A got the pool timeout around 15:50:07, then Thread B successfully used #134080 at ~15:50:37. Why couldn't thread A get it from the pool? 😕
I'm not completely sure about the answer, but I suspect it might be around this:
@contextsnot being thread-safe.That connection is now in Thread A's Selector only. When Thread B reuses the same connection, Thread B adds its request to
@contexts[Thread_B_fiber], but the connection might still only be in Thread A's Selector. Then, when Thread A runs its selector loop and finishes reading the response for its last request, the connection will start being skipped in that loop. At this point, the connection will be in use by Thread B, but it won't be in its selector, so nobody would be reading from that socket.I'm not completely sure about this, still feel I'm missing some piece of the puzzle here. I also can't explain the consistent slow down for all queries in the other jobs, which I also saw happening. I imagine it might have to do with the selector loop where the state is not what it expects (it'd expect a connection to wait on, but would get nothing, yet it'll have a request waiting), and the GVL.
In any case, it feels like having the permanent session shared between threads when a part of it, the one implemented in the
:fiber_concurrencyplugin, is not thread-safe, might be the source of trouble here, so this PR just makes the sessions be per-thread (and per service and application). Running this in production, with the same setup as before: 3 threads, 5 max size of the connection pool, and mixed with other jobs, is working perfectly.