Skip to content

perf: reduce allocations on the chunk reading hot path#339

Open
matshch wants to merge 3 commits into
buildbarn:mainfrom
matshch:fix-buffer-reuse
Open

perf: reduce allocations on the chunk reading hot path#339
matshch wants to merge 3 commits into
buildbarn:mainfrom
matshch:fix-buffer-reuse

Conversation

@matshch
Copy link
Copy Markdown

@matshch matshch commented May 14, 2026

Note

This PR was authored with AI assistance and reviewed by me before opening. I have not verified it yet on the production traffic, but all existing tests do pass.

readerBackedChunkReader previously allocated a fresh maximumChunkSizeBytes slice on every Read() call. With the default 64 KiB chunk size and production traffic around 20-30 KRPS, this allocation site accounted for the bulk of GC pressure on bb_storage, with the runtime spending up to 90% of CPU in GC under load. Go's GC is also hardcoded to use at most 25% of available CPU threads, so the remaining cores stayed idle.

The reader now owns a single buffer for its entire lifetime, acquired lazily on first Read() and returned to a package-private sync.Pool on Close(). The ChunkReader contract is tightened to reflect the long-standing implicit invariant that returned slices are only valid until the next Read()/Close() call — every in-tree consumer already respected this. multiplexedChunkReader's existing channel-based barrier already serializes consumers, which keeps buffer reuse safe across CloneStream().

Benchmark results:

BlobSize ns/op (before -> after) allocs/op throughput GB/s
4 KiB Serial 8879 -> 165 5 -> 4 0.4 -> 25
64 KiB Serial 16691 -> 804 6 -> 4 4 -> 82
256 KiB Serial 41601 -> 2750 9 -> 4 6 -> 95
4 MiB Serial 507039 -> 64856 69 -> 4 8 -> 65
64 MiB Serial 5971350 -> 1826229 1029 -> 4 11 -> 37
4 KiB Parallel 12127 -> 59 5 -> 4 0.3 -> 69
64 KiB Parallel 24879 -> 88 6 -> 4 3 -> 747
256 KiB Parallel 60543 -> 158 9 -> 4 4 -> 1656 (!)
4 MiB Parallel 601633 -> 2504 69 -> 4 7 -> 1675 (!)
64 MiB Parallel 3611204 -> 97334 1029 -> 4 19 -> 689

The remaining 4 allocs/op is from test-harness Buffer wrappers rather than the chunk buffer itself.

A similar make() per Read() pattern exists in pkg/blobstore/grpcclients/cas_blob_access.go's zstdByteStreamChunkReader and is left for a follow-up PR.

matshch added 2 commits May 14, 2026 09:26
Exercises readerBackedChunkReader through the public Buffer API at blob sizes spanning the distribution observed in production (4 KiB, 64 KiB, 256 KiB, 4 MiB, 64 MiB) and a parallel variant covering the same range. Each sub-benchmark reports ns/op, allocs/op, and throughput via b.SetBytes() to enable comparison once the implementation changes.
readerBackedChunkReader previously allocated a fresh maximumChunkSizeBytes
slice on every Read() call. With the default 64 KiB chunk size and production
traffic around 20-30 KRPS, this allocation site accounted for the bulk of GC
pressure on bb_storage, with the runtime spending up to 90% of CPU in GC under
load. Go's GC is also hardcoded to use at most 25% of available CPU threads,
so the remaining cores stayed idle.

The reader now owns a single buffer for its entire lifetime, acquired lazily
on first Read() and returned to a package-private sync.Pool on Close(). The
ChunkReader contract is tightened to reflect the long-standing implicit
invariant that returned slices are only valid until the next Read()/Close()
call -- every in-tree consumer already respected this. multiplexedChunkReader's
existing channel-based barrier already serializes consumers, which keeps buffer
reuse safe across CloneStream().

Benchmark results:

  BlobSize          ns/op (before -> after)  allocs/op   throughput GB/s
    4 KiB Serial           8879 ->     165      5 -> 4   0.4 ->   25
   64 KiB Serial          16691 ->     804      6 -> 4     4 ->   82
  256 KiB Serial          41601 ->    2750      9 -> 4     6 ->   95
    4 MiB Serial         507039 ->   64856     69 -> 4     8 ->   65
   64 MiB Serial        5971350 -> 1826229   1029 -> 4    11 ->   37
    4 KiB Parallel        12127 ->      59      5 -> 4   0.3 ->   69
   64 KiB Parallel        24879 ->      88      6 -> 4     3 ->  747
  256 KiB Parallel        60543 ->     158      9 -> 4     4 -> 1656
    4 MiB Parallel       601633 ->    2504     69 -> 4     7 -> 1675
   64 MiB Parallel      3611204 ->   97334   1029 -> 4    19 ->  689

The remaining 4 allocs/op is from test-harness Buffer wrappers rather than the chunk buffer itself.

A similar make() per Read() pattern exists in pkg/blobstore/grpcclients/cas_blob_access.go's zstdByteStreamChunkReader and is left for a follow-up PR.
Copy link
Copy Markdown
Member

@EdSchouten EdSchouten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that this is safe? In the bytestream server we place these chunks in messages that we pass to gRPC's .Send(). Is it safe to mutate/reuse these buffers immediately afterwards?

@matshch
Copy link
Copy Markdown
Author

matshch commented May 15, 2026

From the practical point of view it is safe: the current implementation of grpc-go always marshals the message before returning from Send(), that is what my AI agent noticed and used to come to the conclusion that this code is safe.

But it looks like the contract actually does not allow that, https://pkg.go.dev/github.com/grpc/grpc-go#ServerStream.SendMsg:

It is not safe to modify the message after calling SendMsg. Tracing libraries and stats handlers may use the message lazily.

(and that is the method called in https://github.com/googleapis/go-genproto/blob/3700d4141b60f18b5e65fb06d6b8ccebd3ff6384/googleapis/bytestream/bytestream.pb.go#L842)

The exact same issue was brought up in grpc/grpc-go#8186, and the maintainers do recommend using grpc.PreparedMsg as a workaround. It moves the marshaling outside of the Send call, and passes the already marshaled message around. And during marshaling all the bytes are copied into the new buffer, so it should be safe to change the old one.

Let me know if that works for you, or if you deem this too unsafe to merge.

@MarshallOfSound
Copy link
Copy Markdown
Contributor

Fwiw this was a follow up to my last perf PR that I was still bashing around a bit in a prod cluster.

I can confirm the prod wins just the safety of it gave me pause. I'll compare what I've been running to this PR and see if there's any edge case differences

@matshch
Copy link
Copy Markdown
Author

matshch commented May 18, 2026

We've rolled out pre-grpc.PreparedMsg version in production and it looks functional. We need more time to say for sure, but it looks like both #338 and this one together give us 2x performance boost.

Comment thread pkg/blobstore/grpcservers/byte_stream_server.go Outdated
… buffers

The previous commit made readerBackedChunkReader reuse its 64 KiB
buffer across Read() calls. That alone violates the grpc.SendMsg
contract, which documents that 'It is not safe to modify the message
after calling SendMsg' because stats handlers, tracing, and binlogs
may capture the message lazily.

grpc.PreparedMsg is the documented escape hatch: PreparedMsg.Encode
marshals and copies the wire bytes synchronously, and prepareMsg
short-circuits on *PreparedMsg, so the source []byte is no longer
referenced by gRPC after SendMsg returns. This is the approach
recommended by the grpc-go maintainers for exactly this use case
(grpc/grpc-go#8186).

Applied to both the IDENTITY path and the ZSTD path
(readStreamWriter), since both hand a caller-owned []byte to Send.
@matshch matshch force-pushed the fix-buffer-reuse branch from e2abf2e to 9e0eb0f Compare May 18, 2026 20:45
@matshch matshch requested review from EdSchouten and tamird May 18, 2026 20:53
Copy link
Copy Markdown

@tamird tamird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Directionally I think this PR is not right. sync.Pool is basically a hack here to attempt to improve the performance of an implementation of ChunkReader but that interface is inherently broken. It is simply not reasonable or possible in the general case to return owned byte slices from an interface - any interface - without forcing oodles of allocations.

The second problem is that the amortized allocations do not actually vanish in this PR because of the continuously allocated PreparedMsg. This PR is just moving the allocations around: before they were in readerBackedChunkReader and now they are in the grpc.PreparedMsg.Encode call.

The benchmarks in this PR are misleading; microbenchmarking around the broken interface doesn't demonstrate anything useful.

I'm planning to send another PR - which is sadly performance-neutral - that removes the broken ChunkReader interface entirely.

@matshch
Copy link
Copy Markdown
Author

matshch commented May 18, 2026

sync.Pool is basically a hack here to attempt to improve the performance of an implementation of ChunkReader but that interface is inherently broken. It is simply not reasonable or possible in the general case to return owned byte slices from an interface - any interface - without forcing oodles of allocations.

I agree, but in the program which whole job is to read some chunks from disk and then chug them into the network — there is not whole lot to do except to manage memory, so that is what we have to optimize.

I'm not quite sure how to approach this issue without owning the buffers, I would be glad to see your PR.

The second problem is that the amortized allocations do not actually vanish in this PR because of the continuously allocated PreparedMsg. This PR is just moving the allocations around: before they were in readerBackedChunkReader and now they are in the grpc.PreparedMsg.Encode call.

You are mistaking two different allocation paths:

  • Allocations from readerBackedChunkReader are gone.
  • Allocations from ServerStream.SendMsg have moved to PreparedMsg.Encode.

So this PR should be performance beneficial, but I am not sure how to demonstrate that easily. I guess we need another benchmark that will hook this server to some client to test out the whole pipeline? I'll measure it empirically on our production soon.

return readErr
}
if writeErr := out.Send(&bytestream.ReadResponse{Data: readBuf}); writeErr != nil {
// readerBackedChunkReader may reuse the slice returned by
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/readerBackedChunkReader/ChunkReader/

The fact that there is a type named readerBackedChunkReader is an implementation of the existing buffer layer. I wouldn't refer to that type here.

b := make([]byte, r.maximumChunkSizeBytes)
n, err := io.ReadFull(r.r, b[:])
if r.buf == nil {
r.buf = getChunkBuffer(r.maximumChunkSizeBytes)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, you're trying to solve the issue that transmitting a large file causes O(n) allocations. However, with this specific change you're trying to reduce this to 'less-than-1' allocation by using a sync.Pool. Is that really necessary?

In other words, why don't we just write this instead?

r.buf = make([]byte, r.maximumChunkSizeBytes)

Sure, it's one more allocation. But are we sure that that's actually part of the bottleneck?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue is that bb-storage on our workload is GC-bound:
Stock bb-storage CPU flamegraph

And the biggest source of allocations is this place:
Stock bb-storage allocations flamegraph

We have all kinds of blob sizes in CAS. Many of them do fit in the default chunk size, but we also have a huge tail of blobs which are in the realm of gigabytes:
image

With this distribution of sizes it looks like we need to optimize both small and huge files, that is why I've looked into sync.Pool first, and then found that we can actually reuse one buffer (except for the gRPC part, which is really unfortunate).

@matshch
Copy link
Copy Markdown
Author

matshch commented May 19, 2026

Here is a new flame graph of allocations: we can see that both readerBackedChunkReader and serverStream.SendMsg are gone, now we only see grpc.PreparedMsg.Encode:
PR bb-storage allocations flame graph

Old flame graph to compare side by sideStock bb-storage allocations flame graph

That means:

  • readerBackedChunkReader is successfully optimized.
  • grpc.PreparedMsg does not introduce additional allocations, we just moved allocations outside of serverStream.SendMsg and now control the allocations.

If we speak about CPU load, here is a new CPU flame graph:
PR bb-storage CPU flame graph

Old flame graph to compare side by sideStock bb-storage CPU flame graph

Now GC takes only ~25% of CPU time. The main win here is that it is <=25%:

  • The GC mark phase is limited to 25% of allocatable CPU in Go runtime. This limit is hardcoded in the runtime, you cannot really affect it in any way.
  • If you are GC-bound and GC takes more than 25% of your CPU time, that effectively means that you cannot utilize all allocated CPU, as the GC mark phase cannot take more than 25% of allocatable CPU, and you are waiting for mark phase to finish before you can continue serving other requests.
  • That means that previously bb-storage would never reach the allocated CPU limit, but it would still throttle on GC and make some requests time out.
  • Now GC as the whole is less than 25%, so you can better control the allocated CPU.

I'll run this PR in our cluster a couple more days and report back how we feel, but overall it looks like a huge win for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants