in-place transcode when output size is known #134

svilupp · 2023-03-09T11:58:38Z

Proposal for #132

This is an initial proposal to tackle in-place decoding by providing the correctly-sized buffer.

It would be hugely beneficial for Arrow format, which is the workhorse of modern data analytics (even Pandas is moving to it as a backend). I've done a quick benchmark and Arrow.jl is among the slowest parsers, especially with compressed files. Based on profiling, the time spent resizing buffers in transcode() is roughly the same as the decoding itself!
Thanks to Arrow IPC file specifications, we always know the output size of a field, however, at the moment we cannot take advantage of it.

This PR defines a new function transcode! that would write into a user-provided Buffer.
Benefits are quantified below in the code snippet (eg, 3x faster processing with large texts).
I haven't yet explored the failure modes, I just wanted to start the discussion.

using TranscodingStreams, CodecZstd
using BenchmarkTools


text = """
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean sollicitudin
mauris non nisi consectetur, a dapibus urna pretium. Vestibulum non posuere
erat. Donec luctus a turpis eget aliquet. Cras tristique iaculis ex, eu
malesuada sem interdum sed. Vestibulum ante ipsum primis in faucibus orci luctus
et ultrices posuere cubilia Curae; Etiam volutpat, risus nec gravida ultricies,
erat ex bibendum ipsum, sed varius ipsum ipsum vitae dui.
"""

# Array API
decompressor=ZstdDecompressor()
compressed = transcode(ZstdCompressor, text)
@assert sizeof(compressed) < sizeof(text)
@assert transcode(decompressor, compressed) == Vector{UInt8}(text)

# Proposed in-place API, writes directly to output buffer.
output_length = ncodeunits(text)
output=TranscodingStreams.Buffer(output_length)
TranscodingStreams.transcode!(decompressor, compressed, output);
@assert output.data == Vector{UInt8}(text)


# Benchmark on smaller text
compressed = transcode(ZstdCompressor, text)
output=TranscodingStreams.Buffer(ncodeunits(text))
@btime transcode($decompressor, $compressed);
# 2.120 μs (4 allocations: 736 bytes)

@btime(TranscodingStreams.transcode!($decompressor, $compressed, output),evals=1,setup=(output=deepcopy($output)));
# 2.041 μs (2 allocations: 64 bytes)

# Benchmark on longer text
longtext= text^100
compressed = transcode(ZstdCompressor, longtext)
@btime transcode($decompressor, $compressed);
# 10.167 μs (10 allocations: 232.92 KiB)

@btime output=TranscodingStreams.Buffer(ncodeunits($longtext));
# 443.409 ns (3 allocations: 43.47 KiB)
@btime(TranscodingStreams.transcode!($decompressor, $compressed, output),evals=1,setup=(output=deepcopy($output)));
# 3.791 μs (2 allocations: 64 bytes)

Moelf · 2023-03-09T14:23:50Z

this is great, I'm not very familiar with the internal Buffer type, but eventually I want ByteData back but can I assume I can construct a Buffer myself and extract the bytes out afterwards?

baumgold · 2023-03-15T12:47:33Z

@svilupp - this suggestion looks great to me. Can this PR be moved out of draft status so it can be reviewed/approved/merged/released? It seems a quite useful feature so I’d like to get it out there ASAP. Thanks!

baumgold · 2023-03-15T14:38:14Z

I think we can implement this without as much code duplication. See #136. Thoughts?

Moelf · 2023-03-16T03:17:47Z

thanks, @svilupp for the initiative!

proposed mutating interface

e5be04e

removed calculation of the size of the next output buffer increase

e1639c0

svilupp mentioned this pull request Mar 12, 2023

Benchmark of Arrow.jl vs Pyarrow (/Polars) apache/arrow-julia#393

Open

baumgold mentioned this pull request Mar 15, 2023

#132 allow users to optionally provide an output buffer when calling transcode #136

Merged

svilupp closed this Mar 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

in-place transcode when output size is known #134

in-place transcode when output size is known #134

Uh oh!

svilupp commented Mar 9, 2023

Uh oh!

Moelf commented Mar 9, 2023

Uh oh!

baumgold commented Mar 15, 2023

Uh oh!

baumgold commented Mar 15, 2023

Uh oh!

Moelf commented Mar 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

in-place transcode when output size is known #134

in-place transcode when output size is known #134

Uh oh!

Conversation

svilupp commented Mar 9, 2023

Uh oh!

Moelf commented Mar 9, 2023

Uh oh!

baumgold commented Mar 15, 2023

Uh oh!

baumgold commented Mar 15, 2023

Uh oh!

Moelf commented Mar 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants