Skip to content

Conversation

@svilupp
Copy link

@svilupp svilupp commented Mar 9, 2023

Proposal for #132

This is an initial proposal to tackle in-place decoding by providing the correctly-sized buffer.

It would be hugely beneficial for Arrow format, which is the workhorse of modern data analytics (even Pandas is moving to it as a backend). I've done a quick benchmark and Arrow.jl is among the slowest parsers, especially with compressed files. Based on profiling, the time spent resizing buffers in transcode() is roughly the same as the decoding itself!
Thanks to Arrow IPC file specifications, we always know the output size of a field, however, at the moment we cannot take advantage of it.

This PR defines a new function transcode! that would write into a user-provided Buffer.
Benefits are quantified below in the code snippet (eg, 3x faster processing with large texts).
I haven't yet explored the failure modes, I just wanted to start the discussion.

using TranscodingStreams, CodecZstd
using BenchmarkTools


text = """
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean sollicitudin
mauris non nisi consectetur, a dapibus urna pretium. Vestibulum non posuere
erat. Donec luctus a turpis eget aliquet. Cras tristique iaculis ex, eu
malesuada sem interdum sed. Vestibulum ante ipsum primis in faucibus orci luctus
et ultrices posuere cubilia Curae; Etiam volutpat, risus nec gravida ultricies,
erat ex bibendum ipsum, sed varius ipsum ipsum vitae dui.
"""

# Array API
decompressor=ZstdDecompressor()
compressed = transcode(ZstdCompressor, text)
@assert sizeof(compressed) < sizeof(text)
@assert transcode(decompressor, compressed) == Vector{UInt8}(text)

# Proposed in-place API, writes directly to output buffer.
output_length = ncodeunits(text)
output=TranscodingStreams.Buffer(output_length)
TranscodingStreams.transcode!(decompressor, compressed, output);
@assert output.data == Vector{UInt8}(text)


# Benchmark on smaller text
compressed = transcode(ZstdCompressor, text)
output=TranscodingStreams.Buffer(ncodeunits(text))
@btime transcode($decompressor, $compressed);
# 2.120 μs (4 allocations: 736 bytes)

@btime(TranscodingStreams.transcode!($decompressor, $compressed, output),evals=1,setup=(output=deepcopy($output)));
# 2.041 μs (2 allocations: 64 bytes)

# Benchmark on longer text
longtext= text^100
compressed = transcode(ZstdCompressor, longtext)
@btime transcode($decompressor, $compressed);
# 10.167 μs (10 allocations: 232.92 KiB)

@btime output=TranscodingStreams.Buffer(ncodeunits($longtext));
# 443.409 ns (3 allocations: 43.47 KiB)
@btime(TranscodingStreams.transcode!($decompressor, $compressed, output),evals=1,setup=(output=deepcopy($output)));
# 3.791 μs (2 allocations: 64 bytes)

@Moelf
Copy link

Moelf commented Mar 9, 2023

this is great, I'm not very familiar with the internal Buffer type, but eventually I want ByteData back but can I assume I can construct a Buffer myself and extract the bytes out afterwards?

@baumgold
Copy link
Contributor

@svilupp - this suggestion looks great to me. Can this PR be moved out of draft status so it can be reviewed/approved/merged/released? It seems a quite useful feature so I’d like to get it out there ASAP. Thanks!

@baumgold
Copy link
Contributor

I think we can implement this without as much code duplication. See #136. Thoughts?

@svilupp svilupp closed this Mar 15, 2023
@Moelf
Copy link

Moelf commented Mar 16, 2023

thanks, @svilupp for the initiative!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants