Skip to content

Conversation

@tkf
Copy link

@tkf tkf commented May 14, 2021

This patch implements two JuliaFolds interfaces on ChainedVector: the sequential iteration protocol (aka foldl) using FGenerators.jl syntax and SplittablesBase.jl interface for parallel reductions.

Arguably, the dependency tree pulled via FGenerators.jl, especially Transducers.jl, is rather large. I'm not sure if you want to pull this in at this stage (i.e., probably you'd want to wait until I extract it out as FoldsBase.jl). But I thought it'd be interesting to demonstrate that using JuliaFolds' iteration facility can be beneficial for not only parallel reduction but also for sequential iterations. For example, maybe this can make some parts of the optimization like #42 easier.

This patch uses FGenerators.jl which is a syntax sugar of Transducers.__foldl__. This is mainly because writing __foldl__ is slightly tedious and also I may need to tweak the interface for solving some subtle problems in parallel reduction at some point. But I expect the syntax sugar provided by FGenerators.jl to be more stable.

Microbenchmark

A simple summation of ChainedVector{Int} is 4x faster with @floop that uses foldl as the iteration mechanism. Looking into LLVM, @floop version is vectorized but iterate version is not.

julia> using FLoops

julia> function sum_iter(xs)
           acc = zero(eltype(xs))
           for x in xs
               acc += x
           end
           acc
       end
sum_iter (generic function with 1 method)

julia> function sum_foldl(xs)
           @floop begin
               acc = zero(eltype(xs))
               for x in xs
                   acc += x
               end
           end
           acc
       end
sum_foldl (generic function with 1 method)

julia> A = ChainedVector([ones(Int, 2^8) for _ in 1:2^8]);

julia> @btime sum_iter(A)
  43.279 μs (1 allocation: 16 bytes)
65536

julia> @btime sum_foldl(A)
  9.500 μs (1 allocation: 16 bytes)
65536

Note: I'm using Int as the element type so that vectorization can be triggered easily. Supporting @simd for floats is possible but ATM it requires a rather ugly macro.

I think it's a big win, also considering that the @yield-based syntax is much simpler than the complex iterate implementation:

@fgenerator(A::ChainedVector) do
for array in A.arrays
for x in array
@yield x
end
end
end

@quinnj
Copy link
Member

quinnj commented May 25, 2021

Woohoo! This is awesome! It's indeed quite painful to eak out as much performance as possible using standard iteration protocols from Base. So in #42, I basically have to overload every custom array operation from Base to avoid iteration and sequential indexing.

I am worried about the current dependency tree here; this package has become a "foundational" package of the data ecosystem, so it's a hard to allow adding such heavy dependencies. I love the idea of a FoldBase.jl though that would allow a lightweight "hook" into all the folds/transducers magic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants