A technique to consider to reduce the memory usage when loading data from databases to arrow is memory pre-allocation.
Memory pre-allocation in this context is the technique of pre-allocating the memory that will be used by the entire dataset before downloading any data. This has the advantage that it minimizes the amount of volatile memory that will be used, to accomplish this some metadata is needed like total count of rows and the data types of each column of the dataset, obtaining this metadata adds overhead that can often be mitigated if the used queries are optimized, for example by constructing appropriate indexes.
Still we only minimize memory consumption, getting close to a theoretical minimum but almost never achieving it, this is due to the dynamic nature of some column types, like strings or dynamic arrays (often called lists), where we don't know the real length of every row.
Why pre-allocating saves memory?
In the rust implementation, ArrowBuilders are dynamic, you can keep adding values. Every time a new value is added, buffer.reserve is called:
src
#[inline(always)]
pub fn reserve(&mut self, additional: usize) {
let required_cap = self.len + additional;
if required_cap > self.layout.size() {
let new_capacity = bit_util::round_upto_multiple_of_64(required_cap);
let new_capacity = std::cmp::max(new_capacity, self.layout.size() * 2);
self.reallocate(new_capacity)
}
}
additional is the number of bytes that will be used by the new value(s), calculated as elements * size_of_type_bytes
As values are added, required_cap will no longer fit the current allocated size and will resize it to whatever is bigger: the next valid 64 multiple or the current size * 2. Doubling the allocated memory is a common technique to avoid multiple smaller allocations, which are more expensive as the overhead adds up. The memory will always be a multiple of 64 for better cache and SIMD performance.
Pre-allocating memory avoids the exponential allocation growth that would otherwise happen as values are appended one by one.
Let's have a look at one example:
Imagine there is an u32 builder with a buffer currently taking 100MB; 1e8 bytes. If we allocate a new item, reserve will be called with reserve(1 * 4), since an u32 takes 4 bytes. The new allocation will be max(1_000_064, 2_000_000) = 2e8 bytes, doubling the current allocated memory.
Should we pre-allocate?
Memory pre-allocation only makes sense if the saved memory amortizes the time spent fetching the metadata to do it.
To put things into perspective:
tphc lineitem 10x (60M rows)
| library |
Time |
Memory |
Has Index |
Pre-Allocated |
| conecta |
89.80 |
8320.34 |
True |
False |
| conecta |
90.8s |
7804.08 |
True |
True |
| conecta |
105.35 |
8320.34 |
False |
False |
| conecta |
170.43 |
7804.08 |
False |
True |
| connectorx |
156.31 |
7695.11 |
False |
False |
| connectorx |
103.02 |
7695.11 |
True |
False |
tphc lineitem 1x (6M rows)
| library |
Time |
Memory |
Has Index |
Pre-Allocated |
| conecta |
1.88 |
147.35 |
False |
True |
| conecta |
1.83 |
212.40 |
False |
False |
| conecta |
1.82 |
214.44 |
True |
False |
| conecta |
1.87 |
147.65 |
True |
True |
| connectorx |
1.95 |
161.47 |
False |
False |
A technique to consider to reduce the memory usage when loading data from databases to arrow is memory pre-allocation.
Memory pre-allocation in this context is the technique of pre-allocating the memory that will be used by the entire dataset before downloading any data. This has the advantage that it minimizes the amount of volatile memory that will be used, to accomplish this some metadata is needed like total count of rows and the data types of each column of the dataset, obtaining this metadata adds overhead that can often be mitigated if the used queries are optimized, for example by constructing appropriate indexes.
Still we only minimize memory consumption, getting close to a theoretical minimum but almost never achieving it, this is due to the dynamic nature of some column types, like strings or dynamic arrays (often called lists), where we don't know the real length of every row.
Why pre-allocating saves memory?
In the rust implementation, ArrowBuilders are dynamic, you can keep adding values. Every time a new value is added,
buffer.reserveis called:src
additionalis the number of bytes that will be used by the new value(s), calculated aselements * size_of_type_bytesAs values are added,
required_capwill no longer fit the current allocated size and will resize it to whatever is bigger: the next valid 64 multiple or the current size * 2. Doubling the allocated memory is a common technique to avoid multiple smaller allocations, which are more expensive as the overhead adds up. The memory will always be a multiple of 64 for better cache and SIMD performance.Pre-allocating memory avoids the exponential allocation growth that would otherwise happen as values are appended one by one.
Let's have a look at one example:
Imagine there is an
u32builder with a buffer currently taking 100MB; 1e8 bytes. If we allocate a new item, reserve will be called withreserve(1 * 4), since anu32takes 4 bytes. The new allocation will bemax(1_000_064, 2_000_000) = 2e8bytes, doubling the current allocated memory.Should we pre-allocate?
Memory pre-allocation only makes sense if the saved memory amortizes the time spent fetching the metadata to do it.
To put things into perspective:
tphc lineitem 10x(60M rows)tphc lineitem 1x(6M rows)