Hash table (also called hash map) is a data structure that implements an associative array abstract data type, a structure that can map keys to values.
ββββββββββββββ
β Hash table β
ββββββββββββββ€
ββββββββββββββ€ βββββββββββ βββββββββββ βββββββββββ
β slot 1 ββ>β€ entry 0 ββ>β€ entry 1 ββ...βββ€ entry n ββ...
ββββββββββββββ€ βββββββββββ βββββββββββ βββββββββββ
ββββββββββββββ€ βββββββββββ βββββββββββ βββββββββββ
β slot 2 ββ>β€ entry 0 ββ>β€ entry 1 ββ>β€ entry 2 ββ...
ββββββββββββββ€ βββββββββββ βββββββββββ βββββββββββ
β ... β
ββββββββββββββ€
β slot 255 ββ...
ββββββββββββββ
A chained hash table fundamentally consists of an array of linked lists. Each list forms a slot in which we place all elements hashing to a specific position in the array (see figure above).
Hash collision is when two pieces of data in a hash table share the same hash value. The hash value in this case is derived from a hash function.
ββββββββββββββββββββ¦βββββββββββββββββββββββββββββββββββββββββββββββββ β Hash function β Description β ββββββββββββββββββββ©βββββββββββββββββββββββββββββββββββββββββββββββββ ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββ β one_hash β h(x) = 1 β ββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββ€ β first_ascii_hash β h(x) = x[0] β ββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββ€ β sum_ascii_hash β h(x) = x[0] + x[1] + ... + x[len(x)] β ββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββ€ β ror_hash β for i in len(x): h(x) = x[i] xor (h(x) ror 1) β ββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββ€ β length_hash β h(x) = len(x) β ββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββ€ β wsum_ascii_hash β h(x) = 1*x[0] + 2*x[1] + ... β ββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββ€ β crc32_hash β ... β ββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββ
Collisions test is described on the figure below:
ββββββββββ βββββββββββββββββββββββ ββββββββββββββββββ β 824528 β β Load to hash table β β Get collisions β β words β >>> β and calculate words β >>> β number β β array β β duplicates β β β ββββββββββ βββββββββββββββββββββββ ββββββββββββββββββ
After analyzing the figures below, we can say that hashes sum_ascii_hash, crc32_hash and ror_hash can be used as 'good' hash functions.
Note that the hash comparison was done with a fixed hash table size of 256. And it cannot be extrapolated to large table sizes.
'Premature optimization is the root of all evil' Donald Knuth
Almost impossible to defeat compiler when optimizing C code. Also it is stupid idea to compare -O2 optimized code with a strange -O'default' solutions. And every problem we are having is because of playing by the compiler rules. However, there are ways to defeat C compiler.
The performance test is super simple. We'll feed the Bible text right into our hash table 100 times. perf linux utility is used to analyze performance.
Now we are playing by our rules. That is why we will choose 32-byte buffer to hold the hash table key. But to be honest we will tell the compiler about the keys size.
Hash table key format:
βββ¬ββ¬ββ¬ββ¬ββ¬ββ¬ββ¬...ββ¬ββ¬ββ¬ββ
β β β β β β β β β β β β
βββ΄ββ΄ββ΄ββ΄ββ΄ββ΄ββ΄...ββ΄ββ΄ββ΄ββ
ββββββββββββββββββββββββββ
32 bytes
For the comparation we will choose -O2 optimized program described below.
(Before optimization)
25β―522β―189β―344 cycles:u # 4,342 GHz
48β―879β―342β―457 instructions:u # 1,92 insn per cycle
9β―299β―653β―700 branches:u # 1,582 G/sec
243β―804β―891 branch-misses:u # 2,62% of all branches
5,893119184 seconds time elapsed
Let's find the heaviest functions being called:
43,15% perf perf [.] crc32_hash 31,92% perf libc.so.6 [.] __strncmp_avx2 16,93% perf perf [.] htab_list_find 4,67% perf perf [.] htab_find 1,72% perf perf [.] strncmp@plt
The heaviest ones are crc32_hash, __strncmp_avx2 and htab_list_find. The code is listed below:
/* crc32_hash */
hash_t crc32_hash(hkey *key)
{
unsigned int byte = 0, mask = 0;
unsigned int crc = 0xFFFFFFFF;
/* 2,88 β20: mov %eax,%ecx
16,82 β and $0x1,%eax
26,77 β neg %eax
2,72 β shr %ecx
13,35 β and $0xedb88320,%eax
26,41 β xor %ecx,%eax */
for (int i = 0; (*key)[i] != 0; i++) {
byte = (*key)[i];
crc = crc ^ byte;
for (int j = 7; j >= 0; j--) {
mask = -(crc & 1);
crc = (crc >> 1) ^ (0xEDB88320 & mask);
}
}
return (hash_t)(~crc);
}int compare_keys(hkey *k1, hkey *k2)
{
/* __strncmp_avx2 */
/* Here compiler chooses avx2 strncmp version */
return strncmp(*k1, *k2, 32);
}/* htab_list_find */
ptrdiff_t htab_list_find(list *const lst, hkey *key)
{
/* 0,02 β4d: mov $0x20,%edx
β mov %r13,%rsi
9,69 β mov %rbx,%rdi
0,53 β β call strncmp@plt
β test %eax,%eax
8,08 β β jne 38
25,14 β61: add $0x8,%rsp
β mov %r12,%rax
8,61 β pop %rbx */
ptrdiff_t cur = lst->head;
while (cur && compare_keys(&lst->nodes[cur].data.key, key))
cur = lst->nodes[cur].next;
return cur;
}If you want to make something great - make it by yourself. It means if compiler "can't understand" that we need crc32 hash, let's write it by hands.
crc32_sse:
mov eax, 0x04c11db7
lea rsi, [rdi + 32]
.hash:
crc32 eax, dword [rdi]
add rdi, 0x4
cmp rdi, rsi
jne .hash
retAfter perf analysis we get:
β crc32_sse.hash():
91,21 β0: crc32l (%rdi),%eax
0,62 β add $0x4,%rdi
β cmp %rsi,%rdi
7,25 β β jne 0
0,92 β β ret
More than 90% of function is occupied by crc32 command. I think we did our best...
The idea is to use YMM registers as key value type.
#include <immintrin.h>
typedef __m256i hkey;Then inline compare_keys. There are two ways of doing that.
How to inline compare_keys()?
β
ββββββββββββββββ΄ββββββββββββββ
Inline asm Intrinsics
I hate it. Wonderful assembly
It makes both programmer's alternative.
and compiler's lives miserable
Two ways are coded below:
/* Intrinsics version */
ptrdiff_t htab_list_find(list *const lst, hkey *key)
{
...
/* 10,67 β40:β mov 0x40(%rcx),%rdx
0,87 β β test %rdx,%rdx
β ββ je 90
0,42 β49:β lea (%rdx,%rdx,2),%rdx
3,10 β β shl $0x5,%rdx
1,74 β β lea (%rdi,%rdx,1),%rcx
here ---> 51,53 β β vpcmpeqq (%rcx),%ymm1,%ymm0
0,02 β β vpmovmskb %ymm0,%esi
0,65 β β cmp $0xffffffff,%esi
2,86 β ββ jne 40
β β test %r13,%r13
9,59 β ββ je 76 */
int find = _mm256_movemask_epi8(
_mm256_cmpeq_epi64(
_mm256_load_si256(&lst->nodes[cur].data.key),
_mm256_load_si256(key)
)/* Inline assembly verison */
ptrdiff_t htab_list_find(list *const lst, hkey *key)
{
...
/* 9,08 β40: mov 0x40(%rcx),%rdx
0,96 β test %rdx,%rdx
β β je 90
0,49 β49: lea (%rdx,%rdx,2),%rdx
3,32 β shl $0x5,%rdx
1,46 β lea (%rdi,%rdx,1),%rcx
Both ---> 29,10 β vmovdqa (%rcx),%ymm1
here ---> 21,38 β vpcmpeqq %ymm1,%ymm0,%ymm1
0,16 β vpmovmskb %ymm1,%esi
0,59 β cmp $0xffffffff,%esi
2,71 β β jne 40
β test %r13,%r13
9,74 β β je 7a */
asm (
"vpcmpeqq %[k0], %[k1], %[k0];"
"vpmovmskb %[k0], %[cmp];"
: [cmp] "=r" (find)
: [k0] "x" (lst->nodes[cur].data.key), [k1] "x" (*key)
);Looking ahead a bit, Iβll say that both solutions give approximately the same performance boost.
After optimizations we can see the following results (around 2x boost):
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β Just -O2 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β 25β―522β―189β―344 cycles:u # 4,342 GHz β β 48β―879β―342β―457 instructions:u # 1,92 insn per cycle β β 9β―299β―653β―700 branches:u # 1,582 G/sec β β 243β―804β―891 branch-misses:u # 2,62% of all branches β β β β 5,893119184 seconds time elapsed β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β 43,15% perf perf [.] crc32_hash β β 31,92% perf libc.so.6 [.] __strncmp_avx2 β β 16,93% perf perf [.] htab_list_find β β 4,67% perf perf [.] htab_find β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β Intrinsics + crc32_sse version + -O2 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β 13β―550β―343β―744 cycles:u # 4,429 GHz β β 10β―736β―726β―776 instructions:u # 0,79 insn per cycle β β 2β―276β―226β―023 branches:u # 744,046 M/sec β β 94β―733β―400 branch-misses:u # 4,16% of all branches β β β β 3,060782918 seconds time elapsed β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β 82,83% perf perf [.] htab_find β β 13,91% perf perf [.] crc32_sse.hash β β 1,84% perf perf [.] main β β 0,56% perf perf [.] parse_words β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β Inline asm + crc32_sse + -O2 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β 13β―817β―734β―807 cycles:u # 4,330 GHz β β 11β―172β―228β―370 instructions:u # 0,81 insn per cycle β β 2β―276β―226β―672 branches:u # 713,359 M/sec β β 94β―618β―194 branch-misses:u # 4,16% of all branches β β β β 3,200113053 seconds time elapsed β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β 83,47% perf perf [.] htab_find β β 13,39% perf perf [.] crc32_sse.hash β β 1,79% perf perf [.] main β β 0,55% perf perf [.] parse_words β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Let me present some numerical estimates of the results of comparing the performance:
- Around 2x performance boost
- ded32 performance coefficient:
