During the development of the Concurrency Kit hash set and hash table, detailed microbenchmarks were used to measure latency and variability of various operations in relation to various open-source hash table implementations. For read-mostly workloads, the implementation was at least twice as fast than Google Dense Hash on reference machines even though it provides stronger forward-progress guarantees for concurrent workloads. For example, it is lock-free for readers and wait-free for writers in single-writer many-reader use-cases. However, a recent use-case required the hash table implementations to handle delete-heavy workloads. As many open-addressing schemes, the implementation failed miserably in this workload due to tombstone accumulation. The strength of the collision resolution mechanism would very quickly lead to the complete elimination of probe sequence termination markers (empty slots) in favor of tombstones. Google Dense Hash performed in a stable manner with the delete-heavy workloads at the cost of a higher operation completion floor for more favorable workloads (due to increased clumping of the quadratic probing it uses). The performance gap for delete-heavy workloads has since then been closed with some trade-offs.
The fellow engineer at AppNexus who had the delete-heavy use-case wrote a self-contained program that executes long-lived interactions with the hash table. A variant of this benchmark was used to generate this data. The source-code is not available, but I urge readers to implement their own benchmarks and share their results. The use-case here involves data in which value objects can be derived from key pointers. Pointer packing was enabled for the hash set. I do hope to work on a much more detailed hash table benchmark suite in the future that addresses not only operation latency distribution, but longer-lived dynamic workloads.
This analysis is by no means complete, and typically I would
use more granular value distribution metrics for these
benchmarks. In addition to this, gettimeofday
is used in this benchmark and as such, includes
vsyscall
overhead. Values were normalized to
ease human consumption of the graphs. If anything, though,
this benchmark does show-case the promising performance
properties of the Concurrency Kit hash table implementations,
even in the absence of concurrent workloads that require fast
lock-free progress guarantees. It also highlights the
performance advantages to the excellent Google Dense Hash.
These are done on RHEL5 machines (glibc-2.5-81) with
Linux kernel 2.6.32-100.0.19.el5. The processor is
an Intel Xeon CPU L5640 at 2.27GHz. The benchmarks
were run with SCHED_FIFO
policy at highest
priority and strong affinity to core 2 on the system.
These graphs represent data from a simple get-put workload. Keys are 40-byte
randomly generated sequences and the benchmark will simply check
if the key exists in the hash set, and if not, will insert it. This
benchmark is run until 200M entries have accumulated. Hash function
overhead has been factored out, and time has been normalized to
nanoseconds per-operation. No delete operations are executed in this
workload.
The following graph represents the same data set as above, but has been
zoomed out to show full costs
of hash resize operations as more entries are inserted.
These graphs represent data from a delete-heavy workload.
After adding 10M unique entries (get to put workload) to
the hash table, every new entry will delete the oldest
entry inserted. Hash function overhead has been factored
out, and time has been normalized to nanoseconds per-operation.
The X-axis represents number of executed transactions
(get to put to delete) at increments of 10M transactions.
The slot bound string in ck_hs indicates that the
CK_HS_MODE_DELETE
hint has been passed
to enable stronger probe bound guarantees, and GC indicates
a variant in which the writer executed a full-table GC
operation (ck_hs_gc
) every 300M transactions.
Amortized GC (which is less bursty but also sub-optimal) is
possible, but has not been measured in this benchmark.
GC time was factored into delete time for simplicity in
comparison.
—Samy Bahra (Dec 19, 2013)