1// This module exists only to provide a separate page for the implementation
2// documentation.
3
4//! Notes on `sharded-slab`'s implementation and design.
5//!
6//! # Design
7//!
8//! The sharded slab's design is strongly inspired by the ideas presented by
9//! Leijen, Zorn, and de Moura in [Mimalloc: Free List Sharding in
10//! Action][mimalloc]. In this report, the authors present a novel design for a
11//! memory allocator based on a concept of _free list sharding_.
12//!
13//! Memory allocators must keep track of what memory regions are not currently
14//! allocated ("free") in order to provide them to future allocation requests.
15//! The term [_free list_][freelist] refers to a technique for performing this
16//! bookkeeping, where each free block stores a pointer to the next free block,
17//! forming a linked list. The memory allocator keeps a pointer to the most
18//! recently freed block, the _head_ of the free list. To allocate more memory,
19//! the allocator pops from the free list by setting the head pointer to the
20//! next free block of the current head block, and returning the previous head.
21//! To deallocate a block, the block is pushed to the free list by setting its
22//! first word to the current head pointer, and the head pointer is set to point
23//! to the deallocated block. Most implementations of slab allocators backed by
24//! arrays or vectors use a similar technique, where pointers are replaced by
25//! indices into the backing array.
26//!
27//! When allocations and deallocations can occur concurrently across threads,
28//! they must synchronize accesses to the free list; either by putting the
29//! entire allocator state inside of a lock, or by using atomic operations to
30//! treat the free list as a lock-free structure (such as a [Treiber stack]). In
31//! both cases, there is a significant performance cost — even when the free
32//! list is lock-free, it is likely that a noticeable amount of time will be
33//! spent in compare-and-swap loops. Ideally, the global synchronzation point
34//! created by the single global free list could be avoided as much as possible.
35//!
36//! The approach presented by Leijen, Zorn, and de Moura is to introduce
37//! sharding and thus increase the granularity of synchronization significantly.
38//! In mimalloc, the heap is _sharded_ so that each thread has its own
39//! thread-local heap. Objects are always allocated from the local heap of the
40//! thread where the allocation is performed. Because allocations are always
41//! done from a thread's local heap, they need not be synchronized.
42//!
43//! However, since objects can move between threads before being deallocated,
44//! _deallocations_ may still occur concurrently. Therefore, Leijen et al.
45//! introduce a concept of _local_ and _global_ free lists. When an object is
46//! deallocated on the same thread it was originally allocated on, it is placed
47//! on the local free list; if it is deallocated on another thread, it goes on
48//! the global free list for the heap of the thread from which it originated. To
49//! allocate, the local free list is used first; if it is empty, the entire
50//! global free list is popped onto the local free list. Since the local free
51//! list is only ever accessed by the thread it belongs to, it does not require
52//! synchronization at all, and because the global free list is popped from
53//! infrequently, the cost of synchronization has a reduced impact. A majority
54//! of allocations can occur without any synchronization at all; and
55//! deallocations only require synchronization when an object has left its
56//! parent thread (a relatively uncommon case).
57//!
58//! [mimalloc]: https://www.microsoft.com/en-us/research/uploads/prod/2019/06/mimalloc-tr-v1.pdf
59//! [freelist]: https://en.wikipedia.org/wiki/Free_list
60//! [Treiber stack]: https://en.wikipedia.org/wiki/Treiber_stack
61//!
62//! # Implementation
63//!
64//! A slab is represented as an array of [`MAX_THREADS`] _shards_. A shard
65//! consists of a vector of one or more _pages_ plus associated metadata.
66//! Finally, a page consists of an array of _slots_, head indices for the local
67//! and remote free lists.
68//!
69//! ```text
70//! ┌─────────────┐
71//! │ shard 1 │
72//! │ │ ┌─────────────┐ ┌────────┐
73//! │ pages───────┼───▶│ page 1 │ │ │
74//! ├─────────────┤ ├─────────────┤ ┌────▶│ next──┼─┐
75//! │ shard 2 │ │ page 2 │ │ ├────────┤ │
76//! ├─────────────┤ │ │ │ │XXXXXXXX│ │
77//! │ shard 3 │ │ local_head──┼──┘ ├────────┤ │
78//! └─────────────┘ │ remote_head─┼──┐ │ │◀┘
79//! ... ├─────────────┤ │ │ next──┼─┐
80//! ┌─────────────┐ │ page 3 │ │ ├────────┤ │
81//! │ shard n │ └─────────────┘ │ │XXXXXXXX│ │
82//! └─────────────┘ ... │ ├────────┤ │
83//! ┌─────────────┐ │ │XXXXXXXX│ │
84//! │ page n │ │ ├────────┤ │
85//! └─────────────┘ │ │ │◀┘
86//! └────▶│ next──┼───▶ ...
87//! ├────────┤
88//! │XXXXXXXX│
89//! └────────┘
90//! ```
91//!
92//!
93//! The size of the first page in a shard is always a power of two, and every
94//! subsequent page added after the first is twice as large as the page that
95//! preceeds it.
96//!
97//! ```text
98//!
99//! pg.
100//! ┌───┐ ┌─┬─┐
101//! │ 0 │───▶ │ │
102//! ├───┤ ├─┼─┼─┬─┐
103//! │ 1 │───▶ │ │ │ │
104//! ├───┤ ├─┼─┼─┼─┼─┬─┬─┬─┐
105//! │ 2 │───▶ │ │ │ │ │ │ │ │
106//! ├───┤ ├─┼─┼─┼─┼─┼─┼─┼─┼─┬─┬─┬─┬─┬─┬─┬─┐
107//! │ 3 │───▶ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │
108//! └───┘ └─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘
109//! ```
110//!
111//! When searching for a free slot, the smallest page is searched first, and if
112//! it is full, the search proceeds to the next page until either a free slot is
113//! found or all available pages have been searched. If all available pages have
114//! been searched and the maximum number of pages has not yet been reached, a
115//! new page is then allocated.
116//!
117//! Since every page is twice as large as the previous page, and all page sizes
118//! are powers of two, we can determine the page index that contains a given
119//! address by shifting the address down by the smallest page size and
120//! looking at how many twos places necessary to represent that number,
121//! telling us what power of two page size it fits inside of. We can
122//! determine the number of twos places by counting the number of leading
123//! zeros (unused twos places) in the number's binary representation, and
124//! subtracting that count from the total number of bits in a word.
125//!
126//! The formula for determining the page number that contains an offset is thus:
127//!
128//! ```rust,ignore
129//! WIDTH - ((offset + INITIAL_PAGE_SIZE) >> INDEX_SHIFT).leading_zeros()
130//! ```
131//!
132//! where `WIDTH` is the number of bits in a `usize`, and `INDEX_SHIFT` is
133//!
134//! ```rust,ignore
135//! INITIAL_PAGE_SIZE.trailing_zeros() + 1;
136//! ```
137//!
138//! [`MAX_THREADS`]: https://docs.rs/sharded-slab/latest/sharded_slab/trait.Config.html#associatedconstant.MAX_THREADS
139