| 1 | // This module exists only to provide a separate page for the implementation |
| 2 | // documentation. |
| 3 | |
| 4 | //! Notes on `sharded-slab`'s implementation and design. |
| 5 | //! |
| 6 | //! # Design |
| 7 | //! |
| 8 | //! The sharded slab's design is strongly inspired by the ideas presented by |
| 9 | //! Leijen, Zorn, and de Moura in [Mimalloc: Free List Sharding in |
| 10 | //! Action][mimalloc]. In this report, the authors present a novel design for a |
| 11 | //! memory allocator based on a concept of _free list sharding_. |
| 12 | //! |
| 13 | //! Memory allocators must keep track of what memory regions are not currently |
| 14 | //! allocated ("free") in order to provide them to future allocation requests. |
| 15 | //! The term [_free list_][freelist] refers to a technique for performing this |
| 16 | //! bookkeeping, where each free block stores a pointer to the next free block, |
| 17 | //! forming a linked list. The memory allocator keeps a pointer to the most |
| 18 | //! recently freed block, the _head_ of the free list. To allocate more memory, |
| 19 | //! the allocator pops from the free list by setting the head pointer to the |
| 20 | //! next free block of the current head block, and returning the previous head. |
| 21 | //! To deallocate a block, the block is pushed to the free list by setting its |
| 22 | //! first word to the current head pointer, and the head pointer is set to point |
| 23 | //! to the deallocated block. Most implementations of slab allocators backed by |
| 24 | //! arrays or vectors use a similar technique, where pointers are replaced by |
| 25 | //! indices into the backing array. |
| 26 | //! |
| 27 | //! When allocations and deallocations can occur concurrently across threads, |
| 28 | //! they must synchronize accesses to the free list; either by putting the |
| 29 | //! entire allocator state inside of a lock, or by using atomic operations to |
| 30 | //! treat the free list as a lock-free structure (such as a [Treiber stack]). In |
| 31 | //! both cases, there is a significant performance cost — even when the free |
| 32 | //! list is lock-free, it is likely that a noticeable amount of time will be |
| 33 | //! spent in compare-and-swap loops. Ideally, the global synchronzation point |
| 34 | //! created by the single global free list could be avoided as much as possible. |
| 35 | //! |
| 36 | //! The approach presented by Leijen, Zorn, and de Moura is to introduce |
| 37 | //! sharding and thus increase the granularity of synchronization significantly. |
| 38 | //! In mimalloc, the heap is _sharded_ so that each thread has its own |
| 39 | //! thread-local heap. Objects are always allocated from the local heap of the |
| 40 | //! thread where the allocation is performed. Because allocations are always |
| 41 | //! done from a thread's local heap, they need not be synchronized. |
| 42 | //! |
| 43 | //! However, since objects can move between threads before being deallocated, |
| 44 | //! _deallocations_ may still occur concurrently. Therefore, Leijen et al. |
| 45 | //! introduce a concept of _local_ and _global_ free lists. When an object is |
| 46 | //! deallocated on the same thread it was originally allocated on, it is placed |
| 47 | //! on the local free list; if it is deallocated on another thread, it goes on |
| 48 | //! the global free list for the heap of the thread from which it originated. To |
| 49 | //! allocate, the local free list is used first; if it is empty, the entire |
| 50 | //! global free list is popped onto the local free list. Since the local free |
| 51 | //! list is only ever accessed by the thread it belongs to, it does not require |
| 52 | //! synchronization at all, and because the global free list is popped from |
| 53 | //! infrequently, the cost of synchronization has a reduced impact. A majority |
| 54 | //! of allocations can occur without any synchronization at all; and |
| 55 | //! deallocations only require synchronization when an object has left its |
| 56 | //! parent thread (a relatively uncommon case). |
| 57 | //! |
| 58 | //! [mimalloc]: https://www.microsoft.com/en-us/research/uploads/prod/2019/06/mimalloc-tr-v1.pdf |
| 59 | //! [freelist]: https://en.wikipedia.org/wiki/Free_list |
| 60 | //! [Treiber stack]: https://en.wikipedia.org/wiki/Treiber_stack |
| 61 | //! |
| 62 | //! # Implementation |
| 63 | //! |
| 64 | //! A slab is represented as an array of [`MAX_THREADS`] _shards_. A shard |
| 65 | //! consists of a vector of one or more _pages_ plus associated metadata. |
| 66 | //! Finally, a page consists of an array of _slots_, head indices for the local |
| 67 | //! and remote free lists. |
| 68 | //! |
| 69 | //! ```text |
| 70 | //! ┌─────────────┐ |
| 71 | //! │ shard 1 │ |
| 72 | //! │ │ ┌─────────────┐ ┌────────┐ |
| 73 | //! │ pages───────┼───▶│ page 1 │ │ │ |
| 74 | //! ├─────────────┤ ├─────────────┤ ┌────▶│ next──┼─┐ |
| 75 | //! │ shard 2 │ │ page 2 │ │ ├────────┤ │ |
| 76 | //! ├─────────────┤ │ │ │ │XXXXXXXX│ │ |
| 77 | //! │ shard 3 │ │ local_head──┼──┘ ├────────┤ │ |
| 78 | //! └─────────────┘ │ remote_head─┼──┐ │ │◀┘ |
| 79 | //! ... ├─────────────┤ │ │ next──┼─┐ |
| 80 | //! ┌─────────────┐ │ page 3 │ │ ├────────┤ │ |
| 81 | //! │ shard n │ └─────────────┘ │ │XXXXXXXX│ │ |
| 82 | //! └─────────────┘ ... │ ├────────┤ │ |
| 83 | //! ┌─────────────┐ │ │XXXXXXXX│ │ |
| 84 | //! │ page n │ │ ├────────┤ │ |
| 85 | //! └─────────────┘ │ │ │◀┘ |
| 86 | //! └────▶│ next──┼───▶ ... |
| 87 | //! ├────────┤ |
| 88 | //! │XXXXXXXX│ |
| 89 | //! └────────┘ |
| 90 | //! ``` |
| 91 | //! |
| 92 | //! |
| 93 | //! The size of the first page in a shard is always a power of two, and every |
| 94 | //! subsequent page added after the first is twice as large as the page that |
| 95 | //! preceeds it. |
| 96 | //! |
| 97 | //! ```text |
| 98 | //! |
| 99 | //! pg. |
| 100 | //! ┌───┐ ┌─┬─┐ |
| 101 | //! │ 0 │───▶ │ │ |
| 102 | //! ├───┤ ├─┼─┼─┬─┐ |
| 103 | //! │ 1 │───▶ │ │ │ │ |
| 104 | //! ├───┤ ├─┼─┼─┼─┼─┬─┬─┬─┐ |
| 105 | //! │ 2 │───▶ │ │ │ │ │ │ │ │ |
| 106 | //! ├───┤ ├─┼─┼─┼─┼─┼─┼─┼─┼─┬─┬─┬─┬─┬─┬─┬─┐ |
| 107 | //! │ 3 │───▶ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ |
| 108 | //! └───┘ └─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘ |
| 109 | //! ``` |
| 110 | //! |
| 111 | //! When searching for a free slot, the smallest page is searched first, and if |
| 112 | //! it is full, the search proceeds to the next page until either a free slot is |
| 113 | //! found or all available pages have been searched. If all available pages have |
| 114 | //! been searched and the maximum number of pages has not yet been reached, a |
| 115 | //! new page is then allocated. |
| 116 | //! |
| 117 | //! Since every page is twice as large as the previous page, and all page sizes |
| 118 | //! are powers of two, we can determine the page index that contains a given |
| 119 | //! address by shifting the address down by the smallest page size and |
| 120 | //! looking at how many twos places necessary to represent that number, |
| 121 | //! telling us what power of two page size it fits inside of. We can |
| 122 | //! determine the number of twos places by counting the number of leading |
| 123 | //! zeros (unused twos places) in the number's binary representation, and |
| 124 | //! subtracting that count from the total number of bits in a word. |
| 125 | //! |
| 126 | //! The formula for determining the page number that contains an offset is thus: |
| 127 | //! |
| 128 | //! ```rust,ignore |
| 129 | //! WIDTH - ((offset + INITIAL_PAGE_SIZE) >> INDEX_SHIFT).leading_zeros() |
| 130 | //! ``` |
| 131 | //! |
| 132 | //! where `WIDTH` is the number of bits in a `usize`, and `INDEX_SHIFT` is |
| 133 | //! |
| 134 | //! ```rust,ignore |
| 135 | //! INITIAL_PAGE_SIZE.trailing_zeros() + 1; |
| 136 | //! ``` |
| 137 | //! |
| 138 | //! [`MAX_THREADS`]: https://docs.rs/sharded-slab/latest/sharded_slab/trait.Config.html#associatedconstant.MAX_THREADS |
| 139 | |