| 1 | /*! |
| 2 | This library provides heavily optimized routines for string search primitives. |
| 3 | |
| 4 | # Overview |
| 5 | |
| 6 | This section gives a brief high level overview of what this crate offers. |
| 7 | |
| 8 | * The top-level module provides routines for searching for 1, 2 or 3 bytes |
| 9 | in the forward or reverse direction. When searching for more than one byte, |
| 10 | positions are considered a match if the byte at that position matches any |
| 11 | of the bytes. |
| 12 | * The [`memmem`] sub-module provides forward and reverse substring search |
| 13 | routines. |
| 14 | |
| 15 | In all such cases, routines operate on `&[u8]` without regard to encoding. This |
| 16 | is exactly what you want when searching either UTF-8 or arbitrary bytes. |
| 17 | |
| 18 | # Example: using `memchr` |
| 19 | |
| 20 | This example shows how to use `memchr` to find the first occurrence of `z` in |
| 21 | a haystack: |
| 22 | |
| 23 | ``` |
| 24 | use memchr::memchr; |
| 25 | |
| 26 | let haystack = b"foo bar baz quuz" ; |
| 27 | assert_eq!(Some(10), memchr(b'z' , haystack)); |
| 28 | ``` |
| 29 | |
| 30 | # Example: matching one of three possible bytes |
| 31 | |
| 32 | This examples shows how to use `memrchr3` to find occurrences of `a`, `b` or |
| 33 | `c`, starting at the end of the haystack. |
| 34 | |
| 35 | ``` |
| 36 | use memchr::memchr3_iter; |
| 37 | |
| 38 | let haystack = b"xyzaxyzbxyzc" ; |
| 39 | |
| 40 | let mut it = memchr3_iter(b'a' , b'b' , b'c' , haystack).rev(); |
| 41 | assert_eq!(Some(11), it.next()); |
| 42 | assert_eq!(Some(7), it.next()); |
| 43 | assert_eq!(Some(3), it.next()); |
| 44 | assert_eq!(None, it.next()); |
| 45 | ``` |
| 46 | |
| 47 | # Example: iterating over substring matches |
| 48 | |
| 49 | This example shows how to use the [`memmem`] sub-module to find occurrences of |
| 50 | a substring in a haystack. |
| 51 | |
| 52 | ``` |
| 53 | use memchr::memmem; |
| 54 | |
| 55 | let haystack = b"foo bar foo baz foo" ; |
| 56 | |
| 57 | let mut it = memmem::find_iter(haystack, "foo" ); |
| 58 | assert_eq!(Some(0), it.next()); |
| 59 | assert_eq!(Some(8), it.next()); |
| 60 | assert_eq!(Some(16), it.next()); |
| 61 | assert_eq!(None, it.next()); |
| 62 | ``` |
| 63 | |
| 64 | # Example: repeating a search for the same needle |
| 65 | |
| 66 | It may be possible for the overhead of constructing a substring searcher to be |
| 67 | measurable in some workloads. In cases where the same needle is used to search |
| 68 | many haystacks, it is possible to do construction once and thus to avoid it for |
| 69 | subsequent searches. This can be done with a [`memmem::Finder`]: |
| 70 | |
| 71 | ``` |
| 72 | use memchr::memmem; |
| 73 | |
| 74 | let finder = memmem::Finder::new("foo" ); |
| 75 | |
| 76 | assert_eq!(Some(4), finder.find(b"baz foo quux" )); |
| 77 | assert_eq!(None, finder.find(b"quux baz bar" )); |
| 78 | ``` |
| 79 | |
| 80 | # Why use this crate? |
| 81 | |
| 82 | At first glance, the APIs provided by this crate might seem weird. Why provide |
| 83 | a dedicated routine like `memchr` for something that could be implemented |
| 84 | clearly and trivially in one line: |
| 85 | |
| 86 | ``` |
| 87 | fn memchr(needle: u8, haystack: &[u8]) -> Option<usize> { |
| 88 | haystack.iter().position(|&b| b == needle) |
| 89 | } |
| 90 | ``` |
| 91 | |
| 92 | Or similarly, why does this crate provide substring search routines when Rust's |
| 93 | core library already provides them? |
| 94 | |
| 95 | ``` |
| 96 | fn search(haystack: &str, needle: &str) -> Option<usize> { |
| 97 | haystack.find(needle) |
| 98 | } |
| 99 | ``` |
| 100 | |
| 101 | The primary reason for both of them to exist is performance. When it comes to |
| 102 | performance, at a high level at least, there are two primary ways to look at |
| 103 | it: |
| 104 | |
| 105 | * **Throughput**: For this, think about it as, "given some very large haystack |
| 106 | and a byte that never occurs in that haystack, how long does it take to |
| 107 | search through it and determine that it, in fact, does not occur?" |
| 108 | * **Latency**: For this, think about it as, "given a tiny haystack---just a |
| 109 | few bytes---how long does it take to determine if a byte is in it?" |
| 110 | |
| 111 | The `memchr` routine in this crate has _slightly_ worse latency than the |
| 112 | solution presented above, however, its throughput can easily be over an |
| 113 | order of magnitude faster. This is a good general purpose trade off to make. |
| 114 | You rarely lose, but often gain big. |
| 115 | |
| 116 | **NOTE:** The name `memchr` comes from the corresponding routine in `libc`. A |
| 117 | key advantage of using this library is that its performance is not tied to its |
| 118 | quality of implementation in the `libc` you happen to be using, which can vary |
| 119 | greatly from platform to platform. |
| 120 | |
| 121 | But what about substring search? This one is a bit more complicated. The |
| 122 | primary reason for its existence is still indeed performance, but it's also |
| 123 | useful because Rust's core library doesn't actually expose any substring |
| 124 | search routine on arbitrary bytes. The only substring search routine that |
| 125 | exists works exclusively on valid UTF-8. |
| 126 | |
| 127 | So if you have valid UTF-8, is there a reason to use this over the standard |
| 128 | library substring search routine? Yes. This routine is faster on almost every |
| 129 | metric, including latency. The natural question then, is why isn't this |
| 130 | implementation in the standard library, even if only for searching on UTF-8? |
| 131 | The reason is that the implementation details for using SIMD in the standard |
| 132 | library haven't quite been worked out yet. |
| 133 | |
| 134 | **NOTE:** Currently, only `x86_64`, `wasm32` and `aarch64` targets have vector |
| 135 | accelerated implementations of `memchr` (and friends) and `memmem`. |
| 136 | |
| 137 | # Crate features |
| 138 | |
| 139 | * **std** - When enabled (the default), this will permit features specific to |
| 140 | the standard library. Currently, the only thing used from the standard library |
| 141 | is runtime SIMD CPU feature detection. This means that this feature must be |
| 142 | enabled to get AVX2 accelerated routines on `x86_64` targets without enabling |
| 143 | the `avx2` feature at compile time, for example. When `std` is not enabled, |
| 144 | this crate will still attempt to use SSE2 accelerated routines on `x86_64`. It |
| 145 | will also use AVX2 accelerated routines when the `avx2` feature is enabled at |
| 146 | compile time. In general, enable this feature if you can. |
| 147 | * **alloc** - When enabled (the default), APIs in this crate requiring some |
| 148 | kind of allocation will become available. For example, the |
| 149 | [`memmem::Finder::into_owned`](crate::memmem::Finder::into_owned) API and the |
| 150 | [`arch::all::shiftor`](crate::arch::all::shiftor) substring search |
| 151 | implementation. Otherwise, this crate is designed from the ground up to be |
| 152 | usable in core-only contexts, so the `alloc` feature doesn't add much |
| 153 | currently. Notably, disabling `std` but enabling `alloc` will **not** result |
| 154 | in the use of AVX2 on `x86_64` targets unless the `avx2` feature is enabled |
| 155 | at compile time. (With `std` enabled, AVX2 can be used even without the `avx2` |
| 156 | feature enabled at compile time by way of runtime CPU feature detection.) |
| 157 | * **logging** - When enabled (disabled by default), the `log` crate is used |
| 158 | to emit log messages about what kinds of `memchr` and `memmem` algorithms |
| 159 | are used. Namely, both `memchr` and `memmem` have a number of different |
| 160 | implementation choices depending on the target and CPU, and the log messages |
| 161 | can help show what specific implementations are being used. Generally, this is |
| 162 | useful for debugging performance issues. |
| 163 | * **libc** - **DEPRECATED**. Previously, this enabled the use of the target's |
| 164 | `memchr` function from whatever `libc` was linked into the program. This |
| 165 | feature is now a no-op because this crate's implementation of `memchr` should |
| 166 | now be sufficiently fast on a number of platforms that `libc` should no longer |
| 167 | be needed. (This feature is somewhat of a holdover from this crate's origins. |
| 168 | Originally, this crate was literally just a safe wrapper function around the |
| 169 | `memchr` function from `libc`.) |
| 170 | */ |
| 171 | |
| 172 | #![deny (missing_docs)] |
| 173 | #![no_std ] |
| 174 | // It's just not worth trying to squash all dead code warnings. Pretty |
| 175 | // unfortunate IMO. Not really sure how to fix this other than to either |
| 176 | // live with it or sprinkle a whole mess of `cfg` annotations everywhere. |
| 177 | #![cfg_attr ( |
| 178 | not(any( |
| 179 | all(target_arch = "x86_64" , target_feature = "sse2" ), |
| 180 | all(target_arch = "wasm32" , target_feature = "simd128" ), |
| 181 | target_arch = "aarch64" , |
| 182 | )), |
| 183 | allow(dead_code) |
| 184 | )] |
| 185 | // Same deal for miri. |
| 186 | #![cfg_attr (miri, allow(dead_code, unused_macros))] |
| 187 | |
| 188 | // Supporting 8-bit (or others) would be fine. If you need it, please submit a |
| 189 | // bug report at https://github.com/BurntSushi/memchr |
| 190 | #[cfg (not(any( |
| 191 | target_pointer_width = "16" , |
| 192 | target_pointer_width = "32" , |
| 193 | target_pointer_width = "64" |
| 194 | )))] |
| 195 | compile_error!("memchr currently not supported on non-{16,32,64}" ); |
| 196 | |
| 197 | #[cfg (any(test, feature = "std" ))] |
| 198 | extern crate std; |
| 199 | |
| 200 | #[cfg (any(test, feature = "alloc" ))] |
| 201 | extern crate alloc; |
| 202 | |
| 203 | pub use crate::memchr::{ |
| 204 | memchr, memchr2, memchr2_iter, memchr3, memchr3_iter, memchr_iter, |
| 205 | memrchr, memrchr2, memrchr2_iter, memrchr3, memrchr3_iter, memrchr_iter, |
| 206 | Memchr, Memchr2, Memchr3, |
| 207 | }; |
| 208 | |
| 209 | #[macro_use ] |
| 210 | mod macros; |
| 211 | |
| 212 | #[cfg (test)] |
| 213 | #[macro_use ] |
| 214 | mod tests; |
| 215 | |
| 216 | pub mod arch; |
| 217 | mod cow; |
| 218 | mod ext; |
| 219 | mod memchr; |
| 220 | pub mod memmem; |
| 221 | mod vector; |
| 222 | |