1 | /*! |
2 | This library provides heavily optimized routines for string search primitives. |
3 | |
4 | # Overview |
5 | |
6 | This section gives a brief high level overview of what this crate offers. |
7 | |
8 | * The top-level module provides routines for searching for 1, 2 or 3 bytes |
9 | in the forward or reverse direction. When searching for more than one byte, |
10 | positions are considered a match if the byte at that position matches any |
11 | of the bytes. |
12 | * The [`memmem`] sub-module provides forward and reverse substring search |
13 | routines. |
14 | |
15 | In all such cases, routines operate on `&[u8]` without regard to encoding. This |
16 | is exactly what you want when searching either UTF-8 or arbitrary bytes. |
17 | |
18 | # Example: using `memchr` |
19 | |
20 | This example shows how to use `memchr` to find the first occurrence of `z` in |
21 | a haystack: |
22 | |
23 | ``` |
24 | use memchr::memchr; |
25 | |
26 | let haystack = b"foo bar baz quuz" ; |
27 | assert_eq!(Some(10), memchr(b'z' , haystack)); |
28 | ``` |
29 | |
30 | # Example: matching one of three possible bytes |
31 | |
32 | This examples shows how to use `memrchr3` to find occurrences of `a`, `b` or |
33 | `c`, starting at the end of the haystack. |
34 | |
35 | ``` |
36 | use memchr::memchr3_iter; |
37 | |
38 | let haystack = b"xyzaxyzbxyzc" ; |
39 | |
40 | let mut it = memchr3_iter(b'a' , b'b' , b'c' , haystack).rev(); |
41 | assert_eq!(Some(11), it.next()); |
42 | assert_eq!(Some(7), it.next()); |
43 | assert_eq!(Some(3), it.next()); |
44 | assert_eq!(None, it.next()); |
45 | ``` |
46 | |
47 | # Example: iterating over substring matches |
48 | |
49 | This example shows how to use the [`memmem`] sub-module to find occurrences of |
50 | a substring in a haystack. |
51 | |
52 | ``` |
53 | use memchr::memmem; |
54 | |
55 | let haystack = b"foo bar foo baz foo" ; |
56 | |
57 | let mut it = memmem::find_iter(haystack, "foo" ); |
58 | assert_eq!(Some(0), it.next()); |
59 | assert_eq!(Some(8), it.next()); |
60 | assert_eq!(Some(16), it.next()); |
61 | assert_eq!(None, it.next()); |
62 | ``` |
63 | |
64 | # Example: repeating a search for the same needle |
65 | |
66 | It may be possible for the overhead of constructing a substring searcher to be |
67 | measurable in some workloads. In cases where the same needle is used to search |
68 | many haystacks, it is possible to do construction once and thus to avoid it for |
69 | subsequent searches. This can be done with a [`memmem::Finder`]: |
70 | |
71 | ``` |
72 | use memchr::memmem; |
73 | |
74 | let finder = memmem::Finder::new("foo" ); |
75 | |
76 | assert_eq!(Some(4), finder.find(b"baz foo quux" )); |
77 | assert_eq!(None, finder.find(b"quux baz bar" )); |
78 | ``` |
79 | |
80 | # Why use this crate? |
81 | |
82 | At first glance, the APIs provided by this crate might seem weird. Why provide |
83 | a dedicated routine like `memchr` for something that could be implemented |
84 | clearly and trivially in one line: |
85 | |
86 | ``` |
87 | fn memchr(needle: u8, haystack: &[u8]) -> Option<usize> { |
88 | haystack.iter().position(|&b| b == needle) |
89 | } |
90 | ``` |
91 | |
92 | Or similarly, why does this crate provide substring search routines when Rust's |
93 | core library already provides them? |
94 | |
95 | ``` |
96 | fn search(haystack: &str, needle: &str) -> Option<usize> { |
97 | haystack.find(needle) |
98 | } |
99 | ``` |
100 | |
101 | The primary reason for both of them to exist is performance. When it comes to |
102 | performance, at a high level at least, there are two primary ways to look at |
103 | it: |
104 | |
105 | * **Throughput**: For this, think about it as, "given some very large haystack |
106 | and a byte that never occurs in that haystack, how long does it take to |
107 | search through it and determine that it, in fact, does not occur?" |
108 | * **Latency**: For this, think about it as, "given a tiny haystack---just a |
109 | few bytes---how long does it take to determine if a byte is in it?" |
110 | |
111 | The `memchr` routine in this crate has _slightly_ worse latency than the |
112 | solution presented above, however, its throughput can easily be over an |
113 | order of magnitude faster. This is a good general purpose trade off to make. |
114 | You rarely lose, but often gain big. |
115 | |
116 | **NOTE:** The name `memchr` comes from the corresponding routine in `libc`. A |
117 | key advantage of using this library is that its performance is not tied to its |
118 | quality of implementation in the `libc` you happen to be using, which can vary |
119 | greatly from platform to platform. |
120 | |
121 | But what about substring search? This one is a bit more complicated. The |
122 | primary reason for its existence is still indeed performance, but it's also |
123 | useful because Rust's core library doesn't actually expose any substring |
124 | search routine on arbitrary bytes. The only substring search routine that |
125 | exists works exclusively on valid UTF-8. |
126 | |
127 | So if you have valid UTF-8, is there a reason to use this over the standard |
128 | library substring search routine? Yes. This routine is faster on almost every |
129 | metric, including latency. The natural question then, is why isn't this |
130 | implementation in the standard library, even if only for searching on UTF-8? |
131 | The reason is that the implementation details for using SIMD in the standard |
132 | library haven't quite been worked out yet. |
133 | |
134 | **NOTE:** Currently, only `x86_64`, `wasm32` and `aarch64` targets have vector |
135 | accelerated implementations of `memchr` (and friends) and `memmem`. |
136 | |
137 | # Crate features |
138 | |
139 | * **std** - When enabled (the default), this will permit features specific to |
140 | the standard library. Currently, the only thing used from the standard library |
141 | is runtime SIMD CPU feature detection. This means that this feature must be |
142 | enabled to get AVX2 accelerated routines on `x86_64` targets without enabling |
143 | the `avx2` feature at compile time, for example. When `std` is not enabled, |
144 | this crate will still attempt to use SSE2 accelerated routines on `x86_64`. It |
145 | will also use AVX2 accelerated routines when the `avx2` feature is enabled at |
146 | compile time. In general, enable this feature if you can. |
147 | * **alloc** - When enabled (the default), APIs in this crate requiring some |
148 | kind of allocation will become available. For example, the |
149 | [`memmem::Finder::into_owned`](crate::memmem::Finder::into_owned) API and the |
150 | [`arch::all::shiftor`](crate::arch::all::shiftor) substring search |
151 | implementation. Otherwise, this crate is designed from the ground up to be |
152 | usable in core-only contexts, so the `alloc` feature doesn't add much |
153 | currently. Notably, disabling `std` but enabling `alloc` will **not** result |
154 | in the use of AVX2 on `x86_64` targets unless the `avx2` feature is enabled |
155 | at compile time. (With `std` enabled, AVX2 can be used even without the `avx2` |
156 | feature enabled at compile time by way of runtime CPU feature detection.) |
157 | * **logging** - When enabled (disabled by default), the `log` crate is used |
158 | to emit log messages about what kinds of `memchr` and `memmem` algorithms |
159 | are used. Namely, both `memchr` and `memmem` have a number of different |
160 | implementation choices depending on the target and CPU, and the log messages |
161 | can help show what specific implementations are being used. Generally, this is |
162 | useful for debugging performance issues. |
163 | * **libc** - **DEPRECATED**. Previously, this enabled the use of the target's |
164 | `memchr` function from whatever `libc` was linked into the program. This |
165 | feature is now a no-op because this crate's implementation of `memchr` should |
166 | now be sufficiently fast on a number of platforms that `libc` should no longer |
167 | be needed. (This feature is somewhat of a holdover from this crate's origins. |
168 | Originally, this crate was literally just a safe wrapper function around the |
169 | `memchr` function from `libc`.) |
170 | */ |
171 | |
172 | #![deny (missing_docs)] |
173 | #![no_std ] |
174 | // It's just not worth trying to squash all dead code warnings. Pretty |
175 | // unfortunate IMO. Not really sure how to fix this other than to either |
176 | // live with it or sprinkle a whole mess of `cfg` annotations everywhere. |
177 | #![cfg_attr ( |
178 | not(any( |
179 | all(target_arch = "x86_64" , target_feature = "sse2" ), |
180 | target_arch = "wasm32" , |
181 | target_arch = "aarch64" , |
182 | )), |
183 | allow(dead_code) |
184 | )] |
185 | // Same deal for miri. |
186 | #![cfg_attr (miri, allow(dead_code, unused_macros))] |
187 | |
188 | // Supporting 8-bit (or others) would be fine. If you need it, please submit a |
189 | // bug report at https://github.com/BurntSushi/memchr |
190 | #[cfg (not(any( |
191 | target_pointer_width = "16" , |
192 | target_pointer_width = "32" , |
193 | target_pointer_width = "64" |
194 | )))] |
195 | compile_error!("memchr currently not supported on non-{16,32,64}" ); |
196 | |
197 | #[cfg (any(test, feature = "std" ))] |
198 | extern crate std; |
199 | |
200 | #[cfg (any(test, feature = "alloc" ))] |
201 | extern crate alloc; |
202 | |
203 | pub use crate::memchr::{ |
204 | memchr, memchr2, memchr2_iter, memchr3, memchr3_iter, memchr_iter, |
205 | memrchr, memrchr2, memrchr2_iter, memrchr3, memrchr3_iter, memrchr_iter, |
206 | Memchr, Memchr2, Memchr3, |
207 | }; |
208 | |
209 | #[macro_use ] |
210 | mod macros; |
211 | |
212 | #[cfg (test)] |
213 | #[macro_use ] |
214 | mod tests; |
215 | |
216 | pub mod arch; |
217 | mod cow; |
218 | mod ext; |
219 | mod memchr; |
220 | pub mod memmem; |
221 | mod vector; |
222 | |