1 | /*! |
2 | An implementation of the [Rabin-Karp substring search algorithm][rabinkarp]. |
3 | |
4 | Rabin-Karp works by creating a hash of the needle provided and then computing |
5 | a rolling hash for each needle sized window in the haystack. When the rolling |
6 | hash matches the hash of the needle, a byte-wise comparison is done to check |
7 | if a match exists. The worst case time complexity of Rabin-Karp is `O(m * |
8 | n)` where `m ~ len(needle)` and `n ~ len(haystack)`. Its worst case space |
9 | complexity is constant. |
10 | |
11 | The main utility of Rabin-Karp is that the searcher can be constructed very |
12 | quickly with very little memory. This makes it especially useful when searching |
13 | for small needles in small haystacks, as it might finish its search before a |
14 | beefier algorithm (like Two-Way) even starts. |
15 | |
16 | [rabinkarp]: https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm |
17 | */ |
18 | |
19 | /* |
20 | (This was the comment I wrote for this module originally when it was not |
21 | exposed. The comment still looks useful, but it's a bit in the weeds, so it's |
22 | not public itself.) |
23 | |
24 | This module implements the classical Rabin-Karp substring search algorithm, |
25 | with no extra frills. While its use would seem to break our time complexity |
26 | guarantee of O(m+n) (RK's time complexity is O(mn)), we are careful to only |
27 | ever use RK on a constant subset of haystacks. The main point here is that |
28 | RK has good latency properties for small needles/haystacks. It's very quick |
29 | to compute a needle hash and zip through the haystack when compared to |
30 | initializing Two-Way, for example. And this is especially useful for cases |
31 | where the haystack is just too short for vector instructions to do much good. |
32 | |
33 | The hashing function used here is the same one recommended by ESMAJ. |
34 | |
35 | Another choice instead of Rabin-Karp would be Shift-Or. But its latency |
36 | isn't quite as good since its preprocessing time is a bit more expensive |
37 | (both in practice and in theory). However, perhaps Shift-Or has a place |
38 | somewhere else for short patterns. I think the main problem is that it |
39 | requires space proportional to the alphabet and the needle. If we, for |
40 | example, supported needles up to length 16, then the total table size would be |
41 | len(alphabet)*size_of::<u16>()==512 bytes. Which isn't exactly small, and it's |
42 | probably bad to put that on the stack. So ideally, we'd throw it on the heap, |
43 | but we'd really like to write as much code without using alloc/std as possible. |
44 | But maybe it's worth the special casing. It's a TODO to benchmark. |
45 | |
46 | Wikipedia has a decent explanation, if a bit heavy on the theory: |
47 | https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm |
48 | |
49 | But ESMAJ provides something a bit more concrete: |
50 | http://www-igm.univ-mlv.fr/~lecroq/string/node5.html |
51 | |
52 | Finally, aho-corasick uses Rabin-Karp for multiple pattern match in some cases: |
53 | https://github.com/BurntSushi/aho-corasick/blob/3852632f10587db0ff72ef29e88d58bf305a0946/src/packed/rabinkarp.rs |
54 | */ |
55 | |
56 | use crate::ext::Pointer; |
57 | |
58 | /// A forward substring searcher using the Rabin-Karp algorithm. |
59 | /// |
60 | /// Note that, as a lower level API, a `Finder` does not have access to the |
61 | /// needle it was constructed with. For this reason, executing a search |
62 | /// with a `Finder` requires passing both the needle and the haystack, |
63 | /// where the needle is exactly equivalent to the one given to the `Finder` |
64 | /// at construction time. This design was chosen so that callers can have |
65 | /// more precise control over where and how many times a needle is stored. |
66 | /// For example, in cases where Rabin-Karp is just one of several possible |
67 | /// substring search algorithms. |
68 | #[derive (Clone, Debug)] |
69 | pub struct Finder { |
70 | /// The actual hash. |
71 | hash: Hash, |
72 | /// The factor needed to multiply a byte by in order to subtract it from |
73 | /// the hash. It is defined to be 2^(n-1) (using wrapping exponentiation), |
74 | /// where n is the length of the needle. This is how we "remove" a byte |
75 | /// from the hash once the hash window rolls past it. |
76 | hash_2pow: u32, |
77 | } |
78 | |
79 | impl Finder { |
80 | /// Create a new Rabin-Karp forward searcher for the given `needle`. |
81 | /// |
82 | /// The needle may be empty. The empty needle matches at every byte offset. |
83 | /// |
84 | /// Note that callers must pass the same needle to all search calls using |
85 | /// this `Finder`. |
86 | #[inline ] |
87 | pub fn new(needle: &[u8]) -> Finder { |
88 | let mut s = Finder { hash: Hash::new(), hash_2pow: 1 }; |
89 | let first_byte = match needle.get(0) { |
90 | None => return s, |
91 | Some(&first_byte) => first_byte, |
92 | }; |
93 | s.hash.add(first_byte); |
94 | for b in needle.iter().copied().skip(1) { |
95 | s.hash.add(b); |
96 | s.hash_2pow = s.hash_2pow.wrapping_shl(1); |
97 | } |
98 | s |
99 | } |
100 | |
101 | /// Return the first occurrence of the `needle` in the `haystack` |
102 | /// given. If no such occurrence exists, then `None` is returned. |
103 | /// |
104 | /// The `needle` provided must match the needle given to this finder at |
105 | /// construction time. |
106 | /// |
107 | /// The maximum value this can return is `haystack.len()`, which can only |
108 | /// occur when the needle and haystack both have length zero. Otherwise, |
109 | /// for non-empty haystacks, the maximum value is `haystack.len() - 1`. |
110 | #[inline ] |
111 | pub fn find(&self, haystack: &[u8], needle: &[u8]) -> Option<usize> { |
112 | unsafe { |
113 | let hstart = haystack.as_ptr(); |
114 | let hend = hstart.add(haystack.len()); |
115 | let nstart = needle.as_ptr(); |
116 | let nend = nstart.add(needle.len()); |
117 | let found = self.find_raw(hstart, hend, nstart, nend)?; |
118 | Some(found.distance(hstart)) |
119 | } |
120 | } |
121 | |
122 | /// Like `find`, but accepts and returns raw pointers. |
123 | /// |
124 | /// When a match is found, the pointer returned is guaranteed to be |
125 | /// `>= start` and `<= end`. The pointer returned is only ever equivalent |
126 | /// to `end` when both the needle and haystack are empty. (That is, the |
127 | /// empty string matches the empty string.) |
128 | /// |
129 | /// This routine is useful if you're already using raw pointers and would |
130 | /// like to avoid converting back to a slice before executing a search. |
131 | /// |
132 | /// # Safety |
133 | /// |
134 | /// Note that `start` and `end` below refer to both pairs of pointers given |
135 | /// to this routine. That is, the conditions apply to both `hstart`/`hend` |
136 | /// and `nstart`/`nend`. |
137 | /// |
138 | /// * Both `start` and `end` must be valid for reads. |
139 | /// * Both `start` and `end` must point to an initialized value. |
140 | /// * Both `start` and `end` must point to the same allocated object and |
141 | /// must either be in bounds or at most one byte past the end of the |
142 | /// allocated object. |
143 | /// * Both `start` and `end` must be _derived from_ a pointer to the same |
144 | /// object. |
145 | /// * The distance between `start` and `end` must not overflow `isize`. |
146 | /// * The distance being in bounds must not rely on "wrapping around" the |
147 | /// address space. |
148 | /// * It must be the case that `start <= end`. |
149 | #[inline ] |
150 | pub unsafe fn find_raw( |
151 | &self, |
152 | hstart: *const u8, |
153 | hend: *const u8, |
154 | nstart: *const u8, |
155 | nend: *const u8, |
156 | ) -> Option<*const u8> { |
157 | let hlen = hend.distance(hstart); |
158 | let nlen = nend.distance(nstart); |
159 | if nlen > hlen { |
160 | return None; |
161 | } |
162 | let mut cur = hstart; |
163 | let end = hend.sub(nlen); |
164 | let mut hash = Hash::forward(cur, cur.add(nlen)); |
165 | loop { |
166 | if self.hash == hash && is_equal_raw(cur, nstart, nlen) { |
167 | return Some(cur); |
168 | } |
169 | if cur >= end { |
170 | return None; |
171 | } |
172 | hash.roll(self, cur.read(), cur.add(nlen).read()); |
173 | cur = cur.add(1); |
174 | } |
175 | } |
176 | } |
177 | |
178 | /// A reverse substring searcher using the Rabin-Karp algorithm. |
179 | #[derive (Clone, Debug)] |
180 | pub struct FinderRev(Finder); |
181 | |
182 | impl FinderRev { |
183 | /// Create a new Rabin-Karp reverse searcher for the given `needle`. |
184 | #[inline ] |
185 | pub fn new(needle: &[u8]) -> FinderRev { |
186 | let mut s = FinderRev(Finder { hash: Hash::new(), hash_2pow: 1 }); |
187 | let last_byte = match needle.last() { |
188 | None => return s, |
189 | Some(&last_byte) => last_byte, |
190 | }; |
191 | s.0.hash.add(last_byte); |
192 | for b in needle.iter().rev().copied().skip(1) { |
193 | s.0.hash.add(b); |
194 | s.0.hash_2pow = s.0.hash_2pow.wrapping_shl(1); |
195 | } |
196 | s |
197 | } |
198 | |
199 | /// Return the last occurrence of the `needle` in the `haystack` |
200 | /// given. If no such occurrence exists, then `None` is returned. |
201 | /// |
202 | /// The `needle` provided must match the needle given to this finder at |
203 | /// construction time. |
204 | /// |
205 | /// The maximum value this can return is `haystack.len()`, which can only |
206 | /// occur when the needle and haystack both have length zero. Otherwise, |
207 | /// for non-empty haystacks, the maximum value is `haystack.len() - 1`. |
208 | #[inline ] |
209 | pub fn rfind(&self, haystack: &[u8], needle: &[u8]) -> Option<usize> { |
210 | unsafe { |
211 | let hstart = haystack.as_ptr(); |
212 | let hend = hstart.add(haystack.len()); |
213 | let nstart = needle.as_ptr(); |
214 | let nend = nstart.add(needle.len()); |
215 | let found = self.rfind_raw(hstart, hend, nstart, nend)?; |
216 | Some(found.distance(hstart)) |
217 | } |
218 | } |
219 | |
220 | /// Like `rfind`, but accepts and returns raw pointers. |
221 | /// |
222 | /// When a match is found, the pointer returned is guaranteed to be |
223 | /// `>= start` and `<= end`. The pointer returned is only ever equivalent |
224 | /// to `end` when both the needle and haystack are empty. (That is, the |
225 | /// empty string matches the empty string.) |
226 | /// |
227 | /// This routine is useful if you're already using raw pointers and would |
228 | /// like to avoid converting back to a slice before executing a search. |
229 | /// |
230 | /// # Safety |
231 | /// |
232 | /// Note that `start` and `end` below refer to both pairs of pointers given |
233 | /// to this routine. That is, the conditions apply to both `hstart`/`hend` |
234 | /// and `nstart`/`nend`. |
235 | /// |
236 | /// * Both `start` and `end` must be valid for reads. |
237 | /// * Both `start` and `end` must point to an initialized value. |
238 | /// * Both `start` and `end` must point to the same allocated object and |
239 | /// must either be in bounds or at most one byte past the end of the |
240 | /// allocated object. |
241 | /// * Both `start` and `end` must be _derived from_ a pointer to the same |
242 | /// object. |
243 | /// * The distance between `start` and `end` must not overflow `isize`. |
244 | /// * The distance being in bounds must not rely on "wrapping around" the |
245 | /// address space. |
246 | /// * It must be the case that `start <= end`. |
247 | #[inline ] |
248 | pub unsafe fn rfind_raw( |
249 | &self, |
250 | hstart: *const u8, |
251 | hend: *const u8, |
252 | nstart: *const u8, |
253 | nend: *const u8, |
254 | ) -> Option<*const u8> { |
255 | let hlen = hend.distance(hstart); |
256 | let nlen = nend.distance(nstart); |
257 | if nlen > hlen { |
258 | return None; |
259 | } |
260 | let mut cur = hend.sub(nlen); |
261 | let start = hstart; |
262 | let mut hash = Hash::reverse(cur, cur.add(nlen)); |
263 | loop { |
264 | if self.0.hash == hash && is_equal_raw(cur, nstart, nlen) { |
265 | return Some(cur); |
266 | } |
267 | if cur <= start { |
268 | return None; |
269 | } |
270 | cur = cur.sub(1); |
271 | hash.roll(&self.0, cur.add(nlen).read(), cur.read()); |
272 | } |
273 | } |
274 | } |
275 | |
276 | /// Whether RK is believed to be very fast for the given needle/haystack. |
277 | #[inline ] |
278 | pub(crate) fn is_fast(haystack: &[u8], _needle: &[u8]) -> bool { |
279 | haystack.len() < 16 |
280 | } |
281 | |
282 | /// A Rabin-Karp hash. This might represent the hash of a needle, or the hash |
283 | /// of a rolling window in the haystack. |
284 | #[derive (Clone, Copy, Debug, Default, Eq, PartialEq)] |
285 | struct Hash(u32); |
286 | |
287 | impl Hash { |
288 | /// Create a new hash that represents the empty string. |
289 | #[inline (always)] |
290 | fn new() -> Hash { |
291 | Hash(0) |
292 | } |
293 | |
294 | /// Create a new hash from the bytes given for use in forward searches. |
295 | /// |
296 | /// # Safety |
297 | /// |
298 | /// The given pointers must be valid to read from within their range. |
299 | #[inline (always)] |
300 | unsafe fn forward(mut start: *const u8, end: *const u8) -> Hash { |
301 | let mut hash = Hash::new(); |
302 | while start < end { |
303 | hash.add(start.read()); |
304 | start = start.add(1); |
305 | } |
306 | hash |
307 | } |
308 | |
309 | /// Create a new hash from the bytes given for use in reverse searches. |
310 | /// |
311 | /// # Safety |
312 | /// |
313 | /// The given pointers must be valid to read from within their range. |
314 | #[inline (always)] |
315 | unsafe fn reverse(start: *const u8, mut end: *const u8) -> Hash { |
316 | let mut hash = Hash::new(); |
317 | while start < end { |
318 | end = end.sub(1); |
319 | hash.add(end.read()); |
320 | } |
321 | hash |
322 | } |
323 | |
324 | /// Add 'new' and remove 'old' from this hash. The given needle hash should |
325 | /// correspond to the hash computed for the needle being searched for. |
326 | /// |
327 | /// This is meant to be used when the rolling window of the haystack is |
328 | /// advanced. |
329 | #[inline (always)] |
330 | fn roll(&mut self, finder: &Finder, old: u8, new: u8) { |
331 | self.del(finder, old); |
332 | self.add(new); |
333 | } |
334 | |
335 | /// Add a byte to this hash. |
336 | #[inline (always)] |
337 | fn add(&mut self, byte: u8) { |
338 | self.0 = self.0.wrapping_shl(1).wrapping_add(u32::from(byte)); |
339 | } |
340 | |
341 | /// Remove a byte from this hash. The given needle hash should correspond |
342 | /// to the hash computed for the needle being searched for. |
343 | #[inline (always)] |
344 | fn del(&mut self, finder: &Finder, byte: u8) { |
345 | let factor = finder.hash_2pow; |
346 | self.0 = self.0.wrapping_sub(u32::from(byte).wrapping_mul(factor)); |
347 | } |
348 | } |
349 | |
350 | /// Returns true when `x[i] == y[i]` for all `0 <= i < n`. |
351 | /// |
352 | /// We forcefully don't inline this to hint at the compiler that it is unlikely |
353 | /// to be called. This causes the inner rabinkarp loop above to be a bit |
354 | /// tighter and leads to some performance improvement. See the |
355 | /// memmem/krate/prebuilt/sliceslice-words/words benchmark. |
356 | /// |
357 | /// # Safety |
358 | /// |
359 | /// Same as `crate::arch::all::is_equal_raw`. |
360 | #[cold ] |
361 | #[inline (never)] |
362 | unsafe fn is_equal_raw(x: *const u8, y: *const u8, n: usize) -> bool { |
363 | crate::arch::all::is_equal_raw(x, y, n) |
364 | } |
365 | |
366 | #[cfg (test)] |
367 | mod tests { |
368 | use super::*; |
369 | |
370 | define_substring_forward_quickcheck!(|h, n| Some( |
371 | Finder::new(n).find(h, n) |
372 | )); |
373 | define_substring_reverse_quickcheck!(|h, n| Some( |
374 | FinderRev::new(n).rfind(h, n) |
375 | )); |
376 | |
377 | #[test ] |
378 | fn forward() { |
379 | crate::tests::substring::Runner::new() |
380 | .fwd(|h, n| Some(Finder::new(n).find(h, n))) |
381 | .run(); |
382 | } |
383 | |
384 | #[test ] |
385 | fn reverse() { |
386 | crate::tests::substring::Runner::new() |
387 | .rev(|h, n| Some(FinderRev::new(n).rfind(h, n))) |
388 | .run(); |
389 | } |
390 | } |
391 | |