1/*!
2An implementation of the [Rabin-Karp substring search algorithm][rabinkarp].
3
4Rabin-Karp works by creating a hash of the needle provided and then computing
5a rolling hash for each needle sized window in the haystack. When the rolling
6hash matches the hash of the needle, a byte-wise comparison is done to check
7if a match exists. The worst case time complexity of Rabin-Karp is `O(m *
8n)` where `m ~ len(needle)` and `n ~ len(haystack)`. Its worst case space
9complexity is constant.
10
11The main utility of Rabin-Karp is that the searcher can be constructed very
12quickly with very little memory. This makes it especially useful when searching
13for small needles in small haystacks, as it might finish its search before a
14beefier algorithm (like Two-Way) even starts.
15
16[rabinkarp]: https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm
17*/
18
19/*
20(This was the comment I wrote for this module originally when it was not
21exposed. The comment still looks useful, but it's a bit in the weeds, so it's
22not public itself.)
23
24This module implements the classical Rabin-Karp substring search algorithm,
25with no extra frills. While its use would seem to break our time complexity
26guarantee of O(m+n) (RK's time complexity is O(mn)), we are careful to only
27ever use RK on a constant subset of haystacks. The main point here is that
28RK has good latency properties for small needles/haystacks. It's very quick
29to compute a needle hash and zip through the haystack when compared to
30initializing Two-Way, for example. And this is especially useful for cases
31where the haystack is just too short for vector instructions to do much good.
32
33The hashing function used here is the same one recommended by ESMAJ.
34
35Another choice instead of Rabin-Karp would be Shift-Or. But its latency
36isn't quite as good since its preprocessing time is a bit more expensive
37(both in practice and in theory). However, perhaps Shift-Or has a place
38somewhere else for short patterns. I think the main problem is that it
39requires space proportional to the alphabet and the needle. If we, for
40example, supported needles up to length 16, then the total table size would be
41len(alphabet)*size_of::<u16>()==512 bytes. Which isn't exactly small, and it's
42probably bad to put that on the stack. So ideally, we'd throw it on the heap,
43but we'd really like to write as much code without using alloc/std as possible.
44But maybe it's worth the special casing. It's a TODO to benchmark.
45
46Wikipedia has a decent explanation, if a bit heavy on the theory:
47https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm
48
49But ESMAJ provides something a bit more concrete:
50http://www-igm.univ-mlv.fr/~lecroq/string/node5.html
51
52Finally, aho-corasick uses Rabin-Karp for multiple pattern match in some cases:
53https://github.com/BurntSushi/aho-corasick/blob/3852632f10587db0ff72ef29e88d58bf305a0946/src/packed/rabinkarp.rs
54*/
55
56use crate::ext::Pointer;
57
58/// A forward substring searcher using the Rabin-Karp algorithm.
59///
60/// Note that, as a lower level API, a `Finder` does not have access to the
61/// needle it was constructed with. For this reason, executing a search
62/// with a `Finder` requires passing both the needle and the haystack,
63/// where the needle is exactly equivalent to the one given to the `Finder`
64/// at construction time. This design was chosen so that callers can have
65/// more precise control over where and how many times a needle is stored.
66/// For example, in cases where Rabin-Karp is just one of several possible
67/// substring search algorithms.
68#[derive(Clone, Debug)]
69pub struct Finder {
70 /// The actual hash.
71 hash: Hash,
72 /// The factor needed to multiply a byte by in order to subtract it from
73 /// the hash. It is defined to be 2^(n-1) (using wrapping exponentiation),
74 /// where n is the length of the needle. This is how we "remove" a byte
75 /// from the hash once the hash window rolls past it.
76 hash_2pow: u32,
77}
78
79impl Finder {
80 /// Create a new Rabin-Karp forward searcher for the given `needle`.
81 ///
82 /// The needle may be empty. The empty needle matches at every byte offset.
83 ///
84 /// Note that callers must pass the same needle to all search calls using
85 /// this `Finder`.
86 #[inline]
87 pub fn new(needle: &[u8]) -> Finder {
88 let mut s = Finder { hash: Hash::new(), hash_2pow: 1 };
89 let first_byte = match needle.get(0) {
90 None => return s,
91 Some(&first_byte) => first_byte,
92 };
93 s.hash.add(first_byte);
94 for b in needle.iter().copied().skip(1) {
95 s.hash.add(b);
96 s.hash_2pow = s.hash_2pow.wrapping_shl(1);
97 }
98 s
99 }
100
101 /// Return the first occurrence of the `needle` in the `haystack`
102 /// given. If no such occurrence exists, then `None` is returned.
103 ///
104 /// The `needle` provided must match the needle given to this finder at
105 /// construction time.
106 ///
107 /// The maximum value this can return is `haystack.len()`, which can only
108 /// occur when the needle and haystack both have length zero. Otherwise,
109 /// for non-empty haystacks, the maximum value is `haystack.len() - 1`.
110 #[inline]
111 pub fn find(&self, haystack: &[u8], needle: &[u8]) -> Option<usize> {
112 unsafe {
113 let hstart = haystack.as_ptr();
114 let hend = hstart.add(haystack.len());
115 let nstart = needle.as_ptr();
116 let nend = nstart.add(needle.len());
117 let found = self.find_raw(hstart, hend, nstart, nend)?;
118 Some(found.distance(hstart))
119 }
120 }
121
122 /// Like `find`, but accepts and returns raw pointers.
123 ///
124 /// When a match is found, the pointer returned is guaranteed to be
125 /// `>= start` and `<= end`. The pointer returned is only ever equivalent
126 /// to `end` when both the needle and haystack are empty. (That is, the
127 /// empty string matches the empty string.)
128 ///
129 /// This routine is useful if you're already using raw pointers and would
130 /// like to avoid converting back to a slice before executing a search.
131 ///
132 /// # Safety
133 ///
134 /// Note that `start` and `end` below refer to both pairs of pointers given
135 /// to this routine. That is, the conditions apply to both `hstart`/`hend`
136 /// and `nstart`/`nend`.
137 ///
138 /// * Both `start` and `end` must be valid for reads.
139 /// * Both `start` and `end` must point to an initialized value.
140 /// * Both `start` and `end` must point to the same allocated object and
141 /// must either be in bounds or at most one byte past the end of the
142 /// allocated object.
143 /// * Both `start` and `end` must be _derived from_ a pointer to the same
144 /// object.
145 /// * The distance between `start` and `end` must not overflow `isize`.
146 /// * The distance being in bounds must not rely on "wrapping around" the
147 /// address space.
148 /// * It must be the case that `start <= end`.
149 #[inline]
150 pub unsafe fn find_raw(
151 &self,
152 hstart: *const u8,
153 hend: *const u8,
154 nstart: *const u8,
155 nend: *const u8,
156 ) -> Option<*const u8> {
157 let hlen = hend.distance(hstart);
158 let nlen = nend.distance(nstart);
159 if nlen > hlen {
160 return None;
161 }
162 let mut cur = hstart;
163 let end = hend.sub(nlen);
164 let mut hash = Hash::forward(cur, cur.add(nlen));
165 loop {
166 if self.hash == hash && is_equal_raw(cur, nstart, nlen) {
167 return Some(cur);
168 }
169 if cur >= end {
170 return None;
171 }
172 hash.roll(self, cur.read(), cur.add(nlen).read());
173 cur = cur.add(1);
174 }
175 }
176}
177
178/// A reverse substring searcher using the Rabin-Karp algorithm.
179#[derive(Clone, Debug)]
180pub struct FinderRev(Finder);
181
182impl FinderRev {
183 /// Create a new Rabin-Karp reverse searcher for the given `needle`.
184 #[inline]
185 pub fn new(needle: &[u8]) -> FinderRev {
186 let mut s = FinderRev(Finder { hash: Hash::new(), hash_2pow: 1 });
187 let last_byte = match needle.last() {
188 None => return s,
189 Some(&last_byte) => last_byte,
190 };
191 s.0.hash.add(last_byte);
192 for b in needle.iter().rev().copied().skip(1) {
193 s.0.hash.add(b);
194 s.0.hash_2pow = s.0.hash_2pow.wrapping_shl(1);
195 }
196 s
197 }
198
199 /// Return the last occurrence of the `needle` in the `haystack`
200 /// given. If no such occurrence exists, then `None` is returned.
201 ///
202 /// The `needle` provided must match the needle given to this finder at
203 /// construction time.
204 ///
205 /// The maximum value this can return is `haystack.len()`, which can only
206 /// occur when the needle and haystack both have length zero. Otherwise,
207 /// for non-empty haystacks, the maximum value is `haystack.len() - 1`.
208 #[inline]
209 pub fn rfind(&self, haystack: &[u8], needle: &[u8]) -> Option<usize> {
210 unsafe {
211 let hstart = haystack.as_ptr();
212 let hend = hstart.add(haystack.len());
213 let nstart = needle.as_ptr();
214 let nend = nstart.add(needle.len());
215 let found = self.rfind_raw(hstart, hend, nstart, nend)?;
216 Some(found.distance(hstart))
217 }
218 }
219
220 /// Like `rfind`, but accepts and returns raw pointers.
221 ///
222 /// When a match is found, the pointer returned is guaranteed to be
223 /// `>= start` and `<= end`. The pointer returned is only ever equivalent
224 /// to `end` when both the needle and haystack are empty. (That is, the
225 /// empty string matches the empty string.)
226 ///
227 /// This routine is useful if you're already using raw pointers and would
228 /// like to avoid converting back to a slice before executing a search.
229 ///
230 /// # Safety
231 ///
232 /// Note that `start` and `end` below refer to both pairs of pointers given
233 /// to this routine. That is, the conditions apply to both `hstart`/`hend`
234 /// and `nstart`/`nend`.
235 ///
236 /// * Both `start` and `end` must be valid for reads.
237 /// * Both `start` and `end` must point to an initialized value.
238 /// * Both `start` and `end` must point to the same allocated object and
239 /// must either be in bounds or at most one byte past the end of the
240 /// allocated object.
241 /// * Both `start` and `end` must be _derived from_ a pointer to the same
242 /// object.
243 /// * The distance between `start` and `end` must not overflow `isize`.
244 /// * The distance being in bounds must not rely on "wrapping around" the
245 /// address space.
246 /// * It must be the case that `start <= end`.
247 #[inline]
248 pub unsafe fn rfind_raw(
249 &self,
250 hstart: *const u8,
251 hend: *const u8,
252 nstart: *const u8,
253 nend: *const u8,
254 ) -> Option<*const u8> {
255 let hlen = hend.distance(hstart);
256 let nlen = nend.distance(nstart);
257 if nlen > hlen {
258 return None;
259 }
260 let mut cur = hend.sub(nlen);
261 let start = hstart;
262 let mut hash = Hash::reverse(cur, cur.add(nlen));
263 loop {
264 if self.0.hash == hash && is_equal_raw(cur, nstart, nlen) {
265 return Some(cur);
266 }
267 if cur <= start {
268 return None;
269 }
270 cur = cur.sub(1);
271 hash.roll(&self.0, cur.add(nlen).read(), cur.read());
272 }
273 }
274}
275
276/// Whether RK is believed to be very fast for the given needle/haystack.
277#[inline]
278pub(crate) fn is_fast(haystack: &[u8], _needle: &[u8]) -> bool {
279 haystack.len() < 16
280}
281
282/// A Rabin-Karp hash. This might represent the hash of a needle, or the hash
283/// of a rolling window in the haystack.
284#[derive(Clone, Copy, Debug, Default, Eq, PartialEq)]
285struct Hash(u32);
286
287impl Hash {
288 /// Create a new hash that represents the empty string.
289 #[inline(always)]
290 fn new() -> Hash {
291 Hash(0)
292 }
293
294 /// Create a new hash from the bytes given for use in forward searches.
295 ///
296 /// # Safety
297 ///
298 /// The given pointers must be valid to read from within their range.
299 #[inline(always)]
300 unsafe fn forward(mut start: *const u8, end: *const u8) -> Hash {
301 let mut hash = Hash::new();
302 while start < end {
303 hash.add(start.read());
304 start = start.add(1);
305 }
306 hash
307 }
308
309 /// Create a new hash from the bytes given for use in reverse searches.
310 ///
311 /// # Safety
312 ///
313 /// The given pointers must be valid to read from within their range.
314 #[inline(always)]
315 unsafe fn reverse(start: *const u8, mut end: *const u8) -> Hash {
316 let mut hash = Hash::new();
317 while start < end {
318 end = end.sub(1);
319 hash.add(end.read());
320 }
321 hash
322 }
323
324 /// Add 'new' and remove 'old' from this hash. The given needle hash should
325 /// correspond to the hash computed for the needle being searched for.
326 ///
327 /// This is meant to be used when the rolling window of the haystack is
328 /// advanced.
329 #[inline(always)]
330 fn roll(&mut self, finder: &Finder, old: u8, new: u8) {
331 self.del(finder, old);
332 self.add(new);
333 }
334
335 /// Add a byte to this hash.
336 #[inline(always)]
337 fn add(&mut self, byte: u8) {
338 self.0 = self.0.wrapping_shl(1).wrapping_add(u32::from(byte));
339 }
340
341 /// Remove a byte from this hash. The given needle hash should correspond
342 /// to the hash computed for the needle being searched for.
343 #[inline(always)]
344 fn del(&mut self, finder: &Finder, byte: u8) {
345 let factor = finder.hash_2pow;
346 self.0 = self.0.wrapping_sub(u32::from(byte).wrapping_mul(factor));
347 }
348}
349
350/// Returns true when `x[i] == y[i]` for all `0 <= i < n`.
351///
352/// We forcefully don't inline this to hint at the compiler that it is unlikely
353/// to be called. This causes the inner rabinkarp loop above to be a bit
354/// tighter and leads to some performance improvement. See the
355/// memmem/krate/prebuilt/sliceslice-words/words benchmark.
356///
357/// # Safety
358///
359/// Same as `crate::arch::all::is_equal_raw`.
360#[cold]
361#[inline(never)]
362unsafe fn is_equal_raw(x: *const u8, y: *const u8, n: usize) -> bool {
363 crate::arch::all::is_equal_raw(x, y, n)
364}
365
366#[cfg(test)]
367mod tests {
368 use super::*;
369
370 define_substring_forward_quickcheck!(|h, n| Some(
371 Finder::new(n).find(h, n)
372 ));
373 define_substring_reverse_quickcheck!(|h, n| Some(
374 FinderRev::new(n).rfind(h, n)
375 ));
376
377 #[test]
378 fn forward() {
379 crate::tests::substring::Runner::new()
380 .fwd(|h, n| Some(Finder::new(n).find(h, n)))
381 .run();
382 }
383
384 #[test]
385 fn reverse() {
386 crate::tests::substring::Runner::new()
387 .rev(|h, n| Some(FinderRev::new(n).rfind(h, n)))
388 .run();
389 }
390}
391