| 1 | /*! |
| 2 | An implementation of the [Rabin-Karp substring search algorithm][rabinkarp]. |
| 3 | |
| 4 | Rabin-Karp works by creating a hash of the needle provided and then computing |
| 5 | a rolling hash for each needle sized window in the haystack. When the rolling |
| 6 | hash matches the hash of the needle, a byte-wise comparison is done to check |
| 7 | if a match exists. The worst case time complexity of Rabin-Karp is `O(m * |
| 8 | n)` where `m ~ len(needle)` and `n ~ len(haystack)`. Its worst case space |
| 9 | complexity is constant. |
| 10 | |
| 11 | The main utility of Rabin-Karp is that the searcher can be constructed very |
| 12 | quickly with very little memory. This makes it especially useful when searching |
| 13 | for small needles in small haystacks, as it might finish its search before a |
| 14 | beefier algorithm (like Two-Way) even starts. |
| 15 | |
| 16 | [rabinkarp]: https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm |
| 17 | */ |
| 18 | |
| 19 | /* |
| 20 | (This was the comment I wrote for this module originally when it was not |
| 21 | exposed. The comment still looks useful, but it's a bit in the weeds, so it's |
| 22 | not public itself.) |
| 23 | |
| 24 | This module implements the classical Rabin-Karp substring search algorithm, |
| 25 | with no extra frills. While its use would seem to break our time complexity |
| 26 | guarantee of O(m+n) (RK's time complexity is O(mn)), we are careful to only |
| 27 | ever use RK on a constant subset of haystacks. The main point here is that |
| 28 | RK has good latency properties for small needles/haystacks. It's very quick |
| 29 | to compute a needle hash and zip through the haystack when compared to |
| 30 | initializing Two-Way, for example. And this is especially useful for cases |
| 31 | where the haystack is just too short for vector instructions to do much good. |
| 32 | |
| 33 | The hashing function used here is the same one recommended by ESMAJ. |
| 34 | |
| 35 | Another choice instead of Rabin-Karp would be Shift-Or. But its latency |
| 36 | isn't quite as good since its preprocessing time is a bit more expensive |
| 37 | (both in practice and in theory). However, perhaps Shift-Or has a place |
| 38 | somewhere else for short patterns. I think the main problem is that it |
| 39 | requires space proportional to the alphabet and the needle. If we, for |
| 40 | example, supported needles up to length 16, then the total table size would be |
| 41 | len(alphabet)*size_of::<u16>()==512 bytes. Which isn't exactly small, and it's |
| 42 | probably bad to put that on the stack. So ideally, we'd throw it on the heap, |
| 43 | but we'd really like to write as much code without using alloc/std as possible. |
| 44 | But maybe it's worth the special casing. It's a TODO to benchmark. |
| 45 | |
| 46 | Wikipedia has a decent explanation, if a bit heavy on the theory: |
| 47 | https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm |
| 48 | |
| 49 | But ESMAJ provides something a bit more concrete: |
| 50 | http://www-igm.univ-mlv.fr/~lecroq/string/node5.html |
| 51 | |
| 52 | Finally, aho-corasick uses Rabin-Karp for multiple pattern match in some cases: |
| 53 | https://github.com/BurntSushi/aho-corasick/blob/3852632f10587db0ff72ef29e88d58bf305a0946/src/packed/rabinkarp.rs |
| 54 | */ |
| 55 | |
| 56 | use crate::ext::Pointer; |
| 57 | |
| 58 | /// A forward substring searcher using the Rabin-Karp algorithm. |
| 59 | /// |
| 60 | /// Note that, as a lower level API, a `Finder` does not have access to the |
| 61 | /// needle it was constructed with. For this reason, executing a search |
| 62 | /// with a `Finder` requires passing both the needle and the haystack, |
| 63 | /// where the needle is exactly equivalent to the one given to the `Finder` |
| 64 | /// at construction time. This design was chosen so that callers can have |
| 65 | /// more precise control over where and how many times a needle is stored. |
| 66 | /// For example, in cases where Rabin-Karp is just one of several possible |
| 67 | /// substring search algorithms. |
| 68 | #[derive (Clone, Debug)] |
| 69 | pub struct Finder { |
| 70 | /// The actual hash. |
| 71 | hash: Hash, |
| 72 | /// The factor needed to multiply a byte by in order to subtract it from |
| 73 | /// the hash. It is defined to be 2^(n-1) (using wrapping exponentiation), |
| 74 | /// where n is the length of the needle. This is how we "remove" a byte |
| 75 | /// from the hash once the hash window rolls past it. |
| 76 | hash_2pow: u32, |
| 77 | } |
| 78 | |
| 79 | impl Finder { |
| 80 | /// Create a new Rabin-Karp forward searcher for the given `needle`. |
| 81 | /// |
| 82 | /// The needle may be empty. The empty needle matches at every byte offset. |
| 83 | /// |
| 84 | /// Note that callers must pass the same needle to all search calls using |
| 85 | /// this `Finder`. |
| 86 | #[inline ] |
| 87 | pub fn new(needle: &[u8]) -> Finder { |
| 88 | let mut s = Finder { hash: Hash::new(), hash_2pow: 1 }; |
| 89 | let first_byte = match needle.get(0) { |
| 90 | None => return s, |
| 91 | Some(&first_byte) => first_byte, |
| 92 | }; |
| 93 | s.hash.add(first_byte); |
| 94 | for b in needle.iter().copied().skip(1) { |
| 95 | s.hash.add(b); |
| 96 | s.hash_2pow = s.hash_2pow.wrapping_shl(1); |
| 97 | } |
| 98 | s |
| 99 | } |
| 100 | |
| 101 | /// Return the first occurrence of the `needle` in the `haystack` |
| 102 | /// given. If no such occurrence exists, then `None` is returned. |
| 103 | /// |
| 104 | /// The `needle` provided must match the needle given to this finder at |
| 105 | /// construction time. |
| 106 | /// |
| 107 | /// The maximum value this can return is `haystack.len()`, which can only |
| 108 | /// occur when the needle and haystack both have length zero. Otherwise, |
| 109 | /// for non-empty haystacks, the maximum value is `haystack.len() - 1`. |
| 110 | #[inline ] |
| 111 | pub fn find(&self, haystack: &[u8], needle: &[u8]) -> Option<usize> { |
| 112 | unsafe { |
| 113 | let hstart = haystack.as_ptr(); |
| 114 | let hend = hstart.add(haystack.len()); |
| 115 | let nstart = needle.as_ptr(); |
| 116 | let nend = nstart.add(needle.len()); |
| 117 | let found = self.find_raw(hstart, hend, nstart, nend)?; |
| 118 | Some(found.distance(hstart)) |
| 119 | } |
| 120 | } |
| 121 | |
| 122 | /// Like `find`, but accepts and returns raw pointers. |
| 123 | /// |
| 124 | /// When a match is found, the pointer returned is guaranteed to be |
| 125 | /// `>= start` and `<= end`. The pointer returned is only ever equivalent |
| 126 | /// to `end` when both the needle and haystack are empty. (That is, the |
| 127 | /// empty string matches the empty string.) |
| 128 | /// |
| 129 | /// This routine is useful if you're already using raw pointers and would |
| 130 | /// like to avoid converting back to a slice before executing a search. |
| 131 | /// |
| 132 | /// # Safety |
| 133 | /// |
| 134 | /// Note that `start` and `end` below refer to both pairs of pointers given |
| 135 | /// to this routine. That is, the conditions apply to both `hstart`/`hend` |
| 136 | /// and `nstart`/`nend`. |
| 137 | /// |
| 138 | /// * Both `start` and `end` must be valid for reads. |
| 139 | /// * Both `start` and `end` must point to an initialized value. |
| 140 | /// * Both `start` and `end` must point to the same allocated object and |
| 141 | /// must either be in bounds or at most one byte past the end of the |
| 142 | /// allocated object. |
| 143 | /// * Both `start` and `end` must be _derived from_ a pointer to the same |
| 144 | /// object. |
| 145 | /// * The distance between `start` and `end` must not overflow `isize`. |
| 146 | /// * The distance being in bounds must not rely on "wrapping around" the |
| 147 | /// address space. |
| 148 | /// * It must be the case that `start <= end`. |
| 149 | #[inline ] |
| 150 | pub unsafe fn find_raw( |
| 151 | &self, |
| 152 | hstart: *const u8, |
| 153 | hend: *const u8, |
| 154 | nstart: *const u8, |
| 155 | nend: *const u8, |
| 156 | ) -> Option<*const u8> { |
| 157 | let hlen = hend.distance(hstart); |
| 158 | let nlen = nend.distance(nstart); |
| 159 | if nlen > hlen { |
| 160 | return None; |
| 161 | } |
| 162 | let mut cur = hstart; |
| 163 | let end = hend.sub(nlen); |
| 164 | let mut hash = Hash::forward(cur, cur.add(nlen)); |
| 165 | loop { |
| 166 | if self.hash == hash && is_equal_raw(cur, nstart, nlen) { |
| 167 | return Some(cur); |
| 168 | } |
| 169 | if cur >= end { |
| 170 | return None; |
| 171 | } |
| 172 | hash.roll(self, cur.read(), cur.add(nlen).read()); |
| 173 | cur = cur.add(1); |
| 174 | } |
| 175 | } |
| 176 | } |
| 177 | |
| 178 | /// A reverse substring searcher using the Rabin-Karp algorithm. |
| 179 | #[derive (Clone, Debug)] |
| 180 | pub struct FinderRev(Finder); |
| 181 | |
| 182 | impl FinderRev { |
| 183 | /// Create a new Rabin-Karp reverse searcher for the given `needle`. |
| 184 | #[inline ] |
| 185 | pub fn new(needle: &[u8]) -> FinderRev { |
| 186 | let mut s = FinderRev(Finder { hash: Hash::new(), hash_2pow: 1 }); |
| 187 | let last_byte = match needle.last() { |
| 188 | None => return s, |
| 189 | Some(&last_byte) => last_byte, |
| 190 | }; |
| 191 | s.0.hash.add(last_byte); |
| 192 | for b in needle.iter().rev().copied().skip(1) { |
| 193 | s.0.hash.add(b); |
| 194 | s.0.hash_2pow = s.0.hash_2pow.wrapping_shl(1); |
| 195 | } |
| 196 | s |
| 197 | } |
| 198 | |
| 199 | /// Return the last occurrence of the `needle` in the `haystack` |
| 200 | /// given. If no such occurrence exists, then `None` is returned. |
| 201 | /// |
| 202 | /// The `needle` provided must match the needle given to this finder at |
| 203 | /// construction time. |
| 204 | /// |
| 205 | /// The maximum value this can return is `haystack.len()`, which can only |
| 206 | /// occur when the needle and haystack both have length zero. Otherwise, |
| 207 | /// for non-empty haystacks, the maximum value is `haystack.len() - 1`. |
| 208 | #[inline ] |
| 209 | pub fn rfind(&self, haystack: &[u8], needle: &[u8]) -> Option<usize> { |
| 210 | unsafe { |
| 211 | let hstart = haystack.as_ptr(); |
| 212 | let hend = hstart.add(haystack.len()); |
| 213 | let nstart = needle.as_ptr(); |
| 214 | let nend = nstart.add(needle.len()); |
| 215 | let found = self.rfind_raw(hstart, hend, nstart, nend)?; |
| 216 | Some(found.distance(hstart)) |
| 217 | } |
| 218 | } |
| 219 | |
| 220 | /// Like `rfind`, but accepts and returns raw pointers. |
| 221 | /// |
| 222 | /// When a match is found, the pointer returned is guaranteed to be |
| 223 | /// `>= start` and `<= end`. The pointer returned is only ever equivalent |
| 224 | /// to `end` when both the needle and haystack are empty. (That is, the |
| 225 | /// empty string matches the empty string.) |
| 226 | /// |
| 227 | /// This routine is useful if you're already using raw pointers and would |
| 228 | /// like to avoid converting back to a slice before executing a search. |
| 229 | /// |
| 230 | /// # Safety |
| 231 | /// |
| 232 | /// Note that `start` and `end` below refer to both pairs of pointers given |
| 233 | /// to this routine. That is, the conditions apply to both `hstart`/`hend` |
| 234 | /// and `nstart`/`nend`. |
| 235 | /// |
| 236 | /// * Both `start` and `end` must be valid for reads. |
| 237 | /// * Both `start` and `end` must point to an initialized value. |
| 238 | /// * Both `start` and `end` must point to the same allocated object and |
| 239 | /// must either be in bounds or at most one byte past the end of the |
| 240 | /// allocated object. |
| 241 | /// * Both `start` and `end` must be _derived from_ a pointer to the same |
| 242 | /// object. |
| 243 | /// * The distance between `start` and `end` must not overflow `isize`. |
| 244 | /// * The distance being in bounds must not rely on "wrapping around" the |
| 245 | /// address space. |
| 246 | /// * It must be the case that `start <= end`. |
| 247 | #[inline ] |
| 248 | pub unsafe fn rfind_raw( |
| 249 | &self, |
| 250 | hstart: *const u8, |
| 251 | hend: *const u8, |
| 252 | nstart: *const u8, |
| 253 | nend: *const u8, |
| 254 | ) -> Option<*const u8> { |
| 255 | let hlen = hend.distance(hstart); |
| 256 | let nlen = nend.distance(nstart); |
| 257 | if nlen > hlen { |
| 258 | return None; |
| 259 | } |
| 260 | let mut cur = hend.sub(nlen); |
| 261 | let start = hstart; |
| 262 | let mut hash = Hash::reverse(cur, cur.add(nlen)); |
| 263 | loop { |
| 264 | if self.0.hash == hash && is_equal_raw(cur, nstart, nlen) { |
| 265 | return Some(cur); |
| 266 | } |
| 267 | if cur <= start { |
| 268 | return None; |
| 269 | } |
| 270 | cur = cur.sub(1); |
| 271 | hash.roll(&self.0, cur.add(nlen).read(), cur.read()); |
| 272 | } |
| 273 | } |
| 274 | } |
| 275 | |
| 276 | /// Whether RK is believed to be very fast for the given needle/haystack. |
| 277 | #[inline ] |
| 278 | pub(crate) fn is_fast(haystack: &[u8], _needle: &[u8]) -> bool { |
| 279 | haystack.len() < 16 |
| 280 | } |
| 281 | |
| 282 | /// A Rabin-Karp hash. This might represent the hash of a needle, or the hash |
| 283 | /// of a rolling window in the haystack. |
| 284 | #[derive (Clone, Copy, Debug, Default, Eq, PartialEq)] |
| 285 | struct Hash(u32); |
| 286 | |
| 287 | impl Hash { |
| 288 | /// Create a new hash that represents the empty string. |
| 289 | #[inline (always)] |
| 290 | fn new() -> Hash { |
| 291 | Hash(0) |
| 292 | } |
| 293 | |
| 294 | /// Create a new hash from the bytes given for use in forward searches. |
| 295 | /// |
| 296 | /// # Safety |
| 297 | /// |
| 298 | /// The given pointers must be valid to read from within their range. |
| 299 | #[inline (always)] |
| 300 | unsafe fn forward(mut start: *const u8, end: *const u8) -> Hash { |
| 301 | let mut hash = Hash::new(); |
| 302 | while start < end { |
| 303 | hash.add(start.read()); |
| 304 | start = start.add(1); |
| 305 | } |
| 306 | hash |
| 307 | } |
| 308 | |
| 309 | /// Create a new hash from the bytes given for use in reverse searches. |
| 310 | /// |
| 311 | /// # Safety |
| 312 | /// |
| 313 | /// The given pointers must be valid to read from within their range. |
| 314 | #[inline (always)] |
| 315 | unsafe fn reverse(start: *const u8, mut end: *const u8) -> Hash { |
| 316 | let mut hash = Hash::new(); |
| 317 | while start < end { |
| 318 | end = end.sub(1); |
| 319 | hash.add(end.read()); |
| 320 | } |
| 321 | hash |
| 322 | } |
| 323 | |
| 324 | /// Add 'new' and remove 'old' from this hash. The given needle hash should |
| 325 | /// correspond to the hash computed for the needle being searched for. |
| 326 | /// |
| 327 | /// This is meant to be used when the rolling window of the haystack is |
| 328 | /// advanced. |
| 329 | #[inline (always)] |
| 330 | fn roll(&mut self, finder: &Finder, old: u8, new: u8) { |
| 331 | self.del(finder, old); |
| 332 | self.add(new); |
| 333 | } |
| 334 | |
| 335 | /// Add a byte to this hash. |
| 336 | #[inline (always)] |
| 337 | fn add(&mut self, byte: u8) { |
| 338 | self.0 = self.0.wrapping_shl(1).wrapping_add(u32::from(byte)); |
| 339 | } |
| 340 | |
| 341 | /// Remove a byte from this hash. The given needle hash should correspond |
| 342 | /// to the hash computed for the needle being searched for. |
| 343 | #[inline (always)] |
| 344 | fn del(&mut self, finder: &Finder, byte: u8) { |
| 345 | let factor = finder.hash_2pow; |
| 346 | self.0 = self.0.wrapping_sub(u32::from(byte).wrapping_mul(factor)); |
| 347 | } |
| 348 | } |
| 349 | |
| 350 | /// Returns true when `x[i] == y[i]` for all `0 <= i < n`. |
| 351 | /// |
| 352 | /// We forcefully don't inline this to hint at the compiler that it is unlikely |
| 353 | /// to be called. This causes the inner rabinkarp loop above to be a bit |
| 354 | /// tighter and leads to some performance improvement. See the |
| 355 | /// memmem/krate/prebuilt/sliceslice-words/words benchmark. |
| 356 | /// |
| 357 | /// # Safety |
| 358 | /// |
| 359 | /// Same as `crate::arch::all::is_equal_raw`. |
| 360 | #[cold ] |
| 361 | #[inline (never)] |
| 362 | unsafe fn is_equal_raw(x: *const u8, y: *const u8, n: usize) -> bool { |
| 363 | crate::arch::all::is_equal_raw(x, y, n) |
| 364 | } |
| 365 | |
| 366 | #[cfg (test)] |
| 367 | mod tests { |
| 368 | use super::*; |
| 369 | |
| 370 | define_substring_forward_quickcheck!(|h, n| Some( |
| 371 | Finder::new(n).find(h, n) |
| 372 | )); |
| 373 | define_substring_reverse_quickcheck!(|h, n| Some( |
| 374 | FinderRev::new(n).rfind(h, n) |
| 375 | )); |
| 376 | |
| 377 | #[test ] |
| 378 | fn forward() { |
| 379 | crate::tests::substring::Runner::new() |
| 380 | .fwd(|h, n| Some(Finder::new(n).find(h, n))) |
| 381 | .run(); |
| 382 | } |
| 383 | |
| 384 | #[test ] |
| 385 | fn reverse() { |
| 386 | crate::tests::substring::Runner::new() |
| 387 | .rev(|h, n| Some(FinderRev::new(n).rfind(h, n))) |
| 388 | .run(); |
| 389 | } |
| 390 | } |
| 391 | |