1 | /*! |
---|---|

2 | An implementation of the [Rabin-Karp substring search algorithm][rabinkarp]. |

3 | |

4 | Rabin-Karp works by creating a hash of the needle provided and then computing |

5 | a rolling hash for each needle sized window in the haystack. When the rolling |

6 | hash matches the hash of the needle, a byte-wise comparison is done to check |

7 | if a match exists. The worst case time complexity of Rabin-Karp is `O(m * |

8 | n)` where `m ~ len(needle)` and `n ~ len(haystack)`. Its worst case space |

9 | complexity is constant. |

10 | |

11 | The main utility of Rabin-Karp is that the searcher can be constructed very |

12 | quickly with very little memory. This makes it especially useful when searching |

13 | for small needles in small haystacks, as it might finish its search before a |

14 | beefier algorithm (like Two-Way) even starts. |

15 | |

16 | [rabinkarp]: https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm |

17 | */ |

18 | |

19 | /* |

20 | (This was the comment I wrote for this module originally when it was not |

21 | exposed. The comment still looks useful, but it's a bit in the weeds, so it's |

22 | not public itself.) |

23 | |

24 | This module implements the classical Rabin-Karp substring search algorithm, |

25 | with no extra frills. While its use would seem to break our time complexity |

26 | guarantee of O(m+n) (RK's time complexity is O(mn)), we are careful to only |

27 | ever use RK on a constant subset of haystacks. The main point here is that |

28 | RK has good latency properties for small needles/haystacks. It's very quick |

29 | to compute a needle hash and zip through the haystack when compared to |

30 | initializing Two-Way, for example. And this is especially useful for cases |

31 | where the haystack is just too short for vector instructions to do much good. |

32 | |

33 | The hashing function used here is the same one recommended by ESMAJ. |

34 | |

35 | Another choice instead of Rabin-Karp would be Shift-Or. But its latency |

36 | isn't quite as good since its preprocessing time is a bit more expensive |

37 | (both in practice and in theory). However, perhaps Shift-Or has a place |

38 | somewhere else for short patterns. I think the main problem is that it |

39 | requires space proportional to the alphabet and the needle. If we, for |

40 | example, supported needles up to length 16, then the total table size would be |

41 | len(alphabet)*size_of::<u16>()==512 bytes. Which isn't exactly small, and it's |

42 | probably bad to put that on the stack. So ideally, we'd throw it on the heap, |

43 | but we'd really like to write as much code without using alloc/std as possible. |

44 | But maybe it's worth the special casing. It's a TODO to benchmark. |

45 | |

46 | Wikipedia has a decent explanation, if a bit heavy on the theory: |

47 | https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm |

48 | |

49 | But ESMAJ provides something a bit more concrete: |

50 | http://www-igm.univ-mlv.fr/~lecroq/string/node5.html |

51 | |

52 | Finally, aho-corasick uses Rabin-Karp for multiple pattern match in some cases: |

53 | https://github.com/BurntSushi/aho-corasick/blob/3852632f10587db0ff72ef29e88d58bf305a0946/src/packed/rabinkarp.rs |

54 | */ |

55 | |

56 | use crate::ext::Pointer; |

57 | |

58 | /// A forward substring searcher using the Rabin-Karp algorithm. |

59 | /// |

60 | /// Note that, as a lower level API, a `Finder` does not have access to the |

61 | /// needle it was constructed with. For this reason, executing a search |

62 | /// with a `Finder` requires passing both the needle and the haystack, |

63 | /// where the needle is exactly equivalent to the one given to the `Finder` |

64 | /// at construction time. This design was chosen so that callers can have |

65 | /// more precise control over where and how many times a needle is stored. |

66 | /// For example, in cases where Rabin-Karp is just one of several possible |

67 | /// substring search algorithms. |

68 | #[derive(Clone, Debug)] |

69 | pub struct Finder { |

70 | /// The actual hash. |

71 | hash: Hash, |

72 | /// The factor needed to multiply a byte by in order to subtract it from |

73 | /// the hash. It is defined to be 2^(n-1) (using wrapping exponentiation), |

74 | /// where n is the length of the needle. This is how we "remove" a byte |

75 | /// from the hash once the hash window rolls past it. |

76 | hash_2pow: u32, |

77 | } |

78 | |

79 | impl Finder { |

80 | /// Create a new Rabin-Karp forward searcher for the given `needle`. |

81 | /// |

82 | /// The needle may be empty. The empty needle matches at every byte offset. |

83 | /// |

84 | /// Note that callers must pass the same needle to all search calls using |

85 | /// this `Finder`. |

86 | #[inline] |

87 | pub fn new(needle: &[u8]) -> Finder { |

88 | let mut s = Finder { hash: Hash::new(), hash_2pow: 1 }; |

89 | let first_byte = match needle.get(0) { |

90 | None => return s, |

91 | Some(&first_byte) => first_byte, |

92 | }; |

93 | s.hash.add(first_byte); |

94 | for b in needle.iter().copied().skip(1) { |

95 | s.hash.add(b); |

96 | s.hash_2pow = s.hash_2pow.wrapping_shl(1); |

97 | } |

98 | s |

99 | } |

100 | |

101 | /// Return the first occurrence of the `needle` in the `haystack` |

102 | /// given. If no such occurrence exists, then `None` is returned. |

103 | /// |

104 | /// The `needle` provided must match the needle given to this finder at |

105 | /// construction time. |

106 | /// |

107 | /// The maximum value this can return is `haystack.len()`, which can only |

108 | /// occur when the needle and haystack both have length zero. Otherwise, |

109 | /// for non-empty haystacks, the maximum value is `haystack.len() - 1`. |

110 | #[inline] |

111 | pub fn find(&self, haystack: &[u8], needle: &[u8]) -> Option<usize> { |

112 | unsafe { |

113 | let hstart = haystack.as_ptr(); |

114 | let hend = hstart.add(haystack.len()); |

115 | let nstart = needle.as_ptr(); |

116 | let nend = nstart.add(needle.len()); |

117 | let found = self.find_raw(hstart, hend, nstart, nend)?; |

118 | Some(found.distance(hstart)) |

119 | } |

120 | } |

121 | |

122 | /// Like `find`, but accepts and returns raw pointers. |

123 | /// |

124 | /// When a match is found, the pointer returned is guaranteed to be |

125 | /// `>= start` and `<= end`. The pointer returned is only ever equivalent |

126 | /// to `end` when both the needle and haystack are empty. (That is, the |

127 | /// empty string matches the empty string.) |

128 | /// |

129 | /// This routine is useful if you're already using raw pointers and would |

130 | /// like to avoid converting back to a slice before executing a search. |

131 | /// |

132 | /// # Safety |

133 | /// |

134 | /// Note that `start` and `end` below refer to both pairs of pointers given |

135 | /// to this routine. That is, the conditions apply to both `hstart`/`hend` |

136 | /// and `nstart`/`nend`. |

137 | /// |

138 | /// * Both `start` and `end` must be valid for reads. |

139 | /// * Both `start` and `end` must point to an initialized value. |

140 | /// * Both `start` and `end` must point to the same allocated object and |

141 | /// must either be in bounds or at most one byte past the end of the |

142 | /// allocated object. |

143 | /// * Both `start` and `end` must be _derived from_ a pointer to the same |

144 | /// object. |

145 | /// * The distance between `start` and `end` must not overflow `isize`. |

146 | /// * The distance being in bounds must not rely on "wrapping around" the |

147 | /// address space. |

148 | /// * It must be the case that `start <= end`. |

149 | #[inline] |

150 | pub unsafe fn find_raw( |

151 | &self, |

152 | hstart: *const u8, |

153 | hend: *const u8, |

154 | nstart: *const u8, |

155 | nend: *const u8, |

156 | ) -> Option<*const u8> { |

157 | let hlen = hend.distance(hstart); |

158 | let nlen = nend.distance(nstart); |

159 | if nlen > hlen { |

160 | return None; |

161 | } |

162 | let mut cur = hstart; |

163 | let end = hend.sub(nlen); |

164 | let mut hash = Hash::forward(cur, cur.add(nlen)); |

165 | loop { |

166 | if self.hash == hash && is_equal_raw(cur, nstart, nlen) { |

167 | return Some(cur); |

168 | } |

169 | if cur >= end { |

170 | return None; |

171 | } |

172 | hash.roll(self, cur.read(), cur.add(nlen).read()); |

173 | cur = cur.add(1); |

174 | } |

175 | } |

176 | } |

177 | |

178 | /// A reverse substring searcher using the Rabin-Karp algorithm. |

179 | #[derive(Clone, Debug)] |

180 | pub struct FinderRev(Finder); |

181 | |

182 | impl FinderRev { |

183 | /// Create a new Rabin-Karp reverse searcher for the given `needle`. |

184 | #[inline] |

185 | pub fn new(needle: &[u8]) -> FinderRev { |

186 | let mut s = FinderRev(Finder { hash: Hash::new(), hash_2pow: 1 }); |

187 | let last_byte = match needle.last() { |

188 | None => return s, |

189 | Some(&last_byte) => last_byte, |

190 | }; |

191 | s.0.hash.add(last_byte); |

192 | for b in needle.iter().rev().copied().skip(1) { |

193 | s.0.hash.add(b); |

194 | s.0.hash_2pow = s.0.hash_2pow.wrapping_shl(1); |

195 | } |

196 | s |

197 | } |

198 | |

199 | /// Return the last occurrence of the `needle` in the `haystack` |

200 | /// given. If no such occurrence exists, then `None` is returned. |

201 | /// |

202 | /// The `needle` provided must match the needle given to this finder at |

203 | /// construction time. |

204 | /// |

205 | /// The maximum value this can return is `haystack.len()`, which can only |

206 | /// occur when the needle and haystack both have length zero. Otherwise, |

207 | /// for non-empty haystacks, the maximum value is `haystack.len() - 1`. |

208 | #[inline] |

209 | pub fn rfind(&self, haystack: &[u8], needle: &[u8]) -> Option<usize> { |

210 | unsafe { |

211 | let hstart = haystack.as_ptr(); |

212 | let hend = hstart.add(haystack.len()); |

213 | let nstart = needle.as_ptr(); |

214 | let nend = nstart.add(needle.len()); |

215 | let found = self.rfind_raw(hstart, hend, nstart, nend)?; |

216 | Some(found.distance(hstart)) |

217 | } |

218 | } |

219 | |

220 | /// Like `rfind`, but accepts and returns raw pointers. |

221 | /// |

222 | /// When a match is found, the pointer returned is guaranteed to be |

223 | /// `>= start` and `<= end`. The pointer returned is only ever equivalent |

224 | /// to `end` when both the needle and haystack are empty. (That is, the |

225 | /// empty string matches the empty string.) |

226 | /// |

227 | /// This routine is useful if you're already using raw pointers and would |

228 | /// like to avoid converting back to a slice before executing a search. |

229 | /// |

230 | /// # Safety |

231 | /// |

232 | /// Note that `start` and `end` below refer to both pairs of pointers given |

233 | /// to this routine. That is, the conditions apply to both `hstart`/`hend` |

234 | /// and `nstart`/`nend`. |

235 | /// |

236 | /// * Both `start` and `end` must be valid for reads. |

237 | /// * Both `start` and `end` must point to an initialized value. |

238 | /// * Both `start` and `end` must point to the same allocated object and |

239 | /// must either be in bounds or at most one byte past the end of the |

240 | /// allocated object. |

241 | /// * Both `start` and `end` must be _derived from_ a pointer to the same |

242 | /// object. |

243 | /// * The distance between `start` and `end` must not overflow `isize`. |

244 | /// * The distance being in bounds must not rely on "wrapping around" the |

245 | /// address space. |

246 | /// * It must be the case that `start <= end`. |

247 | #[inline] |

248 | pub unsafe fn rfind_raw( |

249 | &self, |

250 | hstart: *const u8, |

251 | hend: *const u8, |

252 | nstart: *const u8, |

253 | nend: *const u8, |

254 | ) -> Option<*const u8> { |

255 | let hlen = hend.distance(hstart); |

256 | let nlen = nend.distance(nstart); |

257 | if nlen > hlen { |

258 | return None; |

259 | } |

260 | let mut cur = hend.sub(nlen); |

261 | let start = hstart; |

262 | let mut hash = Hash::reverse(cur, cur.add(nlen)); |

263 | loop { |

264 | if self.0.hash == hash && is_equal_raw(cur, nstart, nlen) { |

265 | return Some(cur); |

266 | } |

267 | if cur <= start { |

268 | return None; |

269 | } |

270 | cur = cur.sub(1); |

271 | hash.roll(&self.0, cur.add(nlen).read(), cur.read()); |

272 | } |

273 | } |

274 | } |

275 | |

276 | /// Whether RK is believed to be very fast for the given needle/haystack. |

277 | #[inline] |

278 | pub(crate) fn is_fast(haystack: &[u8], _needle: &[u8]) -> bool { |

279 | haystack.len() < 16 |

280 | } |

281 | |

282 | /// A Rabin-Karp hash. This might represent the hash of a needle, or the hash |

283 | /// of a rolling window in the haystack. |

284 | #[derive(Clone, Copy, Debug, Default, Eq, PartialEq)] |

285 | struct Hash(u32); |

286 | |

287 | impl Hash { |

288 | /// Create a new hash that represents the empty string. |

289 | #[inline(always)] |

290 | fn new() -> Hash { |

291 | Hash(0) |

292 | } |

293 | |

294 | /// Create a new hash from the bytes given for use in forward searches. |

295 | /// |

296 | /// # Safety |

297 | /// |

298 | /// The given pointers must be valid to read from within their range. |

299 | #[inline(always)] |

300 | unsafe fn forward(mut start: *const u8, end: *const u8) -> Hash { |

301 | let mut hash = Hash::new(); |

302 | while start < end { |

303 | hash.add(start.read()); |

304 | start = start.add(1); |

305 | } |

306 | hash |

307 | } |

308 | |

309 | /// Create a new hash from the bytes given for use in reverse searches. |

310 | /// |

311 | /// # Safety |

312 | /// |

313 | /// The given pointers must be valid to read from within their range. |

314 | #[inline(always)] |

315 | unsafe fn reverse(start: *const u8, mut end: *const u8) -> Hash { |

316 | let mut hash = Hash::new(); |

317 | while start < end { |

318 | end = end.sub(1); |

319 | hash.add(end.read()); |

320 | } |

321 | hash |

322 | } |

323 | |

324 | /// Add 'new' and remove 'old' from this hash. The given needle hash should |

325 | /// correspond to the hash computed for the needle being searched for. |

326 | /// |

327 | /// This is meant to be used when the rolling window of the haystack is |

328 | /// advanced. |

329 | #[inline(always)] |

330 | fn roll(&mut self, finder: &Finder, old: u8, new: u8) { |

331 | self.del(finder, old); |

332 | self.add(new); |

333 | } |

334 | |

335 | /// Add a byte to this hash. |

336 | #[inline(always)] |

337 | fn add(&mut self, byte: u8) { |

338 | self.0 = self.0.wrapping_shl(1).wrapping_add(u32::from(byte)); |

339 | } |

340 | |

341 | /// Remove a byte from this hash. The given needle hash should correspond |

342 | /// to the hash computed for the needle being searched for. |

343 | #[inline(always)] |

344 | fn del(&mut self, finder: &Finder, byte: u8) { |

345 | let factor = finder.hash_2pow; |

346 | self.0 = self.0.wrapping_sub(u32::from(byte).wrapping_mul(factor)); |

347 | } |

348 | } |

349 | |

350 | /// Returns true when `x[i] == y[i]` for all `0 <= i < n`. |

351 | /// |

352 | /// We forcefully don't inline this to hint at the compiler that it is unlikely |

353 | /// to be called. This causes the inner rabinkarp loop above to be a bit |

354 | /// tighter and leads to some performance improvement. See the |

355 | /// memmem/krate/prebuilt/sliceslice-words/words benchmark. |

356 | /// |

357 | /// # Safety |

358 | /// |

359 | /// Same as `crate::arch::all::is_equal_raw`. |

360 | #[cold] |

361 | #[inline(never)] |

362 | unsafe fn is_equal_raw(x: *const u8, y: *const u8, n: usize) -> bool { |

363 | crate::arch::all::is_equal_raw(x, y, n) |

364 | } |

365 | |

366 | #[cfg(test)] |

367 | mod tests { |

368 | use super::*; |

369 | |

370 | define_substring_forward_quickcheck!(|h, n| Some( |

371 | Finder::new(n).find(h, n) |

372 | )); |

373 | define_substring_reverse_quickcheck!(|h, n| Some( |

374 | FinderRev::new(n).rfind(h, n) |

375 | )); |

376 | |

377 | #[test] |

378 | fn forward() { |

379 | crate::tests::substring::Runner::new() |

380 | .fwd(|h, n| Some(Finder::new(n).find(h, n))) |

381 | .run(); |

382 | } |

383 | |

384 | #[test] |

385 | fn reverse() { |

386 | crate::tests::substring::Runner::new() |

387 | .rev(|h, n| Some(FinderRev::new(n).rfind(h, n))) |

388 | .run(); |

389 | } |

390 | } |

391 |