rabinkarp.rs - Codebrowser

1	/!*
2	An implementation of the [Rabin-Karp substring search algorithm][rabinkarp].
3
4	Rabin-Karp works by creating a hash of the needle provided and then computing
5	a rolling hash for each needle sized window in the haystack. When the rolling
6	hash matches the hash of the needle, a byte-wise comparison is done to check
7	if a match exists. The worst case time complexity of Rabin-Karp is `O(m *
8	n)` where `m ~ len(needle)` and `n ~ len(haystack)`. Its worst case space
9	complexity is constant.
10
11	The main utility of Rabin-Karp is that the searcher can be constructed very
12	quickly with very little memory. This makes it especially useful when searching
13	for small needles in small haystacks, as it might finish its search before a
14	beefier algorithm (like Two-Way) even starts.
15
16	[rabinkarp]: https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm
17	*/
18
19	/*
20	(This was the comment I wrote for this module originally when it was not
21	exposed. The comment still looks useful, but it's a bit in the weeds, so it's
22	not public itself.)
23
24	This module implements the classical Rabin-Karp substring search algorithm,
25	with no extra frills. While its use would seem to break our time complexity
26	guarantee of O(m+n) (RK's time complexity is O(mn)), we are careful to only
27	ever use RK on a constant subset of haystacks. The main point here is that
28	RK has good latency properties for small needles/haystacks. It's very quick
29	to compute a needle hash and zip through the haystack when compared to
30	initializing Two-Way, for example. And this is especially useful for cases
31	where the haystack is just too short for vector instructions to do much good.
32
33	The hashing function used here is the same one recommended by ESMAJ.
34
35	Another choice instead of Rabin-Karp would be Shift-Or. But its latency
36	isn't quite as good since its preprocessing time is a bit more expensive
37	(both in practice and in theory). However, perhaps Shift-Or has a place
38	somewhere else for short patterns. I think the main problem is that it
39	requires space proportional to the alphabet and the needle. If we, for
40	example, supported needles up to length 16, then the total table size would be
41	len(alphabet)size_of::<u16>()==512 bytes. Which isn't exactly small, and it's*
42	probably bad to put that on the stack. So ideally, we'd throw it on the heap,
43	but we'd really like to write as much code without using alloc/std as possible.
44	But maybe it's worth the special casing. It's a TODO to benchmark.
45
46	Wikipedia has a decent explanation, if a bit heavy on the theory:
47	https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm
48
49	But ESMAJ provides something a bit more concrete:
50	http://www-igm.univ-mlv.fr/~lecroq/string/node5.html
51
52	Finally, aho-corasick uses Rabin-Karp for multiple pattern match in some cases:
53	https://github.com/BurntSushi/aho-corasick/blob/3852632f10587db0ff72ef29e88d58bf305a0946/src/packed/rabinkarp.rs
54	*/
55
56	use crate::ext::Pointer;
57
58	/// A forward substring searcher using the Rabin-Karp algorithm.
59	///
60	/// Note that, as a lower level API, a `Finder` does not have access to the
61	/// needle it was constructed with. For this reason, executing a search
62	/// with a `Finder` requires passing both the needle and the haystack,
63	/// where the needle is exactly equivalent to the one given to the `Finder`
64	/// at construction time. This design was chosen so that callers can have
65	/// more precise control over where and how many times a needle is stored.
66	/// For example, in cases where Rabin-Karp is just one of several possible
67	/// substring search algorithms.
68	#[derive(Clone, Debug)]
69	pub struct Finder {
70	/// The actual hash.
71	hash: Hash,
72	/// The factor needed to multiply a byte by in order to subtract it from
73	/// the hash. It is defined to be 2^(n-1) (using wrapping exponentiation),
74	/// where n is the length of the needle. This is how we "remove" a byte
75	/// from the hash once the hash window rolls past it.
76	hash_2pow: u32,
77	}
78
79	impl Finder {
80	/// Create a new Rabin-Karp forward searcher for the given `needle`.
81	///
82	/// The needle may be empty. The empty needle matches at every byte offset.
83	///
84	/// Note that callers must pass the same needle to all search calls using
85	/// this `Finder`.
86	#[inline]
87	pub fn new(needle: &[u8]) -> Finder {
88	let mut s = Finder { hash: Hash::new(), hash_2pow: `1` };
89	let first_byte = match needle.get(`0`) {
90	None => return s,
91	Some(&first_byte) => first_byte,
92	};
93	s.hash.add(first_byte);
94	for b in needle.iter().copied().skip(`1`) {
95	s.hash.add(b);
96	s.hash_2pow = s.hash_2pow.wrapping_shl(`1`);
97	}
98	s
99	}
100
101	/// Return the first occurrence of the `needle` in the `haystack`
102	/// given. If no such occurrence exists, then `None` is returned.
103	///
104	/// The `needle` provided must match the needle given to this finder at
105	/// construction time.
106	///
107	/// The maximum value this can return is `haystack.len()`, which can only
108	/// occur when the needle and haystack both have length zero. Otherwise,
109	/// for non-empty haystacks, the maximum value is `haystack.len() - 1`.
110	#[inline]
111	pub fn find(&self, haystack: &[u8], needle: &[u8]) -> Option<usize> {
112	unsafe {
113	let hstart = haystack.as_ptr();
114	let hend = hstart.add(haystack.len());
115	let nstart = needle.as_ptr();
116	let nend = nstart.add(needle.len());
117	let found = self.find_raw(hstart, hend, nstart, nend)?;
118	Some(found.distance(hstart))
119	}
120	}
121
122	/// Like `find`, but accepts and returns raw pointers.
123	///
124	/// When a match is found, the pointer returned is guaranteed to be
125	/// `>= start` and `<= end`. The pointer returned is only ever equivalent
126	/// to `end` when both the needle and haystack are empty. (That is, the
127	/// empty string matches the empty string.)
128	///
129	/// This routine is useful if you're already using raw pointers and would
130	/// like to avoid converting back to a slice before executing a search.
131	///
132	/// # Safety
133	///
134	/// Note that `start` and `end` below refer to both pairs of pointers given
135	/// to this routine. That is, the conditions apply to both `hstart`/`hend`
136	/// and `nstart`/`nend`.
137	///
138	/// Both `start` and `end` must be valid for reads.*
139	/// Both `start` and `end` must point to an initialized value.*
140	/// Both `start` and `end` must point to the same allocated object and*
141	/// must either be in bounds or at most one byte past the end of the
142	/// allocated object.
143	/// Both `start` and `end` must be _derived from_ a pointer to the same*
144	/// object.
145	/// The distance between `start` and `end` must not overflow `isize`.*
146	/// The distance being in bounds must not rely on "wrapping around" the*
147	/// address space.
148	/// It must be the case that `start <= end`.*
149	#[inline]
150	pub unsafe fn find_raw(
151	&self,
152	hstart: *const u8,
153	hend: *const u8,
154	nstart: *const u8,
155	nend: *const u8,
156	) -> Option<*const u8> {
157	let hlen = hend.distance(hstart);
158	let nlen = nend.distance(nstart);
159	if nlen > hlen {
160	return None;
161	}
162	let mut cur = hstart;
163	let end = hend.sub(nlen);
164	let mut hash = Hash::forward(cur, cur.add(nlen));
165	loop {
166	if self.hash == hash && is_equal_raw(cur, nstart, nlen) {
167	return Some(cur);
168	}
169	if cur >= end {
170	return None;
171	}
172	hash.roll(self, cur.read(), cur.add(nlen).read());
173	cur = cur.add(`1`);
174	}
175	}
176	}
177
178	/// A reverse substring searcher using the Rabin-Karp algorithm.
179	#[derive(Clone, Debug)]
180	pub struct FinderRev(Finder);
181
182	impl FinderRev {
183	/// Create a new Rabin-Karp reverse searcher for the given `needle`.
184	#[inline]
185	pub fn new(needle: &[u8]) -> FinderRev {
186	let mut s = FinderRev(Finder { hash: Hash::new(), hash_2pow: `1` });
187	let last_byte = match needle.last() {
188	None => return s,
189	Some(&last_byte) => last_byte,
190	};
191	s.0.hash.add(last_byte);
192	for b in needle.iter().rev().copied().skip(`1`) {
193	s.0.hash.add(b);
194	s.0.hash_2pow = s.0.hash_2pow.wrapping_shl(`1`);
195	}
196	s
197	}
198
199	/// Return the last occurrence of the `needle` in the `haystack`
200	/// given. If no such occurrence exists, then `None` is returned.
201	///
202	/// The `needle` provided must match the needle given to this finder at
203	/// construction time.
204	///
205	/// The maximum value this can return is `haystack.len()`, which can only
206	/// occur when the needle and haystack both have length zero. Otherwise,
207	/// for non-empty haystacks, the maximum value is `haystack.len() - 1`.
208	#[inline]
209	pub fn rfind(&self, haystack: &[u8], needle: &[u8]) -> Option<usize> {
210	unsafe {
211	let hstart = haystack.as_ptr();
212	let hend = hstart.add(haystack.len());
213	let nstart = needle.as_ptr();
214	let nend = nstart.add(needle.len());
215	let found = self.rfind_raw(hstart, hend, nstart, nend)?;
216	Some(found.distance(hstart))
217	}
218	}
219
220	/// Like `rfind`, but accepts and returns raw pointers.
221	///
222	/// When a match is found, the pointer returned is guaranteed to be
223	/// `>= start` and `<= end`. The pointer returned is only ever equivalent
224	/// to `end` when both the needle and haystack are empty. (That is, the
225	/// empty string matches the empty string.)
226	///
227	/// This routine is useful if you're already using raw pointers and would
228	/// like to avoid converting back to a slice before executing a search.
229	///
230	/// # Safety
231	///
232	/// Note that `start` and `end` below refer to both pairs of pointers given
233	/// to this routine. That is, the conditions apply to both `hstart`/`hend`
234	/// and `nstart`/`nend`.
235	///
236	/// Both `start` and `end` must be valid for reads.*
237	/// Both `start` and `end` must point to an initialized value.*
238	/// Both `start` and `end` must point to the same allocated object and*
239	/// must either be in bounds or at most one byte past the end of the
240	/// allocated object.
241	/// Both `start` and `end` must be _derived from_ a pointer to the same*
242	/// object.
243	/// The distance between `start` and `end` must not overflow `isize`.*
244	/// The distance being in bounds must not rely on "wrapping around" the*
245	/// address space.
246	/// It must be the case that `start <= end`.*
247	#[inline]
248	pub unsafe fn rfind_raw(
249	&self,
250	hstart: *const u8,
251	hend: *const u8,
252	nstart: *const u8,
253	nend: *const u8,
254	) -> Option<*const u8> {
255	let hlen = hend.distance(hstart);
256	let nlen = nend.distance(nstart);
257	if nlen > hlen {
258	return None;
259	}
260	let mut cur = hend.sub(nlen);
261	let start = hstart;
262	let mut hash = Hash::reverse(cur, cur.add(nlen));
263	loop {
264	if self.0.hash == hash && is_equal_raw(cur, nstart, nlen) {
265	return Some(cur);
266	}
267	if cur <= start {
268	return None;
269	}
270	cur = cur.sub(`1`);
271	hash.roll(&self.0, cur.add(nlen).read(), cur.read());
272	}
273	}
274	}
275
276	/// Whether RK is believed to be very fast for the given needle/haystack.
277	#[inline]
278	pub(crate) fn is_fast(haystack: &[u8], _needle: &[u8]) -> bool {
279	haystack.len() < `16`
280	}
281
282	/// A Rabin-Karp hash. This might represent the hash of a needle, or the hash
283	/// of a rolling window in the haystack.
284	#[derive(Clone, Copy, Debug, Default, Eq, PartialEq)]
285	struct Hash(u32);
286
287	impl Hash {
288	/// Create a new hash that represents the empty string.
289	#[inline(always)]
290	fn new() -> Hash {
291	Hash(`0`)
292	}
293
294	/// Create a new hash from the bytes given for use in forward searches.
295	///
296	/// # Safety
297	///
298	/// The given pointers must be valid to read from within their range.
299	#[inline(always)]
300	unsafe fn forward(mut start: *const u8, end: *const u8) -> Hash {
301	let mut hash = Hash::new();
302	while start < end {
303	hash.add(start.read());
304	start = start.add(`1`);
305	}
306	hash
307	}
308
309	/// Create a new hash from the bytes given for use in reverse searches.
310	///
311	/// # Safety
312	///
313	/// The given pointers must be valid to read from within their range.
314	#[inline(always)]
315	unsafe fn reverse(start: *const u8, mut end: *const u8) -> Hash {
316	let mut hash = Hash::new();
317	while start < end {
318	end = end.sub(`1`);
319	hash.add(end.read());
320	}
321	hash
322	}
323
324	/// Add 'new' and remove 'old' from this hash. The given needle hash should
325	/// correspond to the hash computed for the needle being searched for.
326	///
327	/// This is meant to be used when the rolling window of the haystack is
328	/// advanced.
329	#[inline(always)]
330	fn roll(&mut self, finder: &Finder, old: u8, new: u8) {
331	self.del(finder, old);
332	self.add(new);
333	}
334
335	/// Add a byte to this hash.
336	#[inline(always)]
337	fn add(&mut self, byte: u8) {
338	self.0 = self.0.wrapping_shl(`1`).wrapping_add(u32::from(byte));
339	}
340
341	/// Remove a byte from this hash. The given needle hash should correspond
342	/// to the hash computed for the needle being searched for.
343	#[inline(always)]
344	fn del(&mut self, finder: &Finder, byte: u8) {
345	let factor = finder.hash_2pow;
346	self.0 = self.0.wrapping_sub(u32::from(byte).wrapping_mul(factor));
347	}
348	}
349
350	/// Returns true when `x[i] == y[i]` for all `0 <= i < n`.
351	///
352	/// We forcefully don't inline this to hint at the compiler that it is unlikely
353	/// to be called. This causes the inner rabinkarp loop above to be a bit
354	/// tighter and leads to some performance improvement. See the
355	/// memmem/krate/prebuilt/sliceslice-words/words benchmark.
356	///
357	/// # Safety
358	///
359	/// Same as `crate::arch::all::is_equal_raw`.
360	#[cold]
361	#[inline(never)]
362	unsafe fn is_equal_raw(x: *const u8, y: *const u8, n: usize) -> bool {
363	crate::arch::all::is_equal_raw(x, y, n)
364	}
365
366	#[cfg(test)]
367	mod tests {
368	use super::*;
369
370	define_substring_forward_quickcheck!(\|h, n\| Some(
371	Finder::new(n).find(h, n)
372	));
373	define_substring_reverse_quickcheck!(\|h, n\| Some(
374	FinderRev::new(n).rfind(h, n)
375	));
376
377	#[test]
378	fn forward() {
379	crate::tests::substring::Runner::new()
380	.fwd(\|h, n\| Some(Finder::new(n).find(h, n)))
381	.run();
382	}
383
384	#[test]
385	fn reverse() {
386	crate::tests::substring::Runner::new()
387	.rev(\|h, n\| Some(FinderRev::new(n).rfind(h, n)))
388	.run();
389	}
390	}
391

Provided by KDAB

Definitions