1 | /*! |
2 | A library for finding occurrences of many patterns at once. This library |
3 | provides multiple pattern search principally through an implementation of the |
4 | [Aho-Corasick algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm), |
5 | which builds a fast finite state machine for executing searches in linear time. |
6 | |
7 | Additionally, this library provides a number of configuration options for |
8 | building the automaton that permit controlling the space versus time trade |
9 | off. Other features include simple ASCII case insensitive matching, finding |
10 | overlapping matches, replacements, searching streams and even searching and |
11 | replacing text in streams. |
12 | |
13 | Finally, unlike all other (known) Aho-Corasick implementations, this one |
14 | supports enabling |
15 | [leftmost-first](enum.MatchKind.html#variant.LeftmostFirst) |
16 | or |
17 | [leftmost-longest](enum.MatchKind.html#variant.LeftmostFirst) |
18 | match semantics, using a (seemingly) novel alternative construction algorithm. |
19 | For more details on what match semantics means, see the |
20 | [`MatchKind`](enum.MatchKind.html) |
21 | type. |
22 | |
23 | # Overview |
24 | |
25 | This section gives a brief overview of the primary types in this crate: |
26 | |
27 | * [`AhoCorasick`](struct.AhoCorasick.html) is the primary type and represents |
28 | an Aho-Corasick automaton. This is the type you use to execute searches. |
29 | * [`AhoCorasickBuilder`](struct.AhoCorasickBuilder.html) can be used to build |
30 | an Aho-Corasick automaton, and supports configuring a number of options. |
31 | * [`Match`](struct.Match.html) represents a single match reported by an |
32 | Aho-Corasick automaton. Each match has two pieces of information: the pattern |
33 | that matched and the start and end byte offsets corresponding to the position |
34 | in the haystack at which it matched. |
35 | |
36 | Additionally, the [`packed`](packed/index.html) sub-module contains a lower |
37 | level API for using fast vectorized routines for finding a small number of |
38 | patterns in a haystack. |
39 | |
40 | # Example: basic searching |
41 | |
42 | This example shows how to search for occurrences of multiple patterns |
43 | simultaneously. Each match includes the pattern that matched along with the |
44 | byte offsets of the match. |
45 | |
46 | ``` |
47 | use aho_corasick::AhoCorasick; |
48 | |
49 | let patterns = &["apple" , "maple" , "Snapple" ]; |
50 | let haystack = "Nobody likes maple in their apple flavored Snapple." ; |
51 | |
52 | let ac = AhoCorasick::new(patterns); |
53 | let mut matches = vec![]; |
54 | for mat in ac.find_iter(haystack) { |
55 | matches.push((mat.pattern(), mat.start(), mat.end())); |
56 | } |
57 | assert_eq!(matches, vec![ |
58 | (1, 13, 18), |
59 | (0, 28, 33), |
60 | (2, 43, 50), |
61 | ]); |
62 | ``` |
63 | |
64 | # Example: case insensitivity |
65 | |
66 | This is like the previous example, but matches `Snapple` case insensitively |
67 | using `AhoCorasickBuilder`: |
68 | |
69 | ``` |
70 | use aho_corasick::AhoCorasickBuilder; |
71 | |
72 | let patterns = &["apple" , "maple" , "snapple" ]; |
73 | let haystack = "Nobody likes maple in their apple flavored Snapple." ; |
74 | |
75 | let ac = AhoCorasickBuilder::new() |
76 | .ascii_case_insensitive(true) |
77 | .build(patterns); |
78 | let mut matches = vec![]; |
79 | for mat in ac.find_iter(haystack) { |
80 | matches.push((mat.pattern(), mat.start(), mat.end())); |
81 | } |
82 | assert_eq!(matches, vec![ |
83 | (1, 13, 18), |
84 | (0, 28, 33), |
85 | (2, 43, 50), |
86 | ]); |
87 | ``` |
88 | |
89 | # Example: replacing matches in a stream |
90 | |
91 | This example shows how to execute a search and replace on a stream without |
92 | loading the entire stream into memory first. |
93 | |
94 | ``` |
95 | use aho_corasick::AhoCorasick; |
96 | |
97 | # fn example() -> Result<(), ::std::io::Error> { |
98 | let patterns = &["fox" , "brown" , "quick" ]; |
99 | let replace_with = &["sloth" , "grey" , "slow" ]; |
100 | |
101 | // In a real example, these might be `std::fs::File`s instead. All you need to |
102 | // do is supply a pair of `std::io::Read` and `std::io::Write` implementations. |
103 | let rdr = "The quick brown fox." ; |
104 | let mut wtr = vec![]; |
105 | |
106 | let ac = AhoCorasick::new(patterns); |
107 | ac.stream_replace_all(rdr.as_bytes(), &mut wtr, replace_with)?; |
108 | assert_eq!(b"The slow grey sloth." .to_vec(), wtr); |
109 | # Ok(()) }; example().unwrap() |
110 | ``` |
111 | |
112 | # Example: finding the leftmost first match |
113 | |
114 | In the textbook description of Aho-Corasick, its formulation is typically |
115 | structured such that it reports all possible matches, even when they overlap |
116 | with another. In many cases, overlapping matches may not be desired, such as |
117 | the case of finding all successive non-overlapping matches like you might with |
118 | a standard regular expression. |
119 | |
120 | Unfortunately the "obvious" way to modify the Aho-Corasick algorithm to do |
121 | this doesn't always work in the expected way, since it will report matches as |
122 | soon as they are seen. For example, consider matching the regex `Samwise|Sam` |
123 | against the text `Samwise`. Most regex engines (that are Perl-like, or |
124 | non-POSIX) will report `Samwise` as a match, but the standard Aho-Corasick |
125 | algorithm modified for reporting non-overlapping matches will report `Sam`. |
126 | |
127 | A novel contribution of this library is the ability to change the match |
128 | semantics of Aho-Corasick (without additional search time overhead) such that |
129 | `Samwise` is reported instead. For example, here's the standard approach: |
130 | |
131 | ``` |
132 | use aho_corasick::AhoCorasick; |
133 | |
134 | let patterns = &["Samwise" , "Sam" ]; |
135 | let haystack = "Samwise" ; |
136 | |
137 | let ac = AhoCorasick::new(patterns); |
138 | let mat = ac.find(haystack).expect("should have a match" ); |
139 | assert_eq!("Sam" , &haystack[mat.start()..mat.end()]); |
140 | ``` |
141 | |
142 | And now here's the leftmost-first version, which matches how a Perl-like |
143 | regex will work: |
144 | |
145 | ``` |
146 | use aho_corasick::{AhoCorasickBuilder, MatchKind}; |
147 | |
148 | let patterns = &["Samwise" , "Sam" ]; |
149 | let haystack = "Samwise" ; |
150 | |
151 | let ac = AhoCorasickBuilder::new() |
152 | .match_kind(MatchKind::LeftmostFirst) |
153 | .build(patterns); |
154 | let mat = ac.find(haystack).expect("should have a match" ); |
155 | assert_eq!("Samwise" , &haystack[mat.start()..mat.end()]); |
156 | ``` |
157 | |
158 | In addition to leftmost-first semantics, this library also supports |
159 | leftmost-longest semantics, which match the POSIX behavior of a regular |
160 | expression alternation. See |
161 | [`MatchKind`](enum.MatchKind.html) |
162 | for more details. |
163 | |
164 | # Prefilters |
165 | |
166 | While an Aho-Corasick automaton can perform admirably when compared to more |
167 | naive solutions, it is generally slower than more specialized algorithms that |
168 | are accelerated using vector instructions such as SIMD. |
169 | |
170 | For that reason, this library will internally use a "prefilter" to attempt |
171 | to accelerate searches when possible. Currently, this library has several |
172 | different algorithms it might use depending on the patterns provided. Once the |
173 | number of patterns gets too big, prefilters are no longer used. |
174 | |
175 | While a prefilter is generally good to have on by default since it works |
176 | well in the common case, it can lead to less predictable or even sub-optimal |
177 | performance in some cases. For that reason, prefilters can be explicitly |
178 | disabled via |
179 | [`AhoCorasickBuilder::prefilter`](struct.AhoCorasickBuilder.html#method.prefilter). |
180 | */ |
181 | |
182 | #![deny (missing_docs)] |
183 | |
184 | // We can never be truly no_std, but we could be alloc-only some day, so |
185 | // require the std feature for now. |
186 | #[cfg (not(feature = "std" ))] |
187 | compile_error!("`std` feature is currently required to build this crate" ); |
188 | |
189 | // #[cfg(doctest)] |
190 | // #[macro_use] |
191 | // extern crate doc_comment; |
192 | |
193 | // #[cfg(doctest)] |
194 | // doctest!("../README.md"); |
195 | |
196 | pub use crate::ahocorasick::{ |
197 | AhoCorasick, AhoCorasickBuilder, FindIter, FindOverlappingIter, MatchKind, |
198 | StreamFindIter, |
199 | }; |
200 | pub use crate::error::{Error, ErrorKind}; |
201 | pub use crate::state_id::StateID; |
202 | |
203 | mod ahocorasick; |
204 | mod automaton; |
205 | mod buffer; |
206 | mod byte_frequencies; |
207 | mod classes; |
208 | mod dfa; |
209 | mod error; |
210 | mod nfa; |
211 | pub mod packed; |
212 | mod prefilter; |
213 | mod state_id; |
214 | #[cfg (test)] |
215 | mod tests; |
216 | |
217 | /// A representation of a match reported by an Aho-Corasick automaton. |
218 | /// |
219 | /// A match has two essential pieces of information: the identifier of the |
220 | /// pattern that matched, along with the start and end offsets of the match |
221 | /// in the haystack. |
222 | /// |
223 | /// # Examples |
224 | /// |
225 | /// Basic usage: |
226 | /// |
227 | /// ``` |
228 | /// use aho_corasick::AhoCorasick; |
229 | /// |
230 | /// let ac = AhoCorasick::new(&[ |
231 | /// "foo" , "bar" , "baz" , |
232 | /// ]); |
233 | /// let mat = ac.find("xxx bar xxx" ).expect("should have a match" ); |
234 | /// assert_eq!(1, mat.pattern()); |
235 | /// assert_eq!(4, mat.start()); |
236 | /// assert_eq!(7, mat.end()); |
237 | /// ``` |
238 | #[derive (Clone, Debug, Eq, Hash, PartialEq)] |
239 | pub struct Match { |
240 | /// The pattern id. |
241 | pattern: usize, |
242 | /// The length of this match, such that the starting position of the match |
243 | /// is `end - len`. |
244 | /// |
245 | /// We use length here because, other than the pattern id, the only |
246 | /// information about each pattern that the automaton stores is its length. |
247 | /// So using the length here is just a bit more natural. But it isn't |
248 | /// technically required. |
249 | len: usize, |
250 | /// The end offset of the match, exclusive. |
251 | end: usize, |
252 | } |
253 | |
254 | impl Match { |
255 | /// Returns the identifier of the pattern that matched. |
256 | /// |
257 | /// The identifier of a pattern is derived from the position in which it |
258 | /// was originally inserted into the corresponding automaton. The first |
259 | /// pattern has identifier `0`, and each subsequent pattern is `1`, `2` |
260 | /// and so on. |
261 | #[inline ] |
262 | pub fn pattern(&self) -> usize { |
263 | self.pattern |
264 | } |
265 | |
266 | /// The starting position of the match. |
267 | #[inline ] |
268 | pub fn start(&self) -> usize { |
269 | self.end - self.len |
270 | } |
271 | |
272 | /// The ending position of the match. |
273 | #[inline ] |
274 | pub fn end(&self) -> usize { |
275 | self.end |
276 | } |
277 | |
278 | /// The length, in bytes, of the match. |
279 | #[inline ] |
280 | pub fn len(&self) -> usize { |
281 | self.len |
282 | } |
283 | |
284 | /// Returns true if and only if this match is empty. That is, when |
285 | /// `start() == end()`. |
286 | /// |
287 | /// An empty match can only be returned when the empty string was among |
288 | /// the patterns used to build the Aho-Corasick automaton. |
289 | #[inline ] |
290 | pub fn is_empty(&self) -> bool { |
291 | self.len == 0 |
292 | } |
293 | |
294 | #[inline ] |
295 | fn increment(&self, by: usize) -> Match { |
296 | Match { pattern: self.pattern, len: self.len, end: self.end + by } |
297 | } |
298 | |
299 | #[inline ] |
300 | fn from_span(id: usize, start: usize, end: usize) -> Match { |
301 | Match { pattern: id, len: end - start, end } |
302 | } |
303 | } |
304 | |