1/*!
2A library for finding occurrences of many patterns at once. This library
3provides multiple pattern search principally through an implementation of the
4[Aho-Corasick algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm),
5which builds a fast finite state machine for executing searches in linear time.
6
7Additionally, this library provides a number of configuration options for
8building the automaton that permit controlling the space versus time trade
9off. Other features include simple ASCII case insensitive matching, finding
10overlapping matches, replacements, searching streams and even searching and
11replacing text in streams.
12
13Finally, unlike most other Aho-Corasick implementations, this one
14supports enabling [leftmost-first](MatchKind::LeftmostFirst) or
15[leftmost-longest](MatchKind::LeftmostLongest) match semantics, using a
16(seemingly) novel alternative construction algorithm. For more details on what
17match semantics means, see the [`MatchKind`] type.
18
19# Overview
20
21This section gives a brief overview of the primary types in this crate:
22
23* [`AhoCorasick`] is the primary type and represents an Aho-Corasick automaton.
24This is the type you use to execute searches.
25* [`AhoCorasickBuilder`] can be used to build an Aho-Corasick automaton, and
26supports configuring a number of options.
27* [`Match`] represents a single match reported by an Aho-Corasick automaton.
28Each match has two pieces of information: the pattern that matched and the
29start and end byte offsets corresponding to the position in the haystack at
30which it matched.
31
32# Example: basic searching
33
34This example shows how to search for occurrences of multiple patterns
35simultaneously. Each match includes the pattern that matched along with the
36byte offsets of the match.
37
38```
39use aho_corasick::{AhoCorasick, PatternID};
40
41let patterns = &["apple", "maple", "Snapple"];
42let haystack = "Nobody likes maple in their apple flavored Snapple.";
43
44let ac = AhoCorasick::new(patterns).unwrap();
45let mut matches = vec![];
46for mat in ac.find_iter(haystack) {
47 matches.push((mat.pattern(), mat.start(), mat.end()));
48}
49assert_eq!(matches, vec![
50 (PatternID::must(1), 13, 18),
51 (PatternID::must(0), 28, 33),
52 (PatternID::must(2), 43, 50),
53]);
54```
55
56# Example: case insensitivity
57
58This is like the previous example, but matches `Snapple` case insensitively
59using `AhoCorasickBuilder`:
60
61```
62use aho_corasick::{AhoCorasick, PatternID};
63
64let patterns = &["apple", "maple", "snapple"];
65let haystack = "Nobody likes maple in their apple flavored Snapple.";
66
67let ac = AhoCorasick::builder()
68 .ascii_case_insensitive(true)
69 .build(patterns)
70 .unwrap();
71let mut matches = vec![];
72for mat in ac.find_iter(haystack) {
73 matches.push((mat.pattern(), mat.start(), mat.end()));
74}
75assert_eq!(matches, vec![
76 (PatternID::must(1), 13, 18),
77 (PatternID::must(0), 28, 33),
78 (PatternID::must(2), 43, 50),
79]);
80```
81
82# Example: replacing matches in a stream
83
84This example shows how to execute a search and replace on a stream without
85loading the entire stream into memory first.
86
87```
88# #[cfg(feature = "std")] {
89use aho_corasick::AhoCorasick;
90
91# fn example() -> Result<(), std::io::Error> {
92let patterns = &["fox", "brown", "quick"];
93let replace_with = &["sloth", "grey", "slow"];
94
95// In a real example, these might be `std::fs::File`s instead. All you need to
96// do is supply a pair of `std::io::Read` and `std::io::Write` implementations.
97let rdr = "The quick brown fox.";
98let mut wtr = vec![];
99
100let ac = AhoCorasick::new(patterns).unwrap();
101ac.try_stream_replace_all(rdr.as_bytes(), &mut wtr, replace_with)?;
102assert_eq!(b"The slow grey sloth.".to_vec(), wtr);
103# Ok(()) }; example().unwrap()
104# }
105```
106
107# Example: finding the leftmost first match
108
109In the textbook description of Aho-Corasick, its formulation is typically
110structured such that it reports all possible matches, even when they overlap
111with another. In many cases, overlapping matches may not be desired, such as
112the case of finding all successive non-overlapping matches like you might with
113a standard regular expression.
114
115Unfortunately the "obvious" way to modify the Aho-Corasick algorithm to do
116this doesn't always work in the expected way, since it will report matches as
117soon as they are seen. For example, consider matching the regex `Samwise|Sam`
118against the text `Samwise`. Most regex engines (that are Perl-like, or
119non-POSIX) will report `Samwise` as a match, but the standard Aho-Corasick
120algorithm modified for reporting non-overlapping matches will report `Sam`.
121
122A novel contribution of this library is the ability to change the match
123semantics of Aho-Corasick (without additional search time overhead) such that
124`Samwise` is reported instead. For example, here's the standard approach:
125
126```
127use aho_corasick::AhoCorasick;
128
129let patterns = &["Samwise", "Sam"];
130let haystack = "Samwise";
131
132let ac = AhoCorasick::new(patterns).unwrap();
133let mat = ac.find(haystack).expect("should have a match");
134assert_eq!("Sam", &haystack[mat.start()..mat.end()]);
135```
136
137And now here's the leftmost-first version, which matches how a Perl-like
138regex will work:
139
140```
141use aho_corasick::{AhoCorasick, MatchKind};
142
143let patterns = &["Samwise", "Sam"];
144let haystack = "Samwise";
145
146let ac = AhoCorasick::builder()
147 .match_kind(MatchKind::LeftmostFirst)
148 .build(patterns)
149 .unwrap();
150let mat = ac.find(haystack).expect("should have a match");
151assert_eq!("Samwise", &haystack[mat.start()..mat.end()]);
152```
153
154In addition to leftmost-first semantics, this library also supports
155leftmost-longest semantics, which match the POSIX behavior of a regular
156expression alternation. See [`MatchKind`] for more details.
157
158# Prefilters
159
160While an Aho-Corasick automaton can perform admirably when compared to more
161naive solutions, it is generally slower than more specialized algorithms that
162are accelerated using vector instructions such as SIMD.
163
164For that reason, this library will internally use a "prefilter" to attempt
165to accelerate searches when possible. Currently, this library has several
166different algorithms it might use depending on the patterns provided. Once the
167number of patterns gets too big, prefilters are no longer used.
168
169While a prefilter is generally good to have on by default since it works
170well in the common case, it can lead to less predictable or even sub-optimal
171performance in some cases. For that reason, prefilters can be explicitly
172disabled via [`AhoCorasickBuilder::prefilter`].
173
174# Lower level APIs
175
176This crate also provides several sub-modules that collectively expose many of
177the implementation details of the main [`AhoCorasick`] type. Most users of this
178library can completely ignore the submodules and their contents, but if you
179needed finer grained control, some parts of them may be useful to you. Here is
180a brief overview of each and why you might want to use them:
181
182* The [`packed`] sub-module contains a lower level API for using fast
183vectorized routines for finding a small number of patterns in a haystack.
184You might want to use this API when you want to completely side-step using
185Aho-Corasick automata. Otherwise, the fast vectorized routines are used
186automatically as prefilters for `AhoCorasick` searches whenever possible.
187* The [`automaton`] sub-module provides a lower level finite state
188machine interface that the various Aho-Corasick implementations in
189this crate implement. This sub-module's main contribution is the
190[`Automaton`](automaton::Automaton) trait, which permits manually walking the
191state transitions of an Aho-Corasick automaton.
192* The [`dfa`] and [`nfa`] sub-modules provide DFA and NFA implementations of
193the aforementioned `Automaton` trait. The main reason one might want to use
194these sub-modules is to get access to a type that implements the `Automaton`
195trait. (The top-level `AhoCorasick` type does not implement the `Automaton`
196trait.)
197
198As mentioned above, if you aren't sure whether you need these sub-modules,
199you should be able to safely ignore them and just focus on the [`AhoCorasick`]
200type.
201
202# Crate features
203
204This crate exposes a few features for controlling dependency usage and whether
205this crate can be used without the standard library.
206
207* **std** -
208 Enables support for the standard library. This feature is enabled by
209 default. When disabled, only `core` and `alloc` are used. At an API
210 level, enabling `std` enables `std::error::Error` trait impls for the
211 various error types, and higher level stream search routines such as
212 [`AhoCorasick::try_stream_find_iter`]. But the `std` feature is also required
213 to enable vectorized prefilters. Prefilters can greatly accelerate searches,
214 but generally only apply when the number of patterns is small (less than
215 ~100).
216* **perf-literal** -
217 Enables support for literal prefilters that use vectorized routines from
218 external crates. This feature is enabled by default. If you're only using
219 Aho-Corasick for large numbers of patterns or otherwise can abide lower
220 throughput when searching with a small number of patterns, then it is
221 reasonable to disable this feature.
222* **logging** -
223 Enables a dependency on the `log` crate and emits messages to aide in
224 diagnostics. This feature is disabled by default.
225*/
226
227#![no_std]
228#![deny(missing_docs)]
229#![deny(rustdoc::broken_intra_doc_links)]
230#![cfg_attr(docsrs, feature(doc_auto_cfg))]
231
232extern crate alloc;
233#[cfg(any(test, feature = "std"))]
234extern crate std;
235
236#[cfg(doctest)]
237doc_comment::doctest!("../README.md");
238
239#[cfg(feature = "std")]
240pub use crate::ahocorasick::StreamFindIter;
241pub use crate::{
242 ahocorasick::{
243 AhoCorasick, AhoCorasickBuilder, AhoCorasickKind, FindIter,
244 FindOverlappingIter,
245 },
246 util::{
247 error::{BuildError, MatchError, MatchErrorKind},
248 primitives::{PatternID, PatternIDError},
249 search::{Anchored, Input, Match, MatchKind, Span, StartKind},
250 },
251};
252
253#[macro_use]
254mod macros;
255
256mod ahocorasick;
257pub mod automaton;
258pub mod dfa;
259pub mod nfa;
260pub mod packed;
261#[cfg(test)]
262mod tests;
263// I wrote out the module for implementing fst::Automaton only to later realize
264// that this would make fst a public dependency and fst is not at 1.0 yet. I
265// decided to just keep the code in tree, but build it only during tests.
266//
267// TODO: I think I've changed my mind again. I'm considering pushing it out
268// into either a separate crate or into 'fst' directly as an optional feature.
269// #[cfg(test)]
270// #[allow(dead_code)]
271// mod transducer;
272pub(crate) mod util;
273
274#[cfg(test)]
275mod testoibits {
276 use std::panic::{RefUnwindSafe, UnwindSafe};
277
278 use super::*;
279
280 fn assert_all<T: Send + Sync + UnwindSafe + RefUnwindSafe>() {}
281
282 #[test]
283 fn oibits_main() {
284 assert_all::<AhoCorasick>();
285 assert_all::<AhoCorasickBuilder>();
286 assert_all::<AhoCorasickKind>();
287 assert_all::<FindIter>();
288 assert_all::<FindOverlappingIter>();
289
290 assert_all::<BuildError>();
291 assert_all::<MatchError>();
292 assert_all::<MatchErrorKind>();
293
294 assert_all::<Anchored>();
295 assert_all::<Input>();
296 assert_all::<Match>();
297 assert_all::<MatchKind>();
298 assert_all::<Span>();
299 assert_all::<StartKind>();
300 }
301
302 #[test]
303 fn oibits_automaton() {
304 use crate::{automaton, dfa::DFA};
305
306 assert_all::<automaton::FindIter<DFA>>();
307 assert_all::<automaton::FindOverlappingIter<DFA>>();
308 #[cfg(feature = "std")]
309 assert_all::<automaton::StreamFindIter<DFA, std::io::Stdin>>();
310 assert_all::<automaton::OverlappingState>();
311
312 assert_all::<automaton::Prefilter>();
313 assert_all::<automaton::Candidate>();
314 }
315
316 #[test]
317 fn oibits_packed() {
318 use crate::packed;
319
320 assert_all::<packed::Config>();
321 assert_all::<packed::Builder>();
322 assert_all::<packed::Searcher>();
323 assert_all::<packed::FindIter>();
324 assert_all::<packed::MatchKind>();
325 }
326}
327