1 | /*! |
---|---|

2 | A library for finding occurrences of many patterns at once. This library |

3 | provides multiple pattern search principally through an implementation of the |

4 | [Aho-Corasick algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm), |

5 | which builds a fast finite state machine for executing searches in linear time. |

6 | |

7 | Additionally, this library provides a number of configuration options for |

8 | building the automaton that permit controlling the space versus time trade |

9 | off. Other features include simple ASCII case insensitive matching, finding |

10 | overlapping matches, replacements, searching streams and even searching and |

11 | replacing text in streams. |

12 | |

13 | Finally, unlike most other Aho-Corasick implementations, this one |

14 | supports enabling [leftmost-first](MatchKind::LeftmostFirst) or |

15 | [leftmost-longest](MatchKind::LeftmostLongest) match semantics, using a |

16 | (seemingly) novel alternative construction algorithm. For more details on what |

17 | match semantics means, see the [`MatchKind`] type. |

18 | |

19 | # Overview |

20 | |

21 | This section gives a brief overview of the primary types in this crate: |

22 | |

23 | * [`AhoCorasick`] is the primary type and represents an Aho-Corasick automaton. |

24 | This is the type you use to execute searches. |

25 | * [`AhoCorasickBuilder`] can be used to build an Aho-Corasick automaton, and |

26 | supports configuring a number of options. |

27 | * [`Match`] represents a single match reported by an Aho-Corasick automaton. |

28 | Each match has two pieces of information: the pattern that matched and the |

29 | start and end byte offsets corresponding to the position in the haystack at |

30 | which it matched. |

31 | |

32 | # Example: basic searching |

33 | |

34 | This example shows how to search for occurrences of multiple patterns |

35 | simultaneously. Each match includes the pattern that matched along with the |

36 | byte offsets of the match. |

37 | |

38 | ``` |

39 | use aho_corasick::{AhoCorasick, PatternID}; |

40 | |

41 | let patterns = &["apple", "maple", "Snapple"]; |

42 | let haystack = "Nobody likes maple in their apple flavored Snapple."; |

43 | |

44 | let ac = AhoCorasick::new(patterns).unwrap(); |

45 | let mut matches = vec![]; |

46 | for mat in ac.find_iter(haystack) { |

47 | matches.push((mat.pattern(), mat.start(), mat.end())); |

48 | } |

49 | assert_eq!(matches, vec![ |

50 | (PatternID::must(1), 13, 18), |

51 | (PatternID::must(0), 28, 33), |

52 | (PatternID::must(2), 43, 50), |

53 | ]); |

54 | ``` |

55 | |

56 | # Example: case insensitivity |

57 | |

58 | This is like the previous example, but matches `Snapple` case insensitively |

59 | using `AhoCorasickBuilder`: |

60 | |

61 | ``` |

62 | use aho_corasick::{AhoCorasick, PatternID}; |

63 | |

64 | let patterns = &["apple", "maple", "snapple"]; |

65 | let haystack = "Nobody likes maple in their apple flavored Snapple."; |

66 | |

67 | let ac = AhoCorasick::builder() |

68 | .ascii_case_insensitive(true) |

69 | .build(patterns) |

70 | .unwrap(); |

71 | let mut matches = vec![]; |

72 | for mat in ac.find_iter(haystack) { |

73 | matches.push((mat.pattern(), mat.start(), mat.end())); |

74 | } |

75 | assert_eq!(matches, vec![ |

76 | (PatternID::must(1), 13, 18), |

77 | (PatternID::must(0), 28, 33), |

78 | (PatternID::must(2), 43, 50), |

79 | ]); |

80 | ``` |

81 | |

82 | # Example: replacing matches in a stream |

83 | |

84 | This example shows how to execute a search and replace on a stream without |

85 | loading the entire stream into memory first. |

86 | |

87 | ``` |

88 | # #[cfg(feature = "std")] { |

89 | use aho_corasick::AhoCorasick; |

90 | |

91 | # fn example() -> Result<(), std::io::Error> { |

92 | let patterns = &["fox", "brown", "quick"]; |

93 | let replace_with = &["sloth", "grey", "slow"]; |

94 | |

95 | // In a real example, these might be `std::fs::File`s instead. All you need to |

96 | // do is supply a pair of `std::io::Read` and `std::io::Write` implementations. |

97 | let rdr = "The quick brown fox."; |

98 | let mut wtr = vec![]; |

99 | |

100 | let ac = AhoCorasick::new(patterns).unwrap(); |

101 | ac.try_stream_replace_all(rdr.as_bytes(), &mut wtr, replace_with)?; |

102 | assert_eq!(b"The slow grey sloth.".to_vec(), wtr); |

103 | # Ok(()) }; example().unwrap() |

104 | # } |

105 | ``` |

106 | |

107 | # Example: finding the leftmost first match |

108 | |

109 | In the textbook description of Aho-Corasick, its formulation is typically |

110 | structured such that it reports all possible matches, even when they overlap |

111 | with another. In many cases, overlapping matches may not be desired, such as |

112 | the case of finding all successive non-overlapping matches like you might with |

113 | a standard regular expression. |

114 | |

115 | Unfortunately the "obvious" way to modify the Aho-Corasick algorithm to do |

116 | this doesn't always work in the expected way, since it will report matches as |

117 | soon as they are seen. For example, consider matching the regex `Samwise|Sam` |

118 | against the text `Samwise`. Most regex engines (that are Perl-like, or |

119 | non-POSIX) will report `Samwise` as a match, but the standard Aho-Corasick |

120 | algorithm modified for reporting non-overlapping matches will report `Sam`. |

121 | |

122 | A novel contribution of this library is the ability to change the match |

123 | semantics of Aho-Corasick (without additional search time overhead) such that |

124 | `Samwise` is reported instead. For example, here's the standard approach: |

125 | |

126 | ``` |

127 | use aho_corasick::AhoCorasick; |

128 | |

129 | let patterns = &["Samwise", "Sam"]; |

130 | let haystack = "Samwise"; |

131 | |

132 | let ac = AhoCorasick::new(patterns).unwrap(); |

133 | let mat = ac.find(haystack).expect("should have a match"); |

134 | assert_eq!("Sam", &haystack[mat.start()..mat.end()]); |

135 | ``` |

136 | |

137 | And now here's the leftmost-first version, which matches how a Perl-like |

138 | regex will work: |

139 | |

140 | ``` |

141 | use aho_corasick::{AhoCorasick, MatchKind}; |

142 | |

143 | let patterns = &["Samwise", "Sam"]; |

144 | let haystack = "Samwise"; |

145 | |

146 | let ac = AhoCorasick::builder() |

147 | .match_kind(MatchKind::LeftmostFirst) |

148 | .build(patterns) |

149 | .unwrap(); |

150 | let mat = ac.find(haystack).expect("should have a match"); |

151 | assert_eq!("Samwise", &haystack[mat.start()..mat.end()]); |

152 | ``` |

153 | |

154 | In addition to leftmost-first semantics, this library also supports |

155 | leftmost-longest semantics, which match the POSIX behavior of a regular |

156 | expression alternation. See [`MatchKind`] for more details. |

157 | |

158 | # Prefilters |

159 | |

160 | While an Aho-Corasick automaton can perform admirably when compared to more |

161 | naive solutions, it is generally slower than more specialized algorithms that |

162 | are accelerated using vector instructions such as SIMD. |

163 | |

164 | For that reason, this library will internally use a "prefilter" to attempt |

165 | to accelerate searches when possible. Currently, this library has several |

166 | different algorithms it might use depending on the patterns provided. Once the |

167 | number of patterns gets too big, prefilters are no longer used. |

168 | |

169 | While a prefilter is generally good to have on by default since it works |

170 | well in the common case, it can lead to less predictable or even sub-optimal |

171 | performance in some cases. For that reason, prefilters can be explicitly |

172 | disabled via [`AhoCorasickBuilder::prefilter`]. |

173 | |

174 | # Lower level APIs |

175 | |

176 | This crate also provides several sub-modules that collectively expose many of |

177 | the implementation details of the main [`AhoCorasick`] type. Most users of this |

178 | library can completely ignore the submodules and their contents, but if you |

179 | needed finer grained control, some parts of them may be useful to you. Here is |

180 | a brief overview of each and why you might want to use them: |

181 | |

182 | * The [`packed`] sub-module contains a lower level API for using fast |

183 | vectorized routines for finding a small number of patterns in a haystack. |

184 | You might want to use this API when you want to completely side-step using |

185 | Aho-Corasick automata. Otherwise, the fast vectorized routines are used |

186 | automatically as prefilters for `AhoCorasick` searches whenever possible. |

187 | * The [`automaton`] sub-module provides a lower level finite state |

188 | machine interface that the various Aho-Corasick implementations in |

189 | this crate implement. This sub-module's main contribution is the |

190 | [`Automaton`](automaton::Automaton) trait, which permits manually walking the |

191 | state transitions of an Aho-Corasick automaton. |

192 | * The [`dfa`] and [`nfa`] sub-modules provide DFA and NFA implementations of |

193 | the aforementioned `Automaton` trait. The main reason one might want to use |

194 | these sub-modules is to get access to a type that implements the `Automaton` |

195 | trait. (The top-level `AhoCorasick` type does not implement the `Automaton` |

196 | trait.) |

197 | |

198 | As mentioned above, if you aren't sure whether you need these sub-modules, |

199 | you should be able to safely ignore them and just focus on the [`AhoCorasick`] |

200 | type. |

201 | |

202 | # Crate features |

203 | |

204 | This crate exposes a few features for controlling dependency usage and whether |

205 | this crate can be used without the standard library. |

206 | |

207 | * **std** - |

208 | Enables support for the standard library. This feature is enabled by |

209 | default. When disabled, only `core` and `alloc` are used. At an API |

210 | level, enabling `std` enables `std::error::Error` trait impls for the |

211 | various error types, and higher level stream search routines such as |

212 | [`AhoCorasick::try_stream_find_iter`]. But the `std` feature is also required |

213 | to enable vectorized prefilters. Prefilters can greatly accelerate searches, |

214 | but generally only apply when the number of patterns is small (less than |

215 | ~100). |

216 | * **perf-literal** - |

217 | Enables support for literal prefilters that use vectorized routines from |

218 | external crates. This feature is enabled by default. If you're only using |

219 | Aho-Corasick for large numbers of patterns or otherwise can abide lower |

220 | throughput when searching with a small number of patterns, then it is |

221 | reasonable to disable this feature. |

222 | * **logging** - |

223 | Enables a dependency on the `log` crate and emits messages to aide in |

224 | diagnostics. This feature is disabled by default. |

225 | */ |

226 | |

227 | #![no_std] |

228 | #![deny(missing_docs)] |

229 | #![deny(rustdoc::broken_intra_doc_links)] |

230 | #![cfg_attr(docsrs, feature(doc_auto_cfg))] |

231 | |

232 | extern crate alloc; |

233 | #[cfg(any(test, feature = "std"))] |

234 | extern crate std; |

235 | |

236 | #[cfg(doctest)] |

237 | doc_comment::doctest!("../README.md"); |

238 | |

239 | #[cfg(feature = "std")] |

240 | pub use crate::ahocorasick::StreamFindIter; |

241 | pub use crate::{ |

242 | ahocorasick::{ |

243 | AhoCorasick, AhoCorasickBuilder, AhoCorasickKind, FindIter, |

244 | FindOverlappingIter, |

245 | }, |

246 | util::{ |

247 | error::{BuildError, MatchError, MatchErrorKind}, |

248 | primitives::{PatternID, PatternIDError}, |

249 | search::{Anchored, Input, Match, MatchKind, Span, StartKind}, |

250 | }, |

251 | }; |

252 | |

253 | #[macro_use] |

254 | mod macros; |

255 | |

256 | mod ahocorasick; |

257 | pub mod automaton; |

258 | pub mod dfa; |

259 | pub mod nfa; |

260 | pub mod packed; |

261 | #[cfg(test)] |

262 | mod tests; |

263 | // I wrote out the module for implementing fst::Automaton only to later realize |

264 | // that this would make fst a public dependency and fst is not at 1.0 yet. I |

265 | // decided to just keep the code in tree, but build it only during tests. |

266 | // |

267 | // TODO: I think I've changed my mind again. I'm considering pushing it out |

268 | // into either a separate crate or into 'fst' directly as an optional feature. |

269 | // #[cfg(test)] |

270 | // #[allow(dead_code)] |

271 | // mod transducer; |

272 | pub(crate) mod util; |

273 | |

274 | #[cfg(test)] |

275 | mod testoibits { |

276 | use std::panic::{RefUnwindSafe, UnwindSafe}; |

277 | |

278 | use super::*; |

279 | |

280 | fn assert_all<T: Send + Sync + UnwindSafe + RefUnwindSafe>() {} |

281 | |

282 | #[test] |

283 | fn oibits_main() { |

284 | assert_all::<AhoCorasick>(); |

285 | assert_all::<AhoCorasickBuilder>(); |

286 | assert_all::<AhoCorasickKind>(); |

287 | assert_all::<FindIter>(); |

288 | assert_all::<FindOverlappingIter>(); |

289 | |

290 | assert_all::<BuildError>(); |

291 | assert_all::<MatchError>(); |

292 | assert_all::<MatchErrorKind>(); |

293 | |

294 | assert_all::<Anchored>(); |

295 | assert_all::<Input>(); |

296 | assert_all::<Match>(); |

297 | assert_all::<MatchKind>(); |

298 | assert_all::<Span>(); |

299 | assert_all::<StartKind>(); |

300 | } |

301 | |

302 | #[test] |

303 | fn oibits_automaton() { |

304 | use crate::{automaton, dfa::DFA}; |

305 | |

306 | assert_all::<automaton::FindIter<DFA>>(); |

307 | assert_all::<automaton::FindOverlappingIter<DFA>>(); |

308 | #[cfg(feature = "std")] |

309 | assert_all::<automaton::StreamFindIter<DFA, std::io::Stdin>>(); |

310 | assert_all::<automaton::OverlappingState>(); |

311 | |

312 | assert_all::<automaton::Prefilter>(); |

313 | assert_all::<automaton::Candidate>(); |

314 | } |

315 | |

316 | #[test] |

317 | fn oibits_packed() { |

318 | use crate::packed; |

319 | |

320 | assert_all::<packed::Config>(); |

321 | assert_all::<packed::Builder>(); |

322 | assert_all::<packed::Searcher>(); |

323 | assert_all::<packed::FindIter>(); |

324 | assert_all::<packed::MatchKind>(); |

325 | } |

326 | } |

327 |