1 | /*! |
2 | A module for building and searching with determinstic finite automata (DFAs). |
3 | |
4 | Like other modules in this crate, DFAs support a rich regex syntax with Unicode |
5 | features. DFAs also have extensive options for configuring the best space vs |
6 | time trade off for your use case and provides support for cheap deserialization |
7 | of automata for use in `no_std` environments. |
8 | |
9 | If you're looking for lazy DFAs that build themselves incrementally during |
10 | search, then please see the top-level [`hybrid` module](crate::hybrid). |
11 | |
12 | # Overview |
13 | |
14 | This section gives a brief overview of the primary types in this module: |
15 | |
16 | * A [`regex::Regex`] provides a way to search for matches of a regular |
17 | expression using DFAs. This includes iterating over matches with both the start |
18 | and end positions of each match. |
19 | * A [`dense::DFA`] provides low level access to a DFA that uses a dense |
20 | representation (uses lots of space, but fast searching). |
21 | * A [`sparse::DFA`] provides the same API as a `dense::DFA`, but uses a sparse |
22 | representation (uses less space, but slower searching). |
23 | * An [`Automaton`] trait that defines an interface that both dense and sparse |
24 | DFAs implement. (A `regex::Regex` is generic over this trait.) |
25 | * Both dense DFAs and sparse DFAs support serialization to raw bytes (e.g., |
26 | [`dense::DFA::to_bytes_little_endian`]) and cheap deserialization (e.g., |
27 | [`dense::DFA::from_bytes`]). |
28 | |
29 | # Example: basic regex searching |
30 | |
31 | This example shows how to compile a regex using the default configuration |
32 | and then use it to find matches in a byte string: |
33 | |
34 | ``` |
35 | use regex_automata::{MultiMatch, dfa::regex::Regex}; |
36 | |
37 | let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}" )?; |
38 | let text = b"2018-12-24 2016-10-08" ; |
39 | let matches: Vec<MultiMatch> = re.find_leftmost_iter(text).collect(); |
40 | assert_eq!(matches, vec![ |
41 | MultiMatch::must(0, 0, 10), |
42 | MultiMatch::must(0, 11, 21), |
43 | ]); |
44 | # Ok::<(), Box<dyn std::error::Error>>(()) |
45 | ``` |
46 | |
47 | # Example: searching with regex sets |
48 | |
49 | The DFAs in this module all fully support searching with multiple regexes |
50 | simultaneously. You can use this support with standard leftmost-first style |
51 | searching to find non-overlapping matches: |
52 | |
53 | ``` |
54 | use regex_automata::{MultiMatch, dfa::regex::Regex}; |
55 | |
56 | let re = Regex::new_many(&[r"\w+" , r"\S+" ])?; |
57 | let text = b"@foo bar" ; |
58 | let matches: Vec<MultiMatch> = re.find_leftmost_iter(text).collect(); |
59 | assert_eq!(matches, vec![ |
60 | MultiMatch::must(1, 0, 4), |
61 | MultiMatch::must(0, 5, 8), |
62 | ]); |
63 | # Ok::<(), Box<dyn std::error::Error>>(()) |
64 | ``` |
65 | |
66 | Or use overlapping style searches to find all possible occurrences: |
67 | |
68 | ``` |
69 | use regex_automata::{MatchKind, MultiMatch, dfa::{dense, regex::Regex}}; |
70 | |
71 | // N.B. For overlapping searches, we need the underlying DFA to report all |
72 | // possible matches. |
73 | let re = Regex::builder() |
74 | .dense(dense::Config::new().match_kind(MatchKind::All)) |
75 | .build_many(&[r"\w{3}" , r"\S{3}" ])?; |
76 | let text = b"@foo bar" ; |
77 | let matches: Vec<MultiMatch> = re.find_overlapping_iter(text).collect(); |
78 | assert_eq!(matches, vec![ |
79 | MultiMatch::must(1, 0, 3), |
80 | MultiMatch::must(0, 1, 4), |
81 | MultiMatch::must(1, 1, 4), |
82 | MultiMatch::must(0, 5, 8), |
83 | MultiMatch::must(1, 5, 8), |
84 | ]); |
85 | # Ok::<(), Box<dyn std::error::Error>>(()) |
86 | ``` |
87 | |
88 | # Example: use sparse DFAs |
89 | |
90 | By default, compiling a regex will use dense DFAs internally. This uses more |
91 | memory, but executes searches more quickly. If you can abide slower searches |
92 | (somewhere around 3-5x), then sparse DFAs might make more sense since they can |
93 | use significantly less space. |
94 | |
95 | Using sparse DFAs is as easy as using `Regex::new_sparse` instead of |
96 | `Regex::new`: |
97 | |
98 | ``` |
99 | use regex_automata::{MultiMatch, dfa::regex::Regex}; |
100 | |
101 | let re = Regex::new_sparse(r"[0-9]{4}-[0-9]{2}-[0-9]{2}" ).unwrap(); |
102 | let text = b"2018-12-24 2016-10-08" ; |
103 | let matches: Vec<MultiMatch> = re.find_leftmost_iter(text).collect(); |
104 | assert_eq!(matches, vec![ |
105 | MultiMatch::must(0, 0, 10), |
106 | MultiMatch::must(0, 11, 21), |
107 | ]); |
108 | # Ok::<(), Box<dyn std::error::Error>>(()) |
109 | ``` |
110 | |
111 | If you already have dense DFAs for some reason, they can be converted to sparse |
112 | DFAs and used to build a new `Regex`. For example: |
113 | |
114 | ``` |
115 | use regex_automata::{MultiMatch, dfa::regex::Regex}; |
116 | |
117 | let dense_re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}" ).unwrap(); |
118 | let sparse_re = Regex::builder().build_from_dfas( |
119 | dense_re.forward().to_sparse()?, |
120 | dense_re.reverse().to_sparse()?, |
121 | ); |
122 | let text = b"2018-12-24 2016-10-08" ; |
123 | let matches: Vec<MultiMatch> = sparse_re.find_leftmost_iter(text).collect(); |
124 | assert_eq!(matches, vec![ |
125 | MultiMatch::must(0, 0, 10), |
126 | MultiMatch::must(0, 11, 21), |
127 | ]); |
128 | # Ok::<(), Box<dyn std::error::Error>>(()) |
129 | ``` |
130 | |
131 | # Example: deserialize a DFA |
132 | |
133 | This shows how to first serialize a DFA into raw bytes, and then deserialize |
134 | those raw bytes back into a DFA. While this particular example is a |
135 | bit contrived, this same technique can be used in your program to |
136 | deserialize a DFA at start up time or by memory mapping a file. |
137 | |
138 | ``` |
139 | use regex_automata::{MultiMatch, dfa::{dense, regex::Regex}}; |
140 | |
141 | let re1 = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}" ).unwrap(); |
142 | // serialize both the forward and reverse DFAs, see note below |
143 | let (fwd_bytes, fwd_pad) = re1.forward().to_bytes_native_endian(); |
144 | let (rev_bytes, rev_pad) = re1.reverse().to_bytes_native_endian(); |
145 | // now deserialize both---we need to specify the correct type! |
146 | let fwd: dense::DFA<&[u32]> = dense::DFA::from_bytes(&fwd_bytes[fwd_pad..])?.0; |
147 | let rev: dense::DFA<&[u32]> = dense::DFA::from_bytes(&rev_bytes[rev_pad..])?.0; |
148 | // finally, reconstruct our regex |
149 | let re2 = Regex::builder().build_from_dfas(fwd, rev); |
150 | |
151 | // we can use it like normal |
152 | let text = b"2018-12-24 2016-10-08" ; |
153 | let matches: Vec<MultiMatch> = re2.find_leftmost_iter(text).collect(); |
154 | assert_eq!(matches, vec![ |
155 | MultiMatch::must(0, 0, 10), |
156 | MultiMatch::must(0, 11, 21), |
157 | ]); |
158 | # Ok::<(), Box<dyn std::error::Error>>(()) |
159 | ``` |
160 | |
161 | There are a few points worth noting here: |
162 | |
163 | * We need to extract the raw DFAs used by the regex and serialize those. You |
164 | can build the DFAs manually yourself using [`dense::Builder`], but using |
165 | the DFAs from a `Regex` guarantees that the DFAs are built correctly. (In |
166 | particular, a `Regex` constructs a reverse DFA for finding the starting |
167 | location of matches.) |
168 | * To convert the DFA to raw bytes, we use the `to_bytes_native_endian` method. |
169 | In practice, you'll want to use either [`dense::DFA::to_bytes_little_endian`] |
170 | or [`dense::DFA::to_bytes_big_endian`], depending on which platform you're |
171 | deserializing your DFA from. If you intend to deserialize on either platform, |
172 | then you'll need to serialize both and deserialize the right one depending on |
173 | your target's endianness. |
174 | * Safely deserializing a DFA requires verifying the raw bytes, particularly if |
175 | they are untrusted, since an invalid DFA could cause logical errors, panics |
176 | or even undefined behavior. This verification step requires visiting all of |
177 | the transitions in the DFA, which can be costly. If cheaper verification is |
178 | desired, then [`dense::DFA::from_bytes_unchecked`] is available that only does |
179 | verification that can be performed in constant time. However, one can only use |
180 | this routine if the caller can guarantee that the bytes provided encoded a |
181 | valid DFA. |
182 | |
183 | The same process can be achieved with sparse DFAs as well: |
184 | |
185 | ``` |
186 | use regex_automata::{MultiMatch, dfa::{sparse, regex::Regex}}; |
187 | |
188 | let re1 = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}" ).unwrap(); |
189 | // serialize both |
190 | let fwd_bytes = re1.forward().to_sparse()?.to_bytes_native_endian(); |
191 | let rev_bytes = re1.reverse().to_sparse()?.to_bytes_native_endian(); |
192 | // now deserialize both---we need to specify the correct type! |
193 | let fwd: sparse::DFA<&[u8]> = sparse::DFA::from_bytes(&fwd_bytes)?.0; |
194 | let rev: sparse::DFA<&[u8]> = sparse::DFA::from_bytes(&rev_bytes)?.0; |
195 | // finally, reconstruct our regex |
196 | let re2 = Regex::builder().build_from_dfas(fwd, rev); |
197 | |
198 | // we can use it like normal |
199 | let text = b"2018-12-24 2016-10-08" ; |
200 | let matches: Vec<MultiMatch> = re2.find_leftmost_iter(text).collect(); |
201 | assert_eq!(matches, vec![ |
202 | MultiMatch::must(0, 0, 10), |
203 | MultiMatch::must(0, 11, 21), |
204 | ]); |
205 | # Ok::<(), Box<dyn std::error::Error>>(()) |
206 | ``` |
207 | |
208 | Note that unlike dense DFAs, sparse DFAs have no alignment requirements. |
209 | Conversely, dense DFAs must be be aligned to the same alignment as a |
210 | [`StateID`](crate::util::id::StateID). |
211 | |
212 | # Support for `no_std` and `alloc`-only |
213 | |
214 | This crate comes with `alloc` and `std` features that are enabled by default. |
215 | When the `alloc` or `std` features are enabled, the API of this module will |
216 | include the facilities necessary for compiling, serializing, deserializing |
217 | and searching with DFAs. When only the `alloc` feature is enabled, then |
218 | implementations of the `std::error::Error` trait are dropped, but everything |
219 | else generally remains the same. When both the `alloc` and `std` features are |
220 | disabled, the API of this module will shrink such that it only includes the |
221 | facilities necessary for deserializing and searching with DFAs. |
222 | |
223 | The intended workflow for `no_std` environments is thus as follows: |
224 | |
225 | * Write a program with the `alloc` or `std` features that compiles and |
226 | serializes a regular expression. You may need to serialize both little and big |
227 | endian versions of each DFA. (So that's 4 DFAs in total for each regex.) |
228 | * In your `no_std` environment, follow the examples above for deserializing |
229 | your previously serialized DFAs into regexes. You can then search with them as |
230 | you would any regex. |
231 | |
232 | Deserialization can happen anywhere. For example, with bytes embedded into a |
233 | binary or with a file memory mapped at runtime. |
234 | |
235 | TODO: Include link to `regex-cli` here pointing out how to generate Rust code |
236 | for deserializing DFAs. |
237 | |
238 | # Syntax |
239 | |
240 | This module supports the same syntax as the `regex` crate, since they share the |
241 | same parser. You can find an exhaustive list of supported syntax in the |
242 | [documentation for the `regex` crate](https://docs.rs/regex/1/regex/#syntax). |
243 | |
244 | There are two things that are not supported by the DFAs in this module: |
245 | |
246 | * Capturing groups. The DFAs (and [`Regex`](regex::Regex)es built on top |
247 | of them) can only find the offsets of an entire match, but cannot resolve |
248 | the offsets of each capturing group. This is because DFAs do not have the |
249 | expressive power necessary. |
250 | * Unicode word boundaries. These present particularly difficult challenges for |
251 | DFA construction and would result in an explosion in the number of states. |
252 | One can enable [`dense::Config::unicode_word_boundary`] though, which provides |
253 | heuristic support for Unicode word boundaries that only works on ASCII text. |
254 | Otherwise, one can use `(?-u:\b)` for an ASCII word boundary, which will work |
255 | on any input. |
256 | |
257 | There are no plans to lift either of these limitations. |
258 | |
259 | Note that these restrictions are identical to the restrictions on lazy DFAs. |
260 | |
261 | # Differences with general purpose regexes |
262 | |
263 | The main goal of the [`regex`](https://docs.rs/regex) crate is to serve as a |
264 | general purpose regular expression engine. It aims to automatically balance low |
265 | compile times, fast search times and low memory usage, while also providing |
266 | a convenient API for users. In contrast, this module provides a lower level |
267 | regular expression interface based exclusively on DFAs that is a bit less |
268 | convenient while providing more explicit control over memory usage and search |
269 | times. |
270 | |
271 | Here are some specific negative differences: |
272 | |
273 | * **Compilation can take an exponential amount of time and space** in the size |
274 | of the regex pattern. While most patterns do not exhibit worst case exponential |
275 | time, such patterns do exist. For example, `[01]*1[01]{N}` will build a DFA |
276 | with approximately `2^(N+2)` states. For this reason, untrusted patterns should |
277 | not be compiled with this module. (In the future, the API may expose an option |
278 | to return an error if the DFA gets too big.) |
279 | * This module does not support sub-match extraction via capturing groups, which |
280 | can be achieved with the regex crate's "captures" API. |
281 | * While the regex crate doesn't necessarily sport fast compilation times, |
282 | the regexes in this module are almost universally slow to compile, especially |
283 | when they contain large Unicode character classes. For example, on my system, |
284 | compiling `\w{50}` takes about 1 second and almost 15MB of memory! (Compiling |
285 | a sparse regex takes about the same time but only uses about 1.2MB of |
286 | memory.) Conversly, compiling the same regex without Unicode support, e.g., |
287 | `(?-u)\w{50}`, takes under 1 millisecond and about 15KB of memory. For this |
288 | reason, you should only use Unicode character classes if you absolutely need |
289 | them! (They are enabled by default though.) |
290 | * This module does not support Unicode word boundaries. ASCII word bondaries |
291 | may be used though by disabling Unicode or selectively doing so in the syntax, |
292 | e.g., `(?-u:\b)`. There is also an option to |
293 | [heuristically enable Unicode word boundaries](crate::dfa::dense::Config::unicode_word_boundary), |
294 | where the corresponding DFA will give up if any non-ASCII byte is seen. |
295 | * As a lower level API, this module does not do literal optimizations |
296 | automatically. Although it does provide hooks in its API to make use of the |
297 | [`Prefilter`](crate::util::prefilter::Prefilter) trait. Missing literal |
298 | optimizations means that searches may run much slower than what you're |
299 | accustomed to, although, it does provide more predictable and consistent |
300 | performance. |
301 | * There is no `&str` API like in the regex crate. In this module, all APIs |
302 | operate on `&[u8]`. By default, match indices are guaranteed to fall on UTF-8 |
303 | boundaries, unless any of [`SyntaxConfig::utf8`](crate::SyntaxConfig::utf8), |
304 | [`nfa::thompson::Config::utf8`](crate::nfa::thompson::Config::utf8) or |
305 | [`regex::Config::utf8`] are disabled. |
306 | |
307 | With some of the downsides out of the way, here are some positive differences: |
308 | |
309 | * Both dense and sparse DFAs can be serialized to raw bytes, and then cheaply |
310 | deserialized. Deserialization can be done in constant time with the unchecked |
311 | APIs, since searching can be performed directly on the raw serialized bytes of |
312 | a DFA. |
313 | * This module was specifically designed so that the searching phase of a |
314 | DFA has minimal runtime requirements, and can therefore be used in `no_std` |
315 | environments. While `no_std` environments cannot compile regexes, they can |
316 | deserialize pre-compiled regexes. |
317 | * Since this module builds DFAs ahead of time, it will generally out-perform |
318 | the `regex` crate on equivalent tasks. The performance difference is likely |
319 | not large. However, because of a complex set of optimizations in the regex |
320 | crate (like literal optimizations), an accurate performance comparison may be |
321 | difficult to do. |
322 | * Sparse DFAs provide a way to build a DFA ahead of time that sacrifices search |
323 | performance a small amount, but uses much less storage space. Potentially even |
324 | less than what the regex crate uses. |
325 | * This module exposes DFAs directly, such as [`dense::DFA`] and |
326 | [`sparse::DFA`], which enables one to do less work in some cases. For example, |
327 | if you only need the end of a match and not the start of a match, then you can |
328 | use a DFA directly without building a `Regex`, which always requires a second |
329 | DFA to find the start of a match. |
330 | * This module provides more control over memory usage. Aside from choosing |
331 | between dense and sparse DFAs, one can also choose a smaller state identifier |
332 | representation to use less space. Also, one can enable DFA minimization |
333 | via [`dense::Config::minimize`], but it can increase compilation times |
334 | dramatically. |
335 | */ |
336 | |
337 | pub use crate::dfa::automaton::{Automaton, OverlappingState}; |
338 | #[cfg (feature = "alloc" )] |
339 | pub use crate::dfa::error::Error; |
340 | |
341 | /// This is an alias for a state ID of zero. It has special significance |
342 | /// because it always corresponds to the first state in a DFA, and the first |
343 | /// state in a DFA is always "dead." That is, the dead state always has all |
344 | /// of its transitions set to itself. Moreover, the dead state is used as a |
345 | /// sentinel for various things. e.g., In search, reaching a dead state means |
346 | /// that the search must stop. |
347 | const DEAD: crate::util::id::StateID = crate::util::id::StateID::ZERO; |
348 | |
349 | mod accel; |
350 | mod automaton; |
351 | pub mod dense; |
352 | #[cfg (feature = "alloc" )] |
353 | mod determinize; |
354 | #[cfg (feature = "alloc" )] |
355 | pub(crate) mod error; |
356 | #[cfg (feature = "alloc" )] |
357 | mod minimize; |
358 | pub mod regex; |
359 | mod search; |
360 | pub mod sparse; |
361 | mod special; |
362 | #[cfg (feature = "transducer" )] |
363 | mod transducer; |
364 | |