1 | /*! |
2 | A module for building and searching with deterministic finite automata (DFAs). |
3 | |
4 | Like other modules in this crate, DFAs support a rich regex syntax with Unicode |
5 | features. DFAs also have extensive options for configuring the best space vs |
6 | time trade off for your use case and provides support for cheap deserialization |
7 | of automata for use in `no_std` environments. |
8 | |
9 | If you're looking for lazy DFAs that build themselves incrementally during |
10 | search, then please see the top-level [`hybrid` module](crate::hybrid). |
11 | |
12 | # Overview |
13 | |
14 | This section gives a brief overview of the primary types in this module: |
15 | |
16 | * A [`regex::Regex`] provides a way to search for matches of a regular |
17 | expression using DFAs. This includes iterating over matches with both the start |
18 | and end positions of each match. |
19 | * A [`dense::DFA`] provides low level access to a DFA that uses a dense |
20 | representation (uses lots of space, but fast searching). |
21 | * A [`sparse::DFA`] provides the same API as a `dense::DFA`, but uses a sparse |
22 | representation (uses less space, but slower searching). |
23 | * An [`Automaton`] trait that defines an interface that both dense and sparse |
24 | DFAs implement. (A `regex::Regex` is generic over this trait.) |
25 | * Both dense DFAs and sparse DFAs support serialization to raw bytes (e.g., |
26 | [`dense::DFA::to_bytes_little_endian`]) and cheap deserialization (e.g., |
27 | [`dense::DFA::from_bytes`]). |
28 | |
29 | There is also a [`onepass`] module that provides a [one-pass |
30 | DFA](onepass::DFA). The unique advantage of this DFA is that, for the class |
31 | of regexes it can be built with, it supports reporting the spans of matching |
32 | capturing groups. It is the only DFA in this crate capable of such a thing. |
33 | |
34 | # Example: basic regex searching |
35 | |
36 | This example shows how to compile a regex using the default configuration |
37 | and then use it to find matches in a byte string: |
38 | |
39 | ``` |
40 | use regex_automata::{Match, dfa::regex::Regex}; |
41 | |
42 | let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}" )?; |
43 | let text = b"2018-12-24 2016-10-08" ; |
44 | let matches: Vec<Match> = re.find_iter(text).collect(); |
45 | assert_eq!(matches, vec![ |
46 | Match::must(0, 0..10), |
47 | Match::must(0, 11..21), |
48 | ]); |
49 | # Ok::<(), Box<dyn std::error::Error>>(()) |
50 | ``` |
51 | |
52 | # Example: searching with regex sets |
53 | |
54 | The DFAs in this module all fully support searching with multiple regexes |
55 | simultaneously. You can use this support with standard leftmost-first style |
56 | searching to find non-overlapping matches: |
57 | |
58 | ``` |
59 | # if cfg!(miri) { return Ok(()); } // miri takes too long |
60 | use regex_automata::{Match, dfa::regex::Regex}; |
61 | |
62 | let re = Regex::new_many(&[r"\w+" , r"\S+" ])?; |
63 | let text = b"@foo bar" ; |
64 | let matches: Vec<Match> = re.find_iter(text).collect(); |
65 | assert_eq!(matches, vec![ |
66 | Match::must(1, 0..4), |
67 | Match::must(0, 5..8), |
68 | ]); |
69 | # Ok::<(), Box<dyn std::error::Error>>(()) |
70 | ``` |
71 | |
72 | # Example: use sparse DFAs |
73 | |
74 | By default, compiling a regex will use dense DFAs internally. This uses more |
75 | memory, but executes searches more quickly. If you can abide slower searches |
76 | (somewhere around 3-5x), then sparse DFAs might make more sense since they can |
77 | use significantly less space. |
78 | |
79 | Using sparse DFAs is as easy as using `Regex::new_sparse` instead of |
80 | `Regex::new`: |
81 | |
82 | ``` |
83 | use regex_automata::{Match, dfa::regex::Regex}; |
84 | |
85 | let re = Regex::new_sparse(r"[0-9]{4}-[0-9]{2}-[0-9]{2}" ).unwrap(); |
86 | let text = b"2018-12-24 2016-10-08" ; |
87 | let matches: Vec<Match> = re.find_iter(text).collect(); |
88 | assert_eq!(matches, vec![ |
89 | Match::must(0, 0..10), |
90 | Match::must(0, 11..21), |
91 | ]); |
92 | # Ok::<(), Box<dyn std::error::Error>>(()) |
93 | ``` |
94 | |
95 | If you already have dense DFAs for some reason, they can be converted to sparse |
96 | DFAs and used to build a new `Regex`. For example: |
97 | |
98 | ``` |
99 | use regex_automata::{Match, dfa::regex::Regex}; |
100 | |
101 | let dense_re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}" ).unwrap(); |
102 | let sparse_re = Regex::builder().build_from_dfas( |
103 | dense_re.forward().to_sparse()?, |
104 | dense_re.reverse().to_sparse()?, |
105 | ); |
106 | let text = b"2018-12-24 2016-10-08" ; |
107 | let matches: Vec<Match> = sparse_re.find_iter(text).collect(); |
108 | assert_eq!(matches, vec![ |
109 | Match::must(0, 0..10), |
110 | Match::must(0, 11..21), |
111 | ]); |
112 | # Ok::<(), Box<dyn std::error::Error>>(()) |
113 | ``` |
114 | |
115 | # Example: deserialize a DFA |
116 | |
117 | This shows how to first serialize a DFA into raw bytes, and then deserialize |
118 | those raw bytes back into a DFA. While this particular example is a |
119 | bit contrived, this same technique can be used in your program to |
120 | deserialize a DFA at start up time or by memory mapping a file. |
121 | |
122 | ``` |
123 | use regex_automata::{Match, dfa::{dense, regex::Regex}}; |
124 | |
125 | let re1 = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}" ).unwrap(); |
126 | // serialize both the forward and reverse DFAs, see note below |
127 | let (fwd_bytes, fwd_pad) = re1.forward().to_bytes_native_endian(); |
128 | let (rev_bytes, rev_pad) = re1.reverse().to_bytes_native_endian(); |
129 | // now deserialize both---we need to specify the correct type! |
130 | let fwd: dense::DFA<&[u32]> = dense::DFA::from_bytes(&fwd_bytes[fwd_pad..])?.0; |
131 | let rev: dense::DFA<&[u32]> = dense::DFA::from_bytes(&rev_bytes[rev_pad..])?.0; |
132 | // finally, reconstruct our regex |
133 | let re2 = Regex::builder().build_from_dfas(fwd, rev); |
134 | |
135 | // we can use it like normal |
136 | let text = b"2018-12-24 2016-10-08" ; |
137 | let matches: Vec<Match> = re2.find_iter(text).collect(); |
138 | assert_eq!(matches, vec![ |
139 | Match::must(0, 0..10), |
140 | Match::must(0, 11..21), |
141 | ]); |
142 | # Ok::<(), Box<dyn std::error::Error>>(()) |
143 | ``` |
144 | |
145 | There are a few points worth noting here: |
146 | |
147 | * We need to extract the raw DFAs used by the regex and serialize those. You |
148 | can build the DFAs manually yourself using [`dense::Builder`], but using |
149 | the DFAs from a `Regex` guarantees that the DFAs are built correctly. (In |
150 | particular, a `Regex` constructs a reverse DFA for finding the starting |
151 | location of matches.) |
152 | * To convert the DFA to raw bytes, we use the `to_bytes_native_endian` method. |
153 | In practice, you'll want to use either [`dense::DFA::to_bytes_little_endian`] |
154 | or [`dense::DFA::to_bytes_big_endian`], depending on which platform you're |
155 | deserializing your DFA from. If you intend to deserialize on either platform, |
156 | then you'll need to serialize both and deserialize the right one depending on |
157 | your target's endianness. |
158 | * Safely deserializing a DFA requires verifying the raw bytes, particularly if |
159 | they are untrusted, since an invalid DFA could cause logical errors, panics |
160 | or even undefined behavior. This verification step requires visiting all of |
161 | the transitions in the DFA, which can be costly. If cheaper verification is |
162 | desired, then [`dense::DFA::from_bytes_unchecked`] is available that only does |
163 | verification that can be performed in constant time. However, one can only use |
164 | this routine if the caller can guarantee that the bytes provided encoded a |
165 | valid DFA. |
166 | |
167 | The same process can be achieved with sparse DFAs as well: |
168 | |
169 | ``` |
170 | use regex_automata::{Match, dfa::{sparse, regex::Regex}}; |
171 | |
172 | let re1 = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}" ).unwrap(); |
173 | // serialize both |
174 | let fwd_bytes = re1.forward().to_sparse()?.to_bytes_native_endian(); |
175 | let rev_bytes = re1.reverse().to_sparse()?.to_bytes_native_endian(); |
176 | // now deserialize both---we need to specify the correct type! |
177 | let fwd: sparse::DFA<&[u8]> = sparse::DFA::from_bytes(&fwd_bytes)?.0; |
178 | let rev: sparse::DFA<&[u8]> = sparse::DFA::from_bytes(&rev_bytes)?.0; |
179 | // finally, reconstruct our regex |
180 | let re2 = Regex::builder().build_from_dfas(fwd, rev); |
181 | |
182 | // we can use it like normal |
183 | let text = b"2018-12-24 2016-10-08" ; |
184 | let matches: Vec<Match> = re2.find_iter(text).collect(); |
185 | assert_eq!(matches, vec![ |
186 | Match::must(0, 0..10), |
187 | Match::must(0, 11..21), |
188 | ]); |
189 | # Ok::<(), Box<dyn std::error::Error>>(()) |
190 | ``` |
191 | |
192 | Note that unlike dense DFAs, sparse DFAs have no alignment requirements. |
193 | Conversely, dense DFAs must be be aligned to the same alignment as a |
194 | [`StateID`](crate::util::primitives::StateID). |
195 | |
196 | # Support for `no_std` and `alloc`-only |
197 | |
198 | This crate comes with `alloc` and `std` features that are enabled by default. |
199 | When the `alloc` or `std` features are enabled, the API of this module will |
200 | include the facilities necessary for compiling, serializing, deserializing |
201 | and searching with DFAs. When only the `alloc` feature is enabled, then |
202 | implementations of the `std::error::Error` trait are dropped, but everything |
203 | else generally remains the same. When both the `alloc` and `std` features are |
204 | disabled, the API of this module will shrink such that it only includes the |
205 | facilities necessary for deserializing and searching with DFAs. |
206 | |
207 | The intended workflow for `no_std` environments is thus as follows: |
208 | |
209 | * Write a program with the `alloc` or `std` features that compiles and |
210 | serializes a regular expression. You may need to serialize both little and big |
211 | endian versions of each DFA. (So that's 4 DFAs in total for each regex.) |
212 | * In your `no_std` environment, follow the examples above for deserializing |
213 | your previously serialized DFAs into regexes. You can then search with them as |
214 | you would any regex. |
215 | |
216 | Deserialization can happen anywhere. For example, with bytes embedded into a |
217 | binary or with a file memory mapped at runtime. |
218 | |
219 | The `regex-cli` command (found in the same repository as this crate) can be |
220 | used to serialize DFAs to files and generate Rust code to read them. |
221 | |
222 | # Syntax |
223 | |
224 | This module supports the same syntax as the `regex` crate, since they share the |
225 | same parser. You can find an exhaustive list of supported syntax in the |
226 | [documentation for the `regex` crate](https://docs.rs/regex/1/regex/#syntax). |
227 | |
228 | There are two things that are not supported by the DFAs in this module: |
229 | |
230 | * Capturing groups. The DFAs (and [`Regex`](regex::Regex)es built on top |
231 | of them) can only find the offsets of an entire match, but cannot resolve |
232 | the offsets of each capturing group. This is because DFAs do not have the |
233 | expressive power necessary. |
234 | * Unicode word boundaries. These present particularly difficult challenges for |
235 | DFA construction and would result in an explosion in the number of states. |
236 | One can enable [`dense::Config::unicode_word_boundary`] though, which provides |
237 | heuristic support for Unicode word boundaries that only works on ASCII text. |
238 | Otherwise, one can use `(?-u:\b)` for an ASCII word boundary, which will work |
239 | on any input. |
240 | |
241 | There are no plans to lift either of these limitations. |
242 | |
243 | Note that these restrictions are identical to the restrictions on lazy DFAs. |
244 | |
245 | # Differences with general purpose regexes |
246 | |
247 | The main goal of the [`regex`](https://docs.rs/regex) crate is to serve as a |
248 | general purpose regular expression engine. It aims to automatically balance low |
249 | compile times, fast search times and low memory usage, while also providing |
250 | a convenient API for users. In contrast, this module provides a lower level |
251 | regular expression interface based exclusively on DFAs that is a bit less |
252 | convenient while providing more explicit control over memory usage and search |
253 | times. |
254 | |
255 | Here are some specific negative differences: |
256 | |
257 | * **Compilation can take an exponential amount of time and space** in the size |
258 | of the regex pattern. While most patterns do not exhibit worst case exponential |
259 | time, such patterns do exist. For example, `[01]*1[01]{N}` will build a DFA |
260 | with approximately `2^(N+2)` states. For this reason, untrusted patterns should |
261 | not be compiled with this module. (In the future, the API may expose an option |
262 | to return an error if the DFA gets too big.) |
263 | * This module does not support sub-match extraction via capturing groups, which |
264 | can be achieved with the regex crate's "captures" API. |
265 | * While the regex crate doesn't necessarily sport fast compilation times, |
266 | the regexes in this module are almost universally slow to compile, especially |
267 | when they contain large Unicode character classes. For example, on my system, |
268 | compiling `\w{50}` takes about 1 second and almost 15MB of memory! (Compiling |
269 | a sparse regex takes about the same time but only uses about 1.2MB of |
270 | memory.) Conversely, compiling the same regex without Unicode support, e.g., |
271 | `(?-u)\w{50}`, takes under 1 millisecond and about 15KB of memory. For this |
272 | reason, you should only use Unicode character classes if you absolutely need |
273 | them! (They are enabled by default though.) |
274 | * This module does not support Unicode word boundaries. ASCII word bondaries |
275 | may be used though by disabling Unicode or selectively doing so in the syntax, |
276 | e.g., `(?-u:\b)`. There is also an option to |
277 | [heuristically enable Unicode word boundaries](crate::dfa::dense::Config::unicode_word_boundary), |
278 | where the corresponding DFA will give up if any non-ASCII byte is seen. |
279 | * As a lower level API, this module does not do literal optimizations |
280 | automatically. Although it does provide hooks in its API to make use of the |
281 | [`Prefilter`](crate::util::prefilter::Prefilter) trait. Missing literal |
282 | optimizations means that searches may run much slower than what you're |
283 | accustomed to, although, it does provide more predictable and consistent |
284 | performance. |
285 | * There is no `&str` API like in the regex crate. In this module, all APIs |
286 | operate on `&[u8]`. By default, match indices are |
287 | guaranteed to fall on UTF-8 boundaries, unless either of |
288 | [`syntax::Config::utf8`](crate::util::syntax::Config::utf8) or |
289 | [`thompson::Config::utf8`](crate::nfa::thompson::Config::utf8) are disabled. |
290 | |
291 | With some of the downsides out of the way, here are some positive differences: |
292 | |
293 | * Both dense and sparse DFAs can be serialized to raw bytes, and then cheaply |
294 | deserialized. Deserialization can be done in constant time with the unchecked |
295 | APIs, since searching can be performed directly on the raw serialized bytes of |
296 | a DFA. |
297 | * This module was specifically designed so that the searching phase of a |
298 | DFA has minimal runtime requirements, and can therefore be used in `no_std` |
299 | environments. While `no_std` environments cannot compile regexes, they can |
300 | deserialize pre-compiled regexes. |
301 | * Since this module builds DFAs ahead of time, it will generally out-perform |
302 | the `regex` crate on equivalent tasks. The performance difference is likely |
303 | not large. However, because of a complex set of optimizations in the regex |
304 | crate (like literal optimizations), an accurate performance comparison may be |
305 | difficult to do. |
306 | * Sparse DFAs provide a way to build a DFA ahead of time that sacrifices search |
307 | performance a small amount, but uses much less storage space. Potentially even |
308 | less than what the regex crate uses. |
309 | * This module exposes DFAs directly, such as [`dense::DFA`] and |
310 | [`sparse::DFA`], which enables one to do less work in some cases. For example, |
311 | if you only need the end of a match and not the start of a match, then you can |
312 | use a DFA directly without building a `Regex`, which always requires a second |
313 | DFA to find the start of a match. |
314 | * This module provides more control over memory usage. Aside from choosing |
315 | between dense and sparse DFAs, one can also choose a smaller state identifier |
316 | representation to use less space. Also, one can enable DFA minimization |
317 | via [`dense::Config::minimize`], but it can increase compilation times |
318 | dramatically. |
319 | */ |
320 | |
321 | #[cfg (feature = "dfa-search" )] |
322 | pub use crate::dfa::{ |
323 | automaton::{Automaton, OverlappingState, StartError}, |
324 | start::StartKind, |
325 | }; |
326 | |
327 | /// This is an alias for a state ID of zero. It has special significance |
328 | /// because it always corresponds to the first state in a DFA, and the first |
329 | /// state in a DFA is always "dead." That is, the dead state always has all |
330 | /// of its transitions set to itself. Moreover, the dead state is used as a |
331 | /// sentinel for various things. e.g., In search, reaching a dead state means |
332 | /// that the search must stop. |
333 | const DEAD: crate::util::primitives::StateID = |
334 | crate::util::primitives::StateID::ZERO; |
335 | |
336 | #[cfg (feature = "dfa-search" )] |
337 | pub mod dense; |
338 | #[cfg (feature = "dfa-onepass" )] |
339 | pub mod onepass; |
340 | #[cfg (feature = "dfa-search" )] |
341 | pub mod regex; |
342 | #[cfg (feature = "dfa-search" )] |
343 | pub mod sparse; |
344 | |
345 | #[cfg (feature = "dfa-search" )] |
346 | pub(crate) mod accel; |
347 | #[cfg (feature = "dfa-search" )] |
348 | mod automaton; |
349 | #[cfg (feature = "dfa-build" )] |
350 | mod determinize; |
351 | #[cfg (feature = "dfa-build" )] |
352 | mod minimize; |
353 | #[cfg (any(feature = "dfa-build" , feature = "dfa-onepass" ))] |
354 | mod remapper; |
355 | #[cfg (feature = "dfa-search" )] |
356 | mod search; |
357 | #[cfg (feature = "dfa-search" )] |
358 | mod special; |
359 | #[cfg (feature = "dfa-search" )] |
360 | mod start; |
361 | |