| 1 | /*! |
| 2 | A module for building and searching with determinstic finite automata (DFAs). |
| 3 | |
| 4 | Like other modules in this crate, DFAs support a rich regex syntax with Unicode |
| 5 | features. DFAs also have extensive options for configuring the best space vs |
| 6 | time trade off for your use case and provides support for cheap deserialization |
| 7 | of automata for use in `no_std` environments. |
| 8 | |
| 9 | If you're looking for lazy DFAs that build themselves incrementally during |
| 10 | search, then please see the top-level [`hybrid` module](crate::hybrid). |
| 11 | |
| 12 | # Overview |
| 13 | |
| 14 | This section gives a brief overview of the primary types in this module: |
| 15 | |
| 16 | * A [`regex::Regex`] provides a way to search for matches of a regular |
| 17 | expression using DFAs. This includes iterating over matches with both the start |
| 18 | and end positions of each match. |
| 19 | * A [`dense::DFA`] provides low level access to a DFA that uses a dense |
| 20 | representation (uses lots of space, but fast searching). |
| 21 | * A [`sparse::DFA`] provides the same API as a `dense::DFA`, but uses a sparse |
| 22 | representation (uses less space, but slower searching). |
| 23 | * An [`Automaton`] trait that defines an interface that both dense and sparse |
| 24 | DFAs implement. (A `regex::Regex` is generic over this trait.) |
| 25 | * Both dense DFAs and sparse DFAs support serialization to raw bytes (e.g., |
| 26 | [`dense::DFA::to_bytes_little_endian`]) and cheap deserialization (e.g., |
| 27 | [`dense::DFA::from_bytes`]). |
| 28 | |
| 29 | # Example: basic regex searching |
| 30 | |
| 31 | This example shows how to compile a regex using the default configuration |
| 32 | and then use it to find matches in a byte string: |
| 33 | |
| 34 | ``` |
| 35 | use regex_automata::{MultiMatch, dfa::regex::Regex}; |
| 36 | |
| 37 | let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}" )?; |
| 38 | let text = b"2018-12-24 2016-10-08" ; |
| 39 | let matches: Vec<MultiMatch> = re.find_leftmost_iter(text).collect(); |
| 40 | assert_eq!(matches, vec![ |
| 41 | MultiMatch::must(0, 0, 10), |
| 42 | MultiMatch::must(0, 11, 21), |
| 43 | ]); |
| 44 | # Ok::<(), Box<dyn std::error::Error>>(()) |
| 45 | ``` |
| 46 | |
| 47 | # Example: searching with regex sets |
| 48 | |
| 49 | The DFAs in this module all fully support searching with multiple regexes |
| 50 | simultaneously. You can use this support with standard leftmost-first style |
| 51 | searching to find non-overlapping matches: |
| 52 | |
| 53 | ``` |
| 54 | use regex_automata::{MultiMatch, dfa::regex::Regex}; |
| 55 | |
| 56 | let re = Regex::new_many(&[r"\w+" , r"\S+" ])?; |
| 57 | let text = b"@foo bar" ; |
| 58 | let matches: Vec<MultiMatch> = re.find_leftmost_iter(text).collect(); |
| 59 | assert_eq!(matches, vec![ |
| 60 | MultiMatch::must(1, 0, 4), |
| 61 | MultiMatch::must(0, 5, 8), |
| 62 | ]); |
| 63 | # Ok::<(), Box<dyn std::error::Error>>(()) |
| 64 | ``` |
| 65 | |
| 66 | Or use overlapping style searches to find all possible occurrences: |
| 67 | |
| 68 | ``` |
| 69 | use regex_automata::{MatchKind, MultiMatch, dfa::{dense, regex::Regex}}; |
| 70 | |
| 71 | // N.B. For overlapping searches, we need the underlying DFA to report all |
| 72 | // possible matches. |
| 73 | let re = Regex::builder() |
| 74 | .dense(dense::Config::new().match_kind(MatchKind::All)) |
| 75 | .build_many(&[r"\w{3}" , r"\S{3}" ])?; |
| 76 | let text = b"@foo bar" ; |
| 77 | let matches: Vec<MultiMatch> = re.find_overlapping_iter(text).collect(); |
| 78 | assert_eq!(matches, vec![ |
| 79 | MultiMatch::must(1, 0, 3), |
| 80 | MultiMatch::must(0, 1, 4), |
| 81 | MultiMatch::must(1, 1, 4), |
| 82 | MultiMatch::must(0, 5, 8), |
| 83 | MultiMatch::must(1, 5, 8), |
| 84 | ]); |
| 85 | # Ok::<(), Box<dyn std::error::Error>>(()) |
| 86 | ``` |
| 87 | |
| 88 | # Example: use sparse DFAs |
| 89 | |
| 90 | By default, compiling a regex will use dense DFAs internally. This uses more |
| 91 | memory, but executes searches more quickly. If you can abide slower searches |
| 92 | (somewhere around 3-5x), then sparse DFAs might make more sense since they can |
| 93 | use significantly less space. |
| 94 | |
| 95 | Using sparse DFAs is as easy as using `Regex::new_sparse` instead of |
| 96 | `Regex::new`: |
| 97 | |
| 98 | ``` |
| 99 | use regex_automata::{MultiMatch, dfa::regex::Regex}; |
| 100 | |
| 101 | let re = Regex::new_sparse(r"[0-9]{4}-[0-9]{2}-[0-9]{2}" ).unwrap(); |
| 102 | let text = b"2018-12-24 2016-10-08" ; |
| 103 | let matches: Vec<MultiMatch> = re.find_leftmost_iter(text).collect(); |
| 104 | assert_eq!(matches, vec![ |
| 105 | MultiMatch::must(0, 0, 10), |
| 106 | MultiMatch::must(0, 11, 21), |
| 107 | ]); |
| 108 | # Ok::<(), Box<dyn std::error::Error>>(()) |
| 109 | ``` |
| 110 | |
| 111 | If you already have dense DFAs for some reason, they can be converted to sparse |
| 112 | DFAs and used to build a new `Regex`. For example: |
| 113 | |
| 114 | ``` |
| 115 | use regex_automata::{MultiMatch, dfa::regex::Regex}; |
| 116 | |
| 117 | let dense_re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}" ).unwrap(); |
| 118 | let sparse_re = Regex::builder().build_from_dfas( |
| 119 | dense_re.forward().to_sparse()?, |
| 120 | dense_re.reverse().to_sparse()?, |
| 121 | ); |
| 122 | let text = b"2018-12-24 2016-10-08" ; |
| 123 | let matches: Vec<MultiMatch> = sparse_re.find_leftmost_iter(text).collect(); |
| 124 | assert_eq!(matches, vec![ |
| 125 | MultiMatch::must(0, 0, 10), |
| 126 | MultiMatch::must(0, 11, 21), |
| 127 | ]); |
| 128 | # Ok::<(), Box<dyn std::error::Error>>(()) |
| 129 | ``` |
| 130 | |
| 131 | # Example: deserialize a DFA |
| 132 | |
| 133 | This shows how to first serialize a DFA into raw bytes, and then deserialize |
| 134 | those raw bytes back into a DFA. While this particular example is a |
| 135 | bit contrived, this same technique can be used in your program to |
| 136 | deserialize a DFA at start up time or by memory mapping a file. |
| 137 | |
| 138 | ``` |
| 139 | use regex_automata::{MultiMatch, dfa::{dense, regex::Regex}}; |
| 140 | |
| 141 | let re1 = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}" ).unwrap(); |
| 142 | // serialize both the forward and reverse DFAs, see note below |
| 143 | let (fwd_bytes, fwd_pad) = re1.forward().to_bytes_native_endian(); |
| 144 | let (rev_bytes, rev_pad) = re1.reverse().to_bytes_native_endian(); |
| 145 | // now deserialize both---we need to specify the correct type! |
| 146 | let fwd: dense::DFA<&[u32]> = dense::DFA::from_bytes(&fwd_bytes[fwd_pad..])?.0; |
| 147 | let rev: dense::DFA<&[u32]> = dense::DFA::from_bytes(&rev_bytes[rev_pad..])?.0; |
| 148 | // finally, reconstruct our regex |
| 149 | let re2 = Regex::builder().build_from_dfas(fwd, rev); |
| 150 | |
| 151 | // we can use it like normal |
| 152 | let text = b"2018-12-24 2016-10-08" ; |
| 153 | let matches: Vec<MultiMatch> = re2.find_leftmost_iter(text).collect(); |
| 154 | assert_eq!(matches, vec![ |
| 155 | MultiMatch::must(0, 0, 10), |
| 156 | MultiMatch::must(0, 11, 21), |
| 157 | ]); |
| 158 | # Ok::<(), Box<dyn std::error::Error>>(()) |
| 159 | ``` |
| 160 | |
| 161 | There are a few points worth noting here: |
| 162 | |
| 163 | * We need to extract the raw DFAs used by the regex and serialize those. You |
| 164 | can build the DFAs manually yourself using [`dense::Builder`], but using |
| 165 | the DFAs from a `Regex` guarantees that the DFAs are built correctly. (In |
| 166 | particular, a `Regex` constructs a reverse DFA for finding the starting |
| 167 | location of matches.) |
| 168 | * To convert the DFA to raw bytes, we use the `to_bytes_native_endian` method. |
| 169 | In practice, you'll want to use either [`dense::DFA::to_bytes_little_endian`] |
| 170 | or [`dense::DFA::to_bytes_big_endian`], depending on which platform you're |
| 171 | deserializing your DFA from. If you intend to deserialize on either platform, |
| 172 | then you'll need to serialize both and deserialize the right one depending on |
| 173 | your target's endianness. |
| 174 | * Safely deserializing a DFA requires verifying the raw bytes, particularly if |
| 175 | they are untrusted, since an invalid DFA could cause logical errors, panics |
| 176 | or even undefined behavior. This verification step requires visiting all of |
| 177 | the transitions in the DFA, which can be costly. If cheaper verification is |
| 178 | desired, then [`dense::DFA::from_bytes_unchecked`] is available that only does |
| 179 | verification that can be performed in constant time. However, one can only use |
| 180 | this routine if the caller can guarantee that the bytes provided encoded a |
| 181 | valid DFA. |
| 182 | |
| 183 | The same process can be achieved with sparse DFAs as well: |
| 184 | |
| 185 | ``` |
| 186 | use regex_automata::{MultiMatch, dfa::{sparse, regex::Regex}}; |
| 187 | |
| 188 | let re1 = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}" ).unwrap(); |
| 189 | // serialize both |
| 190 | let fwd_bytes = re1.forward().to_sparse()?.to_bytes_native_endian(); |
| 191 | let rev_bytes = re1.reverse().to_sparse()?.to_bytes_native_endian(); |
| 192 | // now deserialize both---we need to specify the correct type! |
| 193 | let fwd: sparse::DFA<&[u8]> = sparse::DFA::from_bytes(&fwd_bytes)?.0; |
| 194 | let rev: sparse::DFA<&[u8]> = sparse::DFA::from_bytes(&rev_bytes)?.0; |
| 195 | // finally, reconstruct our regex |
| 196 | let re2 = Regex::builder().build_from_dfas(fwd, rev); |
| 197 | |
| 198 | // we can use it like normal |
| 199 | let text = b"2018-12-24 2016-10-08" ; |
| 200 | let matches: Vec<MultiMatch> = re2.find_leftmost_iter(text).collect(); |
| 201 | assert_eq!(matches, vec![ |
| 202 | MultiMatch::must(0, 0, 10), |
| 203 | MultiMatch::must(0, 11, 21), |
| 204 | ]); |
| 205 | # Ok::<(), Box<dyn std::error::Error>>(()) |
| 206 | ``` |
| 207 | |
| 208 | Note that unlike dense DFAs, sparse DFAs have no alignment requirements. |
| 209 | Conversely, dense DFAs must be be aligned to the same alignment as a |
| 210 | [`StateID`](crate::util::id::StateID). |
| 211 | |
| 212 | # Support for `no_std` and `alloc`-only |
| 213 | |
| 214 | This crate comes with `alloc` and `std` features that are enabled by default. |
| 215 | When the `alloc` or `std` features are enabled, the API of this module will |
| 216 | include the facilities necessary for compiling, serializing, deserializing |
| 217 | and searching with DFAs. When only the `alloc` feature is enabled, then |
| 218 | implementations of the `std::error::Error` trait are dropped, but everything |
| 219 | else generally remains the same. When both the `alloc` and `std` features are |
| 220 | disabled, the API of this module will shrink such that it only includes the |
| 221 | facilities necessary for deserializing and searching with DFAs. |
| 222 | |
| 223 | The intended workflow for `no_std` environments is thus as follows: |
| 224 | |
| 225 | * Write a program with the `alloc` or `std` features that compiles and |
| 226 | serializes a regular expression. You may need to serialize both little and big |
| 227 | endian versions of each DFA. (So that's 4 DFAs in total for each regex.) |
| 228 | * In your `no_std` environment, follow the examples above for deserializing |
| 229 | your previously serialized DFAs into regexes. You can then search with them as |
| 230 | you would any regex. |
| 231 | |
| 232 | Deserialization can happen anywhere. For example, with bytes embedded into a |
| 233 | binary or with a file memory mapped at runtime. |
| 234 | |
| 235 | TODO: Include link to `regex-cli` here pointing out how to generate Rust code |
| 236 | for deserializing DFAs. |
| 237 | |
| 238 | # Syntax |
| 239 | |
| 240 | This module supports the same syntax as the `regex` crate, since they share the |
| 241 | same parser. You can find an exhaustive list of supported syntax in the |
| 242 | [documentation for the `regex` crate](https://docs.rs/regex/1/regex/#syntax). |
| 243 | |
| 244 | There are two things that are not supported by the DFAs in this module: |
| 245 | |
| 246 | * Capturing groups. The DFAs (and [`Regex`](regex::Regex)es built on top |
| 247 | of them) can only find the offsets of an entire match, but cannot resolve |
| 248 | the offsets of each capturing group. This is because DFAs do not have the |
| 249 | expressive power necessary. |
| 250 | * Unicode word boundaries. These present particularly difficult challenges for |
| 251 | DFA construction and would result in an explosion in the number of states. |
| 252 | One can enable [`dense::Config::unicode_word_boundary`] though, which provides |
| 253 | heuristic support for Unicode word boundaries that only works on ASCII text. |
| 254 | Otherwise, one can use `(?-u:\b)` for an ASCII word boundary, which will work |
| 255 | on any input. |
| 256 | |
| 257 | There are no plans to lift either of these limitations. |
| 258 | |
| 259 | Note that these restrictions are identical to the restrictions on lazy DFAs. |
| 260 | |
| 261 | # Differences with general purpose regexes |
| 262 | |
| 263 | The main goal of the [`regex`](https://docs.rs/regex) crate is to serve as a |
| 264 | general purpose regular expression engine. It aims to automatically balance low |
| 265 | compile times, fast search times and low memory usage, while also providing |
| 266 | a convenient API for users. In contrast, this module provides a lower level |
| 267 | regular expression interface based exclusively on DFAs that is a bit less |
| 268 | convenient while providing more explicit control over memory usage and search |
| 269 | times. |
| 270 | |
| 271 | Here are some specific negative differences: |
| 272 | |
| 273 | * **Compilation can take an exponential amount of time and space** in the size |
| 274 | of the regex pattern. While most patterns do not exhibit worst case exponential |
| 275 | time, such patterns do exist. For example, `[01]*1[01]{N}` will build a DFA |
| 276 | with approximately `2^(N+2)` states. For this reason, untrusted patterns should |
| 277 | not be compiled with this module. (In the future, the API may expose an option |
| 278 | to return an error if the DFA gets too big.) |
| 279 | * This module does not support sub-match extraction via capturing groups, which |
| 280 | can be achieved with the regex crate's "captures" API. |
| 281 | * While the regex crate doesn't necessarily sport fast compilation times, |
| 282 | the regexes in this module are almost universally slow to compile, especially |
| 283 | when they contain large Unicode character classes. For example, on my system, |
| 284 | compiling `\w{50}` takes about 1 second and almost 15MB of memory! (Compiling |
| 285 | a sparse regex takes about the same time but only uses about 1.2MB of |
| 286 | memory.) Conversly, compiling the same regex without Unicode support, e.g., |
| 287 | `(?-u)\w{50}`, takes under 1 millisecond and about 15KB of memory. For this |
| 288 | reason, you should only use Unicode character classes if you absolutely need |
| 289 | them! (They are enabled by default though.) |
| 290 | * This module does not support Unicode word boundaries. ASCII word bondaries |
| 291 | may be used though by disabling Unicode or selectively doing so in the syntax, |
| 292 | e.g., `(?-u:\b)`. There is also an option to |
| 293 | [heuristically enable Unicode word boundaries](crate::dfa::dense::Config::unicode_word_boundary), |
| 294 | where the corresponding DFA will give up if any non-ASCII byte is seen. |
| 295 | * As a lower level API, this module does not do literal optimizations |
| 296 | automatically. Although it does provide hooks in its API to make use of the |
| 297 | [`Prefilter`](crate::util::prefilter::Prefilter) trait. Missing literal |
| 298 | optimizations means that searches may run much slower than what you're |
| 299 | accustomed to, although, it does provide more predictable and consistent |
| 300 | performance. |
| 301 | * There is no `&str` API like in the regex crate. In this module, all APIs |
| 302 | operate on `&[u8]`. By default, match indices are guaranteed to fall on UTF-8 |
| 303 | boundaries, unless any of [`SyntaxConfig::utf8`](crate::SyntaxConfig::utf8), |
| 304 | [`nfa::thompson::Config::utf8`](crate::nfa::thompson::Config::utf8) or |
| 305 | [`regex::Config::utf8`] are disabled. |
| 306 | |
| 307 | With some of the downsides out of the way, here are some positive differences: |
| 308 | |
| 309 | * Both dense and sparse DFAs can be serialized to raw bytes, and then cheaply |
| 310 | deserialized. Deserialization can be done in constant time with the unchecked |
| 311 | APIs, since searching can be performed directly on the raw serialized bytes of |
| 312 | a DFA. |
| 313 | * This module was specifically designed so that the searching phase of a |
| 314 | DFA has minimal runtime requirements, and can therefore be used in `no_std` |
| 315 | environments. While `no_std` environments cannot compile regexes, they can |
| 316 | deserialize pre-compiled regexes. |
| 317 | * Since this module builds DFAs ahead of time, it will generally out-perform |
| 318 | the `regex` crate on equivalent tasks. The performance difference is likely |
| 319 | not large. However, because of a complex set of optimizations in the regex |
| 320 | crate (like literal optimizations), an accurate performance comparison may be |
| 321 | difficult to do. |
| 322 | * Sparse DFAs provide a way to build a DFA ahead of time that sacrifices search |
| 323 | performance a small amount, but uses much less storage space. Potentially even |
| 324 | less than what the regex crate uses. |
| 325 | * This module exposes DFAs directly, such as [`dense::DFA`] and |
| 326 | [`sparse::DFA`], which enables one to do less work in some cases. For example, |
| 327 | if you only need the end of a match and not the start of a match, then you can |
| 328 | use a DFA directly without building a `Regex`, which always requires a second |
| 329 | DFA to find the start of a match. |
| 330 | * This module provides more control over memory usage. Aside from choosing |
| 331 | between dense and sparse DFAs, one can also choose a smaller state identifier |
| 332 | representation to use less space. Also, one can enable DFA minimization |
| 333 | via [`dense::Config::minimize`], but it can increase compilation times |
| 334 | dramatically. |
| 335 | */ |
| 336 | |
| 337 | pub use crate::dfa::automaton::{Automaton, OverlappingState}; |
| 338 | #[cfg (feature = "alloc" )] |
| 339 | pub use crate::dfa::error::Error; |
| 340 | |
| 341 | /// This is an alias for a state ID of zero. It has special significance |
| 342 | /// because it always corresponds to the first state in a DFA, and the first |
| 343 | /// state in a DFA is always "dead." That is, the dead state always has all |
| 344 | /// of its transitions set to itself. Moreover, the dead state is used as a |
| 345 | /// sentinel for various things. e.g., In search, reaching a dead state means |
| 346 | /// that the search must stop. |
| 347 | const DEAD: crate::util::id::StateID = crate::util::id::StateID::ZERO; |
| 348 | |
| 349 | mod accel; |
| 350 | mod automaton; |
| 351 | pub mod dense; |
| 352 | #[cfg (feature = "alloc" )] |
| 353 | mod determinize; |
| 354 | #[cfg (feature = "alloc" )] |
| 355 | pub(crate) mod error; |
| 356 | #[cfg (feature = "alloc" )] |
| 357 | mod minimize; |
| 358 | pub mod regex; |
| 359 | mod search; |
| 360 | pub mod sparse; |
| 361 | mod special; |
| 362 | #[cfg (feature = "transducer" )] |
| 363 | mod transducer; |
| 364 | |