| 1 | /*! |
| 2 | A module for building and searching with lazy deterministic finite automata |
| 3 | (DFAs). |
| 4 | |
| 5 | Like other modules in this crate, lazy DFAs support a rich regex syntax with |
| 6 | Unicode features. The key feature of a lazy DFA is that it builds itself |
| 7 | incrementally during search, and never uses more than a configured capacity of |
| 8 | memory. Thus, when searching with a lazy DFA, one must supply a mutable "cache" |
| 9 | in which the actual DFA's transition table is stored. |
| 10 | |
| 11 | If you're looking for fully compiled DFAs, then please see the top-level |
| 12 | [`dfa` module](crate::dfa). |
| 13 | |
| 14 | # Overview |
| 15 | |
| 16 | This section gives a brief overview of the primary types in this module: |
| 17 | |
| 18 | * A [`regex::Regex`] provides a way to search for matches of a regular |
| 19 | expression using lazy DFAs. This includes iterating over matches with both the |
| 20 | start and end positions of each match. |
| 21 | * A [`dfa::DFA`] provides direct low level access to a lazy DFA. |
| 22 | |
| 23 | # Example: basic regex searching |
| 24 | |
| 25 | This example shows how to compile a regex using the default configuration |
| 26 | and then use it to find matches in a byte string: |
| 27 | |
| 28 | ``` |
| 29 | use regex_automata::{hybrid::regex::Regex, Match}; |
| 30 | |
| 31 | let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}" )?; |
| 32 | let mut cache = re.create_cache(); |
| 33 | |
| 34 | let haystack = "2018-12-24 2016-10-08" ; |
| 35 | let matches: Vec<Match> = re.find_iter(&mut cache, haystack).collect(); |
| 36 | assert_eq!(matches, vec![ |
| 37 | Match::must(0, 0..10), |
| 38 | Match::must(0, 11..21), |
| 39 | ]); |
| 40 | # Ok::<(), Box<dyn std::error::Error>>(()) |
| 41 | ``` |
| 42 | |
| 43 | # Example: searching with multiple regexes |
| 44 | |
| 45 | The lazy DFAs in this module all fully support searching with multiple regexes |
| 46 | simultaneously. You can use this support with standard leftmost-first style |
| 47 | searching to find non-overlapping matches: |
| 48 | |
| 49 | ``` |
| 50 | # if cfg!(miri) { return Ok(()); } // miri takes too long |
| 51 | use regex_automata::{hybrid::regex::Regex, Match}; |
| 52 | |
| 53 | let re = Regex::new_many(&[r"\w+" , r"\S+" ])?; |
| 54 | let mut cache = re.create_cache(); |
| 55 | |
| 56 | let haystack = "@foo bar" ; |
| 57 | let matches: Vec<Match> = re.find_iter(&mut cache, haystack).collect(); |
| 58 | assert_eq!(matches, vec![ |
| 59 | Match::must(1, 0..4), |
| 60 | Match::must(0, 5..8), |
| 61 | ]); |
| 62 | # Ok::<(), Box<dyn std::error::Error>>(()) |
| 63 | ``` |
| 64 | |
| 65 | # When should I use this? |
| 66 | |
| 67 | Generally speaking, if you can abide the use of mutable state during search, |
| 68 | and you don't need things like capturing groups or Unicode word boundary |
| 69 | support in non-ASCII text, then a lazy DFA is likely a robust choice with |
| 70 | respect to both search speed and memory usage. Note however that its speed |
| 71 | may be worse than a general purpose regex engine if you don't select a good |
| 72 | [prefilter](crate::util::prefilter). |
| 73 | |
| 74 | If you know ahead of time that your pattern would result in a very large DFA |
| 75 | if it was fully compiled, it may be better to use an NFA simulation instead |
| 76 | of a lazy DFA. Either that, or increase the cache capacity of your lazy DFA |
| 77 | to something that is big enough to hold the state machine (likely through |
| 78 | experimentation). The issue here is that if the cache is too small, then it |
| 79 | could wind up being reset too frequently and this might decrease searching |
| 80 | speed significantly. |
| 81 | |
| 82 | # Differences with fully compiled DFAs |
| 83 | |
| 84 | A [`hybrid::regex::Regex`](crate::hybrid::regex::Regex) and a |
| 85 | [`dfa::regex::Regex`](crate::dfa::regex::Regex) both have the same capabilities |
| 86 | (and similarly for their underlying DFAs), but they achieve them through |
| 87 | different means. The main difference is that a hybrid or "lazy" regex builds |
| 88 | its DFA lazily during search, where as a fully compiled regex will build its |
| 89 | DFA at construction time. While building a DFA at search time might sound like |
| 90 | it's slow, it tends to work out where most bytes seen during a search will |
| 91 | reuse pre-built parts of the DFA and thus can be almost as fast as a fully |
| 92 | compiled DFA. The main downside is that searching requires mutable space to |
| 93 | store the DFA, and, in the worst case, a search can result in a new state being |
| 94 | created for each byte seen, which would make searching quite a bit slower. |
| 95 | |
| 96 | A fully compiled DFA never has to worry about searches being slower once |
| 97 | it's built. (Aside from, say, the transition table being so large that it |
| 98 | is subject to harsh CPU cache effects.) However, of course, building a full |
| 99 | DFA can be quite time consuming and memory hungry. Particularly when large |
| 100 | Unicode character classes are used, which tend to translate into very large |
| 101 | DFAs. |
| 102 | |
| 103 | A lazy DFA strikes a nice balance _in practice_, particularly in the |
| 104 | presence of Unicode mode, by only building what is needed. It avoids the |
| 105 | worst case exponential time complexity of DFA compilation by guaranteeing that |
| 106 | it will only build at most one state per byte searched. While the worst |
| 107 | case here can lead to a very high constant, it will never be exponential. |
| 108 | |
| 109 | # Syntax |
| 110 | |
| 111 | This module supports the same syntax as the `regex` crate, since they share the |
| 112 | same parser. You can find an exhaustive list of supported syntax in the |
| 113 | [documentation for the `regex` crate](https://docs.rs/regex/1/regex/#syntax). |
| 114 | |
| 115 | There are two things that are not supported by the lazy DFAs in this module: |
| 116 | |
| 117 | * Capturing groups. The DFAs (and [`Regex`](regex::Regex)es built on top |
| 118 | of them) can only find the offsets of an entire match, but cannot resolve |
| 119 | the offsets of each capturing group. This is because DFAs do not have the |
| 120 | expressive power necessary. Note that it is okay to build a lazy DFA from an |
| 121 | NFA that contains capture groups. The capture groups will simply be ignored. |
| 122 | * Unicode word boundaries. These present particularly difficult challenges for |
| 123 | DFA construction and would result in an explosion in the number of states. |
| 124 | One can enable [`dfa::Config::unicode_word_boundary`] though, which provides |
| 125 | heuristic support for Unicode word boundaries that only works on ASCII text. |
| 126 | Otherwise, one can use `(?-u:\b)` for an ASCII word boundary, which will work |
| 127 | on any input. |
| 128 | |
| 129 | There are no plans to lift either of these limitations. |
| 130 | |
| 131 | Note that these restrictions are identical to the restrictions on fully |
| 132 | compiled DFAs. |
| 133 | */ |
| 134 | |
| 135 | pub use self::{ |
| 136 | error::{BuildError, CacheError, StartError}, |
| 137 | id::LazyStateID, |
| 138 | }; |
| 139 | |
| 140 | pub mod dfa; |
| 141 | mod error; |
| 142 | mod id; |
| 143 | pub mod regex; |
| 144 | mod search; |
| 145 | |