1 | /*! |
2 | A module for building and searching with lazy deterministic finite automata |
3 | (DFAs). |
4 | |
5 | Like other modules in this crate, lazy DFAs support a rich regex syntax with |
6 | Unicode features. The key feature of a lazy DFA is that it builds itself |
7 | incrementally during search, and never uses more than a configured capacity of |
8 | memory. Thus, when searching with a lazy DFA, one must supply a mutable "cache" |
9 | in which the actual DFA's transition table is stored. |
10 | |
11 | If you're looking for fully compiled DFAs, then please see the top-level |
12 | [`dfa` module](crate::dfa). |
13 | |
14 | # Overview |
15 | |
16 | This section gives a brief overview of the primary types in this module: |
17 | |
18 | * A [`regex::Regex`] provides a way to search for matches of a regular |
19 | expression using lazy DFAs. This includes iterating over matches with both the |
20 | start and end positions of each match. |
21 | * A [`dfa::DFA`] provides direct low level access to a lazy DFA. |
22 | |
23 | # Example: basic regex searching |
24 | |
25 | This example shows how to compile a regex using the default configuration |
26 | and then use it to find matches in a byte string: |
27 | |
28 | ``` |
29 | use regex_automata::{hybrid::regex::Regex, Match}; |
30 | |
31 | let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}" )?; |
32 | let mut cache = re.create_cache(); |
33 | |
34 | let haystack = "2018-12-24 2016-10-08" ; |
35 | let matches: Vec<Match> = re.find_iter(&mut cache, haystack).collect(); |
36 | assert_eq!(matches, vec![ |
37 | Match::must(0, 0..10), |
38 | Match::must(0, 11..21), |
39 | ]); |
40 | # Ok::<(), Box<dyn std::error::Error>>(()) |
41 | ``` |
42 | |
43 | # Example: searching with multiple regexes |
44 | |
45 | The lazy DFAs in this module all fully support searching with multiple regexes |
46 | simultaneously. You can use this support with standard leftmost-first style |
47 | searching to find non-overlapping matches: |
48 | |
49 | ``` |
50 | # if cfg!(miri) { return Ok(()); } // miri takes too long |
51 | use regex_automata::{hybrid::regex::Regex, Match}; |
52 | |
53 | let re = Regex::new_many(&[r"\w+" , r"\S+" ])?; |
54 | let mut cache = re.create_cache(); |
55 | |
56 | let haystack = "@foo bar" ; |
57 | let matches: Vec<Match> = re.find_iter(&mut cache, haystack).collect(); |
58 | assert_eq!(matches, vec![ |
59 | Match::must(1, 0..4), |
60 | Match::must(0, 5..8), |
61 | ]); |
62 | # Ok::<(), Box<dyn std::error::Error>>(()) |
63 | ``` |
64 | |
65 | # When should I use this? |
66 | |
67 | Generally speaking, if you can abide the use of mutable state during search, |
68 | and you don't need things like capturing groups or Unicode word boundary |
69 | support in non-ASCII text, then a lazy DFA is likely a robust choice with |
70 | respect to both search speed and memory usage. Note however that its speed |
71 | may be worse than a general purpose regex engine if you don't select a good |
72 | [prefilter](crate::util::prefilter). |
73 | |
74 | If you know ahead of time that your pattern would result in a very large DFA |
75 | if it was fully compiled, it may be better to use an NFA simulation instead |
76 | of a lazy DFA. Either that, or increase the cache capacity of your lazy DFA |
77 | to something that is big enough to hold the state machine (likely through |
78 | experimentation). The issue here is that if the cache is too small, then it |
79 | could wind up being reset too frequently and this might decrease searching |
80 | speed significantly. |
81 | |
82 | # Differences with fully compiled DFAs |
83 | |
84 | A [`hybrid::regex::Regex`](crate::hybrid::regex::Regex) and a |
85 | [`dfa::regex::Regex`](crate::dfa::regex::Regex) both have the same capabilities |
86 | (and similarly for their underlying DFAs), but they achieve them through |
87 | different means. The main difference is that a hybrid or "lazy" regex builds |
88 | its DFA lazily during search, where as a fully compiled regex will build its |
89 | DFA at construction time. While building a DFA at search time might sound like |
90 | it's slow, it tends to work out where most bytes seen during a search will |
91 | reuse pre-built parts of the DFA and thus can be almost as fast as a fully |
92 | compiled DFA. The main downside is that searching requires mutable space to |
93 | store the DFA, and, in the worst case, a search can result in a new state being |
94 | created for each byte seen, which would make searching quite a bit slower. |
95 | |
96 | A fully compiled DFA never has to worry about searches being slower once |
97 | it's built. (Aside from, say, the transition table being so large that it |
98 | is subject to harsh CPU cache effects.) However, of course, building a full |
99 | DFA can be quite time consuming and memory hungry. Particularly when large |
100 | Unicode character classes are used, which tend to translate into very large |
101 | DFAs. |
102 | |
103 | A lazy DFA strikes a nice balance _in practice_, particularly in the |
104 | presence of Unicode mode, by only building what is needed. It avoids the |
105 | worst case exponential time complexity of DFA compilation by guaranteeing that |
106 | it will only build at most one state per byte searched. While the worst |
107 | case here can lead to a very high constant, it will never be exponential. |
108 | |
109 | # Syntax |
110 | |
111 | This module supports the same syntax as the `regex` crate, since they share the |
112 | same parser. You can find an exhaustive list of supported syntax in the |
113 | [documentation for the `regex` crate](https://docs.rs/regex/1/regex/#syntax). |
114 | |
115 | There are two things that are not supported by the lazy DFAs in this module: |
116 | |
117 | * Capturing groups. The DFAs (and [`Regex`](regex::Regex)es built on top |
118 | of them) can only find the offsets of an entire match, but cannot resolve |
119 | the offsets of each capturing group. This is because DFAs do not have the |
120 | expressive power necessary. Note that it is okay to build a lazy DFA from an |
121 | NFA that contains capture groups. The capture groups will simply be ignored. |
122 | * Unicode word boundaries. These present particularly difficult challenges for |
123 | DFA construction and would result in an explosion in the number of states. |
124 | One can enable [`dfa::Config::unicode_word_boundary`] though, which provides |
125 | heuristic support for Unicode word boundaries that only works on ASCII text. |
126 | Otherwise, one can use `(?-u:\b)` for an ASCII word boundary, which will work |
127 | on any input. |
128 | |
129 | There are no plans to lift either of these limitations. |
130 | |
131 | Note that these restrictions are identical to the restrictions on fully |
132 | compiled DFAs. |
133 | */ |
134 | |
135 | pub use self::{ |
136 | error::{BuildError, CacheError, StartError}, |
137 | id::LazyStateID, |
138 | }; |
139 | |
140 | pub mod dfa; |
141 | mod error; |
142 | mod id; |
143 | pub mod regex; |
144 | mod search; |
145 | |