mod.rs source code [crates/regex-automata-0.4.6/src/hybrid/mod.rs]

1	/!*
2	A module for building and searching with lazy deterministic finite automata
3	(DFAs).
4
5	Like other modules in this crate, lazy DFAs support a rich regex syntax with
6	Unicode features. The key feature of a lazy DFA is that it builds itself
7	incrementally during search, and never uses more than a configured capacity of
8	memory. Thus, when searching with a lazy DFA, one must supply a mutable "cache"
9	in which the actual DFA's transition table is stored.
10
11	If you're looking for fully compiled DFAs, then please see the top-level
12	[`dfa` module](crate::dfa).
13
14	# Overview
15
16	This section gives a brief overview of the primary types in this module:
17
18	* A [`regex::Regex`] provides a way to search for matches of a regular
19	expression using lazy DFAs. This includes iterating over matches with both the
20	start and end positions of each match.
21	* A [`dfa::DFA`] provides direct low level access to a lazy DFA.
22
23	# Example: basic regex searching
24
25	This example shows how to compile a regex using the default configuration
26	and then use it to find matches in a byte string:
27
28	```
29	use regex_automata::{hybrid::regex::Regex, Match};
30
31	let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}")?;
32	let mut cache = re.create_cache();
33
34	let haystack = "2018-12-24 2016-10-08";
35	let matches: Vec<Match> = re.find_iter(&mut cache, haystack).collect();
36	assert_eq!(matches, vec![
37	Match::must(`0`, `0`..`10`),
38	Match::must(`0`, `11`..`21`),
39	]);
40	# Ok::<(), Box<dyn std::error::Error>>(())
41	```
42
43	# Example: searching with multiple regexes
44
45	The lazy DFAs in this module all fully support searching with multiple regexes
46	simultaneously. You can use this support with standard leftmost-first style
47	searching to find non-overlapping matches:
48
49	```
50	# if cfg!(miri) { return Ok(()); } // miri takes too long
51	use regex_automata::{hybrid::regex::Regex, Match};
52
53	let re = Regex::new_many(&[r"\w+", r"\S+"])?;
54	let mut cache = re.create_cache();
55
56	let haystack = "@foo bar";
57	let matches: Vec<Match> = re.find_iter(&mut cache, haystack).collect();
58	assert_eq!(matches, vec![
59	Match::must(`1`, `0`..`4`),
60	Match::must(`0`, `5`..`8`),
61	]);
62	# Ok::<(), Box<dyn std::error::Error>>(())
63	```
64
65	# When should I use this?
66
67	Generally speaking, if you can abide the use of mutable state during search,
68	and you don't need things like capturing groups or Unicode word boundary
69	support in non-ASCII text, then a lazy DFA is likely a robust choice with
70	respect to both search speed and memory usage. Note however that its speed
71	may be worse than a general purpose regex engine if you don't select a good
72	[prefilter](crate::util::prefilter).
73
74	If you know ahead of time that your pattern would result in a very large DFA
75	if it was fully compiled, it may be better to use an NFA simulation instead
76	of a lazy DFA. Either that, or increase the cache capacity of your lazy DFA
77	to something that is big enough to hold the state machine (likely through
78	experimentation). The issue here is that if the cache is too small, then it
79	could wind up being reset too frequently and this might decrease searching
80	speed significantly.
81
82	# Differences with fully compiled DFAs
83
84	A [`hybrid::regex::Regex`](crate::hybrid::regex::Regex) and a
85	[`dfa::regex::Regex`](crate::dfa::regex::Regex) both have the same capabilities
86	(and similarly for their underlying DFAs), but they achieve them through
87	different means. The main difference is that a hybrid or "lazy" regex builds
88	its DFA lazily during search, where as a fully compiled regex will build its
89	DFA at construction time. While building a DFA at search time might sound like
90	it's slow, it tends to work out where most bytes seen during a search will
91	reuse pre-built parts of the DFA and thus can be almost as fast as a fully
92	compiled DFA. The main downside is that searching requires mutable space to
93	store the DFA, and, in the worst case, a search can result in a new state being
94	created for each byte seen, which would make searching quite a bit slower.
95
96	A fully compiled DFA never has to worry about searches being slower once
97	it's built. (Aside from, say, the transition table being so large that it
98	is subject to harsh CPU cache effects.) However, of course, building a full
99	DFA can be quite time consuming and memory hungry. Particularly when large
100	Unicode character classes are used, which tend to translate into very large
101	DFAs.
102
103	A lazy DFA strikes a nice balance _in practice_, particularly in the
104	presence of Unicode mode, by only building what is needed. It avoids the
105	worst case exponential time complexity of DFA compilation by guaranteeing that
106	it will only build at most one state per byte searched. While the worst
107	case here can lead to a very high constant, it will never be exponential.
108
109	# Syntax
110
111	This module supports the same syntax as the `regex` crate, since they share the
112	same parser. You can find an exhaustive list of supported syntax in the
113	[documentation for the `regex` crate](https://docs.rs/regex/1/regex/#syntax).
114
115	There are two things that are not supported by the lazy DFAs in this module:
116
117	* Capturing groups. The DFAs (and [`Regex`](regex::Regex)es built on top
118	of them) can only find the offsets of an entire match, but cannot resolve
119	the offsets of each capturing group. This is because DFAs do not have the
120	expressive power necessary. Note that it is okay to build a lazy DFA from an
121	NFA that contains capture groups. The capture groups will simply be ignored.
122	* Unicode word boundaries. These present particularly difficult challenges for
123	DFA construction and would result in an explosion in the number of states.
124	One can enable [`dfa::Config::unicode_word_boundary`] though, which provides
125	heuristic support for Unicode word boundaries that only works on ASCII text.
126	Otherwise, one can use `(?-u:\b)` for an ASCII word boundary, which will work
127	on any input.
128
129	There are no plans to lift either of these limitations.
130
131	Note that these restrictions are identical to the restrictions on fully
132	compiled DFAs.
133	*/
134
135	pub use self::{
136	error::{BuildError, CacheError, StartError},
137	id::LazyStateID,
138	};
139
140	pub mod dfa;
141	mod error;
142	mod id;
143	pub mod regex;
144	mod search;
145