literal.rs source code [crates/regex-syntax-0.7.2/src/hir/literal.rs]

1	/!*
2	Provides literal extraction from `Hir` expressions.
3
4	An [`Extractor`] pulls literals out of [`Hir`] expressions and returns a
5	[`Seq`] of [`Literal`]s.
6
7	The purpose of literal extraction is generally to provide avenues for
8	optimizing regex searches. The main idea is that substring searches can be an
9	order of magnitude faster than a regex search. Therefore, if one can execute
10	a substring search to find candidate match locations and only run the regex
11	search at those locations, then it is possible for huge improvements in
12	performance to be realized.
13
14	With that said, literal optimizations are generally a black art because even
15	though substring search is generally faster, if the number of candidates
16	produced is high, then it can create a lot of overhead by ping-ponging between
17	the substring search and the regex search.
18
19	Here are some heuristics that might be used to help increase the chances of
20	effective literal optimizations:
21
22	* Stick to small [`Seq`]s. If you search for too many literals, it's likely
23	to lead to substring search that is only a little faster than a regex search,
24	and thus the overhead of using literal optimizations in the first place might
25	make things slower overall.
26	* The literals in your [`Seq`] shoudn't be too short. In general, longer is
27	better. A sequence corresponding to single bytes that occur frequently in the
28	haystack, for example, is probably a bad literal optimization because it's
29	likely to produce many false positive candidates. Longer literals are less
30	likely to match, and thus probably produce fewer false positives.
31	* If it's possible to estimate the approximate frequency of each byte according
32	to some pre-computed background distribution, it is possible to compute a score
33	of how "good" a `Seq` is. If a `Seq` isn't good enough, you might consider
34	skipping the literal optimization and just use the regex engine.
35
36	(It should be noted that there are always pathological cases that can make
37	any kind of literal optimization be a net slower result. This is why it
38	might be a good idea to be conservative, or to even provide a means for
39	literal optimizations to be dynamically disabled if they are determined to be
40	ineffective according to some measure.)
41
42	You're encouraged to explore the methods on [`Seq`], which permit shrinking
43	the size of sequences in a preference-order preserving fashion.
44
45	Finally, note that it isn't strictly necessary to use an [`Extractor`]. Namely,
46	an `Extractor` only uses public APIs of the [`Seq`] and [`Literal`] types,
47	so it is possible to implement your own extractor. For example, for n-grams
48	or "inner" literals (i.e., not prefix or suffix literals). The `Extractor`
49	is mostly responsible for the case analysis over `Hir` expressions. Much of
50	the "trickier" parts are how to combine literal sequences, and that is all
51	implemented on [`Seq`].
52	*/
53
54	use core::{cmp, mem};
55
56	use alloc::{vec, vec::Vec};
57
58	use crate::hir::{self, Hir};
59
60	/// Extracts prefix or suffix literal sequences from [`Hir`] expressions.
61	///
62	/// Literal extraction is based on the following observations:
63	///
64	/// Many regexes start with one or a small number of literals.*
65	/// Substring search for literals is often much faster (sometimes by an order*
66	/// of magnitude) than a regex search.
67	///
68	/// Thus, in many cases, one can search for literals to find candidate starting
69	/// locations of a match, and then only run the full regex engine at each such
70	/// location instead of over the full haystack.
71	///
72	/// The main downside of literal extraction is that it can wind up causing a
73	/// search to be slower overall. For example, if there are many matches or if
74	/// there are many candidates that don't ultimately lead to a match, then a
75	/// lot of overhead will be spent in shuffing back-and-forth between substring
76	/// search and the regex engine. This is the fundamental reason why literal
77	/// optimizations for regex patterns is sometimes considered a "black art."
78	///
79	/// # Look-around assertions
80	///
81	/// Literal extraction treats all look-around assertions as-if they match every
82	/// empty string. So for example, the regex `\bquux\b` will yield a sequence
83	/// containing a single exact literal `quux`. However, not all occurrences
84	/// of `quux` correspond to a match a of the regex. For example, `\bquux\b`
85	/// does not match `ZquuxZ` anywhere because `quux` does not fall on a word
86	/// boundary.
87	///
88	/// In effect, if your regex contains look-around assertions, then a match of
89	/// an exact literal does not necessarily mean the regex overall matches. So
90	/// you may still need to run the regex engine in such cases to confirm the
91	/// match.
92	///
93	/// The precise guarantee you get from a literal sequence is: if every literal
94	/// in the sequence is exact and the original regex contains zero look-around
95	/// assertions, then a preference-order multi-substring search of those
96	/// literals will precisely match a preference-order search of the original
97	/// regex.
98	///
99	/// # Example
100	///
101	/// This shows how to extract prefixes:
102	///
103	/// ```
104	/// use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse};
105	///
106	/// let hir = parse(r"(a\|b\|c)(x\|y\|z)[A-Z]+foo")?;
107	/// let got = Extractor::new().extract(&hir);
108	/// // All literals returned are "inexact" because none of them reach the
109	/// // match state.
110	/// let expected = Seq::from_iter([
111	/// Literal::inexact("ax"),
112	/// Literal::inexact("ay"),
113	/// Literal::inexact("az"),
114	/// Literal::inexact("bx"),
115	/// Literal::inexact("by"),
116	/// Literal::inexact("bz"),
117	/// Literal::inexact("cx"),
118	/// Literal::inexact("cy"),
119	/// Literal::inexact("cz"),
120	/// ]);
121	/// assert_eq!(expected, got);
122	///
123	/// # Ok::<(), Box<dyn std::error::Error>>(())
124	/// ```
125	///
126	/// This shows how to extract suffixes:
127	///
128	/// ```
129	/// use regex_syntax::{
130	/// hir::literal::{Extractor, ExtractKind, Literal, Seq},
131	/// parse,
132	/// };
133	///
134	/// let hir = parse(r"foo\|[A-Z]+bar")?;
135	/// let got = Extractor::new().kind(ExtractKind::Suffix).extract(&hir);
136	/// // Since 'foo' gets to a match state, it is considered exact. But 'bar'
137	/// // does not because of the '[A-Z]+', and thus is marked inexact.
138	/// let expected = Seq::from_iter([
139	/// Literal::exact("foo"),
140	/// Literal::inexact("bar"),
141	/// ]);
142	/// assert_eq!(expected, got);
143	///
144	/// # Ok::<(), Box<dyn std::error::Error>>(())
145	/// ```
146	#[derive(Clone, Debug)]
147	pub struct Extractor {
148	kind: ExtractKind,
149	limit_class: usize,
150	limit_repeat: usize,
151	limit_literal_len: usize,
152	limit_total: usize,
153	}
154
155	impl Extractor {
156	/// Create a new extractor with a default configuration.
157	///
158	/// The extractor can be optionally configured before calling
159	/// [`Extractor::extract`] to get a literal sequence.
160	pub fn new() -> Extractor {
161	Extractor {
162	kind: ExtractKind::Prefix,
163	limit_class: `10`,
164	limit_repeat: `10`,
165	limit_literal_len: `100`,
166	limit_total: `250`,
167	}
168	}
169
170	/// Execute the extractor and return a sequence of literals.
171	pub fn extract(&self, hir: &Hir) -> Seq {
172	use crate::hir::HirKind::*;
173
174	match *hir.kind() {
175	Empty \| Look(_) => Seq::singleton(self::Literal::exact(vec![])),
176	Literal(hir::Literal(ref bytes)) => {
177	let mut seq =
178	Seq::singleton(self::Literal::exact(bytes.to_vec()));
179	self.enforce_literal_len(&mut seq);
180	seq
181	}
182	Class(hir::Class::Unicode(ref cls)) => {
183	self.extract_class_unicode(cls)
184	}
185	Class(hir::Class::Bytes(ref cls)) => self.extract_class_bytes(cls),
186	Repetition(ref rep) => self.extract_repetition(rep),
187	Capture(hir::Capture { ref sub, .. }) => self.extract(sub),
188	Concat(ref hirs) => match self.kind {
189	ExtractKind::Prefix => self.extract_concat(hirs.iter()),
190	ExtractKind::Suffix => self.extract_concat(hirs.iter().rev()),
191	},
192	Alternation(ref hirs) => {
193	// Unlike concat, we always union starting from the beginning,
194	// since the beginning corresponds to the highest preference,
195	// which doesn't change based on forwards vs reverse.
196	self.extract_alternation(hirs.iter())
197	}
198	}
199	}
200
201	/// Set the kind of literal sequence to extract from an [`Hir`] expression.
202	///
203	/// The default is to extract prefixes, but suffixes can be selected
204	/// instead. The contract for prefixes is that every match of the
205	/// corresponding `Hir` must start with one of the literals in the sequence
206	/// returned. Moreover, the _order_ of the sequence returned corresponds to
207	/// the preference order.
208	///
209	/// Suffixes satisfy a similar contract in that every match of the
210	/// corresponding `Hir` must end with one of the literals in the sequence
211	/// returned. However, there is no guarantee that the literals are in
212	/// preference order.
213	///
214	/// Remember that a sequence can be infinite. For example, unless the
215	/// limits are configured to be impractically large, attempting to extract
216	/// prefixes (or suffixes) for the pattern `[A-Z]` will return an infinite
217	/// sequence. Generally speaking, if the sequence returned is infinite,
218	/// then it is presumed to be unwise to do prefix (or suffix) optimizations
219	/// for the pattern.
220	pub fn kind(&mut self, kind: ExtractKind) -> &mut Extractor {
221	self.kind = kind;
222	self
223	}
224
225	/// Configure a limit on the length of the sequence that is permitted for
226	/// a character class. If a character class exceeds this limit, then the
227	/// sequence returned for it is infinite.
228	///
229	/// This prevents classes like `[A-Z]` or `\pL` from getting turned into
230	/// huge and likely unproductive sequences of literals.
231	///
232	/// # Example
233	///
234	/// This example shows how this limit can be lowered to decrease the tolerance
235	/// for character classes being turned into literal sequences.
236	///
237	/// ```
238	/// use regex_syntax::{hir::literal::{Extractor, Seq}, parse};
239	///
240	/// let hir = parse(r"[0-9]")?;
241	///
242	/// let got = Extractor::new().extract(&hir);
243	/// let expected = Seq::new([
244	/// "0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
245	/// ]);
246	/// assert_eq!(expected, got);
247	///
248	/// // Now let's shrink the limit and see how that changes things.
249	/// let got = Extractor::new().limit_class(`4`).extract(&hir);
250	/// let expected = Seq::infinite();
251	/// assert_eq!(expected, got);
252	///
253	/// # Ok::<(), Box<dyn std::error::Error>>(())
254	/// ```
255	pub fn limit_class(&mut self, limit: usize) -> &mut Extractor {
256	self.limit_class = limit;
257	self
258	}
259
260	/// Configure a limit on the total number of repetitions that is permitted
261	/// before literal extraction is stopped.
262	///
263	/// This is useful for limiting things like `(abcde){50}`, or more
264	/// insidiously, `(?:){1000000000}`. This limit prevents any one single
265	/// repetition from adding too much to a literal sequence.
266	///
267	/// With this limit set, repetitions that exceed it will be stopped and any
268	/// literals extracted up to that point will be made inexact.
269	///
270	/// # Example
271	///
272	/// This shows how to decrease the limit and compares it with the default.
273	///
274	/// ```
275	/// use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse};
276	///
277	/// let hir = parse(r"(abc){8}")?;
278	///
279	/// let got = Extractor::new().extract(&hir);
280	/// let expected = Seq::new(["abcabcabcabcabcabcabcabc"]);
281	/// assert_eq!(expected, got);
282	///
283	/// // Now let's shrink the limit and see how that changes things.
284	/// let got = Extractor::new().limit_repeat(`4`).extract(&hir);
285	/// let expected = Seq::from_iter([
286	/// Literal::inexact("abcabcabcabc"),
287	/// ]);
288	/// assert_eq!(expected, got);
289	///
290	/// # Ok::<(), Box<dyn std::error::Error>>(())
291	/// ```
292	pub fn limit_repeat(&mut self, limit: usize) -> &mut Extractor {
293	self.limit_repeat = limit;
294	self
295	}
296
297	/// Configure a limit on the maximum length of any literal in a sequence.
298	///
299	/// This is useful for limiting things like `(abcde){5}{5}{5}{5}`. While
300	/// each repetition or literal in that regex is small, when all the
301	/// repetitions are applied, one ends up with a literal of length `5^4 =
302	/// 625`.
303	///
304	/// With this limit set, literals that exceed it will be made inexact and
305	/// thus prevented from growing.
306	///
307	/// # Example
308	///
309	/// This shows how to decrease the limit and compares it with the default.
310	///
311	/// ```
312	/// use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse};
313	///
314	/// let hir = parse(r"(abc){2}{2}{2}")?;
315	///
316	/// let got = Extractor::new().extract(&hir);
317	/// let expected = Seq::new(["abcabcabcabcabcabcabcabc"]);
318	/// assert_eq!(expected, got);
319	///
320	/// // Now let's shrink the limit and see how that changes things.
321	/// let got = Extractor::new().limit_literal_len(`14`).extract(&hir);
322	/// let expected = Seq::from_iter([
323	/// Literal::inexact("abcabcabcabcab"),
324	/// ]);
325	/// assert_eq!(expected, got);
326	///
327	/// # Ok::<(), Box<dyn std::error::Error>>(())
328	/// ```
329	pub fn limit_literal_len(&mut self, limit: usize) -> &mut Extractor {
330	self.limit_literal_len = limit;
331	self
332	}
333
334	/// Configure a limit on the total number of literals that will be
335	/// returned.
336	///
337	/// This is useful as a practical measure for avoiding the creation of
338	/// large sequences of literals. While the extractor will automatically
339	/// handle local creations of large sequences (for example, `[A-Z]` yields
340	/// an infinite sequence by default), large sequences can be created
341	/// through non-local means as well.
342	///
343	/// For example, `[ab]{3}{3}` would yield a sequence of length `512 = 2^9`
344	/// despite each of the repetitions being small on their own. This limit
345	/// thus represents a "catch all" for avoiding locally small sequences from
346	/// combining into large sequences.
347	///
348	/// # Example
349	///
350	/// This example shows how reducing the limit will change the literal
351	/// sequence returned.
352	///
353	/// ```
354	/// use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse};
355	///
356	/// let hir = parse(r"[ab]{2}{2}")?;
357	///
358	/// let got = Extractor::new().extract(&hir);
359	/// let expected = Seq::new([
360	/// "aaaa", "aaab", "aaba", "aabb",
361	/// "abaa", "abab", "abba", "abbb",
362	/// "baaa", "baab", "baba", "babb",
363	/// "bbaa", "bbab", "bbba", "bbbb",
364	/// ]);
365	/// assert_eq!(expected, got);
366	///
367	/// // The default limit is not too big, but big enough to extract all
368	/// // literals from '[ab]{2}{2}'. If we shrink the limit to less than 16,
369	/// // then we'll get a truncated set. Notice that it returns a sequence of
370	/// // length 4 even though our limit was 10. This is because the sequence
371	/// // is difficult to increase without blowing the limit. Notice also
372	/// // that every literal in the sequence is now inexact because they were
373	/// // stripped of some suffix.
374	/// let got = Extractor::new().limit_total(`10`).extract(&hir);
375	/// let expected = Seq::from_iter([
376	/// Literal::inexact("aa"),
377	/// Literal::inexact("ab"),
378	/// Literal::inexact("ba"),
379	/// Literal::inexact("bb"),
380	/// ]);
381	/// assert_eq!(expected, got);
382	///
383	/// # Ok::<(), Box<dyn std::error::Error>>(())
384	/// ```
385	pub fn limit_total(&mut self, limit: usize) -> &mut Extractor {
386	self.limit_total = limit;
387	self
388	}
389
390	/// Extract a sequence from the given concatenation. Sequences from each of
391	/// the child HIR expressions are combined via cross product.
392	///
393	/// This short circuits once the cross product turns into a sequence
394	/// containing only inexact literals.
395	fn extract_concat<'a, I: Iterator<Item = &'a Hir>>(&self, it: I) -> Seq {
396	let mut seq = Seq::singleton(self::Literal::exact(vec![]));
397	for hir in it {
398	// If every element in the sequence is inexact, then a cross
399	// product will always be a no-op. Thus, there is nothing else we
400	// can add to it and can quit early. Note that this also includes
401	// infinite sequences.
402	if seq.is_inexact() {
403	break;
404	}
405	// Note that 'cross' also dispatches based on whether we're
406	// extracting prefixes or suffixes.
407	seq = self.cross(seq, &mut self.extract(hir));
408	}
409	seq
410	}
411
412	/// Extract a sequence from the given alternation.
413	///
414	/// This short circuits once the union turns into an infinite sequence.
415	fn extract_alternation<'a, I: Iterator<Item = &'a Hir>>(
416	&self,
417	it: I,
418	) -> Seq {
419	let mut seq = Seq::empty();
420	for hir in it {
421	// Once our 'seq' is infinite, every subsequent union
422	// operation on it will itself always result in an
423	// infinite sequence. Thus, it can never change and we can
424	// short-circuit.
425	if !seq.is_finite() {
426	break;
427	}
428	seq = self.union(seq, &mut self.extract(hir));
429	}
430	seq
431	}
432
433	/// Extract a sequence of literals from the given repetition. We do our
434	/// best, Some examples:
435	///
436	/// 'a' => [inexact(a), exact("")]*
437	/// 'a?' => [exact(""), inexact(a)]*
438	/// 'a+' => [inexact(a)]
439	/// 'a{3}' => [exact(aaa)]
440	/// 'a{3,5} => [inexact(aaa)]
441	///
442	/// The key here really is making sure we get the 'inexact' vs 'exact'
443	/// attributes correct on each of the literals we add. For example, the
444	/// fact that 'a' gives us an inexact 'a' and an exact empty string means*
445	/// that a regex like 'abc' will result in [inexact(ab), exact(ac)]*
446	/// literals being extracted, which might actually be a better prefilter
447	/// than just 'a'.
448	fn extract_repetition(&self, rep: &hir::Repetition) -> Seq {
449	let mut subseq = self.extract(&rep.sub);
450	match *rep {
451	hir::Repetition { min: `0`, max, greedy, .. } => {
452	// When 'max=1', we can retain exactness, since 'a?' is
453	// equivalent to 'a\|'. Similarly below, 'a??' is equivalent to
454	// '\|a'.
455	if max != Some(`1`) {
456	subseq.make_inexact();
457	}
458	let mut empty = Seq::singleton(Literal::exact(vec![]));
459	if !greedy {
460	mem::swap(&mut subseq, &mut empty);
461	}
462	self.union(subseq, &mut empty)
463	}
464	hir::Repetition { min, max: Some(max), .. } if min == max => {
465	assert!(min > `0`); // handled above
466	let limit =
467	u32::try_from(self.limit_repeat).unwrap_or(u32::MAX);
468	let mut seq = Seq::singleton(Literal::exact(vec![]));
469	for _ in `0`..cmp::min(min, limit) {
470	if seq.is_inexact() {
471	break;
472	}
473	seq = self.cross(seq, &mut subseq.clone());
474	}
475	if usize::try_from(min).is_err() \|\| min > limit {
476	seq.make_inexact();
477	}
478	seq
479	}
480	hir::Repetition { min, max: Some(max), .. } if min < max => {
481	assert!(min > `0`); // handled above
482	let limit =
483	u32::try_from(self.limit_repeat).unwrap_or(u32::MAX);
484	let mut seq = Seq::singleton(Literal::exact(vec![]));
485	for _ in `0`..cmp::min(min, limit) {
486	if seq.is_inexact() {
487	break;
488	}
489	seq = self.cross(seq, &mut subseq.clone());
490	}
491	seq.make_inexact();
492	seq
493	}
494	hir::Repetition { .. } => {
495	subseq.make_inexact();
496	subseq
497	}
498	}
499	}
500
501	/// Convert the given Unicode class into a sequence of literals if the
502	/// class is small enough. If the class is too big, return an infinite
503	/// sequence.
504	fn extract_class_unicode(&self, cls: &hir::ClassUnicode) -> Seq {
505	if self.class_over_limit_unicode(cls) {
506	return Seq::infinite();
507	}
508	let mut seq = Seq::empty();
509	for r in cls.iter() {
510	for ch in r.start()..=r.end() {
511	seq.push(Literal::from(ch));
512	}
513	}
514	self.enforce_literal_len(&mut seq);
515	seq
516	}
517
518	/// Convert the given byte class into a sequence of literals if the class
519	/// is small enough. If the class is too big, return an infinite sequence.
520	fn extract_class_bytes(&self, cls: &hir::ClassBytes) -> Seq {
521	if self.class_over_limit_bytes(cls) {
522	return Seq::infinite();
523	}
524	let mut seq = Seq::empty();
525	for r in cls.iter() {
526	for b in r.start()..=r.end() {
527	seq.push(Literal::from(b));
528	}
529	}
530	self.enforce_literal_len(&mut seq);
531	seq
532	}
533
534	/// Returns true if the given Unicode class exceeds the configured limits
535	/// on this extractor.
536	fn class_over_limit_unicode(&self, cls: &hir::ClassUnicode) -> bool {
537	let mut count = `0`;
538	for r in cls.iter() {
539	if count > self.limit_class {
540	return `true`;
541	}
542	count += r.len();
543	}
544	count > self.limit_class
545	}
546
547	/// Returns true if the given byte class exceeds the configured limits on
548	/// this extractor.
549	fn class_over_limit_bytes(&self, cls: &hir::ClassBytes) -> bool {
550	let mut count = `0`;
551	for r in cls.iter() {
552	if count > self.limit_class {
553	return `true`;
554	}
555	count += r.len();
556	}
557	count > self.limit_class
558	}
559
560	/// Compute the cross product of the two sequences if the result would be
561	/// within configured limits. Otherwise, make `seq2` infinite and cross the
562	/// infinite sequence with `seq1`.
563	fn cross(&self, mut seq1: Seq, seq2: &mut Seq) -> Seq {
564	if seq1.max_cross_len(seq2).map_or(`false`, \|len\| len > self.limit_total)
565	{
566	seq2.make_infinite();
567	}
568	if let ExtractKind::Suffix = self.kind {
569	seq1.cross_reverse(seq2);
570	} else {
571	seq1.cross_forward(seq2);
572	}
573	assert!(seq1.len().map_or(`true`, \|x\| x <= self.limit_total));
574	self.enforce_literal_len(&mut seq1);
575	seq1
576	}
577
578	/// Union the two sequences if the result would be within configured
579	/// limits. Otherwise, make `seq2` infinite and union the infinite sequence
580	/// with `seq1`.
581	fn union(&self, mut seq1: Seq, seq2: &mut Seq) -> Seq {
582	if seq1.max_union_len(seq2).map_or(`false`, \|len\| len > self.limit_total)
583	{
584	// We try to trim our literal sequences to see if we can make
585	// room for more literals. The idea is that we'd rather trim down
586	// literals already in our sequence if it means we can add a few
587	// more and retain a finite sequence. Otherwise, we'll union with
588	// an infinite sequence and that infects everything and effectively
589	// stops literal extraction in its tracks.
590	//
591	// We do we keep 4 bytes here? Well, it's a bit of an abstraction
592	// leakage. Downstream, the literals may wind up getting fed to
593	// the Teddy algorithm, which supports searching literals up to
594	// length 4. So that's why we pick that number here. Arguably this
595	// should be a tuneable parameter, but it seems a little tricky to
596	// describe. And I'm still unsure if this is the right way to go
597	// about culling literal sequences.
598	match self.kind {
599	ExtractKind::Prefix => {
600	seq1.keep_first_bytes(`4`);
601	seq2.keep_first_bytes(`4`);
602	}
603	ExtractKind::Suffix => {
604	seq1.keep_last_bytes(`4`);
605	seq2.keep_last_bytes(`4`);
606	}
607	}
608	seq1.dedup();
609	seq2.dedup();
610	if seq1
611	.max_union_len(seq2)
612	.map_or(`false`, \|len\| len > self.limit_total)
613	{
614	seq2.make_infinite();
615	}
616	}
617	seq1.union(seq2);
618	assert!(seq1.len().map_or(`true`, \|x\| x <= self.limit_total));
619	seq1
620	}
621
622	/// Applies the literal length limit to the given sequence. If none of the
623	/// literals in the sequence exceed the limit, then this is a no-op.
624	fn enforce_literal_len(&self, seq: &mut Seq) {
625	let len = self.limit_literal_len;
626	match self.kind {
627	ExtractKind::Prefix => seq.keep_first_bytes(len),
628	ExtractKind::Suffix => seq.keep_last_bytes(len),
629	}
630	}
631	}
632
633	impl Default for Extractor {
634	fn default() -> Extractor {
635	Extractor::new()
636	}
637	}
638
639	/// The kind of literals to extract from an [`Hir`] expression.
640	///
641	/// The default extraction kind is `Prefix`.
642	#[non_exhaustive]
643	#[derive(Clone, Debug)]
644	pub enum ExtractKind {
645	/// Extracts only prefix literals from a regex.
646	Prefix,
647	/// Extracts only suffix literals from a regex.
648	///
649	/// Note that the sequence returned by suffix literals currently may
650	/// not correctly represent leftmost-first or "preference" order match
651	/// semantics.
652	Suffix,
653	}
654
655	impl ExtractKind {
656	/// Returns true if this kind is the `Prefix` variant.
657	pub fn is_prefix(&self) -> bool {
658	matches!(*self, ExtractKind::Prefix)
659	}
660
661	/// Returns true if this kind is the `Suffix` variant.
662	pub fn is_suffix(&self) -> bool {
663	matches!(*self, ExtractKind::Suffix)
664	}
665	}
666
667	impl Default for ExtractKind {
668	fn default() -> ExtractKind {
669	ExtractKind::Prefix
670	}
671	}
672
673	/// A sequence of literals.
674	///
675	/// A `Seq` is very much like a set in that it represents a union of its
676	/// members. That is, it corresponds to a set of literals where at least one
677	/// must match in order for a particular [`Hir`] expression to match. (Whether
678	/// this corresponds to the entire `Hir` expression, a prefix of it or a suffix
679	/// of it depends on how the `Seq` was extracted from the `Hir`.)
680	///
681	/// It is also unlike a set in that multiple identical literals may appear,
682	/// and that the order of the literals in the `Seq` matters. For example, if
683	/// the sequence is `[sam, samwise]` and leftmost-first matching is used, then
684	/// `samwise` can never match and the sequence is equivalent to `[sam]`.
685	///
686	/// # States of a sequence
687	///
688	/// A `Seq` has a few different logical states to consider:
689	///
690	/// The sequence can represent "any" literal. When this happens, the set does*
691	/// not have a finite size. The purpose of this state is to inhibit callers
692	/// from making assumptions about what literals are required in order to match
693	/// a particular [`Hir`] expression. Generally speaking, when a set is in this
694	/// state, literal optimizations are inhibited. A good example of a regex that
695	/// will cause this sort of set to apppear is `[A-Za-z]`. The character class
696	/// is just too big (and also too narrow) to be usefully expanded into 52
697	/// different literals. (Note that the decision for when a seq should become
698	/// infinite is determined by the caller. A seq itself has no hard-coded
699	/// limits.)
700	/// The sequence can be empty, in which case, it is an affirmative statement*
701	/// that there are no literals that can match the corresponding `Hir`.
702	/// Consequently, the `Hir` never matches any input. For example, `[a&&b]`.
703	/// The sequence can be non-empty, in which case, at least one of the*
704	/// literals must match in order for the corresponding `Hir` to match.
705	///
706	/// # Example
707	///
708	/// This example shows how literal sequences can be simplified by stripping
709	/// suffixes and minimizing while maintaining preference order.
710	///
711	/// ```
712	/// use regex_syntax::hir::literal::{Literal, Seq};
713	///
714	/// let mut seq = Seq::new(&[
715	/// "farm",
716	/// "appliance",
717	/// "faraway",
718	/// "apple",
719	/// "fare",
720	/// "gap",
721	/// "applicant",
722	/// "applaud",
723	/// ]);
724	/// seq.keep_first_bytes(`3`);
725	/// seq.minimize_by_preference();
726	/// // Notice that 'far' comes before 'app', which matches the order in the
727	/// // original sequence. This guarantees that leftmost-first semantics are
728	/// // not altered by simplifying the set.
729	/// let expected = Seq::from_iter([
730	/// Literal::inexact("far"),
731	/// Literal::inexact("app"),
732	/// Literal::exact("gap"),
733	/// ]);
734	/// assert_eq!(expected, seq);
735	/// ```
736	#[derive(Clone, Eq, PartialEq)]
737	pub struct Seq {
738	/// The members of this seq.
739	///
740	/// When `None`, the seq represents all possible literals. That is, it
741	/// prevents one from making assumptions about specific literals in the
742	/// seq, and forces one to treat it as if any literal might be in the seq.
743	///
744	/// Note that `Some(vec![])` is valid and corresponds to the empty seq of
745	/// literals, i.e., a regex that can never match. For example, `[a&&b]`.
746	/// It is distinct from `Some(vec![""])`, which corresponds to the seq
747	/// containing an empty string, which matches at every position.
748	literals: Option<Vec<Literal>>,
749	}
750
751	impl Seq {
752	/// Returns an empty sequence.
753	///
754	/// An empty sequence matches zero literals, and thus corresponds to a
755	/// regex that itself can never match.
756	#[inline]
757	pub fn empty() -> Seq {
758	Seq { literals: Some(vec![]) }
759	}
760
761	/// Returns a sequence of literals without a finite size and may contain
762	/// any literal.
763	///
764	/// A sequence without finite size does not reveal anything about the
765	/// characteristics of the literals in its set. There are no fixed prefixes
766	/// or suffixes, nor are lower or upper bounds on the length of the literals
767	/// in the set known.
768	///
769	/// This is useful to represent constructs in a regex that are "too big"
770	/// to useful represent as a sequence of literals. For example, `[A-Za-z]`.
771	/// When sequences get too big, they lose their discriminating nature and
772	/// are more likely to produce false positives, which in turn makes them
773	/// less likely to speed up searches.
774	///
775	/// More pragmatically, for many regexes, enumerating all possible literals
776	/// is itself not possible or might otherwise use too many resources. So
777	/// constraining the size of sets during extraction is a practical trade
778	/// off to make.
779	#[inline]
780	pub fn infinite() -> Seq {
781	Seq { literals: None }
782	}
783
784	/// Returns a sequence containing a single literal.
785	#[inline]
786	pub fn singleton(lit: Literal) -> Seq {
787	Seq { literals: Some(vec![lit]) }
788	}
789
790	/// Returns a sequence of exact literals from the given byte strings.
791	#[inline]
792	pub fn new<I, B>(it: I) -> Seq
793	where
794	I: IntoIterator<Item = B>,
795	B: AsRef<[u8]>,
796	{
797	it.into_iter().map(\|b\| Literal::exact(b.as_ref())).collect()
798	}
799
800	/// If this is a finite sequence, return its members as a slice of
801	/// literals.
802	///
803	/// The slice returned may be empty, in which case, there are no literals
804	/// that can match this sequence.
805	#[inline]
806	pub fn literals(&self) -> Option<&[Literal]> {
807	self.literals.as_deref()
808	}
809
810	/// Push a literal to the end of this sequence.
811	///
812	/// If this sequence is not finite, then this is a no-op.
813	///
814	/// Similarly, if the most recently added item of this sequence is
815	/// equivalent to the literal given, then it is not added. This reflects
816	/// a `Seq`'s "set like" behavior, and represents a practical trade off.
817	/// Namely, there is never any need to have two adjacent and equivalent
818	/// literals in the same sequence, _and_ it is easy to detect in some
819	/// cases.
820	#[inline]
821	pub fn push(&mut self, lit: Literal) {
822	let lits = match self.literals {
823	None => return,
824	Some(ref mut lits) => lits,
825	};
826	if lits.last().map_or(`false`, \|m\| m == &lit) {
827	return;
828	}
829	lits.push(lit);
830	}
831
832	/// Make all of the literals in this sequence inexact.
833	///
834	/// This is a no-op if this sequence is not finite.
835	#[inline]
836	pub fn make_inexact(&mut self) {
837	let lits = match self.literals {
838	None => return,
839	Some(ref mut lits) => lits,
840	};
841	for lit in lits.iter_mut() {
842	lit.make_inexact();
843	}
844	}
845
846	/// Converts this sequence to an infinite sequence.
847	///
848	/// This is a no-op if the sequence is already infinite.
849	#[inline]
850	pub fn make_infinite(&mut self) {
851	self.literals = None;
852	}
853
854	/// Modify this sequence to contain the cross product between it and the
855	/// sequence given.
856	///
857	/// The cross product only considers literals in this sequence that are
858	/// exact. That is, inexact literals are not extended.
859	///
860	/// The literals are always drained from `other`, even if none are used.
861	/// This permits callers to reuse the sequence allocation elsewhere.
862	///
863	/// If this sequence is infinite, then this is a no-op, regardless of what
864	/// `other` contains (and in this case, the literals are still drained from
865	/// `other`). If `other` is infinite and this sequence is finite, then this
866	/// is a no-op, unless this sequence contains a zero-length literal. In
867	/// which case, the infiniteness of `other` infects this sequence, and this
868	/// sequence is itself made infinite.
869	///
870	/// Like [`Seq::union`], this may attempt to deduplicate literals. See
871	/// [`Seq::dedup`] for how deduplication deals with exact and inexact
872	/// literals.
873	///
874	/// # Example
875	///
876	/// This example shows basic usage and how exact and inexact literals
877	/// interact.
878	///
879	/// ```
880	/// use regex_syntax::hir::literal::{Literal, Seq};
881	///
882	/// let mut seq1 = Seq::from_iter([
883	/// Literal::exact("foo"),
884	/// Literal::inexact("bar"),
885	/// ]);
886	/// let mut seq2 = Seq::from_iter([
887	/// Literal::inexact("quux"),
888	/// Literal::exact("baz"),
889	/// ]);
890	/// seq1.cross_forward(&mut seq2);
891	///
892	/// // The literals are pulled out of seq2.
893	/// assert_eq!(Some(`0`), seq2.len());
894	///
895	/// let expected = Seq::from_iter([
896	/// Literal::inexact("fooquux"),
897	/// Literal::exact("foobaz"),
898	/// Literal::inexact("bar"),
899	/// ]);
900	/// assert_eq!(expected, seq1);
901	/// ```
902	///
903	/// This example shows the behavior of when `other` is an infinite
904	/// sequence.
905	///
906	/// ```
907	/// use regex_syntax::hir::literal::{Literal, Seq};
908	///
909	/// let mut seq1 = Seq::from_iter([
910	/// Literal::exact("foo"),
911	/// Literal::inexact("bar"),
912	/// ]);
913	/// let mut seq2 = Seq::infinite();
914	/// seq1.cross_forward(&mut seq2);
915	///
916	/// // When seq2 is infinite, cross product doesn't add anything, but
917	/// // ensures all members of seq1 are inexact.
918	/// let expected = Seq::from_iter([
919	/// Literal::inexact("foo"),
920	/// Literal::inexact("bar"),
921	/// ]);
922	/// assert_eq!(expected, seq1);
923	/// ```
924	///
925	/// This example is like the one above, but shows what happens when this
926	/// sequence contains an empty string. In this case, an infinite `other`
927	/// sequence infects this sequence (because the empty string means that
928	/// there are no finite prefixes):
929	///
930	/// ```
931	/// use regex_syntax::hir::literal::{Literal, Seq};
932	///
933	/// let mut seq1 = Seq::from_iter([
934	/// Literal::exact("foo"),
935	/// Literal::exact(""), // inexact provokes same behavior
936	/// Literal::inexact("bar"),
937	/// ]);
938	/// let mut seq2 = Seq::infinite();
939	/// seq1.cross_forward(&mut seq2);
940	///
941	/// // seq1 is now infinite!
942	/// assert!(!seq1.is_finite());
943	/// ```
944	///
945	/// This example shows the behavior of this sequence is infinite.
946	///
947	/// ```
948	/// use regex_syntax::hir::literal::{Literal, Seq};
949	///
950	/// let mut seq1 = Seq::infinite();
951	/// let mut seq2 = Seq::from_iter([
952	/// Literal::exact("foo"),
953	/// Literal::inexact("bar"),
954	/// ]);
955	/// seq1.cross_forward(&mut seq2);
956	///
957	/// // seq1 remains unchanged.
958	/// assert!(!seq1.is_finite());
959	/// // Even though the literals in seq2 weren't used, it was still drained.
960	/// assert_eq!(Some(`0`), seq2.len());
961	/// ```
962	#[inline]
963	pub fn cross_forward(&mut self, other: &mut Seq) {
964	let (lits1, lits2) = match self.cross_preamble(other) {
965	None => return,
966	Some((lits1, lits2)) => (lits1, lits2),
967	};
968	let newcap = lits1.len().saturating_mul(lits2.len());
969	for selflit in mem::replace(lits1, Vec::with_capacity(newcap)) {
970	if !selflit.is_exact() {
971	lits1.push(selflit);
972	continue;
973	}
974	for otherlit in lits2.iter() {
975	let mut newlit = Literal::exact(Vec::with_capacity(
976	selflit.len() + otherlit.len(),
977	));
978	newlit.extend(&selflit);
979	newlit.extend(&otherlit);
980	if !otherlit.is_exact() {
981	newlit.make_inexact();
982	}
983	lits1.push(newlit);
984	}
985	}
986	lits2.drain(..);
987	self.dedup();
988	}
989
990	/// Modify this sequence to contain the cross product between it and
991	/// the sequence given, where the sequences are treated as suffixes
992	/// instead of prefixes. Namely, the sequence `other` is prepended
993	/// to `self` (as opposed to `other` being appended* to `self` in*
994	/// [`Seq::cross_forward`]).
995	///
996	/// The cross product only considers literals in this sequence that are
997	/// exact. That is, inexact literals are not extended.
998	///
999	/// The literals are always drained from `other`, even if none are used.
1000	/// This permits callers to reuse the sequence allocation elsewhere.
1001	///
1002	/// If this sequence is infinite, then this is a no-op, regardless of what
1003	/// `other` contains (and in this case, the literals are still drained from
1004	/// `other`). If `other` is infinite and this sequence is finite, then this
1005	/// is a no-op, unless this sequence contains a zero-length literal. In
1006	/// which case, the infiniteness of `other` infects this sequence, and this
1007	/// sequence is itself made infinite.
1008	///
1009	/// Like [`Seq::union`], this may attempt to deduplicate literals. See
1010	/// [`Seq::dedup`] for how deduplication deals with exact and inexact
1011	/// literals.
1012	///
1013	/// # Example
1014	///
1015	/// This example shows basic usage and how exact and inexact literals
1016	/// interact.
1017	///
1018	/// ```
1019	/// use regex_syntax::hir::literal::{Literal, Seq};
1020	///
1021	/// let mut seq1 = Seq::from_iter([
1022	/// Literal::exact("foo"),
1023	/// Literal::inexact("bar"),
1024	/// ]);
1025	/// let mut seq2 = Seq::from_iter([
1026	/// Literal::inexact("quux"),
1027	/// Literal::exact("baz"),
1028	/// ]);
1029	/// seq1.cross_reverse(&mut seq2);
1030	///
1031	/// // The literals are pulled out of seq2.
1032	/// assert_eq!(Some(`0`), seq2.len());
1033	///
1034	/// let expected = Seq::from_iter([
1035	/// Literal::inexact("quuxfoo"),
1036	/// Literal::inexact("bar"),
1037	/// Literal::exact("bazfoo"),
1038	/// ]);
1039	/// assert_eq!(expected, seq1);
1040	/// ```
1041	///
1042	/// This example shows the behavior of when `other` is an infinite
1043	/// sequence.
1044	///
1045	/// ```
1046	/// use regex_syntax::hir::literal::{Literal, Seq};
1047	///
1048	/// let mut seq1 = Seq::from_iter([
1049	/// Literal::exact("foo"),
1050	/// Literal::inexact("bar"),
1051	/// ]);
1052	/// let mut seq2 = Seq::infinite();
1053	/// seq1.cross_reverse(&mut seq2);
1054	///
1055	/// // When seq2 is infinite, cross product doesn't add anything, but
1056	/// // ensures all members of seq1 are inexact.
1057	/// let expected = Seq::from_iter([
1058	/// Literal::inexact("foo"),
1059	/// Literal::inexact("bar"),
1060	/// ]);
1061	/// assert_eq!(expected, seq1);
1062	/// ```
1063	///
1064	/// This example is like the one above, but shows what happens when this
1065	/// sequence contains an empty string. In this case, an infinite `other`
1066	/// sequence infects this sequence (because the empty string means that
1067	/// there are no finite suffixes):
1068	///
1069	/// ```
1070	/// use regex_syntax::hir::literal::{Literal, Seq};
1071	///
1072	/// let mut seq1 = Seq::from_iter([
1073	/// Literal::exact("foo"),
1074	/// Literal::exact(""), // inexact provokes same behavior
1075	/// Literal::inexact("bar"),
1076	/// ]);
1077	/// let mut seq2 = Seq::infinite();
1078	/// seq1.cross_reverse(&mut seq2);
1079	///
1080	/// // seq1 is now infinite!
1081	/// assert!(!seq1.is_finite());
1082	/// ```
1083	///
1084	/// This example shows the behavior when this sequence is infinite.
1085	///
1086	/// ```
1087	/// use regex_syntax::hir::literal::{Literal, Seq};
1088	///
1089	/// let mut seq1 = Seq::infinite();
1090	/// let mut seq2 = Seq::from_iter([
1091	/// Literal::exact("foo"),
1092	/// Literal::inexact("bar"),
1093	/// ]);
1094	/// seq1.cross_reverse(&mut seq2);
1095	///
1096	/// // seq1 remains unchanged.
1097	/// assert!(!seq1.is_finite());
1098	/// // Even though the literals in seq2 weren't used, it was still drained.
1099	/// assert_eq!(Some(`0`), seq2.len());
1100	/// ```
1101	#[inline]
1102	pub fn cross_reverse(&mut self, other: &mut Seq) {
1103	let (lits1, lits2) = match self.cross_preamble(other) {
1104	None => return,
1105	Some((lits1, lits2)) => (lits1, lits2),
1106	};
1107	// We basically proceed as we do in 'cross_forward' at this point,
1108	// except that the outer loop is now 'other' and the inner loop is now
1109	// 'self'. That's because 'self' corresponds to suffixes and 'other'
1110	// corresponds to the sequence we want to prepend* to the suffixes.*
1111	let newcap = lits1.len().saturating_mul(lits2.len());
1112	let selflits = mem::replace(lits1, Vec::with_capacity(newcap));
1113	for (i, otherlit) in lits2.drain(..).enumerate() {
1114	for selflit in selflits.iter() {
1115	if !selflit.is_exact() {
1116	// If the suffix isn't exact, then we can't prepend
1117	// anything to it. However, we still want to keep it. But
1118	// we only want to keep one of them, to avoid duplication.
1119	// (The duplication is okay from a correctness perspective,
1120	// but wasteful.)
1121	if i == `0` {
1122	lits1.push(selflit.clone());
1123	}
1124	continue;
1125	}
1126	let mut newlit = Literal::exact(Vec::with_capacity(
1127	otherlit.len() + selflit.len(),
1128	));
1129	newlit.extend(&otherlit);
1130	newlit.extend(&selflit);
1131	if !otherlit.is_exact() {
1132	newlit.make_inexact();
1133	}
1134	lits1.push(newlit);
1135	}
1136	}
1137	self.dedup();
1138	}
1139
1140	/// A helper function the corresponds to the subtle preamble for both
1141	/// `cross_forward` and `cross_reverse`. In effect, it handles the cases
1142	/// of infinite sequences for both `self` and `other`, as well as ensuring
1143	/// that literals from `other` are drained even if they aren't used.
1144	fn cross_preamble<'a>(
1145	&'a mut self,
1146	other: &'a mut Seq,
1147	) -> Option<(&'a mut Vec<Literal>, &'a mut Vec<Literal>)> {
1148	let lits2 = match other.literals {
1149	None => {
1150	// If our current seq contains the empty string and the seq
1151	// we're adding matches any literal, then it follows that the
1152	// current seq must now also match any literal.
1153	//
1154	// Otherwise, we just have to make sure everything in this
1155	// sequence is inexact.
1156	if self.min_literal_len() == Some(`0`) {
1157	*self = Seq::infinite();
1158	} else {
1159	self.make_inexact();
1160	}
1161	return None;
1162	}
1163	Some(ref mut lits) => lits,
1164	};
1165	let lits1 = match self.literals {
1166	None => {
1167	// If we aren't going to make it to the end of this routine
1168	// where lits2 is drained, then we need to do it now.
1169	lits2.drain(..);
1170	return None;
1171	}
1172	Some(ref mut lits) => lits,
1173	};
1174	Some((lits1, lits2))
1175	}
1176
1177	/// Unions the `other` sequence into this one.
1178	///
1179	/// The literals are always drained out of the given `other` sequence,
1180	/// even if they are being unioned into an infinite sequence. This permits
1181	/// the caller to reuse the `other` sequence in another context.
1182	///
1183	/// Some literal deduping may be performed. If any deduping happens,
1184	/// any leftmost-first or "preference" order match semantics will be
1185	/// preserved.
1186	///
1187	/// # Example
1188	///
1189	/// This example shows basic usage.
1190	///
1191	/// ```
1192	/// use regex_syntax::hir::literal::Seq;
1193	///
1194	/// let mut seq1 = Seq::new(&["foo", "bar"]);
1195	/// let mut seq2 = Seq::new(&["bar", "quux", "foo"]);
1196	/// seq1.union(&mut seq2);
1197	///
1198	/// // The literals are pulled out of seq2.
1199	/// assert_eq!(Some(`0`), seq2.len());
1200	///
1201	/// // Adjacent literals are deduped, but non-adjacent literals may not be.
1202	/// assert_eq!(Seq::new(&["foo", "bar", "quux", "foo"]), seq1);
1203	/// ```
1204	///
1205	/// This example shows that literals are drained from `other` even when
1206	/// they aren't necessarily used.
1207	///
1208	/// ```
1209	/// use regex_syntax::hir::literal::Seq;
1210	///
1211	/// let mut seq1 = Seq::infinite();
1212	/// // Infinite sequences have no finite length.
1213	/// assert_eq!(None, seq1.len());
1214	///
1215	/// let mut seq2 = Seq::new(&["bar", "quux", "foo"]);
1216	/// seq1.union(&mut seq2);
1217	///
1218	/// // seq1 is still infinite and seq2 has been drained.
1219	/// assert_eq!(None, seq1.len());
1220	/// assert_eq!(Some(`0`), seq2.len());
1221	/// ```
1222	#[inline]
1223	pub fn union(&mut self, other: &mut Seq) {
1224	let lits2 = match other.literals {
1225	None => {
1226	// Unioning with an infinite sequence always results in an
1227	// infinite sequence.
1228	self.make_infinite();
1229	return;
1230	}
1231	Some(ref mut lits) => lits.drain(..),
1232	};
1233	let lits1 = match self.literals {
1234	None => return,
1235	Some(ref mut lits) => lits,
1236	};
1237	lits1.extend(lits2);
1238	self.dedup();
1239	}
1240
1241	/// Unions the `other` sequence into this one by splice the `other`
1242	/// sequence at the position of the first zero-length literal.
1243	///
1244	/// This is useful for preserving preference order semantics when combining
1245	/// two literal sequences. For example, in the regex `(a\|\|f)+foo`, the
1246	/// correct preference order prefix sequence is `[a, foo, f]`.
1247	///
1248	/// The literals are always drained out of the given `other` sequence,
1249	/// even if they are being unioned into an infinite sequence. This permits
1250	/// the caller to reuse the `other` sequence in another context. Note that
1251	/// the literals are drained even if no union is performed as well, i.e.,
1252	/// when this sequence does not contain a zero-length literal.
1253	///
1254	/// Some literal deduping may be performed. If any deduping happens,
1255	/// any leftmost-first or "preference" order match semantics will be
1256	/// preserved.
1257	///
1258	/// # Example
1259	///
1260	/// This example shows basic usage.
1261	///
1262	/// ```
1263	/// use regex_syntax::hir::literal::Seq;
1264	///
1265	/// let mut seq1 = Seq::new(&["a", "", "f", ""]);
1266	/// let mut seq2 = Seq::new(&["foo"]);
1267	/// seq1.union_into_empty(&mut seq2);
1268	///
1269	/// // The literals are pulled out of seq2.
1270	/// assert_eq!(Some(`0`), seq2.len());
1271	/// // 'foo' gets spliced into seq1 where the first empty string occurs.
1272	/// assert_eq!(Seq::new(&["a", "foo", "f"]), seq1);
1273	/// ```
1274	///
1275	/// This example shows that literals are drained from `other` even when
1276	/// they aren't necessarily used.
1277	///
1278	/// ```
1279	/// use regex_syntax::hir::literal::Seq;
1280	///
1281	/// let mut seq1 = Seq::new(&["foo", "bar"]);
1282	/// let mut seq2 = Seq::new(&["bar", "quux", "foo"]);
1283	/// seq1.union_into_empty(&mut seq2);
1284	///
1285	/// // seq1 has no zero length literals, so no splicing happens.
1286	/// assert_eq!(Seq::new(&["foo", "bar"]), seq1);
1287	/// // Even though no splicing happens, seq2 is still drained.
1288	/// assert_eq!(Some(`0`), seq2.len());
1289	/// ```
1290	#[inline]
1291	pub fn union_into_empty(&mut self, other: &mut Seq) {
1292	let lits2 = other.literals.as_mut().map(\|lits\| lits.drain(..));
1293	let lits1 = match self.literals {
1294	None => return,
1295	Some(ref mut lits) => lits,
1296	};
1297	let first_empty = match lits1.iter().position(\|m\| m.is_empty()) {
1298	None => return,
1299	Some(i) => i,
1300	};
1301	let lits2 = match lits2 {
1302	None => {
1303	// Note that we are only here if we've found an empty literal,
1304	// which implies that an infinite sequence infects this seq and
1305	// also turns it into an infinite sequence.
1306	self.literals = None;
1307	return;
1308	}
1309	Some(lits) => lits,
1310	};
1311	// Clearing out the empties needs to come before the splice because
1312	// the splice might add more empties that we don't want to get rid
1313	// of. Since we're splicing into the position of the first empty, the
1314	// 'first_empty' position computed above is still correct.
1315	lits1.retain(\|m\| !m.is_empty());
1316	lits1.splice(first_empty..first_empty, lits2);
1317	self.dedup();
1318	}
1319
1320	/// Deduplicate adjacent equivalent literals in this sequence.
1321	///
1322	/// If adjacent literals are equivalent strings but one is exact and the
1323	/// other inexact, the inexact literal is kept and the exact one is
1324	/// removed.
1325	///
1326	/// Deduping an infinite sequence is a no-op.
1327	///
1328	/// # Example
1329	///
1330	/// This example shows how literals that are duplicate byte strings but
1331	/// are not equivalent with respect to exactness are resolved.
1332	///
1333	/// ```
1334	/// use regex_syntax::hir::literal::{Literal, Seq};
1335	///
1336	/// let mut seq = Seq::from_iter([
1337	/// Literal::exact("foo"),
1338	/// Literal::inexact("foo"),
1339	/// ]);
1340	/// seq.dedup();
1341	///
1342	/// assert_eq!(Seq::from_iter([Literal::inexact("foo")]), seq);
1343	/// ```
1344	#[inline]
1345	pub fn dedup(&mut self) {
1346	if let Some(ref mut lits) = self.literals {
1347	lits.dedup_by(\|lit1, lit2\| {
1348	if lit1.as_bytes() != lit2.as_bytes() {
1349	return `false`;
1350	}
1351	if lit1.is_exact() != lit2.is_exact() {
1352	lit1.make_inexact();
1353	lit2.make_inexact();
1354	}
1355	`true`
1356	});
1357	}
1358	}
1359
1360	/// Sorts this sequence of literals lexicographically.
1361	///
1362	/// Note that if, before sorting, if a literal that is a prefix of another
1363	/// literal appears after it, then after sorting, the sequence will not
1364	/// represent the same preference order match semantics. For example,
1365	/// sorting the sequence `[samwise, sam]` yields the sequence `[sam,
1366	/// samwise]`. Under preference order semantics, the latter sequence will
1367	/// never match `samwise` where as the first sequence can.
1368	///
1369	/// # Example
1370	///
1371	/// This example shows basic usage.
1372	///
1373	/// ```
1374	/// use regex_syntax::hir::literal::Seq;
1375	///
1376	/// let mut seq = Seq::new(&["foo", "quux", "bar"]);
1377	/// seq.sort();
1378	///
1379	/// assert_eq!(Seq::new(&["bar", "foo", "quux"]), seq);
1380	/// ```
1381	#[inline]
1382	pub fn sort(&mut self) {
1383	if let Some(ref mut lits) = self.literals {
1384	lits.sort();
1385	}
1386	}
1387
1388	/// Reverses all of the literals in this sequence.
1389	///
1390	/// The order of the sequence itself is preserved.
1391	///
1392	/// # Example
1393	///
1394	/// This example shows basic usage.
1395	///
1396	/// ```
1397	/// use regex_syntax::hir::literal::Seq;
1398	///
1399	/// let mut seq = Seq::new(&["oof", "rab"]);
1400	/// seq.reverse_literals();
1401	/// assert_eq!(Seq::new(&["foo", "bar"]), seq);
1402	/// ```
1403	#[inline]
1404	pub fn reverse_literals(&mut self) {
1405	if let Some(ref mut lits) = self.literals {
1406	for lit in lits.iter_mut() {
1407	lit.reverse();
1408	}
1409	}
1410	}
1411
1412	/// Shrinks this seq to its minimal size while respecting the preference
1413	/// order of its literals.
1414	///
1415	/// While this routine will remove duplicate literals from this seq, it
1416	/// will also remove literals that can never match in a leftmost-first or
1417	/// "preference order" search. Similar to [`Seq::dedup`], if a literal is
1418	/// deduped, then the one that remains is made inexact.
1419	///
1420	/// This is a no-op on seqs that are empty or not finite.
1421	///
1422	/// # Example
1423	///
1424	/// This example shows the difference between `{sam, samwise}` and
1425	/// `{samwise, sam}`.
1426	///
1427	/// ```
1428	/// use regex_syntax::hir::literal::{Literal, Seq};
1429	///
1430	/// // If 'sam' comes before 'samwise' and a preference order search is
1431	/// // executed, then 'samwise' can never match.
1432	/// let mut seq = Seq::new(&["sam", "samwise"]);
1433	/// seq.minimize_by_preference();
1434	/// assert_eq!(Seq::from_iter([Literal::inexact("sam")]), seq);
1435	///
1436	/// // But if they are reversed, then it's possible for 'samwise' to match
1437	/// // since it is given higher preference.
1438	/// let mut seq = Seq::new(&["samwise", "sam"]);
1439	/// seq.minimize_by_preference();
1440	/// assert_eq!(Seq::new(&["samwise", "sam"]), seq);
1441	/// ```
1442	///
1443	/// This example shows that if an empty string is in this seq, then
1444	/// anything that comes after it can never match.
1445	///
1446	/// ```
1447	/// use regex_syntax::hir::literal::{Literal, Seq};
1448	///
1449	/// // An empty string is a prefix of all strings, so it automatically
1450	/// // inhibits any subsequent strings from matching.
1451	/// let mut seq = Seq::new(&["foo", "bar", "", "quux", "fox"]);
1452	/// seq.minimize_by_preference();
1453	/// let expected = Seq::from_iter([
1454	/// Literal::exact("foo"),
1455	/// Literal::exact("bar"),
1456	/// Literal::inexact(""),
1457	/// ]);
1458	/// assert_eq!(expected, seq);
1459	///
1460	/// // And of course, if it's at the beginning, then it makes it impossible
1461	/// // for anything else to match.
1462	/// let mut seq = Seq::new(&["", "foo", "quux", "fox"]);
1463	/// seq.minimize_by_preference();
1464	/// assert_eq!(Seq::from_iter([Literal::inexact("")]), seq);
1465	/// ```
1466	#[inline]
1467	pub fn minimize_by_preference(&mut self) {
1468	if let Some(ref mut lits) = self.literals {
1469	PreferenceTrie::minimize(lits, `false`);
1470	}
1471	}
1472
1473	/// Trims all literals in this seq such that only the first `len` bytes
1474	/// remain. If a literal has less than or equal to `len` bytes, then it
1475	/// remains unchanged. Otherwise, it is trimmed and made inexact.
1476	///
1477	/// # Example
1478	///
1479	/// ```
1480	/// use regex_syntax::hir::literal::{Literal, Seq};
1481	///
1482	/// let mut seq = Seq::new(&["a", "foo", "quux"]);
1483	/// seq.keep_first_bytes(`2`);
1484	///
1485	/// let expected = Seq::from_iter([
1486	/// Literal::exact("a"),
1487	/// Literal::inexact("fo"),
1488	/// Literal::inexact("qu"),
1489	/// ]);
1490	/// assert_eq!(expected, seq);
1491	/// ```
1492	#[inline]
1493	pub fn keep_first_bytes(&mut self, len: usize) {
1494	if let Some(ref mut lits) = self.literals {
1495	for m in lits.iter_mut() {
1496	m.keep_first_bytes(len);
1497	}
1498	}
1499	}
1500
1501	/// Trims all literals in this seq such that only the last `len` bytes
1502	/// remain. If a literal has less than or equal to `len` bytes, then it
1503	/// remains unchanged. Otherwise, it is trimmed and made inexact.
1504	///
1505	/// # Example
1506	///
1507	/// ```
1508	/// use regex_syntax::hir::literal::{Literal, Seq};
1509	///
1510	/// let mut seq = Seq::new(&["a", "foo", "quux"]);
1511	/// seq.keep_last_bytes(`2`);
1512	///
1513	/// let expected = Seq::from_iter([
1514	/// Literal::exact("a"),
1515	/// Literal::inexact("oo"),
1516	/// Literal::inexact("ux"),
1517	/// ]);
1518	/// assert_eq!(expected, seq);
1519	/// ```
1520	#[inline]
1521	pub fn keep_last_bytes(&mut self, len: usize) {
1522	if let Some(ref mut lits) = self.literals {
1523	for m in lits.iter_mut() {
1524	m.keep_last_bytes(len);
1525	}
1526	}
1527	}
1528
1529	/// Returns true if this sequence is finite.
1530	///
1531	/// When false, this sequence is infinite and must be treated as if it
1532	/// contains every possible literal.
1533	#[inline]
1534	pub fn is_finite(&self) -> bool {
1535	self.literals.is_some()
1536	}
1537
1538	/// Returns true if and only if this sequence is finite and empty.
1539	///
1540	/// An empty sequence never matches anything. It can only be produced by
1541	/// literal extraction when the corresponding regex itself cannot match.
1542	#[inline]
1543	pub fn is_empty(&self) -> bool {
1544	self.len() == Some(`0`)
1545	}
1546
1547	/// Returns the number of literals in this sequence if the sequence is
1548	/// finite. If the sequence is infinite, then `None` is returned.
1549	#[inline]
1550	pub fn len(&self) -> Option<usize> {
1551	self.literals.as_ref().map(\|lits\| lits.len())
1552	}
1553
1554	/// Returns true if and only if all literals in this sequence are exact.
1555	///
1556	/// This returns false if the sequence is infinite.
1557	#[inline]
1558	pub fn is_exact(&self) -> bool {
1559	self.literals().map_or(`false`, \|lits\| lits.iter().all(\|x\| x.is_exact()))
1560	}
1561
1562	/// Returns true if and only if all literals in this sequence are inexact.
1563	///
1564	/// This returns true if the sequence is infinite.
1565	#[inline]
1566	pub fn is_inexact(&self) -> bool {
1567	self.literals().map_or(`true`, \|lits\| lits.iter().all(\|x\| !x.is_exact()))
1568	}
1569
1570	/// Return the maximum length of the sequence that would result from
1571	/// unioning `self` with `other`. If either set is infinite, then this
1572	/// returns `None`.
1573	#[inline]
1574	fn max_union_len(&self, other: &Seq) -> Option<usize> {
1575	let len1 = self.len()?;
1576	let len2 = other.len()?;
1577	Some(len1.saturating_add(len2))
1578	}
1579
1580	/// Return the maximum length of the sequence that would result from the
1581	/// cross product of `self` with `other`. If either set is infinite, then
1582	/// this returns `None`.
1583	#[inline]
1584	fn max_cross_len(&self, other: &Seq) -> Option<usize> {
1585	let len1 = self.len()?;
1586	let len2 = other.len()?;
1587	Some(len1.saturating_mul(len2))
1588	}
1589
1590	/// Returns the length of the shortest literal in this sequence.
1591	///
1592	/// If the sequence is infinite or empty, then this returns `None`.
1593	#[inline]
1594	pub fn min_literal_len(&self) -> Option<usize> {
1595	self.literals.as_ref()?.iter().map(\|x\| x.len()).min()
1596	}
1597
1598	/// Returns the length of the longest literal in this sequence.
1599	///
1600	/// If the sequence is infinite or empty, then this returns `None`.
1601	#[inline]
1602	pub fn max_literal_len(&self) -> Option<usize> {
1603	self.literals.as_ref()?.iter().map(\|x\| x.len()).max()
1604	}
1605
1606	/// Returns the longest common prefix from this seq.
1607	///
1608	/// If the seq matches any literal or other contains no literals, then
1609	/// there is no meaningful prefix and this returns `None`.
1610	///
1611	/// # Example
1612	///
1613	/// This shows some example seqs and their longest common prefix.
1614	///
1615	/// ```
1616	/// use regex_syntax::hir::literal::Seq;
1617	///
1618	/// let seq = Seq::new(&["foo", "foobar", "fo"]);
1619	/// assert_eq!(Some(&b"fo"[..]), seq.longest_common_prefix());
1620	/// let seq = Seq::new(&["foo", "foo"]);
1621	/// assert_eq!(Some(&b"foo"[..]), seq.longest_common_prefix());
1622	/// let seq = Seq::new(&["foo", "bar"]);
1623	/// assert_eq!(Some(&b""[..]), seq.longest_common_prefix());
1624	/// let seq = Seq::new(&[""]);
1625	/// assert_eq!(Some(&b""[..]), seq.longest_common_prefix());
1626	///
1627	/// let seq = Seq::infinite();
1628	/// assert_eq!(None, seq.longest_common_prefix());
1629	/// let seq = Seq::empty();
1630	/// assert_eq!(None, seq.longest_common_prefix());
1631	/// ```
1632	#[inline]
1633	pub fn longest_common_prefix(&self) -> Option<&[u8]> {
1634	// If we match everything or match nothing, then there's no meaningful
1635	// longest common prefix.
1636	let lits = match self.literals {
1637	None => return None,
1638	Some(ref lits) => lits,
1639	};
1640	if lits.len() == `0` {
1641	return None;
1642	}
1643	let base = lits[`0`].as_bytes();
1644	let mut len = base.len();
1645	for m in lits.iter().skip(`1`) {
1646	len = m
1647	.as_bytes()
1648	.iter()
1649	.zip(base[..len].iter())
1650	.take_while(\|&(a, b)\| a == b)
1651	.count();
1652	if len == `0` {
1653	return Some(&[]);
1654	}
1655	}
1656	Some(&base[..len])
1657	}
1658
1659	/// Returns the longest common suffix from this seq.
1660	///
1661	/// If the seq matches any literal or other contains no literals, then
1662	/// there is no meaningful suffix and this returns `None`.
1663	///
1664	/// # Example
1665	///
1666	/// This shows some example seqs and their longest common suffix.
1667	///
1668	/// ```
1669	/// use regex_syntax::hir::literal::Seq;
1670	///
1671	/// let seq = Seq::new(&["oof", "raboof", "of"]);
1672	/// assert_eq!(Some(&b"of"[..]), seq.longest_common_suffix());
1673	/// let seq = Seq::new(&["foo", "foo"]);
1674	/// assert_eq!(Some(&b"foo"[..]), seq.longest_common_suffix());
1675	/// let seq = Seq::new(&["foo", "bar"]);
1676	/// assert_eq!(Some(&b""[..]), seq.longest_common_suffix());
1677	/// let seq = Seq::new(&[""]);
1678	/// assert_eq!(Some(&b""[..]), seq.longest_common_suffix());
1679	///
1680	/// let seq = Seq::infinite();
1681	/// assert_eq!(None, seq.longest_common_suffix());
1682	/// let seq = Seq::empty();
1683	/// assert_eq!(None, seq.longest_common_suffix());
1684	/// ```
1685	#[inline]
1686	pub fn longest_common_suffix(&self) -> Option<&[u8]> {
1687	// If we match everything or match nothing, then there's no meaningful
1688	// longest common suffix.
1689	let lits = match self.literals {
1690	None => return None,
1691	Some(ref lits) => lits,
1692	};
1693	if lits.len() == `0` {
1694	return None;
1695	}
1696	let base = lits[`0`].as_bytes();
1697	let mut len = base.len();
1698	for m in lits.iter().skip(`1`) {
1699	len = m
1700	.as_bytes()
1701	.iter()
1702	.rev()
1703	.zip(base[base.len() - len..].iter().rev())
1704	.take_while(\|&(a, b)\| a == b)
1705	.count();
1706	if len == `0` {
1707	return Some(&[]);
1708	}
1709	}
1710	Some(&base[base.len() - len..])
1711	}
1712
1713	/// Optimizes this seq while treating its literals as prefixes and
1714	/// respecting the preference order of its literals.
1715	///
1716	/// The specific way "optimization" works is meant to be an implementation
1717	/// detail, as it essentially represents a set of heuristics. The goal
1718	/// that optimization tries to accomplish is to make the literals in this
1719	/// set reflect inputs that will result in a more effective prefilter.
1720	/// Principally by reducing the false positive rate of candidates found by
1721	/// the literals in this sequence. That is, when a match of a literal is
1722	/// found, we would like it to be a strong predictor of the overall match
1723	/// of the regex. If it isn't, then much time will be spent starting and
1724	/// stopping the prefilter search and attempting to confirm the match only
1725	/// to have it fail.
1726	///
1727	/// Some of those heuristics might be:
1728	///
1729	/// Identifying a common prefix from a larger sequence of literals, and*
1730	/// shrinking the sequence down to that single common prefix.
1731	/// Rejecting the sequence entirely if it is believed to result in very*
1732	/// high false positive rate. When this happens, the sequence is made
1733	/// infinite.
1734	/// Shrinking the sequence to a smaller number of literals representing*
1735	/// prefixes, but not shrinking it so much as to make literals too short.
1736	/// (A sequence with very short literals, of 1 or 2 bytes, will typically
1737	/// result in a higher false positive rate.)
1738	///
1739	/// Optimization should only be run once extraction is complete. Namely,
1740	/// optimization may make assumptions that do not compose with other
1741	/// operations in the middle of extraction. For example, optimization will
1742	/// reduce `[E(sam), E(samwise)]` to `[E(sam)]`, but such a transformation
1743	/// is only valid if no other extraction will occur. If other extraction
1744	/// may occur, then the correct transformation would be to `[I(sam)]`.
1745	///
1746	/// The [`Seq::optimize_for_suffix_by_preference`] does the same thing, but
1747	/// for suffixes.
1748	///
1749	/// # Example
1750	///
1751	/// This shows how optimization might transform a sequence. Note that
1752	/// the specific behavior is not a documented guarantee. The heuristics
1753	/// used are an implementation detail and may change over time in semver
1754	/// compatible releases.
1755	///
1756	/// ```
1757	/// use regex_syntax::hir::literal::{Seq, Literal};
1758	///
1759	/// let mut seq = Seq::new(&[
1760	/// "samantha",
1761	/// "sam",
1762	/// "samwise",
1763	/// "frodo",
1764	/// ]);
1765	/// seq.optimize_for_prefix_by_preference();
1766	/// assert_eq!(Seq::from_iter([
1767	/// Literal::exact("samantha"),
1768	/// // Kept exact even though 'samwise' got pruned
1769	/// // because optimization assumes literal extraction
1770	/// // has finished.
1771	/// Literal::exact("sam"),
1772	/// Literal::exact("frodo"),
1773	/// ]), seq);
1774	/// ```
1775	///
1776	/// # Example: optimization may make the sequence infinite
1777	///
1778	/// If the heuristics deem that the sequence could cause a very high false
1779	/// positive rate, then it may make the sequence infinite, effectively
1780	/// disabling its use as a prefilter.
1781	///
1782	/// ```
1783	/// use regex_syntax::hir::literal::{Seq, Literal};
1784	///
1785	/// let mut seq = Seq::new(&[
1786	/// "samantha",
1787	/// // An empty string matches at every position,
1788	/// // thus rendering the prefilter completely
1789	/// // ineffective.
1790	/// "",
1791	/// "sam",
1792	/// "samwise",
1793	/// "frodo",
1794	/// ]);
1795	/// seq.optimize_for_prefix_by_preference();
1796	/// assert!(!seq.is_finite());
1797	/// ```
1798	///
1799	/// Do note that just because there is a `" "` in the sequence, that
1800	/// doesn't mean the sequence will always be made infinite after it is
1801	/// optimized. Namely, if the sequence is considered exact (any match
1802	/// corresponds to an overall match of the original regex), then any match
1803	/// is an overall match, and so the false positive rate is always `0`.
1804	///
1805	/// To demonstrate this, we remove `samwise` from our sequence. This
1806	/// results in no optimization happening and all literals remain exact.
1807	/// Thus the entire sequence is exact, and it is kept as-is, even though
1808	/// one is an ASCII space:
1809	///
1810	/// ```
1811	/// use regex_syntax::hir::literal::{Seq, Literal};
1812	///
1813	/// let mut seq = Seq::new(&[
1814	/// "samantha",
1815	/// " ",
1816	/// "sam",
1817	/// "frodo",
1818	/// ]);
1819	/// seq.optimize_for_prefix_by_preference();
1820	/// assert!(seq.is_finite());
1821	/// ```
1822	#[inline]
1823	pub fn optimize_for_prefix_by_preference(&mut self) {
1824	self.optimize_by_preference(`true`);
1825	}
1826
1827	/// Optimizes this seq while treating its literals as suffixes and
1828	/// respecting the preference order of its literals.
1829	///
1830	/// Optimization should only be run once extraction is complete.
1831	///
1832	/// The [`Seq::optimize_for_prefix_by_preference`] does the same thing, but
1833	/// for prefixes. See its documentation for more explanation.
1834	#[inline]
1835	pub fn optimize_for_suffix_by_preference(&mut self) {
1836	self.optimize_by_preference(`false`);
1837	}
1838
1839	fn optimize_by_preference(&mut self, prefix: bool) {
1840	let origlen = match self.len() {
1841	None => return,
1842	Some(len) => len,
1843	};
1844	// Make sure we start with the smallest sequence possible. We use a
1845	// special version of preference minimization that retains exactness.
1846	// This is legal because optimization is only expected to occur once
1847	// extraction is complete.
1848	if prefix {
1849	if let Some(ref mut lits) = self.literals {
1850	PreferenceTrie::minimize(lits, `true`);
1851	}
1852	}
1853
1854	// Look for a common prefix (or suffix). If we found one of those and
1855	// it's long enough, then it's a good bet that it will be our fastest
1856	// possible prefilter since single-substring search is so fast.
1857	let fix = if prefix {
1858	self.longest_common_prefix()
1859	} else {
1860	self.longest_common_suffix()
1861	};
1862	if let Some(fix) = fix {
1863	// As a special case, if we have a common prefix and the leading
1864	// byte of that prefix is one that we think probably occurs rarely,
1865	// then strip everything down to just that single byte. This should
1866	// promote the use of memchr.
1867	//
1868	// ... we only do this though if our sequence has more than one
1869	// literal. Otherwise, we'd rather just stick with a single literal
1870	// scan. That is, using memchr is probably better than looking
1871	// for 2 or more literals, but probably not as good as a straight
1872	// memmem search.
1873	//
1874	// ... and also only do this when the prefix is short and probably
1875	// not too discriminatory anyway. If it's longer, then it's
1876	// probably quite discriminatory and thus is likely to have a low
1877	// false positive rate.
1878	if prefix
1879	&& origlen > `1`
1880	&& fix.len() >= `1`
1881	&& fix.len() <= `3`
1882	&& rank(fix[`0`]) < `200`
1883	{
1884	self.keep_first_bytes(`1`);
1885	self.dedup();
1886	return;
1887	}
1888	// We only strip down to the common prefix/suffix if we think
1889	// the existing set of literals isn't great, or if the common
1890	// prefix/suffix is expected to be particularly discriminatory.
1891	let isfast =
1892	self.is_exact() && self.len().map_or(`false`, \|len\| len <= `16`);
1893	let usefix = fix.len() > `4` \|\| (fix.len() > `1` && !isfast);
1894	if usefix {
1895	// If we keep exactly the number of bytes equal to the length
1896	// of the prefix (or suffix), then by the definition of a
1897	// prefix, every literal in the sequence will be equivalent.
1898	// Thus, 'dedup' will leave us with one literal.
1899	//
1900	// We do it this way to avoid an alloc, but also to make sure
1901	// the exactness of literals is kept (or not).
1902	if prefix {
1903	self.keep_first_bytes(fix.len());
1904	} else {
1905	self.keep_last_bytes(fix.len());
1906	}
1907	self.dedup();
1908	assert_eq!(Some(`1`), self.len());
1909	// We still fall through here. In particular, we want our
1910	// longest common prefix to be subject to the poison check.
1911	}
1912	}
1913	// Everything below this check is more-or-less about trying to
1914	// heuristically reduce the false positive rate of a prefilter. But
1915	// if our sequence is completely exact, then it's possible the regex
1916	// engine can be skipped entirely. In this case, the false positive
1917	// rate is zero because every literal match corresponds to a regex
1918	// match.
1919	//
1920	// This is OK even if the sequence contains a poison literal. Remember,
1921	// a literal is only poisononous because of what we assume about its
1922	// impact on the false positive rate. However, we do still check for
1923	// an empty string. Empty strings are weird and it's best to let the
1924	// regex engine handle those.
1925	//
1926	// We do currently do this check after the longest common prefix (or
1927	// suffix) check, under the theory that single-substring search is so
1928	// fast that we want that even if we'd end up turning an exact sequence
1929	// into an inexact one. But this might be wrong...
1930	if self.is_exact()
1931	&& self.min_literal_len().map_or(`false`, \|len\| len > `0`)
1932	{
1933	return;
1934	}
1935	// Now we attempt to shorten the sequence. The idea here is that we
1936	// don't want to look for too many literals, but we want to shorten
1937	// our sequence enough to improve our odds of using better algorithms
1938	// downstream (such as Teddy).
1939	const ATTEMPTS: [(usize, usize); `5`] =
1940	[(`5`, `64`), (`4`, `64`), (`3`, `64`), (`2`, `64`), (`1`, `10`)];
1941	for (keep, limit) in ATTEMPTS {
1942	let len = match self.len() {
1943	None => break,
1944	Some(len) => len,
1945	};
1946	if len <= limit {
1947	break;
1948	}
1949	if prefix {
1950	self.keep_first_bytes(keep);
1951	} else {
1952	self.keep_last_bytes(keep);
1953	}
1954	self.minimize_by_preference();
1955	}
1956	// Check for a poison literal. A poison literal is one that is short
1957	// and is believed to have a very high match count. These poisons
1958	// generally lead to a prefilter with a very high false positive rate,
1959	// and thus overall worse performance.
1960	//
1961	// We do this last because we could have gone from a non-poisonous
1962	// sequence to a poisonous one. Perhaps we should add some code to
1963	// prevent such transitions in the first place, but then again, we
1964	// likely only made the transition in the first place if the sequence
1965	// was itself huge. And huge sequences are themselves poisonous. So...
1966	if let Some(lits) = self.literals() {
1967	if lits.iter().any(\|lit\| lit.is_poisonous()) {
1968	self.make_infinite();
1969	}
1970	}
1971	}
1972	}
1973
1974	impl core::fmt::Debug for Seq {
1975	fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
1976	write!(f, "Seq")?;
1977	if let Some(lits: &[Literal]) = self.literals() {
1978	f.debug_list().entries(lits.iter()).finish()
1979	} else {
1980	write!(f, "[∅]")
1981	}
1982	}
1983	}
1984
1985	impl FromIterator<Literal> for Seq {
1986	fn from_iter<T: IntoIterator<Item = Literal>>(it: T) -> Seq {
1987	let mut seq: Seq = Seq::empty();
1988	for literal: Literal in it {
1989	seq.push(lit:literal);
1990	}
1991	seq
1992	}
1993	}
1994
1995	/// A single literal extracted from an [`Hir`] expression.
1996	///
1997	/// A literal is composed of two things:
1998	///
1999	/// A sequence of bytes. No guarantees with respect to UTF-8 are provided.*
2000	/// In particular, even if the regex a literal is extracted from is UTF-8, the
2001	/// literal extracted may not be valid UTF-8. (For example, if an [`Extractor`]
2002	/// limit resulted in trimming a literal in a way that splits a codepoint.)
2003	/// Whether the literal is "exact" or not. An "exact" literal means that it*
2004	/// has not been trimmed, and may continue to be extended. If a literal is
2005	/// "exact" after visiting the entire `Hir` expression, then this implies that
2006	/// the literal leads to a match state. (Although it doesn't necessarily imply
2007	/// all occurrences of the literal correspond to a match of the regex, since
2008	/// literal extraction ignores look-around assertions.)
2009	#[derive(Clone, Eq, PartialEq, PartialOrd, Ord)]
2010	pub struct Literal {
2011	bytes: Vec<u8>,
2012	exact: bool,
2013	}
2014
2015	impl Literal {
2016	/// Returns a new exact literal containing the bytes given.
2017	#[inline]
2018	pub fn exact<B: Into<Vec<u8>>>(bytes: B) -> Literal {
2019	Literal { bytes: bytes.into(), exact: `true` }
2020	}
2021
2022	/// Returns a new inexact literal containing the bytes given.
2023	#[inline]
2024	pub fn inexact<B: Into<Vec<u8>>>(bytes: B) -> Literal {
2025	Literal { bytes: bytes.into(), exact: `false` }
2026	}
2027
2028	/// Returns the bytes in this literal.
2029	#[inline]
2030	pub fn as_bytes(&self) -> &[u8] {
2031	&self.bytes
2032	}
2033
2034	/// Yields ownership of the bytes inside this literal.
2035	///
2036	/// Note that this throws away whether the literal is "exact" or not.
2037	#[inline]
2038	pub fn into_bytes(self) -> Vec<u8> {
2039	self.bytes
2040	}
2041
2042	/// Returns the length of this literal in bytes.
2043	#[inline]
2044	pub fn len(&self) -> usize {
2045	self.as_bytes().len()
2046	}
2047
2048	/// Returns true if and only if this literal has zero bytes.
2049	#[inline]
2050	pub fn is_empty(&self) -> bool {
2051	self.len() == `0`
2052	}
2053
2054	/// Returns true if and only if this literal is exact.
2055	#[inline]
2056	pub fn is_exact(&self) -> bool {
2057	self.exact
2058	}
2059
2060	/// Marks this literal as inexact.
2061	///
2062	/// Inexact literals can never be extended. For example,
2063	/// [`Seq::cross_forward`] will not extend inexact literals.
2064	#[inline]
2065	pub fn make_inexact(&mut self) {
2066	self.exact = `false`;
2067	}
2068
2069	/// Reverse the bytes in this literal.
2070	#[inline]
2071	pub fn reverse(&mut self) {
2072	self.bytes.reverse();
2073	}
2074
2075	/// Extend this literal with the literal given.
2076	///
2077	/// If this literal is inexact, then this is a no-op.
2078	#[inline]
2079	pub fn extend(&mut self, lit: &Literal) {
2080	if !self.is_exact() {
2081	return;
2082	}
2083	self.bytes.extend_from_slice(&lit.bytes);
2084	}
2085
2086	/// Trims this literal such that only the first `len` bytes remain. If
2087	/// this literal has fewer than `len` bytes, then it remains unchanged.
2088	/// Otherwise, the literal is marked as inexact.
2089	#[inline]
2090	pub fn keep_first_bytes(&mut self, len: usize) {
2091	if len >= self.len() {
2092	return;
2093	}
2094	self.make_inexact();
2095	self.bytes.truncate(len);
2096	}
2097
2098	/// Trims this literal such that only the last `len` bytes remain. If this
2099	/// literal has fewer than `len` bytes, then it remains unchanged.
2100	/// Otherwise, the literal is marked as inexact.
2101	#[inline]
2102	pub fn keep_last_bytes(&mut self, len: usize) {
2103	if len >= self.len() {
2104	return;
2105	}
2106	self.make_inexact();
2107	self.bytes.drain(..self.len() - len);
2108	}
2109
2110	/// Returns true if it is believe that this literal is likely to match very
2111	/// frequently, and is thus not a good candidate for a prefilter.
2112	fn is_poisonous(&self) -> bool {
2113	self.is_empty() \|\| (self.len() == `1` && rank(self.as_bytes()[`0`]) >= `250`)
2114	}
2115	}
2116
2117	impl From<u8> for Literal {
2118	fn from(byte: u8) -> Literal {
2119	Literal::exact(bytes:vec![byte])
2120	}
2121	}
2122
2123	impl From<char> for Literal {
2124	fn from(ch: char) -> Literal {
2125	use alloc::string::ToString;
2126	Literal::exact(bytes:ch.encode_utf8(&mut [`0`; `4`]).to_string())
2127	}
2128	}
2129
2130	impl AsRef<[u8]> for Literal {
2131	fn as_ref(&self) -> &[u8] {
2132	self.as_bytes()
2133	}
2134	}
2135
2136	impl core::fmt::Debug for Literal {
2137	fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
2138	let tag: &str = if self.exact { "E" } else { "I" };
2139	f&mut DebugTuple<'_, '_>.debug_tuple(name:tag)
2140	.field(&crate::debug::Bytes(self.as_bytes()))
2141	.finish()
2142	}
2143	}
2144
2145	/// A "preference" trie that rejects literals that will never match when
2146	/// executing a leftmost first or "preference" search.
2147	///
2148	/// For example, if 'sam' is inserted, then trying to insert 'samwise' will be
2149	/// rejected because 'samwise' can never match since 'sam' will always take
2150	/// priority. However, if 'samwise' is inserted first, then inserting 'sam'
2151	/// after it is accepted. In this case, either 'samwise' or 'sam' can match in
2152	/// a "preference" search.
2153	///
2154	/// Note that we only use this trie as a "set." That is, given a sequence of
2155	/// literals, we insert each one in order. An `insert` will reject a literal
2156	/// if a prefix of that literal already exists in the trie. Thus, to rebuild
2157	/// the "minimal" sequence, we simply only keep literals that were successfully
2158	/// inserted. (Since we don't need traversal, one wonders whether we can make
2159	/// some simplifications here, but I haven't given it a ton of thought and I've
2160	/// never seen this show up on a profile. Because of the heuristic limits
2161	/// imposed on literal extractions, the size of the inputs here is usually
2162	/// very small.)
2163	#[derive(Debug, Default)]
2164	struct PreferenceTrie {
2165	/// The states in this trie. The index of a state in this vector is its ID.
2166	states: Vec<State>,
2167	/// The index to allocate to the next literal added to this trie. Starts at
2168	/// 0 and increments by 1 for every literal successfully added to the trie.
2169	next_literal_index: usize,
2170	}
2171
2172	/// A single state in a trie. Uses a sparse representation for its transitions.
2173	#[derive(Debug, Default)]
2174	struct State {
2175	/// Sparse representation of the transitions out of this state. Transitions
2176	/// are sorted by byte. There is at most one such transition for any
2177	/// particular byte.
2178	trans: Vec<(u8, usize)>,
2179	/// Whether this is a matching state or not. If it is, then it contains the
2180	/// index to the matching literal.
2181	literal_index: Option<usize>,
2182	}
2183
2184	impl PreferenceTrie {
2185	/// Minimizes the given sequence of literals while preserving preference
2186	/// order semantics.
2187	///
2188	/// When `keep_exact` is true, the exactness of every literal retained is
2189	/// kept. This is useful when dealing with a fully extracted `Seq` that
2190	/// only contains exact literals. In that case, we can keep all retained
2191	/// literals as exact because we know we'll never need to match anything
2192	/// after them and because any removed literals are guaranteed to never
2193	/// match.
2194	fn minimize(literals: &mut Vec<Literal>, keep_exact: bool) {
2195	use core::cell::RefCell;
2196
2197	// MSRV(1.61): Use retain_mut here to avoid interior mutability.
2198	let trie = RefCell::new(PreferenceTrie::default());
2199	let mut make_inexact = vec![];
2200	literals.retain(\|lit\| {
2201	match trie.borrow_mut().insert(lit.as_bytes()) {
2202	Ok(_) => `true`,
2203	Err(i) => {
2204	if !keep_exact {
2205	make_inexact.push(i);
2206	}
2207	`false`
2208	}
2209	}
2210	});
2211	for i in make_inexact {
2212	literals[i].make_inexact();
2213	}
2214	}
2215
2216	/// Returns `Ok` if the given byte string is accepted into this trie and
2217	/// `Err` otherwise. The index for the success case corresponds to the
2218	/// index of the literal added. The index for the error case corresponds to
2219	/// the index of the literal already in the trie that prevented the given
2220	/// byte string from being added. (Which implies it is a prefix of the one
2221	/// given.)
2222	///
2223	/// In short, the byte string given is accepted into the trie if and only
2224	/// if it is possible for it to match when executing a preference order
2225	/// search.
2226	fn insert(&mut self, bytes: &[u8]) -> Result<usize, usize> {
2227	let mut prev = self.root();
2228	if let Some(idx) = self.states[prev].literal_index {
2229	return Err(idx);
2230	}
2231	for &b in bytes.iter() {
2232	match self.states[prev].trans.binary_search_by_key(&b, \|t\| t.0) {
2233	Ok(i) => {
2234	prev = self.states[prev].trans[i].1;
2235	if let Some(idx) = self.states[prev].literal_index {
2236	return Err(idx);
2237	}
2238	}
2239	Err(i) => {
2240	let next = self.create_state();
2241	self.states[prev].trans.insert(i, (b, next));
2242	prev = next;
2243	}
2244	}
2245	}
2246	let idx = self.next_literal_index;
2247	self.next_literal_index += `1`;
2248	self.states[prev].literal_index = Some(idx);
2249	Ok(idx)
2250	}
2251
2252	/// Returns the root state ID, and if it doesn't exist, creates it.
2253	fn root(&mut self) -> usize {
2254	if !self.states.is_empty() {
2255	`0`
2256	} else {
2257	self.create_state()
2258	}
2259	}
2260
2261	/// Creates a new empty state and returns its ID.
2262	fn create_state(&mut self) -> usize {
2263	let id = self.states.len();
2264	self.states.push(State::default());
2265	id
2266	}
2267	}
2268
2269	/// Returns the "rank" of the given byte.
2270	///
2271	/// The minimum rank value is `0` and the maximum rank value is `255`.
2272	///
2273	/// The rank of a byte is derived from a heuristic background distribution of
2274	/// relative frequencies of bytes. The heuristic says that lower the rank of a
2275	/// byte, the less likely that byte is to appear in any arbitrary haystack.
2276	pub fn rank(byte: u8) -> u8 {
2277	crate::rank::BYTE_FREQUENCIES[usize::from(byte)]
2278	}
2279
2280	#[cfg(test)]
2281	mod tests {
2282	use super::*;
2283
2284	fn parse(pattern: &str) -> Hir {
2285	crate::ParserBuilder::new().utf8(`false`).build().parse(pattern).unwrap()
2286	}
2287
2288	fn prefixes(pattern: &str) -> Seq {
2289	Extractor::new().kind(ExtractKind::Prefix).extract(&parse(pattern))
2290	}
2291
2292	fn suffixes(pattern: &str) -> Seq {
2293	Extractor::new().kind(ExtractKind::Suffix).extract(&parse(pattern))
2294	}
2295
2296	fn e(pattern: &str) -> (Seq, Seq) {
2297	(prefixes(pattern), suffixes(pattern))
2298	}
2299
2300	#[allow(non_snake_case)]
2301	fn E(x: &str) -> Literal {
2302	Literal::exact(x.as_bytes())
2303	}
2304
2305	#[allow(non_snake_case)]
2306	fn I(x: &str) -> Literal {
2307	Literal::inexact(x.as_bytes())
2308	}
2309
2310	fn seq<I: IntoIterator<Item = Literal>>(it: I) -> Seq {
2311	Seq::from_iter(it)
2312	}
2313
2314	fn infinite() -> (Seq, Seq) {
2315	(Seq::infinite(), Seq::infinite())
2316	}
2317
2318	fn inexact<I1, I2>(it1: I1, it2: I2) -> (Seq, Seq)
2319	where
2320	I1: IntoIterator<Item = Literal>,
2321	I2: IntoIterator<Item = Literal>,
2322	{
2323	(Seq::from_iter(it1), Seq::from_iter(it2))
2324	}
2325
2326	fn exact<B: AsRef<[u8]>, I: IntoIterator<Item = B>>(it: I) -> (Seq, Seq) {
2327	let s1 = Seq::new(it);
2328	let s2 = s1.clone();
2329	(s1, s2)
2330	}
2331
2332	fn opt<B: AsRef<[u8]>, I: IntoIterator<Item = B>>(it: I) -> (Seq, Seq) {
2333	let (mut p, mut s) = exact(it);
2334	p.optimize_for_prefix_by_preference();
2335	s.optimize_for_suffix_by_preference();
2336	(p, s)
2337	}
2338
2339	#[test]
2340	fn literal() {
2341	assert_eq!(exact(["a"]), e("a"));
2342	assert_eq!(exact(["aaaaa"]), e("aaaaa"));
2343	assert_eq!(exact(["A", "a"]), e("(?i-u)a"));
2344	assert_eq!(exact(["AB", "Ab", "aB", "ab"]), e("(?i-u)ab"));
2345	assert_eq!(exact(["abC", "abc"]), e("ab(?i-u)c"));
2346
2347	assert_eq!(exact([b"`\xFF`"]), e(r"(?-u:\xFF)"));
2348
2349	#[cfg(feature = "unicode-case")]
2350	{
2351	assert_eq!(exact(["☃"]), e("☃"));
2352	assert_eq!(exact(["☃"]), e("(?i)☃"));
2353	assert_eq!(exact(["☃☃☃☃☃"]), e("☃☃☃☃☃"));
2354
2355	assert_eq!(exact(["Δ"]), e("Δ"));
2356	assert_eq!(exact(["δ"]), e("δ"));
2357	assert_eq!(exact(["Δ", "δ"]), e("(?i)Δ"));
2358	assert_eq!(exact(["Δ", "δ"]), e("(?i)δ"));
2359
2360	assert_eq!(exact(["S", "s", "ſ"]), e("(?i)S"));
2361	assert_eq!(exact(["S", "s", "ſ"]), e("(?i)s"));
2362	assert_eq!(exact(["S", "s", "ſ"]), e("(?i)ſ"));
2363	}
2364
2365	let letters = "ͱͳͷΐάέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋ";
2366	assert_eq!(exact([letters]), e(letters));
2367	}
2368
2369	#[test]
2370	fn class() {
2371	assert_eq!(exact(["a", "b", "c"]), e("[abc]"));
2372	assert_eq!(exact(["a1b", "a2b", "a3b"]), e("a[123]b"));
2373	assert_eq!(exact(["δ", "ε"]), e("[εδ]"));
2374	#[cfg(feature = "unicode-case")]
2375	{
2376	assert_eq!(exact(["Δ", "Ε", "δ", "ε", "ϵ"]), e(r"(?i)[εδ]"));
2377	}
2378	}
2379
2380	#[test]
2381	fn look() {
2382	assert_eq!(exact(["ab"]), e(r"a\Ab"));
2383	assert_eq!(exact(["ab"]), e(r"a\zb"));
2384	assert_eq!(exact(["ab"]), e(r"a(?m:^)b"));
2385	assert_eq!(exact(["ab"]), e(r"a(?m:$)b"));
2386	assert_eq!(exact(["ab"]), e(r"a\bb"));
2387	assert_eq!(exact(["ab"]), e(r"a\Bb"));
2388	assert_eq!(exact(["ab"]), e(r"a(?-u:\b)b"));
2389	assert_eq!(exact(["ab"]), e(r"a(?-u:\B)b"));
2390
2391	assert_eq!(exact(["ab"]), e(r"^ab"));
2392	assert_eq!(exact(["ab"]), e(r"$ab"));
2393	assert_eq!(exact(["ab"]), e(r"(?m:^)ab"));
2394	assert_eq!(exact(["ab"]), e(r"(?m:$)ab"));
2395	assert_eq!(exact(["ab"]), e(r"\bab"));
2396	assert_eq!(exact(["ab"]), e(r"\Bab"));
2397	assert_eq!(exact(["ab"]), e(r"(?-u:\b)ab"));
2398	assert_eq!(exact(["ab"]), e(r"(?-u:\B)ab"));
2399
2400	assert_eq!(exact(["ab"]), e(r"ab^"));
2401	assert_eq!(exact(["ab"]), e(r"ab$"));
2402	assert_eq!(exact(["ab"]), e(r"ab(?m:^)"));
2403	assert_eq!(exact(["ab"]), e(r"ab(?m:$)"));
2404	assert_eq!(exact(["ab"]), e(r"ab\b"));
2405	assert_eq!(exact(["ab"]), e(r"ab\B"));
2406	assert_eq!(exact(["ab"]), e(r"ab(?-u:\b)"));
2407	assert_eq!(exact(["ab"]), e(r"ab(?-u:\B)"));
2408
2409	let expected = (seq([I("aZ"), E("ab")]), seq([I("Zb"), E("ab")]));
2410	assert_eq!(expected, e(r"^aZ*b"));
2411	}
2412
2413	#[test]
2414	fn repetition() {
2415	assert_eq!(exact(["a", ""]), e(r"a?"));
2416	assert_eq!(exact(["", "a"]), e(r"a??"));
2417	assert_eq!(inexact([I("a"), E("")], [I("a"), E("")]), e(r"a*"));
2418	assert_eq!(inexact([E(""), I("a")], [E(""), I("a")]), e(r"a*?"));
2419	assert_eq!(inexact([I("a")], [I("a")]), e(r"a+"));
2420	assert_eq!(inexact([I("a")], [I("a")]), e(r"(a+)+"));
2421
2422	assert_eq!(exact(["ab"]), e(r"aZ{0}b"));
2423	assert_eq!(exact(["aZb", "ab"]), e(r"aZ?b"));
2424	assert_eq!(exact(["ab", "aZb"]), e(r"aZ??b"));
2425	assert_eq!(
2426	inexact([I("aZ"), E("ab")], [I("Zb"), E("ab")]),
2427	e(r"aZ*b")
2428	);
2429	assert_eq!(
2430	inexact([E("ab"), I("aZ")], [E("ab"), I("Zb")]),
2431	e(r"aZ*?b")
2432	);
2433	assert_eq!(inexact([I("aZ")], [I("Zb")]), e(r"aZ+b"));
2434	assert_eq!(inexact([I("aZ")], [I("Zb")]), e(r"aZ+?b"));
2435
2436	assert_eq!(exact(["aZZb"]), e(r"aZ{2}b"));
2437	assert_eq!(inexact([I("aZZ")], [I("ZZb")]), e(r"aZ{2,3}b"));
2438
2439	assert_eq!(exact(["abc", ""]), e(r"(abc)?"));
2440	assert_eq!(exact(["", "abc"]), e(r"(abc)??"));
2441
2442	assert_eq!(inexact([I("a"), E("b")], [I("ab"), E("b")]), e(r"a*b"));
2443	assert_eq!(inexact([E("b"), I("a")], [E("b"), I("ab")]), e(r"a*?b"));
2444	assert_eq!(inexact([I("ab")], [I("b")]), e(r"ab+"));
2445	assert_eq!(inexact([I("a"), I("b")], [I("b")]), e(r"a*b+"));
2446
2447	// FIXME: The suffixes for this don't look quite right to me. I think
2448	// the right suffixes would be: [I(ac), I(bc), E(c)]. The main issue I
2449	// think is that suffixes are computed by iterating over concatenations
2450	// in reverse, and then [bc, ac, c] ordering is indeed correct from
2451	// that perspective. We also test a few more equivalent regexes, and
2452	// we get the same result, so it is consistent at least I suppose.
2453	//
2454	// The reason why this isn't an issue is that it only messes up
2455	// preference order, and currently, suffixes are never used in a
2456	// context where preference order matters. For prefixes it matters
2457	// because we sometimes want to use prefilters without confirmation
2458	// when all of the literals are exact (and there's no look-around). But
2459	// we never do that for suffixes. Any time we use suffixes, we always
2460	// include a confirmation step. If that ever changes, then it's likely
2461	// this bug will need to be fixed, but last time I looked, it appears
2462	// hard to do so.
2463	assert_eq!(
2464	inexact([I("a"), I("b"), E("c")], [I("bc"), I("ac"), E("c")]),
2465	e(r"abc")
2466	);
2467	assert_eq!(
2468	inexact([I("a"), I("b"), E("c")], [I("bc"), I("ac"), E("c")]),
2469	e(r"(a+)?(b+)?c")
2470	);
2471	assert_eq!(
2472	inexact([I("a"), I("b"), E("c")], [I("bc"), I("ac"), E("c")]),
2473	e(r"(a+\|)(b+\|)c")
2474	);
2475	// A few more similarish but not identical regexes. These may have a
2476	// similar problem as above.
2477	assert_eq!(
2478	inexact(
2479	[I("a"), I("b"), I("c"), E("")],
2480	[I("c"), I("b"), I("a"), E("")]
2481	),
2482	e(r"abc*")
2483	);
2484	assert_eq!(inexact([I("a"), I("b"), I("c")], [I("c")]), e(r"abc+"));
2485	assert_eq!(inexact([I("a"), I("b")], [I("bc")]), e(r"a*b+c"));
2486	assert_eq!(inexact([I("a"), I("b")], [I("c"), I("b")]), e(r"ab+c"));
2487	assert_eq!(inexact([I("ab"), E("a")], [I("b"), E("a")]), e(r"ab*"));
2488	assert_eq!(
2489	inexact([I("ab"), E("ac")], [I("bc"), E("ac")]),
2490	e(r"ab*c")
2491	);
2492	assert_eq!(inexact([I("ab")], [I("b")]), e(r"ab+"));
2493	assert_eq!(inexact([I("ab")], [I("bc")]), e(r"ab+c"));
2494
2495	assert_eq!(
2496	inexact([I("z"), E("azb")], [I("zazb"), E("azb")]),
2497	e(r"z*azb")
2498	);
2499
2500	let expected =
2501	exact(["aaa", "aab", "aba", "abb", "baa", "bab", "bba", "bbb"]);
2502	assert_eq!(expected, e(r"[ab]{3}"));
2503	let expected = inexact(
2504	[
2505	I("aaa"),
2506	I("aab"),
2507	I("aba"),
2508	I("abb"),
2509	I("baa"),
2510	I("bab"),
2511	I("bba"),
2512	I("bbb"),
2513	],
2514	[
2515	I("aaa"),
2516	I("aab"),
2517	I("aba"),
2518	I("abb"),
2519	I("baa"),
2520	I("bab"),
2521	I("bba"),
2522	I("bbb"),
2523	],
2524	);
2525	assert_eq!(expected, e(r"[ab]{3,4}"));
2526	}
2527
2528	#[test]
2529	fn concat() {
2530	let empty: [&str; `0`] = [];
2531
2532	assert_eq!(exact(["abcxyz"]), e(r"abc()xyz"));
2533	assert_eq!(exact(["abcxyz"]), e(r"(abc)(xyz)"));
2534	assert_eq!(exact(["abcmnoxyz"]), e(r"abc()mno()xyz"));
2535	assert_eq!(exact(empty), e(r"abc[a&&b]xyz"));
2536	assert_eq!(exact(["abcxyz"]), e(r"abc[a&&b]*xyz"));
2537	}
2538
2539	#[test]
2540	fn alternation() {
2541	assert_eq!(exact(["abc", "mno", "xyz"]), e(r"abc\|mno\|xyz"));
2542	assert_eq!(
2543	inexact(
2544	[E("abc"), I("mZ"), E("mo"), E("xyz")],
2545	[E("abc"), I("Zo"), E("mo"), E("xyz")]
2546	),
2547	e(r"abc\|mZ*o\|xyz")
2548	);
2549	assert_eq!(exact(["abc", "xyz"]), e(r"abc\|M[a&&b]N\|xyz"));
2550	assert_eq!(exact(["abc", "MN", "xyz"]), e(r"abc\|M[a&&b]*N\|xyz"));
2551
2552	assert_eq!(exact(["aaa", "aaaaa"]), e(r"(?:\|aa)aaa"));
2553	assert_eq!(
2554	inexact(
2555	[I("aaa"), E(""), I("aaaaa"), E("aa")],
2556	[I("aaa"), E(""), E("aa")]
2557	),
2558	e(r"(?:\|aa)(?:aaa)*")
2559	);
2560	assert_eq!(
2561	inexact(
2562	[E(""), I("aaa"), E("aa"), I("aaaaa")],
2563	[E(""), I("aaa"), E("aa")]
2564	),
2565	e(r"(?:\|aa)(?:aaa)*?")
2566	);
2567
2568	assert_eq!(
2569	inexact([E("a"), I("b"), E("")], [E("a"), I("b"), E("")]),
2570	e(r"a\|b*")
2571	);
2572	assert_eq!(inexact([E("a"), I("b")], [E("a"), I("b")]), e(r"a\|b+"));
2573
2574	assert_eq!(
2575	inexact([I("a"), E("b"), E("c")], [I("ab"), E("b"), E("c")]),
2576	e(r"a*b\|c")
2577	);
2578
2579	assert_eq!(
2580	inexact(
2581	[E("a"), E("b"), I("c"), E("")],
2582	[E("a"), E("b"), I("c"), E("")]
2583	),
2584	e(r"a\|(?:b\|c*)")
2585	);
2586
2587	assert_eq!(
2588	inexact(
2589	[I("a"), I("b"), E("c"), I("a"), I("ab"), E("c")],
2590	[I("ac"), I("bc"), E("c"), I("ac"), I("abc"), E("c")],
2591	),
2592	e(r"(a\|b)c\|(a\|ab)c")
2593	);
2594
2595	assert_eq!(
2596	exact(["abef", "abgh", "cdef", "cdgh"]),
2597	e(r"(ab\|cd)(ef\|gh)")
2598	);
2599	assert_eq!(
2600	exact([
2601	"abefij", "abefkl", "abghij", "abghkl", "cdefij", "cdefkl",
2602	"cdghij", "cdghkl",
2603	]),
2604	e(r"(ab\|cd)(ef\|gh)(ij\|kl)")
2605	);
2606	}
2607
2608	#[test]
2609	fn impossible() {
2610	let empty: [&str; `0`] = [];
2611
2612	assert_eq!(exact(empty), e(r"[a&&b]"));
2613	assert_eq!(exact(empty), e(r"a[a&&b]"));
2614	assert_eq!(exact(empty), e(r"[a&&b]b"));
2615	assert_eq!(exact(empty), e(r"a[a&&b]b"));
2616	assert_eq!(exact(["a", "b"]), e(r"a\|[a&&b]\|b"));
2617	assert_eq!(exact(["a", "b"]), e(r"a\|c[a&&b]\|b"));
2618	assert_eq!(exact(["a", "b"]), e(r"a\|[a&&b]d\|b"));
2619	assert_eq!(exact(["a", "b"]), e(r"a\|c[a&&b]d\|b"));
2620	assert_eq!(exact([""]), e(r"[a&&b]*"));
2621	assert_eq!(exact(["MN"]), e(r"M[a&&b]*N"));
2622	}
2623
2624	// This tests patterns that contain something that defeats literal
2625	// detection, usually because it would blow some limit on the total number
2626	// of literals that can be returned.
2627	//
2628	// The main idea is that when literal extraction sees something that
2629	// it knows will blow a limit, it replaces it with a marker that says
2630	// "any literal will match here." While not necessarily true, the
2631	// over-estimation is just fine for the purposes of literal extraction,
2632	// because the imprecision doesn't matter: too big is too big.
2633	//
2634	// This is one of the trickier parts of literal extraction, since we need
2635	// to make sure all of our literal extraction operations correctly compose
2636	// with the markers.
2637	#[test]
2638	fn anything() {
2639	assert_eq!(infinite(), e(r"."));
2640	assert_eq!(infinite(), e(r"(?s)."));
2641	assert_eq!(infinite(), e(r"[A-Za-z]"));
2642	assert_eq!(infinite(), e(r"[A-Z]"));
2643	assert_eq!(exact([""]), e(r"[A-Z]{0}"));
2644	assert_eq!(infinite(), e(r"[A-Z]?"));
2645	assert_eq!(infinite(), e(r"[A-Z]*"));
2646	assert_eq!(infinite(), e(r"[A-Z]+"));
2647	assert_eq!((seq([I("1")]), Seq::infinite()), e(r"1[A-Z]"));
2648	assert_eq!((seq([I("1")]), seq([I("2")])), e(r"1[A-Z]2"));
2649	assert_eq!((Seq::infinite(), seq([I("123")])), e(r"[A-Z]+123"));
2650	assert_eq!(infinite(), e(r"[A-Z]+123[A-Z]+"));
2651	assert_eq!(infinite(), e(r"1\|[A-Z]\|3"));
2652	assert_eq!(
2653	(seq([E("1"), I("2"), E("3")]), Seq::infinite()),
2654	e(r"1\|2[A-Z]\|3"),
2655	);
2656	assert_eq!(
2657	(Seq::infinite(), seq([E("1"), I("2"), E("3")])),
2658	e(r"1\|[A-Z]2\|3"),
2659	);
2660	assert_eq!(
2661	(seq([E("1"), I("2"), E("4")]), seq([E("1"), I("3"), E("4")])),
2662	e(r"1\|2[A-Z]3\|4"),
2663	);
2664	assert_eq!((Seq::infinite(), seq([I("2")])), e(r"(?:\|1)[A-Z]2"));
2665	assert_eq!(inexact([I("a")], [I("z")]), e(r"a.z"));
2666	}
2667
2668	// Like the 'anything' test, but it uses smaller limits in order to test
2669	// the logic for effectively aborting literal extraction when the seqs get
2670	// too big.
2671	#[test]
2672	fn anything_small_limits() {
2673	fn prefixes(pattern: &str) -> Seq {
2674	Extractor::new()
2675	.kind(ExtractKind::Prefix)
2676	.limit_total(`10`)
2677	.extract(&parse(pattern))
2678	}
2679
2680	fn suffixes(pattern: &str) -> Seq {
2681	Extractor::new()
2682	.kind(ExtractKind::Suffix)
2683	.limit_total(`10`)
2684	.extract(&parse(pattern))
2685	}
2686
2687	fn e(pattern: &str) -> (Seq, Seq) {
2688	(prefixes(pattern), suffixes(pattern))
2689	}
2690
2691	assert_eq!(
2692	(
2693	seq([
2694	I("aaa"),
2695	I("aab"),
2696	I("aba"),
2697	I("abb"),
2698	I("baa"),
2699	I("bab"),
2700	I("bba"),
2701	I("bbb")
2702	]),
2703	seq([
2704	I("aaa"),
2705	I("aab"),
2706	I("aba"),
2707	I("abb"),
2708	I("baa"),
2709	I("bab"),
2710	I("bba"),
2711	I("bbb")
2712	])
2713	),
2714	e(r"[ab]{3}{3}")
2715	);
2716
2717	assert_eq!(infinite(), e(r"ab\|cd\|ef\|gh\|ij\|kl\|mn\|op\|qr\|st\|uv\|wx\|yz"));
2718	}
2719
2720	#[test]
2721	fn empty() {
2722	assert_eq!(exact([""]), e(r""));
2723	assert_eq!(exact([""]), e(r"^"));
2724	assert_eq!(exact([""]), e(r"$"));
2725	assert_eq!(exact([""]), e(r"(?m:^)"));
2726	assert_eq!(exact([""]), e(r"(?m:$)"));
2727	assert_eq!(exact([""]), e(r"\b"));
2728	assert_eq!(exact([""]), e(r"\B"));
2729	assert_eq!(exact([""]), e(r"(?-u:\b)"));
2730	assert_eq!(exact([""]), e(r"(?-u:\B)"));
2731	}
2732
2733	#[test]
2734	fn odds_and_ends() {
2735	assert_eq!((Seq::infinite(), seq([I("a")])), e(r".a"));
2736	assert_eq!((seq([I("a")]), Seq::infinite()), e(r"a."));
2737	assert_eq!(infinite(), e(r"a\|."));
2738	assert_eq!(infinite(), e(r".\|a"));
2739
2740	let pat = r"M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]";
2741	let expected = inexact(
2742	["Mo'am", "Moam", "Mu'am", "Muam"].map(I),
2743	[
2744	"ddafi", "ddafy", "dhafi", "dhafy", "dzafi", "dzafy", "dafi",
2745	"dafy", "tdafi", "tdafy", "thafi", "thafy", "tzafi", "tzafy",
2746	"tafi", "tafy", "zdafi", "zdafy", "zhafi", "zhafy", "zzafi",
2747	"zzafy", "zafi", "zafy",
2748	]
2749	.map(I),
2750	);
2751	assert_eq!(expected, e(pat));
2752
2753	assert_eq!(
2754	(seq(["fn is_", "fn as_"].map(I)), Seq::infinite()),
2755	e(r"fn is_([A-Z]+)\|fn as_([A-Z]+)"),
2756	);
2757	assert_eq!(
2758	inexact([I("foo")], [I("quux")]),
2759	e(r"foo[A-Z]+bar[A-Z]+quux")
2760	);
2761	assert_eq!(infinite(), e(r"[A-Z]+bar[A-Z]+"));
2762	assert_eq!(
2763	exact(["Sherlock Holmes"]),
2764	e(r"(?m)^Sherlock Holmes\|Sherlock Holmes$")
2765	);
2766
2767	assert_eq!(exact(["sa", "sb"]), e(r"\bs(?:[ab])"));
2768	}
2769
2770	// This tests a specific regex along with some heuristic steps to reduce
2771	// the sequences extracted. This is meant to roughly correspond to the
2772	// types of heuristics used to shrink literal sets in practice. (Shrinking
2773	// is done because you want to balance "spend too much work looking for
2774	// too many literals" and "spend too much work processing false positive
2775	// matches from short literals.")
2776	#[test]
2777	#[cfg(feature = "unicode-case")]
2778	fn holmes() {
2779	let expected = inexact(
2780	["HOL", "HOl", "HoL", "Hol", "hOL", "hOl", "hoL", "hol"].map(I),
2781	[
2782	"MES", "MEs", "Eſ", "MeS", "Mes", "eſ", "mES", "mEs", "meS",
2783	"mes",
2784	]
2785	.map(I),
2786	);
2787	let (mut prefixes, mut suffixes) = e(r"(?i)Holmes");
2788	prefixes.keep_first_bytes(`3`);
2789	suffixes.keep_last_bytes(`3`);
2790	prefixes.minimize_by_preference();
2791	suffixes.minimize_by_preference();
2792	assert_eq!(expected, (prefixes, suffixes));
2793	}
2794
2795	// This tests that we get some kind of literals extracted for a beefier
2796	// alternation with case insensitive mode enabled. At one point during
2797	// development, this returned nothing, and motivated some special case
2798	// code in Extractor::union to try and trim down the literal sequences
2799	// if the union would blow the limits set.
2800	#[test]
2801	#[cfg(feature = "unicode-case")]
2802	fn holmes_alt() {
2803	let mut pre =
2804	prefixes(r"(?i)Sherlock\|Holmes\|Watson\|Irene\|Adler\|John\|Baker");
2805	assert!(pre.len().unwrap() > `0`);
2806	pre.optimize_for_prefix_by_preference();
2807	assert!(pre.len().unwrap() > `0`);
2808	}
2809
2810	// See: https://github.com/rust-lang/regex/security/advisories/GHSA-m5pq-gvj9-9vr8
2811	// See: CVE-2022-24713
2812	//
2813	// We test this here to ensure literal extraction completes in reasonable
2814	// time and isn't materially impacted by these sorts of pathological
2815	// repeats.
2816	#[test]
2817	fn crazy_repeats() {
2818	assert_eq!(inexact([I("")], [I("")]), e(r"(?:){4294967295}"));
2819	assert_eq!(
2820	inexact([I("")], [I("")]),
2821	e(r"(?:){64}{64}{64}{64}{64}{64}")
2822	);
2823	assert_eq!(inexact([I("")], [I("")]), e(r"x{0}{4294967295}"));
2824	assert_eq!(inexact([I("")], [I("")]), e(r"(?:\|){4294967295}"));
2825
2826	assert_eq!(
2827	inexact([E("")], [E("")]),
2828	e(r"(?:){8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}")
2829	);
2830	let repa = "a".repeat(`100`);
2831	assert_eq!(
2832	inexact([I(&repa)], [I(&repa)]),
2833	e(r"a{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}")
2834	);
2835	}
2836
2837	#[test]
2838	fn huge() {
2839	let pat = r#"(?-u)
2840	2(?:
2841	[45]\d{3}\|
2842	7(?:
2843	1[0-267]\|
2844	2[0-289]\|
2845	3[0-29]\|
2846	4[01]\|
2847	5[1-3]\|
2848	6[013]\|
2849	7[0178]\|
2850	91
2851	)\|
2852	8(?:
2853	0[125]\|
2854	[139][1-6]\|
2855	2[0157-9]\|
2856	41\|
2857	6[1-35]\|
2858	7[1-5]\|
2859	8[1-8]\|
2860	90
2861	)\|
2862	9(?:
2863	0[0-2]\|
2864	1[0-4]\|
2865	2[568]\|
2866	3[3-6]\|
2867	5[5-7]\|
2868	6[0167]\|
2869	7[15]\|
2870	8[0146-9]
2871	)
2872	)\d{4}\|
2873	3(?:
2874	12?[5-7]\d{2}\|
2875	0(?:
2876	2(?:
2877	[025-79]\d\|
2878	[348]\d{1,2}
2879	)\|
2880	3(?:
2881	[2-4]\d\|
2882	[56]\d?
2883	)
2884	)\|
2885	2(?:
2886	1\d{2}\|
2887	2(?:
2888	[12]\d\|
2889	[35]\d{1,2}\|
2890	4\d?
2891	)
2892	)\|
2893	3(?:
2894	1\d{2}\|
2895	2(?:
2896	[2356]\d\|
2897	4\d{1,2}
2898	)
2899	)\|
2900	4(?:
2901	1\d{2}\|
2902	2(?:
2903	2\d{1,2}\|
2904	[47]\|
2905	5\d{2}
2906	)
2907	)\|
2908	5(?:
2909	1\d{2}\|
2910	29
2911	)\|
2912	[67]1\d{2}\|
2913	8(?:
2914	1\d{2}\|
2915	2(?:
2916	2\d{2}\|
2917	3\|
2918	4\d
2919	)
2920	)
2921	)\d{3}\|
2922	4(?:
2923	0(?:
2924	2(?:
2925	[09]\d\|
2926	7
2927	)\|
2928	33\d{2}
2929	)\|
2930	1\d{3}\|
2931	2(?:
2932	1\d{2}\|
2933	2(?:
2934	[25]\d?\|
2935	[348]\d\|
2936	[67]\d{1,2}
2937	)
2938	)\|
2939	3(?:
2940	1\d{2}(?:
2941	\d{2}
2942	)?\|
2943	2(?:
2944	[045]\d\|
2945	[236-9]\d{1,2}
2946	)\|
2947	32\d{2}
2948	)\|
2949	4(?:
2950	[18]\d{2}\|
2951	2(?:
2952	[2-46]\d{2}\|
2953	3
2954	)\|
2955	5[25]\d{2}
2956	)\|
2957	5(?:
2958	1\d{2}\|
2959	2(?:
2960	3\d\|
2961	5
2962	)
2963	)\|
2964	6(?:
2965	[18]\d{2}\|
2966	2(?:
2967	3(?:
2968	\d{2}
2969	)?\|
2970	[46]\d{1,2}\|
2971	5\d{2}\|
2972	7\d
2973	)\|
2974	5(?:
2975	3\d?\|
2976	4\d\|
2977	[57]\d{1,2}\|
2978	6\d{2}\|
2979	8
2980	)
2981	)\|
2982	71\d{2}\|
2983	8(?:
2984	[18]\d{2}\|
2985	23\d{2}\|
2986	54\d{2}
2987	)\|
2988	9(?:
2989	[18]\d{2}\|
2990	2[2-5]\d{2}\|
2991	53\d{1,2}
2992	)
2993	)\d{3}\|
2994	5(?:
2995	02[03489]\d{2}\|
2996	1\d{2}\|
2997	2(?:
2998	1\d{2}\|
2999	2(?:
3000	2(?:
3001	\d{2}
3002	)?\|
3003	[457]\d{2}
3004	)
3005	)\|
3006	3(?:
3007	1\d{2}\|
3008	2(?:
3009	[37](?:
3010	\d{2}
3011	)?\|
3012	[569]\d{2}
3013	)
3014	)\|
3015	4(?:
3016	1\d{2}\|
3017	2[46]\d{2}
3018	)\|
3019	5(?:
3020	1\d{2}\|
3021	26\d{1,2}
3022	)\|
3023	6(?:
3024	[18]\d{2}\|
3025	2\|
3026	53\d{2}
3027	)\|
3028	7(?:
3029	1\|
3030	24
3031	)\d{2}\|
3032	8(?:
3033	1\|
3034	26
3035	)\d{2}\|
3036	91\d{2}
3037	)\d{3}\|
3038	6(?:
3039	0(?:
3040	1\d{2}\|
3041	2(?:
3042	3\d{2}\|
3043	4\d{1,2}
3044	)
3045	)\|
3046	2(?:
3047	2[2-5]\d{2}\|
3048	5(?:
3049	[3-5]\d{2}\|
3050	7
3051	)\|
3052	8\d{2}
3053	)\|
3054	3(?:
3055	1\|
3056	2[3478]
3057	)\d{2}\|
3058	4(?:
3059	1\|
3060	2[34]
3061	)\d{2}\|
3062	5(?:
3063	1\|
3064	2[47]
3065	)\d{2}\|
3066	6(?:
3067	[18]\d{2}\|
3068	6(?:
3069	2(?:
3070	2\d\|
3071	[34]\d{2}
3072	)\|
3073	5(?:
3074	[24]\d{2}\|
3075	3\d\|
3076	5\d{1,2}
3077	)
3078	)
3079	)\|
3080	72[2-5]\d{2}\|
3081	8(?:
3082	1\d{2}\|
3083	2[2-5]\d{2}
3084	)\|
3085	9(?:
3086	1\d{2}\|
3087	2[2-6]\d{2}
3088	)
3089	)\d{3}\|
3090	7(?:
3091	(?:
3092	02\|
3093	[3-589]1\|
3094	6[12]\|
3095	72[24]
3096	)\d{2}\|
3097	21\d{3}\|
3098	32
3099	)\d{3}\|
3100	8(?:
3101	(?:
3102	4[12]\|
3103	[5-7]2\|
3104	1\d?
3105	)\|
3106	(?:
3107	0\|
3108	3[12]\|
3109	[5-7]1\|
3110	217
3111	)\d
3112	)\d{4}\|
3113	9(?:
3114	[35]1\|
3115	(?:
3116	[024]2\|
3117	81
3118	)\d\|
3119	(?:
3120	1\|
3121	[24]1
3122	)\d{2}
3123	)\d{3}
3124	"#;
3125	// TODO: This is a good candidate of a seq of literals that could be
3126	// shrunk quite a bit and still be very productive with respect to
3127	// literal optimizations.
3128	let (prefixes, suffixes) = e(pat);
3129	assert!(!suffixes.is_finite());
3130	assert_eq!(Some(`243`), prefixes.len());
3131	}
3132
3133	#[test]
3134	fn optimize() {
3135	// This gets a common prefix that isn't too short.
3136	let (p, s) =
3137	opt(["foobarfoobar", "foobar", "foobarzfoobar", "foobarfoobar"]);
3138	assert_eq!(seq([I("foobar")]), p);
3139	assert_eq!(seq([I("foobar")]), s);
3140
3141	// This also finds a common prefix, but since it's only one byte, it
3142	// prefers the multiple literals.
3143	let (p, s) = opt(["abba", "akka", "abccba"]);
3144	assert_eq!(exact(["abba", "akka", "abccba"]), (p, s));
3145
3146	let (p, s) = opt(["sam", "samwise"]);
3147	assert_eq!((seq([E("sam")]), seq([E("sam"), E("samwise")])), (p, s));
3148
3149	// The empty string is poisonous, so our seq becomes infinite, even
3150	// though all literals are exact.
3151	let (p, s) = opt(["foobarfoo", "foo", "", "foozfoo", "foofoo"]);
3152	assert!(!p.is_finite());
3153	assert!(!s.is_finite());
3154
3155	// A space is also poisonous, so our seq becomes infinite. But this
3156	// only gets triggered when we don't have a completely exact sequence.
3157	// When the sequence is exact, spaces are okay, since we presume that
3158	// any prefilter will match a space more quickly than the regex engine.
3159	// (When the sequence is exact, there's a chance of the prefilter being
3160	// used without needing the regex engine at all.)
3161	let mut p = seq([E("foobarfoo"), I("foo"), E(" "), E("foofoo")]);
3162	p.optimize_for_prefix_by_preference();
3163	assert!(!p.is_finite());
3164	}
3165	}
3166