1 | /*! |
2 | Provides literal extraction from `Hir` expressions. |
3 | |
4 | An [`Extractor`] pulls literals out of [`Hir`] expressions and returns a |
5 | [`Seq`] of [`Literal`]s. |
6 | |
7 | The purpose of literal extraction is generally to provide avenues for |
8 | optimizing regex searches. The main idea is that substring searches can be an |
9 | order of magnitude faster than a regex search. Therefore, if one can execute |
10 | a substring search to find candidate match locations and only run the regex |
11 | search at those locations, then it is possible for huge improvements in |
12 | performance to be realized. |
13 | |
14 | With that said, literal optimizations are generally a black art because even |
15 | though substring search is generally faster, if the number of candidates |
16 | produced is high, then it can create a lot of overhead by ping-ponging between |
17 | the substring search and the regex search. |
18 | |
19 | Here are some heuristics that might be used to help increase the chances of |
20 | effective literal optimizations: |
21 | |
22 | * Stick to small [`Seq`]s. If you search for too many literals, it's likely |
23 | to lead to substring search that is only a little faster than a regex search, |
24 | and thus the overhead of using literal optimizations in the first place might |
25 | make things slower overall. |
26 | * The literals in your [`Seq`] shoudn't be too short. In general, longer is |
27 | better. A sequence corresponding to single bytes that occur frequently in the |
28 | haystack, for example, is probably a bad literal optimization because it's |
29 | likely to produce many false positive candidates. Longer literals are less |
30 | likely to match, and thus probably produce fewer false positives. |
31 | * If it's possible to estimate the approximate frequency of each byte according |
32 | to some pre-computed background distribution, it is possible to compute a score |
33 | of how "good" a `Seq` is. If a `Seq` isn't good enough, you might consider |
34 | skipping the literal optimization and just use the regex engine. |
35 | |
36 | (It should be noted that there are always pathological cases that can make |
37 | any kind of literal optimization be a net slower result. This is why it |
38 | might be a good idea to be conservative, or to even provide a means for |
39 | literal optimizations to be dynamically disabled if they are determined to be |
40 | ineffective according to some measure.) |
41 | |
42 | You're encouraged to explore the methods on [`Seq`], which permit shrinking |
43 | the size of sequences in a preference-order preserving fashion. |
44 | |
45 | Finally, note that it isn't strictly necessary to use an [`Extractor`]. Namely, |
46 | an `Extractor` only uses public APIs of the [`Seq`] and [`Literal`] types, |
47 | so it is possible to implement your own extractor. For example, for n-grams |
48 | or "inner" literals (i.e., not prefix or suffix literals). The `Extractor` |
49 | is mostly responsible for the case analysis over `Hir` expressions. Much of |
50 | the "trickier" parts are how to combine literal sequences, and that is all |
51 | implemented on [`Seq`]. |
52 | */ |
53 | |
54 | use core::{cmp, mem}; |
55 | |
56 | use alloc::{vec, vec::Vec}; |
57 | |
58 | use crate::hir::{self, Hir}; |
59 | |
60 | /// Extracts prefix or suffix literal sequences from [`Hir`] expressions. |
61 | /// |
62 | /// Literal extraction is based on the following observations: |
63 | /// |
64 | /// * Many regexes start with one or a small number of literals. |
65 | /// * Substring search for literals is often much faster (sometimes by an order |
66 | /// of magnitude) than a regex search. |
67 | /// |
68 | /// Thus, in many cases, one can search for literals to find candidate starting |
69 | /// locations of a match, and then only run the full regex engine at each such |
70 | /// location instead of over the full haystack. |
71 | /// |
72 | /// The main downside of literal extraction is that it can wind up causing a |
73 | /// search to be slower overall. For example, if there are many matches or if |
74 | /// there are many candidates that don't ultimately lead to a match, then a |
75 | /// lot of overhead will be spent in shuffing back-and-forth between substring |
76 | /// search and the regex engine. This is the fundamental reason why literal |
77 | /// optimizations for regex patterns is sometimes considered a "black art." |
78 | /// |
79 | /// # Look-around assertions |
80 | /// |
81 | /// Literal extraction treats all look-around assertions as-if they match every |
82 | /// empty string. So for example, the regex `\bquux\b` will yield a sequence |
83 | /// containing a single exact literal `quux`. However, not all occurrences |
84 | /// of `quux` correspond to a match a of the regex. For example, `\bquux\b` |
85 | /// does not match `ZquuxZ` anywhere because `quux` does not fall on a word |
86 | /// boundary. |
87 | /// |
88 | /// In effect, if your regex contains look-around assertions, then a match of |
89 | /// an exact literal does not necessarily mean the regex overall matches. So |
90 | /// you may still need to run the regex engine in such cases to confirm the |
91 | /// match. |
92 | /// |
93 | /// The precise guarantee you get from a literal sequence is: if every literal |
94 | /// in the sequence is exact and the original regex contains zero look-around |
95 | /// assertions, then a preference-order multi-substring search of those |
96 | /// literals will precisely match a preference-order search of the original |
97 | /// regex. |
98 | /// |
99 | /// # Example |
100 | /// |
101 | /// This shows how to extract prefixes: |
102 | /// |
103 | /// ``` |
104 | /// use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse}; |
105 | /// |
106 | /// let hir = parse(r"(a|b|c)(x|y|z)[A-Z]+foo" )?; |
107 | /// let got = Extractor::new().extract(&hir); |
108 | /// // All literals returned are "inexact" because none of them reach the |
109 | /// // match state. |
110 | /// let expected = Seq::from_iter([ |
111 | /// Literal::inexact("ax" ), |
112 | /// Literal::inexact("ay" ), |
113 | /// Literal::inexact("az" ), |
114 | /// Literal::inexact("bx" ), |
115 | /// Literal::inexact("by" ), |
116 | /// Literal::inexact("bz" ), |
117 | /// Literal::inexact("cx" ), |
118 | /// Literal::inexact("cy" ), |
119 | /// Literal::inexact("cz" ), |
120 | /// ]); |
121 | /// assert_eq!(expected, got); |
122 | /// |
123 | /// # Ok::<(), Box<dyn std::error::Error>>(()) |
124 | /// ``` |
125 | /// |
126 | /// This shows how to extract suffixes: |
127 | /// |
128 | /// ``` |
129 | /// use regex_syntax::{ |
130 | /// hir::literal::{Extractor, ExtractKind, Literal, Seq}, |
131 | /// parse, |
132 | /// }; |
133 | /// |
134 | /// let hir = parse(r"foo|[A-Z]+bar" )?; |
135 | /// let got = Extractor::new().kind(ExtractKind::Suffix).extract(&hir); |
136 | /// // Since 'foo' gets to a match state, it is considered exact. But 'bar' |
137 | /// // does not because of the '[A-Z]+', and thus is marked inexact. |
138 | /// let expected = Seq::from_iter([ |
139 | /// Literal::exact("foo" ), |
140 | /// Literal::inexact("bar" ), |
141 | /// ]); |
142 | /// assert_eq!(expected, got); |
143 | /// |
144 | /// # Ok::<(), Box<dyn std::error::Error>>(()) |
145 | /// ``` |
146 | #[derive (Clone, Debug)] |
147 | pub struct Extractor { |
148 | kind: ExtractKind, |
149 | limit_class: usize, |
150 | limit_repeat: usize, |
151 | limit_literal_len: usize, |
152 | limit_total: usize, |
153 | } |
154 | |
155 | impl Extractor { |
156 | /// Create a new extractor with a default configuration. |
157 | /// |
158 | /// The extractor can be optionally configured before calling |
159 | /// [`Extractor::extract`] to get a literal sequence. |
160 | pub fn new() -> Extractor { |
161 | Extractor { |
162 | kind: ExtractKind::Prefix, |
163 | limit_class: 10, |
164 | limit_repeat: 10, |
165 | limit_literal_len: 100, |
166 | limit_total: 250, |
167 | } |
168 | } |
169 | |
170 | /// Execute the extractor and return a sequence of literals. |
171 | pub fn extract(&self, hir: &Hir) -> Seq { |
172 | use crate::hir::HirKind::*; |
173 | |
174 | match *hir.kind() { |
175 | Empty | Look(_) => Seq::singleton(self::Literal::exact(vec![])), |
176 | Literal(hir::Literal(ref bytes)) => { |
177 | let mut seq = |
178 | Seq::singleton(self::Literal::exact(bytes.to_vec())); |
179 | self.enforce_literal_len(&mut seq); |
180 | seq |
181 | } |
182 | Class(hir::Class::Unicode(ref cls)) => { |
183 | self.extract_class_unicode(cls) |
184 | } |
185 | Class(hir::Class::Bytes(ref cls)) => self.extract_class_bytes(cls), |
186 | Repetition(ref rep) => self.extract_repetition(rep), |
187 | Capture(hir::Capture { ref sub, .. }) => self.extract(sub), |
188 | Concat(ref hirs) => match self.kind { |
189 | ExtractKind::Prefix => self.extract_concat(hirs.iter()), |
190 | ExtractKind::Suffix => self.extract_concat(hirs.iter().rev()), |
191 | }, |
192 | Alternation(ref hirs) => { |
193 | // Unlike concat, we always union starting from the beginning, |
194 | // since the beginning corresponds to the highest preference, |
195 | // which doesn't change based on forwards vs reverse. |
196 | self.extract_alternation(hirs.iter()) |
197 | } |
198 | } |
199 | } |
200 | |
201 | /// Set the kind of literal sequence to extract from an [`Hir`] expression. |
202 | /// |
203 | /// The default is to extract prefixes, but suffixes can be selected |
204 | /// instead. The contract for prefixes is that every match of the |
205 | /// corresponding `Hir` must start with one of the literals in the sequence |
206 | /// returned. Moreover, the _order_ of the sequence returned corresponds to |
207 | /// the preference order. |
208 | /// |
209 | /// Suffixes satisfy a similar contract in that every match of the |
210 | /// corresponding `Hir` must end with one of the literals in the sequence |
211 | /// returned. However, there is no guarantee that the literals are in |
212 | /// preference order. |
213 | /// |
214 | /// Remember that a sequence can be infinite. For example, unless the |
215 | /// limits are configured to be impractically large, attempting to extract |
216 | /// prefixes (or suffixes) for the pattern `[A-Z]` will return an infinite |
217 | /// sequence. Generally speaking, if the sequence returned is infinite, |
218 | /// then it is presumed to be unwise to do prefix (or suffix) optimizations |
219 | /// for the pattern. |
220 | pub fn kind(&mut self, kind: ExtractKind) -> &mut Extractor { |
221 | self.kind = kind; |
222 | self |
223 | } |
224 | |
225 | /// Configure a limit on the length of the sequence that is permitted for |
226 | /// a character class. If a character class exceeds this limit, then the |
227 | /// sequence returned for it is infinite. |
228 | /// |
229 | /// This prevents classes like `[A-Z]` or `\pL` from getting turned into |
230 | /// huge and likely unproductive sequences of literals. |
231 | /// |
232 | /// # Example |
233 | /// |
234 | /// This example shows how this limit can be lowered to decrease the tolerance |
235 | /// for character classes being turned into literal sequences. |
236 | /// |
237 | /// ``` |
238 | /// use regex_syntax::{hir::literal::{Extractor, Seq}, parse}; |
239 | /// |
240 | /// let hir = parse(r"[0-9]" )?; |
241 | /// |
242 | /// let got = Extractor::new().extract(&hir); |
243 | /// let expected = Seq::new([ |
244 | /// "0" , "1" , "2" , "3" , "4" , "5" , "6" , "7" , "8" , "9" , |
245 | /// ]); |
246 | /// assert_eq!(expected, got); |
247 | /// |
248 | /// // Now let's shrink the limit and see how that changes things. |
249 | /// let got = Extractor::new().limit_class(4).extract(&hir); |
250 | /// let expected = Seq::infinite(); |
251 | /// assert_eq!(expected, got); |
252 | /// |
253 | /// # Ok::<(), Box<dyn std::error::Error>>(()) |
254 | /// ``` |
255 | pub fn limit_class(&mut self, limit: usize) -> &mut Extractor { |
256 | self.limit_class = limit; |
257 | self |
258 | } |
259 | |
260 | /// Configure a limit on the total number of repetitions that is permitted |
261 | /// before literal extraction is stopped. |
262 | /// |
263 | /// This is useful for limiting things like `(abcde){50}`, or more |
264 | /// insidiously, `(?:){1000000000}`. This limit prevents any one single |
265 | /// repetition from adding too much to a literal sequence. |
266 | /// |
267 | /// With this limit set, repetitions that exceed it will be stopped and any |
268 | /// literals extracted up to that point will be made inexact. |
269 | /// |
270 | /// # Example |
271 | /// |
272 | /// This shows how to decrease the limit and compares it with the default. |
273 | /// |
274 | /// ``` |
275 | /// use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse}; |
276 | /// |
277 | /// let hir = parse(r"(abc){8}" )?; |
278 | /// |
279 | /// let got = Extractor::new().extract(&hir); |
280 | /// let expected = Seq::new(["abcabcabcabcabcabcabcabc" ]); |
281 | /// assert_eq!(expected, got); |
282 | /// |
283 | /// // Now let's shrink the limit and see how that changes things. |
284 | /// let got = Extractor::new().limit_repeat(4).extract(&hir); |
285 | /// let expected = Seq::from_iter([ |
286 | /// Literal::inexact("abcabcabcabc" ), |
287 | /// ]); |
288 | /// assert_eq!(expected, got); |
289 | /// |
290 | /// # Ok::<(), Box<dyn std::error::Error>>(()) |
291 | /// ``` |
292 | pub fn limit_repeat(&mut self, limit: usize) -> &mut Extractor { |
293 | self.limit_repeat = limit; |
294 | self |
295 | } |
296 | |
297 | /// Configure a limit on the maximum length of any literal in a sequence. |
298 | /// |
299 | /// This is useful for limiting things like `(abcde){5}{5}{5}{5}`. While |
300 | /// each repetition or literal in that regex is small, when all the |
301 | /// repetitions are applied, one ends up with a literal of length `5^4 = |
302 | /// 625`. |
303 | /// |
304 | /// With this limit set, literals that exceed it will be made inexact and |
305 | /// thus prevented from growing. |
306 | /// |
307 | /// # Example |
308 | /// |
309 | /// This shows how to decrease the limit and compares it with the default. |
310 | /// |
311 | /// ``` |
312 | /// use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse}; |
313 | /// |
314 | /// let hir = parse(r"(abc){2}{2}{2}" )?; |
315 | /// |
316 | /// let got = Extractor::new().extract(&hir); |
317 | /// let expected = Seq::new(["abcabcabcabcabcabcabcabc" ]); |
318 | /// assert_eq!(expected, got); |
319 | /// |
320 | /// // Now let's shrink the limit and see how that changes things. |
321 | /// let got = Extractor::new().limit_literal_len(14).extract(&hir); |
322 | /// let expected = Seq::from_iter([ |
323 | /// Literal::inexact("abcabcabcabcab" ), |
324 | /// ]); |
325 | /// assert_eq!(expected, got); |
326 | /// |
327 | /// # Ok::<(), Box<dyn std::error::Error>>(()) |
328 | /// ``` |
329 | pub fn limit_literal_len(&mut self, limit: usize) -> &mut Extractor { |
330 | self.limit_literal_len = limit; |
331 | self |
332 | } |
333 | |
334 | /// Configure a limit on the total number of literals that will be |
335 | /// returned. |
336 | /// |
337 | /// This is useful as a practical measure for avoiding the creation of |
338 | /// large sequences of literals. While the extractor will automatically |
339 | /// handle local creations of large sequences (for example, `[A-Z]` yields |
340 | /// an infinite sequence by default), large sequences can be created |
341 | /// through non-local means as well. |
342 | /// |
343 | /// For example, `[ab]{3}{3}` would yield a sequence of length `512 = 2^9` |
344 | /// despite each of the repetitions being small on their own. This limit |
345 | /// thus represents a "catch all" for avoiding locally small sequences from |
346 | /// combining into large sequences. |
347 | /// |
348 | /// # Example |
349 | /// |
350 | /// This example shows how reducing the limit will change the literal |
351 | /// sequence returned. |
352 | /// |
353 | /// ``` |
354 | /// use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse}; |
355 | /// |
356 | /// let hir = parse(r"[ab]{2}{2}" )?; |
357 | /// |
358 | /// let got = Extractor::new().extract(&hir); |
359 | /// let expected = Seq::new([ |
360 | /// "aaaa" , "aaab" , "aaba" , "aabb" , |
361 | /// "abaa" , "abab" , "abba" , "abbb" , |
362 | /// "baaa" , "baab" , "baba" , "babb" , |
363 | /// "bbaa" , "bbab" , "bbba" , "bbbb" , |
364 | /// ]); |
365 | /// assert_eq!(expected, got); |
366 | /// |
367 | /// // The default limit is not too big, but big enough to extract all |
368 | /// // literals from '[ab]{2}{2}'. If we shrink the limit to less than 16, |
369 | /// // then we'll get a truncated set. Notice that it returns a sequence of |
370 | /// // length 4 even though our limit was 10. This is because the sequence |
371 | /// // is difficult to increase without blowing the limit. Notice also |
372 | /// // that every literal in the sequence is now inexact because they were |
373 | /// // stripped of some suffix. |
374 | /// let got = Extractor::new().limit_total(10).extract(&hir); |
375 | /// let expected = Seq::from_iter([ |
376 | /// Literal::inexact("aa" ), |
377 | /// Literal::inexact("ab" ), |
378 | /// Literal::inexact("ba" ), |
379 | /// Literal::inexact("bb" ), |
380 | /// ]); |
381 | /// assert_eq!(expected, got); |
382 | /// |
383 | /// # Ok::<(), Box<dyn std::error::Error>>(()) |
384 | /// ``` |
385 | pub fn limit_total(&mut self, limit: usize) -> &mut Extractor { |
386 | self.limit_total = limit; |
387 | self |
388 | } |
389 | |
390 | /// Extract a sequence from the given concatenation. Sequences from each of |
391 | /// the child HIR expressions are combined via cross product. |
392 | /// |
393 | /// This short circuits once the cross product turns into a sequence |
394 | /// containing only inexact literals. |
395 | fn extract_concat<'a, I: Iterator<Item = &'a Hir>>(&self, it: I) -> Seq { |
396 | let mut seq = Seq::singleton(self::Literal::exact(vec![])); |
397 | for hir in it { |
398 | // If every element in the sequence is inexact, then a cross |
399 | // product will always be a no-op. Thus, there is nothing else we |
400 | // can add to it and can quit early. Note that this also includes |
401 | // infinite sequences. |
402 | if seq.is_inexact() { |
403 | break; |
404 | } |
405 | // Note that 'cross' also dispatches based on whether we're |
406 | // extracting prefixes or suffixes. |
407 | seq = self.cross(seq, &mut self.extract(hir)); |
408 | } |
409 | seq |
410 | } |
411 | |
412 | /// Extract a sequence from the given alternation. |
413 | /// |
414 | /// This short circuits once the union turns into an infinite sequence. |
415 | fn extract_alternation<'a, I: Iterator<Item = &'a Hir>>( |
416 | &self, |
417 | it: I, |
418 | ) -> Seq { |
419 | let mut seq = Seq::empty(); |
420 | for hir in it { |
421 | // Once our 'seq' is infinite, every subsequent union |
422 | // operation on it will itself always result in an |
423 | // infinite sequence. Thus, it can never change and we can |
424 | // short-circuit. |
425 | if !seq.is_finite() { |
426 | break; |
427 | } |
428 | seq = self.union(seq, &mut self.extract(hir)); |
429 | } |
430 | seq |
431 | } |
432 | |
433 | /// Extract a sequence of literals from the given repetition. We do our |
434 | /// best, Some examples: |
435 | /// |
436 | /// 'a*' => [inexact(a), exact("")] |
437 | /// 'a*?' => [exact(""), inexact(a)] |
438 | /// 'a+' => [inexact(a)] |
439 | /// 'a{3}' => [exact(aaa)] |
440 | /// 'a{3,5} => [inexact(aaa)] |
441 | /// |
442 | /// The key here really is making sure we get the 'inexact' vs 'exact' |
443 | /// attributes correct on each of the literals we add. For example, the |
444 | /// fact that 'a*' gives us an inexact 'a' and an exact empty string means |
445 | /// that a regex like 'ab*c' will result in [inexact(ab), exact(ac)] |
446 | /// literals being extracted, which might actually be a better prefilter |
447 | /// than just 'a'. |
448 | fn extract_repetition(&self, rep: &hir::Repetition) -> Seq { |
449 | let mut subseq = self.extract(&rep.sub); |
450 | match *rep { |
451 | hir::Repetition { min: 0, max, greedy, .. } => { |
452 | // When 'max=1', we can retain exactness, since 'a?' is |
453 | // equivalent to 'a|'. Similarly below, 'a??' is equivalent to |
454 | // '|a'. |
455 | if max != Some(1) { |
456 | subseq.make_inexact(); |
457 | } |
458 | let mut empty = Seq::singleton(Literal::exact(vec![])); |
459 | if !greedy { |
460 | mem::swap(&mut subseq, &mut empty); |
461 | } |
462 | self.union(subseq, &mut empty) |
463 | } |
464 | hir::Repetition { min, max: Some(max), .. } if min == max => { |
465 | assert!(min > 0); // handled above |
466 | let limit = |
467 | u32::try_from(self.limit_repeat).unwrap_or(u32::MAX); |
468 | let mut seq = Seq::singleton(Literal::exact(vec![])); |
469 | for _ in 0..cmp::min(min, limit) { |
470 | if seq.is_inexact() { |
471 | break; |
472 | } |
473 | seq = self.cross(seq, &mut subseq.clone()); |
474 | } |
475 | if usize::try_from(min).is_err() || min > limit { |
476 | seq.make_inexact(); |
477 | } |
478 | seq |
479 | } |
480 | hir::Repetition { min, max: Some(max), .. } if min < max => { |
481 | assert!(min > 0); // handled above |
482 | let limit = |
483 | u32::try_from(self.limit_repeat).unwrap_or(u32::MAX); |
484 | let mut seq = Seq::singleton(Literal::exact(vec![])); |
485 | for _ in 0..cmp::min(min, limit) { |
486 | if seq.is_inexact() { |
487 | break; |
488 | } |
489 | seq = self.cross(seq, &mut subseq.clone()); |
490 | } |
491 | seq.make_inexact(); |
492 | seq |
493 | } |
494 | hir::Repetition { .. } => { |
495 | subseq.make_inexact(); |
496 | subseq |
497 | } |
498 | } |
499 | } |
500 | |
501 | /// Convert the given Unicode class into a sequence of literals if the |
502 | /// class is small enough. If the class is too big, return an infinite |
503 | /// sequence. |
504 | fn extract_class_unicode(&self, cls: &hir::ClassUnicode) -> Seq { |
505 | if self.class_over_limit_unicode(cls) { |
506 | return Seq::infinite(); |
507 | } |
508 | let mut seq = Seq::empty(); |
509 | for r in cls.iter() { |
510 | for ch in r.start()..=r.end() { |
511 | seq.push(Literal::from(ch)); |
512 | } |
513 | } |
514 | self.enforce_literal_len(&mut seq); |
515 | seq |
516 | } |
517 | |
518 | /// Convert the given byte class into a sequence of literals if the class |
519 | /// is small enough. If the class is too big, return an infinite sequence. |
520 | fn extract_class_bytes(&self, cls: &hir::ClassBytes) -> Seq { |
521 | if self.class_over_limit_bytes(cls) { |
522 | return Seq::infinite(); |
523 | } |
524 | let mut seq = Seq::empty(); |
525 | for r in cls.iter() { |
526 | for b in r.start()..=r.end() { |
527 | seq.push(Literal::from(b)); |
528 | } |
529 | } |
530 | self.enforce_literal_len(&mut seq); |
531 | seq |
532 | } |
533 | |
534 | /// Returns true if the given Unicode class exceeds the configured limits |
535 | /// on this extractor. |
536 | fn class_over_limit_unicode(&self, cls: &hir::ClassUnicode) -> bool { |
537 | let mut count = 0; |
538 | for r in cls.iter() { |
539 | if count > self.limit_class { |
540 | return true; |
541 | } |
542 | count += r.len(); |
543 | } |
544 | count > self.limit_class |
545 | } |
546 | |
547 | /// Returns true if the given byte class exceeds the configured limits on |
548 | /// this extractor. |
549 | fn class_over_limit_bytes(&self, cls: &hir::ClassBytes) -> bool { |
550 | let mut count = 0; |
551 | for r in cls.iter() { |
552 | if count > self.limit_class { |
553 | return true; |
554 | } |
555 | count += r.len(); |
556 | } |
557 | count > self.limit_class |
558 | } |
559 | |
560 | /// Compute the cross product of the two sequences if the result would be |
561 | /// within configured limits. Otherwise, make `seq2` infinite and cross the |
562 | /// infinite sequence with `seq1`. |
563 | fn cross(&self, mut seq1: Seq, seq2: &mut Seq) -> Seq { |
564 | if seq1.max_cross_len(seq2).map_or(false, |len| len > self.limit_total) |
565 | { |
566 | seq2.make_infinite(); |
567 | } |
568 | if let ExtractKind::Suffix = self.kind { |
569 | seq1.cross_reverse(seq2); |
570 | } else { |
571 | seq1.cross_forward(seq2); |
572 | } |
573 | assert!(seq1.len().map_or(true, |x| x <= self.limit_total)); |
574 | self.enforce_literal_len(&mut seq1); |
575 | seq1 |
576 | } |
577 | |
578 | /// Union the two sequences if the result would be within configured |
579 | /// limits. Otherwise, make `seq2` infinite and union the infinite sequence |
580 | /// with `seq1`. |
581 | fn union(&self, mut seq1: Seq, seq2: &mut Seq) -> Seq { |
582 | if seq1.max_union_len(seq2).map_or(false, |len| len > self.limit_total) |
583 | { |
584 | // We try to trim our literal sequences to see if we can make |
585 | // room for more literals. The idea is that we'd rather trim down |
586 | // literals already in our sequence if it means we can add a few |
587 | // more and retain a finite sequence. Otherwise, we'll union with |
588 | // an infinite sequence and that infects everything and effectively |
589 | // stops literal extraction in its tracks. |
590 | // |
591 | // We do we keep 4 bytes here? Well, it's a bit of an abstraction |
592 | // leakage. Downstream, the literals may wind up getting fed to |
593 | // the Teddy algorithm, which supports searching literals up to |
594 | // length 4. So that's why we pick that number here. Arguably this |
595 | // should be a tuneable parameter, but it seems a little tricky to |
596 | // describe. And I'm still unsure if this is the right way to go |
597 | // about culling literal sequences. |
598 | match self.kind { |
599 | ExtractKind::Prefix => { |
600 | seq1.keep_first_bytes(4); |
601 | seq2.keep_first_bytes(4); |
602 | } |
603 | ExtractKind::Suffix => { |
604 | seq1.keep_last_bytes(4); |
605 | seq2.keep_last_bytes(4); |
606 | } |
607 | } |
608 | seq1.dedup(); |
609 | seq2.dedup(); |
610 | if seq1 |
611 | .max_union_len(seq2) |
612 | .map_or(false, |len| len > self.limit_total) |
613 | { |
614 | seq2.make_infinite(); |
615 | } |
616 | } |
617 | seq1.union(seq2); |
618 | assert!(seq1.len().map_or(true, |x| x <= self.limit_total)); |
619 | seq1 |
620 | } |
621 | |
622 | /// Applies the literal length limit to the given sequence. If none of the |
623 | /// literals in the sequence exceed the limit, then this is a no-op. |
624 | fn enforce_literal_len(&self, seq: &mut Seq) { |
625 | let len = self.limit_literal_len; |
626 | match self.kind { |
627 | ExtractKind::Prefix => seq.keep_first_bytes(len), |
628 | ExtractKind::Suffix => seq.keep_last_bytes(len), |
629 | } |
630 | } |
631 | } |
632 | |
633 | impl Default for Extractor { |
634 | fn default() -> Extractor { |
635 | Extractor::new() |
636 | } |
637 | } |
638 | |
639 | /// The kind of literals to extract from an [`Hir`] expression. |
640 | /// |
641 | /// The default extraction kind is `Prefix`. |
642 | #[non_exhaustive ] |
643 | #[derive (Clone, Debug)] |
644 | pub enum ExtractKind { |
645 | /// Extracts only prefix literals from a regex. |
646 | Prefix, |
647 | /// Extracts only suffix literals from a regex. |
648 | /// |
649 | /// Note that the sequence returned by suffix literals currently may |
650 | /// not correctly represent leftmost-first or "preference" order match |
651 | /// semantics. |
652 | Suffix, |
653 | } |
654 | |
655 | impl ExtractKind { |
656 | /// Returns true if this kind is the `Prefix` variant. |
657 | pub fn is_prefix(&self) -> bool { |
658 | matches!(*self, ExtractKind::Prefix) |
659 | } |
660 | |
661 | /// Returns true if this kind is the `Suffix` variant. |
662 | pub fn is_suffix(&self) -> bool { |
663 | matches!(*self, ExtractKind::Suffix) |
664 | } |
665 | } |
666 | |
667 | impl Default for ExtractKind { |
668 | fn default() -> ExtractKind { |
669 | ExtractKind::Prefix |
670 | } |
671 | } |
672 | |
673 | /// A sequence of literals. |
674 | /// |
675 | /// A `Seq` is very much like a set in that it represents a union of its |
676 | /// members. That is, it corresponds to a set of literals where at least one |
677 | /// must match in order for a particular [`Hir`] expression to match. (Whether |
678 | /// this corresponds to the entire `Hir` expression, a prefix of it or a suffix |
679 | /// of it depends on how the `Seq` was extracted from the `Hir`.) |
680 | /// |
681 | /// It is also unlike a set in that multiple identical literals may appear, |
682 | /// and that the order of the literals in the `Seq` matters. For example, if |
683 | /// the sequence is `[sam, samwise]` and leftmost-first matching is used, then |
684 | /// `samwise` can never match and the sequence is equivalent to `[sam]`. |
685 | /// |
686 | /// # States of a sequence |
687 | /// |
688 | /// A `Seq` has a few different logical states to consider: |
689 | /// |
690 | /// * The sequence can represent "any" literal. When this happens, the set does |
691 | /// not have a finite size. The purpose of this state is to inhibit callers |
692 | /// from making assumptions about what literals are required in order to match |
693 | /// a particular [`Hir`] expression. Generally speaking, when a set is in this |
694 | /// state, literal optimizations are inhibited. A good example of a regex that |
695 | /// will cause this sort of set to apppear is `[A-Za-z]`. The character class |
696 | /// is just too big (and also too narrow) to be usefully expanded into 52 |
697 | /// different literals. (Note that the decision for when a seq should become |
698 | /// infinite is determined by the caller. A seq itself has no hard-coded |
699 | /// limits.) |
700 | /// * The sequence can be empty, in which case, it is an affirmative statement |
701 | /// that there are no literals that can match the corresponding `Hir`. |
702 | /// Consequently, the `Hir` never matches any input. For example, `[a&&b]`. |
703 | /// * The sequence can be non-empty, in which case, at least one of the |
704 | /// literals must match in order for the corresponding `Hir` to match. |
705 | /// |
706 | /// # Example |
707 | /// |
708 | /// This example shows how literal sequences can be simplified by stripping |
709 | /// suffixes and minimizing while maintaining preference order. |
710 | /// |
711 | /// ``` |
712 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
713 | /// |
714 | /// let mut seq = Seq::new(&[ |
715 | /// "farm" , |
716 | /// "appliance" , |
717 | /// "faraway" , |
718 | /// "apple" , |
719 | /// "fare" , |
720 | /// "gap" , |
721 | /// "applicant" , |
722 | /// "applaud" , |
723 | /// ]); |
724 | /// seq.keep_first_bytes(3); |
725 | /// seq.minimize_by_preference(); |
726 | /// // Notice that 'far' comes before 'app', which matches the order in the |
727 | /// // original sequence. This guarantees that leftmost-first semantics are |
728 | /// // not altered by simplifying the set. |
729 | /// let expected = Seq::from_iter([ |
730 | /// Literal::inexact("far" ), |
731 | /// Literal::inexact("app" ), |
732 | /// Literal::exact("gap" ), |
733 | /// ]); |
734 | /// assert_eq!(expected, seq); |
735 | /// ``` |
736 | #[derive (Clone, Eq, PartialEq)] |
737 | pub struct Seq { |
738 | /// The members of this seq. |
739 | /// |
740 | /// When `None`, the seq represents all possible literals. That is, it |
741 | /// prevents one from making assumptions about specific literals in the |
742 | /// seq, and forces one to treat it as if any literal might be in the seq. |
743 | /// |
744 | /// Note that `Some(vec![])` is valid and corresponds to the empty seq of |
745 | /// literals, i.e., a regex that can never match. For example, `[a&&b]`. |
746 | /// It is distinct from `Some(vec![""])`, which corresponds to the seq |
747 | /// containing an empty string, which matches at every position. |
748 | literals: Option<Vec<Literal>>, |
749 | } |
750 | |
751 | impl Seq { |
752 | /// Returns an empty sequence. |
753 | /// |
754 | /// An empty sequence matches zero literals, and thus corresponds to a |
755 | /// regex that itself can never match. |
756 | #[inline ] |
757 | pub fn empty() -> Seq { |
758 | Seq { literals: Some(vec![]) } |
759 | } |
760 | |
761 | /// Returns a sequence of literals without a finite size and may contain |
762 | /// any literal. |
763 | /// |
764 | /// A sequence without finite size does not reveal anything about the |
765 | /// characteristics of the literals in its set. There are no fixed prefixes |
766 | /// or suffixes, nor are lower or upper bounds on the length of the literals |
767 | /// in the set known. |
768 | /// |
769 | /// This is useful to represent constructs in a regex that are "too big" |
770 | /// to useful represent as a sequence of literals. For example, `[A-Za-z]`. |
771 | /// When sequences get too big, they lose their discriminating nature and |
772 | /// are more likely to produce false positives, which in turn makes them |
773 | /// less likely to speed up searches. |
774 | /// |
775 | /// More pragmatically, for many regexes, enumerating all possible literals |
776 | /// is itself not possible or might otherwise use too many resources. So |
777 | /// constraining the size of sets during extraction is a practical trade |
778 | /// off to make. |
779 | #[inline ] |
780 | pub fn infinite() -> Seq { |
781 | Seq { literals: None } |
782 | } |
783 | |
784 | /// Returns a sequence containing a single literal. |
785 | #[inline ] |
786 | pub fn singleton(lit: Literal) -> Seq { |
787 | Seq { literals: Some(vec![lit]) } |
788 | } |
789 | |
790 | /// Returns a sequence of exact literals from the given byte strings. |
791 | #[inline ] |
792 | pub fn new<I, B>(it: I) -> Seq |
793 | where |
794 | I: IntoIterator<Item = B>, |
795 | B: AsRef<[u8]>, |
796 | { |
797 | it.into_iter().map(|b| Literal::exact(b.as_ref())).collect() |
798 | } |
799 | |
800 | /// If this is a finite sequence, return its members as a slice of |
801 | /// literals. |
802 | /// |
803 | /// The slice returned may be empty, in which case, there are no literals |
804 | /// that can match this sequence. |
805 | #[inline ] |
806 | pub fn literals(&self) -> Option<&[Literal]> { |
807 | self.literals.as_deref() |
808 | } |
809 | |
810 | /// Push a literal to the end of this sequence. |
811 | /// |
812 | /// If this sequence is not finite, then this is a no-op. |
813 | /// |
814 | /// Similarly, if the most recently added item of this sequence is |
815 | /// equivalent to the literal given, then it is not added. This reflects |
816 | /// a `Seq`'s "set like" behavior, and represents a practical trade off. |
817 | /// Namely, there is never any need to have two adjacent and equivalent |
818 | /// literals in the same sequence, _and_ it is easy to detect in some |
819 | /// cases. |
820 | #[inline ] |
821 | pub fn push(&mut self, lit: Literal) { |
822 | let lits = match self.literals { |
823 | None => return, |
824 | Some(ref mut lits) => lits, |
825 | }; |
826 | if lits.last().map_or(false, |m| m == &lit) { |
827 | return; |
828 | } |
829 | lits.push(lit); |
830 | } |
831 | |
832 | /// Make all of the literals in this sequence inexact. |
833 | /// |
834 | /// This is a no-op if this sequence is not finite. |
835 | #[inline ] |
836 | pub fn make_inexact(&mut self) { |
837 | let lits = match self.literals { |
838 | None => return, |
839 | Some(ref mut lits) => lits, |
840 | }; |
841 | for lit in lits.iter_mut() { |
842 | lit.make_inexact(); |
843 | } |
844 | } |
845 | |
846 | /// Converts this sequence to an infinite sequence. |
847 | /// |
848 | /// This is a no-op if the sequence is already infinite. |
849 | #[inline ] |
850 | pub fn make_infinite(&mut self) { |
851 | self.literals = None; |
852 | } |
853 | |
854 | /// Modify this sequence to contain the cross product between it and the |
855 | /// sequence given. |
856 | /// |
857 | /// The cross product only considers literals in this sequence that are |
858 | /// exact. That is, inexact literals are not extended. |
859 | /// |
860 | /// The literals are always drained from `other`, even if none are used. |
861 | /// This permits callers to reuse the sequence allocation elsewhere. |
862 | /// |
863 | /// If this sequence is infinite, then this is a no-op, regardless of what |
864 | /// `other` contains (and in this case, the literals are still drained from |
865 | /// `other`). If `other` is infinite and this sequence is finite, then this |
866 | /// is a no-op, unless this sequence contains a zero-length literal. In |
867 | /// which case, the infiniteness of `other` infects this sequence, and this |
868 | /// sequence is itself made infinite. |
869 | /// |
870 | /// Like [`Seq::union`], this may attempt to deduplicate literals. See |
871 | /// [`Seq::dedup`] for how deduplication deals with exact and inexact |
872 | /// literals. |
873 | /// |
874 | /// # Example |
875 | /// |
876 | /// This example shows basic usage and how exact and inexact literals |
877 | /// interact. |
878 | /// |
879 | /// ``` |
880 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
881 | /// |
882 | /// let mut seq1 = Seq::from_iter([ |
883 | /// Literal::exact("foo" ), |
884 | /// Literal::inexact("bar" ), |
885 | /// ]); |
886 | /// let mut seq2 = Seq::from_iter([ |
887 | /// Literal::inexact("quux" ), |
888 | /// Literal::exact("baz" ), |
889 | /// ]); |
890 | /// seq1.cross_forward(&mut seq2); |
891 | /// |
892 | /// // The literals are pulled out of seq2. |
893 | /// assert_eq!(Some(0), seq2.len()); |
894 | /// |
895 | /// let expected = Seq::from_iter([ |
896 | /// Literal::inexact("fooquux" ), |
897 | /// Literal::exact("foobaz" ), |
898 | /// Literal::inexact("bar" ), |
899 | /// ]); |
900 | /// assert_eq!(expected, seq1); |
901 | /// ``` |
902 | /// |
903 | /// This example shows the behavior of when `other` is an infinite |
904 | /// sequence. |
905 | /// |
906 | /// ``` |
907 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
908 | /// |
909 | /// let mut seq1 = Seq::from_iter([ |
910 | /// Literal::exact("foo" ), |
911 | /// Literal::inexact("bar" ), |
912 | /// ]); |
913 | /// let mut seq2 = Seq::infinite(); |
914 | /// seq1.cross_forward(&mut seq2); |
915 | /// |
916 | /// // When seq2 is infinite, cross product doesn't add anything, but |
917 | /// // ensures all members of seq1 are inexact. |
918 | /// let expected = Seq::from_iter([ |
919 | /// Literal::inexact("foo" ), |
920 | /// Literal::inexact("bar" ), |
921 | /// ]); |
922 | /// assert_eq!(expected, seq1); |
923 | /// ``` |
924 | /// |
925 | /// This example is like the one above, but shows what happens when this |
926 | /// sequence contains an empty string. In this case, an infinite `other` |
927 | /// sequence infects this sequence (because the empty string means that |
928 | /// there are no finite prefixes): |
929 | /// |
930 | /// ``` |
931 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
932 | /// |
933 | /// let mut seq1 = Seq::from_iter([ |
934 | /// Literal::exact("foo" ), |
935 | /// Literal::exact("" ), // inexact provokes same behavior |
936 | /// Literal::inexact("bar" ), |
937 | /// ]); |
938 | /// let mut seq2 = Seq::infinite(); |
939 | /// seq1.cross_forward(&mut seq2); |
940 | /// |
941 | /// // seq1 is now infinite! |
942 | /// assert!(!seq1.is_finite()); |
943 | /// ``` |
944 | /// |
945 | /// This example shows the behavior of this sequence is infinite. |
946 | /// |
947 | /// ``` |
948 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
949 | /// |
950 | /// let mut seq1 = Seq::infinite(); |
951 | /// let mut seq2 = Seq::from_iter([ |
952 | /// Literal::exact("foo" ), |
953 | /// Literal::inexact("bar" ), |
954 | /// ]); |
955 | /// seq1.cross_forward(&mut seq2); |
956 | /// |
957 | /// // seq1 remains unchanged. |
958 | /// assert!(!seq1.is_finite()); |
959 | /// // Even though the literals in seq2 weren't used, it was still drained. |
960 | /// assert_eq!(Some(0), seq2.len()); |
961 | /// ``` |
962 | #[inline ] |
963 | pub fn cross_forward(&mut self, other: &mut Seq) { |
964 | let (lits1, lits2) = match self.cross_preamble(other) { |
965 | None => return, |
966 | Some((lits1, lits2)) => (lits1, lits2), |
967 | }; |
968 | let newcap = lits1.len().saturating_mul(lits2.len()); |
969 | for selflit in mem::replace(lits1, Vec::with_capacity(newcap)) { |
970 | if !selflit.is_exact() { |
971 | lits1.push(selflit); |
972 | continue; |
973 | } |
974 | for otherlit in lits2.iter() { |
975 | let mut newlit = Literal::exact(Vec::with_capacity( |
976 | selflit.len() + otherlit.len(), |
977 | )); |
978 | newlit.extend(&selflit); |
979 | newlit.extend(&otherlit); |
980 | if !otherlit.is_exact() { |
981 | newlit.make_inexact(); |
982 | } |
983 | lits1.push(newlit); |
984 | } |
985 | } |
986 | lits2.drain(..); |
987 | self.dedup(); |
988 | } |
989 | |
990 | /// Modify this sequence to contain the cross product between it and |
991 | /// the sequence given, where the sequences are treated as suffixes |
992 | /// instead of prefixes. Namely, the sequence `other` is *prepended* |
993 | /// to `self` (as opposed to `other` being *appended* to `self` in |
994 | /// [`Seq::cross_forward`]). |
995 | /// |
996 | /// The cross product only considers literals in this sequence that are |
997 | /// exact. That is, inexact literals are not extended. |
998 | /// |
999 | /// The literals are always drained from `other`, even if none are used. |
1000 | /// This permits callers to reuse the sequence allocation elsewhere. |
1001 | /// |
1002 | /// If this sequence is infinite, then this is a no-op, regardless of what |
1003 | /// `other` contains (and in this case, the literals are still drained from |
1004 | /// `other`). If `other` is infinite and this sequence is finite, then this |
1005 | /// is a no-op, unless this sequence contains a zero-length literal. In |
1006 | /// which case, the infiniteness of `other` infects this sequence, and this |
1007 | /// sequence is itself made infinite. |
1008 | /// |
1009 | /// Like [`Seq::union`], this may attempt to deduplicate literals. See |
1010 | /// [`Seq::dedup`] for how deduplication deals with exact and inexact |
1011 | /// literals. |
1012 | /// |
1013 | /// # Example |
1014 | /// |
1015 | /// This example shows basic usage and how exact and inexact literals |
1016 | /// interact. |
1017 | /// |
1018 | /// ``` |
1019 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
1020 | /// |
1021 | /// let mut seq1 = Seq::from_iter([ |
1022 | /// Literal::exact("foo" ), |
1023 | /// Literal::inexact("bar" ), |
1024 | /// ]); |
1025 | /// let mut seq2 = Seq::from_iter([ |
1026 | /// Literal::inexact("quux" ), |
1027 | /// Literal::exact("baz" ), |
1028 | /// ]); |
1029 | /// seq1.cross_reverse(&mut seq2); |
1030 | /// |
1031 | /// // The literals are pulled out of seq2. |
1032 | /// assert_eq!(Some(0), seq2.len()); |
1033 | /// |
1034 | /// let expected = Seq::from_iter([ |
1035 | /// Literal::inexact("quuxfoo" ), |
1036 | /// Literal::inexact("bar" ), |
1037 | /// Literal::exact("bazfoo" ), |
1038 | /// ]); |
1039 | /// assert_eq!(expected, seq1); |
1040 | /// ``` |
1041 | /// |
1042 | /// This example shows the behavior of when `other` is an infinite |
1043 | /// sequence. |
1044 | /// |
1045 | /// ``` |
1046 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
1047 | /// |
1048 | /// let mut seq1 = Seq::from_iter([ |
1049 | /// Literal::exact("foo" ), |
1050 | /// Literal::inexact("bar" ), |
1051 | /// ]); |
1052 | /// let mut seq2 = Seq::infinite(); |
1053 | /// seq1.cross_reverse(&mut seq2); |
1054 | /// |
1055 | /// // When seq2 is infinite, cross product doesn't add anything, but |
1056 | /// // ensures all members of seq1 are inexact. |
1057 | /// let expected = Seq::from_iter([ |
1058 | /// Literal::inexact("foo" ), |
1059 | /// Literal::inexact("bar" ), |
1060 | /// ]); |
1061 | /// assert_eq!(expected, seq1); |
1062 | /// ``` |
1063 | /// |
1064 | /// This example is like the one above, but shows what happens when this |
1065 | /// sequence contains an empty string. In this case, an infinite `other` |
1066 | /// sequence infects this sequence (because the empty string means that |
1067 | /// there are no finite suffixes): |
1068 | /// |
1069 | /// ``` |
1070 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
1071 | /// |
1072 | /// let mut seq1 = Seq::from_iter([ |
1073 | /// Literal::exact("foo" ), |
1074 | /// Literal::exact("" ), // inexact provokes same behavior |
1075 | /// Literal::inexact("bar" ), |
1076 | /// ]); |
1077 | /// let mut seq2 = Seq::infinite(); |
1078 | /// seq1.cross_reverse(&mut seq2); |
1079 | /// |
1080 | /// // seq1 is now infinite! |
1081 | /// assert!(!seq1.is_finite()); |
1082 | /// ``` |
1083 | /// |
1084 | /// This example shows the behavior when this sequence is infinite. |
1085 | /// |
1086 | /// ``` |
1087 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
1088 | /// |
1089 | /// let mut seq1 = Seq::infinite(); |
1090 | /// let mut seq2 = Seq::from_iter([ |
1091 | /// Literal::exact("foo" ), |
1092 | /// Literal::inexact("bar" ), |
1093 | /// ]); |
1094 | /// seq1.cross_reverse(&mut seq2); |
1095 | /// |
1096 | /// // seq1 remains unchanged. |
1097 | /// assert!(!seq1.is_finite()); |
1098 | /// // Even though the literals in seq2 weren't used, it was still drained. |
1099 | /// assert_eq!(Some(0), seq2.len()); |
1100 | /// ``` |
1101 | #[inline ] |
1102 | pub fn cross_reverse(&mut self, other: &mut Seq) { |
1103 | let (lits1, lits2) = match self.cross_preamble(other) { |
1104 | None => return, |
1105 | Some((lits1, lits2)) => (lits1, lits2), |
1106 | }; |
1107 | // We basically proceed as we do in 'cross_forward' at this point, |
1108 | // except that the outer loop is now 'other' and the inner loop is now |
1109 | // 'self'. That's because 'self' corresponds to suffixes and 'other' |
1110 | // corresponds to the sequence we want to *prepend* to the suffixes. |
1111 | let newcap = lits1.len().saturating_mul(lits2.len()); |
1112 | let selflits = mem::replace(lits1, Vec::with_capacity(newcap)); |
1113 | for (i, otherlit) in lits2.drain(..).enumerate() { |
1114 | for selflit in selflits.iter() { |
1115 | if !selflit.is_exact() { |
1116 | // If the suffix isn't exact, then we can't prepend |
1117 | // anything to it. However, we still want to keep it. But |
1118 | // we only want to keep one of them, to avoid duplication. |
1119 | // (The duplication is okay from a correctness perspective, |
1120 | // but wasteful.) |
1121 | if i == 0 { |
1122 | lits1.push(selflit.clone()); |
1123 | } |
1124 | continue; |
1125 | } |
1126 | let mut newlit = Literal::exact(Vec::with_capacity( |
1127 | otherlit.len() + selflit.len(), |
1128 | )); |
1129 | newlit.extend(&otherlit); |
1130 | newlit.extend(&selflit); |
1131 | if !otherlit.is_exact() { |
1132 | newlit.make_inexact(); |
1133 | } |
1134 | lits1.push(newlit); |
1135 | } |
1136 | } |
1137 | self.dedup(); |
1138 | } |
1139 | |
1140 | /// A helper function the corresponds to the subtle preamble for both |
1141 | /// `cross_forward` and `cross_reverse`. In effect, it handles the cases |
1142 | /// of infinite sequences for both `self` and `other`, as well as ensuring |
1143 | /// that literals from `other` are drained even if they aren't used. |
1144 | fn cross_preamble<'a>( |
1145 | &'a mut self, |
1146 | other: &'a mut Seq, |
1147 | ) -> Option<(&'a mut Vec<Literal>, &'a mut Vec<Literal>)> { |
1148 | let lits2 = match other.literals { |
1149 | None => { |
1150 | // If our current seq contains the empty string and the seq |
1151 | // we're adding matches any literal, then it follows that the |
1152 | // current seq must now also match any literal. |
1153 | // |
1154 | // Otherwise, we just have to make sure everything in this |
1155 | // sequence is inexact. |
1156 | if self.min_literal_len() == Some(0) { |
1157 | *self = Seq::infinite(); |
1158 | } else { |
1159 | self.make_inexact(); |
1160 | } |
1161 | return None; |
1162 | } |
1163 | Some(ref mut lits) => lits, |
1164 | }; |
1165 | let lits1 = match self.literals { |
1166 | None => { |
1167 | // If we aren't going to make it to the end of this routine |
1168 | // where lits2 is drained, then we need to do it now. |
1169 | lits2.drain(..); |
1170 | return None; |
1171 | } |
1172 | Some(ref mut lits) => lits, |
1173 | }; |
1174 | Some((lits1, lits2)) |
1175 | } |
1176 | |
1177 | /// Unions the `other` sequence into this one. |
1178 | /// |
1179 | /// The literals are always drained out of the given `other` sequence, |
1180 | /// even if they are being unioned into an infinite sequence. This permits |
1181 | /// the caller to reuse the `other` sequence in another context. |
1182 | /// |
1183 | /// Some literal deduping may be performed. If any deduping happens, |
1184 | /// any leftmost-first or "preference" order match semantics will be |
1185 | /// preserved. |
1186 | /// |
1187 | /// # Example |
1188 | /// |
1189 | /// This example shows basic usage. |
1190 | /// |
1191 | /// ``` |
1192 | /// use regex_syntax::hir::literal::Seq; |
1193 | /// |
1194 | /// let mut seq1 = Seq::new(&["foo" , "bar" ]); |
1195 | /// let mut seq2 = Seq::new(&["bar" , "quux" , "foo" ]); |
1196 | /// seq1.union(&mut seq2); |
1197 | /// |
1198 | /// // The literals are pulled out of seq2. |
1199 | /// assert_eq!(Some(0), seq2.len()); |
1200 | /// |
1201 | /// // Adjacent literals are deduped, but non-adjacent literals may not be. |
1202 | /// assert_eq!(Seq::new(&["foo" , "bar" , "quux" , "foo" ]), seq1); |
1203 | /// ``` |
1204 | /// |
1205 | /// This example shows that literals are drained from `other` even when |
1206 | /// they aren't necessarily used. |
1207 | /// |
1208 | /// ``` |
1209 | /// use regex_syntax::hir::literal::Seq; |
1210 | /// |
1211 | /// let mut seq1 = Seq::infinite(); |
1212 | /// // Infinite sequences have no finite length. |
1213 | /// assert_eq!(None, seq1.len()); |
1214 | /// |
1215 | /// let mut seq2 = Seq::new(&["bar" , "quux" , "foo" ]); |
1216 | /// seq1.union(&mut seq2); |
1217 | /// |
1218 | /// // seq1 is still infinite and seq2 has been drained. |
1219 | /// assert_eq!(None, seq1.len()); |
1220 | /// assert_eq!(Some(0), seq2.len()); |
1221 | /// ``` |
1222 | #[inline ] |
1223 | pub fn union(&mut self, other: &mut Seq) { |
1224 | let lits2 = match other.literals { |
1225 | None => { |
1226 | // Unioning with an infinite sequence always results in an |
1227 | // infinite sequence. |
1228 | self.make_infinite(); |
1229 | return; |
1230 | } |
1231 | Some(ref mut lits) => lits.drain(..), |
1232 | }; |
1233 | let lits1 = match self.literals { |
1234 | None => return, |
1235 | Some(ref mut lits) => lits, |
1236 | }; |
1237 | lits1.extend(lits2); |
1238 | self.dedup(); |
1239 | } |
1240 | |
1241 | /// Unions the `other` sequence into this one by splice the `other` |
1242 | /// sequence at the position of the first zero-length literal. |
1243 | /// |
1244 | /// This is useful for preserving preference order semantics when combining |
1245 | /// two literal sequences. For example, in the regex `(a||f)+foo`, the |
1246 | /// correct preference order prefix sequence is `[a, foo, f]`. |
1247 | /// |
1248 | /// The literals are always drained out of the given `other` sequence, |
1249 | /// even if they are being unioned into an infinite sequence. This permits |
1250 | /// the caller to reuse the `other` sequence in another context. Note that |
1251 | /// the literals are drained even if no union is performed as well, i.e., |
1252 | /// when this sequence does not contain a zero-length literal. |
1253 | /// |
1254 | /// Some literal deduping may be performed. If any deduping happens, |
1255 | /// any leftmost-first or "preference" order match semantics will be |
1256 | /// preserved. |
1257 | /// |
1258 | /// # Example |
1259 | /// |
1260 | /// This example shows basic usage. |
1261 | /// |
1262 | /// ``` |
1263 | /// use regex_syntax::hir::literal::Seq; |
1264 | /// |
1265 | /// let mut seq1 = Seq::new(&["a" , "" , "f" , "" ]); |
1266 | /// let mut seq2 = Seq::new(&["foo" ]); |
1267 | /// seq1.union_into_empty(&mut seq2); |
1268 | /// |
1269 | /// // The literals are pulled out of seq2. |
1270 | /// assert_eq!(Some(0), seq2.len()); |
1271 | /// // 'foo' gets spliced into seq1 where the first empty string occurs. |
1272 | /// assert_eq!(Seq::new(&["a" , "foo" , "f" ]), seq1); |
1273 | /// ``` |
1274 | /// |
1275 | /// This example shows that literals are drained from `other` even when |
1276 | /// they aren't necessarily used. |
1277 | /// |
1278 | /// ``` |
1279 | /// use regex_syntax::hir::literal::Seq; |
1280 | /// |
1281 | /// let mut seq1 = Seq::new(&["foo" , "bar" ]); |
1282 | /// let mut seq2 = Seq::new(&["bar" , "quux" , "foo" ]); |
1283 | /// seq1.union_into_empty(&mut seq2); |
1284 | /// |
1285 | /// // seq1 has no zero length literals, so no splicing happens. |
1286 | /// assert_eq!(Seq::new(&["foo" , "bar" ]), seq1); |
1287 | /// // Even though no splicing happens, seq2 is still drained. |
1288 | /// assert_eq!(Some(0), seq2.len()); |
1289 | /// ``` |
1290 | #[inline ] |
1291 | pub fn union_into_empty(&mut self, other: &mut Seq) { |
1292 | let lits2 = other.literals.as_mut().map(|lits| lits.drain(..)); |
1293 | let lits1 = match self.literals { |
1294 | None => return, |
1295 | Some(ref mut lits) => lits, |
1296 | }; |
1297 | let first_empty = match lits1.iter().position(|m| m.is_empty()) { |
1298 | None => return, |
1299 | Some(i) => i, |
1300 | }; |
1301 | let lits2 = match lits2 { |
1302 | None => { |
1303 | // Note that we are only here if we've found an empty literal, |
1304 | // which implies that an infinite sequence infects this seq and |
1305 | // also turns it into an infinite sequence. |
1306 | self.literals = None; |
1307 | return; |
1308 | } |
1309 | Some(lits) => lits, |
1310 | }; |
1311 | // Clearing out the empties needs to come before the splice because |
1312 | // the splice might add more empties that we don't want to get rid |
1313 | // of. Since we're splicing into the position of the first empty, the |
1314 | // 'first_empty' position computed above is still correct. |
1315 | lits1.retain(|m| !m.is_empty()); |
1316 | lits1.splice(first_empty..first_empty, lits2); |
1317 | self.dedup(); |
1318 | } |
1319 | |
1320 | /// Deduplicate adjacent equivalent literals in this sequence. |
1321 | /// |
1322 | /// If adjacent literals are equivalent strings but one is exact and the |
1323 | /// other inexact, the inexact literal is kept and the exact one is |
1324 | /// removed. |
1325 | /// |
1326 | /// Deduping an infinite sequence is a no-op. |
1327 | /// |
1328 | /// # Example |
1329 | /// |
1330 | /// This example shows how literals that are duplicate byte strings but |
1331 | /// are not equivalent with respect to exactness are resolved. |
1332 | /// |
1333 | /// ``` |
1334 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
1335 | /// |
1336 | /// let mut seq = Seq::from_iter([ |
1337 | /// Literal::exact("foo" ), |
1338 | /// Literal::inexact("foo" ), |
1339 | /// ]); |
1340 | /// seq.dedup(); |
1341 | /// |
1342 | /// assert_eq!(Seq::from_iter([Literal::inexact("foo" )]), seq); |
1343 | /// ``` |
1344 | #[inline ] |
1345 | pub fn dedup(&mut self) { |
1346 | if let Some(ref mut lits) = self.literals { |
1347 | lits.dedup_by(|lit1, lit2| { |
1348 | if lit1.as_bytes() != lit2.as_bytes() { |
1349 | return false; |
1350 | } |
1351 | if lit1.is_exact() != lit2.is_exact() { |
1352 | lit1.make_inexact(); |
1353 | lit2.make_inexact(); |
1354 | } |
1355 | true |
1356 | }); |
1357 | } |
1358 | } |
1359 | |
1360 | /// Sorts this sequence of literals lexicographically. |
1361 | /// |
1362 | /// Note that if, before sorting, if a literal that is a prefix of another |
1363 | /// literal appears after it, then after sorting, the sequence will not |
1364 | /// represent the same preference order match semantics. For example, |
1365 | /// sorting the sequence `[samwise, sam]` yields the sequence `[sam, |
1366 | /// samwise]`. Under preference order semantics, the latter sequence will |
1367 | /// never match `samwise` where as the first sequence can. |
1368 | /// |
1369 | /// # Example |
1370 | /// |
1371 | /// This example shows basic usage. |
1372 | /// |
1373 | /// ``` |
1374 | /// use regex_syntax::hir::literal::Seq; |
1375 | /// |
1376 | /// let mut seq = Seq::new(&["foo" , "quux" , "bar" ]); |
1377 | /// seq.sort(); |
1378 | /// |
1379 | /// assert_eq!(Seq::new(&["bar" , "foo" , "quux" ]), seq); |
1380 | /// ``` |
1381 | #[inline ] |
1382 | pub fn sort(&mut self) { |
1383 | if let Some(ref mut lits) = self.literals { |
1384 | lits.sort(); |
1385 | } |
1386 | } |
1387 | |
1388 | /// Reverses all of the literals in this sequence. |
1389 | /// |
1390 | /// The order of the sequence itself is preserved. |
1391 | /// |
1392 | /// # Example |
1393 | /// |
1394 | /// This example shows basic usage. |
1395 | /// |
1396 | /// ``` |
1397 | /// use regex_syntax::hir::literal::Seq; |
1398 | /// |
1399 | /// let mut seq = Seq::new(&["oof" , "rab" ]); |
1400 | /// seq.reverse_literals(); |
1401 | /// assert_eq!(Seq::new(&["foo" , "bar" ]), seq); |
1402 | /// ``` |
1403 | #[inline ] |
1404 | pub fn reverse_literals(&mut self) { |
1405 | if let Some(ref mut lits) = self.literals { |
1406 | for lit in lits.iter_mut() { |
1407 | lit.reverse(); |
1408 | } |
1409 | } |
1410 | } |
1411 | |
1412 | /// Shrinks this seq to its minimal size while respecting the preference |
1413 | /// order of its literals. |
1414 | /// |
1415 | /// While this routine will remove duplicate literals from this seq, it |
1416 | /// will also remove literals that can never match in a leftmost-first or |
1417 | /// "preference order" search. Similar to [`Seq::dedup`], if a literal is |
1418 | /// deduped, then the one that remains is made inexact. |
1419 | /// |
1420 | /// This is a no-op on seqs that are empty or not finite. |
1421 | /// |
1422 | /// # Example |
1423 | /// |
1424 | /// This example shows the difference between `{sam, samwise}` and |
1425 | /// `{samwise, sam}`. |
1426 | /// |
1427 | /// ``` |
1428 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
1429 | /// |
1430 | /// // If 'sam' comes before 'samwise' and a preference order search is |
1431 | /// // executed, then 'samwise' can never match. |
1432 | /// let mut seq = Seq::new(&["sam" , "samwise" ]); |
1433 | /// seq.minimize_by_preference(); |
1434 | /// assert_eq!(Seq::from_iter([Literal::inexact("sam" )]), seq); |
1435 | /// |
1436 | /// // But if they are reversed, then it's possible for 'samwise' to match |
1437 | /// // since it is given higher preference. |
1438 | /// let mut seq = Seq::new(&["samwise" , "sam" ]); |
1439 | /// seq.minimize_by_preference(); |
1440 | /// assert_eq!(Seq::new(&["samwise" , "sam" ]), seq); |
1441 | /// ``` |
1442 | /// |
1443 | /// This example shows that if an empty string is in this seq, then |
1444 | /// anything that comes after it can never match. |
1445 | /// |
1446 | /// ``` |
1447 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
1448 | /// |
1449 | /// // An empty string is a prefix of all strings, so it automatically |
1450 | /// // inhibits any subsequent strings from matching. |
1451 | /// let mut seq = Seq::new(&["foo" , "bar" , "" , "quux" , "fox" ]); |
1452 | /// seq.minimize_by_preference(); |
1453 | /// let expected = Seq::from_iter([ |
1454 | /// Literal::exact("foo" ), |
1455 | /// Literal::exact("bar" ), |
1456 | /// Literal::inexact("" ), |
1457 | /// ]); |
1458 | /// assert_eq!(expected, seq); |
1459 | /// |
1460 | /// // And of course, if it's at the beginning, then it makes it impossible |
1461 | /// // for anything else to match. |
1462 | /// let mut seq = Seq::new(&["" , "foo" , "quux" , "fox" ]); |
1463 | /// seq.minimize_by_preference(); |
1464 | /// assert_eq!(Seq::from_iter([Literal::inexact("" )]), seq); |
1465 | /// ``` |
1466 | #[inline ] |
1467 | pub fn minimize_by_preference(&mut self) { |
1468 | if let Some(ref mut lits) = self.literals { |
1469 | PreferenceTrie::minimize(lits, false); |
1470 | } |
1471 | } |
1472 | |
1473 | /// Trims all literals in this seq such that only the first `len` bytes |
1474 | /// remain. If a literal has less than or equal to `len` bytes, then it |
1475 | /// remains unchanged. Otherwise, it is trimmed and made inexact. |
1476 | /// |
1477 | /// # Example |
1478 | /// |
1479 | /// ``` |
1480 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
1481 | /// |
1482 | /// let mut seq = Seq::new(&["a" , "foo" , "quux" ]); |
1483 | /// seq.keep_first_bytes(2); |
1484 | /// |
1485 | /// let expected = Seq::from_iter([ |
1486 | /// Literal::exact("a" ), |
1487 | /// Literal::inexact("fo" ), |
1488 | /// Literal::inexact("qu" ), |
1489 | /// ]); |
1490 | /// assert_eq!(expected, seq); |
1491 | /// ``` |
1492 | #[inline ] |
1493 | pub fn keep_first_bytes(&mut self, len: usize) { |
1494 | if let Some(ref mut lits) = self.literals { |
1495 | for m in lits.iter_mut() { |
1496 | m.keep_first_bytes(len); |
1497 | } |
1498 | } |
1499 | } |
1500 | |
1501 | /// Trims all literals in this seq such that only the last `len` bytes |
1502 | /// remain. If a literal has less than or equal to `len` bytes, then it |
1503 | /// remains unchanged. Otherwise, it is trimmed and made inexact. |
1504 | /// |
1505 | /// # Example |
1506 | /// |
1507 | /// ``` |
1508 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
1509 | /// |
1510 | /// let mut seq = Seq::new(&["a" , "foo" , "quux" ]); |
1511 | /// seq.keep_last_bytes(2); |
1512 | /// |
1513 | /// let expected = Seq::from_iter([ |
1514 | /// Literal::exact("a" ), |
1515 | /// Literal::inexact("oo" ), |
1516 | /// Literal::inexact("ux" ), |
1517 | /// ]); |
1518 | /// assert_eq!(expected, seq); |
1519 | /// ``` |
1520 | #[inline ] |
1521 | pub fn keep_last_bytes(&mut self, len: usize) { |
1522 | if let Some(ref mut lits) = self.literals { |
1523 | for m in lits.iter_mut() { |
1524 | m.keep_last_bytes(len); |
1525 | } |
1526 | } |
1527 | } |
1528 | |
1529 | /// Returns true if this sequence is finite. |
1530 | /// |
1531 | /// When false, this sequence is infinite and must be treated as if it |
1532 | /// contains every possible literal. |
1533 | #[inline ] |
1534 | pub fn is_finite(&self) -> bool { |
1535 | self.literals.is_some() |
1536 | } |
1537 | |
1538 | /// Returns true if and only if this sequence is finite and empty. |
1539 | /// |
1540 | /// An empty sequence never matches anything. It can only be produced by |
1541 | /// literal extraction when the corresponding regex itself cannot match. |
1542 | #[inline ] |
1543 | pub fn is_empty(&self) -> bool { |
1544 | self.len() == Some(0) |
1545 | } |
1546 | |
1547 | /// Returns the number of literals in this sequence if the sequence is |
1548 | /// finite. If the sequence is infinite, then `None` is returned. |
1549 | #[inline ] |
1550 | pub fn len(&self) -> Option<usize> { |
1551 | self.literals.as_ref().map(|lits| lits.len()) |
1552 | } |
1553 | |
1554 | /// Returns true if and only if all literals in this sequence are exact. |
1555 | /// |
1556 | /// This returns false if the sequence is infinite. |
1557 | #[inline ] |
1558 | pub fn is_exact(&self) -> bool { |
1559 | self.literals().map_or(false, |lits| lits.iter().all(|x| x.is_exact())) |
1560 | } |
1561 | |
1562 | /// Returns true if and only if all literals in this sequence are inexact. |
1563 | /// |
1564 | /// This returns true if the sequence is infinite. |
1565 | #[inline ] |
1566 | pub fn is_inexact(&self) -> bool { |
1567 | self.literals().map_or(true, |lits| lits.iter().all(|x| !x.is_exact())) |
1568 | } |
1569 | |
1570 | /// Return the maximum length of the sequence that would result from |
1571 | /// unioning `self` with `other`. If either set is infinite, then this |
1572 | /// returns `None`. |
1573 | #[inline ] |
1574 | fn max_union_len(&self, other: &Seq) -> Option<usize> { |
1575 | let len1 = self.len()?; |
1576 | let len2 = other.len()?; |
1577 | Some(len1.saturating_add(len2)) |
1578 | } |
1579 | |
1580 | /// Return the maximum length of the sequence that would result from the |
1581 | /// cross product of `self` with `other`. If either set is infinite, then |
1582 | /// this returns `None`. |
1583 | #[inline ] |
1584 | fn max_cross_len(&self, other: &Seq) -> Option<usize> { |
1585 | let len1 = self.len()?; |
1586 | let len2 = other.len()?; |
1587 | Some(len1.saturating_mul(len2)) |
1588 | } |
1589 | |
1590 | /// Returns the length of the shortest literal in this sequence. |
1591 | /// |
1592 | /// If the sequence is infinite or empty, then this returns `None`. |
1593 | #[inline ] |
1594 | pub fn min_literal_len(&self) -> Option<usize> { |
1595 | self.literals.as_ref()?.iter().map(|x| x.len()).min() |
1596 | } |
1597 | |
1598 | /// Returns the length of the longest literal in this sequence. |
1599 | /// |
1600 | /// If the sequence is infinite or empty, then this returns `None`. |
1601 | #[inline ] |
1602 | pub fn max_literal_len(&self) -> Option<usize> { |
1603 | self.literals.as_ref()?.iter().map(|x| x.len()).max() |
1604 | } |
1605 | |
1606 | /// Returns the longest common prefix from this seq. |
1607 | /// |
1608 | /// If the seq matches any literal or other contains no literals, then |
1609 | /// there is no meaningful prefix and this returns `None`. |
1610 | /// |
1611 | /// # Example |
1612 | /// |
1613 | /// This shows some example seqs and their longest common prefix. |
1614 | /// |
1615 | /// ``` |
1616 | /// use regex_syntax::hir::literal::Seq; |
1617 | /// |
1618 | /// let seq = Seq::new(&["foo" , "foobar" , "fo" ]); |
1619 | /// assert_eq!(Some(&b"fo" [..]), seq.longest_common_prefix()); |
1620 | /// let seq = Seq::new(&["foo" , "foo" ]); |
1621 | /// assert_eq!(Some(&b"foo" [..]), seq.longest_common_prefix()); |
1622 | /// let seq = Seq::new(&["foo" , "bar" ]); |
1623 | /// assert_eq!(Some(&b"" [..]), seq.longest_common_prefix()); |
1624 | /// let seq = Seq::new(&["" ]); |
1625 | /// assert_eq!(Some(&b"" [..]), seq.longest_common_prefix()); |
1626 | /// |
1627 | /// let seq = Seq::infinite(); |
1628 | /// assert_eq!(None, seq.longest_common_prefix()); |
1629 | /// let seq = Seq::empty(); |
1630 | /// assert_eq!(None, seq.longest_common_prefix()); |
1631 | /// ``` |
1632 | #[inline ] |
1633 | pub fn longest_common_prefix(&self) -> Option<&[u8]> { |
1634 | // If we match everything or match nothing, then there's no meaningful |
1635 | // longest common prefix. |
1636 | let lits = match self.literals { |
1637 | None => return None, |
1638 | Some(ref lits) => lits, |
1639 | }; |
1640 | if lits.len() == 0 { |
1641 | return None; |
1642 | } |
1643 | let base = lits[0].as_bytes(); |
1644 | let mut len = base.len(); |
1645 | for m in lits.iter().skip(1) { |
1646 | len = m |
1647 | .as_bytes() |
1648 | .iter() |
1649 | .zip(base[..len].iter()) |
1650 | .take_while(|&(a, b)| a == b) |
1651 | .count(); |
1652 | if len == 0 { |
1653 | return Some(&[]); |
1654 | } |
1655 | } |
1656 | Some(&base[..len]) |
1657 | } |
1658 | |
1659 | /// Returns the longest common suffix from this seq. |
1660 | /// |
1661 | /// If the seq matches any literal or other contains no literals, then |
1662 | /// there is no meaningful suffix and this returns `None`. |
1663 | /// |
1664 | /// # Example |
1665 | /// |
1666 | /// This shows some example seqs and their longest common suffix. |
1667 | /// |
1668 | /// ``` |
1669 | /// use regex_syntax::hir::literal::Seq; |
1670 | /// |
1671 | /// let seq = Seq::new(&["oof" , "raboof" , "of" ]); |
1672 | /// assert_eq!(Some(&b"of" [..]), seq.longest_common_suffix()); |
1673 | /// let seq = Seq::new(&["foo" , "foo" ]); |
1674 | /// assert_eq!(Some(&b"foo" [..]), seq.longest_common_suffix()); |
1675 | /// let seq = Seq::new(&["foo" , "bar" ]); |
1676 | /// assert_eq!(Some(&b"" [..]), seq.longest_common_suffix()); |
1677 | /// let seq = Seq::new(&["" ]); |
1678 | /// assert_eq!(Some(&b"" [..]), seq.longest_common_suffix()); |
1679 | /// |
1680 | /// let seq = Seq::infinite(); |
1681 | /// assert_eq!(None, seq.longest_common_suffix()); |
1682 | /// let seq = Seq::empty(); |
1683 | /// assert_eq!(None, seq.longest_common_suffix()); |
1684 | /// ``` |
1685 | #[inline ] |
1686 | pub fn longest_common_suffix(&self) -> Option<&[u8]> { |
1687 | // If we match everything or match nothing, then there's no meaningful |
1688 | // longest common suffix. |
1689 | let lits = match self.literals { |
1690 | None => return None, |
1691 | Some(ref lits) => lits, |
1692 | }; |
1693 | if lits.len() == 0 { |
1694 | return None; |
1695 | } |
1696 | let base = lits[0].as_bytes(); |
1697 | let mut len = base.len(); |
1698 | for m in lits.iter().skip(1) { |
1699 | len = m |
1700 | .as_bytes() |
1701 | .iter() |
1702 | .rev() |
1703 | .zip(base[base.len() - len..].iter().rev()) |
1704 | .take_while(|&(a, b)| a == b) |
1705 | .count(); |
1706 | if len == 0 { |
1707 | return Some(&[]); |
1708 | } |
1709 | } |
1710 | Some(&base[base.len() - len..]) |
1711 | } |
1712 | |
1713 | /// Optimizes this seq while treating its literals as prefixes and |
1714 | /// respecting the preference order of its literals. |
1715 | /// |
1716 | /// The specific way "optimization" works is meant to be an implementation |
1717 | /// detail, as it essentially represents a set of heuristics. The goal |
1718 | /// that optimization tries to accomplish is to make the literals in this |
1719 | /// set reflect inputs that will result in a more effective prefilter. |
1720 | /// Principally by reducing the false positive rate of candidates found by |
1721 | /// the literals in this sequence. That is, when a match of a literal is |
1722 | /// found, we would like it to be a strong predictor of the overall match |
1723 | /// of the regex. If it isn't, then much time will be spent starting and |
1724 | /// stopping the prefilter search and attempting to confirm the match only |
1725 | /// to have it fail. |
1726 | /// |
1727 | /// Some of those heuristics might be: |
1728 | /// |
1729 | /// * Identifying a common prefix from a larger sequence of literals, and |
1730 | /// shrinking the sequence down to that single common prefix. |
1731 | /// * Rejecting the sequence entirely if it is believed to result in very |
1732 | /// high false positive rate. When this happens, the sequence is made |
1733 | /// infinite. |
1734 | /// * Shrinking the sequence to a smaller number of literals representing |
1735 | /// prefixes, but not shrinking it so much as to make literals too short. |
1736 | /// (A sequence with very short literals, of 1 or 2 bytes, will typically |
1737 | /// result in a higher false positive rate.) |
1738 | /// |
1739 | /// Optimization should only be run once extraction is complete. Namely, |
1740 | /// optimization may make assumptions that do not compose with other |
1741 | /// operations in the middle of extraction. For example, optimization will |
1742 | /// reduce `[E(sam), E(samwise)]` to `[E(sam)]`, but such a transformation |
1743 | /// is only valid if no other extraction will occur. If other extraction |
1744 | /// may occur, then the correct transformation would be to `[I(sam)]`. |
1745 | /// |
1746 | /// The [`Seq::optimize_for_suffix_by_preference`] does the same thing, but |
1747 | /// for suffixes. |
1748 | /// |
1749 | /// # Example |
1750 | /// |
1751 | /// This shows how optimization might transform a sequence. Note that |
1752 | /// the specific behavior is not a documented guarantee. The heuristics |
1753 | /// used are an implementation detail and may change over time in semver |
1754 | /// compatible releases. |
1755 | /// |
1756 | /// ``` |
1757 | /// use regex_syntax::hir::literal::{Seq, Literal}; |
1758 | /// |
1759 | /// let mut seq = Seq::new(&[ |
1760 | /// "samantha" , |
1761 | /// "sam" , |
1762 | /// "samwise" , |
1763 | /// "frodo" , |
1764 | /// ]); |
1765 | /// seq.optimize_for_prefix_by_preference(); |
1766 | /// assert_eq!(Seq::from_iter([ |
1767 | /// Literal::exact("samantha" ), |
1768 | /// // Kept exact even though 'samwise' got pruned |
1769 | /// // because optimization assumes literal extraction |
1770 | /// // has finished. |
1771 | /// Literal::exact("sam" ), |
1772 | /// Literal::exact("frodo" ), |
1773 | /// ]), seq); |
1774 | /// ``` |
1775 | /// |
1776 | /// # Example: optimization may make the sequence infinite |
1777 | /// |
1778 | /// If the heuristics deem that the sequence could cause a very high false |
1779 | /// positive rate, then it may make the sequence infinite, effectively |
1780 | /// disabling its use as a prefilter. |
1781 | /// |
1782 | /// ``` |
1783 | /// use regex_syntax::hir::literal::{Seq, Literal}; |
1784 | /// |
1785 | /// let mut seq = Seq::new(&[ |
1786 | /// "samantha" , |
1787 | /// // An empty string matches at every position, |
1788 | /// // thus rendering the prefilter completely |
1789 | /// // ineffective. |
1790 | /// "" , |
1791 | /// "sam" , |
1792 | /// "samwise" , |
1793 | /// "frodo" , |
1794 | /// ]); |
1795 | /// seq.optimize_for_prefix_by_preference(); |
1796 | /// assert!(!seq.is_finite()); |
1797 | /// ``` |
1798 | /// |
1799 | /// Do note that just because there is a `" "` in the sequence, that |
1800 | /// doesn't mean the sequence will always be made infinite after it is |
1801 | /// optimized. Namely, if the sequence is considered exact (any match |
1802 | /// corresponds to an overall match of the original regex), then any match |
1803 | /// is an overall match, and so the false positive rate is always `0`. |
1804 | /// |
1805 | /// To demonstrate this, we remove `samwise` from our sequence. This |
1806 | /// results in no optimization happening and all literals remain exact. |
1807 | /// Thus the entire sequence is exact, and it is kept as-is, even though |
1808 | /// one is an ASCII space: |
1809 | /// |
1810 | /// ``` |
1811 | /// use regex_syntax::hir::literal::{Seq, Literal}; |
1812 | /// |
1813 | /// let mut seq = Seq::new(&[ |
1814 | /// "samantha" , |
1815 | /// " " , |
1816 | /// "sam" , |
1817 | /// "frodo" , |
1818 | /// ]); |
1819 | /// seq.optimize_for_prefix_by_preference(); |
1820 | /// assert!(seq.is_finite()); |
1821 | /// ``` |
1822 | #[inline ] |
1823 | pub fn optimize_for_prefix_by_preference(&mut self) { |
1824 | self.optimize_by_preference(true); |
1825 | } |
1826 | |
1827 | /// Optimizes this seq while treating its literals as suffixes and |
1828 | /// respecting the preference order of its literals. |
1829 | /// |
1830 | /// Optimization should only be run once extraction is complete. |
1831 | /// |
1832 | /// The [`Seq::optimize_for_prefix_by_preference`] does the same thing, but |
1833 | /// for prefixes. See its documentation for more explanation. |
1834 | #[inline ] |
1835 | pub fn optimize_for_suffix_by_preference(&mut self) { |
1836 | self.optimize_by_preference(false); |
1837 | } |
1838 | |
1839 | fn optimize_by_preference(&mut self, prefix: bool) { |
1840 | let origlen = match self.len() { |
1841 | None => return, |
1842 | Some(len) => len, |
1843 | }; |
1844 | // Make sure we start with the smallest sequence possible. We use a |
1845 | // special version of preference minimization that retains exactness. |
1846 | // This is legal because optimization is only expected to occur once |
1847 | // extraction is complete. |
1848 | if prefix { |
1849 | if let Some(ref mut lits) = self.literals { |
1850 | PreferenceTrie::minimize(lits, true); |
1851 | } |
1852 | } |
1853 | |
1854 | // Look for a common prefix (or suffix). If we found one of those and |
1855 | // it's long enough, then it's a good bet that it will be our fastest |
1856 | // possible prefilter since single-substring search is so fast. |
1857 | let fix = if prefix { |
1858 | self.longest_common_prefix() |
1859 | } else { |
1860 | self.longest_common_suffix() |
1861 | }; |
1862 | if let Some(fix) = fix { |
1863 | // As a special case, if we have a common prefix and the leading |
1864 | // byte of that prefix is one that we think probably occurs rarely, |
1865 | // then strip everything down to just that single byte. This should |
1866 | // promote the use of memchr. |
1867 | // |
1868 | // ... we only do this though if our sequence has more than one |
1869 | // literal. Otherwise, we'd rather just stick with a single literal |
1870 | // scan. That is, using memchr is probably better than looking |
1871 | // for 2 or more literals, but probably not as good as a straight |
1872 | // memmem search. |
1873 | // |
1874 | // ... and also only do this when the prefix is short and probably |
1875 | // not too discriminatory anyway. If it's longer, then it's |
1876 | // probably quite discriminatory and thus is likely to have a low |
1877 | // false positive rate. |
1878 | if prefix |
1879 | && origlen > 1 |
1880 | && fix.len() >= 1 |
1881 | && fix.len() <= 3 |
1882 | && rank(fix[0]) < 200 |
1883 | { |
1884 | self.keep_first_bytes(1); |
1885 | self.dedup(); |
1886 | return; |
1887 | } |
1888 | // We only strip down to the common prefix/suffix if we think |
1889 | // the existing set of literals isn't great, or if the common |
1890 | // prefix/suffix is expected to be particularly discriminatory. |
1891 | let isfast = |
1892 | self.is_exact() && self.len().map_or(false, |len| len <= 16); |
1893 | let usefix = fix.len() > 4 || (fix.len() > 1 && !isfast); |
1894 | if usefix { |
1895 | // If we keep exactly the number of bytes equal to the length |
1896 | // of the prefix (or suffix), then by the definition of a |
1897 | // prefix, every literal in the sequence will be equivalent. |
1898 | // Thus, 'dedup' will leave us with one literal. |
1899 | // |
1900 | // We do it this way to avoid an alloc, but also to make sure |
1901 | // the exactness of literals is kept (or not). |
1902 | if prefix { |
1903 | self.keep_first_bytes(fix.len()); |
1904 | } else { |
1905 | self.keep_last_bytes(fix.len()); |
1906 | } |
1907 | self.dedup(); |
1908 | assert_eq!(Some(1), self.len()); |
1909 | // We still fall through here. In particular, we want our |
1910 | // longest common prefix to be subject to the poison check. |
1911 | } |
1912 | } |
1913 | // Everything below this check is more-or-less about trying to |
1914 | // heuristically reduce the false positive rate of a prefilter. But |
1915 | // if our sequence is completely exact, then it's possible the regex |
1916 | // engine can be skipped entirely. In this case, the false positive |
1917 | // rate is zero because every literal match corresponds to a regex |
1918 | // match. |
1919 | // |
1920 | // This is OK even if the sequence contains a poison literal. Remember, |
1921 | // a literal is only poisononous because of what we assume about its |
1922 | // impact on the false positive rate. However, we do still check for |
1923 | // an empty string. Empty strings are weird and it's best to let the |
1924 | // regex engine handle those. |
1925 | // |
1926 | // We do currently do this check after the longest common prefix (or |
1927 | // suffix) check, under the theory that single-substring search is so |
1928 | // fast that we want that even if we'd end up turning an exact sequence |
1929 | // into an inexact one. But this might be wrong... |
1930 | if self.is_exact() |
1931 | && self.min_literal_len().map_or(false, |len| len > 0) |
1932 | { |
1933 | return; |
1934 | } |
1935 | // Now we attempt to shorten the sequence. The idea here is that we |
1936 | // don't want to look for too many literals, but we want to shorten |
1937 | // our sequence enough to improve our odds of using better algorithms |
1938 | // downstream (such as Teddy). |
1939 | const ATTEMPTS: [(usize, usize); 5] = |
1940 | [(5, 64), (4, 64), (3, 64), (2, 64), (1, 10)]; |
1941 | for (keep, limit) in ATTEMPTS { |
1942 | let len = match self.len() { |
1943 | None => break, |
1944 | Some(len) => len, |
1945 | }; |
1946 | if len <= limit { |
1947 | break; |
1948 | } |
1949 | if prefix { |
1950 | self.keep_first_bytes(keep); |
1951 | } else { |
1952 | self.keep_last_bytes(keep); |
1953 | } |
1954 | self.minimize_by_preference(); |
1955 | } |
1956 | // Check for a poison literal. A poison literal is one that is short |
1957 | // and is believed to have a very high match count. These poisons |
1958 | // generally lead to a prefilter with a very high false positive rate, |
1959 | // and thus overall worse performance. |
1960 | // |
1961 | // We do this last because we could have gone from a non-poisonous |
1962 | // sequence to a poisonous one. Perhaps we should add some code to |
1963 | // prevent such transitions in the first place, but then again, we |
1964 | // likely only made the transition in the first place if the sequence |
1965 | // was itself huge. And huge sequences are themselves poisonous. So... |
1966 | if let Some(lits) = self.literals() { |
1967 | if lits.iter().any(|lit| lit.is_poisonous()) { |
1968 | self.make_infinite(); |
1969 | } |
1970 | } |
1971 | } |
1972 | } |
1973 | |
1974 | impl core::fmt::Debug for Seq { |
1975 | fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { |
1976 | write!(f, "Seq" )?; |
1977 | if let Some(lits: &[Literal]) = self.literals() { |
1978 | f.debug_list().entries(lits.iter()).finish() |
1979 | } else { |
1980 | write!(f, "[∅]" ) |
1981 | } |
1982 | } |
1983 | } |
1984 | |
1985 | impl FromIterator<Literal> for Seq { |
1986 | fn from_iter<T: IntoIterator<Item = Literal>>(it: T) -> Seq { |
1987 | let mut seq: Seq = Seq::empty(); |
1988 | for literal: Literal in it { |
1989 | seq.push(lit:literal); |
1990 | } |
1991 | seq |
1992 | } |
1993 | } |
1994 | |
1995 | /// A single literal extracted from an [`Hir`] expression. |
1996 | /// |
1997 | /// A literal is composed of two things: |
1998 | /// |
1999 | /// * A sequence of bytes. No guarantees with respect to UTF-8 are provided. |
2000 | /// In particular, even if the regex a literal is extracted from is UTF-8, the |
2001 | /// literal extracted may not be valid UTF-8. (For example, if an [`Extractor`] |
2002 | /// limit resulted in trimming a literal in a way that splits a codepoint.) |
2003 | /// * Whether the literal is "exact" or not. An "exact" literal means that it |
2004 | /// has not been trimmed, and may continue to be extended. If a literal is |
2005 | /// "exact" after visiting the entire `Hir` expression, then this implies that |
2006 | /// the literal leads to a match state. (Although it doesn't necessarily imply |
2007 | /// all occurrences of the literal correspond to a match of the regex, since |
2008 | /// literal extraction ignores look-around assertions.) |
2009 | #[derive (Clone, Eq, PartialEq, PartialOrd, Ord)] |
2010 | pub struct Literal { |
2011 | bytes: Vec<u8>, |
2012 | exact: bool, |
2013 | } |
2014 | |
2015 | impl Literal { |
2016 | /// Returns a new exact literal containing the bytes given. |
2017 | #[inline ] |
2018 | pub fn exact<B: Into<Vec<u8>>>(bytes: B) -> Literal { |
2019 | Literal { bytes: bytes.into(), exact: true } |
2020 | } |
2021 | |
2022 | /// Returns a new inexact literal containing the bytes given. |
2023 | #[inline ] |
2024 | pub fn inexact<B: Into<Vec<u8>>>(bytes: B) -> Literal { |
2025 | Literal { bytes: bytes.into(), exact: false } |
2026 | } |
2027 | |
2028 | /// Returns the bytes in this literal. |
2029 | #[inline ] |
2030 | pub fn as_bytes(&self) -> &[u8] { |
2031 | &self.bytes |
2032 | } |
2033 | |
2034 | /// Yields ownership of the bytes inside this literal. |
2035 | /// |
2036 | /// Note that this throws away whether the literal is "exact" or not. |
2037 | #[inline ] |
2038 | pub fn into_bytes(self) -> Vec<u8> { |
2039 | self.bytes |
2040 | } |
2041 | |
2042 | /// Returns the length of this literal in bytes. |
2043 | #[inline ] |
2044 | pub fn len(&self) -> usize { |
2045 | self.as_bytes().len() |
2046 | } |
2047 | |
2048 | /// Returns true if and only if this literal has zero bytes. |
2049 | #[inline ] |
2050 | pub fn is_empty(&self) -> bool { |
2051 | self.len() == 0 |
2052 | } |
2053 | |
2054 | /// Returns true if and only if this literal is exact. |
2055 | #[inline ] |
2056 | pub fn is_exact(&self) -> bool { |
2057 | self.exact |
2058 | } |
2059 | |
2060 | /// Marks this literal as inexact. |
2061 | /// |
2062 | /// Inexact literals can never be extended. For example, |
2063 | /// [`Seq::cross_forward`] will not extend inexact literals. |
2064 | #[inline ] |
2065 | pub fn make_inexact(&mut self) { |
2066 | self.exact = false; |
2067 | } |
2068 | |
2069 | /// Reverse the bytes in this literal. |
2070 | #[inline ] |
2071 | pub fn reverse(&mut self) { |
2072 | self.bytes.reverse(); |
2073 | } |
2074 | |
2075 | /// Extend this literal with the literal given. |
2076 | /// |
2077 | /// If this literal is inexact, then this is a no-op. |
2078 | #[inline ] |
2079 | pub fn extend(&mut self, lit: &Literal) { |
2080 | if !self.is_exact() { |
2081 | return; |
2082 | } |
2083 | self.bytes.extend_from_slice(&lit.bytes); |
2084 | } |
2085 | |
2086 | /// Trims this literal such that only the first `len` bytes remain. If |
2087 | /// this literal has fewer than `len` bytes, then it remains unchanged. |
2088 | /// Otherwise, the literal is marked as inexact. |
2089 | #[inline ] |
2090 | pub fn keep_first_bytes(&mut self, len: usize) { |
2091 | if len >= self.len() { |
2092 | return; |
2093 | } |
2094 | self.make_inexact(); |
2095 | self.bytes.truncate(len); |
2096 | } |
2097 | |
2098 | /// Trims this literal such that only the last `len` bytes remain. If this |
2099 | /// literal has fewer than `len` bytes, then it remains unchanged. |
2100 | /// Otherwise, the literal is marked as inexact. |
2101 | #[inline ] |
2102 | pub fn keep_last_bytes(&mut self, len: usize) { |
2103 | if len >= self.len() { |
2104 | return; |
2105 | } |
2106 | self.make_inexact(); |
2107 | self.bytes.drain(..self.len() - len); |
2108 | } |
2109 | |
2110 | /// Returns true if it is believe that this literal is likely to match very |
2111 | /// frequently, and is thus not a good candidate for a prefilter. |
2112 | fn is_poisonous(&self) -> bool { |
2113 | self.is_empty() || (self.len() == 1 && rank(self.as_bytes()[0]) >= 250) |
2114 | } |
2115 | } |
2116 | |
2117 | impl From<u8> for Literal { |
2118 | fn from(byte: u8) -> Literal { |
2119 | Literal::exact(bytes:vec![byte]) |
2120 | } |
2121 | } |
2122 | |
2123 | impl From<char> for Literal { |
2124 | fn from(ch: char) -> Literal { |
2125 | use alloc::string::ToString; |
2126 | Literal::exact(bytes:ch.encode_utf8(&mut [0; 4]).to_string()) |
2127 | } |
2128 | } |
2129 | |
2130 | impl AsRef<[u8]> for Literal { |
2131 | fn as_ref(&self) -> &[u8] { |
2132 | self.as_bytes() |
2133 | } |
2134 | } |
2135 | |
2136 | impl core::fmt::Debug for Literal { |
2137 | fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { |
2138 | let tag: &str = if self.exact { "E" } else { "I" }; |
2139 | f&mut DebugTuple<'_, '_>.debug_tuple(name:tag) |
2140 | .field(&crate::debug::Bytes(self.as_bytes())) |
2141 | .finish() |
2142 | } |
2143 | } |
2144 | |
2145 | /// A "preference" trie that rejects literals that will never match when |
2146 | /// executing a leftmost first or "preference" search. |
2147 | /// |
2148 | /// For example, if 'sam' is inserted, then trying to insert 'samwise' will be |
2149 | /// rejected because 'samwise' can never match since 'sam' will always take |
2150 | /// priority. However, if 'samwise' is inserted first, then inserting 'sam' |
2151 | /// after it is accepted. In this case, either 'samwise' or 'sam' can match in |
2152 | /// a "preference" search. |
2153 | /// |
2154 | /// Note that we only use this trie as a "set." That is, given a sequence of |
2155 | /// literals, we insert each one in order. An `insert` will reject a literal |
2156 | /// if a prefix of that literal already exists in the trie. Thus, to rebuild |
2157 | /// the "minimal" sequence, we simply only keep literals that were successfully |
2158 | /// inserted. (Since we don't need traversal, one wonders whether we can make |
2159 | /// some simplifications here, but I haven't given it a ton of thought and I've |
2160 | /// never seen this show up on a profile. Because of the heuristic limits |
2161 | /// imposed on literal extractions, the size of the inputs here is usually |
2162 | /// very small.) |
2163 | #[derive (Debug, Default)] |
2164 | struct PreferenceTrie { |
2165 | /// The states in this trie. The index of a state in this vector is its ID. |
2166 | states: Vec<State>, |
2167 | /// The index to allocate to the next literal added to this trie. Starts at |
2168 | /// 0 and increments by 1 for every literal successfully added to the trie. |
2169 | next_literal_index: usize, |
2170 | } |
2171 | |
2172 | /// A single state in a trie. Uses a sparse representation for its transitions. |
2173 | #[derive (Debug, Default)] |
2174 | struct State { |
2175 | /// Sparse representation of the transitions out of this state. Transitions |
2176 | /// are sorted by byte. There is at most one such transition for any |
2177 | /// particular byte. |
2178 | trans: Vec<(u8, usize)>, |
2179 | /// Whether this is a matching state or not. If it is, then it contains the |
2180 | /// index to the matching literal. |
2181 | literal_index: Option<usize>, |
2182 | } |
2183 | |
2184 | impl PreferenceTrie { |
2185 | /// Minimizes the given sequence of literals while preserving preference |
2186 | /// order semantics. |
2187 | /// |
2188 | /// When `keep_exact` is true, the exactness of every literal retained is |
2189 | /// kept. This is useful when dealing with a fully extracted `Seq` that |
2190 | /// only contains exact literals. In that case, we can keep all retained |
2191 | /// literals as exact because we know we'll never need to match anything |
2192 | /// after them and because any removed literals are guaranteed to never |
2193 | /// match. |
2194 | fn minimize(literals: &mut Vec<Literal>, keep_exact: bool) { |
2195 | use core::cell::RefCell; |
2196 | |
2197 | // MSRV(1.61): Use retain_mut here to avoid interior mutability. |
2198 | let trie = RefCell::new(PreferenceTrie::default()); |
2199 | let mut make_inexact = vec![]; |
2200 | literals.retain(|lit| { |
2201 | match trie.borrow_mut().insert(lit.as_bytes()) { |
2202 | Ok(_) => true, |
2203 | Err(i) => { |
2204 | if !keep_exact { |
2205 | make_inexact.push(i); |
2206 | } |
2207 | false |
2208 | } |
2209 | } |
2210 | }); |
2211 | for i in make_inexact { |
2212 | literals[i].make_inexact(); |
2213 | } |
2214 | } |
2215 | |
2216 | /// Returns `Ok` if the given byte string is accepted into this trie and |
2217 | /// `Err` otherwise. The index for the success case corresponds to the |
2218 | /// index of the literal added. The index for the error case corresponds to |
2219 | /// the index of the literal already in the trie that prevented the given |
2220 | /// byte string from being added. (Which implies it is a prefix of the one |
2221 | /// given.) |
2222 | /// |
2223 | /// In short, the byte string given is accepted into the trie if and only |
2224 | /// if it is possible for it to match when executing a preference order |
2225 | /// search. |
2226 | fn insert(&mut self, bytes: &[u8]) -> Result<usize, usize> { |
2227 | let mut prev = self.root(); |
2228 | if let Some(idx) = self.states[prev].literal_index { |
2229 | return Err(idx); |
2230 | } |
2231 | for &b in bytes.iter() { |
2232 | match self.states[prev].trans.binary_search_by_key(&b, |t| t.0) { |
2233 | Ok(i) => { |
2234 | prev = self.states[prev].trans[i].1; |
2235 | if let Some(idx) = self.states[prev].literal_index { |
2236 | return Err(idx); |
2237 | } |
2238 | } |
2239 | Err(i) => { |
2240 | let next = self.create_state(); |
2241 | self.states[prev].trans.insert(i, (b, next)); |
2242 | prev = next; |
2243 | } |
2244 | } |
2245 | } |
2246 | let idx = self.next_literal_index; |
2247 | self.next_literal_index += 1; |
2248 | self.states[prev].literal_index = Some(idx); |
2249 | Ok(idx) |
2250 | } |
2251 | |
2252 | /// Returns the root state ID, and if it doesn't exist, creates it. |
2253 | fn root(&mut self) -> usize { |
2254 | if !self.states.is_empty() { |
2255 | 0 |
2256 | } else { |
2257 | self.create_state() |
2258 | } |
2259 | } |
2260 | |
2261 | /// Creates a new empty state and returns its ID. |
2262 | fn create_state(&mut self) -> usize { |
2263 | let id = self.states.len(); |
2264 | self.states.push(State::default()); |
2265 | id |
2266 | } |
2267 | } |
2268 | |
2269 | /// Returns the "rank" of the given byte. |
2270 | /// |
2271 | /// The minimum rank value is `0` and the maximum rank value is `255`. |
2272 | /// |
2273 | /// The rank of a byte is derived from a heuristic background distribution of |
2274 | /// relative frequencies of bytes. The heuristic says that lower the rank of a |
2275 | /// byte, the less likely that byte is to appear in any arbitrary haystack. |
2276 | pub fn rank(byte: u8) -> u8 { |
2277 | crate::rank::BYTE_FREQUENCIES[usize::from(byte)] |
2278 | } |
2279 | |
2280 | #[cfg (test)] |
2281 | mod tests { |
2282 | use super::*; |
2283 | |
2284 | fn parse(pattern: &str) -> Hir { |
2285 | crate::ParserBuilder::new().utf8(false).build().parse(pattern).unwrap() |
2286 | } |
2287 | |
2288 | fn prefixes(pattern: &str) -> Seq { |
2289 | Extractor::new().kind(ExtractKind::Prefix).extract(&parse(pattern)) |
2290 | } |
2291 | |
2292 | fn suffixes(pattern: &str) -> Seq { |
2293 | Extractor::new().kind(ExtractKind::Suffix).extract(&parse(pattern)) |
2294 | } |
2295 | |
2296 | fn e(pattern: &str) -> (Seq, Seq) { |
2297 | (prefixes(pattern), suffixes(pattern)) |
2298 | } |
2299 | |
2300 | #[allow (non_snake_case)] |
2301 | fn E(x: &str) -> Literal { |
2302 | Literal::exact(x.as_bytes()) |
2303 | } |
2304 | |
2305 | #[allow (non_snake_case)] |
2306 | fn I(x: &str) -> Literal { |
2307 | Literal::inexact(x.as_bytes()) |
2308 | } |
2309 | |
2310 | fn seq<I: IntoIterator<Item = Literal>>(it: I) -> Seq { |
2311 | Seq::from_iter(it) |
2312 | } |
2313 | |
2314 | fn infinite() -> (Seq, Seq) { |
2315 | (Seq::infinite(), Seq::infinite()) |
2316 | } |
2317 | |
2318 | fn inexact<I1, I2>(it1: I1, it2: I2) -> (Seq, Seq) |
2319 | where |
2320 | I1: IntoIterator<Item = Literal>, |
2321 | I2: IntoIterator<Item = Literal>, |
2322 | { |
2323 | (Seq::from_iter(it1), Seq::from_iter(it2)) |
2324 | } |
2325 | |
2326 | fn exact<B: AsRef<[u8]>, I: IntoIterator<Item = B>>(it: I) -> (Seq, Seq) { |
2327 | let s1 = Seq::new(it); |
2328 | let s2 = s1.clone(); |
2329 | (s1, s2) |
2330 | } |
2331 | |
2332 | fn opt<B: AsRef<[u8]>, I: IntoIterator<Item = B>>(it: I) -> (Seq, Seq) { |
2333 | let (mut p, mut s) = exact(it); |
2334 | p.optimize_for_prefix_by_preference(); |
2335 | s.optimize_for_suffix_by_preference(); |
2336 | (p, s) |
2337 | } |
2338 | |
2339 | #[test ] |
2340 | fn literal() { |
2341 | assert_eq!(exact(["a" ]), e("a" )); |
2342 | assert_eq!(exact(["aaaaa" ]), e("aaaaa" )); |
2343 | assert_eq!(exact(["A" , "a" ]), e("(?i-u)a" )); |
2344 | assert_eq!(exact(["AB" , "Ab" , "aB" , "ab" ]), e("(?i-u)ab" )); |
2345 | assert_eq!(exact(["abC" , "abc" ]), e("ab(?i-u)c" )); |
2346 | |
2347 | assert_eq!(exact([b" \xFF" ]), e(r"(?-u:\xFF)" )); |
2348 | |
2349 | #[cfg (feature = "unicode-case" )] |
2350 | { |
2351 | assert_eq!(exact(["☃" ]), e("☃" )); |
2352 | assert_eq!(exact(["☃" ]), e("(?i)☃" )); |
2353 | assert_eq!(exact(["☃☃☃☃☃" ]), e("☃☃☃☃☃" )); |
2354 | |
2355 | assert_eq!(exact(["Δ" ]), e("Δ" )); |
2356 | assert_eq!(exact(["δ" ]), e("δ" )); |
2357 | assert_eq!(exact(["Δ" , "δ" ]), e("(?i)Δ" )); |
2358 | assert_eq!(exact(["Δ" , "δ" ]), e("(?i)δ" )); |
2359 | |
2360 | assert_eq!(exact(["S" , "s" , "ſ" ]), e("(?i)S" )); |
2361 | assert_eq!(exact(["S" , "s" , "ſ" ]), e("(?i)s" )); |
2362 | assert_eq!(exact(["S" , "s" , "ſ" ]), e("(?i)ſ" )); |
2363 | } |
2364 | |
2365 | let letters = "ͱͳͷΐάέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋ" ; |
2366 | assert_eq!(exact([letters]), e(letters)); |
2367 | } |
2368 | |
2369 | #[test ] |
2370 | fn class() { |
2371 | assert_eq!(exact(["a" , "b" , "c" ]), e("[abc]" )); |
2372 | assert_eq!(exact(["a1b" , "a2b" , "a3b" ]), e("a[123]b" )); |
2373 | assert_eq!(exact(["δ" , "ε" ]), e("[εδ]" )); |
2374 | #[cfg (feature = "unicode-case" )] |
2375 | { |
2376 | assert_eq!(exact(["Δ" , "Ε" , "δ" , "ε" , "ϵ" ]), e(r"(?i)[εδ]" )); |
2377 | } |
2378 | } |
2379 | |
2380 | #[test ] |
2381 | fn look() { |
2382 | assert_eq!(exact(["ab" ]), e(r"a\Ab" )); |
2383 | assert_eq!(exact(["ab" ]), e(r"a\zb" )); |
2384 | assert_eq!(exact(["ab" ]), e(r"a(?m:^)b" )); |
2385 | assert_eq!(exact(["ab" ]), e(r"a(?m:$)b" )); |
2386 | assert_eq!(exact(["ab" ]), e(r"a\bb" )); |
2387 | assert_eq!(exact(["ab" ]), e(r"a\Bb" )); |
2388 | assert_eq!(exact(["ab" ]), e(r"a(?-u:\b)b" )); |
2389 | assert_eq!(exact(["ab" ]), e(r"a(?-u:\B)b" )); |
2390 | |
2391 | assert_eq!(exact(["ab" ]), e(r"^ab" )); |
2392 | assert_eq!(exact(["ab" ]), e(r"$ab" )); |
2393 | assert_eq!(exact(["ab" ]), e(r"(?m:^)ab" )); |
2394 | assert_eq!(exact(["ab" ]), e(r"(?m:$)ab" )); |
2395 | assert_eq!(exact(["ab" ]), e(r"\bab" )); |
2396 | assert_eq!(exact(["ab" ]), e(r"\Bab" )); |
2397 | assert_eq!(exact(["ab" ]), e(r"(?-u:\b)ab" )); |
2398 | assert_eq!(exact(["ab" ]), e(r"(?-u:\B)ab" )); |
2399 | |
2400 | assert_eq!(exact(["ab" ]), e(r"ab^" )); |
2401 | assert_eq!(exact(["ab" ]), e(r"ab$" )); |
2402 | assert_eq!(exact(["ab" ]), e(r"ab(?m:^)" )); |
2403 | assert_eq!(exact(["ab" ]), e(r"ab(?m:$)" )); |
2404 | assert_eq!(exact(["ab" ]), e(r"ab\b" )); |
2405 | assert_eq!(exact(["ab" ]), e(r"ab\B" )); |
2406 | assert_eq!(exact(["ab" ]), e(r"ab(?-u:\b)" )); |
2407 | assert_eq!(exact(["ab" ]), e(r"ab(?-u:\B)" )); |
2408 | |
2409 | let expected = (seq([I("aZ" ), E("ab" )]), seq([I("Zb" ), E("ab" )])); |
2410 | assert_eq!(expected, e(r"^aZ*b" )); |
2411 | } |
2412 | |
2413 | #[test ] |
2414 | fn repetition() { |
2415 | assert_eq!(exact(["a" , "" ]), e(r"a?" )); |
2416 | assert_eq!(exact(["" , "a" ]), e(r"a??" )); |
2417 | assert_eq!(inexact([I("a" ), E("" )], [I("a" ), E("" )]), e(r"a*" )); |
2418 | assert_eq!(inexact([E("" ), I("a" )], [E("" ), I("a" )]), e(r"a*?" )); |
2419 | assert_eq!(inexact([I("a" )], [I("a" )]), e(r"a+" )); |
2420 | assert_eq!(inexact([I("a" )], [I("a" )]), e(r"(a+)+" )); |
2421 | |
2422 | assert_eq!(exact(["ab" ]), e(r"aZ{0}b" )); |
2423 | assert_eq!(exact(["aZb" , "ab" ]), e(r"aZ?b" )); |
2424 | assert_eq!(exact(["ab" , "aZb" ]), e(r"aZ??b" )); |
2425 | assert_eq!( |
2426 | inexact([I("aZ" ), E("ab" )], [I("Zb" ), E("ab" )]), |
2427 | e(r"aZ*b" ) |
2428 | ); |
2429 | assert_eq!( |
2430 | inexact([E("ab" ), I("aZ" )], [E("ab" ), I("Zb" )]), |
2431 | e(r"aZ*?b" ) |
2432 | ); |
2433 | assert_eq!(inexact([I("aZ" )], [I("Zb" )]), e(r"aZ+b" )); |
2434 | assert_eq!(inexact([I("aZ" )], [I("Zb" )]), e(r"aZ+?b" )); |
2435 | |
2436 | assert_eq!(exact(["aZZb" ]), e(r"aZ{2}b" )); |
2437 | assert_eq!(inexact([I("aZZ" )], [I("ZZb" )]), e(r"aZ{2,3}b" )); |
2438 | |
2439 | assert_eq!(exact(["abc" , "" ]), e(r"(abc)?" )); |
2440 | assert_eq!(exact(["" , "abc" ]), e(r"(abc)??" )); |
2441 | |
2442 | assert_eq!(inexact([I("a" ), E("b" )], [I("ab" ), E("b" )]), e(r"a*b" )); |
2443 | assert_eq!(inexact([E("b" ), I("a" )], [E("b" ), I("ab" )]), e(r"a*?b" )); |
2444 | assert_eq!(inexact([I("ab" )], [I("b" )]), e(r"ab+" )); |
2445 | assert_eq!(inexact([I("a" ), I("b" )], [I("b" )]), e(r"a*b+" )); |
2446 | |
2447 | // FIXME: The suffixes for this don't look quite right to me. I think |
2448 | // the right suffixes would be: [I(ac), I(bc), E(c)]. The main issue I |
2449 | // think is that suffixes are computed by iterating over concatenations |
2450 | // in reverse, and then [bc, ac, c] ordering is indeed correct from |
2451 | // that perspective. We also test a few more equivalent regexes, and |
2452 | // we get the same result, so it is consistent at least I suppose. |
2453 | // |
2454 | // The reason why this isn't an issue is that it only messes up |
2455 | // preference order, and currently, suffixes are never used in a |
2456 | // context where preference order matters. For prefixes it matters |
2457 | // because we sometimes want to use prefilters without confirmation |
2458 | // when all of the literals are exact (and there's no look-around). But |
2459 | // we never do that for suffixes. Any time we use suffixes, we always |
2460 | // include a confirmation step. If that ever changes, then it's likely |
2461 | // this bug will need to be fixed, but last time I looked, it appears |
2462 | // hard to do so. |
2463 | assert_eq!( |
2464 | inexact([I("a" ), I("b" ), E("c" )], [I("bc" ), I("ac" ), E("c" )]), |
2465 | e(r"a*b*c" ) |
2466 | ); |
2467 | assert_eq!( |
2468 | inexact([I("a" ), I("b" ), E("c" )], [I("bc" ), I("ac" ), E("c" )]), |
2469 | e(r"(a+)?(b+)?c" ) |
2470 | ); |
2471 | assert_eq!( |
2472 | inexact([I("a" ), I("b" ), E("c" )], [I("bc" ), I("ac" ), E("c" )]), |
2473 | e(r"(a+|)(b+|)c" ) |
2474 | ); |
2475 | // A few more similarish but not identical regexes. These may have a |
2476 | // similar problem as above. |
2477 | assert_eq!( |
2478 | inexact( |
2479 | [I("a" ), I("b" ), I("c" ), E("" )], |
2480 | [I("c" ), I("b" ), I("a" ), E("" )] |
2481 | ), |
2482 | e(r"a*b*c*" ) |
2483 | ); |
2484 | assert_eq!(inexact([I("a" ), I("b" ), I("c" )], [I("c" )]), e(r"a*b*c+" )); |
2485 | assert_eq!(inexact([I("a" ), I("b" )], [I("bc" )]), e(r"a*b+c" )); |
2486 | assert_eq!(inexact([I("a" ), I("b" )], [I("c" ), I("b" )]), e(r"a*b+c*" )); |
2487 | assert_eq!(inexact([I("ab" ), E("a" )], [I("b" ), E("a" )]), e(r"ab*" )); |
2488 | assert_eq!( |
2489 | inexact([I("ab" ), E("ac" )], [I("bc" ), E("ac" )]), |
2490 | e(r"ab*c" ) |
2491 | ); |
2492 | assert_eq!(inexact([I("ab" )], [I("b" )]), e(r"ab+" )); |
2493 | assert_eq!(inexact([I("ab" )], [I("bc" )]), e(r"ab+c" )); |
2494 | |
2495 | assert_eq!( |
2496 | inexact([I("z" ), E("azb" )], [I("zazb" ), E("azb" )]), |
2497 | e(r"z*azb" ) |
2498 | ); |
2499 | |
2500 | let expected = |
2501 | exact(["aaa" , "aab" , "aba" , "abb" , "baa" , "bab" , "bba" , "bbb" ]); |
2502 | assert_eq!(expected, e(r"[ab]{3}" )); |
2503 | let expected = inexact( |
2504 | [ |
2505 | I("aaa" ), |
2506 | I("aab" ), |
2507 | I("aba" ), |
2508 | I("abb" ), |
2509 | I("baa" ), |
2510 | I("bab" ), |
2511 | I("bba" ), |
2512 | I("bbb" ), |
2513 | ], |
2514 | [ |
2515 | I("aaa" ), |
2516 | I("aab" ), |
2517 | I("aba" ), |
2518 | I("abb" ), |
2519 | I("baa" ), |
2520 | I("bab" ), |
2521 | I("bba" ), |
2522 | I("bbb" ), |
2523 | ], |
2524 | ); |
2525 | assert_eq!(expected, e(r"[ab]{3,4}" )); |
2526 | } |
2527 | |
2528 | #[test ] |
2529 | fn concat() { |
2530 | let empty: [&str; 0] = []; |
2531 | |
2532 | assert_eq!(exact(["abcxyz" ]), e(r"abc()xyz" )); |
2533 | assert_eq!(exact(["abcxyz" ]), e(r"(abc)(xyz)" )); |
2534 | assert_eq!(exact(["abcmnoxyz" ]), e(r"abc()mno()xyz" )); |
2535 | assert_eq!(exact(empty), e(r"abc[a&&b]xyz" )); |
2536 | assert_eq!(exact(["abcxyz" ]), e(r"abc[a&&b]*xyz" )); |
2537 | } |
2538 | |
2539 | #[test ] |
2540 | fn alternation() { |
2541 | assert_eq!(exact(["abc" , "mno" , "xyz" ]), e(r"abc|mno|xyz" )); |
2542 | assert_eq!( |
2543 | inexact( |
2544 | [E("abc" ), I("mZ" ), E("mo" ), E("xyz" )], |
2545 | [E("abc" ), I("Zo" ), E("mo" ), E("xyz" )] |
2546 | ), |
2547 | e(r"abc|mZ*o|xyz" ) |
2548 | ); |
2549 | assert_eq!(exact(["abc" , "xyz" ]), e(r"abc|M[a&&b]N|xyz" )); |
2550 | assert_eq!(exact(["abc" , "MN" , "xyz" ]), e(r"abc|M[a&&b]*N|xyz" )); |
2551 | |
2552 | assert_eq!(exact(["aaa" , "aaaaa" ]), e(r"(?:|aa)aaa" )); |
2553 | assert_eq!( |
2554 | inexact( |
2555 | [I("aaa" ), E("" ), I("aaaaa" ), E("aa" )], |
2556 | [I("aaa" ), E("" ), E("aa" )] |
2557 | ), |
2558 | e(r"(?:|aa)(?:aaa)*" ) |
2559 | ); |
2560 | assert_eq!( |
2561 | inexact( |
2562 | [E("" ), I("aaa" ), E("aa" ), I("aaaaa" )], |
2563 | [E("" ), I("aaa" ), E("aa" )] |
2564 | ), |
2565 | e(r"(?:|aa)(?:aaa)*?" ) |
2566 | ); |
2567 | |
2568 | assert_eq!( |
2569 | inexact([E("a" ), I("b" ), E("" )], [E("a" ), I("b" ), E("" )]), |
2570 | e(r"a|b*" ) |
2571 | ); |
2572 | assert_eq!(inexact([E("a" ), I("b" )], [E("a" ), I("b" )]), e(r"a|b+" )); |
2573 | |
2574 | assert_eq!( |
2575 | inexact([I("a" ), E("b" ), E("c" )], [I("ab" ), E("b" ), E("c" )]), |
2576 | e(r"a*b|c" ) |
2577 | ); |
2578 | |
2579 | assert_eq!( |
2580 | inexact( |
2581 | [E("a" ), E("b" ), I("c" ), E("" )], |
2582 | [E("a" ), E("b" ), I("c" ), E("" )] |
2583 | ), |
2584 | e(r"a|(?:b|c*)" ) |
2585 | ); |
2586 | |
2587 | assert_eq!( |
2588 | inexact( |
2589 | [I("a" ), I("b" ), E("c" ), I("a" ), I("ab" ), E("c" )], |
2590 | [I("ac" ), I("bc" ), E("c" ), I("ac" ), I("abc" ), E("c" )], |
2591 | ), |
2592 | e(r"(a|b)*c|(a|ab)*c" ) |
2593 | ); |
2594 | |
2595 | assert_eq!( |
2596 | exact(["abef" , "abgh" , "cdef" , "cdgh" ]), |
2597 | e(r"(ab|cd)(ef|gh)" ) |
2598 | ); |
2599 | assert_eq!( |
2600 | exact([ |
2601 | "abefij" , "abefkl" , "abghij" , "abghkl" , "cdefij" , "cdefkl" , |
2602 | "cdghij" , "cdghkl" , |
2603 | ]), |
2604 | e(r"(ab|cd)(ef|gh)(ij|kl)" ) |
2605 | ); |
2606 | } |
2607 | |
2608 | #[test ] |
2609 | fn impossible() { |
2610 | let empty: [&str; 0] = []; |
2611 | |
2612 | assert_eq!(exact(empty), e(r"[a&&b]" )); |
2613 | assert_eq!(exact(empty), e(r"a[a&&b]" )); |
2614 | assert_eq!(exact(empty), e(r"[a&&b]b" )); |
2615 | assert_eq!(exact(empty), e(r"a[a&&b]b" )); |
2616 | assert_eq!(exact(["a" , "b" ]), e(r"a|[a&&b]|b" )); |
2617 | assert_eq!(exact(["a" , "b" ]), e(r"a|c[a&&b]|b" )); |
2618 | assert_eq!(exact(["a" , "b" ]), e(r"a|[a&&b]d|b" )); |
2619 | assert_eq!(exact(["a" , "b" ]), e(r"a|c[a&&b]d|b" )); |
2620 | assert_eq!(exact(["" ]), e(r"[a&&b]*" )); |
2621 | assert_eq!(exact(["MN" ]), e(r"M[a&&b]*N" )); |
2622 | } |
2623 | |
2624 | // This tests patterns that contain something that defeats literal |
2625 | // detection, usually because it would blow some limit on the total number |
2626 | // of literals that can be returned. |
2627 | // |
2628 | // The main idea is that when literal extraction sees something that |
2629 | // it knows will blow a limit, it replaces it with a marker that says |
2630 | // "any literal will match here." While not necessarily true, the |
2631 | // over-estimation is just fine for the purposes of literal extraction, |
2632 | // because the imprecision doesn't matter: too big is too big. |
2633 | // |
2634 | // This is one of the trickier parts of literal extraction, since we need |
2635 | // to make sure all of our literal extraction operations correctly compose |
2636 | // with the markers. |
2637 | #[test ] |
2638 | fn anything() { |
2639 | assert_eq!(infinite(), e(r"." )); |
2640 | assert_eq!(infinite(), e(r"(?s)." )); |
2641 | assert_eq!(infinite(), e(r"[A-Za-z]" )); |
2642 | assert_eq!(infinite(), e(r"[A-Z]" )); |
2643 | assert_eq!(exact(["" ]), e(r"[A-Z]{0}" )); |
2644 | assert_eq!(infinite(), e(r"[A-Z]?" )); |
2645 | assert_eq!(infinite(), e(r"[A-Z]*" )); |
2646 | assert_eq!(infinite(), e(r"[A-Z]+" )); |
2647 | assert_eq!((seq([I("1" )]), Seq::infinite()), e(r"1[A-Z]" )); |
2648 | assert_eq!((seq([I("1" )]), seq([I("2" )])), e(r"1[A-Z]2" )); |
2649 | assert_eq!((Seq::infinite(), seq([I("123" )])), e(r"[A-Z]+123" )); |
2650 | assert_eq!(infinite(), e(r"[A-Z]+123[A-Z]+" )); |
2651 | assert_eq!(infinite(), e(r"1|[A-Z]|3" )); |
2652 | assert_eq!( |
2653 | (seq([E("1" ), I("2" ), E("3" )]), Seq::infinite()), |
2654 | e(r"1|2[A-Z]|3" ), |
2655 | ); |
2656 | assert_eq!( |
2657 | (Seq::infinite(), seq([E("1" ), I("2" ), E("3" )])), |
2658 | e(r"1|[A-Z]2|3" ), |
2659 | ); |
2660 | assert_eq!( |
2661 | (seq([E("1" ), I("2" ), E("4" )]), seq([E("1" ), I("3" ), E("4" )])), |
2662 | e(r"1|2[A-Z]3|4" ), |
2663 | ); |
2664 | assert_eq!((Seq::infinite(), seq([I("2" )])), e(r"(?:|1)[A-Z]2" )); |
2665 | assert_eq!(inexact([I("a" )], [I("z" )]), e(r"a.z" )); |
2666 | } |
2667 | |
2668 | // Like the 'anything' test, but it uses smaller limits in order to test |
2669 | // the logic for effectively aborting literal extraction when the seqs get |
2670 | // too big. |
2671 | #[test ] |
2672 | fn anything_small_limits() { |
2673 | fn prefixes(pattern: &str) -> Seq { |
2674 | Extractor::new() |
2675 | .kind(ExtractKind::Prefix) |
2676 | .limit_total(10) |
2677 | .extract(&parse(pattern)) |
2678 | } |
2679 | |
2680 | fn suffixes(pattern: &str) -> Seq { |
2681 | Extractor::new() |
2682 | .kind(ExtractKind::Suffix) |
2683 | .limit_total(10) |
2684 | .extract(&parse(pattern)) |
2685 | } |
2686 | |
2687 | fn e(pattern: &str) -> (Seq, Seq) { |
2688 | (prefixes(pattern), suffixes(pattern)) |
2689 | } |
2690 | |
2691 | assert_eq!( |
2692 | ( |
2693 | seq([ |
2694 | I("aaa" ), |
2695 | I("aab" ), |
2696 | I("aba" ), |
2697 | I("abb" ), |
2698 | I("baa" ), |
2699 | I("bab" ), |
2700 | I("bba" ), |
2701 | I("bbb" ) |
2702 | ]), |
2703 | seq([ |
2704 | I("aaa" ), |
2705 | I("aab" ), |
2706 | I("aba" ), |
2707 | I("abb" ), |
2708 | I("baa" ), |
2709 | I("bab" ), |
2710 | I("bba" ), |
2711 | I("bbb" ) |
2712 | ]) |
2713 | ), |
2714 | e(r"[ab]{3}{3}" ) |
2715 | ); |
2716 | |
2717 | assert_eq!(infinite(), e(r"ab|cd|ef|gh|ij|kl|mn|op|qr|st|uv|wx|yz" )); |
2718 | } |
2719 | |
2720 | #[test ] |
2721 | fn empty() { |
2722 | assert_eq!(exact(["" ]), e(r"" )); |
2723 | assert_eq!(exact(["" ]), e(r"^" )); |
2724 | assert_eq!(exact(["" ]), e(r"$" )); |
2725 | assert_eq!(exact(["" ]), e(r"(?m:^)" )); |
2726 | assert_eq!(exact(["" ]), e(r"(?m:$)" )); |
2727 | assert_eq!(exact(["" ]), e(r"\b" )); |
2728 | assert_eq!(exact(["" ]), e(r"\B" )); |
2729 | assert_eq!(exact(["" ]), e(r"(?-u:\b)" )); |
2730 | assert_eq!(exact(["" ]), e(r"(?-u:\B)" )); |
2731 | } |
2732 | |
2733 | #[test ] |
2734 | fn odds_and_ends() { |
2735 | assert_eq!((Seq::infinite(), seq([I("a" )])), e(r".a" )); |
2736 | assert_eq!((seq([I("a" )]), Seq::infinite()), e(r"a." )); |
2737 | assert_eq!(infinite(), e(r"a|." )); |
2738 | assert_eq!(infinite(), e(r".|a" )); |
2739 | |
2740 | let pat = r"M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]" ; |
2741 | let expected = inexact( |
2742 | ["Mo'am" , "Moam" , "Mu'am" , "Muam" ].map(I), |
2743 | [ |
2744 | "ddafi" , "ddafy" , "dhafi" , "dhafy" , "dzafi" , "dzafy" , "dafi" , |
2745 | "dafy" , "tdafi" , "tdafy" , "thafi" , "thafy" , "tzafi" , "tzafy" , |
2746 | "tafi" , "tafy" , "zdafi" , "zdafy" , "zhafi" , "zhafy" , "zzafi" , |
2747 | "zzafy" , "zafi" , "zafy" , |
2748 | ] |
2749 | .map(I), |
2750 | ); |
2751 | assert_eq!(expected, e(pat)); |
2752 | |
2753 | assert_eq!( |
2754 | (seq(["fn is_" , "fn as_" ].map(I)), Seq::infinite()), |
2755 | e(r"fn is_([A-Z]+)|fn as_([A-Z]+)" ), |
2756 | ); |
2757 | assert_eq!( |
2758 | inexact([I("foo" )], [I("quux" )]), |
2759 | e(r"foo[A-Z]+bar[A-Z]+quux" ) |
2760 | ); |
2761 | assert_eq!(infinite(), e(r"[A-Z]+bar[A-Z]+" )); |
2762 | assert_eq!( |
2763 | exact(["Sherlock Holmes" ]), |
2764 | e(r"(?m)^Sherlock Holmes|Sherlock Holmes$" ) |
2765 | ); |
2766 | |
2767 | assert_eq!(exact(["sa" , "sb" ]), e(r"\bs(?:[ab])" )); |
2768 | } |
2769 | |
2770 | // This tests a specific regex along with some heuristic steps to reduce |
2771 | // the sequences extracted. This is meant to roughly correspond to the |
2772 | // types of heuristics used to shrink literal sets in practice. (Shrinking |
2773 | // is done because you want to balance "spend too much work looking for |
2774 | // too many literals" and "spend too much work processing false positive |
2775 | // matches from short literals.") |
2776 | #[test ] |
2777 | #[cfg (feature = "unicode-case" )] |
2778 | fn holmes() { |
2779 | let expected = inexact( |
2780 | ["HOL" , "HOl" , "HoL" , "Hol" , "hOL" , "hOl" , "hoL" , "hol" ].map(I), |
2781 | [ |
2782 | "MES" , "MEs" , "Eſ" , "MeS" , "Mes" , "eſ" , "mES" , "mEs" , "meS" , |
2783 | "mes" , |
2784 | ] |
2785 | .map(I), |
2786 | ); |
2787 | let (mut prefixes, mut suffixes) = e(r"(?i)Holmes" ); |
2788 | prefixes.keep_first_bytes(3); |
2789 | suffixes.keep_last_bytes(3); |
2790 | prefixes.minimize_by_preference(); |
2791 | suffixes.minimize_by_preference(); |
2792 | assert_eq!(expected, (prefixes, suffixes)); |
2793 | } |
2794 | |
2795 | // This tests that we get some kind of literals extracted for a beefier |
2796 | // alternation with case insensitive mode enabled. At one point during |
2797 | // development, this returned nothing, and motivated some special case |
2798 | // code in Extractor::union to try and trim down the literal sequences |
2799 | // if the union would blow the limits set. |
2800 | #[test ] |
2801 | #[cfg (feature = "unicode-case" )] |
2802 | fn holmes_alt() { |
2803 | let mut pre = |
2804 | prefixes(r"(?i)Sherlock|Holmes|Watson|Irene|Adler|John|Baker" ); |
2805 | assert!(pre.len().unwrap() > 0); |
2806 | pre.optimize_for_prefix_by_preference(); |
2807 | assert!(pre.len().unwrap() > 0); |
2808 | } |
2809 | |
2810 | // See: https://github.com/rust-lang/regex/security/advisories/GHSA-m5pq-gvj9-9vr8 |
2811 | // See: CVE-2022-24713 |
2812 | // |
2813 | // We test this here to ensure literal extraction completes in reasonable |
2814 | // time and isn't materially impacted by these sorts of pathological |
2815 | // repeats. |
2816 | #[test ] |
2817 | fn crazy_repeats() { |
2818 | assert_eq!(inexact([I("" )], [I("" )]), e(r"(?:){4294967295}" )); |
2819 | assert_eq!( |
2820 | inexact([I("" )], [I("" )]), |
2821 | e(r"(?:){64}{64}{64}{64}{64}{64}" ) |
2822 | ); |
2823 | assert_eq!(inexact([I("" )], [I("" )]), e(r"x{0}{4294967295}" )); |
2824 | assert_eq!(inexact([I("" )], [I("" )]), e(r"(?:|){4294967295}" )); |
2825 | |
2826 | assert_eq!( |
2827 | inexact([E("" )], [E("" )]), |
2828 | e(r"(?:){8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}" ) |
2829 | ); |
2830 | let repa = "a" .repeat(100); |
2831 | assert_eq!( |
2832 | inexact([I(&repa)], [I(&repa)]), |
2833 | e(r"a{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}" ) |
2834 | ); |
2835 | } |
2836 | |
2837 | #[test ] |
2838 | fn huge() { |
2839 | let pat = r#"(?-u) |
2840 | 2(?: |
2841 | [45]\d{3}| |
2842 | 7(?: |
2843 | 1[0-267]| |
2844 | 2[0-289]| |
2845 | 3[0-29]| |
2846 | 4[01]| |
2847 | 5[1-3]| |
2848 | 6[013]| |
2849 | 7[0178]| |
2850 | 91 |
2851 | )| |
2852 | 8(?: |
2853 | 0[125]| |
2854 | [139][1-6]| |
2855 | 2[0157-9]| |
2856 | 41| |
2857 | 6[1-35]| |
2858 | 7[1-5]| |
2859 | 8[1-8]| |
2860 | 90 |
2861 | )| |
2862 | 9(?: |
2863 | 0[0-2]| |
2864 | 1[0-4]| |
2865 | 2[568]| |
2866 | 3[3-6]| |
2867 | 5[5-7]| |
2868 | 6[0167]| |
2869 | 7[15]| |
2870 | 8[0146-9] |
2871 | ) |
2872 | )\d{4}| |
2873 | 3(?: |
2874 | 12?[5-7]\d{2}| |
2875 | 0(?: |
2876 | 2(?: |
2877 | [025-79]\d| |
2878 | [348]\d{1,2} |
2879 | )| |
2880 | 3(?: |
2881 | [2-4]\d| |
2882 | [56]\d? |
2883 | ) |
2884 | )| |
2885 | 2(?: |
2886 | 1\d{2}| |
2887 | 2(?: |
2888 | [12]\d| |
2889 | [35]\d{1,2}| |
2890 | 4\d? |
2891 | ) |
2892 | )| |
2893 | 3(?: |
2894 | 1\d{2}| |
2895 | 2(?: |
2896 | [2356]\d| |
2897 | 4\d{1,2} |
2898 | ) |
2899 | )| |
2900 | 4(?: |
2901 | 1\d{2}| |
2902 | 2(?: |
2903 | 2\d{1,2}| |
2904 | [47]| |
2905 | 5\d{2} |
2906 | ) |
2907 | )| |
2908 | 5(?: |
2909 | 1\d{2}| |
2910 | 29 |
2911 | )| |
2912 | [67]1\d{2}| |
2913 | 8(?: |
2914 | 1\d{2}| |
2915 | 2(?: |
2916 | 2\d{2}| |
2917 | 3| |
2918 | 4\d |
2919 | ) |
2920 | ) |
2921 | )\d{3}| |
2922 | 4(?: |
2923 | 0(?: |
2924 | 2(?: |
2925 | [09]\d| |
2926 | 7 |
2927 | )| |
2928 | 33\d{2} |
2929 | )| |
2930 | 1\d{3}| |
2931 | 2(?: |
2932 | 1\d{2}| |
2933 | 2(?: |
2934 | [25]\d?| |
2935 | [348]\d| |
2936 | [67]\d{1,2} |
2937 | ) |
2938 | )| |
2939 | 3(?: |
2940 | 1\d{2}(?: |
2941 | \d{2} |
2942 | )?| |
2943 | 2(?: |
2944 | [045]\d| |
2945 | [236-9]\d{1,2} |
2946 | )| |
2947 | 32\d{2} |
2948 | )| |
2949 | 4(?: |
2950 | [18]\d{2}| |
2951 | 2(?: |
2952 | [2-46]\d{2}| |
2953 | 3 |
2954 | )| |
2955 | 5[25]\d{2} |
2956 | )| |
2957 | 5(?: |
2958 | 1\d{2}| |
2959 | 2(?: |
2960 | 3\d| |
2961 | 5 |
2962 | ) |
2963 | )| |
2964 | 6(?: |
2965 | [18]\d{2}| |
2966 | 2(?: |
2967 | 3(?: |
2968 | \d{2} |
2969 | )?| |
2970 | [46]\d{1,2}| |
2971 | 5\d{2}| |
2972 | 7\d |
2973 | )| |
2974 | 5(?: |
2975 | 3\d?| |
2976 | 4\d| |
2977 | [57]\d{1,2}| |
2978 | 6\d{2}| |
2979 | 8 |
2980 | ) |
2981 | )| |
2982 | 71\d{2}| |
2983 | 8(?: |
2984 | [18]\d{2}| |
2985 | 23\d{2}| |
2986 | 54\d{2} |
2987 | )| |
2988 | 9(?: |
2989 | [18]\d{2}| |
2990 | 2[2-5]\d{2}| |
2991 | 53\d{1,2} |
2992 | ) |
2993 | )\d{3}| |
2994 | 5(?: |
2995 | 02[03489]\d{2}| |
2996 | 1\d{2}| |
2997 | 2(?: |
2998 | 1\d{2}| |
2999 | 2(?: |
3000 | 2(?: |
3001 | \d{2} |
3002 | )?| |
3003 | [457]\d{2} |
3004 | ) |
3005 | )| |
3006 | 3(?: |
3007 | 1\d{2}| |
3008 | 2(?: |
3009 | [37](?: |
3010 | \d{2} |
3011 | )?| |
3012 | [569]\d{2} |
3013 | ) |
3014 | )| |
3015 | 4(?: |
3016 | 1\d{2}| |
3017 | 2[46]\d{2} |
3018 | )| |
3019 | 5(?: |
3020 | 1\d{2}| |
3021 | 26\d{1,2} |
3022 | )| |
3023 | 6(?: |
3024 | [18]\d{2}| |
3025 | 2| |
3026 | 53\d{2} |
3027 | )| |
3028 | 7(?: |
3029 | 1| |
3030 | 24 |
3031 | )\d{2}| |
3032 | 8(?: |
3033 | 1| |
3034 | 26 |
3035 | )\d{2}| |
3036 | 91\d{2} |
3037 | )\d{3}| |
3038 | 6(?: |
3039 | 0(?: |
3040 | 1\d{2}| |
3041 | 2(?: |
3042 | 3\d{2}| |
3043 | 4\d{1,2} |
3044 | ) |
3045 | )| |
3046 | 2(?: |
3047 | 2[2-5]\d{2}| |
3048 | 5(?: |
3049 | [3-5]\d{2}| |
3050 | 7 |
3051 | )| |
3052 | 8\d{2} |
3053 | )| |
3054 | 3(?: |
3055 | 1| |
3056 | 2[3478] |
3057 | )\d{2}| |
3058 | 4(?: |
3059 | 1| |
3060 | 2[34] |
3061 | )\d{2}| |
3062 | 5(?: |
3063 | 1| |
3064 | 2[47] |
3065 | )\d{2}| |
3066 | 6(?: |
3067 | [18]\d{2}| |
3068 | 6(?: |
3069 | 2(?: |
3070 | 2\d| |
3071 | [34]\d{2} |
3072 | )| |
3073 | 5(?: |
3074 | [24]\d{2}| |
3075 | 3\d| |
3076 | 5\d{1,2} |
3077 | ) |
3078 | ) |
3079 | )| |
3080 | 72[2-5]\d{2}| |
3081 | 8(?: |
3082 | 1\d{2}| |
3083 | 2[2-5]\d{2} |
3084 | )| |
3085 | 9(?: |
3086 | 1\d{2}| |
3087 | 2[2-6]\d{2} |
3088 | ) |
3089 | )\d{3}| |
3090 | 7(?: |
3091 | (?: |
3092 | 02| |
3093 | [3-589]1| |
3094 | 6[12]| |
3095 | 72[24] |
3096 | )\d{2}| |
3097 | 21\d{3}| |
3098 | 32 |
3099 | )\d{3}| |
3100 | 8(?: |
3101 | (?: |
3102 | 4[12]| |
3103 | [5-7]2| |
3104 | 1\d? |
3105 | )| |
3106 | (?: |
3107 | 0| |
3108 | 3[12]| |
3109 | [5-7]1| |
3110 | 217 |
3111 | )\d |
3112 | )\d{4}| |
3113 | 9(?: |
3114 | [35]1| |
3115 | (?: |
3116 | [024]2| |
3117 | 81 |
3118 | )\d| |
3119 | (?: |
3120 | 1| |
3121 | [24]1 |
3122 | )\d{2} |
3123 | )\d{3} |
3124 | "# ; |
3125 | // TODO: This is a good candidate of a seq of literals that could be |
3126 | // shrunk quite a bit and still be very productive with respect to |
3127 | // literal optimizations. |
3128 | let (prefixes, suffixes) = e(pat); |
3129 | assert!(!suffixes.is_finite()); |
3130 | assert_eq!(Some(243), prefixes.len()); |
3131 | } |
3132 | |
3133 | #[test ] |
3134 | fn optimize() { |
3135 | // This gets a common prefix that isn't too short. |
3136 | let (p, s) = |
3137 | opt(["foobarfoobar" , "foobar" , "foobarzfoobar" , "foobarfoobar" ]); |
3138 | assert_eq!(seq([I("foobar" )]), p); |
3139 | assert_eq!(seq([I("foobar" )]), s); |
3140 | |
3141 | // This also finds a common prefix, but since it's only one byte, it |
3142 | // prefers the multiple literals. |
3143 | let (p, s) = opt(["abba" , "akka" , "abccba" ]); |
3144 | assert_eq!(exact(["abba" , "akka" , "abccba" ]), (p, s)); |
3145 | |
3146 | let (p, s) = opt(["sam" , "samwise" ]); |
3147 | assert_eq!((seq([E("sam" )]), seq([E("sam" ), E("samwise" )])), (p, s)); |
3148 | |
3149 | // The empty string is poisonous, so our seq becomes infinite, even |
3150 | // though all literals are exact. |
3151 | let (p, s) = opt(["foobarfoo" , "foo" , "" , "foozfoo" , "foofoo" ]); |
3152 | assert!(!p.is_finite()); |
3153 | assert!(!s.is_finite()); |
3154 | |
3155 | // A space is also poisonous, so our seq becomes infinite. But this |
3156 | // only gets triggered when we don't have a completely exact sequence. |
3157 | // When the sequence is exact, spaces are okay, since we presume that |
3158 | // any prefilter will match a space more quickly than the regex engine. |
3159 | // (When the sequence is exact, there's a chance of the prefilter being |
3160 | // used without needing the regex engine at all.) |
3161 | let mut p = seq([E("foobarfoo" ), I("foo" ), E(" " ), E("foofoo" )]); |
3162 | p.optimize_for_prefix_by_preference(); |
3163 | assert!(!p.is_finite()); |
3164 | } |
3165 | } |
3166 | |