| 1 | /*! |
| 2 | Provides literal extraction from `Hir` expressions. |
| 3 | |
| 4 | An [`Extractor`] pulls literals out of [`Hir`] expressions and returns a |
| 5 | [`Seq`] of [`Literal`]s. |
| 6 | |
| 7 | The purpose of literal extraction is generally to provide avenues for |
| 8 | optimizing regex searches. The main idea is that substring searches can be an |
| 9 | order of magnitude faster than a regex search. Therefore, if one can execute |
| 10 | a substring search to find candidate match locations and only run the regex |
| 11 | search at those locations, then it is possible for huge improvements in |
| 12 | performance to be realized. |
| 13 | |
| 14 | With that said, literal optimizations are generally a black art because even |
| 15 | though substring search is generally faster, if the number of candidates |
| 16 | produced is high, then it can create a lot of overhead by ping-ponging between |
| 17 | the substring search and the regex search. |
| 18 | |
| 19 | Here are some heuristics that might be used to help increase the chances of |
| 20 | effective literal optimizations: |
| 21 | |
| 22 | * Stick to small [`Seq`]s. If you search for too many literals, it's likely |
| 23 | to lead to substring search that is only a little faster than a regex search, |
| 24 | and thus the overhead of using literal optimizations in the first place might |
| 25 | make things slower overall. |
| 26 | * The literals in your [`Seq`] shouldn't be too short. In general, longer is |
| 27 | better. A sequence corresponding to single bytes that occur frequently in the |
| 28 | haystack, for example, is probably a bad literal optimization because it's |
| 29 | likely to produce many false positive candidates. Longer literals are less |
| 30 | likely to match, and thus probably produce fewer false positives. |
| 31 | * If it's possible to estimate the approximate frequency of each byte according |
| 32 | to some pre-computed background distribution, it is possible to compute a score |
| 33 | of how "good" a `Seq` is. If a `Seq` isn't good enough, you might consider |
| 34 | skipping the literal optimization and just use the regex engine. |
| 35 | |
| 36 | (It should be noted that there are always pathological cases that can make |
| 37 | any kind of literal optimization be a net slower result. This is why it |
| 38 | might be a good idea to be conservative, or to even provide a means for |
| 39 | literal optimizations to be dynamically disabled if they are determined to be |
| 40 | ineffective according to some measure.) |
| 41 | |
| 42 | You're encouraged to explore the methods on [`Seq`], which permit shrinking |
| 43 | the size of sequences in a preference-order preserving fashion. |
| 44 | |
| 45 | Finally, note that it isn't strictly necessary to use an [`Extractor`]. Namely, |
| 46 | an `Extractor` only uses public APIs of the [`Seq`] and [`Literal`] types, |
| 47 | so it is possible to implement your own extractor. For example, for n-grams |
| 48 | or "inner" literals (i.e., not prefix or suffix literals). The `Extractor` |
| 49 | is mostly responsible for the case analysis over `Hir` expressions. Much of |
| 50 | the "trickier" parts are how to combine literal sequences, and that is all |
| 51 | implemented on [`Seq`]. |
| 52 | */ |
| 53 | |
| 54 | use core::{cmp, mem, num::NonZeroUsize}; |
| 55 | |
| 56 | use alloc::{vec, vec::Vec}; |
| 57 | |
| 58 | use crate::hir::{self, Hir}; |
| 59 | |
| 60 | /// Extracts prefix or suffix literal sequences from [`Hir`] expressions. |
| 61 | /// |
| 62 | /// Literal extraction is based on the following observations: |
| 63 | /// |
| 64 | /// * Many regexes start with one or a small number of literals. |
| 65 | /// * Substring search for literals is often much faster (sometimes by an order |
| 66 | /// of magnitude) than a regex search. |
| 67 | /// |
| 68 | /// Thus, in many cases, one can search for literals to find candidate starting |
| 69 | /// locations of a match, and then only run the full regex engine at each such |
| 70 | /// location instead of over the full haystack. |
| 71 | /// |
| 72 | /// The main downside of literal extraction is that it can wind up causing a |
| 73 | /// search to be slower overall. For example, if there are many matches or if |
| 74 | /// there are many candidates that don't ultimately lead to a match, then a |
| 75 | /// lot of overhead will be spent in shuffing back-and-forth between substring |
| 76 | /// search and the regex engine. This is the fundamental reason why literal |
| 77 | /// optimizations for regex patterns is sometimes considered a "black art." |
| 78 | /// |
| 79 | /// # Look-around assertions |
| 80 | /// |
| 81 | /// Literal extraction treats all look-around assertions as-if they match every |
| 82 | /// empty string. So for example, the regex `\bquux\b` will yield a sequence |
| 83 | /// containing a single exact literal `quux`. However, not all occurrences |
| 84 | /// of `quux` correspond to a match a of the regex. For example, `\bquux\b` |
| 85 | /// does not match `ZquuxZ` anywhere because `quux` does not fall on a word |
| 86 | /// boundary. |
| 87 | /// |
| 88 | /// In effect, if your regex contains look-around assertions, then a match of |
| 89 | /// an exact literal does not necessarily mean the regex overall matches. So |
| 90 | /// you may still need to run the regex engine in such cases to confirm the |
| 91 | /// match. |
| 92 | /// |
| 93 | /// The precise guarantee you get from a literal sequence is: if every literal |
| 94 | /// in the sequence is exact and the original regex contains zero look-around |
| 95 | /// assertions, then a preference-order multi-substring search of those |
| 96 | /// literals will precisely match a preference-order search of the original |
| 97 | /// regex. |
| 98 | /// |
| 99 | /// # Example |
| 100 | /// |
| 101 | /// This shows how to extract prefixes: |
| 102 | /// |
| 103 | /// ``` |
| 104 | /// use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse}; |
| 105 | /// |
| 106 | /// let hir = parse(r"(a|b|c)(x|y|z)[A-Z]+foo" )?; |
| 107 | /// let got = Extractor::new().extract(&hir); |
| 108 | /// // All literals returned are "inexact" because none of them reach the |
| 109 | /// // match state. |
| 110 | /// let expected = Seq::from_iter([ |
| 111 | /// Literal::inexact("ax" ), |
| 112 | /// Literal::inexact("ay" ), |
| 113 | /// Literal::inexact("az" ), |
| 114 | /// Literal::inexact("bx" ), |
| 115 | /// Literal::inexact("by" ), |
| 116 | /// Literal::inexact("bz" ), |
| 117 | /// Literal::inexact("cx" ), |
| 118 | /// Literal::inexact("cy" ), |
| 119 | /// Literal::inexact("cz" ), |
| 120 | /// ]); |
| 121 | /// assert_eq!(expected, got); |
| 122 | /// |
| 123 | /// # Ok::<(), Box<dyn std::error::Error>>(()) |
| 124 | /// ``` |
| 125 | /// |
| 126 | /// This shows how to extract suffixes: |
| 127 | /// |
| 128 | /// ``` |
| 129 | /// use regex_syntax::{ |
| 130 | /// hir::literal::{Extractor, ExtractKind, Literal, Seq}, |
| 131 | /// parse, |
| 132 | /// }; |
| 133 | /// |
| 134 | /// let hir = parse(r"foo|[A-Z]+bar" )?; |
| 135 | /// let got = Extractor::new().kind(ExtractKind::Suffix).extract(&hir); |
| 136 | /// // Since 'foo' gets to a match state, it is considered exact. But 'bar' |
| 137 | /// // does not because of the '[A-Z]+', and thus is marked inexact. |
| 138 | /// let expected = Seq::from_iter([ |
| 139 | /// Literal::exact("foo" ), |
| 140 | /// Literal::inexact("bar" ), |
| 141 | /// ]); |
| 142 | /// assert_eq!(expected, got); |
| 143 | /// |
| 144 | /// # Ok::<(), Box<dyn std::error::Error>>(()) |
| 145 | /// ``` |
| 146 | #[derive (Clone, Debug)] |
| 147 | pub struct Extractor { |
| 148 | kind: ExtractKind, |
| 149 | limit_class: usize, |
| 150 | limit_repeat: usize, |
| 151 | limit_literal_len: usize, |
| 152 | limit_total: usize, |
| 153 | } |
| 154 | |
| 155 | impl Extractor { |
| 156 | /// Create a new extractor with a default configuration. |
| 157 | /// |
| 158 | /// The extractor can be optionally configured before calling |
| 159 | /// [`Extractor::extract`] to get a literal sequence. |
| 160 | pub fn new() -> Extractor { |
| 161 | Extractor { |
| 162 | kind: ExtractKind::Prefix, |
| 163 | limit_class: 10, |
| 164 | limit_repeat: 10, |
| 165 | limit_literal_len: 100, |
| 166 | limit_total: 250, |
| 167 | } |
| 168 | } |
| 169 | |
| 170 | /// Execute the extractor and return a sequence of literals. |
| 171 | pub fn extract(&self, hir: &Hir) -> Seq { |
| 172 | use crate::hir::HirKind::*; |
| 173 | |
| 174 | match *hir.kind() { |
| 175 | Empty | Look(_) => Seq::singleton(self::Literal::exact(vec![])), |
| 176 | Literal(hir::Literal(ref bytes)) => { |
| 177 | let mut seq = |
| 178 | Seq::singleton(self::Literal::exact(bytes.to_vec())); |
| 179 | self.enforce_literal_len(&mut seq); |
| 180 | seq |
| 181 | } |
| 182 | Class(hir::Class::Unicode(ref cls)) => { |
| 183 | self.extract_class_unicode(cls) |
| 184 | } |
| 185 | Class(hir::Class::Bytes(ref cls)) => self.extract_class_bytes(cls), |
| 186 | Repetition(ref rep) => self.extract_repetition(rep), |
| 187 | Capture(hir::Capture { ref sub, .. }) => self.extract(sub), |
| 188 | Concat(ref hirs) => match self.kind { |
| 189 | ExtractKind::Prefix => self.extract_concat(hirs.iter()), |
| 190 | ExtractKind::Suffix => self.extract_concat(hirs.iter().rev()), |
| 191 | }, |
| 192 | Alternation(ref hirs) => { |
| 193 | // Unlike concat, we always union starting from the beginning, |
| 194 | // since the beginning corresponds to the highest preference, |
| 195 | // which doesn't change based on forwards vs reverse. |
| 196 | self.extract_alternation(hirs.iter()) |
| 197 | } |
| 198 | } |
| 199 | } |
| 200 | |
| 201 | /// Set the kind of literal sequence to extract from an [`Hir`] expression. |
| 202 | /// |
| 203 | /// The default is to extract prefixes, but suffixes can be selected |
| 204 | /// instead. The contract for prefixes is that every match of the |
| 205 | /// corresponding `Hir` must start with one of the literals in the sequence |
| 206 | /// returned. Moreover, the _order_ of the sequence returned corresponds to |
| 207 | /// the preference order. |
| 208 | /// |
| 209 | /// Suffixes satisfy a similar contract in that every match of the |
| 210 | /// corresponding `Hir` must end with one of the literals in the sequence |
| 211 | /// returned. However, there is no guarantee that the literals are in |
| 212 | /// preference order. |
| 213 | /// |
| 214 | /// Remember that a sequence can be infinite. For example, unless the |
| 215 | /// limits are configured to be impractically large, attempting to extract |
| 216 | /// prefixes (or suffixes) for the pattern `[A-Z]` will return an infinite |
| 217 | /// sequence. Generally speaking, if the sequence returned is infinite, |
| 218 | /// then it is presumed to be unwise to do prefix (or suffix) optimizations |
| 219 | /// for the pattern. |
| 220 | pub fn kind(&mut self, kind: ExtractKind) -> &mut Extractor { |
| 221 | self.kind = kind; |
| 222 | self |
| 223 | } |
| 224 | |
| 225 | /// Configure a limit on the length of the sequence that is permitted for |
| 226 | /// a character class. If a character class exceeds this limit, then the |
| 227 | /// sequence returned for it is infinite. |
| 228 | /// |
| 229 | /// This prevents classes like `[A-Z]` or `\pL` from getting turned into |
| 230 | /// huge and likely unproductive sequences of literals. |
| 231 | /// |
| 232 | /// # Example |
| 233 | /// |
| 234 | /// This example shows how this limit can be lowered to decrease the tolerance |
| 235 | /// for character classes being turned into literal sequences. |
| 236 | /// |
| 237 | /// ``` |
| 238 | /// use regex_syntax::{hir::literal::{Extractor, Seq}, parse}; |
| 239 | /// |
| 240 | /// let hir = parse(r"[0-9]" )?; |
| 241 | /// |
| 242 | /// let got = Extractor::new().extract(&hir); |
| 243 | /// let expected = Seq::new([ |
| 244 | /// "0" , "1" , "2" , "3" , "4" , "5" , "6" , "7" , "8" , "9" , |
| 245 | /// ]); |
| 246 | /// assert_eq!(expected, got); |
| 247 | /// |
| 248 | /// // Now let's shrink the limit and see how that changes things. |
| 249 | /// let got = Extractor::new().limit_class(4).extract(&hir); |
| 250 | /// let expected = Seq::infinite(); |
| 251 | /// assert_eq!(expected, got); |
| 252 | /// |
| 253 | /// # Ok::<(), Box<dyn std::error::Error>>(()) |
| 254 | /// ``` |
| 255 | pub fn limit_class(&mut self, limit: usize) -> &mut Extractor { |
| 256 | self.limit_class = limit; |
| 257 | self |
| 258 | } |
| 259 | |
| 260 | /// Configure a limit on the total number of repetitions that is permitted |
| 261 | /// before literal extraction is stopped. |
| 262 | /// |
| 263 | /// This is useful for limiting things like `(abcde){50}`, or more |
| 264 | /// insidiously, `(?:){1000000000}`. This limit prevents any one single |
| 265 | /// repetition from adding too much to a literal sequence. |
| 266 | /// |
| 267 | /// With this limit set, repetitions that exceed it will be stopped and any |
| 268 | /// literals extracted up to that point will be made inexact. |
| 269 | /// |
| 270 | /// # Example |
| 271 | /// |
| 272 | /// This shows how to decrease the limit and compares it with the default. |
| 273 | /// |
| 274 | /// ``` |
| 275 | /// use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse}; |
| 276 | /// |
| 277 | /// let hir = parse(r"(abc){8}" )?; |
| 278 | /// |
| 279 | /// let got = Extractor::new().extract(&hir); |
| 280 | /// let expected = Seq::new(["abcabcabcabcabcabcabcabc" ]); |
| 281 | /// assert_eq!(expected, got); |
| 282 | /// |
| 283 | /// // Now let's shrink the limit and see how that changes things. |
| 284 | /// let got = Extractor::new().limit_repeat(4).extract(&hir); |
| 285 | /// let expected = Seq::from_iter([ |
| 286 | /// Literal::inexact("abcabcabcabc" ), |
| 287 | /// ]); |
| 288 | /// assert_eq!(expected, got); |
| 289 | /// |
| 290 | /// # Ok::<(), Box<dyn std::error::Error>>(()) |
| 291 | /// ``` |
| 292 | pub fn limit_repeat(&mut self, limit: usize) -> &mut Extractor { |
| 293 | self.limit_repeat = limit; |
| 294 | self |
| 295 | } |
| 296 | |
| 297 | /// Configure a limit on the maximum length of any literal in a sequence. |
| 298 | /// |
| 299 | /// This is useful for limiting things like `(abcde){5}{5}{5}{5}`. While |
| 300 | /// each repetition or literal in that regex is small, when all the |
| 301 | /// repetitions are applied, one ends up with a literal of length `5^4 = |
| 302 | /// 625`. |
| 303 | /// |
| 304 | /// With this limit set, literals that exceed it will be made inexact and |
| 305 | /// thus prevented from growing. |
| 306 | /// |
| 307 | /// # Example |
| 308 | /// |
| 309 | /// This shows how to decrease the limit and compares it with the default. |
| 310 | /// |
| 311 | /// ``` |
| 312 | /// use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse}; |
| 313 | /// |
| 314 | /// let hir = parse(r"(abc){2}{2}{2}" )?; |
| 315 | /// |
| 316 | /// let got = Extractor::new().extract(&hir); |
| 317 | /// let expected = Seq::new(["abcabcabcabcabcabcabcabc" ]); |
| 318 | /// assert_eq!(expected, got); |
| 319 | /// |
| 320 | /// // Now let's shrink the limit and see how that changes things. |
| 321 | /// let got = Extractor::new().limit_literal_len(14).extract(&hir); |
| 322 | /// let expected = Seq::from_iter([ |
| 323 | /// Literal::inexact("abcabcabcabcab" ), |
| 324 | /// ]); |
| 325 | /// assert_eq!(expected, got); |
| 326 | /// |
| 327 | /// # Ok::<(), Box<dyn std::error::Error>>(()) |
| 328 | /// ``` |
| 329 | pub fn limit_literal_len(&mut self, limit: usize) -> &mut Extractor { |
| 330 | self.limit_literal_len = limit; |
| 331 | self |
| 332 | } |
| 333 | |
| 334 | /// Configure a limit on the total number of literals that will be |
| 335 | /// returned. |
| 336 | /// |
| 337 | /// This is useful as a practical measure for avoiding the creation of |
| 338 | /// large sequences of literals. While the extractor will automatically |
| 339 | /// handle local creations of large sequences (for example, `[A-Z]` yields |
| 340 | /// an infinite sequence by default), large sequences can be created |
| 341 | /// through non-local means as well. |
| 342 | /// |
| 343 | /// For example, `[ab]{3}{3}` would yield a sequence of length `512 = 2^9` |
| 344 | /// despite each of the repetitions being small on their own. This limit |
| 345 | /// thus represents a "catch all" for avoiding locally small sequences from |
| 346 | /// combining into large sequences. |
| 347 | /// |
| 348 | /// # Example |
| 349 | /// |
| 350 | /// This example shows how reducing the limit will change the literal |
| 351 | /// sequence returned. |
| 352 | /// |
| 353 | /// ``` |
| 354 | /// use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse}; |
| 355 | /// |
| 356 | /// let hir = parse(r"[ab]{2}{2}" )?; |
| 357 | /// |
| 358 | /// let got = Extractor::new().extract(&hir); |
| 359 | /// let expected = Seq::new([ |
| 360 | /// "aaaa" , "aaab" , "aaba" , "aabb" , |
| 361 | /// "abaa" , "abab" , "abba" , "abbb" , |
| 362 | /// "baaa" , "baab" , "baba" , "babb" , |
| 363 | /// "bbaa" , "bbab" , "bbba" , "bbbb" , |
| 364 | /// ]); |
| 365 | /// assert_eq!(expected, got); |
| 366 | /// |
| 367 | /// // The default limit is not too big, but big enough to extract all |
| 368 | /// // literals from '[ab]{2}{2}'. If we shrink the limit to less than 16, |
| 369 | /// // then we'll get a truncated set. Notice that it returns a sequence of |
| 370 | /// // length 4 even though our limit was 10. This is because the sequence |
| 371 | /// // is difficult to increase without blowing the limit. Notice also |
| 372 | /// // that every literal in the sequence is now inexact because they were |
| 373 | /// // stripped of some suffix. |
| 374 | /// let got = Extractor::new().limit_total(10).extract(&hir); |
| 375 | /// let expected = Seq::from_iter([ |
| 376 | /// Literal::inexact("aa" ), |
| 377 | /// Literal::inexact("ab" ), |
| 378 | /// Literal::inexact("ba" ), |
| 379 | /// Literal::inexact("bb" ), |
| 380 | /// ]); |
| 381 | /// assert_eq!(expected, got); |
| 382 | /// |
| 383 | /// # Ok::<(), Box<dyn std::error::Error>>(()) |
| 384 | /// ``` |
| 385 | pub fn limit_total(&mut self, limit: usize) -> &mut Extractor { |
| 386 | self.limit_total = limit; |
| 387 | self |
| 388 | } |
| 389 | |
| 390 | /// Extract a sequence from the given concatenation. Sequences from each of |
| 391 | /// the child HIR expressions are combined via cross product. |
| 392 | /// |
| 393 | /// This short circuits once the cross product turns into a sequence |
| 394 | /// containing only inexact literals. |
| 395 | fn extract_concat<'a, I: Iterator<Item = &'a Hir>>(&self, it: I) -> Seq { |
| 396 | let mut seq = Seq::singleton(self::Literal::exact(vec![])); |
| 397 | for hir in it { |
| 398 | // If every element in the sequence is inexact, then a cross |
| 399 | // product will always be a no-op. Thus, there is nothing else we |
| 400 | // can add to it and can quit early. Note that this also includes |
| 401 | // infinite sequences. |
| 402 | if seq.is_inexact() { |
| 403 | break; |
| 404 | } |
| 405 | // Note that 'cross' also dispatches based on whether we're |
| 406 | // extracting prefixes or suffixes. |
| 407 | seq = self.cross(seq, &mut self.extract(hir)); |
| 408 | } |
| 409 | seq |
| 410 | } |
| 411 | |
| 412 | /// Extract a sequence from the given alternation. |
| 413 | /// |
| 414 | /// This short circuits once the union turns into an infinite sequence. |
| 415 | fn extract_alternation<'a, I: Iterator<Item = &'a Hir>>( |
| 416 | &self, |
| 417 | it: I, |
| 418 | ) -> Seq { |
| 419 | let mut seq = Seq::empty(); |
| 420 | for hir in it { |
| 421 | // Once our 'seq' is infinite, every subsequent union |
| 422 | // operation on it will itself always result in an |
| 423 | // infinite sequence. Thus, it can never change and we can |
| 424 | // short-circuit. |
| 425 | if !seq.is_finite() { |
| 426 | break; |
| 427 | } |
| 428 | seq = self.union(seq, &mut self.extract(hir)); |
| 429 | } |
| 430 | seq |
| 431 | } |
| 432 | |
| 433 | /// Extract a sequence of literals from the given repetition. We do our |
| 434 | /// best, Some examples: |
| 435 | /// |
| 436 | /// 'a*' => [inexact(a), exact("")] |
| 437 | /// 'a*?' => [exact(""), inexact(a)] |
| 438 | /// 'a+' => [inexact(a)] |
| 439 | /// 'a{3}' => [exact(aaa)] |
| 440 | /// 'a{3,5} => [inexact(aaa)] |
| 441 | /// |
| 442 | /// The key here really is making sure we get the 'inexact' vs 'exact' |
| 443 | /// attributes correct on each of the literals we add. For example, the |
| 444 | /// fact that 'a*' gives us an inexact 'a' and an exact empty string means |
| 445 | /// that a regex like 'ab*c' will result in [inexact(ab), exact(ac)] |
| 446 | /// literals being extracted, which might actually be a better prefilter |
| 447 | /// than just 'a'. |
| 448 | fn extract_repetition(&self, rep: &hir::Repetition) -> Seq { |
| 449 | let mut subseq = self.extract(&rep.sub); |
| 450 | match *rep { |
| 451 | hir::Repetition { min: 0, max, greedy, .. } => { |
| 452 | // When 'max=1', we can retain exactness, since 'a?' is |
| 453 | // equivalent to 'a|'. Similarly below, 'a??' is equivalent to |
| 454 | // '|a'. |
| 455 | if max != Some(1) { |
| 456 | subseq.make_inexact(); |
| 457 | } |
| 458 | let mut empty = Seq::singleton(Literal::exact(vec![])); |
| 459 | if !greedy { |
| 460 | mem::swap(&mut subseq, &mut empty); |
| 461 | } |
| 462 | self.union(subseq, &mut empty) |
| 463 | } |
| 464 | hir::Repetition { min, max: Some(max), .. } if min == max => { |
| 465 | assert!(min > 0); // handled above |
| 466 | let limit = |
| 467 | u32::try_from(self.limit_repeat).unwrap_or(u32::MAX); |
| 468 | let mut seq = Seq::singleton(Literal::exact(vec![])); |
| 469 | for _ in 0..cmp::min(min, limit) { |
| 470 | if seq.is_inexact() { |
| 471 | break; |
| 472 | } |
| 473 | seq = self.cross(seq, &mut subseq.clone()); |
| 474 | } |
| 475 | if usize::try_from(min).is_err() || min > limit { |
| 476 | seq.make_inexact(); |
| 477 | } |
| 478 | seq |
| 479 | } |
| 480 | hir::Repetition { min, .. } => { |
| 481 | assert!(min > 0); // handled above |
| 482 | let limit = |
| 483 | u32::try_from(self.limit_repeat).unwrap_or(u32::MAX); |
| 484 | let mut seq = Seq::singleton(Literal::exact(vec![])); |
| 485 | for _ in 0..cmp::min(min, limit) { |
| 486 | if seq.is_inexact() { |
| 487 | break; |
| 488 | } |
| 489 | seq = self.cross(seq, &mut subseq.clone()); |
| 490 | } |
| 491 | seq.make_inexact(); |
| 492 | seq |
| 493 | } |
| 494 | } |
| 495 | } |
| 496 | |
| 497 | /// Convert the given Unicode class into a sequence of literals if the |
| 498 | /// class is small enough. If the class is too big, return an infinite |
| 499 | /// sequence. |
| 500 | fn extract_class_unicode(&self, cls: &hir::ClassUnicode) -> Seq { |
| 501 | if self.class_over_limit_unicode(cls) { |
| 502 | return Seq::infinite(); |
| 503 | } |
| 504 | let mut seq = Seq::empty(); |
| 505 | for r in cls.iter() { |
| 506 | for ch in r.start()..=r.end() { |
| 507 | seq.push(Literal::from(ch)); |
| 508 | } |
| 509 | } |
| 510 | self.enforce_literal_len(&mut seq); |
| 511 | seq |
| 512 | } |
| 513 | |
| 514 | /// Convert the given byte class into a sequence of literals if the class |
| 515 | /// is small enough. If the class is too big, return an infinite sequence. |
| 516 | fn extract_class_bytes(&self, cls: &hir::ClassBytes) -> Seq { |
| 517 | if self.class_over_limit_bytes(cls) { |
| 518 | return Seq::infinite(); |
| 519 | } |
| 520 | let mut seq = Seq::empty(); |
| 521 | for r in cls.iter() { |
| 522 | for b in r.start()..=r.end() { |
| 523 | seq.push(Literal::from(b)); |
| 524 | } |
| 525 | } |
| 526 | self.enforce_literal_len(&mut seq); |
| 527 | seq |
| 528 | } |
| 529 | |
| 530 | /// Returns true if the given Unicode class exceeds the configured limits |
| 531 | /// on this extractor. |
| 532 | fn class_over_limit_unicode(&self, cls: &hir::ClassUnicode) -> bool { |
| 533 | let mut count = 0; |
| 534 | for r in cls.iter() { |
| 535 | if count > self.limit_class { |
| 536 | return true; |
| 537 | } |
| 538 | count += r.len(); |
| 539 | } |
| 540 | count > self.limit_class |
| 541 | } |
| 542 | |
| 543 | /// Returns true if the given byte class exceeds the configured limits on |
| 544 | /// this extractor. |
| 545 | fn class_over_limit_bytes(&self, cls: &hir::ClassBytes) -> bool { |
| 546 | let mut count = 0; |
| 547 | for r in cls.iter() { |
| 548 | if count > self.limit_class { |
| 549 | return true; |
| 550 | } |
| 551 | count += r.len(); |
| 552 | } |
| 553 | count > self.limit_class |
| 554 | } |
| 555 | |
| 556 | /// Compute the cross product of the two sequences if the result would be |
| 557 | /// within configured limits. Otherwise, make `seq2` infinite and cross the |
| 558 | /// infinite sequence with `seq1`. |
| 559 | fn cross(&self, mut seq1: Seq, seq2: &mut Seq) -> Seq { |
| 560 | if seq1.max_cross_len(seq2).map_or(false, |len| len > self.limit_total) |
| 561 | { |
| 562 | seq2.make_infinite(); |
| 563 | } |
| 564 | if let ExtractKind::Suffix = self.kind { |
| 565 | seq1.cross_reverse(seq2); |
| 566 | } else { |
| 567 | seq1.cross_forward(seq2); |
| 568 | } |
| 569 | assert!(seq1.len().map_or(true, |x| x <= self.limit_total)); |
| 570 | self.enforce_literal_len(&mut seq1); |
| 571 | seq1 |
| 572 | } |
| 573 | |
| 574 | /// Union the two sequences if the result would be within configured |
| 575 | /// limits. Otherwise, make `seq2` infinite and union the infinite sequence |
| 576 | /// with `seq1`. |
| 577 | fn union(&self, mut seq1: Seq, seq2: &mut Seq) -> Seq { |
| 578 | if seq1.max_union_len(seq2).map_or(false, |len| len > self.limit_total) |
| 579 | { |
| 580 | // We try to trim our literal sequences to see if we can make |
| 581 | // room for more literals. The idea is that we'd rather trim down |
| 582 | // literals already in our sequence if it means we can add a few |
| 583 | // more and retain a finite sequence. Otherwise, we'll union with |
| 584 | // an infinite sequence and that infects everything and effectively |
| 585 | // stops literal extraction in its tracks. |
| 586 | // |
| 587 | // We do we keep 4 bytes here? Well, it's a bit of an abstraction |
| 588 | // leakage. Downstream, the literals may wind up getting fed to |
| 589 | // the Teddy algorithm, which supports searching literals up to |
| 590 | // length 4. So that's why we pick that number here. Arguably this |
| 591 | // should be a tuneable parameter, but it seems a little tricky to |
| 592 | // describe. And I'm still unsure if this is the right way to go |
| 593 | // about culling literal sequences. |
| 594 | match self.kind { |
| 595 | ExtractKind::Prefix => { |
| 596 | seq1.keep_first_bytes(4); |
| 597 | seq2.keep_first_bytes(4); |
| 598 | } |
| 599 | ExtractKind::Suffix => { |
| 600 | seq1.keep_last_bytes(4); |
| 601 | seq2.keep_last_bytes(4); |
| 602 | } |
| 603 | } |
| 604 | seq1.dedup(); |
| 605 | seq2.dedup(); |
| 606 | if seq1 |
| 607 | .max_union_len(seq2) |
| 608 | .map_or(false, |len| len > self.limit_total) |
| 609 | { |
| 610 | seq2.make_infinite(); |
| 611 | } |
| 612 | } |
| 613 | seq1.union(seq2); |
| 614 | assert!(seq1.len().map_or(true, |x| x <= self.limit_total)); |
| 615 | seq1 |
| 616 | } |
| 617 | |
| 618 | /// Applies the literal length limit to the given sequence. If none of the |
| 619 | /// literals in the sequence exceed the limit, then this is a no-op. |
| 620 | fn enforce_literal_len(&self, seq: &mut Seq) { |
| 621 | let len = self.limit_literal_len; |
| 622 | match self.kind { |
| 623 | ExtractKind::Prefix => seq.keep_first_bytes(len), |
| 624 | ExtractKind::Suffix => seq.keep_last_bytes(len), |
| 625 | } |
| 626 | } |
| 627 | } |
| 628 | |
| 629 | impl Default for Extractor { |
| 630 | fn default() -> Extractor { |
| 631 | Extractor::new() |
| 632 | } |
| 633 | } |
| 634 | |
| 635 | /// The kind of literals to extract from an [`Hir`] expression. |
| 636 | /// |
| 637 | /// The default extraction kind is `Prefix`. |
| 638 | #[non_exhaustive ] |
| 639 | #[derive (Clone, Debug)] |
| 640 | pub enum ExtractKind { |
| 641 | /// Extracts only prefix literals from a regex. |
| 642 | Prefix, |
| 643 | /// Extracts only suffix literals from a regex. |
| 644 | /// |
| 645 | /// Note that the sequence returned by suffix literals currently may |
| 646 | /// not correctly represent leftmost-first or "preference" order match |
| 647 | /// semantics. |
| 648 | Suffix, |
| 649 | } |
| 650 | |
| 651 | impl ExtractKind { |
| 652 | /// Returns true if this kind is the `Prefix` variant. |
| 653 | pub fn is_prefix(&self) -> bool { |
| 654 | matches!(*self, ExtractKind::Prefix) |
| 655 | } |
| 656 | |
| 657 | /// Returns true if this kind is the `Suffix` variant. |
| 658 | pub fn is_suffix(&self) -> bool { |
| 659 | matches!(*self, ExtractKind::Suffix) |
| 660 | } |
| 661 | } |
| 662 | |
| 663 | impl Default for ExtractKind { |
| 664 | fn default() -> ExtractKind { |
| 665 | ExtractKind::Prefix |
| 666 | } |
| 667 | } |
| 668 | |
| 669 | /// A sequence of literals. |
| 670 | /// |
| 671 | /// A `Seq` is very much like a set in that it represents a union of its |
| 672 | /// members. That is, it corresponds to a set of literals where at least one |
| 673 | /// must match in order for a particular [`Hir`] expression to match. (Whether |
| 674 | /// this corresponds to the entire `Hir` expression, a prefix of it or a suffix |
| 675 | /// of it depends on how the `Seq` was extracted from the `Hir`.) |
| 676 | /// |
| 677 | /// It is also unlike a set in that multiple identical literals may appear, |
| 678 | /// and that the order of the literals in the `Seq` matters. For example, if |
| 679 | /// the sequence is `[sam, samwise]` and leftmost-first matching is used, then |
| 680 | /// `samwise` can never match and the sequence is equivalent to `[sam]`. |
| 681 | /// |
| 682 | /// # States of a sequence |
| 683 | /// |
| 684 | /// A `Seq` has a few different logical states to consider: |
| 685 | /// |
| 686 | /// * The sequence can represent "any" literal. When this happens, the set does |
| 687 | /// not have a finite size. The purpose of this state is to inhibit callers |
| 688 | /// from making assumptions about what literals are required in order to match |
| 689 | /// a particular [`Hir`] expression. Generally speaking, when a set is in this |
| 690 | /// state, literal optimizations are inhibited. A good example of a regex that |
| 691 | /// will cause this sort of set to appear is `[A-Za-z]`. The character class |
| 692 | /// is just too big (and also too narrow) to be usefully expanded into 52 |
| 693 | /// different literals. (Note that the decision for when a seq should become |
| 694 | /// infinite is determined by the caller. A seq itself has no hard-coded |
| 695 | /// limits.) |
| 696 | /// * The sequence can be empty, in which case, it is an affirmative statement |
| 697 | /// that there are no literals that can match the corresponding `Hir`. |
| 698 | /// Consequently, the `Hir` never matches any input. For example, `[a&&b]`. |
| 699 | /// * The sequence can be non-empty, in which case, at least one of the |
| 700 | /// literals must match in order for the corresponding `Hir` to match. |
| 701 | /// |
| 702 | /// # Example |
| 703 | /// |
| 704 | /// This example shows how literal sequences can be simplified by stripping |
| 705 | /// suffixes and minimizing while maintaining preference order. |
| 706 | /// |
| 707 | /// ``` |
| 708 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
| 709 | /// |
| 710 | /// let mut seq = Seq::new(&[ |
| 711 | /// "farm" , |
| 712 | /// "appliance" , |
| 713 | /// "faraway" , |
| 714 | /// "apple" , |
| 715 | /// "fare" , |
| 716 | /// "gap" , |
| 717 | /// "applicant" , |
| 718 | /// "applaud" , |
| 719 | /// ]); |
| 720 | /// seq.keep_first_bytes(3); |
| 721 | /// seq.minimize_by_preference(); |
| 722 | /// // Notice that 'far' comes before 'app', which matches the order in the |
| 723 | /// // original sequence. This guarantees that leftmost-first semantics are |
| 724 | /// // not altered by simplifying the set. |
| 725 | /// let expected = Seq::from_iter([ |
| 726 | /// Literal::inexact("far" ), |
| 727 | /// Literal::inexact("app" ), |
| 728 | /// Literal::exact("gap" ), |
| 729 | /// ]); |
| 730 | /// assert_eq!(expected, seq); |
| 731 | /// ``` |
| 732 | #[derive (Clone, Eq, PartialEq)] |
| 733 | pub struct Seq { |
| 734 | /// The members of this seq. |
| 735 | /// |
| 736 | /// When `None`, the seq represents all possible literals. That is, it |
| 737 | /// prevents one from making assumptions about specific literals in the |
| 738 | /// seq, and forces one to treat it as if any literal might be in the seq. |
| 739 | /// |
| 740 | /// Note that `Some(vec![])` is valid and corresponds to the empty seq of |
| 741 | /// literals, i.e., a regex that can never match. For example, `[a&&b]`. |
| 742 | /// It is distinct from `Some(vec![""])`, which corresponds to the seq |
| 743 | /// containing an empty string, which matches at every position. |
| 744 | literals: Option<Vec<Literal>>, |
| 745 | } |
| 746 | |
| 747 | impl Seq { |
| 748 | /// Returns an empty sequence. |
| 749 | /// |
| 750 | /// An empty sequence matches zero literals, and thus corresponds to a |
| 751 | /// regex that itself can never match. |
| 752 | #[inline ] |
| 753 | pub fn empty() -> Seq { |
| 754 | Seq { literals: Some(vec![]) } |
| 755 | } |
| 756 | |
| 757 | /// Returns a sequence of literals without a finite size and may contain |
| 758 | /// any literal. |
| 759 | /// |
| 760 | /// A sequence without finite size does not reveal anything about the |
| 761 | /// characteristics of the literals in its set. There are no fixed prefixes |
| 762 | /// or suffixes, nor are lower or upper bounds on the length of the literals |
| 763 | /// in the set known. |
| 764 | /// |
| 765 | /// This is useful to represent constructs in a regex that are "too big" |
| 766 | /// to useful represent as a sequence of literals. For example, `[A-Za-z]`. |
| 767 | /// When sequences get too big, they lose their discriminating nature and |
| 768 | /// are more likely to produce false positives, which in turn makes them |
| 769 | /// less likely to speed up searches. |
| 770 | /// |
| 771 | /// More pragmatically, for many regexes, enumerating all possible literals |
| 772 | /// is itself not possible or might otherwise use too many resources. So |
| 773 | /// constraining the size of sets during extraction is a practical trade |
| 774 | /// off to make. |
| 775 | #[inline ] |
| 776 | pub fn infinite() -> Seq { |
| 777 | Seq { literals: None } |
| 778 | } |
| 779 | |
| 780 | /// Returns a sequence containing a single literal. |
| 781 | #[inline ] |
| 782 | pub fn singleton(lit: Literal) -> Seq { |
| 783 | Seq { literals: Some(vec![lit]) } |
| 784 | } |
| 785 | |
| 786 | /// Returns a sequence of exact literals from the given byte strings. |
| 787 | #[inline ] |
| 788 | pub fn new<I, B>(it: I) -> Seq |
| 789 | where |
| 790 | I: IntoIterator<Item = B>, |
| 791 | B: AsRef<[u8]>, |
| 792 | { |
| 793 | it.into_iter().map(|b| Literal::exact(b.as_ref())).collect() |
| 794 | } |
| 795 | |
| 796 | /// If this is a finite sequence, return its members as a slice of |
| 797 | /// literals. |
| 798 | /// |
| 799 | /// The slice returned may be empty, in which case, there are no literals |
| 800 | /// that can match this sequence. |
| 801 | #[inline ] |
| 802 | pub fn literals(&self) -> Option<&[Literal]> { |
| 803 | self.literals.as_deref() |
| 804 | } |
| 805 | |
| 806 | /// Push a literal to the end of this sequence. |
| 807 | /// |
| 808 | /// If this sequence is not finite, then this is a no-op. |
| 809 | /// |
| 810 | /// Similarly, if the most recently added item of this sequence is |
| 811 | /// equivalent to the literal given, then it is not added. This reflects |
| 812 | /// a `Seq`'s "set like" behavior, and represents a practical trade off. |
| 813 | /// Namely, there is never any need to have two adjacent and equivalent |
| 814 | /// literals in the same sequence, _and_ it is easy to detect in some |
| 815 | /// cases. |
| 816 | #[inline ] |
| 817 | pub fn push(&mut self, lit: Literal) { |
| 818 | let lits = match self.literals { |
| 819 | None => return, |
| 820 | Some(ref mut lits) => lits, |
| 821 | }; |
| 822 | if lits.last().map_or(false, |m| m == &lit) { |
| 823 | return; |
| 824 | } |
| 825 | lits.push(lit); |
| 826 | } |
| 827 | |
| 828 | /// Make all of the literals in this sequence inexact. |
| 829 | /// |
| 830 | /// This is a no-op if this sequence is not finite. |
| 831 | #[inline ] |
| 832 | pub fn make_inexact(&mut self) { |
| 833 | let lits = match self.literals { |
| 834 | None => return, |
| 835 | Some(ref mut lits) => lits, |
| 836 | }; |
| 837 | for lit in lits.iter_mut() { |
| 838 | lit.make_inexact(); |
| 839 | } |
| 840 | } |
| 841 | |
| 842 | /// Converts this sequence to an infinite sequence. |
| 843 | /// |
| 844 | /// This is a no-op if the sequence is already infinite. |
| 845 | #[inline ] |
| 846 | pub fn make_infinite(&mut self) { |
| 847 | self.literals = None; |
| 848 | } |
| 849 | |
| 850 | /// Modify this sequence to contain the cross product between it and the |
| 851 | /// sequence given. |
| 852 | /// |
| 853 | /// The cross product only considers literals in this sequence that are |
| 854 | /// exact. That is, inexact literals are not extended. |
| 855 | /// |
| 856 | /// The literals are always drained from `other`, even if none are used. |
| 857 | /// This permits callers to reuse the sequence allocation elsewhere. |
| 858 | /// |
| 859 | /// If this sequence is infinite, then this is a no-op, regardless of what |
| 860 | /// `other` contains (and in this case, the literals are still drained from |
| 861 | /// `other`). If `other` is infinite and this sequence is finite, then this |
| 862 | /// is a no-op, unless this sequence contains a zero-length literal. In |
| 863 | /// which case, the infiniteness of `other` infects this sequence, and this |
| 864 | /// sequence is itself made infinite. |
| 865 | /// |
| 866 | /// Like [`Seq::union`], this may attempt to deduplicate literals. See |
| 867 | /// [`Seq::dedup`] for how deduplication deals with exact and inexact |
| 868 | /// literals. |
| 869 | /// |
| 870 | /// # Example |
| 871 | /// |
| 872 | /// This example shows basic usage and how exact and inexact literals |
| 873 | /// interact. |
| 874 | /// |
| 875 | /// ``` |
| 876 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
| 877 | /// |
| 878 | /// let mut seq1 = Seq::from_iter([ |
| 879 | /// Literal::exact("foo" ), |
| 880 | /// Literal::inexact("bar" ), |
| 881 | /// ]); |
| 882 | /// let mut seq2 = Seq::from_iter([ |
| 883 | /// Literal::inexact("quux" ), |
| 884 | /// Literal::exact("baz" ), |
| 885 | /// ]); |
| 886 | /// seq1.cross_forward(&mut seq2); |
| 887 | /// |
| 888 | /// // The literals are pulled out of seq2. |
| 889 | /// assert_eq!(Some(0), seq2.len()); |
| 890 | /// |
| 891 | /// let expected = Seq::from_iter([ |
| 892 | /// Literal::inexact("fooquux" ), |
| 893 | /// Literal::exact("foobaz" ), |
| 894 | /// Literal::inexact("bar" ), |
| 895 | /// ]); |
| 896 | /// assert_eq!(expected, seq1); |
| 897 | /// ``` |
| 898 | /// |
| 899 | /// This example shows the behavior of when `other` is an infinite |
| 900 | /// sequence. |
| 901 | /// |
| 902 | /// ``` |
| 903 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
| 904 | /// |
| 905 | /// let mut seq1 = Seq::from_iter([ |
| 906 | /// Literal::exact("foo" ), |
| 907 | /// Literal::inexact("bar" ), |
| 908 | /// ]); |
| 909 | /// let mut seq2 = Seq::infinite(); |
| 910 | /// seq1.cross_forward(&mut seq2); |
| 911 | /// |
| 912 | /// // When seq2 is infinite, cross product doesn't add anything, but |
| 913 | /// // ensures all members of seq1 are inexact. |
| 914 | /// let expected = Seq::from_iter([ |
| 915 | /// Literal::inexact("foo" ), |
| 916 | /// Literal::inexact("bar" ), |
| 917 | /// ]); |
| 918 | /// assert_eq!(expected, seq1); |
| 919 | /// ``` |
| 920 | /// |
| 921 | /// This example is like the one above, but shows what happens when this |
| 922 | /// sequence contains an empty string. In this case, an infinite `other` |
| 923 | /// sequence infects this sequence (because the empty string means that |
| 924 | /// there are no finite prefixes): |
| 925 | /// |
| 926 | /// ``` |
| 927 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
| 928 | /// |
| 929 | /// let mut seq1 = Seq::from_iter([ |
| 930 | /// Literal::exact("foo" ), |
| 931 | /// Literal::exact("" ), // inexact provokes same behavior |
| 932 | /// Literal::inexact("bar" ), |
| 933 | /// ]); |
| 934 | /// let mut seq2 = Seq::infinite(); |
| 935 | /// seq1.cross_forward(&mut seq2); |
| 936 | /// |
| 937 | /// // seq1 is now infinite! |
| 938 | /// assert!(!seq1.is_finite()); |
| 939 | /// ``` |
| 940 | /// |
| 941 | /// This example shows the behavior of this sequence is infinite. |
| 942 | /// |
| 943 | /// ``` |
| 944 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
| 945 | /// |
| 946 | /// let mut seq1 = Seq::infinite(); |
| 947 | /// let mut seq2 = Seq::from_iter([ |
| 948 | /// Literal::exact("foo" ), |
| 949 | /// Literal::inexact("bar" ), |
| 950 | /// ]); |
| 951 | /// seq1.cross_forward(&mut seq2); |
| 952 | /// |
| 953 | /// // seq1 remains unchanged. |
| 954 | /// assert!(!seq1.is_finite()); |
| 955 | /// // Even though the literals in seq2 weren't used, it was still drained. |
| 956 | /// assert_eq!(Some(0), seq2.len()); |
| 957 | /// ``` |
| 958 | #[inline ] |
| 959 | pub fn cross_forward(&mut self, other: &mut Seq) { |
| 960 | let (lits1, lits2) = match self.cross_preamble(other) { |
| 961 | None => return, |
| 962 | Some((lits1, lits2)) => (lits1, lits2), |
| 963 | }; |
| 964 | let newcap = lits1.len().saturating_mul(lits2.len()); |
| 965 | for selflit in mem::replace(lits1, Vec::with_capacity(newcap)) { |
| 966 | if !selflit.is_exact() { |
| 967 | lits1.push(selflit); |
| 968 | continue; |
| 969 | } |
| 970 | for otherlit in lits2.iter() { |
| 971 | let mut newlit = Literal::exact(Vec::with_capacity( |
| 972 | selflit.len() + otherlit.len(), |
| 973 | )); |
| 974 | newlit.extend(&selflit); |
| 975 | newlit.extend(&otherlit); |
| 976 | if !otherlit.is_exact() { |
| 977 | newlit.make_inexact(); |
| 978 | } |
| 979 | lits1.push(newlit); |
| 980 | } |
| 981 | } |
| 982 | lits2.drain(..); |
| 983 | self.dedup(); |
| 984 | } |
| 985 | |
| 986 | /// Modify this sequence to contain the cross product between it and |
| 987 | /// the sequence given, where the sequences are treated as suffixes |
| 988 | /// instead of prefixes. Namely, the sequence `other` is *prepended* |
| 989 | /// to `self` (as opposed to `other` being *appended* to `self` in |
| 990 | /// [`Seq::cross_forward`]). |
| 991 | /// |
| 992 | /// The cross product only considers literals in this sequence that are |
| 993 | /// exact. That is, inexact literals are not extended. |
| 994 | /// |
| 995 | /// The literals are always drained from `other`, even if none are used. |
| 996 | /// This permits callers to reuse the sequence allocation elsewhere. |
| 997 | /// |
| 998 | /// If this sequence is infinite, then this is a no-op, regardless of what |
| 999 | /// `other` contains (and in this case, the literals are still drained from |
| 1000 | /// `other`). If `other` is infinite and this sequence is finite, then this |
| 1001 | /// is a no-op, unless this sequence contains a zero-length literal. In |
| 1002 | /// which case, the infiniteness of `other` infects this sequence, and this |
| 1003 | /// sequence is itself made infinite. |
| 1004 | /// |
| 1005 | /// Like [`Seq::union`], this may attempt to deduplicate literals. See |
| 1006 | /// [`Seq::dedup`] for how deduplication deals with exact and inexact |
| 1007 | /// literals. |
| 1008 | /// |
| 1009 | /// # Example |
| 1010 | /// |
| 1011 | /// This example shows basic usage and how exact and inexact literals |
| 1012 | /// interact. |
| 1013 | /// |
| 1014 | /// ``` |
| 1015 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
| 1016 | /// |
| 1017 | /// let mut seq1 = Seq::from_iter([ |
| 1018 | /// Literal::exact("foo" ), |
| 1019 | /// Literal::inexact("bar" ), |
| 1020 | /// ]); |
| 1021 | /// let mut seq2 = Seq::from_iter([ |
| 1022 | /// Literal::inexact("quux" ), |
| 1023 | /// Literal::exact("baz" ), |
| 1024 | /// ]); |
| 1025 | /// seq1.cross_reverse(&mut seq2); |
| 1026 | /// |
| 1027 | /// // The literals are pulled out of seq2. |
| 1028 | /// assert_eq!(Some(0), seq2.len()); |
| 1029 | /// |
| 1030 | /// let expected = Seq::from_iter([ |
| 1031 | /// Literal::inexact("quuxfoo" ), |
| 1032 | /// Literal::inexact("bar" ), |
| 1033 | /// Literal::exact("bazfoo" ), |
| 1034 | /// ]); |
| 1035 | /// assert_eq!(expected, seq1); |
| 1036 | /// ``` |
| 1037 | /// |
| 1038 | /// This example shows the behavior of when `other` is an infinite |
| 1039 | /// sequence. |
| 1040 | /// |
| 1041 | /// ``` |
| 1042 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
| 1043 | /// |
| 1044 | /// let mut seq1 = Seq::from_iter([ |
| 1045 | /// Literal::exact("foo" ), |
| 1046 | /// Literal::inexact("bar" ), |
| 1047 | /// ]); |
| 1048 | /// let mut seq2 = Seq::infinite(); |
| 1049 | /// seq1.cross_reverse(&mut seq2); |
| 1050 | /// |
| 1051 | /// // When seq2 is infinite, cross product doesn't add anything, but |
| 1052 | /// // ensures all members of seq1 are inexact. |
| 1053 | /// let expected = Seq::from_iter([ |
| 1054 | /// Literal::inexact("foo" ), |
| 1055 | /// Literal::inexact("bar" ), |
| 1056 | /// ]); |
| 1057 | /// assert_eq!(expected, seq1); |
| 1058 | /// ``` |
| 1059 | /// |
| 1060 | /// This example is like the one above, but shows what happens when this |
| 1061 | /// sequence contains an empty string. In this case, an infinite `other` |
| 1062 | /// sequence infects this sequence (because the empty string means that |
| 1063 | /// there are no finite suffixes): |
| 1064 | /// |
| 1065 | /// ``` |
| 1066 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
| 1067 | /// |
| 1068 | /// let mut seq1 = Seq::from_iter([ |
| 1069 | /// Literal::exact("foo" ), |
| 1070 | /// Literal::exact("" ), // inexact provokes same behavior |
| 1071 | /// Literal::inexact("bar" ), |
| 1072 | /// ]); |
| 1073 | /// let mut seq2 = Seq::infinite(); |
| 1074 | /// seq1.cross_reverse(&mut seq2); |
| 1075 | /// |
| 1076 | /// // seq1 is now infinite! |
| 1077 | /// assert!(!seq1.is_finite()); |
| 1078 | /// ``` |
| 1079 | /// |
| 1080 | /// This example shows the behavior when this sequence is infinite. |
| 1081 | /// |
| 1082 | /// ``` |
| 1083 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
| 1084 | /// |
| 1085 | /// let mut seq1 = Seq::infinite(); |
| 1086 | /// let mut seq2 = Seq::from_iter([ |
| 1087 | /// Literal::exact("foo" ), |
| 1088 | /// Literal::inexact("bar" ), |
| 1089 | /// ]); |
| 1090 | /// seq1.cross_reverse(&mut seq2); |
| 1091 | /// |
| 1092 | /// // seq1 remains unchanged. |
| 1093 | /// assert!(!seq1.is_finite()); |
| 1094 | /// // Even though the literals in seq2 weren't used, it was still drained. |
| 1095 | /// assert_eq!(Some(0), seq2.len()); |
| 1096 | /// ``` |
| 1097 | #[inline ] |
| 1098 | pub fn cross_reverse(&mut self, other: &mut Seq) { |
| 1099 | let (lits1, lits2) = match self.cross_preamble(other) { |
| 1100 | None => return, |
| 1101 | Some((lits1, lits2)) => (lits1, lits2), |
| 1102 | }; |
| 1103 | // We basically proceed as we do in 'cross_forward' at this point, |
| 1104 | // except that the outer loop is now 'other' and the inner loop is now |
| 1105 | // 'self'. That's because 'self' corresponds to suffixes and 'other' |
| 1106 | // corresponds to the sequence we want to *prepend* to the suffixes. |
| 1107 | let newcap = lits1.len().saturating_mul(lits2.len()); |
| 1108 | let selflits = mem::replace(lits1, Vec::with_capacity(newcap)); |
| 1109 | for (i, otherlit) in lits2.drain(..).enumerate() { |
| 1110 | for selflit in selflits.iter() { |
| 1111 | if !selflit.is_exact() { |
| 1112 | // If the suffix isn't exact, then we can't prepend |
| 1113 | // anything to it. However, we still want to keep it. But |
| 1114 | // we only want to keep one of them, to avoid duplication. |
| 1115 | // (The duplication is okay from a correctness perspective, |
| 1116 | // but wasteful.) |
| 1117 | if i == 0 { |
| 1118 | lits1.push(selflit.clone()); |
| 1119 | } |
| 1120 | continue; |
| 1121 | } |
| 1122 | let mut newlit = Literal::exact(Vec::with_capacity( |
| 1123 | otherlit.len() + selflit.len(), |
| 1124 | )); |
| 1125 | newlit.extend(&otherlit); |
| 1126 | newlit.extend(&selflit); |
| 1127 | if !otherlit.is_exact() { |
| 1128 | newlit.make_inexact(); |
| 1129 | } |
| 1130 | lits1.push(newlit); |
| 1131 | } |
| 1132 | } |
| 1133 | self.dedup(); |
| 1134 | } |
| 1135 | |
| 1136 | /// A helper function the corresponds to the subtle preamble for both |
| 1137 | /// `cross_forward` and `cross_reverse`. In effect, it handles the cases |
| 1138 | /// of infinite sequences for both `self` and `other`, as well as ensuring |
| 1139 | /// that literals from `other` are drained even if they aren't used. |
| 1140 | fn cross_preamble<'a>( |
| 1141 | &'a mut self, |
| 1142 | other: &'a mut Seq, |
| 1143 | ) -> Option<(&'a mut Vec<Literal>, &'a mut Vec<Literal>)> { |
| 1144 | let lits2 = match other.literals { |
| 1145 | None => { |
| 1146 | // If our current seq contains the empty string and the seq |
| 1147 | // we're adding matches any literal, then it follows that the |
| 1148 | // current seq must now also match any literal. |
| 1149 | // |
| 1150 | // Otherwise, we just have to make sure everything in this |
| 1151 | // sequence is inexact. |
| 1152 | if self.min_literal_len() == Some(0) { |
| 1153 | *self = Seq::infinite(); |
| 1154 | } else { |
| 1155 | self.make_inexact(); |
| 1156 | } |
| 1157 | return None; |
| 1158 | } |
| 1159 | Some(ref mut lits) => lits, |
| 1160 | }; |
| 1161 | let lits1 = match self.literals { |
| 1162 | None => { |
| 1163 | // If we aren't going to make it to the end of this routine |
| 1164 | // where lits2 is drained, then we need to do it now. |
| 1165 | lits2.drain(..); |
| 1166 | return None; |
| 1167 | } |
| 1168 | Some(ref mut lits) => lits, |
| 1169 | }; |
| 1170 | Some((lits1, lits2)) |
| 1171 | } |
| 1172 | |
| 1173 | /// Unions the `other` sequence into this one. |
| 1174 | /// |
| 1175 | /// The literals are always drained out of the given `other` sequence, |
| 1176 | /// even if they are being unioned into an infinite sequence. This permits |
| 1177 | /// the caller to reuse the `other` sequence in another context. |
| 1178 | /// |
| 1179 | /// Some literal deduping may be performed. If any deduping happens, |
| 1180 | /// any leftmost-first or "preference" order match semantics will be |
| 1181 | /// preserved. |
| 1182 | /// |
| 1183 | /// # Example |
| 1184 | /// |
| 1185 | /// This example shows basic usage. |
| 1186 | /// |
| 1187 | /// ``` |
| 1188 | /// use regex_syntax::hir::literal::Seq; |
| 1189 | /// |
| 1190 | /// let mut seq1 = Seq::new(&["foo" , "bar" ]); |
| 1191 | /// let mut seq2 = Seq::new(&["bar" , "quux" , "foo" ]); |
| 1192 | /// seq1.union(&mut seq2); |
| 1193 | /// |
| 1194 | /// // The literals are pulled out of seq2. |
| 1195 | /// assert_eq!(Some(0), seq2.len()); |
| 1196 | /// |
| 1197 | /// // Adjacent literals are deduped, but non-adjacent literals may not be. |
| 1198 | /// assert_eq!(Seq::new(&["foo" , "bar" , "quux" , "foo" ]), seq1); |
| 1199 | /// ``` |
| 1200 | /// |
| 1201 | /// This example shows that literals are drained from `other` even when |
| 1202 | /// they aren't necessarily used. |
| 1203 | /// |
| 1204 | /// ``` |
| 1205 | /// use regex_syntax::hir::literal::Seq; |
| 1206 | /// |
| 1207 | /// let mut seq1 = Seq::infinite(); |
| 1208 | /// // Infinite sequences have no finite length. |
| 1209 | /// assert_eq!(None, seq1.len()); |
| 1210 | /// |
| 1211 | /// let mut seq2 = Seq::new(&["bar" , "quux" , "foo" ]); |
| 1212 | /// seq1.union(&mut seq2); |
| 1213 | /// |
| 1214 | /// // seq1 is still infinite and seq2 has been drained. |
| 1215 | /// assert_eq!(None, seq1.len()); |
| 1216 | /// assert_eq!(Some(0), seq2.len()); |
| 1217 | /// ``` |
| 1218 | #[inline ] |
| 1219 | pub fn union(&mut self, other: &mut Seq) { |
| 1220 | let lits2 = match other.literals { |
| 1221 | None => { |
| 1222 | // Unioning with an infinite sequence always results in an |
| 1223 | // infinite sequence. |
| 1224 | self.make_infinite(); |
| 1225 | return; |
| 1226 | } |
| 1227 | Some(ref mut lits) => lits.drain(..), |
| 1228 | }; |
| 1229 | let lits1 = match self.literals { |
| 1230 | None => return, |
| 1231 | Some(ref mut lits) => lits, |
| 1232 | }; |
| 1233 | lits1.extend(lits2); |
| 1234 | self.dedup(); |
| 1235 | } |
| 1236 | |
| 1237 | /// Unions the `other` sequence into this one by splice the `other` |
| 1238 | /// sequence at the position of the first zero-length literal. |
| 1239 | /// |
| 1240 | /// This is useful for preserving preference order semantics when combining |
| 1241 | /// two literal sequences. For example, in the regex `(a||f)+foo`, the |
| 1242 | /// correct preference order prefix sequence is `[a, foo, f]`. |
| 1243 | /// |
| 1244 | /// The literals are always drained out of the given `other` sequence, |
| 1245 | /// even if they are being unioned into an infinite sequence. This permits |
| 1246 | /// the caller to reuse the `other` sequence in another context. Note that |
| 1247 | /// the literals are drained even if no union is performed as well, i.e., |
| 1248 | /// when this sequence does not contain a zero-length literal. |
| 1249 | /// |
| 1250 | /// Some literal deduping may be performed. If any deduping happens, |
| 1251 | /// any leftmost-first or "preference" order match semantics will be |
| 1252 | /// preserved. |
| 1253 | /// |
| 1254 | /// # Example |
| 1255 | /// |
| 1256 | /// This example shows basic usage. |
| 1257 | /// |
| 1258 | /// ``` |
| 1259 | /// use regex_syntax::hir::literal::Seq; |
| 1260 | /// |
| 1261 | /// let mut seq1 = Seq::new(&["a" , "" , "f" , "" ]); |
| 1262 | /// let mut seq2 = Seq::new(&["foo" ]); |
| 1263 | /// seq1.union_into_empty(&mut seq2); |
| 1264 | /// |
| 1265 | /// // The literals are pulled out of seq2. |
| 1266 | /// assert_eq!(Some(0), seq2.len()); |
| 1267 | /// // 'foo' gets spliced into seq1 where the first empty string occurs. |
| 1268 | /// assert_eq!(Seq::new(&["a" , "foo" , "f" ]), seq1); |
| 1269 | /// ``` |
| 1270 | /// |
| 1271 | /// This example shows that literals are drained from `other` even when |
| 1272 | /// they aren't necessarily used. |
| 1273 | /// |
| 1274 | /// ``` |
| 1275 | /// use regex_syntax::hir::literal::Seq; |
| 1276 | /// |
| 1277 | /// let mut seq1 = Seq::new(&["foo" , "bar" ]); |
| 1278 | /// let mut seq2 = Seq::new(&["bar" , "quux" , "foo" ]); |
| 1279 | /// seq1.union_into_empty(&mut seq2); |
| 1280 | /// |
| 1281 | /// // seq1 has no zero length literals, so no splicing happens. |
| 1282 | /// assert_eq!(Seq::new(&["foo" , "bar" ]), seq1); |
| 1283 | /// // Even though no splicing happens, seq2 is still drained. |
| 1284 | /// assert_eq!(Some(0), seq2.len()); |
| 1285 | /// ``` |
| 1286 | #[inline ] |
| 1287 | pub fn union_into_empty(&mut self, other: &mut Seq) { |
| 1288 | let lits2 = other.literals.as_mut().map(|lits| lits.drain(..)); |
| 1289 | let lits1 = match self.literals { |
| 1290 | None => return, |
| 1291 | Some(ref mut lits) => lits, |
| 1292 | }; |
| 1293 | let first_empty = match lits1.iter().position(|m| m.is_empty()) { |
| 1294 | None => return, |
| 1295 | Some(i) => i, |
| 1296 | }; |
| 1297 | let lits2 = match lits2 { |
| 1298 | None => { |
| 1299 | // Note that we are only here if we've found an empty literal, |
| 1300 | // which implies that an infinite sequence infects this seq and |
| 1301 | // also turns it into an infinite sequence. |
| 1302 | self.literals = None; |
| 1303 | return; |
| 1304 | } |
| 1305 | Some(lits) => lits, |
| 1306 | }; |
| 1307 | // Clearing out the empties needs to come before the splice because |
| 1308 | // the splice might add more empties that we don't want to get rid |
| 1309 | // of. Since we're splicing into the position of the first empty, the |
| 1310 | // 'first_empty' position computed above is still correct. |
| 1311 | lits1.retain(|m| !m.is_empty()); |
| 1312 | lits1.splice(first_empty..first_empty, lits2); |
| 1313 | self.dedup(); |
| 1314 | } |
| 1315 | |
| 1316 | /// Deduplicate adjacent equivalent literals in this sequence. |
| 1317 | /// |
| 1318 | /// If adjacent literals are equivalent strings but one is exact and the |
| 1319 | /// other inexact, the inexact literal is kept and the exact one is |
| 1320 | /// removed. |
| 1321 | /// |
| 1322 | /// Deduping an infinite sequence is a no-op. |
| 1323 | /// |
| 1324 | /// # Example |
| 1325 | /// |
| 1326 | /// This example shows how literals that are duplicate byte strings but |
| 1327 | /// are not equivalent with respect to exactness are resolved. |
| 1328 | /// |
| 1329 | /// ``` |
| 1330 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
| 1331 | /// |
| 1332 | /// let mut seq = Seq::from_iter([ |
| 1333 | /// Literal::exact("foo" ), |
| 1334 | /// Literal::inexact("foo" ), |
| 1335 | /// ]); |
| 1336 | /// seq.dedup(); |
| 1337 | /// |
| 1338 | /// assert_eq!(Seq::from_iter([Literal::inexact("foo" )]), seq); |
| 1339 | /// ``` |
| 1340 | #[inline ] |
| 1341 | pub fn dedup(&mut self) { |
| 1342 | if let Some(ref mut lits) = self.literals { |
| 1343 | lits.dedup_by(|lit1, lit2| { |
| 1344 | if lit1.as_bytes() != lit2.as_bytes() { |
| 1345 | return false; |
| 1346 | } |
| 1347 | if lit1.is_exact() != lit2.is_exact() { |
| 1348 | lit1.make_inexact(); |
| 1349 | lit2.make_inexact(); |
| 1350 | } |
| 1351 | true |
| 1352 | }); |
| 1353 | } |
| 1354 | } |
| 1355 | |
| 1356 | /// Sorts this sequence of literals lexicographically. |
| 1357 | /// |
| 1358 | /// Note that if, before sorting, if a literal that is a prefix of another |
| 1359 | /// literal appears after it, then after sorting, the sequence will not |
| 1360 | /// represent the same preference order match semantics. For example, |
| 1361 | /// sorting the sequence `[samwise, sam]` yields the sequence `[sam, |
| 1362 | /// samwise]`. Under preference order semantics, the latter sequence will |
| 1363 | /// never match `samwise` where as the first sequence can. |
| 1364 | /// |
| 1365 | /// # Example |
| 1366 | /// |
| 1367 | /// This example shows basic usage. |
| 1368 | /// |
| 1369 | /// ``` |
| 1370 | /// use regex_syntax::hir::literal::Seq; |
| 1371 | /// |
| 1372 | /// let mut seq = Seq::new(&["foo" , "quux" , "bar" ]); |
| 1373 | /// seq.sort(); |
| 1374 | /// |
| 1375 | /// assert_eq!(Seq::new(&["bar" , "foo" , "quux" ]), seq); |
| 1376 | /// ``` |
| 1377 | #[inline ] |
| 1378 | pub fn sort(&mut self) { |
| 1379 | if let Some(ref mut lits) = self.literals { |
| 1380 | lits.sort(); |
| 1381 | } |
| 1382 | } |
| 1383 | |
| 1384 | /// Reverses all of the literals in this sequence. |
| 1385 | /// |
| 1386 | /// The order of the sequence itself is preserved. |
| 1387 | /// |
| 1388 | /// # Example |
| 1389 | /// |
| 1390 | /// This example shows basic usage. |
| 1391 | /// |
| 1392 | /// ``` |
| 1393 | /// use regex_syntax::hir::literal::Seq; |
| 1394 | /// |
| 1395 | /// let mut seq = Seq::new(&["oof" , "rab" ]); |
| 1396 | /// seq.reverse_literals(); |
| 1397 | /// assert_eq!(Seq::new(&["foo" , "bar" ]), seq); |
| 1398 | /// ``` |
| 1399 | #[inline ] |
| 1400 | pub fn reverse_literals(&mut self) { |
| 1401 | if let Some(ref mut lits) = self.literals { |
| 1402 | for lit in lits.iter_mut() { |
| 1403 | lit.reverse(); |
| 1404 | } |
| 1405 | } |
| 1406 | } |
| 1407 | |
| 1408 | /// Shrinks this seq to its minimal size while respecting the preference |
| 1409 | /// order of its literals. |
| 1410 | /// |
| 1411 | /// While this routine will remove duplicate literals from this seq, it |
| 1412 | /// will also remove literals that can never match in a leftmost-first or |
| 1413 | /// "preference order" search. Similar to [`Seq::dedup`], if a literal is |
| 1414 | /// deduped, then the one that remains is made inexact. |
| 1415 | /// |
| 1416 | /// This is a no-op on seqs that are empty or not finite. |
| 1417 | /// |
| 1418 | /// # Example |
| 1419 | /// |
| 1420 | /// This example shows the difference between `{sam, samwise}` and |
| 1421 | /// `{samwise, sam}`. |
| 1422 | /// |
| 1423 | /// ``` |
| 1424 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
| 1425 | /// |
| 1426 | /// // If 'sam' comes before 'samwise' and a preference order search is |
| 1427 | /// // executed, then 'samwise' can never match. |
| 1428 | /// let mut seq = Seq::new(&["sam" , "samwise" ]); |
| 1429 | /// seq.minimize_by_preference(); |
| 1430 | /// assert_eq!(Seq::from_iter([Literal::inexact("sam" )]), seq); |
| 1431 | /// |
| 1432 | /// // But if they are reversed, then it's possible for 'samwise' to match |
| 1433 | /// // since it is given higher preference. |
| 1434 | /// let mut seq = Seq::new(&["samwise" , "sam" ]); |
| 1435 | /// seq.minimize_by_preference(); |
| 1436 | /// assert_eq!(Seq::new(&["samwise" , "sam" ]), seq); |
| 1437 | /// ``` |
| 1438 | /// |
| 1439 | /// This example shows that if an empty string is in this seq, then |
| 1440 | /// anything that comes after it can never match. |
| 1441 | /// |
| 1442 | /// ``` |
| 1443 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
| 1444 | /// |
| 1445 | /// // An empty string is a prefix of all strings, so it automatically |
| 1446 | /// // inhibits any subsequent strings from matching. |
| 1447 | /// let mut seq = Seq::new(&["foo" , "bar" , "" , "quux" , "fox" ]); |
| 1448 | /// seq.minimize_by_preference(); |
| 1449 | /// let expected = Seq::from_iter([ |
| 1450 | /// Literal::exact("foo" ), |
| 1451 | /// Literal::exact("bar" ), |
| 1452 | /// Literal::inexact("" ), |
| 1453 | /// ]); |
| 1454 | /// assert_eq!(expected, seq); |
| 1455 | /// |
| 1456 | /// // And of course, if it's at the beginning, then it makes it impossible |
| 1457 | /// // for anything else to match. |
| 1458 | /// let mut seq = Seq::new(&["" , "foo" , "quux" , "fox" ]); |
| 1459 | /// seq.minimize_by_preference(); |
| 1460 | /// assert_eq!(Seq::from_iter([Literal::inexact("" )]), seq); |
| 1461 | /// ``` |
| 1462 | #[inline ] |
| 1463 | pub fn minimize_by_preference(&mut self) { |
| 1464 | if let Some(ref mut lits) = self.literals { |
| 1465 | PreferenceTrie::minimize(lits, false); |
| 1466 | } |
| 1467 | } |
| 1468 | |
| 1469 | /// Trims all literals in this seq such that only the first `len` bytes |
| 1470 | /// remain. If a literal has less than or equal to `len` bytes, then it |
| 1471 | /// remains unchanged. Otherwise, it is trimmed and made inexact. |
| 1472 | /// |
| 1473 | /// # Example |
| 1474 | /// |
| 1475 | /// ``` |
| 1476 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
| 1477 | /// |
| 1478 | /// let mut seq = Seq::new(&["a" , "foo" , "quux" ]); |
| 1479 | /// seq.keep_first_bytes(2); |
| 1480 | /// |
| 1481 | /// let expected = Seq::from_iter([ |
| 1482 | /// Literal::exact("a" ), |
| 1483 | /// Literal::inexact("fo" ), |
| 1484 | /// Literal::inexact("qu" ), |
| 1485 | /// ]); |
| 1486 | /// assert_eq!(expected, seq); |
| 1487 | /// ``` |
| 1488 | #[inline ] |
| 1489 | pub fn keep_first_bytes(&mut self, len: usize) { |
| 1490 | if let Some(ref mut lits) = self.literals { |
| 1491 | for m in lits.iter_mut() { |
| 1492 | m.keep_first_bytes(len); |
| 1493 | } |
| 1494 | } |
| 1495 | } |
| 1496 | |
| 1497 | /// Trims all literals in this seq such that only the last `len` bytes |
| 1498 | /// remain. If a literal has less than or equal to `len` bytes, then it |
| 1499 | /// remains unchanged. Otherwise, it is trimmed and made inexact. |
| 1500 | /// |
| 1501 | /// # Example |
| 1502 | /// |
| 1503 | /// ``` |
| 1504 | /// use regex_syntax::hir::literal::{Literal, Seq}; |
| 1505 | /// |
| 1506 | /// let mut seq = Seq::new(&["a" , "foo" , "quux" ]); |
| 1507 | /// seq.keep_last_bytes(2); |
| 1508 | /// |
| 1509 | /// let expected = Seq::from_iter([ |
| 1510 | /// Literal::exact("a" ), |
| 1511 | /// Literal::inexact("oo" ), |
| 1512 | /// Literal::inexact("ux" ), |
| 1513 | /// ]); |
| 1514 | /// assert_eq!(expected, seq); |
| 1515 | /// ``` |
| 1516 | #[inline ] |
| 1517 | pub fn keep_last_bytes(&mut self, len: usize) { |
| 1518 | if let Some(ref mut lits) = self.literals { |
| 1519 | for m in lits.iter_mut() { |
| 1520 | m.keep_last_bytes(len); |
| 1521 | } |
| 1522 | } |
| 1523 | } |
| 1524 | |
| 1525 | /// Returns true if this sequence is finite. |
| 1526 | /// |
| 1527 | /// When false, this sequence is infinite and must be treated as if it |
| 1528 | /// contains every possible literal. |
| 1529 | #[inline ] |
| 1530 | pub fn is_finite(&self) -> bool { |
| 1531 | self.literals.is_some() |
| 1532 | } |
| 1533 | |
| 1534 | /// Returns true if and only if this sequence is finite and empty. |
| 1535 | /// |
| 1536 | /// An empty sequence never matches anything. It can only be produced by |
| 1537 | /// literal extraction when the corresponding regex itself cannot match. |
| 1538 | #[inline ] |
| 1539 | pub fn is_empty(&self) -> bool { |
| 1540 | self.len() == Some(0) |
| 1541 | } |
| 1542 | |
| 1543 | /// Returns the number of literals in this sequence if the sequence is |
| 1544 | /// finite. If the sequence is infinite, then `None` is returned. |
| 1545 | #[inline ] |
| 1546 | pub fn len(&self) -> Option<usize> { |
| 1547 | self.literals.as_ref().map(|lits| lits.len()) |
| 1548 | } |
| 1549 | |
| 1550 | /// Returns true if and only if all literals in this sequence are exact. |
| 1551 | /// |
| 1552 | /// This returns false if the sequence is infinite. |
| 1553 | #[inline ] |
| 1554 | pub fn is_exact(&self) -> bool { |
| 1555 | self.literals().map_or(false, |lits| lits.iter().all(|x| x.is_exact())) |
| 1556 | } |
| 1557 | |
| 1558 | /// Returns true if and only if all literals in this sequence are inexact. |
| 1559 | /// |
| 1560 | /// This returns true if the sequence is infinite. |
| 1561 | #[inline ] |
| 1562 | pub fn is_inexact(&self) -> bool { |
| 1563 | self.literals().map_or(true, |lits| lits.iter().all(|x| !x.is_exact())) |
| 1564 | } |
| 1565 | |
| 1566 | /// Return the maximum length of the sequence that would result from |
| 1567 | /// unioning `self` with `other`. If either set is infinite, then this |
| 1568 | /// returns `None`. |
| 1569 | #[inline ] |
| 1570 | pub fn max_union_len(&self, other: &Seq) -> Option<usize> { |
| 1571 | let len1 = self.len()?; |
| 1572 | let len2 = other.len()?; |
| 1573 | Some(len1.saturating_add(len2)) |
| 1574 | } |
| 1575 | |
| 1576 | /// Return the maximum length of the sequence that would result from the |
| 1577 | /// cross product of `self` with `other`. If either set is infinite, then |
| 1578 | /// this returns `None`. |
| 1579 | #[inline ] |
| 1580 | pub fn max_cross_len(&self, other: &Seq) -> Option<usize> { |
| 1581 | let len1 = self.len()?; |
| 1582 | let len2 = other.len()?; |
| 1583 | Some(len1.saturating_mul(len2)) |
| 1584 | } |
| 1585 | |
| 1586 | /// Returns the length of the shortest literal in this sequence. |
| 1587 | /// |
| 1588 | /// If the sequence is infinite or empty, then this returns `None`. |
| 1589 | #[inline ] |
| 1590 | pub fn min_literal_len(&self) -> Option<usize> { |
| 1591 | self.literals.as_ref()?.iter().map(|x| x.len()).min() |
| 1592 | } |
| 1593 | |
| 1594 | /// Returns the length of the longest literal in this sequence. |
| 1595 | /// |
| 1596 | /// If the sequence is infinite or empty, then this returns `None`. |
| 1597 | #[inline ] |
| 1598 | pub fn max_literal_len(&self) -> Option<usize> { |
| 1599 | self.literals.as_ref()?.iter().map(|x| x.len()).max() |
| 1600 | } |
| 1601 | |
| 1602 | /// Returns the longest common prefix from this seq. |
| 1603 | /// |
| 1604 | /// If the seq matches any literal or other contains no literals, then |
| 1605 | /// there is no meaningful prefix and this returns `None`. |
| 1606 | /// |
| 1607 | /// # Example |
| 1608 | /// |
| 1609 | /// This shows some example seqs and their longest common prefix. |
| 1610 | /// |
| 1611 | /// ``` |
| 1612 | /// use regex_syntax::hir::literal::Seq; |
| 1613 | /// |
| 1614 | /// let seq = Seq::new(&["foo" , "foobar" , "fo" ]); |
| 1615 | /// assert_eq!(Some(&b"fo" [..]), seq.longest_common_prefix()); |
| 1616 | /// let seq = Seq::new(&["foo" , "foo" ]); |
| 1617 | /// assert_eq!(Some(&b"foo" [..]), seq.longest_common_prefix()); |
| 1618 | /// let seq = Seq::new(&["foo" , "bar" ]); |
| 1619 | /// assert_eq!(Some(&b"" [..]), seq.longest_common_prefix()); |
| 1620 | /// let seq = Seq::new(&["" ]); |
| 1621 | /// assert_eq!(Some(&b"" [..]), seq.longest_common_prefix()); |
| 1622 | /// |
| 1623 | /// let seq = Seq::infinite(); |
| 1624 | /// assert_eq!(None, seq.longest_common_prefix()); |
| 1625 | /// let seq = Seq::empty(); |
| 1626 | /// assert_eq!(None, seq.longest_common_prefix()); |
| 1627 | /// ``` |
| 1628 | #[inline ] |
| 1629 | pub fn longest_common_prefix(&self) -> Option<&[u8]> { |
| 1630 | // If we match everything or match nothing, then there's no meaningful |
| 1631 | // longest common prefix. |
| 1632 | let lits = match self.literals { |
| 1633 | None => return None, |
| 1634 | Some(ref lits) => lits, |
| 1635 | }; |
| 1636 | if lits.len() == 0 { |
| 1637 | return None; |
| 1638 | } |
| 1639 | let base = lits[0].as_bytes(); |
| 1640 | let mut len = base.len(); |
| 1641 | for m in lits.iter().skip(1) { |
| 1642 | len = m |
| 1643 | .as_bytes() |
| 1644 | .iter() |
| 1645 | .zip(base[..len].iter()) |
| 1646 | .take_while(|&(a, b)| a == b) |
| 1647 | .count(); |
| 1648 | if len == 0 { |
| 1649 | return Some(&[]); |
| 1650 | } |
| 1651 | } |
| 1652 | Some(&base[..len]) |
| 1653 | } |
| 1654 | |
| 1655 | /// Returns the longest common suffix from this seq. |
| 1656 | /// |
| 1657 | /// If the seq matches any literal or other contains no literals, then |
| 1658 | /// there is no meaningful suffix and this returns `None`. |
| 1659 | /// |
| 1660 | /// # Example |
| 1661 | /// |
| 1662 | /// This shows some example seqs and their longest common suffix. |
| 1663 | /// |
| 1664 | /// ``` |
| 1665 | /// use regex_syntax::hir::literal::Seq; |
| 1666 | /// |
| 1667 | /// let seq = Seq::new(&["oof" , "raboof" , "of" ]); |
| 1668 | /// assert_eq!(Some(&b"of" [..]), seq.longest_common_suffix()); |
| 1669 | /// let seq = Seq::new(&["foo" , "foo" ]); |
| 1670 | /// assert_eq!(Some(&b"foo" [..]), seq.longest_common_suffix()); |
| 1671 | /// let seq = Seq::new(&["foo" , "bar" ]); |
| 1672 | /// assert_eq!(Some(&b"" [..]), seq.longest_common_suffix()); |
| 1673 | /// let seq = Seq::new(&["" ]); |
| 1674 | /// assert_eq!(Some(&b"" [..]), seq.longest_common_suffix()); |
| 1675 | /// |
| 1676 | /// let seq = Seq::infinite(); |
| 1677 | /// assert_eq!(None, seq.longest_common_suffix()); |
| 1678 | /// let seq = Seq::empty(); |
| 1679 | /// assert_eq!(None, seq.longest_common_suffix()); |
| 1680 | /// ``` |
| 1681 | #[inline ] |
| 1682 | pub fn longest_common_suffix(&self) -> Option<&[u8]> { |
| 1683 | // If we match everything or match nothing, then there's no meaningful |
| 1684 | // longest common suffix. |
| 1685 | let lits = match self.literals { |
| 1686 | None => return None, |
| 1687 | Some(ref lits) => lits, |
| 1688 | }; |
| 1689 | if lits.len() == 0 { |
| 1690 | return None; |
| 1691 | } |
| 1692 | let base = lits[0].as_bytes(); |
| 1693 | let mut len = base.len(); |
| 1694 | for m in lits.iter().skip(1) { |
| 1695 | len = m |
| 1696 | .as_bytes() |
| 1697 | .iter() |
| 1698 | .rev() |
| 1699 | .zip(base[base.len() - len..].iter().rev()) |
| 1700 | .take_while(|&(a, b)| a == b) |
| 1701 | .count(); |
| 1702 | if len == 0 { |
| 1703 | return Some(&[]); |
| 1704 | } |
| 1705 | } |
| 1706 | Some(&base[base.len() - len..]) |
| 1707 | } |
| 1708 | |
| 1709 | /// Optimizes this seq while treating its literals as prefixes and |
| 1710 | /// respecting the preference order of its literals. |
| 1711 | /// |
| 1712 | /// The specific way "optimization" works is meant to be an implementation |
| 1713 | /// detail, as it essentially represents a set of heuristics. The goal |
| 1714 | /// that optimization tries to accomplish is to make the literals in this |
| 1715 | /// set reflect inputs that will result in a more effective prefilter. |
| 1716 | /// Principally by reducing the false positive rate of candidates found by |
| 1717 | /// the literals in this sequence. That is, when a match of a literal is |
| 1718 | /// found, we would like it to be a strong predictor of the overall match |
| 1719 | /// of the regex. If it isn't, then much time will be spent starting and |
| 1720 | /// stopping the prefilter search and attempting to confirm the match only |
| 1721 | /// to have it fail. |
| 1722 | /// |
| 1723 | /// Some of those heuristics might be: |
| 1724 | /// |
| 1725 | /// * Identifying a common prefix from a larger sequence of literals, and |
| 1726 | /// shrinking the sequence down to that single common prefix. |
| 1727 | /// * Rejecting the sequence entirely if it is believed to result in very |
| 1728 | /// high false positive rate. When this happens, the sequence is made |
| 1729 | /// infinite. |
| 1730 | /// * Shrinking the sequence to a smaller number of literals representing |
| 1731 | /// prefixes, but not shrinking it so much as to make literals too short. |
| 1732 | /// (A sequence with very short literals, of 1 or 2 bytes, will typically |
| 1733 | /// result in a higher false positive rate.) |
| 1734 | /// |
| 1735 | /// Optimization should only be run once extraction is complete. Namely, |
| 1736 | /// optimization may make assumptions that do not compose with other |
| 1737 | /// operations in the middle of extraction. For example, optimization will |
| 1738 | /// reduce `[E(sam), E(samwise)]` to `[E(sam)]`, but such a transformation |
| 1739 | /// is only valid if no other extraction will occur. If other extraction |
| 1740 | /// may occur, then the correct transformation would be to `[I(sam)]`. |
| 1741 | /// |
| 1742 | /// The [`Seq::optimize_for_suffix_by_preference`] does the same thing, but |
| 1743 | /// for suffixes. |
| 1744 | /// |
| 1745 | /// # Example |
| 1746 | /// |
| 1747 | /// This shows how optimization might transform a sequence. Note that |
| 1748 | /// the specific behavior is not a documented guarantee. The heuristics |
| 1749 | /// used are an implementation detail and may change over time in semver |
| 1750 | /// compatible releases. |
| 1751 | /// |
| 1752 | /// ``` |
| 1753 | /// use regex_syntax::hir::literal::{Seq, Literal}; |
| 1754 | /// |
| 1755 | /// let mut seq = Seq::new(&[ |
| 1756 | /// "samantha" , |
| 1757 | /// "sam" , |
| 1758 | /// "samwise" , |
| 1759 | /// "frodo" , |
| 1760 | /// ]); |
| 1761 | /// seq.optimize_for_prefix_by_preference(); |
| 1762 | /// assert_eq!(Seq::from_iter([ |
| 1763 | /// Literal::exact("samantha" ), |
| 1764 | /// // Kept exact even though 'samwise' got pruned |
| 1765 | /// // because optimization assumes literal extraction |
| 1766 | /// // has finished. |
| 1767 | /// Literal::exact("sam" ), |
| 1768 | /// Literal::exact("frodo" ), |
| 1769 | /// ]), seq); |
| 1770 | /// ``` |
| 1771 | /// |
| 1772 | /// # Example: optimization may make the sequence infinite |
| 1773 | /// |
| 1774 | /// If the heuristics deem that the sequence could cause a very high false |
| 1775 | /// positive rate, then it may make the sequence infinite, effectively |
| 1776 | /// disabling its use as a prefilter. |
| 1777 | /// |
| 1778 | /// ``` |
| 1779 | /// use regex_syntax::hir::literal::{Seq, Literal}; |
| 1780 | /// |
| 1781 | /// let mut seq = Seq::new(&[ |
| 1782 | /// "samantha" , |
| 1783 | /// // An empty string matches at every position, |
| 1784 | /// // thus rendering the prefilter completely |
| 1785 | /// // ineffective. |
| 1786 | /// "" , |
| 1787 | /// "sam" , |
| 1788 | /// "samwise" , |
| 1789 | /// "frodo" , |
| 1790 | /// ]); |
| 1791 | /// seq.optimize_for_prefix_by_preference(); |
| 1792 | /// assert!(!seq.is_finite()); |
| 1793 | /// ``` |
| 1794 | /// |
| 1795 | /// Do note that just because there is a `" "` in the sequence, that |
| 1796 | /// doesn't mean the sequence will always be made infinite after it is |
| 1797 | /// optimized. Namely, if the sequence is considered exact (any match |
| 1798 | /// corresponds to an overall match of the original regex), then any match |
| 1799 | /// is an overall match, and so the false positive rate is always `0`. |
| 1800 | /// |
| 1801 | /// To demonstrate this, we remove `samwise` from our sequence. This |
| 1802 | /// results in no optimization happening and all literals remain exact. |
| 1803 | /// Thus the entire sequence is exact, and it is kept as-is, even though |
| 1804 | /// one is an ASCII space: |
| 1805 | /// |
| 1806 | /// ``` |
| 1807 | /// use regex_syntax::hir::literal::{Seq, Literal}; |
| 1808 | /// |
| 1809 | /// let mut seq = Seq::new(&[ |
| 1810 | /// "samantha" , |
| 1811 | /// " " , |
| 1812 | /// "sam" , |
| 1813 | /// "frodo" , |
| 1814 | /// ]); |
| 1815 | /// seq.optimize_for_prefix_by_preference(); |
| 1816 | /// assert!(seq.is_finite()); |
| 1817 | /// ``` |
| 1818 | #[inline ] |
| 1819 | pub fn optimize_for_prefix_by_preference(&mut self) { |
| 1820 | self.optimize_by_preference(true); |
| 1821 | } |
| 1822 | |
| 1823 | /// Optimizes this seq while treating its literals as suffixes and |
| 1824 | /// respecting the preference order of its literals. |
| 1825 | /// |
| 1826 | /// Optimization should only be run once extraction is complete. |
| 1827 | /// |
| 1828 | /// The [`Seq::optimize_for_prefix_by_preference`] does the same thing, but |
| 1829 | /// for prefixes. See its documentation for more explanation. |
| 1830 | #[inline ] |
| 1831 | pub fn optimize_for_suffix_by_preference(&mut self) { |
| 1832 | self.optimize_by_preference(false); |
| 1833 | } |
| 1834 | |
| 1835 | fn optimize_by_preference(&mut self, prefix: bool) { |
| 1836 | let origlen = match self.len() { |
| 1837 | None => return, |
| 1838 | Some(len) => len, |
| 1839 | }; |
| 1840 | // Just give up now if our sequence contains an empty string. |
| 1841 | if self.min_literal_len().map_or(false, |len| len == 0) { |
| 1842 | // We squash the sequence so that nobody else gets any bright |
| 1843 | // ideas to try and use it. An empty string implies a match at |
| 1844 | // every position. A prefilter cannot help you here. |
| 1845 | self.make_infinite(); |
| 1846 | return; |
| 1847 | } |
| 1848 | // Make sure we start with the smallest sequence possible. We use a |
| 1849 | // special version of preference minimization that retains exactness. |
| 1850 | // This is legal because optimization is only expected to occur once |
| 1851 | // extraction is complete. |
| 1852 | if prefix { |
| 1853 | if let Some(ref mut lits) = self.literals { |
| 1854 | PreferenceTrie::minimize(lits, true); |
| 1855 | } |
| 1856 | } |
| 1857 | |
| 1858 | // Look for a common prefix (or suffix). If we found one of those and |
| 1859 | // it's long enough, then it's a good bet that it will be our fastest |
| 1860 | // possible prefilter since single-substring search is so fast. |
| 1861 | let fix = if prefix { |
| 1862 | self.longest_common_prefix() |
| 1863 | } else { |
| 1864 | self.longest_common_suffix() |
| 1865 | }; |
| 1866 | if let Some(fix) = fix { |
| 1867 | // As a special case, if we have a common prefix and the leading |
| 1868 | // byte of that prefix is one that we think probably occurs rarely, |
| 1869 | // then strip everything down to just that single byte. This should |
| 1870 | // promote the use of memchr. |
| 1871 | // |
| 1872 | // ... we only do this though if our sequence has more than one |
| 1873 | // literal. Otherwise, we'd rather just stick with a single literal |
| 1874 | // scan. That is, using memchr is probably better than looking |
| 1875 | // for 2 or more literals, but probably not as good as a straight |
| 1876 | // memmem search. |
| 1877 | // |
| 1878 | // ... and also only do this when the prefix is short and probably |
| 1879 | // not too discriminatory anyway. If it's longer, then it's |
| 1880 | // probably quite discriminatory and thus is likely to have a low |
| 1881 | // false positive rate. |
| 1882 | if prefix |
| 1883 | && origlen > 1 |
| 1884 | && fix.len() >= 1 |
| 1885 | && fix.len() <= 3 |
| 1886 | && rank(fix[0]) < 200 |
| 1887 | { |
| 1888 | self.keep_first_bytes(1); |
| 1889 | self.dedup(); |
| 1890 | return; |
| 1891 | } |
| 1892 | // We only strip down to the common prefix/suffix if we think |
| 1893 | // the existing set of literals isn't great, or if the common |
| 1894 | // prefix/suffix is expected to be particularly discriminatory. |
| 1895 | let isfast = |
| 1896 | self.is_exact() && self.len().map_or(false, |len| len <= 16); |
| 1897 | let usefix = fix.len() > 4 || (fix.len() > 1 && !isfast); |
| 1898 | if usefix { |
| 1899 | // If we keep exactly the number of bytes equal to the length |
| 1900 | // of the prefix (or suffix), then by the definition of a |
| 1901 | // prefix, every literal in the sequence will be equivalent. |
| 1902 | // Thus, 'dedup' will leave us with one literal. |
| 1903 | // |
| 1904 | // We do it this way to avoid an alloc, but also to make sure |
| 1905 | // the exactness of literals is kept (or not). |
| 1906 | if prefix { |
| 1907 | self.keep_first_bytes(fix.len()); |
| 1908 | } else { |
| 1909 | self.keep_last_bytes(fix.len()); |
| 1910 | } |
| 1911 | self.dedup(); |
| 1912 | assert_eq!(Some(1), self.len()); |
| 1913 | // We still fall through here. In particular, we want our |
| 1914 | // longest common prefix to be subject to the poison check. |
| 1915 | } |
| 1916 | } |
| 1917 | // If we have an exact sequence, we *probably* just want to keep it |
| 1918 | // as-is. But there are some cases where we don't. So we save a copy of |
| 1919 | // the exact sequence now, and then try to do some more optimizations |
| 1920 | // below. If those don't work out, we go back to this exact sequence. |
| 1921 | // |
| 1922 | // The specific motivation for this is that we sometimes wind up with |
| 1923 | // an exact sequence with a hefty number of literals. Say, 100. If we |
| 1924 | // stuck with that, it would be too big for Teddy and would result in |
| 1925 | // using Aho-Corasick. Which is fine... but the lazy DFA is plenty |
| 1926 | // suitable in such cases. The real issue is that we will wind up not |
| 1927 | // using a fast prefilter at all. So in cases like this, even though |
| 1928 | // we have an exact sequence, it would be better to try and shrink the |
| 1929 | // sequence (which we do below) and use it as a prefilter that can |
| 1930 | // produce false positive matches. |
| 1931 | // |
| 1932 | // But if the shrinking below results in a sequence that "sucks," then |
| 1933 | // we don't want to use that because we already have an exact sequence |
| 1934 | // in hand. |
| 1935 | let exact: Option<Seq> = |
| 1936 | if self.is_exact() { Some(self.clone()) } else { None }; |
| 1937 | // Now we attempt to shorten the sequence. The idea here is that we |
| 1938 | // don't want to look for too many literals, but we want to shorten |
| 1939 | // our sequence enough to improve our odds of using better algorithms |
| 1940 | // downstream (such as Teddy). |
| 1941 | // |
| 1942 | // The pair of numbers in this list corresponds to the maximal prefix |
| 1943 | // (in bytes) to keep for all literals and the length of the sequence |
| 1944 | // at which to do it. |
| 1945 | // |
| 1946 | // So for example, the pair (3, 500) would mean, "if we have more than |
| 1947 | // 500 literals in our sequence, then truncate all of our literals |
| 1948 | // such that they are at most 3 bytes in length and the minimize the |
| 1949 | // sequence." |
| 1950 | const ATTEMPTS: [(usize, usize); 5] = |
| 1951 | [(5, 10), (4, 10), (3, 64), (2, 64), (1, 10)]; |
| 1952 | for (keep, limit) in ATTEMPTS { |
| 1953 | let len = match self.len() { |
| 1954 | None => break, |
| 1955 | Some(len) => len, |
| 1956 | }; |
| 1957 | if len <= limit { |
| 1958 | break; |
| 1959 | } |
| 1960 | if prefix { |
| 1961 | self.keep_first_bytes(keep); |
| 1962 | } else { |
| 1963 | self.keep_last_bytes(keep); |
| 1964 | } |
| 1965 | if prefix { |
| 1966 | if let Some(ref mut lits) = self.literals { |
| 1967 | PreferenceTrie::minimize(lits, true); |
| 1968 | } |
| 1969 | } |
| 1970 | } |
| 1971 | // Check for a poison literal. A poison literal is one that is short |
| 1972 | // and is believed to have a very high match count. These poisons |
| 1973 | // generally lead to a prefilter with a very high false positive rate, |
| 1974 | // and thus overall worse performance. |
| 1975 | // |
| 1976 | // We do this last because we could have gone from a non-poisonous |
| 1977 | // sequence to a poisonous one. Perhaps we should add some code to |
| 1978 | // prevent such transitions in the first place, but then again, we |
| 1979 | // likely only made the transition in the first place if the sequence |
| 1980 | // was itself huge. And huge sequences are themselves poisonous. So... |
| 1981 | if let Some(lits) = self.literals() { |
| 1982 | if lits.iter().any(|lit| lit.is_poisonous()) { |
| 1983 | self.make_infinite(); |
| 1984 | } |
| 1985 | } |
| 1986 | // OK, if we had an exact sequence before attempting more optimizations |
| 1987 | // above and our post-optimized sequence sucks for some reason or |
| 1988 | // another, then we go back to the exact sequence. |
| 1989 | if let Some(exact) = exact { |
| 1990 | // If optimizing resulted in dropping our literals, then certainly |
| 1991 | // backup and use the exact sequence that we had. |
| 1992 | if !self.is_finite() { |
| 1993 | *self = exact; |
| 1994 | return; |
| 1995 | } |
| 1996 | // If our optimized sequence contains a short literal, then it's |
| 1997 | // *probably* not so great. So throw it away and revert to the |
| 1998 | // exact sequence. |
| 1999 | if self.min_literal_len().map_or(true, |len| len <= 2) { |
| 2000 | *self = exact; |
| 2001 | return; |
| 2002 | } |
| 2003 | // Finally, if our optimized sequence is "big" (i.e., can't use |
| 2004 | // Teddy), then also don't use it and rely on the exact sequence. |
| 2005 | if self.len().map_or(true, |len| len > 64) { |
| 2006 | *self = exact; |
| 2007 | return; |
| 2008 | } |
| 2009 | } |
| 2010 | } |
| 2011 | } |
| 2012 | |
| 2013 | impl core::fmt::Debug for Seq { |
| 2014 | fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { |
| 2015 | write!(f, "Seq" )?; |
| 2016 | if let Some(lits: &[Literal]) = self.literals() { |
| 2017 | f.debug_list().entries(lits.iter()).finish() |
| 2018 | } else { |
| 2019 | write!(f, "[∞]" ) |
| 2020 | } |
| 2021 | } |
| 2022 | } |
| 2023 | |
| 2024 | impl FromIterator<Literal> for Seq { |
| 2025 | fn from_iter<T: IntoIterator<Item = Literal>>(it: T) -> Seq { |
| 2026 | let mut seq: Seq = Seq::empty(); |
| 2027 | for literal: Literal in it { |
| 2028 | seq.push(lit:literal); |
| 2029 | } |
| 2030 | seq |
| 2031 | } |
| 2032 | } |
| 2033 | |
| 2034 | /// A single literal extracted from an [`Hir`] expression. |
| 2035 | /// |
| 2036 | /// A literal is composed of two things: |
| 2037 | /// |
| 2038 | /// * A sequence of bytes. No guarantees with respect to UTF-8 are provided. |
| 2039 | /// In particular, even if the regex a literal is extracted from is UTF-8, the |
| 2040 | /// literal extracted may not be valid UTF-8. (For example, if an [`Extractor`] |
| 2041 | /// limit resulted in trimming a literal in a way that splits a codepoint.) |
| 2042 | /// * Whether the literal is "exact" or not. An "exact" literal means that it |
| 2043 | /// has not been trimmed, and may continue to be extended. If a literal is |
| 2044 | /// "exact" after visiting the entire `Hir` expression, then this implies that |
| 2045 | /// the literal leads to a match state. (Although it doesn't necessarily imply |
| 2046 | /// all occurrences of the literal correspond to a match of the regex, since |
| 2047 | /// literal extraction ignores look-around assertions.) |
| 2048 | #[derive (Clone, Eq, PartialEq, PartialOrd, Ord)] |
| 2049 | pub struct Literal { |
| 2050 | bytes: Vec<u8>, |
| 2051 | exact: bool, |
| 2052 | } |
| 2053 | |
| 2054 | impl Literal { |
| 2055 | /// Returns a new exact literal containing the bytes given. |
| 2056 | #[inline ] |
| 2057 | pub fn exact<B: Into<Vec<u8>>>(bytes: B) -> Literal { |
| 2058 | Literal { bytes: bytes.into(), exact: true } |
| 2059 | } |
| 2060 | |
| 2061 | /// Returns a new inexact literal containing the bytes given. |
| 2062 | #[inline ] |
| 2063 | pub fn inexact<B: Into<Vec<u8>>>(bytes: B) -> Literal { |
| 2064 | Literal { bytes: bytes.into(), exact: false } |
| 2065 | } |
| 2066 | |
| 2067 | /// Returns the bytes in this literal. |
| 2068 | #[inline ] |
| 2069 | pub fn as_bytes(&self) -> &[u8] { |
| 2070 | &self.bytes |
| 2071 | } |
| 2072 | |
| 2073 | /// Yields ownership of the bytes inside this literal. |
| 2074 | /// |
| 2075 | /// Note that this throws away whether the literal is "exact" or not. |
| 2076 | #[inline ] |
| 2077 | pub fn into_bytes(self) -> Vec<u8> { |
| 2078 | self.bytes |
| 2079 | } |
| 2080 | |
| 2081 | /// Returns the length of this literal in bytes. |
| 2082 | #[inline ] |
| 2083 | pub fn len(&self) -> usize { |
| 2084 | self.as_bytes().len() |
| 2085 | } |
| 2086 | |
| 2087 | /// Returns true if and only if this literal has zero bytes. |
| 2088 | #[inline ] |
| 2089 | pub fn is_empty(&self) -> bool { |
| 2090 | self.len() == 0 |
| 2091 | } |
| 2092 | |
| 2093 | /// Returns true if and only if this literal is exact. |
| 2094 | #[inline ] |
| 2095 | pub fn is_exact(&self) -> bool { |
| 2096 | self.exact |
| 2097 | } |
| 2098 | |
| 2099 | /// Marks this literal as inexact. |
| 2100 | /// |
| 2101 | /// Inexact literals can never be extended. For example, |
| 2102 | /// [`Seq::cross_forward`] will not extend inexact literals. |
| 2103 | #[inline ] |
| 2104 | pub fn make_inexact(&mut self) { |
| 2105 | self.exact = false; |
| 2106 | } |
| 2107 | |
| 2108 | /// Reverse the bytes in this literal. |
| 2109 | #[inline ] |
| 2110 | pub fn reverse(&mut self) { |
| 2111 | self.bytes.reverse(); |
| 2112 | } |
| 2113 | |
| 2114 | /// Extend this literal with the literal given. |
| 2115 | /// |
| 2116 | /// If this literal is inexact, then this is a no-op. |
| 2117 | #[inline ] |
| 2118 | pub fn extend(&mut self, lit: &Literal) { |
| 2119 | if !self.is_exact() { |
| 2120 | return; |
| 2121 | } |
| 2122 | self.bytes.extend_from_slice(&lit.bytes); |
| 2123 | } |
| 2124 | |
| 2125 | /// Trims this literal such that only the first `len` bytes remain. If |
| 2126 | /// this literal has fewer than `len` bytes, then it remains unchanged. |
| 2127 | /// Otherwise, the literal is marked as inexact. |
| 2128 | #[inline ] |
| 2129 | pub fn keep_first_bytes(&mut self, len: usize) { |
| 2130 | if len >= self.len() { |
| 2131 | return; |
| 2132 | } |
| 2133 | self.make_inexact(); |
| 2134 | self.bytes.truncate(len); |
| 2135 | } |
| 2136 | |
| 2137 | /// Trims this literal such that only the last `len` bytes remain. If this |
| 2138 | /// literal has fewer than `len` bytes, then it remains unchanged. |
| 2139 | /// Otherwise, the literal is marked as inexact. |
| 2140 | #[inline ] |
| 2141 | pub fn keep_last_bytes(&mut self, len: usize) { |
| 2142 | if len >= self.len() { |
| 2143 | return; |
| 2144 | } |
| 2145 | self.make_inexact(); |
| 2146 | self.bytes.drain(..self.len() - len); |
| 2147 | } |
| 2148 | |
| 2149 | /// Returns true if it is believe that this literal is likely to match very |
| 2150 | /// frequently, and is thus not a good candidate for a prefilter. |
| 2151 | fn is_poisonous(&self) -> bool { |
| 2152 | self.is_empty() || (self.len() == 1 && rank(self.as_bytes()[0]) >= 250) |
| 2153 | } |
| 2154 | } |
| 2155 | |
| 2156 | impl From<u8> for Literal { |
| 2157 | fn from(byte: u8) -> Literal { |
| 2158 | Literal::exact(bytes:vec![byte]) |
| 2159 | } |
| 2160 | } |
| 2161 | |
| 2162 | impl From<char> for Literal { |
| 2163 | fn from(ch: char) -> Literal { |
| 2164 | use alloc::string::ToString; |
| 2165 | Literal::exact(bytes:ch.encode_utf8(&mut [0; 4]).to_string()) |
| 2166 | } |
| 2167 | } |
| 2168 | |
| 2169 | impl AsRef<[u8]> for Literal { |
| 2170 | fn as_ref(&self) -> &[u8] { |
| 2171 | self.as_bytes() |
| 2172 | } |
| 2173 | } |
| 2174 | |
| 2175 | impl core::fmt::Debug for Literal { |
| 2176 | fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { |
| 2177 | let tag: &'static str = if self.exact { "E" } else { "I" }; |
| 2178 | f&mut DebugTuple<'_, '_>.debug_tuple(name:tag) |
| 2179 | .field(&crate::debug::Bytes(self.as_bytes())) |
| 2180 | .finish() |
| 2181 | } |
| 2182 | } |
| 2183 | |
| 2184 | /// A "preference" trie that rejects literals that will never match when |
| 2185 | /// executing a leftmost first or "preference" search. |
| 2186 | /// |
| 2187 | /// For example, if 'sam' is inserted, then trying to insert 'samwise' will be |
| 2188 | /// rejected because 'samwise' can never match since 'sam' will always take |
| 2189 | /// priority. However, if 'samwise' is inserted first, then inserting 'sam' |
| 2190 | /// after it is accepted. In this case, either 'samwise' or 'sam' can match in |
| 2191 | /// a "preference" search. |
| 2192 | /// |
| 2193 | /// Note that we only use this trie as a "set." That is, given a sequence of |
| 2194 | /// literals, we insert each one in order. An `insert` will reject a literal |
| 2195 | /// if a prefix of that literal already exists in the trie. Thus, to rebuild |
| 2196 | /// the "minimal" sequence, we simply only keep literals that were successfully |
| 2197 | /// inserted. (Since we don't need traversal, one wonders whether we can make |
| 2198 | /// some simplifications here, but I haven't given it a ton of thought and I've |
| 2199 | /// never seen this show up on a profile. Because of the heuristic limits |
| 2200 | /// imposed on literal extractions, the size of the inputs here is usually |
| 2201 | /// very small.) |
| 2202 | #[derive (Debug)] |
| 2203 | struct PreferenceTrie { |
| 2204 | /// The states in this trie. The index of a state in this vector is its ID. |
| 2205 | states: Vec<State>, |
| 2206 | /// This vec indicates which states are match states. It always has |
| 2207 | /// the same length as `states` and is indexed by the same state ID. |
| 2208 | /// A state with identifier `sid` is a match state if and only if |
| 2209 | /// `matches[sid].is_some()`. The option contains the index of the literal |
| 2210 | /// corresponding to the match. The index is offset by 1 so that it fits in |
| 2211 | /// a NonZeroUsize. |
| 2212 | matches: Vec<Option<NonZeroUsize>>, |
| 2213 | /// The index to allocate to the next literal added to this trie. Starts at |
| 2214 | /// 1 and increments by 1 for every literal successfully added to the trie. |
| 2215 | next_literal_index: usize, |
| 2216 | } |
| 2217 | |
| 2218 | /// A single state in a trie. Uses a sparse representation for its transitions. |
| 2219 | #[derive (Debug, Default)] |
| 2220 | struct State { |
| 2221 | /// Sparse representation of the transitions out of this state. Transitions |
| 2222 | /// are sorted by byte. There is at most one such transition for any |
| 2223 | /// particular byte. |
| 2224 | trans: Vec<(u8, usize)>, |
| 2225 | } |
| 2226 | |
| 2227 | impl PreferenceTrie { |
| 2228 | /// Minimizes the given sequence of literals while preserving preference |
| 2229 | /// order semantics. |
| 2230 | /// |
| 2231 | /// When `keep_exact` is true, the exactness of every literal retained is |
| 2232 | /// kept. This is useful when dealing with a fully extracted `Seq` that |
| 2233 | /// only contains exact literals. In that case, we can keep all retained |
| 2234 | /// literals as exact because we know we'll never need to match anything |
| 2235 | /// after them and because any removed literals are guaranteed to never |
| 2236 | /// match. |
| 2237 | fn minimize(literals: &mut Vec<Literal>, keep_exact: bool) { |
| 2238 | let mut trie = PreferenceTrie { |
| 2239 | states: vec![], |
| 2240 | matches: vec![], |
| 2241 | next_literal_index: 1, |
| 2242 | }; |
| 2243 | let mut make_inexact = vec![]; |
| 2244 | literals.retain_mut(|lit| match trie.insert(lit.as_bytes()) { |
| 2245 | Ok(_) => true, |
| 2246 | Err(i) => { |
| 2247 | if !keep_exact { |
| 2248 | make_inexact.push(i.checked_sub(1).unwrap()); |
| 2249 | } |
| 2250 | false |
| 2251 | } |
| 2252 | }); |
| 2253 | for i in make_inexact { |
| 2254 | literals[i].make_inexact(); |
| 2255 | } |
| 2256 | } |
| 2257 | |
| 2258 | /// Returns `Ok` if the given byte string is accepted into this trie and |
| 2259 | /// `Err` otherwise. The index for the success case corresponds to the |
| 2260 | /// index of the literal added. The index for the error case corresponds to |
| 2261 | /// the index of the literal already in the trie that prevented the given |
| 2262 | /// byte string from being added. (Which implies it is a prefix of the one |
| 2263 | /// given.) |
| 2264 | /// |
| 2265 | /// In short, the byte string given is accepted into the trie if and only |
| 2266 | /// if it is possible for it to match when executing a preference order |
| 2267 | /// search. |
| 2268 | fn insert(&mut self, bytes: &[u8]) -> Result<usize, usize> { |
| 2269 | let mut prev = self.root(); |
| 2270 | if let Some(idx) = self.matches[prev] { |
| 2271 | return Err(idx.get()); |
| 2272 | } |
| 2273 | for &b in bytes.iter() { |
| 2274 | match self.states[prev].trans.binary_search_by_key(&b, |t| t.0) { |
| 2275 | Ok(i) => { |
| 2276 | prev = self.states[prev].trans[i].1; |
| 2277 | if let Some(idx) = self.matches[prev] { |
| 2278 | return Err(idx.get()); |
| 2279 | } |
| 2280 | } |
| 2281 | Err(i) => { |
| 2282 | let next = self.create_state(); |
| 2283 | self.states[prev].trans.insert(i, (b, next)); |
| 2284 | prev = next; |
| 2285 | } |
| 2286 | } |
| 2287 | } |
| 2288 | let idx = self.next_literal_index; |
| 2289 | self.next_literal_index += 1; |
| 2290 | self.matches[prev] = NonZeroUsize::new(idx); |
| 2291 | Ok(idx) |
| 2292 | } |
| 2293 | |
| 2294 | /// Returns the root state ID, and if it doesn't exist, creates it. |
| 2295 | fn root(&mut self) -> usize { |
| 2296 | if !self.states.is_empty() { |
| 2297 | 0 |
| 2298 | } else { |
| 2299 | self.create_state() |
| 2300 | } |
| 2301 | } |
| 2302 | |
| 2303 | /// Creates a new empty state and returns its ID. |
| 2304 | fn create_state(&mut self) -> usize { |
| 2305 | let id = self.states.len(); |
| 2306 | self.states.push(State::default()); |
| 2307 | self.matches.push(None); |
| 2308 | id |
| 2309 | } |
| 2310 | } |
| 2311 | |
| 2312 | /// Returns the "rank" of the given byte. |
| 2313 | /// |
| 2314 | /// The minimum rank value is `0` and the maximum rank value is `255`. |
| 2315 | /// |
| 2316 | /// The rank of a byte is derived from a heuristic background distribution of |
| 2317 | /// relative frequencies of bytes. The heuristic says that lower the rank of a |
| 2318 | /// byte, the less likely that byte is to appear in any arbitrary haystack. |
| 2319 | pub fn rank(byte: u8) -> u8 { |
| 2320 | crate::rank::BYTE_FREQUENCIES[usize::from(byte)] |
| 2321 | } |
| 2322 | |
| 2323 | #[cfg (test)] |
| 2324 | mod tests { |
| 2325 | use super::*; |
| 2326 | |
| 2327 | fn parse(pattern: &str) -> Hir { |
| 2328 | crate::ParserBuilder::new().utf8(false).build().parse(pattern).unwrap() |
| 2329 | } |
| 2330 | |
| 2331 | fn prefixes(pattern: &str) -> Seq { |
| 2332 | Extractor::new().kind(ExtractKind::Prefix).extract(&parse(pattern)) |
| 2333 | } |
| 2334 | |
| 2335 | fn suffixes(pattern: &str) -> Seq { |
| 2336 | Extractor::new().kind(ExtractKind::Suffix).extract(&parse(pattern)) |
| 2337 | } |
| 2338 | |
| 2339 | fn e(pattern: &str) -> (Seq, Seq) { |
| 2340 | (prefixes(pattern), suffixes(pattern)) |
| 2341 | } |
| 2342 | |
| 2343 | #[allow (non_snake_case)] |
| 2344 | fn E(x: &str) -> Literal { |
| 2345 | Literal::exact(x.as_bytes()) |
| 2346 | } |
| 2347 | |
| 2348 | #[allow (non_snake_case)] |
| 2349 | fn I(x: &str) -> Literal { |
| 2350 | Literal::inexact(x.as_bytes()) |
| 2351 | } |
| 2352 | |
| 2353 | fn seq<I: IntoIterator<Item = Literal>>(it: I) -> Seq { |
| 2354 | Seq::from_iter(it) |
| 2355 | } |
| 2356 | |
| 2357 | fn infinite() -> (Seq, Seq) { |
| 2358 | (Seq::infinite(), Seq::infinite()) |
| 2359 | } |
| 2360 | |
| 2361 | fn inexact<I1, I2>(it1: I1, it2: I2) -> (Seq, Seq) |
| 2362 | where |
| 2363 | I1: IntoIterator<Item = Literal>, |
| 2364 | I2: IntoIterator<Item = Literal>, |
| 2365 | { |
| 2366 | (Seq::from_iter(it1), Seq::from_iter(it2)) |
| 2367 | } |
| 2368 | |
| 2369 | fn exact<B: AsRef<[u8]>, I: IntoIterator<Item = B>>(it: I) -> (Seq, Seq) { |
| 2370 | let s1 = Seq::new(it); |
| 2371 | let s2 = s1.clone(); |
| 2372 | (s1, s2) |
| 2373 | } |
| 2374 | |
| 2375 | fn opt<B: AsRef<[u8]>, I: IntoIterator<Item = B>>(it: I) -> (Seq, Seq) { |
| 2376 | let (mut p, mut s) = exact(it); |
| 2377 | p.optimize_for_prefix_by_preference(); |
| 2378 | s.optimize_for_suffix_by_preference(); |
| 2379 | (p, s) |
| 2380 | } |
| 2381 | |
| 2382 | #[test ] |
| 2383 | fn literal() { |
| 2384 | assert_eq!(exact(["a" ]), e("a" )); |
| 2385 | assert_eq!(exact(["aaaaa" ]), e("aaaaa" )); |
| 2386 | assert_eq!(exact(["A" , "a" ]), e("(?i-u)a" )); |
| 2387 | assert_eq!(exact(["AB" , "Ab" , "aB" , "ab" ]), e("(?i-u)ab" )); |
| 2388 | assert_eq!(exact(["abC" , "abc" ]), e("ab(?i-u)c" )); |
| 2389 | |
| 2390 | assert_eq!(exact([b" \xFF" ]), e(r"(?-u:\xFF)" )); |
| 2391 | |
| 2392 | #[cfg (feature = "unicode-case" )] |
| 2393 | { |
| 2394 | assert_eq!(exact(["☃" ]), e("☃" )); |
| 2395 | assert_eq!(exact(["☃" ]), e("(?i)☃" )); |
| 2396 | assert_eq!(exact(["☃☃☃☃☃" ]), e("☃☃☃☃☃" )); |
| 2397 | |
| 2398 | assert_eq!(exact(["Δ" ]), e("Δ" )); |
| 2399 | assert_eq!(exact(["δ" ]), e("δ" )); |
| 2400 | assert_eq!(exact(["Δ" , "δ" ]), e("(?i)Δ" )); |
| 2401 | assert_eq!(exact(["Δ" , "δ" ]), e("(?i)δ" )); |
| 2402 | |
| 2403 | assert_eq!(exact(["S" , "s" , "ſ" ]), e("(?i)S" )); |
| 2404 | assert_eq!(exact(["S" , "s" , "ſ" ]), e("(?i)s" )); |
| 2405 | assert_eq!(exact(["S" , "s" , "ſ" ]), e("(?i)ſ" )); |
| 2406 | } |
| 2407 | |
| 2408 | let letters = "ͱͳͷΐάέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋ" ; |
| 2409 | assert_eq!(exact([letters]), e(letters)); |
| 2410 | } |
| 2411 | |
| 2412 | #[test ] |
| 2413 | fn class() { |
| 2414 | assert_eq!(exact(["a" , "b" , "c" ]), e("[abc]" )); |
| 2415 | assert_eq!(exact(["a1b" , "a2b" , "a3b" ]), e("a[123]b" )); |
| 2416 | assert_eq!(exact(["δ" , "ε" ]), e("[εδ]" )); |
| 2417 | #[cfg (feature = "unicode-case" )] |
| 2418 | { |
| 2419 | assert_eq!(exact(["Δ" , "Ε" , "δ" , "ε" , "ϵ" ]), e(r"(?i)[εδ]" )); |
| 2420 | } |
| 2421 | } |
| 2422 | |
| 2423 | #[test ] |
| 2424 | fn look() { |
| 2425 | assert_eq!(exact(["ab" ]), e(r"a\Ab" )); |
| 2426 | assert_eq!(exact(["ab" ]), e(r"a\zb" )); |
| 2427 | assert_eq!(exact(["ab" ]), e(r"a(?m:^)b" )); |
| 2428 | assert_eq!(exact(["ab" ]), e(r"a(?m:$)b" )); |
| 2429 | assert_eq!(exact(["ab" ]), e(r"a\bb" )); |
| 2430 | assert_eq!(exact(["ab" ]), e(r"a\Bb" )); |
| 2431 | assert_eq!(exact(["ab" ]), e(r"a(?-u:\b)b" )); |
| 2432 | assert_eq!(exact(["ab" ]), e(r"a(?-u:\B)b" )); |
| 2433 | |
| 2434 | assert_eq!(exact(["ab" ]), e(r"^ab" )); |
| 2435 | assert_eq!(exact(["ab" ]), e(r"$ab" )); |
| 2436 | assert_eq!(exact(["ab" ]), e(r"(?m:^)ab" )); |
| 2437 | assert_eq!(exact(["ab" ]), e(r"(?m:$)ab" )); |
| 2438 | assert_eq!(exact(["ab" ]), e(r"\bab" )); |
| 2439 | assert_eq!(exact(["ab" ]), e(r"\Bab" )); |
| 2440 | assert_eq!(exact(["ab" ]), e(r"(?-u:\b)ab" )); |
| 2441 | assert_eq!(exact(["ab" ]), e(r"(?-u:\B)ab" )); |
| 2442 | |
| 2443 | assert_eq!(exact(["ab" ]), e(r"ab^" )); |
| 2444 | assert_eq!(exact(["ab" ]), e(r"ab$" )); |
| 2445 | assert_eq!(exact(["ab" ]), e(r"ab(?m:^)" )); |
| 2446 | assert_eq!(exact(["ab" ]), e(r"ab(?m:$)" )); |
| 2447 | assert_eq!(exact(["ab" ]), e(r"ab\b" )); |
| 2448 | assert_eq!(exact(["ab" ]), e(r"ab\B" )); |
| 2449 | assert_eq!(exact(["ab" ]), e(r"ab(?-u:\b)" )); |
| 2450 | assert_eq!(exact(["ab" ]), e(r"ab(?-u:\B)" )); |
| 2451 | |
| 2452 | let expected = (seq([I("aZ" ), E("ab" )]), seq([I("Zb" ), E("ab" )])); |
| 2453 | assert_eq!(expected, e(r"^aZ*b" )); |
| 2454 | } |
| 2455 | |
| 2456 | #[test ] |
| 2457 | fn repetition() { |
| 2458 | assert_eq!(exact(["a" , "" ]), e(r"a?" )); |
| 2459 | assert_eq!(exact(["" , "a" ]), e(r"a??" )); |
| 2460 | assert_eq!(inexact([I("a" ), E("" )], [I("a" ), E("" )]), e(r"a*" )); |
| 2461 | assert_eq!(inexact([E("" ), I("a" )], [E("" ), I("a" )]), e(r"a*?" )); |
| 2462 | assert_eq!(inexact([I("a" )], [I("a" )]), e(r"a+" )); |
| 2463 | assert_eq!(inexact([I("a" )], [I("a" )]), e(r"(a+)+" )); |
| 2464 | |
| 2465 | assert_eq!(exact(["ab" ]), e(r"aZ{0}b" )); |
| 2466 | assert_eq!(exact(["aZb" , "ab" ]), e(r"aZ?b" )); |
| 2467 | assert_eq!(exact(["ab" , "aZb" ]), e(r"aZ??b" )); |
| 2468 | assert_eq!( |
| 2469 | inexact([I("aZ" ), E("ab" )], [I("Zb" ), E("ab" )]), |
| 2470 | e(r"aZ*b" ) |
| 2471 | ); |
| 2472 | assert_eq!( |
| 2473 | inexact([E("ab" ), I("aZ" )], [E("ab" ), I("Zb" )]), |
| 2474 | e(r"aZ*?b" ) |
| 2475 | ); |
| 2476 | assert_eq!(inexact([I("aZ" )], [I("Zb" )]), e(r"aZ+b" )); |
| 2477 | assert_eq!(inexact([I("aZ" )], [I("Zb" )]), e(r"aZ+?b" )); |
| 2478 | |
| 2479 | assert_eq!(exact(["aZZb" ]), e(r"aZ{2}b" )); |
| 2480 | assert_eq!(inexact([I("aZZ" )], [I("ZZb" )]), e(r"aZ{2,3}b" )); |
| 2481 | |
| 2482 | assert_eq!(exact(["abc" , "" ]), e(r"(abc)?" )); |
| 2483 | assert_eq!(exact(["" , "abc" ]), e(r"(abc)??" )); |
| 2484 | |
| 2485 | assert_eq!(inexact([I("a" ), E("b" )], [I("ab" ), E("b" )]), e(r"a*b" )); |
| 2486 | assert_eq!(inexact([E("b" ), I("a" )], [E("b" ), I("ab" )]), e(r"a*?b" )); |
| 2487 | assert_eq!(inexact([I("ab" )], [I("b" )]), e(r"ab+" )); |
| 2488 | assert_eq!(inexact([I("a" ), I("b" )], [I("b" )]), e(r"a*b+" )); |
| 2489 | |
| 2490 | // FIXME: The suffixes for this don't look quite right to me. I think |
| 2491 | // the right suffixes would be: [I(ac), I(bc), E(c)]. The main issue I |
| 2492 | // think is that suffixes are computed by iterating over concatenations |
| 2493 | // in reverse, and then [bc, ac, c] ordering is indeed correct from |
| 2494 | // that perspective. We also test a few more equivalent regexes, and |
| 2495 | // we get the same result, so it is consistent at least I suppose. |
| 2496 | // |
| 2497 | // The reason why this isn't an issue is that it only messes up |
| 2498 | // preference order, and currently, suffixes are never used in a |
| 2499 | // context where preference order matters. For prefixes it matters |
| 2500 | // because we sometimes want to use prefilters without confirmation |
| 2501 | // when all of the literals are exact (and there's no look-around). But |
| 2502 | // we never do that for suffixes. Any time we use suffixes, we always |
| 2503 | // include a confirmation step. If that ever changes, then it's likely |
| 2504 | // this bug will need to be fixed, but last time I looked, it appears |
| 2505 | // hard to do so. |
| 2506 | assert_eq!( |
| 2507 | inexact([I("a" ), I("b" ), E("c" )], [I("bc" ), I("ac" ), E("c" )]), |
| 2508 | e(r"a*b*c" ) |
| 2509 | ); |
| 2510 | assert_eq!( |
| 2511 | inexact([I("a" ), I("b" ), E("c" )], [I("bc" ), I("ac" ), E("c" )]), |
| 2512 | e(r"(a+)?(b+)?c" ) |
| 2513 | ); |
| 2514 | assert_eq!( |
| 2515 | inexact([I("a" ), I("b" ), E("c" )], [I("bc" ), I("ac" ), E("c" )]), |
| 2516 | e(r"(a+|)(b+|)c" ) |
| 2517 | ); |
| 2518 | // A few more similarish but not identical regexes. These may have a |
| 2519 | // similar problem as above. |
| 2520 | assert_eq!( |
| 2521 | inexact( |
| 2522 | [I("a" ), I("b" ), I("c" ), E("" )], |
| 2523 | [I("c" ), I("b" ), I("a" ), E("" )] |
| 2524 | ), |
| 2525 | e(r"a*b*c*" ) |
| 2526 | ); |
| 2527 | assert_eq!(inexact([I("a" ), I("b" ), I("c" )], [I("c" )]), e(r"a*b*c+" )); |
| 2528 | assert_eq!(inexact([I("a" ), I("b" )], [I("bc" )]), e(r"a*b+c" )); |
| 2529 | assert_eq!(inexact([I("a" ), I("b" )], [I("c" ), I("b" )]), e(r"a*b+c*" )); |
| 2530 | assert_eq!(inexact([I("ab" ), E("a" )], [I("b" ), E("a" )]), e(r"ab*" )); |
| 2531 | assert_eq!( |
| 2532 | inexact([I("ab" ), E("ac" )], [I("bc" ), E("ac" )]), |
| 2533 | e(r"ab*c" ) |
| 2534 | ); |
| 2535 | assert_eq!(inexact([I("ab" )], [I("b" )]), e(r"ab+" )); |
| 2536 | assert_eq!(inexact([I("ab" )], [I("bc" )]), e(r"ab+c" )); |
| 2537 | |
| 2538 | assert_eq!( |
| 2539 | inexact([I("z" ), E("azb" )], [I("zazb" ), E("azb" )]), |
| 2540 | e(r"z*azb" ) |
| 2541 | ); |
| 2542 | |
| 2543 | let expected = |
| 2544 | exact(["aaa" , "aab" , "aba" , "abb" , "baa" , "bab" , "bba" , "bbb" ]); |
| 2545 | assert_eq!(expected, e(r"[ab]{3}" )); |
| 2546 | let expected = inexact( |
| 2547 | [ |
| 2548 | I("aaa" ), |
| 2549 | I("aab" ), |
| 2550 | I("aba" ), |
| 2551 | I("abb" ), |
| 2552 | I("baa" ), |
| 2553 | I("bab" ), |
| 2554 | I("bba" ), |
| 2555 | I("bbb" ), |
| 2556 | ], |
| 2557 | [ |
| 2558 | I("aaa" ), |
| 2559 | I("aab" ), |
| 2560 | I("aba" ), |
| 2561 | I("abb" ), |
| 2562 | I("baa" ), |
| 2563 | I("bab" ), |
| 2564 | I("bba" ), |
| 2565 | I("bbb" ), |
| 2566 | ], |
| 2567 | ); |
| 2568 | assert_eq!(expected, e(r"[ab]{3,4}" )); |
| 2569 | } |
| 2570 | |
| 2571 | #[test ] |
| 2572 | fn concat() { |
| 2573 | let empty: [&str; 0] = []; |
| 2574 | |
| 2575 | assert_eq!(exact(["abcxyz" ]), e(r"abc()xyz" )); |
| 2576 | assert_eq!(exact(["abcxyz" ]), e(r"(abc)(xyz)" )); |
| 2577 | assert_eq!(exact(["abcmnoxyz" ]), e(r"abc()mno()xyz" )); |
| 2578 | assert_eq!(exact(empty), e(r"abc[a&&b]xyz" )); |
| 2579 | assert_eq!(exact(["abcxyz" ]), e(r"abc[a&&b]*xyz" )); |
| 2580 | } |
| 2581 | |
| 2582 | #[test ] |
| 2583 | fn alternation() { |
| 2584 | assert_eq!(exact(["abc" , "mno" , "xyz" ]), e(r"abc|mno|xyz" )); |
| 2585 | assert_eq!( |
| 2586 | inexact( |
| 2587 | [E("abc" ), I("mZ" ), E("mo" ), E("xyz" )], |
| 2588 | [E("abc" ), I("Zo" ), E("mo" ), E("xyz" )] |
| 2589 | ), |
| 2590 | e(r"abc|mZ*o|xyz" ) |
| 2591 | ); |
| 2592 | assert_eq!(exact(["abc" , "xyz" ]), e(r"abc|M[a&&b]N|xyz" )); |
| 2593 | assert_eq!(exact(["abc" , "MN" , "xyz" ]), e(r"abc|M[a&&b]*N|xyz" )); |
| 2594 | |
| 2595 | assert_eq!(exact(["aaa" , "aaaaa" ]), e(r"(?:|aa)aaa" )); |
| 2596 | assert_eq!( |
| 2597 | inexact( |
| 2598 | [I("aaa" ), E("" ), I("aaaaa" ), E("aa" )], |
| 2599 | [I("aaa" ), E("" ), E("aa" )] |
| 2600 | ), |
| 2601 | e(r"(?:|aa)(?:aaa)*" ) |
| 2602 | ); |
| 2603 | assert_eq!( |
| 2604 | inexact( |
| 2605 | [E("" ), I("aaa" ), E("aa" ), I("aaaaa" )], |
| 2606 | [E("" ), I("aaa" ), E("aa" )] |
| 2607 | ), |
| 2608 | e(r"(?:|aa)(?:aaa)*?" ) |
| 2609 | ); |
| 2610 | |
| 2611 | assert_eq!( |
| 2612 | inexact([E("a" ), I("b" ), E("" )], [E("a" ), I("b" ), E("" )]), |
| 2613 | e(r"a|b*" ) |
| 2614 | ); |
| 2615 | assert_eq!(inexact([E("a" ), I("b" )], [E("a" ), I("b" )]), e(r"a|b+" )); |
| 2616 | |
| 2617 | assert_eq!( |
| 2618 | inexact([I("a" ), E("b" ), E("c" )], [I("ab" ), E("b" ), E("c" )]), |
| 2619 | e(r"a*b|c" ) |
| 2620 | ); |
| 2621 | |
| 2622 | assert_eq!( |
| 2623 | inexact( |
| 2624 | [E("a" ), E("b" ), I("c" ), E("" )], |
| 2625 | [E("a" ), E("b" ), I("c" ), E("" )] |
| 2626 | ), |
| 2627 | e(r"a|(?:b|c*)" ) |
| 2628 | ); |
| 2629 | |
| 2630 | assert_eq!( |
| 2631 | inexact( |
| 2632 | [I("a" ), I("b" ), E("c" ), I("a" ), I("ab" ), E("c" )], |
| 2633 | [I("ac" ), I("bc" ), E("c" ), I("ac" ), I("abc" ), E("c" )], |
| 2634 | ), |
| 2635 | e(r"(a|b)*c|(a|ab)*c" ) |
| 2636 | ); |
| 2637 | |
| 2638 | assert_eq!( |
| 2639 | exact(["abef" , "abgh" , "cdef" , "cdgh" ]), |
| 2640 | e(r"(ab|cd)(ef|gh)" ) |
| 2641 | ); |
| 2642 | assert_eq!( |
| 2643 | exact([ |
| 2644 | "abefij" , "abefkl" , "abghij" , "abghkl" , "cdefij" , "cdefkl" , |
| 2645 | "cdghij" , "cdghkl" , |
| 2646 | ]), |
| 2647 | e(r"(ab|cd)(ef|gh)(ij|kl)" ) |
| 2648 | ); |
| 2649 | |
| 2650 | assert_eq!(inexact([E("abab" )], [E("abab" )]), e(r"(ab){2}" )); |
| 2651 | |
| 2652 | assert_eq!(inexact([I("abab" )], [I("abab" )]), e(r"(ab){2,3}" )); |
| 2653 | |
| 2654 | assert_eq!(inexact([I("abab" )], [I("abab" )]), e(r"(ab){2,}" )); |
| 2655 | } |
| 2656 | |
| 2657 | #[test ] |
| 2658 | fn impossible() { |
| 2659 | let empty: [&str; 0] = []; |
| 2660 | |
| 2661 | assert_eq!(exact(empty), e(r"[a&&b]" )); |
| 2662 | assert_eq!(exact(empty), e(r"a[a&&b]" )); |
| 2663 | assert_eq!(exact(empty), e(r"[a&&b]b" )); |
| 2664 | assert_eq!(exact(empty), e(r"a[a&&b]b" )); |
| 2665 | assert_eq!(exact(["a" , "b" ]), e(r"a|[a&&b]|b" )); |
| 2666 | assert_eq!(exact(["a" , "b" ]), e(r"a|c[a&&b]|b" )); |
| 2667 | assert_eq!(exact(["a" , "b" ]), e(r"a|[a&&b]d|b" )); |
| 2668 | assert_eq!(exact(["a" , "b" ]), e(r"a|c[a&&b]d|b" )); |
| 2669 | assert_eq!(exact(["" ]), e(r"[a&&b]*" )); |
| 2670 | assert_eq!(exact(["MN" ]), e(r"M[a&&b]*N" )); |
| 2671 | } |
| 2672 | |
| 2673 | // This tests patterns that contain something that defeats literal |
| 2674 | // detection, usually because it would blow some limit on the total number |
| 2675 | // of literals that can be returned. |
| 2676 | // |
| 2677 | // The main idea is that when literal extraction sees something that |
| 2678 | // it knows will blow a limit, it replaces it with a marker that says |
| 2679 | // "any literal will match here." While not necessarily true, the |
| 2680 | // over-estimation is just fine for the purposes of literal extraction, |
| 2681 | // because the imprecision doesn't matter: too big is too big. |
| 2682 | // |
| 2683 | // This is one of the trickier parts of literal extraction, since we need |
| 2684 | // to make sure all of our literal extraction operations correctly compose |
| 2685 | // with the markers. |
| 2686 | #[test ] |
| 2687 | fn anything() { |
| 2688 | assert_eq!(infinite(), e(r"." )); |
| 2689 | assert_eq!(infinite(), e(r"(?s)." )); |
| 2690 | assert_eq!(infinite(), e(r"[A-Za-z]" )); |
| 2691 | assert_eq!(infinite(), e(r"[A-Z]" )); |
| 2692 | assert_eq!(exact(["" ]), e(r"[A-Z]{0}" )); |
| 2693 | assert_eq!(infinite(), e(r"[A-Z]?" )); |
| 2694 | assert_eq!(infinite(), e(r"[A-Z]*" )); |
| 2695 | assert_eq!(infinite(), e(r"[A-Z]+" )); |
| 2696 | assert_eq!((seq([I("1" )]), Seq::infinite()), e(r"1[A-Z]" )); |
| 2697 | assert_eq!((seq([I("1" )]), seq([I("2" )])), e(r"1[A-Z]2" )); |
| 2698 | assert_eq!((Seq::infinite(), seq([I("123" )])), e(r"[A-Z]+123" )); |
| 2699 | assert_eq!(infinite(), e(r"[A-Z]+123[A-Z]+" )); |
| 2700 | assert_eq!(infinite(), e(r"1|[A-Z]|3" )); |
| 2701 | assert_eq!( |
| 2702 | (seq([E("1" ), I("2" ), E("3" )]), Seq::infinite()), |
| 2703 | e(r"1|2[A-Z]|3" ), |
| 2704 | ); |
| 2705 | assert_eq!( |
| 2706 | (Seq::infinite(), seq([E("1" ), I("2" ), E("3" )])), |
| 2707 | e(r"1|[A-Z]2|3" ), |
| 2708 | ); |
| 2709 | assert_eq!( |
| 2710 | (seq([E("1" ), I("2" ), E("4" )]), seq([E("1" ), I("3" ), E("4" )])), |
| 2711 | e(r"1|2[A-Z]3|4" ), |
| 2712 | ); |
| 2713 | assert_eq!((Seq::infinite(), seq([I("2" )])), e(r"(?:|1)[A-Z]2" )); |
| 2714 | assert_eq!(inexact([I("a" )], [I("z" )]), e(r"a.z" )); |
| 2715 | } |
| 2716 | |
| 2717 | // Like the 'anything' test, but it uses smaller limits in order to test |
| 2718 | // the logic for effectively aborting literal extraction when the seqs get |
| 2719 | // too big. |
| 2720 | #[test ] |
| 2721 | fn anything_small_limits() { |
| 2722 | fn prefixes(pattern: &str) -> Seq { |
| 2723 | Extractor::new() |
| 2724 | .kind(ExtractKind::Prefix) |
| 2725 | .limit_total(10) |
| 2726 | .extract(&parse(pattern)) |
| 2727 | } |
| 2728 | |
| 2729 | fn suffixes(pattern: &str) -> Seq { |
| 2730 | Extractor::new() |
| 2731 | .kind(ExtractKind::Suffix) |
| 2732 | .limit_total(10) |
| 2733 | .extract(&parse(pattern)) |
| 2734 | } |
| 2735 | |
| 2736 | fn e(pattern: &str) -> (Seq, Seq) { |
| 2737 | (prefixes(pattern), suffixes(pattern)) |
| 2738 | } |
| 2739 | |
| 2740 | assert_eq!( |
| 2741 | ( |
| 2742 | seq([ |
| 2743 | I("aaa" ), |
| 2744 | I("aab" ), |
| 2745 | I("aba" ), |
| 2746 | I("abb" ), |
| 2747 | I("baa" ), |
| 2748 | I("bab" ), |
| 2749 | I("bba" ), |
| 2750 | I("bbb" ) |
| 2751 | ]), |
| 2752 | seq([ |
| 2753 | I("aaa" ), |
| 2754 | I("aab" ), |
| 2755 | I("aba" ), |
| 2756 | I("abb" ), |
| 2757 | I("baa" ), |
| 2758 | I("bab" ), |
| 2759 | I("bba" ), |
| 2760 | I("bbb" ) |
| 2761 | ]) |
| 2762 | ), |
| 2763 | e(r"[ab]{3}{3}" ) |
| 2764 | ); |
| 2765 | |
| 2766 | assert_eq!(infinite(), e(r"ab|cd|ef|gh|ij|kl|mn|op|qr|st|uv|wx|yz" )); |
| 2767 | } |
| 2768 | |
| 2769 | #[test ] |
| 2770 | fn empty() { |
| 2771 | assert_eq!(exact(["" ]), e(r"" )); |
| 2772 | assert_eq!(exact(["" ]), e(r"^" )); |
| 2773 | assert_eq!(exact(["" ]), e(r"$" )); |
| 2774 | assert_eq!(exact(["" ]), e(r"(?m:^)" )); |
| 2775 | assert_eq!(exact(["" ]), e(r"(?m:$)" )); |
| 2776 | assert_eq!(exact(["" ]), e(r"\b" )); |
| 2777 | assert_eq!(exact(["" ]), e(r"\B" )); |
| 2778 | assert_eq!(exact(["" ]), e(r"(?-u:\b)" )); |
| 2779 | assert_eq!(exact(["" ]), e(r"(?-u:\B)" )); |
| 2780 | } |
| 2781 | |
| 2782 | #[test ] |
| 2783 | fn odds_and_ends() { |
| 2784 | assert_eq!((Seq::infinite(), seq([I("a" )])), e(r".a" )); |
| 2785 | assert_eq!((seq([I("a" )]), Seq::infinite()), e(r"a." )); |
| 2786 | assert_eq!(infinite(), e(r"a|." )); |
| 2787 | assert_eq!(infinite(), e(r".|a" )); |
| 2788 | |
| 2789 | let pat = r"M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]" ; |
| 2790 | let expected = inexact( |
| 2791 | ["Mo'am" , "Moam" , "Mu'am" , "Muam" ].map(I), |
| 2792 | [ |
| 2793 | "ddafi" , "ddafy" , "dhafi" , "dhafy" , "dzafi" , "dzafy" , "dafi" , |
| 2794 | "dafy" , "tdafi" , "tdafy" , "thafi" , "thafy" , "tzafi" , "tzafy" , |
| 2795 | "tafi" , "tafy" , "zdafi" , "zdafy" , "zhafi" , "zhafy" , "zzafi" , |
| 2796 | "zzafy" , "zafi" , "zafy" , |
| 2797 | ] |
| 2798 | .map(I), |
| 2799 | ); |
| 2800 | assert_eq!(expected, e(pat)); |
| 2801 | |
| 2802 | assert_eq!( |
| 2803 | (seq(["fn is_" , "fn as_" ].map(I)), Seq::infinite()), |
| 2804 | e(r"fn is_([A-Z]+)|fn as_([A-Z]+)" ), |
| 2805 | ); |
| 2806 | assert_eq!( |
| 2807 | inexact([I("foo" )], [I("quux" )]), |
| 2808 | e(r"foo[A-Z]+bar[A-Z]+quux" ) |
| 2809 | ); |
| 2810 | assert_eq!(infinite(), e(r"[A-Z]+bar[A-Z]+" )); |
| 2811 | assert_eq!( |
| 2812 | exact(["Sherlock Holmes" ]), |
| 2813 | e(r"(?m)^Sherlock Holmes|Sherlock Holmes$" ) |
| 2814 | ); |
| 2815 | |
| 2816 | assert_eq!(exact(["sa" , "sb" ]), e(r"\bs(?:[ab])" )); |
| 2817 | } |
| 2818 | |
| 2819 | // This tests a specific regex along with some heuristic steps to reduce |
| 2820 | // the sequences extracted. This is meant to roughly correspond to the |
| 2821 | // types of heuristics used to shrink literal sets in practice. (Shrinking |
| 2822 | // is done because you want to balance "spend too much work looking for |
| 2823 | // too many literals" and "spend too much work processing false positive |
| 2824 | // matches from short literals.") |
| 2825 | #[test ] |
| 2826 | #[cfg (feature = "unicode-case" )] |
| 2827 | fn holmes() { |
| 2828 | let expected = inexact( |
| 2829 | ["HOL" , "HOl" , "HoL" , "Hol" , "hOL" , "hOl" , "hoL" , "hol" ].map(I), |
| 2830 | [ |
| 2831 | "MES" , "MEs" , "Eſ" , "MeS" , "Mes" , "eſ" , "mES" , "mEs" , "meS" , |
| 2832 | "mes" , |
| 2833 | ] |
| 2834 | .map(I), |
| 2835 | ); |
| 2836 | let (mut prefixes, mut suffixes) = e(r"(?i)Holmes" ); |
| 2837 | prefixes.keep_first_bytes(3); |
| 2838 | suffixes.keep_last_bytes(3); |
| 2839 | prefixes.minimize_by_preference(); |
| 2840 | suffixes.minimize_by_preference(); |
| 2841 | assert_eq!(expected, (prefixes, suffixes)); |
| 2842 | } |
| 2843 | |
| 2844 | // This tests that we get some kind of literals extracted for a beefier |
| 2845 | // alternation with case insensitive mode enabled. At one point during |
| 2846 | // development, this returned nothing, and motivated some special case |
| 2847 | // code in Extractor::union to try and trim down the literal sequences |
| 2848 | // if the union would blow the limits set. |
| 2849 | #[test ] |
| 2850 | #[cfg (feature = "unicode-case" )] |
| 2851 | fn holmes_alt() { |
| 2852 | let mut pre = |
| 2853 | prefixes(r"(?i)Sherlock|Holmes|Watson|Irene|Adler|John|Baker" ); |
| 2854 | assert!(pre.len().unwrap() > 0); |
| 2855 | pre.optimize_for_prefix_by_preference(); |
| 2856 | assert!(pre.len().unwrap() > 0); |
| 2857 | } |
| 2858 | |
| 2859 | // See: https://github.com/rust-lang/regex/security/advisories/GHSA-m5pq-gvj9-9vr8 |
| 2860 | // See: CVE-2022-24713 |
| 2861 | // |
| 2862 | // We test this here to ensure literal extraction completes in reasonable |
| 2863 | // time and isn't materially impacted by these sorts of pathological |
| 2864 | // repeats. |
| 2865 | #[test ] |
| 2866 | fn crazy_repeats() { |
| 2867 | assert_eq!(inexact([E("" )], [E("" )]), e(r"(?:){4294967295}" )); |
| 2868 | assert_eq!( |
| 2869 | inexact([E("" )], [E("" )]), |
| 2870 | e(r"(?:){64}{64}{64}{64}{64}{64}" ) |
| 2871 | ); |
| 2872 | assert_eq!(inexact([E("" )], [E("" )]), e(r"x{0}{4294967295}" )); |
| 2873 | assert_eq!(inexact([E("" )], [E("" )]), e(r"(?:|){4294967295}" )); |
| 2874 | |
| 2875 | assert_eq!( |
| 2876 | inexact([E("" )], [E("" )]), |
| 2877 | e(r"(?:){8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}" ) |
| 2878 | ); |
| 2879 | let repa = "a" .repeat(100); |
| 2880 | assert_eq!( |
| 2881 | inexact([I(&repa)], [I(&repa)]), |
| 2882 | e(r"a{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}" ) |
| 2883 | ); |
| 2884 | } |
| 2885 | |
| 2886 | #[test ] |
| 2887 | fn huge() { |
| 2888 | let pat = r#"(?-u) |
| 2889 | 2(?: |
| 2890 | [45]\d{3}| |
| 2891 | 7(?: |
| 2892 | 1[0-267]| |
| 2893 | 2[0-289]| |
| 2894 | 3[0-29]| |
| 2895 | 4[01]| |
| 2896 | 5[1-3]| |
| 2897 | 6[013]| |
| 2898 | 7[0178]| |
| 2899 | 91 |
| 2900 | )| |
| 2901 | 8(?: |
| 2902 | 0[125]| |
| 2903 | [139][1-6]| |
| 2904 | 2[0157-9]| |
| 2905 | 41| |
| 2906 | 6[1-35]| |
| 2907 | 7[1-5]| |
| 2908 | 8[1-8]| |
| 2909 | 90 |
| 2910 | )| |
| 2911 | 9(?: |
| 2912 | 0[0-2]| |
| 2913 | 1[0-4]| |
| 2914 | 2[568]| |
| 2915 | 3[3-6]| |
| 2916 | 5[5-7]| |
| 2917 | 6[0167]| |
| 2918 | 7[15]| |
| 2919 | 8[0146-9] |
| 2920 | ) |
| 2921 | )\d{4}| |
| 2922 | 3(?: |
| 2923 | 12?[5-7]\d{2}| |
| 2924 | 0(?: |
| 2925 | 2(?: |
| 2926 | [025-79]\d| |
| 2927 | [348]\d{1,2} |
| 2928 | )| |
| 2929 | 3(?: |
| 2930 | [2-4]\d| |
| 2931 | [56]\d? |
| 2932 | ) |
| 2933 | )| |
| 2934 | 2(?: |
| 2935 | 1\d{2}| |
| 2936 | 2(?: |
| 2937 | [12]\d| |
| 2938 | [35]\d{1,2}| |
| 2939 | 4\d? |
| 2940 | ) |
| 2941 | )| |
| 2942 | 3(?: |
| 2943 | 1\d{2}| |
| 2944 | 2(?: |
| 2945 | [2356]\d| |
| 2946 | 4\d{1,2} |
| 2947 | ) |
| 2948 | )| |
| 2949 | 4(?: |
| 2950 | 1\d{2}| |
| 2951 | 2(?: |
| 2952 | 2\d{1,2}| |
| 2953 | [47]| |
| 2954 | 5\d{2} |
| 2955 | ) |
| 2956 | )| |
| 2957 | 5(?: |
| 2958 | 1\d{2}| |
| 2959 | 29 |
| 2960 | )| |
| 2961 | [67]1\d{2}| |
| 2962 | 8(?: |
| 2963 | 1\d{2}| |
| 2964 | 2(?: |
| 2965 | 2\d{2}| |
| 2966 | 3| |
| 2967 | 4\d |
| 2968 | ) |
| 2969 | ) |
| 2970 | )\d{3}| |
| 2971 | 4(?: |
| 2972 | 0(?: |
| 2973 | 2(?: |
| 2974 | [09]\d| |
| 2975 | 7 |
| 2976 | )| |
| 2977 | 33\d{2} |
| 2978 | )| |
| 2979 | 1\d{3}| |
| 2980 | 2(?: |
| 2981 | 1\d{2}| |
| 2982 | 2(?: |
| 2983 | [25]\d?| |
| 2984 | [348]\d| |
| 2985 | [67]\d{1,2} |
| 2986 | ) |
| 2987 | )| |
| 2988 | 3(?: |
| 2989 | 1\d{2}(?: |
| 2990 | \d{2} |
| 2991 | )?| |
| 2992 | 2(?: |
| 2993 | [045]\d| |
| 2994 | [236-9]\d{1,2} |
| 2995 | )| |
| 2996 | 32\d{2} |
| 2997 | )| |
| 2998 | 4(?: |
| 2999 | [18]\d{2}| |
| 3000 | 2(?: |
| 3001 | [2-46]\d{2}| |
| 3002 | 3 |
| 3003 | )| |
| 3004 | 5[25]\d{2} |
| 3005 | )| |
| 3006 | 5(?: |
| 3007 | 1\d{2}| |
| 3008 | 2(?: |
| 3009 | 3\d| |
| 3010 | 5 |
| 3011 | ) |
| 3012 | )| |
| 3013 | 6(?: |
| 3014 | [18]\d{2}| |
| 3015 | 2(?: |
| 3016 | 3(?: |
| 3017 | \d{2} |
| 3018 | )?| |
| 3019 | [46]\d{1,2}| |
| 3020 | 5\d{2}| |
| 3021 | 7\d |
| 3022 | )| |
| 3023 | 5(?: |
| 3024 | 3\d?| |
| 3025 | 4\d| |
| 3026 | [57]\d{1,2}| |
| 3027 | 6\d{2}| |
| 3028 | 8 |
| 3029 | ) |
| 3030 | )| |
| 3031 | 71\d{2}| |
| 3032 | 8(?: |
| 3033 | [18]\d{2}| |
| 3034 | 23\d{2}| |
| 3035 | 54\d{2} |
| 3036 | )| |
| 3037 | 9(?: |
| 3038 | [18]\d{2}| |
| 3039 | 2[2-5]\d{2}| |
| 3040 | 53\d{1,2} |
| 3041 | ) |
| 3042 | )\d{3}| |
| 3043 | 5(?: |
| 3044 | 02[03489]\d{2}| |
| 3045 | 1\d{2}| |
| 3046 | 2(?: |
| 3047 | 1\d{2}| |
| 3048 | 2(?: |
| 3049 | 2(?: |
| 3050 | \d{2} |
| 3051 | )?| |
| 3052 | [457]\d{2} |
| 3053 | ) |
| 3054 | )| |
| 3055 | 3(?: |
| 3056 | 1\d{2}| |
| 3057 | 2(?: |
| 3058 | [37](?: |
| 3059 | \d{2} |
| 3060 | )?| |
| 3061 | [569]\d{2} |
| 3062 | ) |
| 3063 | )| |
| 3064 | 4(?: |
| 3065 | 1\d{2}| |
| 3066 | 2[46]\d{2} |
| 3067 | )| |
| 3068 | 5(?: |
| 3069 | 1\d{2}| |
| 3070 | 26\d{1,2} |
| 3071 | )| |
| 3072 | 6(?: |
| 3073 | [18]\d{2}| |
| 3074 | 2| |
| 3075 | 53\d{2} |
| 3076 | )| |
| 3077 | 7(?: |
| 3078 | 1| |
| 3079 | 24 |
| 3080 | )\d{2}| |
| 3081 | 8(?: |
| 3082 | 1| |
| 3083 | 26 |
| 3084 | )\d{2}| |
| 3085 | 91\d{2} |
| 3086 | )\d{3}| |
| 3087 | 6(?: |
| 3088 | 0(?: |
| 3089 | 1\d{2}| |
| 3090 | 2(?: |
| 3091 | 3\d{2}| |
| 3092 | 4\d{1,2} |
| 3093 | ) |
| 3094 | )| |
| 3095 | 2(?: |
| 3096 | 2[2-5]\d{2}| |
| 3097 | 5(?: |
| 3098 | [3-5]\d{2}| |
| 3099 | 7 |
| 3100 | )| |
| 3101 | 8\d{2} |
| 3102 | )| |
| 3103 | 3(?: |
| 3104 | 1| |
| 3105 | 2[3478] |
| 3106 | )\d{2}| |
| 3107 | 4(?: |
| 3108 | 1| |
| 3109 | 2[34] |
| 3110 | )\d{2}| |
| 3111 | 5(?: |
| 3112 | 1| |
| 3113 | 2[47] |
| 3114 | )\d{2}| |
| 3115 | 6(?: |
| 3116 | [18]\d{2}| |
| 3117 | 6(?: |
| 3118 | 2(?: |
| 3119 | 2\d| |
| 3120 | [34]\d{2} |
| 3121 | )| |
| 3122 | 5(?: |
| 3123 | [24]\d{2}| |
| 3124 | 3\d| |
| 3125 | 5\d{1,2} |
| 3126 | ) |
| 3127 | ) |
| 3128 | )| |
| 3129 | 72[2-5]\d{2}| |
| 3130 | 8(?: |
| 3131 | 1\d{2}| |
| 3132 | 2[2-5]\d{2} |
| 3133 | )| |
| 3134 | 9(?: |
| 3135 | 1\d{2}| |
| 3136 | 2[2-6]\d{2} |
| 3137 | ) |
| 3138 | )\d{3}| |
| 3139 | 7(?: |
| 3140 | (?: |
| 3141 | 02| |
| 3142 | [3-589]1| |
| 3143 | 6[12]| |
| 3144 | 72[24] |
| 3145 | )\d{2}| |
| 3146 | 21\d{3}| |
| 3147 | 32 |
| 3148 | )\d{3}| |
| 3149 | 8(?: |
| 3150 | (?: |
| 3151 | 4[12]| |
| 3152 | [5-7]2| |
| 3153 | 1\d? |
| 3154 | )| |
| 3155 | (?: |
| 3156 | 0| |
| 3157 | 3[12]| |
| 3158 | [5-7]1| |
| 3159 | 217 |
| 3160 | )\d |
| 3161 | )\d{4}| |
| 3162 | 9(?: |
| 3163 | [35]1| |
| 3164 | (?: |
| 3165 | [024]2| |
| 3166 | 81 |
| 3167 | )\d| |
| 3168 | (?: |
| 3169 | 1| |
| 3170 | [24]1 |
| 3171 | )\d{2} |
| 3172 | )\d{3} |
| 3173 | "# ; |
| 3174 | // TODO: This is a good candidate of a seq of literals that could be |
| 3175 | // shrunk quite a bit and still be very productive with respect to |
| 3176 | // literal optimizations. |
| 3177 | let (prefixes, suffixes) = e(pat); |
| 3178 | assert!(!suffixes.is_finite()); |
| 3179 | assert_eq!(Some(243), prefixes.len()); |
| 3180 | } |
| 3181 | |
| 3182 | #[test ] |
| 3183 | fn optimize() { |
| 3184 | // This gets a common prefix that isn't too short. |
| 3185 | let (p, s) = |
| 3186 | opt(["foobarfoobar" , "foobar" , "foobarzfoobar" , "foobarfoobar" ]); |
| 3187 | assert_eq!(seq([I("foobar" )]), p); |
| 3188 | assert_eq!(seq([I("foobar" )]), s); |
| 3189 | |
| 3190 | // This also finds a common prefix, but since it's only one byte, it |
| 3191 | // prefers the multiple literals. |
| 3192 | let (p, s) = opt(["abba" , "akka" , "abccba" ]); |
| 3193 | assert_eq!(exact(["abba" , "akka" , "abccba" ]), (p, s)); |
| 3194 | |
| 3195 | let (p, s) = opt(["sam" , "samwise" ]); |
| 3196 | assert_eq!((seq([E("sam" )]), seq([E("sam" ), E("samwise" )])), (p, s)); |
| 3197 | |
| 3198 | // The empty string is poisonous, so our seq becomes infinite, even |
| 3199 | // though all literals are exact. |
| 3200 | let (p, s) = opt(["foobarfoo" , "foo" , "" , "foozfoo" , "foofoo" ]); |
| 3201 | assert!(!p.is_finite()); |
| 3202 | assert!(!s.is_finite()); |
| 3203 | |
| 3204 | // A space is also poisonous, so our seq becomes infinite. But this |
| 3205 | // only gets triggered when we don't have a completely exact sequence. |
| 3206 | // When the sequence is exact, spaces are okay, since we presume that |
| 3207 | // any prefilter will match a space more quickly than the regex engine. |
| 3208 | // (When the sequence is exact, there's a chance of the prefilter being |
| 3209 | // used without needing the regex engine at all.) |
| 3210 | let mut p = seq([E("foobarfoo" ), I("foo" ), E(" " ), E("foofoo" )]); |
| 3211 | p.optimize_for_prefix_by_preference(); |
| 3212 | assert!(!p.is_finite()); |
| 3213 | } |
| 3214 | } |
| 3215 | |