| 1 | /*! | 
| 2 | Provides literal extraction from `Hir` expressions. | 
|---|
| 3 |  | 
|---|
| 4 | An [`Extractor`] pulls literals out of [`Hir`] expressions and returns a | 
|---|
| 5 | [`Seq`] of [`Literal`]s. | 
|---|
| 6 |  | 
|---|
| 7 | The purpose of literal extraction is generally to provide avenues for | 
|---|
| 8 | optimizing regex searches. The main idea is that substring searches can be an | 
|---|
| 9 | order of magnitude faster than a regex search. Therefore, if one can execute | 
|---|
| 10 | a substring search to find candidate match locations and only run the regex | 
|---|
| 11 | search at those locations, then it is possible for huge improvements in | 
|---|
| 12 | performance to be realized. | 
|---|
| 13 |  | 
|---|
| 14 | With that said, literal optimizations are generally a black art because even | 
|---|
| 15 | though substring search is generally faster, if the number of candidates | 
|---|
| 16 | produced is high, then it can create a lot of overhead by ping-ponging between | 
|---|
| 17 | the substring search and the regex search. | 
|---|
| 18 |  | 
|---|
| 19 | Here are some heuristics that might be used to help increase the chances of | 
|---|
| 20 | effective literal optimizations: | 
|---|
| 21 |  | 
|---|
| 22 | * Stick to small [`Seq`]s. If you search for too many literals, it's likely | 
|---|
| 23 | to lead to substring search that is only a little faster than a regex search, | 
|---|
| 24 | and thus the overhead of using literal optimizations in the first place might | 
|---|
| 25 | make things slower overall. | 
|---|
| 26 | * The literals in your [`Seq`] shouldn't be too short. In general, longer is | 
|---|
| 27 | better. A sequence corresponding to single bytes that occur frequently in the | 
|---|
| 28 | haystack, for example, is probably a bad literal optimization because it's | 
|---|
| 29 | likely to produce many false positive candidates. Longer literals are less | 
|---|
| 30 | likely to match, and thus probably produce fewer false positives. | 
|---|
| 31 | * If it's possible to estimate the approximate frequency of each byte according | 
|---|
| 32 | to some pre-computed background distribution, it is possible to compute a score | 
|---|
| 33 | of how "good" a `Seq` is. If a `Seq` isn't good enough, you might consider | 
|---|
| 34 | skipping the literal optimization and just use the regex engine. | 
|---|
| 35 |  | 
|---|
| 36 | (It should be noted that there are always pathological cases that can make | 
|---|
| 37 | any kind of literal optimization be a net slower result. This is why it | 
|---|
| 38 | might be a good idea to be conservative, or to even provide a means for | 
|---|
| 39 | literal optimizations to be dynamically disabled if they are determined to be | 
|---|
| 40 | ineffective according to some measure.) | 
|---|
| 41 |  | 
|---|
| 42 | You're encouraged to explore the methods on [`Seq`], which permit shrinking | 
|---|
| 43 | the size of sequences in a preference-order preserving fashion. | 
|---|
| 44 |  | 
|---|
| 45 | Finally, note that it isn't strictly necessary to use an [`Extractor`]. Namely, | 
|---|
| 46 | an `Extractor` only uses public APIs of the [`Seq`] and [`Literal`] types, | 
|---|
| 47 | so it is possible to implement your own extractor. For example, for n-grams | 
|---|
| 48 | or "inner" literals (i.e., not prefix or suffix literals). The `Extractor` | 
|---|
| 49 | is mostly responsible for the case analysis over `Hir` expressions. Much of | 
|---|
| 50 | the "trickier" parts are how to combine literal sequences, and that is all | 
|---|
| 51 | implemented on [`Seq`]. | 
|---|
| 52 | */ | 
|---|
| 53 |  | 
|---|
| 54 | use core::{cmp, mem, num::NonZeroUsize}; | 
|---|
| 55 |  | 
|---|
| 56 | use alloc::{vec, vec::Vec}; | 
|---|
| 57 |  | 
|---|
| 58 | use crate::hir::{self, Hir}; | 
|---|
| 59 |  | 
|---|
| 60 | /// Extracts prefix or suffix literal sequences from [`Hir`] expressions. | 
|---|
| 61 | /// | 
|---|
| 62 | /// Literal extraction is based on the following observations: | 
|---|
| 63 | /// | 
|---|
| 64 | /// * Many regexes start with one or a small number of literals. | 
|---|
| 65 | /// * Substring search for literals is often much faster (sometimes by an order | 
|---|
| 66 | /// of magnitude) than a regex search. | 
|---|
| 67 | /// | 
|---|
| 68 | /// Thus, in many cases, one can search for literals to find candidate starting | 
|---|
| 69 | /// locations of a match, and then only run the full regex engine at each such | 
|---|
| 70 | /// location instead of over the full haystack. | 
|---|
| 71 | /// | 
|---|
| 72 | /// The main downside of literal extraction is that it can wind up causing a | 
|---|
| 73 | /// search to be slower overall. For example, if there are many matches or if | 
|---|
| 74 | /// there are many candidates that don't ultimately lead to a match, then a | 
|---|
| 75 | /// lot of overhead will be spent in shuffing back-and-forth between substring | 
|---|
| 76 | /// search and the regex engine. This is the fundamental reason why literal | 
|---|
| 77 | /// optimizations for regex patterns is sometimes considered a "black art." | 
|---|
| 78 | /// | 
|---|
| 79 | /// # Look-around assertions | 
|---|
| 80 | /// | 
|---|
| 81 | /// Literal extraction treats all look-around assertions as-if they match every | 
|---|
| 82 | /// empty string. So for example, the regex `\bquux\b` will yield a sequence | 
|---|
| 83 | /// containing a single exact literal `quux`. However, not all occurrences | 
|---|
| 84 | /// of `quux` correspond to a match a of the regex. For example, `\bquux\b` | 
|---|
| 85 | /// does not match `ZquuxZ` anywhere because `quux` does not fall on a word | 
|---|
| 86 | /// boundary. | 
|---|
| 87 | /// | 
|---|
| 88 | /// In effect, if your regex contains look-around assertions, then a match of | 
|---|
| 89 | /// an exact literal does not necessarily mean the regex overall matches. So | 
|---|
| 90 | /// you may still need to run the regex engine in such cases to confirm the | 
|---|
| 91 | /// match. | 
|---|
| 92 | /// | 
|---|
| 93 | /// The precise guarantee you get from a literal sequence is: if every literal | 
|---|
| 94 | /// in the sequence is exact and the original regex contains zero look-around | 
|---|
| 95 | /// assertions, then a preference-order multi-substring search of those | 
|---|
| 96 | /// literals will precisely match a preference-order search of the original | 
|---|
| 97 | /// regex. | 
|---|
| 98 | /// | 
|---|
| 99 | /// # Example | 
|---|
| 100 | /// | 
|---|
| 101 | /// This shows how to extract prefixes: | 
|---|
| 102 | /// | 
|---|
| 103 | /// ``` | 
|---|
| 104 | /// use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse}; | 
|---|
| 105 | /// | 
|---|
| 106 | /// let hir = parse( r"(a|b|c)(x|y|z)[A-Z]+foo")?; | 
|---|
| 107 | /// let got = Extractor::new().extract(&hir); | 
|---|
| 108 | /// // All literals returned are "inexact" because none of them reach the | 
|---|
| 109 | /// // match state. | 
|---|
| 110 | /// let expected = Seq::from_iter([ | 
|---|
| 111 | ///     Literal::inexact( "ax"), | 
|---|
| 112 | ///     Literal::inexact( "ay"), | 
|---|
| 113 | ///     Literal::inexact( "az"), | 
|---|
| 114 | ///     Literal::inexact( "bx"), | 
|---|
| 115 | ///     Literal::inexact( "by"), | 
|---|
| 116 | ///     Literal::inexact( "bz"), | 
|---|
| 117 | ///     Literal::inexact( "cx"), | 
|---|
| 118 | ///     Literal::inexact( "cy"), | 
|---|
| 119 | ///     Literal::inexact( "cz"), | 
|---|
| 120 | /// ]); | 
|---|
| 121 | /// assert_eq!(expected, got); | 
|---|
| 122 | /// | 
|---|
| 123 | /// # Ok::<(), Box<dyn std::error::Error>>(()) | 
|---|
| 124 | /// ``` | 
|---|
| 125 | /// | 
|---|
| 126 | /// This shows how to extract suffixes: | 
|---|
| 127 | /// | 
|---|
| 128 | /// ``` | 
|---|
| 129 | /// use regex_syntax::{ | 
|---|
| 130 | ///     hir::literal::{Extractor, ExtractKind, Literal, Seq}, | 
|---|
| 131 | ///     parse, | 
|---|
| 132 | /// }; | 
|---|
| 133 | /// | 
|---|
| 134 | /// let hir = parse( r"foo|[A-Z]+bar")?; | 
|---|
| 135 | /// let got = Extractor::new().kind(ExtractKind::Suffix).extract(&hir); | 
|---|
| 136 | /// // Since 'foo' gets to a match state, it is considered exact. But 'bar' | 
|---|
| 137 | /// // does not because of the '[A-Z]+', and thus is marked inexact. | 
|---|
| 138 | /// let expected = Seq::from_iter([ | 
|---|
| 139 | ///     Literal::exact( "foo"), | 
|---|
| 140 | ///     Literal::inexact( "bar"), | 
|---|
| 141 | /// ]); | 
|---|
| 142 | /// assert_eq!(expected, got); | 
|---|
| 143 | /// | 
|---|
| 144 | /// # Ok::<(), Box<dyn std::error::Error>>(()) | 
|---|
| 145 | /// ``` | 
|---|
| 146 | #[ derive(Clone, Debug)] | 
|---|
| 147 | pub struct Extractor { | 
|---|
| 148 | kind: ExtractKind, | 
|---|
| 149 | limit_class: usize, | 
|---|
| 150 | limit_repeat: usize, | 
|---|
| 151 | limit_literal_len: usize, | 
|---|
| 152 | limit_total: usize, | 
|---|
| 153 | } | 
|---|
| 154 |  | 
|---|
| 155 | impl Extractor { | 
|---|
| 156 | /// Create a new extractor with a default configuration. | 
|---|
| 157 | /// | 
|---|
| 158 | /// The extractor can be optionally configured before calling | 
|---|
| 159 | /// [`Extractor::extract`] to get a literal sequence. | 
|---|
| 160 | pub fn new() -> Extractor { | 
|---|
| 161 | Extractor { | 
|---|
| 162 | kind: ExtractKind::Prefix, | 
|---|
| 163 | limit_class: 10, | 
|---|
| 164 | limit_repeat: 10, | 
|---|
| 165 | limit_literal_len: 100, | 
|---|
| 166 | limit_total: 250, | 
|---|
| 167 | } | 
|---|
| 168 | } | 
|---|
| 169 |  | 
|---|
| 170 | /// Execute the extractor and return a sequence of literals. | 
|---|
| 171 | pub fn extract(&self, hir: &Hir) -> Seq { | 
|---|
| 172 | use crate::hir::HirKind::*; | 
|---|
| 173 |  | 
|---|
| 174 | match *hir.kind() { | 
|---|
| 175 | Empty | Look(_) => Seq::singleton(self::Literal::exact(vec![])), | 
|---|
| 176 | Literal(hir::Literal(ref bytes)) => { | 
|---|
| 177 | let mut seq = | 
|---|
| 178 | Seq::singleton(self::Literal::exact(bytes.to_vec())); | 
|---|
| 179 | self.enforce_literal_len(&mut seq); | 
|---|
| 180 | seq | 
|---|
| 181 | } | 
|---|
| 182 | Class(hir::Class::Unicode(ref cls)) => { | 
|---|
| 183 | self.extract_class_unicode(cls) | 
|---|
| 184 | } | 
|---|
| 185 | Class(hir::Class::Bytes(ref cls)) => self.extract_class_bytes(cls), | 
|---|
| 186 | Repetition(ref rep) => self.extract_repetition(rep), | 
|---|
| 187 | Capture(hir::Capture { ref sub, .. }) => self.extract(sub), | 
|---|
| 188 | Concat(ref hirs) => match self.kind { | 
|---|
| 189 | ExtractKind::Prefix => self.extract_concat(hirs.iter()), | 
|---|
| 190 | ExtractKind::Suffix => self.extract_concat(hirs.iter().rev()), | 
|---|
| 191 | }, | 
|---|
| 192 | Alternation(ref hirs) => { | 
|---|
| 193 | // Unlike concat, we always union starting from the beginning, | 
|---|
| 194 | // since the beginning corresponds to the highest preference, | 
|---|
| 195 | // which doesn't change based on forwards vs reverse. | 
|---|
| 196 | self.extract_alternation(hirs.iter()) | 
|---|
| 197 | } | 
|---|
| 198 | } | 
|---|
| 199 | } | 
|---|
| 200 |  | 
|---|
| 201 | /// Set the kind of literal sequence to extract from an [`Hir`] expression. | 
|---|
| 202 | /// | 
|---|
| 203 | /// The default is to extract prefixes, but suffixes can be selected | 
|---|
| 204 | /// instead. The contract for prefixes is that every match of the | 
|---|
| 205 | /// corresponding `Hir` must start with one of the literals in the sequence | 
|---|
| 206 | /// returned. Moreover, the _order_ of the sequence returned corresponds to | 
|---|
| 207 | /// the preference order. | 
|---|
| 208 | /// | 
|---|
| 209 | /// Suffixes satisfy a similar contract in that every match of the | 
|---|
| 210 | /// corresponding `Hir` must end with one of the literals in the sequence | 
|---|
| 211 | /// returned. However, there is no guarantee that the literals are in | 
|---|
| 212 | /// preference order. | 
|---|
| 213 | /// | 
|---|
| 214 | /// Remember that a sequence can be infinite. For example, unless the | 
|---|
| 215 | /// limits are configured to be impractically large, attempting to extract | 
|---|
| 216 | /// prefixes (or suffixes) for the pattern `[A-Z]` will return an infinite | 
|---|
| 217 | /// sequence. Generally speaking, if the sequence returned is infinite, | 
|---|
| 218 | /// then it is presumed to be unwise to do prefix (or suffix) optimizations | 
|---|
| 219 | /// for the pattern. | 
|---|
| 220 | pub fn kind(&mut self, kind: ExtractKind) -> &mut Extractor { | 
|---|
| 221 | self.kind = kind; | 
|---|
| 222 | self | 
|---|
| 223 | } | 
|---|
| 224 |  | 
|---|
| 225 | /// Configure a limit on the length of the sequence that is permitted for | 
|---|
| 226 | /// a character class. If a character class exceeds this limit, then the | 
|---|
| 227 | /// sequence returned for it is infinite. | 
|---|
| 228 | /// | 
|---|
| 229 | /// This prevents classes like `[A-Z]` or `\pL` from getting turned into | 
|---|
| 230 | /// huge and likely unproductive sequences of literals. | 
|---|
| 231 | /// | 
|---|
| 232 | /// # Example | 
|---|
| 233 | /// | 
|---|
| 234 | /// This example shows how this limit can be lowered to decrease the tolerance | 
|---|
| 235 | /// for character classes being turned into literal sequences. | 
|---|
| 236 | /// | 
|---|
| 237 | /// ``` | 
|---|
| 238 | /// use regex_syntax::{hir::literal::{Extractor, Seq}, parse}; | 
|---|
| 239 | /// | 
|---|
| 240 | /// let hir = parse( r"[0-9]")?; | 
|---|
| 241 | /// | 
|---|
| 242 | /// let got = Extractor::new().extract(&hir); | 
|---|
| 243 | /// let expected = Seq::new([ | 
|---|
| 244 | /// "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", | 
|---|
| 245 | /// ]); | 
|---|
| 246 | /// assert_eq!(expected, got); | 
|---|
| 247 | /// | 
|---|
| 248 | /// // Now let's shrink the limit and see how that changes things. | 
|---|
| 249 | /// let got = Extractor::new().limit_class(4).extract(&hir); | 
|---|
| 250 | /// let expected = Seq::infinite(); | 
|---|
| 251 | /// assert_eq!(expected, got); | 
|---|
| 252 | /// | 
|---|
| 253 | /// # Ok::<(), Box<dyn std::error::Error>>(()) | 
|---|
| 254 | /// ``` | 
|---|
| 255 | pub fn limit_class(&mut self, limit: usize) -> &mut Extractor { | 
|---|
| 256 | self.limit_class = limit; | 
|---|
| 257 | self | 
|---|
| 258 | } | 
|---|
| 259 |  | 
|---|
| 260 | /// Configure a limit on the total number of repetitions that is permitted | 
|---|
| 261 | /// before literal extraction is stopped. | 
|---|
| 262 | /// | 
|---|
| 263 | /// This is useful for limiting things like `(abcde){50}`, or more | 
|---|
| 264 | /// insidiously, `(?:){1000000000}`. This limit prevents any one single | 
|---|
| 265 | /// repetition from adding too much to a literal sequence. | 
|---|
| 266 | /// | 
|---|
| 267 | /// With this limit set, repetitions that exceed it will be stopped and any | 
|---|
| 268 | /// literals extracted up to that point will be made inexact. | 
|---|
| 269 | /// | 
|---|
| 270 | /// # Example | 
|---|
| 271 | /// | 
|---|
| 272 | /// This shows how to decrease the limit and compares it with the default. | 
|---|
| 273 | /// | 
|---|
| 274 | /// ``` | 
|---|
| 275 | /// use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse}; | 
|---|
| 276 | /// | 
|---|
| 277 | /// let hir = parse( r"(abc){8}")?; | 
|---|
| 278 | /// | 
|---|
| 279 | /// let got = Extractor::new().extract(&hir); | 
|---|
| 280 | /// let expected = Seq::new([ "abcabcabcabcabcabcabcabc"]); | 
|---|
| 281 | /// assert_eq!(expected, got); | 
|---|
| 282 | /// | 
|---|
| 283 | /// // Now let's shrink the limit and see how that changes things. | 
|---|
| 284 | /// let got = Extractor::new().limit_repeat(4).extract(&hir); | 
|---|
| 285 | /// let expected = Seq::from_iter([ | 
|---|
| 286 | ///     Literal::inexact( "abcabcabcabc"), | 
|---|
| 287 | /// ]); | 
|---|
| 288 | /// assert_eq!(expected, got); | 
|---|
| 289 | /// | 
|---|
| 290 | /// # Ok::<(), Box<dyn std::error::Error>>(()) | 
|---|
| 291 | /// ``` | 
|---|
| 292 | pub fn limit_repeat(&mut self, limit: usize) -> &mut Extractor { | 
|---|
| 293 | self.limit_repeat = limit; | 
|---|
| 294 | self | 
|---|
| 295 | } | 
|---|
| 296 |  | 
|---|
| 297 | /// Configure a limit on the maximum length of any literal in a sequence. | 
|---|
| 298 | /// | 
|---|
| 299 | /// This is useful for limiting things like `(abcde){5}{5}{5}{5}`. While | 
|---|
| 300 | /// each repetition or literal in that regex is small, when all the | 
|---|
| 301 | /// repetitions are applied, one ends up with a literal of length `5^4 = | 
|---|
| 302 | /// 625`. | 
|---|
| 303 | /// | 
|---|
| 304 | /// With this limit set, literals that exceed it will be made inexact and | 
|---|
| 305 | /// thus prevented from growing. | 
|---|
| 306 | /// | 
|---|
| 307 | /// # Example | 
|---|
| 308 | /// | 
|---|
| 309 | /// This shows how to decrease the limit and compares it with the default. | 
|---|
| 310 | /// | 
|---|
| 311 | /// ``` | 
|---|
| 312 | /// use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse}; | 
|---|
| 313 | /// | 
|---|
| 314 | /// let hir = parse( r"(abc){2}{2}{2}")?; | 
|---|
| 315 | /// | 
|---|
| 316 | /// let got = Extractor::new().extract(&hir); | 
|---|
| 317 | /// let expected = Seq::new([ "abcabcabcabcabcabcabcabc"]); | 
|---|
| 318 | /// assert_eq!(expected, got); | 
|---|
| 319 | /// | 
|---|
| 320 | /// // Now let's shrink the limit and see how that changes things. | 
|---|
| 321 | /// let got = Extractor::new().limit_literal_len(14).extract(&hir); | 
|---|
| 322 | /// let expected = Seq::from_iter([ | 
|---|
| 323 | ///     Literal::inexact( "abcabcabcabcab"), | 
|---|
| 324 | /// ]); | 
|---|
| 325 | /// assert_eq!(expected, got); | 
|---|
| 326 | /// | 
|---|
| 327 | /// # Ok::<(), Box<dyn std::error::Error>>(()) | 
|---|
| 328 | /// ``` | 
|---|
| 329 | pub fn limit_literal_len(&mut self, limit: usize) -> &mut Extractor { | 
|---|
| 330 | self.limit_literal_len = limit; | 
|---|
| 331 | self | 
|---|
| 332 | } | 
|---|
| 333 |  | 
|---|
| 334 | /// Configure a limit on the total number of literals that will be | 
|---|
| 335 | /// returned. | 
|---|
| 336 | /// | 
|---|
| 337 | /// This is useful as a practical measure for avoiding the creation of | 
|---|
| 338 | /// large sequences of literals. While the extractor will automatically | 
|---|
| 339 | /// handle local creations of large sequences (for example, `[A-Z]` yields | 
|---|
| 340 | /// an infinite sequence by default), large sequences can be created | 
|---|
| 341 | /// through non-local means as well. | 
|---|
| 342 | /// | 
|---|
| 343 | /// For example, `[ab]{3}{3}` would yield a sequence of length `512 = 2^9` | 
|---|
| 344 | /// despite each of the repetitions being small on their own. This limit | 
|---|
| 345 | /// thus represents a "catch all" for avoiding locally small sequences from | 
|---|
| 346 | /// combining into large sequences. | 
|---|
| 347 | /// | 
|---|
| 348 | /// # Example | 
|---|
| 349 | /// | 
|---|
| 350 | /// This example shows how reducing the limit will change the literal | 
|---|
| 351 | /// sequence returned. | 
|---|
| 352 | /// | 
|---|
| 353 | /// ``` | 
|---|
| 354 | /// use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse}; | 
|---|
| 355 | /// | 
|---|
| 356 | /// let hir = parse( r"[ab]{2}{2}")?; | 
|---|
| 357 | /// | 
|---|
| 358 | /// let got = Extractor::new().extract(&hir); | 
|---|
| 359 | /// let expected = Seq::new([ | 
|---|
| 360 | /// "aaaa", "aaab", "aaba", "aabb", | 
|---|
| 361 | /// "abaa", "abab", "abba", "abbb", | 
|---|
| 362 | /// "baaa", "baab", "baba", "babb", | 
|---|
| 363 | /// "bbaa", "bbab", "bbba", "bbbb", | 
|---|
| 364 | /// ]); | 
|---|
| 365 | /// assert_eq!(expected, got); | 
|---|
| 366 | /// | 
|---|
| 367 | /// // The default limit is not too big, but big enough to extract all | 
|---|
| 368 | /// // literals from '[ab]{2}{2}'. If we shrink the limit to less than 16, | 
|---|
| 369 | /// // then we'll get a truncated set. Notice that it returns a sequence of | 
|---|
| 370 | /// // length 4 even though our limit was 10. This is because the sequence | 
|---|
| 371 | /// // is difficult to increase without blowing the limit. Notice also | 
|---|
| 372 | /// // that every literal in the sequence is now inexact because they were | 
|---|
| 373 | /// // stripped of some suffix. | 
|---|
| 374 | /// let got = Extractor::new().limit_total(10).extract(&hir); | 
|---|
| 375 | /// let expected = Seq::from_iter([ | 
|---|
| 376 | ///     Literal::inexact( "aa"), | 
|---|
| 377 | ///     Literal::inexact( "ab"), | 
|---|
| 378 | ///     Literal::inexact( "ba"), | 
|---|
| 379 | ///     Literal::inexact( "bb"), | 
|---|
| 380 | /// ]); | 
|---|
| 381 | /// assert_eq!(expected, got); | 
|---|
| 382 | /// | 
|---|
| 383 | /// # Ok::<(), Box<dyn std::error::Error>>(()) | 
|---|
| 384 | /// ``` | 
|---|
| 385 | pub fn limit_total(&mut self, limit: usize) -> &mut Extractor { | 
|---|
| 386 | self.limit_total = limit; | 
|---|
| 387 | self | 
|---|
| 388 | } | 
|---|
| 389 |  | 
|---|
| 390 | /// Extract a sequence from the given concatenation. Sequences from each of | 
|---|
| 391 | /// the child HIR expressions are combined via cross product. | 
|---|
| 392 | /// | 
|---|
| 393 | /// This short circuits once the cross product turns into a sequence | 
|---|
| 394 | /// containing only inexact literals. | 
|---|
| 395 | fn extract_concat<'a, I: Iterator<Item = &'a Hir>>(&self, it: I) -> Seq { | 
|---|
| 396 | let mut seq = Seq::singleton(self::Literal::exact(vec![])); | 
|---|
| 397 | for hir in it { | 
|---|
| 398 | // If every element in the sequence is inexact, then a cross | 
|---|
| 399 | // product will always be a no-op. Thus, there is nothing else we | 
|---|
| 400 | // can add to it and can quit early. Note that this also includes | 
|---|
| 401 | // infinite sequences. | 
|---|
| 402 | if seq.is_inexact() { | 
|---|
| 403 | break; | 
|---|
| 404 | } | 
|---|
| 405 | // Note that 'cross' also dispatches based on whether we're | 
|---|
| 406 | // extracting prefixes or suffixes. | 
|---|
| 407 | seq = self.cross(seq, &mut self.extract(hir)); | 
|---|
| 408 | } | 
|---|
| 409 | seq | 
|---|
| 410 | } | 
|---|
| 411 |  | 
|---|
| 412 | /// Extract a sequence from the given alternation. | 
|---|
| 413 | /// | 
|---|
| 414 | /// This short circuits once the union turns into an infinite sequence. | 
|---|
| 415 | fn extract_alternation<'a, I: Iterator<Item = &'a Hir>>( | 
|---|
| 416 | &self, | 
|---|
| 417 | it: I, | 
|---|
| 418 | ) -> Seq { | 
|---|
| 419 | let mut seq = Seq::empty(); | 
|---|
| 420 | for hir in it { | 
|---|
| 421 | // Once our 'seq' is infinite, every subsequent union | 
|---|
| 422 | // operation on it will itself always result in an | 
|---|
| 423 | // infinite sequence. Thus, it can never change and we can | 
|---|
| 424 | // short-circuit. | 
|---|
| 425 | if !seq.is_finite() { | 
|---|
| 426 | break; | 
|---|
| 427 | } | 
|---|
| 428 | seq = self.union(seq, &mut self.extract(hir)); | 
|---|
| 429 | } | 
|---|
| 430 | seq | 
|---|
| 431 | } | 
|---|
| 432 |  | 
|---|
| 433 | /// Extract a sequence of literals from the given repetition. We do our | 
|---|
| 434 | /// best, Some examples: | 
|---|
| 435 | /// | 
|---|
| 436 | ///   'a*'    => [inexact(a), exact("")] | 
|---|
| 437 | ///   'a*?'   => [exact(""), inexact(a)] | 
|---|
| 438 | ///   'a+'    => [inexact(a)] | 
|---|
| 439 | ///   'a{3}'  => [exact(aaa)] | 
|---|
| 440 | ///   'a{3,5} => [inexact(aaa)] | 
|---|
| 441 | /// | 
|---|
| 442 | /// The key here really is making sure we get the 'inexact' vs 'exact' | 
|---|
| 443 | /// attributes correct on each of the literals we add. For example, the | 
|---|
| 444 | /// fact that 'a*' gives us an inexact 'a' and an exact empty string means | 
|---|
| 445 | /// that a regex like 'ab*c' will result in [inexact(ab), exact(ac)] | 
|---|
| 446 | /// literals being extracted, which might actually be a better prefilter | 
|---|
| 447 | /// than just 'a'. | 
|---|
| 448 | fn extract_repetition(&self, rep: &hir::Repetition) -> Seq { | 
|---|
| 449 | let mut subseq = self.extract(&rep.sub); | 
|---|
| 450 | match *rep { | 
|---|
| 451 | hir::Repetition { min: 0, max, greedy, .. } => { | 
|---|
| 452 | // When 'max=1', we can retain exactness, since 'a?' is | 
|---|
| 453 | // equivalent to 'a|'. Similarly below, 'a??' is equivalent to | 
|---|
| 454 | // '|a'. | 
|---|
| 455 | if max != Some(1) { | 
|---|
| 456 | subseq.make_inexact(); | 
|---|
| 457 | } | 
|---|
| 458 | let mut empty = Seq::singleton(Literal::exact(vec![])); | 
|---|
| 459 | if !greedy { | 
|---|
| 460 | mem::swap(&mut subseq, &mut empty); | 
|---|
| 461 | } | 
|---|
| 462 | self.union(subseq, &mut empty) | 
|---|
| 463 | } | 
|---|
| 464 | hir::Repetition { min, max: Some(max), .. } if min == max => { | 
|---|
| 465 | assert!(min > 0); // handled above | 
|---|
| 466 | let limit = | 
|---|
| 467 | u32::try_from(self.limit_repeat).unwrap_or(u32::MAX); | 
|---|
| 468 | let mut seq = Seq::singleton(Literal::exact(vec![])); | 
|---|
| 469 | for _ in 0..cmp::min(min, limit) { | 
|---|
| 470 | if seq.is_inexact() { | 
|---|
| 471 | break; | 
|---|
| 472 | } | 
|---|
| 473 | seq = self.cross(seq, &mut subseq.clone()); | 
|---|
| 474 | } | 
|---|
| 475 | if usize::try_from(min).is_err() || min > limit { | 
|---|
| 476 | seq.make_inexact(); | 
|---|
| 477 | } | 
|---|
| 478 | seq | 
|---|
| 479 | } | 
|---|
| 480 | hir::Repetition { min, .. } => { | 
|---|
| 481 | assert!(min > 0); // handled above | 
|---|
| 482 | let limit = | 
|---|
| 483 | u32::try_from(self.limit_repeat).unwrap_or(u32::MAX); | 
|---|
| 484 | let mut seq = Seq::singleton(Literal::exact(vec![])); | 
|---|
| 485 | for _ in 0..cmp::min(min, limit) { | 
|---|
| 486 | if seq.is_inexact() { | 
|---|
| 487 | break; | 
|---|
| 488 | } | 
|---|
| 489 | seq = self.cross(seq, &mut subseq.clone()); | 
|---|
| 490 | } | 
|---|
| 491 | seq.make_inexact(); | 
|---|
| 492 | seq | 
|---|
| 493 | } | 
|---|
| 494 | } | 
|---|
| 495 | } | 
|---|
| 496 |  | 
|---|
| 497 | /// Convert the given Unicode class into a sequence of literals if the | 
|---|
| 498 | /// class is small enough. If the class is too big, return an infinite | 
|---|
| 499 | /// sequence. | 
|---|
| 500 | fn extract_class_unicode(&self, cls: &hir::ClassUnicode) -> Seq { | 
|---|
| 501 | if self.class_over_limit_unicode(cls) { | 
|---|
| 502 | return Seq::infinite(); | 
|---|
| 503 | } | 
|---|
| 504 | let mut seq = Seq::empty(); | 
|---|
| 505 | for r in cls.iter() { | 
|---|
| 506 | for ch in r.start()..=r.end() { | 
|---|
| 507 | seq.push(Literal::from(ch)); | 
|---|
| 508 | } | 
|---|
| 509 | } | 
|---|
| 510 | self.enforce_literal_len(&mut seq); | 
|---|
| 511 | seq | 
|---|
| 512 | } | 
|---|
| 513 |  | 
|---|
| 514 | /// Convert the given byte class into a sequence of literals if the class | 
|---|
| 515 | /// is small enough. If the class is too big, return an infinite sequence. | 
|---|
| 516 | fn extract_class_bytes(&self, cls: &hir::ClassBytes) -> Seq { | 
|---|
| 517 | if self.class_over_limit_bytes(cls) { | 
|---|
| 518 | return Seq::infinite(); | 
|---|
| 519 | } | 
|---|
| 520 | let mut seq = Seq::empty(); | 
|---|
| 521 | for r in cls.iter() { | 
|---|
| 522 | for b in r.start()..=r.end() { | 
|---|
| 523 | seq.push(Literal::from(b)); | 
|---|
| 524 | } | 
|---|
| 525 | } | 
|---|
| 526 | self.enforce_literal_len(&mut seq); | 
|---|
| 527 | seq | 
|---|
| 528 | } | 
|---|
| 529 |  | 
|---|
| 530 | /// Returns true if the given Unicode class exceeds the configured limits | 
|---|
| 531 | /// on this extractor. | 
|---|
| 532 | fn class_over_limit_unicode(&self, cls: &hir::ClassUnicode) -> bool { | 
|---|
| 533 | let mut count = 0; | 
|---|
| 534 | for r in cls.iter() { | 
|---|
| 535 | if count > self.limit_class { | 
|---|
| 536 | return true; | 
|---|
| 537 | } | 
|---|
| 538 | count += r.len(); | 
|---|
| 539 | } | 
|---|
| 540 | count > self.limit_class | 
|---|
| 541 | } | 
|---|
| 542 |  | 
|---|
| 543 | /// Returns true if the given byte class exceeds the configured limits on | 
|---|
| 544 | /// this extractor. | 
|---|
| 545 | fn class_over_limit_bytes(&self, cls: &hir::ClassBytes) -> bool { | 
|---|
| 546 | let mut count = 0; | 
|---|
| 547 | for r in cls.iter() { | 
|---|
| 548 | if count > self.limit_class { | 
|---|
| 549 | return true; | 
|---|
| 550 | } | 
|---|
| 551 | count += r.len(); | 
|---|
| 552 | } | 
|---|
| 553 | count > self.limit_class | 
|---|
| 554 | } | 
|---|
| 555 |  | 
|---|
| 556 | /// Compute the cross product of the two sequences if the result would be | 
|---|
| 557 | /// within configured limits. Otherwise, make `seq2` infinite and cross the | 
|---|
| 558 | /// infinite sequence with `seq1`. | 
|---|
| 559 | fn cross(&self, mut seq1: Seq, seq2: &mut Seq) -> Seq { | 
|---|
| 560 | if seq1.max_cross_len(seq2).map_or(false, |len| len > self.limit_total) | 
|---|
| 561 | { | 
|---|
| 562 | seq2.make_infinite(); | 
|---|
| 563 | } | 
|---|
| 564 | if let ExtractKind::Suffix = self.kind { | 
|---|
| 565 | seq1.cross_reverse(seq2); | 
|---|
| 566 | } else { | 
|---|
| 567 | seq1.cross_forward(seq2); | 
|---|
| 568 | } | 
|---|
| 569 | assert!(seq1.len().map_or(true, |x| x <= self.limit_total)); | 
|---|
| 570 | self.enforce_literal_len(&mut seq1); | 
|---|
| 571 | seq1 | 
|---|
| 572 | } | 
|---|
| 573 |  | 
|---|
| 574 | /// Union the two sequences if the result would be within configured | 
|---|
| 575 | /// limits. Otherwise, make `seq2` infinite and union the infinite sequence | 
|---|
| 576 | /// with `seq1`. | 
|---|
| 577 | fn union(&self, mut seq1: Seq, seq2: &mut Seq) -> Seq { | 
|---|
| 578 | if seq1.max_union_len(seq2).map_or(false, |len| len > self.limit_total) | 
|---|
| 579 | { | 
|---|
| 580 | // We try to trim our literal sequences to see if we can make | 
|---|
| 581 | // room for more literals. The idea is that we'd rather trim down | 
|---|
| 582 | // literals already in our sequence if it means we can add a few | 
|---|
| 583 | // more and retain a finite sequence. Otherwise, we'll union with | 
|---|
| 584 | // an infinite sequence and that infects everything and effectively | 
|---|
| 585 | // stops literal extraction in its tracks. | 
|---|
| 586 | // | 
|---|
| 587 | // We do we keep 4 bytes here? Well, it's a bit of an abstraction | 
|---|
| 588 | // leakage. Downstream, the literals may wind up getting fed to | 
|---|
| 589 | // the Teddy algorithm, which supports searching literals up to | 
|---|
| 590 | // length 4. So that's why we pick that number here. Arguably this | 
|---|
| 591 | // should be a tuneable parameter, but it seems a little tricky to | 
|---|
| 592 | // describe. And I'm still unsure if this is the right way to go | 
|---|
| 593 | // about culling literal sequences. | 
|---|
| 594 | match self.kind { | 
|---|
| 595 | ExtractKind::Prefix => { | 
|---|
| 596 | seq1.keep_first_bytes(4); | 
|---|
| 597 | seq2.keep_first_bytes(4); | 
|---|
| 598 | } | 
|---|
| 599 | ExtractKind::Suffix => { | 
|---|
| 600 | seq1.keep_last_bytes(4); | 
|---|
| 601 | seq2.keep_last_bytes(4); | 
|---|
| 602 | } | 
|---|
| 603 | } | 
|---|
| 604 | seq1.dedup(); | 
|---|
| 605 | seq2.dedup(); | 
|---|
| 606 | if seq1 | 
|---|
| 607 | .max_union_len(seq2) | 
|---|
| 608 | .map_or(false, |len| len > self.limit_total) | 
|---|
| 609 | { | 
|---|
| 610 | seq2.make_infinite(); | 
|---|
| 611 | } | 
|---|
| 612 | } | 
|---|
| 613 | seq1.union(seq2); | 
|---|
| 614 | assert!(seq1.len().map_or(true, |x| x <= self.limit_total)); | 
|---|
| 615 | seq1 | 
|---|
| 616 | } | 
|---|
| 617 |  | 
|---|
| 618 | /// Applies the literal length limit to the given sequence. If none of the | 
|---|
| 619 | /// literals in the sequence exceed the limit, then this is a no-op. | 
|---|
| 620 | fn enforce_literal_len(&self, seq: &mut Seq) { | 
|---|
| 621 | let len = self.limit_literal_len; | 
|---|
| 622 | match self.kind { | 
|---|
| 623 | ExtractKind::Prefix => seq.keep_first_bytes(len), | 
|---|
| 624 | ExtractKind::Suffix => seq.keep_last_bytes(len), | 
|---|
| 625 | } | 
|---|
| 626 | } | 
|---|
| 627 | } | 
|---|
| 628 |  | 
|---|
| 629 | impl Default for Extractor { | 
|---|
| 630 | fn default() -> Extractor { | 
|---|
| 631 | Extractor::new() | 
|---|
| 632 | } | 
|---|
| 633 | } | 
|---|
| 634 |  | 
|---|
| 635 | /// The kind of literals to extract from an [`Hir`] expression. | 
|---|
| 636 | /// | 
|---|
| 637 | /// The default extraction kind is `Prefix`. | 
|---|
| 638 | #[ non_exhaustive] | 
|---|
| 639 | #[ derive(Clone, Debug)] | 
|---|
| 640 | pub enum ExtractKind { | 
|---|
| 641 | /// Extracts only prefix literals from a regex. | 
|---|
| 642 | Prefix, | 
|---|
| 643 | /// Extracts only suffix literals from a regex. | 
|---|
| 644 | /// | 
|---|
| 645 | /// Note that the sequence returned by suffix literals currently may | 
|---|
| 646 | /// not correctly represent leftmost-first or "preference" order match | 
|---|
| 647 | /// semantics. | 
|---|
| 648 | Suffix, | 
|---|
| 649 | } | 
|---|
| 650 |  | 
|---|
| 651 | impl ExtractKind { | 
|---|
| 652 | /// Returns true if this kind is the `Prefix` variant. | 
|---|
| 653 | pub fn is_prefix(&self) -> bool { | 
|---|
| 654 | matches!(*self, ExtractKind::Prefix) | 
|---|
| 655 | } | 
|---|
| 656 |  | 
|---|
| 657 | /// Returns true if this kind is the `Suffix` variant. | 
|---|
| 658 | pub fn is_suffix(&self) -> bool { | 
|---|
| 659 | matches!(*self, ExtractKind::Suffix) | 
|---|
| 660 | } | 
|---|
| 661 | } | 
|---|
| 662 |  | 
|---|
| 663 | impl Default for ExtractKind { | 
|---|
| 664 | fn default() -> ExtractKind { | 
|---|
| 665 | ExtractKind::Prefix | 
|---|
| 666 | } | 
|---|
| 667 | } | 
|---|
| 668 |  | 
|---|
| 669 | /// A sequence of literals. | 
|---|
| 670 | /// | 
|---|
| 671 | /// A `Seq` is very much like a set in that it represents a union of its | 
|---|
| 672 | /// members. That is, it corresponds to a set of literals where at least one | 
|---|
| 673 | /// must match in order for a particular [`Hir`] expression to match. (Whether | 
|---|
| 674 | /// this corresponds to the entire `Hir` expression, a prefix of it or a suffix | 
|---|
| 675 | /// of it depends on how the `Seq` was extracted from the `Hir`.) | 
|---|
| 676 | /// | 
|---|
| 677 | /// It is also unlike a set in that multiple identical literals may appear, | 
|---|
| 678 | /// and that the order of the literals in the `Seq` matters. For example, if | 
|---|
| 679 | /// the sequence is `[sam, samwise]` and leftmost-first matching is used, then | 
|---|
| 680 | /// `samwise` can never match and the sequence is equivalent to `[sam]`. | 
|---|
| 681 | /// | 
|---|
| 682 | /// # States of a sequence | 
|---|
| 683 | /// | 
|---|
| 684 | /// A `Seq` has a few different logical states to consider: | 
|---|
| 685 | /// | 
|---|
| 686 | /// * The sequence can represent "any" literal. When this happens, the set does | 
|---|
| 687 | /// not have a finite size. The purpose of this state is to inhibit callers | 
|---|
| 688 | /// from making assumptions about what literals are required in order to match | 
|---|
| 689 | /// a particular [`Hir`] expression. Generally speaking, when a set is in this | 
|---|
| 690 | /// state, literal optimizations are inhibited. A good example of a regex that | 
|---|
| 691 | /// will cause this sort of set to appear is `[A-Za-z]`. The character class | 
|---|
| 692 | /// is just too big (and also too narrow) to be usefully expanded into 52 | 
|---|
| 693 | /// different literals. (Note that the decision for when a seq should become | 
|---|
| 694 | /// infinite is determined by the caller. A seq itself has no hard-coded | 
|---|
| 695 | /// limits.) | 
|---|
| 696 | /// * The sequence can be empty, in which case, it is an affirmative statement | 
|---|
| 697 | /// that there are no literals that can match the corresponding `Hir`. | 
|---|
| 698 | /// Consequently, the `Hir` never matches any input. For example, `[a&&b]`. | 
|---|
| 699 | /// * The sequence can be non-empty, in which case, at least one of the | 
|---|
| 700 | /// literals must match in order for the corresponding `Hir` to match. | 
|---|
| 701 | /// | 
|---|
| 702 | /// # Example | 
|---|
| 703 | /// | 
|---|
| 704 | /// This example shows how literal sequences can be simplified by stripping | 
|---|
| 705 | /// suffixes and minimizing while maintaining preference order. | 
|---|
| 706 | /// | 
|---|
| 707 | /// ``` | 
|---|
| 708 | /// use regex_syntax::hir::literal::{Literal, Seq}; | 
|---|
| 709 | /// | 
|---|
| 710 | /// let mut seq = Seq::new(&[ | 
|---|
| 711 | /// "farm", | 
|---|
| 712 | /// "appliance", | 
|---|
| 713 | /// "faraway", | 
|---|
| 714 | /// "apple", | 
|---|
| 715 | /// "fare", | 
|---|
| 716 | /// "gap", | 
|---|
| 717 | /// "applicant", | 
|---|
| 718 | /// "applaud", | 
|---|
| 719 | /// ]); | 
|---|
| 720 | /// seq.keep_first_bytes(3); | 
|---|
| 721 | /// seq.minimize_by_preference(); | 
|---|
| 722 | /// // Notice that 'far' comes before 'app', which matches the order in the | 
|---|
| 723 | /// // original sequence. This guarantees that leftmost-first semantics are | 
|---|
| 724 | /// // not altered by simplifying the set. | 
|---|
| 725 | /// let expected = Seq::from_iter([ | 
|---|
| 726 | ///     Literal::inexact( "far"), | 
|---|
| 727 | ///     Literal::inexact( "app"), | 
|---|
| 728 | ///     Literal::exact( "gap"), | 
|---|
| 729 | /// ]); | 
|---|
| 730 | /// assert_eq!(expected, seq); | 
|---|
| 731 | /// ``` | 
|---|
| 732 | #[ derive(Clone, Eq, PartialEq)] | 
|---|
| 733 | pub struct Seq { | 
|---|
| 734 | /// The members of this seq. | 
|---|
| 735 | /// | 
|---|
| 736 | /// When `None`, the seq represents all possible literals. That is, it | 
|---|
| 737 | /// prevents one from making assumptions about specific literals in the | 
|---|
| 738 | /// seq, and forces one to treat it as if any literal might be in the seq. | 
|---|
| 739 | /// | 
|---|
| 740 | /// Note that `Some(vec![])` is valid and corresponds to the empty seq of | 
|---|
| 741 | /// literals, i.e., a regex that can never match. For example, `[a&&b]`. | 
|---|
| 742 | /// It is distinct from `Some(vec![""])`, which corresponds to the seq | 
|---|
| 743 | /// containing an empty string, which matches at every position. | 
|---|
| 744 | literals: Option<Vec<Literal>>, | 
|---|
| 745 | } | 
|---|
| 746 |  | 
|---|
| 747 | impl Seq { | 
|---|
| 748 | /// Returns an empty sequence. | 
|---|
| 749 | /// | 
|---|
| 750 | /// An empty sequence matches zero literals, and thus corresponds to a | 
|---|
| 751 | /// regex that itself can never match. | 
|---|
| 752 | #[ inline] | 
|---|
| 753 | pub fn empty() -> Seq { | 
|---|
| 754 | Seq { literals: Some(vec![]) } | 
|---|
| 755 | } | 
|---|
| 756 |  | 
|---|
| 757 | /// Returns a sequence of literals without a finite size and may contain | 
|---|
| 758 | /// any literal. | 
|---|
| 759 | /// | 
|---|
| 760 | /// A sequence without finite size does not reveal anything about the | 
|---|
| 761 | /// characteristics of the literals in its set. There are no fixed prefixes | 
|---|
| 762 | /// or suffixes, nor are lower or upper bounds on the length of the literals | 
|---|
| 763 | /// in the set known. | 
|---|
| 764 | /// | 
|---|
| 765 | /// This is useful to represent constructs in a regex that are "too big" | 
|---|
| 766 | /// to useful represent as a sequence of literals. For example, `[A-Za-z]`. | 
|---|
| 767 | /// When sequences get too big, they lose their discriminating nature and | 
|---|
| 768 | /// are more likely to produce false positives, which in turn makes them | 
|---|
| 769 | /// less likely to speed up searches. | 
|---|
| 770 | /// | 
|---|
| 771 | /// More pragmatically, for many regexes, enumerating all possible literals | 
|---|
| 772 | /// is itself not possible or might otherwise use too many resources. So | 
|---|
| 773 | /// constraining the size of sets during extraction is a practical trade | 
|---|
| 774 | /// off to make. | 
|---|
| 775 | #[ inline] | 
|---|
| 776 | pub fn infinite() -> Seq { | 
|---|
| 777 | Seq { literals: None } | 
|---|
| 778 | } | 
|---|
| 779 |  | 
|---|
| 780 | /// Returns a sequence containing a single literal. | 
|---|
| 781 | #[ inline] | 
|---|
| 782 | pub fn singleton(lit: Literal) -> Seq { | 
|---|
| 783 | Seq { literals: Some(vec![lit]) } | 
|---|
| 784 | } | 
|---|
| 785 |  | 
|---|
| 786 | /// Returns a sequence of exact literals from the given byte strings. | 
|---|
| 787 | #[ inline] | 
|---|
| 788 | pub fn new<I, B>(it: I) -> Seq | 
|---|
| 789 | where | 
|---|
| 790 | I: IntoIterator<Item = B>, | 
|---|
| 791 | B: AsRef<[u8]>, | 
|---|
| 792 | { | 
|---|
| 793 | it.into_iter().map(|b| Literal::exact(b.as_ref())).collect() | 
|---|
| 794 | } | 
|---|
| 795 |  | 
|---|
| 796 | /// If this is a finite sequence, return its members as a slice of | 
|---|
| 797 | /// literals. | 
|---|
| 798 | /// | 
|---|
| 799 | /// The slice returned may be empty, in which case, there are no literals | 
|---|
| 800 | /// that can match this sequence. | 
|---|
| 801 | #[ inline] | 
|---|
| 802 | pub fn literals(&self) -> Option<&[Literal]> { | 
|---|
| 803 | self.literals.as_deref() | 
|---|
| 804 | } | 
|---|
| 805 |  | 
|---|
| 806 | /// Push a literal to the end of this sequence. | 
|---|
| 807 | /// | 
|---|
| 808 | /// If this sequence is not finite, then this is a no-op. | 
|---|
| 809 | /// | 
|---|
| 810 | /// Similarly, if the most recently added item of this sequence is | 
|---|
| 811 | /// equivalent to the literal given, then it is not added. This reflects | 
|---|
| 812 | /// a `Seq`'s "set like" behavior, and represents a practical trade off. | 
|---|
| 813 | /// Namely, there is never any need to have two adjacent and equivalent | 
|---|
| 814 | /// literals in the same sequence, _and_ it is easy to detect in some | 
|---|
| 815 | /// cases. | 
|---|
| 816 | #[ inline] | 
|---|
| 817 | pub fn push(&mut self, lit: Literal) { | 
|---|
| 818 | let lits = match self.literals { | 
|---|
| 819 | None => return, | 
|---|
| 820 | Some(ref mut lits) => lits, | 
|---|
| 821 | }; | 
|---|
| 822 | if lits.last().map_or(false, |m| m == &lit) { | 
|---|
| 823 | return; | 
|---|
| 824 | } | 
|---|
| 825 | lits.push(lit); | 
|---|
| 826 | } | 
|---|
| 827 |  | 
|---|
| 828 | /// Make all of the literals in this sequence inexact. | 
|---|
| 829 | /// | 
|---|
| 830 | /// This is a no-op if this sequence is not finite. | 
|---|
| 831 | #[ inline] | 
|---|
| 832 | pub fn make_inexact(&mut self) { | 
|---|
| 833 | let lits = match self.literals { | 
|---|
| 834 | None => return, | 
|---|
| 835 | Some(ref mut lits) => lits, | 
|---|
| 836 | }; | 
|---|
| 837 | for lit in lits.iter_mut() { | 
|---|
| 838 | lit.make_inexact(); | 
|---|
| 839 | } | 
|---|
| 840 | } | 
|---|
| 841 |  | 
|---|
| 842 | /// Converts this sequence to an infinite sequence. | 
|---|
| 843 | /// | 
|---|
| 844 | /// This is a no-op if the sequence is already infinite. | 
|---|
| 845 | #[ inline] | 
|---|
| 846 | pub fn make_infinite(&mut self) { | 
|---|
| 847 | self.literals = None; | 
|---|
| 848 | } | 
|---|
| 849 |  | 
|---|
| 850 | /// Modify this sequence to contain the cross product between it and the | 
|---|
| 851 | /// sequence given. | 
|---|
| 852 | /// | 
|---|
| 853 | /// The cross product only considers literals in this sequence that are | 
|---|
| 854 | /// exact. That is, inexact literals are not extended. | 
|---|
| 855 | /// | 
|---|
| 856 | /// The literals are always drained from `other`, even if none are used. | 
|---|
| 857 | /// This permits callers to reuse the sequence allocation elsewhere. | 
|---|
| 858 | /// | 
|---|
| 859 | /// If this sequence is infinite, then this is a no-op, regardless of what | 
|---|
| 860 | /// `other` contains (and in this case, the literals are still drained from | 
|---|
| 861 | /// `other`). If `other` is infinite and this sequence is finite, then this | 
|---|
| 862 | /// is a no-op, unless this sequence contains a zero-length literal. In | 
|---|
| 863 | /// which case, the infiniteness of `other` infects this sequence, and this | 
|---|
| 864 | /// sequence is itself made infinite. | 
|---|
| 865 | /// | 
|---|
| 866 | /// Like [`Seq::union`], this may attempt to deduplicate literals. See | 
|---|
| 867 | /// [`Seq::dedup`] for how deduplication deals with exact and inexact | 
|---|
| 868 | /// literals. | 
|---|
| 869 | /// | 
|---|
| 870 | /// # Example | 
|---|
| 871 | /// | 
|---|
| 872 | /// This example shows basic usage and how exact and inexact literals | 
|---|
| 873 | /// interact. | 
|---|
| 874 | /// | 
|---|
| 875 | /// ``` | 
|---|
| 876 | /// use regex_syntax::hir::literal::{Literal, Seq}; | 
|---|
| 877 | /// | 
|---|
| 878 | /// let mut seq1 = Seq::from_iter([ | 
|---|
| 879 | ///     Literal::exact( "foo"), | 
|---|
| 880 | ///     Literal::inexact( "bar"), | 
|---|
| 881 | /// ]); | 
|---|
| 882 | /// let mut seq2 = Seq::from_iter([ | 
|---|
| 883 | ///     Literal::inexact( "quux"), | 
|---|
| 884 | ///     Literal::exact( "baz"), | 
|---|
| 885 | /// ]); | 
|---|
| 886 | /// seq1.cross_forward(&mut seq2); | 
|---|
| 887 | /// | 
|---|
| 888 | /// // The literals are pulled out of seq2. | 
|---|
| 889 | /// assert_eq!(Some(0), seq2.len()); | 
|---|
| 890 | /// | 
|---|
| 891 | /// let expected = Seq::from_iter([ | 
|---|
| 892 | ///     Literal::inexact( "fooquux"), | 
|---|
| 893 | ///     Literal::exact( "foobaz"), | 
|---|
| 894 | ///     Literal::inexact( "bar"), | 
|---|
| 895 | /// ]); | 
|---|
| 896 | /// assert_eq!(expected, seq1); | 
|---|
| 897 | /// ``` | 
|---|
| 898 | /// | 
|---|
| 899 | /// This example shows the behavior of when `other` is an infinite | 
|---|
| 900 | /// sequence. | 
|---|
| 901 | /// | 
|---|
| 902 | /// ``` | 
|---|
| 903 | /// use regex_syntax::hir::literal::{Literal, Seq}; | 
|---|
| 904 | /// | 
|---|
| 905 | /// let mut seq1 = Seq::from_iter([ | 
|---|
| 906 | ///     Literal::exact( "foo"), | 
|---|
| 907 | ///     Literal::inexact( "bar"), | 
|---|
| 908 | /// ]); | 
|---|
| 909 | /// let mut seq2 = Seq::infinite(); | 
|---|
| 910 | /// seq1.cross_forward(&mut seq2); | 
|---|
| 911 | /// | 
|---|
| 912 | /// // When seq2 is infinite, cross product doesn't add anything, but | 
|---|
| 913 | /// // ensures all members of seq1 are inexact. | 
|---|
| 914 | /// let expected = Seq::from_iter([ | 
|---|
| 915 | ///     Literal::inexact( "foo"), | 
|---|
| 916 | ///     Literal::inexact( "bar"), | 
|---|
| 917 | /// ]); | 
|---|
| 918 | /// assert_eq!(expected, seq1); | 
|---|
| 919 | /// ``` | 
|---|
| 920 | /// | 
|---|
| 921 | /// This example is like the one above, but shows what happens when this | 
|---|
| 922 | /// sequence contains an empty string. In this case, an infinite `other` | 
|---|
| 923 | /// sequence infects this sequence (because the empty string means that | 
|---|
| 924 | /// there are no finite prefixes): | 
|---|
| 925 | /// | 
|---|
| 926 | /// ``` | 
|---|
| 927 | /// use regex_syntax::hir::literal::{Literal, Seq}; | 
|---|
| 928 | /// | 
|---|
| 929 | /// let mut seq1 = Seq::from_iter([ | 
|---|
| 930 | ///     Literal::exact( "foo"), | 
|---|
| 931 | ///     Literal::exact( ""), // inexact provokes same behavior | 
|---|
| 932 | ///     Literal::inexact( "bar"), | 
|---|
| 933 | /// ]); | 
|---|
| 934 | /// let mut seq2 = Seq::infinite(); | 
|---|
| 935 | /// seq1.cross_forward(&mut seq2); | 
|---|
| 936 | /// | 
|---|
| 937 | /// // seq1 is now infinite! | 
|---|
| 938 | /// assert!(!seq1.is_finite()); | 
|---|
| 939 | /// ``` | 
|---|
| 940 | /// | 
|---|
| 941 | /// This example shows the behavior of this sequence is infinite. | 
|---|
| 942 | /// | 
|---|
| 943 | /// ``` | 
|---|
| 944 | /// use regex_syntax::hir::literal::{Literal, Seq}; | 
|---|
| 945 | /// | 
|---|
| 946 | /// let mut seq1 = Seq::infinite(); | 
|---|
| 947 | /// let mut seq2 = Seq::from_iter([ | 
|---|
| 948 | ///     Literal::exact( "foo"), | 
|---|
| 949 | ///     Literal::inexact( "bar"), | 
|---|
| 950 | /// ]); | 
|---|
| 951 | /// seq1.cross_forward(&mut seq2); | 
|---|
| 952 | /// | 
|---|
| 953 | /// // seq1 remains unchanged. | 
|---|
| 954 | /// assert!(!seq1.is_finite()); | 
|---|
| 955 | /// // Even though the literals in seq2 weren't used, it was still drained. | 
|---|
| 956 | /// assert_eq!(Some(0), seq2.len()); | 
|---|
| 957 | /// ``` | 
|---|
| 958 | #[ inline] | 
|---|
| 959 | pub fn cross_forward(&mut self, other: &mut Seq) { | 
|---|
| 960 | let (lits1, lits2) = match self.cross_preamble(other) { | 
|---|
| 961 | None => return, | 
|---|
| 962 | Some((lits1, lits2)) => (lits1, lits2), | 
|---|
| 963 | }; | 
|---|
| 964 | let newcap = lits1.len().saturating_mul(lits2.len()); | 
|---|
| 965 | for selflit in mem::replace(lits1, Vec::with_capacity(newcap)) { | 
|---|
| 966 | if !selflit.is_exact() { | 
|---|
| 967 | lits1.push(selflit); | 
|---|
| 968 | continue; | 
|---|
| 969 | } | 
|---|
| 970 | for otherlit in lits2.iter() { | 
|---|
| 971 | let mut newlit = Literal::exact(Vec::with_capacity( | 
|---|
| 972 | selflit.len() + otherlit.len(), | 
|---|
| 973 | )); | 
|---|
| 974 | newlit.extend(&selflit); | 
|---|
| 975 | newlit.extend(&otherlit); | 
|---|
| 976 | if !otherlit.is_exact() { | 
|---|
| 977 | newlit.make_inexact(); | 
|---|
| 978 | } | 
|---|
| 979 | lits1.push(newlit); | 
|---|
| 980 | } | 
|---|
| 981 | } | 
|---|
| 982 | lits2.drain(..); | 
|---|
| 983 | self.dedup(); | 
|---|
| 984 | } | 
|---|
| 985 |  | 
|---|
| 986 | /// Modify this sequence to contain the cross product between it and | 
|---|
| 987 | /// the sequence given, where the sequences are treated as suffixes | 
|---|
| 988 | /// instead of prefixes. Namely, the sequence `other` is *prepended* | 
|---|
| 989 | /// to `self` (as opposed to `other` being *appended* to `self` in | 
|---|
| 990 | /// [`Seq::cross_forward`]). | 
|---|
| 991 | /// | 
|---|
| 992 | /// The cross product only considers literals in this sequence that are | 
|---|
| 993 | /// exact. That is, inexact literals are not extended. | 
|---|
| 994 | /// | 
|---|
| 995 | /// The literals are always drained from `other`, even if none are used. | 
|---|
| 996 | /// This permits callers to reuse the sequence allocation elsewhere. | 
|---|
| 997 | /// | 
|---|
| 998 | /// If this sequence is infinite, then this is a no-op, regardless of what | 
|---|
| 999 | /// `other` contains (and in this case, the literals are still drained from | 
|---|
| 1000 | /// `other`). If `other` is infinite and this sequence is finite, then this | 
|---|
| 1001 | /// is a no-op, unless this sequence contains a zero-length literal. In | 
|---|
| 1002 | /// which case, the infiniteness of `other` infects this sequence, and this | 
|---|
| 1003 | /// sequence is itself made infinite. | 
|---|
| 1004 | /// | 
|---|
| 1005 | /// Like [`Seq::union`], this may attempt to deduplicate literals. See | 
|---|
| 1006 | /// [`Seq::dedup`] for how deduplication deals with exact and inexact | 
|---|
| 1007 | /// literals. | 
|---|
| 1008 | /// | 
|---|
| 1009 | /// # Example | 
|---|
| 1010 | /// | 
|---|
| 1011 | /// This example shows basic usage and how exact and inexact literals | 
|---|
| 1012 | /// interact. | 
|---|
| 1013 | /// | 
|---|
| 1014 | /// ``` | 
|---|
| 1015 | /// use regex_syntax::hir::literal::{Literal, Seq}; | 
|---|
| 1016 | /// | 
|---|
| 1017 | /// let mut seq1 = Seq::from_iter([ | 
|---|
| 1018 | ///     Literal::exact( "foo"), | 
|---|
| 1019 | ///     Literal::inexact( "bar"), | 
|---|
| 1020 | /// ]); | 
|---|
| 1021 | /// let mut seq2 = Seq::from_iter([ | 
|---|
| 1022 | ///     Literal::inexact( "quux"), | 
|---|
| 1023 | ///     Literal::exact( "baz"), | 
|---|
| 1024 | /// ]); | 
|---|
| 1025 | /// seq1.cross_reverse(&mut seq2); | 
|---|
| 1026 | /// | 
|---|
| 1027 | /// // The literals are pulled out of seq2. | 
|---|
| 1028 | /// assert_eq!(Some(0), seq2.len()); | 
|---|
| 1029 | /// | 
|---|
| 1030 | /// let expected = Seq::from_iter([ | 
|---|
| 1031 | ///     Literal::inexact( "quuxfoo"), | 
|---|
| 1032 | ///     Literal::inexact( "bar"), | 
|---|
| 1033 | ///     Literal::exact( "bazfoo"), | 
|---|
| 1034 | /// ]); | 
|---|
| 1035 | /// assert_eq!(expected, seq1); | 
|---|
| 1036 | /// ``` | 
|---|
| 1037 | /// | 
|---|
| 1038 | /// This example shows the behavior of when `other` is an infinite | 
|---|
| 1039 | /// sequence. | 
|---|
| 1040 | /// | 
|---|
| 1041 | /// ``` | 
|---|
| 1042 | /// use regex_syntax::hir::literal::{Literal, Seq}; | 
|---|
| 1043 | /// | 
|---|
| 1044 | /// let mut seq1 = Seq::from_iter([ | 
|---|
| 1045 | ///     Literal::exact( "foo"), | 
|---|
| 1046 | ///     Literal::inexact( "bar"), | 
|---|
| 1047 | /// ]); | 
|---|
| 1048 | /// let mut seq2 = Seq::infinite(); | 
|---|
| 1049 | /// seq1.cross_reverse(&mut seq2); | 
|---|
| 1050 | /// | 
|---|
| 1051 | /// // When seq2 is infinite, cross product doesn't add anything, but | 
|---|
| 1052 | /// // ensures all members of seq1 are inexact. | 
|---|
| 1053 | /// let expected = Seq::from_iter([ | 
|---|
| 1054 | ///     Literal::inexact( "foo"), | 
|---|
| 1055 | ///     Literal::inexact( "bar"), | 
|---|
| 1056 | /// ]); | 
|---|
| 1057 | /// assert_eq!(expected, seq1); | 
|---|
| 1058 | /// ``` | 
|---|
| 1059 | /// | 
|---|
| 1060 | /// This example is like the one above, but shows what happens when this | 
|---|
| 1061 | /// sequence contains an empty string. In this case, an infinite `other` | 
|---|
| 1062 | /// sequence infects this sequence (because the empty string means that | 
|---|
| 1063 | /// there are no finite suffixes): | 
|---|
| 1064 | /// | 
|---|
| 1065 | /// ``` | 
|---|
| 1066 | /// use regex_syntax::hir::literal::{Literal, Seq}; | 
|---|
| 1067 | /// | 
|---|
| 1068 | /// let mut seq1 = Seq::from_iter([ | 
|---|
| 1069 | ///     Literal::exact( "foo"), | 
|---|
| 1070 | ///     Literal::exact( ""), // inexact provokes same behavior | 
|---|
| 1071 | ///     Literal::inexact( "bar"), | 
|---|
| 1072 | /// ]); | 
|---|
| 1073 | /// let mut seq2 = Seq::infinite(); | 
|---|
| 1074 | /// seq1.cross_reverse(&mut seq2); | 
|---|
| 1075 | /// | 
|---|
| 1076 | /// // seq1 is now infinite! | 
|---|
| 1077 | /// assert!(!seq1.is_finite()); | 
|---|
| 1078 | /// ``` | 
|---|
| 1079 | /// | 
|---|
| 1080 | /// This example shows the behavior when this sequence is infinite. | 
|---|
| 1081 | /// | 
|---|
| 1082 | /// ``` | 
|---|
| 1083 | /// use regex_syntax::hir::literal::{Literal, Seq}; | 
|---|
| 1084 | /// | 
|---|
| 1085 | /// let mut seq1 = Seq::infinite(); | 
|---|
| 1086 | /// let mut seq2 = Seq::from_iter([ | 
|---|
| 1087 | ///     Literal::exact( "foo"), | 
|---|
| 1088 | ///     Literal::inexact( "bar"), | 
|---|
| 1089 | /// ]); | 
|---|
| 1090 | /// seq1.cross_reverse(&mut seq2); | 
|---|
| 1091 | /// | 
|---|
| 1092 | /// // seq1 remains unchanged. | 
|---|
| 1093 | /// assert!(!seq1.is_finite()); | 
|---|
| 1094 | /// // Even though the literals in seq2 weren't used, it was still drained. | 
|---|
| 1095 | /// assert_eq!(Some(0), seq2.len()); | 
|---|
| 1096 | /// ``` | 
|---|
| 1097 | #[ inline] | 
|---|
| 1098 | pub fn cross_reverse(&mut self, other: &mut Seq) { | 
|---|
| 1099 | let (lits1, lits2) = match self.cross_preamble(other) { | 
|---|
| 1100 | None => return, | 
|---|
| 1101 | Some((lits1, lits2)) => (lits1, lits2), | 
|---|
| 1102 | }; | 
|---|
| 1103 | // We basically proceed as we do in 'cross_forward' at this point, | 
|---|
| 1104 | // except that the outer loop is now 'other' and the inner loop is now | 
|---|
| 1105 | // 'self'. That's because 'self' corresponds to suffixes and 'other' | 
|---|
| 1106 | // corresponds to the sequence we want to *prepend* to the suffixes. | 
|---|
| 1107 | let newcap = lits1.len().saturating_mul(lits2.len()); | 
|---|
| 1108 | let selflits = mem::replace(lits1, Vec::with_capacity(newcap)); | 
|---|
| 1109 | for (i, otherlit) in lits2.drain(..).enumerate() { | 
|---|
| 1110 | for selflit in selflits.iter() { | 
|---|
| 1111 | if !selflit.is_exact() { | 
|---|
| 1112 | // If the suffix isn't exact, then we can't prepend | 
|---|
| 1113 | // anything to it. However, we still want to keep it. But | 
|---|
| 1114 | // we only want to keep one of them, to avoid duplication. | 
|---|
| 1115 | // (The duplication is okay from a correctness perspective, | 
|---|
| 1116 | // but wasteful.) | 
|---|
| 1117 | if i == 0 { | 
|---|
| 1118 | lits1.push(selflit.clone()); | 
|---|
| 1119 | } | 
|---|
| 1120 | continue; | 
|---|
| 1121 | } | 
|---|
| 1122 | let mut newlit = Literal::exact(Vec::with_capacity( | 
|---|
| 1123 | otherlit.len() + selflit.len(), | 
|---|
| 1124 | )); | 
|---|
| 1125 | newlit.extend(&otherlit); | 
|---|
| 1126 | newlit.extend(&selflit); | 
|---|
| 1127 | if !otherlit.is_exact() { | 
|---|
| 1128 | newlit.make_inexact(); | 
|---|
| 1129 | } | 
|---|
| 1130 | lits1.push(newlit); | 
|---|
| 1131 | } | 
|---|
| 1132 | } | 
|---|
| 1133 | self.dedup(); | 
|---|
| 1134 | } | 
|---|
| 1135 |  | 
|---|
| 1136 | /// A helper function the corresponds to the subtle preamble for both | 
|---|
| 1137 | /// `cross_forward` and `cross_reverse`. In effect, it handles the cases | 
|---|
| 1138 | /// of infinite sequences for both `self` and `other`, as well as ensuring | 
|---|
| 1139 | /// that literals from `other` are drained even if they aren't used. | 
|---|
| 1140 | fn cross_preamble<'a>( | 
|---|
| 1141 | &'a mut self, | 
|---|
| 1142 | other: &'a mut Seq, | 
|---|
| 1143 | ) -> Option<(&'a mut Vec<Literal>, &'a mut Vec<Literal>)> { | 
|---|
| 1144 | let lits2 = match other.literals { | 
|---|
| 1145 | None => { | 
|---|
| 1146 | // If our current seq contains the empty string and the seq | 
|---|
| 1147 | // we're adding matches any literal, then it follows that the | 
|---|
| 1148 | // current seq must now also match any literal. | 
|---|
| 1149 | // | 
|---|
| 1150 | // Otherwise, we just have to make sure everything in this | 
|---|
| 1151 | // sequence is inexact. | 
|---|
| 1152 | if self.min_literal_len() == Some(0) { | 
|---|
| 1153 | *self = Seq::infinite(); | 
|---|
| 1154 | } else { | 
|---|
| 1155 | self.make_inexact(); | 
|---|
| 1156 | } | 
|---|
| 1157 | return None; | 
|---|
| 1158 | } | 
|---|
| 1159 | Some(ref mut lits) => lits, | 
|---|
| 1160 | }; | 
|---|
| 1161 | let lits1 = match self.literals { | 
|---|
| 1162 | None => { | 
|---|
| 1163 | // If we aren't going to make it to the end of this routine | 
|---|
| 1164 | // where lits2 is drained, then we need to do it now. | 
|---|
| 1165 | lits2.drain(..); | 
|---|
| 1166 | return None; | 
|---|
| 1167 | } | 
|---|
| 1168 | Some(ref mut lits) => lits, | 
|---|
| 1169 | }; | 
|---|
| 1170 | Some((lits1, lits2)) | 
|---|
| 1171 | } | 
|---|
| 1172 |  | 
|---|
| 1173 | /// Unions the `other` sequence into this one. | 
|---|
| 1174 | /// | 
|---|
| 1175 | /// The literals are always drained out of the given `other` sequence, | 
|---|
| 1176 | /// even if they are being unioned into an infinite sequence. This permits | 
|---|
| 1177 | /// the caller to reuse the `other` sequence in another context. | 
|---|
| 1178 | /// | 
|---|
| 1179 | /// Some literal deduping may be performed. If any deduping happens, | 
|---|
| 1180 | /// any leftmost-first or "preference" order match semantics will be | 
|---|
| 1181 | /// preserved. | 
|---|
| 1182 | /// | 
|---|
| 1183 | /// # Example | 
|---|
| 1184 | /// | 
|---|
| 1185 | /// This example shows basic usage. | 
|---|
| 1186 | /// | 
|---|
| 1187 | /// ``` | 
|---|
| 1188 | /// use regex_syntax::hir::literal::Seq; | 
|---|
| 1189 | /// | 
|---|
| 1190 | /// let mut seq1 = Seq::new(&[ "foo", "bar"]); | 
|---|
| 1191 | /// let mut seq2 = Seq::new(&[ "bar", "quux", "foo"]); | 
|---|
| 1192 | /// seq1.union(&mut seq2); | 
|---|
| 1193 | /// | 
|---|
| 1194 | /// // The literals are pulled out of seq2. | 
|---|
| 1195 | /// assert_eq!(Some(0), seq2.len()); | 
|---|
| 1196 | /// | 
|---|
| 1197 | /// // Adjacent literals are deduped, but non-adjacent literals may not be. | 
|---|
| 1198 | /// assert_eq!(Seq::new(&[ "foo", "bar", "quux", "foo"]), seq1); | 
|---|
| 1199 | /// ``` | 
|---|
| 1200 | /// | 
|---|
| 1201 | /// This example shows that literals are drained from `other` even when | 
|---|
| 1202 | /// they aren't necessarily used. | 
|---|
| 1203 | /// | 
|---|
| 1204 | /// ``` | 
|---|
| 1205 | /// use regex_syntax::hir::literal::Seq; | 
|---|
| 1206 | /// | 
|---|
| 1207 | /// let mut seq1 = Seq::infinite(); | 
|---|
| 1208 | /// // Infinite sequences have no finite length. | 
|---|
| 1209 | /// assert_eq!(None, seq1.len()); | 
|---|
| 1210 | /// | 
|---|
| 1211 | /// let mut seq2 = Seq::new(&[ "bar", "quux", "foo"]); | 
|---|
| 1212 | /// seq1.union(&mut seq2); | 
|---|
| 1213 | /// | 
|---|
| 1214 | /// // seq1 is still infinite and seq2 has been drained. | 
|---|
| 1215 | /// assert_eq!(None, seq1.len()); | 
|---|
| 1216 | /// assert_eq!(Some(0), seq2.len()); | 
|---|
| 1217 | /// ``` | 
|---|
| 1218 | #[ inline] | 
|---|
| 1219 | pub fn union(&mut self, other: &mut Seq) { | 
|---|
| 1220 | let lits2 = match other.literals { | 
|---|
| 1221 | None => { | 
|---|
| 1222 | // Unioning with an infinite sequence always results in an | 
|---|
| 1223 | // infinite sequence. | 
|---|
| 1224 | self.make_infinite(); | 
|---|
| 1225 | return; | 
|---|
| 1226 | } | 
|---|
| 1227 | Some(ref mut lits) => lits.drain(..), | 
|---|
| 1228 | }; | 
|---|
| 1229 | let lits1 = match self.literals { | 
|---|
| 1230 | None => return, | 
|---|
| 1231 | Some(ref mut lits) => lits, | 
|---|
| 1232 | }; | 
|---|
| 1233 | lits1.extend(lits2); | 
|---|
| 1234 | self.dedup(); | 
|---|
| 1235 | } | 
|---|
| 1236 |  | 
|---|
| 1237 | /// Unions the `other` sequence into this one by splice the `other` | 
|---|
| 1238 | /// sequence at the position of the first zero-length literal. | 
|---|
| 1239 | /// | 
|---|
| 1240 | /// This is useful for preserving preference order semantics when combining | 
|---|
| 1241 | /// two literal sequences. For example, in the regex `(a||f)+foo`, the | 
|---|
| 1242 | /// correct preference order prefix sequence is `[a, foo, f]`. | 
|---|
| 1243 | /// | 
|---|
| 1244 | /// The literals are always drained out of the given `other` sequence, | 
|---|
| 1245 | /// even if they are being unioned into an infinite sequence. This permits | 
|---|
| 1246 | /// the caller to reuse the `other` sequence in another context. Note that | 
|---|
| 1247 | /// the literals are drained even if no union is performed as well, i.e., | 
|---|
| 1248 | /// when this sequence does not contain a zero-length literal. | 
|---|
| 1249 | /// | 
|---|
| 1250 | /// Some literal deduping may be performed. If any deduping happens, | 
|---|
| 1251 | /// any leftmost-first or "preference" order match semantics will be | 
|---|
| 1252 | /// preserved. | 
|---|
| 1253 | /// | 
|---|
| 1254 | /// # Example | 
|---|
| 1255 | /// | 
|---|
| 1256 | /// This example shows basic usage. | 
|---|
| 1257 | /// | 
|---|
| 1258 | /// ``` | 
|---|
| 1259 | /// use regex_syntax::hir::literal::Seq; | 
|---|
| 1260 | /// | 
|---|
| 1261 | /// let mut seq1 = Seq::new(&[ "a", "", "f", ""]); | 
|---|
| 1262 | /// let mut seq2 = Seq::new(&[ "foo"]); | 
|---|
| 1263 | /// seq1.union_into_empty(&mut seq2); | 
|---|
| 1264 | /// | 
|---|
| 1265 | /// // The literals are pulled out of seq2. | 
|---|
| 1266 | /// assert_eq!(Some(0), seq2.len()); | 
|---|
| 1267 | /// // 'foo' gets spliced into seq1 where the first empty string occurs. | 
|---|
| 1268 | /// assert_eq!(Seq::new(&[ "a", "foo", "f"]), seq1); | 
|---|
| 1269 | /// ``` | 
|---|
| 1270 | /// | 
|---|
| 1271 | /// This example shows that literals are drained from `other` even when | 
|---|
| 1272 | /// they aren't necessarily used. | 
|---|
| 1273 | /// | 
|---|
| 1274 | /// ``` | 
|---|
| 1275 | /// use regex_syntax::hir::literal::Seq; | 
|---|
| 1276 | /// | 
|---|
| 1277 | /// let mut seq1 = Seq::new(&[ "foo", "bar"]); | 
|---|
| 1278 | /// let mut seq2 = Seq::new(&[ "bar", "quux", "foo"]); | 
|---|
| 1279 | /// seq1.union_into_empty(&mut seq2); | 
|---|
| 1280 | /// | 
|---|
| 1281 | /// // seq1 has no zero length literals, so no splicing happens. | 
|---|
| 1282 | /// assert_eq!(Seq::new(&[ "foo", "bar"]), seq1); | 
|---|
| 1283 | /// // Even though no splicing happens, seq2 is still drained. | 
|---|
| 1284 | /// assert_eq!(Some(0), seq2.len()); | 
|---|
| 1285 | /// ``` | 
|---|
| 1286 | #[ inline] | 
|---|
| 1287 | pub fn union_into_empty(&mut self, other: &mut Seq) { | 
|---|
| 1288 | let lits2 = other.literals.as_mut().map(|lits| lits.drain(..)); | 
|---|
| 1289 | let lits1 = match self.literals { | 
|---|
| 1290 | None => return, | 
|---|
| 1291 | Some(ref mut lits) => lits, | 
|---|
| 1292 | }; | 
|---|
| 1293 | let first_empty = match lits1.iter().position(|m| m.is_empty()) { | 
|---|
| 1294 | None => return, | 
|---|
| 1295 | Some(i) => i, | 
|---|
| 1296 | }; | 
|---|
| 1297 | let lits2 = match lits2 { | 
|---|
| 1298 | None => { | 
|---|
| 1299 | // Note that we are only here if we've found an empty literal, | 
|---|
| 1300 | // which implies that an infinite sequence infects this seq and | 
|---|
| 1301 | // also turns it into an infinite sequence. | 
|---|
| 1302 | self.literals = None; | 
|---|
| 1303 | return; | 
|---|
| 1304 | } | 
|---|
| 1305 | Some(lits) => lits, | 
|---|
| 1306 | }; | 
|---|
| 1307 | // Clearing out the empties needs to come before the splice because | 
|---|
| 1308 | // the splice might add more empties that we don't want to get rid | 
|---|
| 1309 | // of. Since we're splicing into the position of the first empty, the | 
|---|
| 1310 | // 'first_empty' position computed above is still correct. | 
|---|
| 1311 | lits1.retain(|m| !m.is_empty()); | 
|---|
| 1312 | lits1.splice(first_empty..first_empty, lits2); | 
|---|
| 1313 | self.dedup(); | 
|---|
| 1314 | } | 
|---|
| 1315 |  | 
|---|
| 1316 | /// Deduplicate adjacent equivalent literals in this sequence. | 
|---|
| 1317 | /// | 
|---|
| 1318 | /// If adjacent literals are equivalent strings but one is exact and the | 
|---|
| 1319 | /// other inexact, the inexact literal is kept and the exact one is | 
|---|
| 1320 | /// removed. | 
|---|
| 1321 | /// | 
|---|
| 1322 | /// Deduping an infinite sequence is a no-op. | 
|---|
| 1323 | /// | 
|---|
| 1324 | /// # Example | 
|---|
| 1325 | /// | 
|---|
| 1326 | /// This example shows how literals that are duplicate byte strings but | 
|---|
| 1327 | /// are not equivalent with respect to exactness are resolved. | 
|---|
| 1328 | /// | 
|---|
| 1329 | /// ``` | 
|---|
| 1330 | /// use regex_syntax::hir::literal::{Literal, Seq}; | 
|---|
| 1331 | /// | 
|---|
| 1332 | /// let mut seq = Seq::from_iter([ | 
|---|
| 1333 | ///     Literal::exact( "foo"), | 
|---|
| 1334 | ///     Literal::inexact( "foo"), | 
|---|
| 1335 | /// ]); | 
|---|
| 1336 | /// seq.dedup(); | 
|---|
| 1337 | /// | 
|---|
| 1338 | /// assert_eq!(Seq::from_iter([Literal::inexact( "foo")]), seq); | 
|---|
| 1339 | /// ``` | 
|---|
| 1340 | #[ inline] | 
|---|
| 1341 | pub fn dedup(&mut self) { | 
|---|
| 1342 | if let Some(ref mut lits) = self.literals { | 
|---|
| 1343 | lits.dedup_by(|lit1, lit2| { | 
|---|
| 1344 | if lit1.as_bytes() != lit2.as_bytes() { | 
|---|
| 1345 | return false; | 
|---|
| 1346 | } | 
|---|
| 1347 | if lit1.is_exact() != lit2.is_exact() { | 
|---|
| 1348 | lit1.make_inexact(); | 
|---|
| 1349 | lit2.make_inexact(); | 
|---|
| 1350 | } | 
|---|
| 1351 | true | 
|---|
| 1352 | }); | 
|---|
| 1353 | } | 
|---|
| 1354 | } | 
|---|
| 1355 |  | 
|---|
| 1356 | /// Sorts this sequence of literals lexicographically. | 
|---|
| 1357 | /// | 
|---|
| 1358 | /// Note that if, before sorting, if a literal that is a prefix of another | 
|---|
| 1359 | /// literal appears after it, then after sorting, the sequence will not | 
|---|
| 1360 | /// represent the same preference order match semantics. For example, | 
|---|
| 1361 | /// sorting the sequence `[samwise, sam]` yields the sequence `[sam, | 
|---|
| 1362 | /// samwise]`. Under preference order semantics, the latter sequence will | 
|---|
| 1363 | /// never match `samwise` where as the first sequence can. | 
|---|
| 1364 | /// | 
|---|
| 1365 | /// # Example | 
|---|
| 1366 | /// | 
|---|
| 1367 | /// This example shows basic usage. | 
|---|
| 1368 | /// | 
|---|
| 1369 | /// ``` | 
|---|
| 1370 | /// use regex_syntax::hir::literal::Seq; | 
|---|
| 1371 | /// | 
|---|
| 1372 | /// let mut seq = Seq::new(&[ "foo", "quux", "bar"]); | 
|---|
| 1373 | /// seq.sort(); | 
|---|
| 1374 | /// | 
|---|
| 1375 | /// assert_eq!(Seq::new(&[ "bar", "foo", "quux"]), seq); | 
|---|
| 1376 | /// ``` | 
|---|
| 1377 | #[ inline] | 
|---|
| 1378 | pub fn sort(&mut self) { | 
|---|
| 1379 | if let Some(ref mut lits) = self.literals { | 
|---|
| 1380 | lits.sort(); | 
|---|
| 1381 | } | 
|---|
| 1382 | } | 
|---|
| 1383 |  | 
|---|
| 1384 | /// Reverses all of the literals in this sequence. | 
|---|
| 1385 | /// | 
|---|
| 1386 | /// The order of the sequence itself is preserved. | 
|---|
| 1387 | /// | 
|---|
| 1388 | /// # Example | 
|---|
| 1389 | /// | 
|---|
| 1390 | /// This example shows basic usage. | 
|---|
| 1391 | /// | 
|---|
| 1392 | /// ``` | 
|---|
| 1393 | /// use regex_syntax::hir::literal::Seq; | 
|---|
| 1394 | /// | 
|---|
| 1395 | /// let mut seq = Seq::new(&[ "oof", "rab"]); | 
|---|
| 1396 | /// seq.reverse_literals(); | 
|---|
| 1397 | /// assert_eq!(Seq::new(&[ "foo", "bar"]), seq); | 
|---|
| 1398 | /// ``` | 
|---|
| 1399 | #[ inline] | 
|---|
| 1400 | pub fn reverse_literals(&mut self) { | 
|---|
| 1401 | if let Some(ref mut lits) = self.literals { | 
|---|
| 1402 | for lit in lits.iter_mut() { | 
|---|
| 1403 | lit.reverse(); | 
|---|
| 1404 | } | 
|---|
| 1405 | } | 
|---|
| 1406 | } | 
|---|
| 1407 |  | 
|---|
| 1408 | /// Shrinks this seq to its minimal size while respecting the preference | 
|---|
| 1409 | /// order of its literals. | 
|---|
| 1410 | /// | 
|---|
| 1411 | /// While this routine will remove duplicate literals from this seq, it | 
|---|
| 1412 | /// will also remove literals that can never match in a leftmost-first or | 
|---|
| 1413 | /// "preference order" search. Similar to [`Seq::dedup`], if a literal is | 
|---|
| 1414 | /// deduped, then the one that remains is made inexact. | 
|---|
| 1415 | /// | 
|---|
| 1416 | /// This is a no-op on seqs that are empty or not finite. | 
|---|
| 1417 | /// | 
|---|
| 1418 | /// # Example | 
|---|
| 1419 | /// | 
|---|
| 1420 | /// This example shows the difference between `{sam, samwise}` and | 
|---|
| 1421 | /// `{samwise, sam}`. | 
|---|
| 1422 | /// | 
|---|
| 1423 | /// ``` | 
|---|
| 1424 | /// use regex_syntax::hir::literal::{Literal, Seq}; | 
|---|
| 1425 | /// | 
|---|
| 1426 | /// // If 'sam' comes before 'samwise' and a preference order search is | 
|---|
| 1427 | /// // executed, then 'samwise' can never match. | 
|---|
| 1428 | /// let mut seq = Seq::new(&[ "sam", "samwise"]); | 
|---|
| 1429 | /// seq.minimize_by_preference(); | 
|---|
| 1430 | /// assert_eq!(Seq::from_iter([Literal::inexact( "sam")]), seq); | 
|---|
| 1431 | /// | 
|---|
| 1432 | /// // But if they are reversed, then it's possible for 'samwise' to match | 
|---|
| 1433 | /// // since it is given higher preference. | 
|---|
| 1434 | /// let mut seq = Seq::new(&[ "samwise", "sam"]); | 
|---|
| 1435 | /// seq.minimize_by_preference(); | 
|---|
| 1436 | /// assert_eq!(Seq::new(&[ "samwise", "sam"]), seq); | 
|---|
| 1437 | /// ``` | 
|---|
| 1438 | /// | 
|---|
| 1439 | /// This example shows that if an empty string is in this seq, then | 
|---|
| 1440 | /// anything that comes after it can never match. | 
|---|
| 1441 | /// | 
|---|
| 1442 | /// ``` | 
|---|
| 1443 | /// use regex_syntax::hir::literal::{Literal, Seq}; | 
|---|
| 1444 | /// | 
|---|
| 1445 | /// // An empty string is a prefix of all strings, so it automatically | 
|---|
| 1446 | /// // inhibits any subsequent strings from matching. | 
|---|
| 1447 | /// let mut seq = Seq::new(&[ "foo", "bar", "", "quux", "fox"]); | 
|---|
| 1448 | /// seq.minimize_by_preference(); | 
|---|
| 1449 | /// let expected = Seq::from_iter([ | 
|---|
| 1450 | ///     Literal::exact( "foo"), | 
|---|
| 1451 | ///     Literal::exact( "bar"), | 
|---|
| 1452 | ///     Literal::inexact( ""), | 
|---|
| 1453 | /// ]); | 
|---|
| 1454 | /// assert_eq!(expected, seq); | 
|---|
| 1455 | /// | 
|---|
| 1456 | /// // And of course, if it's at the beginning, then it makes it impossible | 
|---|
| 1457 | /// // for anything else to match. | 
|---|
| 1458 | /// let mut seq = Seq::new(&[ "", "foo", "quux", "fox"]); | 
|---|
| 1459 | /// seq.minimize_by_preference(); | 
|---|
| 1460 | /// assert_eq!(Seq::from_iter([Literal::inexact( "")]), seq); | 
|---|
| 1461 | /// ``` | 
|---|
| 1462 | #[ inline] | 
|---|
| 1463 | pub fn minimize_by_preference(&mut self) { | 
|---|
| 1464 | if let Some(ref mut lits) = self.literals { | 
|---|
| 1465 | PreferenceTrie::minimize(lits, false); | 
|---|
| 1466 | } | 
|---|
| 1467 | } | 
|---|
| 1468 |  | 
|---|
| 1469 | /// Trims all literals in this seq such that only the first `len` bytes | 
|---|
| 1470 | /// remain. If a literal has less than or equal to `len` bytes, then it | 
|---|
| 1471 | /// remains unchanged. Otherwise, it is trimmed and made inexact. | 
|---|
| 1472 | /// | 
|---|
| 1473 | /// # Example | 
|---|
| 1474 | /// | 
|---|
| 1475 | /// ``` | 
|---|
| 1476 | /// use regex_syntax::hir::literal::{Literal, Seq}; | 
|---|
| 1477 | /// | 
|---|
| 1478 | /// let mut seq = Seq::new(&[ "a", "foo", "quux"]); | 
|---|
| 1479 | /// seq.keep_first_bytes(2); | 
|---|
| 1480 | /// | 
|---|
| 1481 | /// let expected = Seq::from_iter([ | 
|---|
| 1482 | ///     Literal::exact( "a"), | 
|---|
| 1483 | ///     Literal::inexact( "fo"), | 
|---|
| 1484 | ///     Literal::inexact( "qu"), | 
|---|
| 1485 | /// ]); | 
|---|
| 1486 | /// assert_eq!(expected, seq); | 
|---|
| 1487 | /// ``` | 
|---|
| 1488 | #[ inline] | 
|---|
| 1489 | pub fn keep_first_bytes(&mut self, len: usize) { | 
|---|
| 1490 | if let Some(ref mut lits) = self.literals { | 
|---|
| 1491 | for m in lits.iter_mut() { | 
|---|
| 1492 | m.keep_first_bytes(len); | 
|---|
| 1493 | } | 
|---|
| 1494 | } | 
|---|
| 1495 | } | 
|---|
| 1496 |  | 
|---|
| 1497 | /// Trims all literals in this seq such that only the last `len` bytes | 
|---|
| 1498 | /// remain. If a literal has less than or equal to `len` bytes, then it | 
|---|
| 1499 | /// remains unchanged. Otherwise, it is trimmed and made inexact. | 
|---|
| 1500 | /// | 
|---|
| 1501 | /// # Example | 
|---|
| 1502 | /// | 
|---|
| 1503 | /// ``` | 
|---|
| 1504 | /// use regex_syntax::hir::literal::{Literal, Seq}; | 
|---|
| 1505 | /// | 
|---|
| 1506 | /// let mut seq = Seq::new(&[ "a", "foo", "quux"]); | 
|---|
| 1507 | /// seq.keep_last_bytes(2); | 
|---|
| 1508 | /// | 
|---|
| 1509 | /// let expected = Seq::from_iter([ | 
|---|
| 1510 | ///     Literal::exact( "a"), | 
|---|
| 1511 | ///     Literal::inexact( "oo"), | 
|---|
| 1512 | ///     Literal::inexact( "ux"), | 
|---|
| 1513 | /// ]); | 
|---|
| 1514 | /// assert_eq!(expected, seq); | 
|---|
| 1515 | /// ``` | 
|---|
| 1516 | #[ inline] | 
|---|
| 1517 | pub fn keep_last_bytes(&mut self, len: usize) { | 
|---|
| 1518 | if let Some(ref mut lits) = self.literals { | 
|---|
| 1519 | for m in lits.iter_mut() { | 
|---|
| 1520 | m.keep_last_bytes(len); | 
|---|
| 1521 | } | 
|---|
| 1522 | } | 
|---|
| 1523 | } | 
|---|
| 1524 |  | 
|---|
| 1525 | /// Returns true if this sequence is finite. | 
|---|
| 1526 | /// | 
|---|
| 1527 | /// When false, this sequence is infinite and must be treated as if it | 
|---|
| 1528 | /// contains every possible literal. | 
|---|
| 1529 | #[ inline] | 
|---|
| 1530 | pub fn is_finite(&self) -> bool { | 
|---|
| 1531 | self.literals.is_some() | 
|---|
| 1532 | } | 
|---|
| 1533 |  | 
|---|
| 1534 | /// Returns true if and only if this sequence is finite and empty. | 
|---|
| 1535 | /// | 
|---|
| 1536 | /// An empty sequence never matches anything. It can only be produced by | 
|---|
| 1537 | /// literal extraction when the corresponding regex itself cannot match. | 
|---|
| 1538 | #[ inline] | 
|---|
| 1539 | pub fn is_empty(&self) -> bool { | 
|---|
| 1540 | self.len() == Some(0) | 
|---|
| 1541 | } | 
|---|
| 1542 |  | 
|---|
| 1543 | /// Returns the number of literals in this sequence if the sequence is | 
|---|
| 1544 | /// finite. If the sequence is infinite, then `None` is returned. | 
|---|
| 1545 | #[ inline] | 
|---|
| 1546 | pub fn len(&self) -> Option<usize> { | 
|---|
| 1547 | self.literals.as_ref().map(|lits| lits.len()) | 
|---|
| 1548 | } | 
|---|
| 1549 |  | 
|---|
| 1550 | /// Returns true if and only if all literals in this sequence are exact. | 
|---|
| 1551 | /// | 
|---|
| 1552 | /// This returns false if the sequence is infinite. | 
|---|
| 1553 | #[ inline] | 
|---|
| 1554 | pub fn is_exact(&self) -> bool { | 
|---|
| 1555 | self.literals().map_or(false, |lits| lits.iter().all(|x| x.is_exact())) | 
|---|
| 1556 | } | 
|---|
| 1557 |  | 
|---|
| 1558 | /// Returns true if and only if all literals in this sequence are inexact. | 
|---|
| 1559 | /// | 
|---|
| 1560 | /// This returns true if the sequence is infinite. | 
|---|
| 1561 | #[ inline] | 
|---|
| 1562 | pub fn is_inexact(&self) -> bool { | 
|---|
| 1563 | self.literals().map_or(true, |lits| lits.iter().all(|x| !x.is_exact())) | 
|---|
| 1564 | } | 
|---|
| 1565 |  | 
|---|
| 1566 | /// Return the maximum length of the sequence that would result from | 
|---|
| 1567 | /// unioning `self` with `other`. If either set is infinite, then this | 
|---|
| 1568 | /// returns `None`. | 
|---|
| 1569 | #[ inline] | 
|---|
| 1570 | pub fn max_union_len(&self, other: &Seq) -> Option<usize> { | 
|---|
| 1571 | let len1 = self.len()?; | 
|---|
| 1572 | let len2 = other.len()?; | 
|---|
| 1573 | Some(len1.saturating_add(len2)) | 
|---|
| 1574 | } | 
|---|
| 1575 |  | 
|---|
| 1576 | /// Return the maximum length of the sequence that would result from the | 
|---|
| 1577 | /// cross product of `self` with `other`. If either set is infinite, then | 
|---|
| 1578 | /// this returns `None`. | 
|---|
| 1579 | #[ inline] | 
|---|
| 1580 | pub fn max_cross_len(&self, other: &Seq) -> Option<usize> { | 
|---|
| 1581 | let len1 = self.len()?; | 
|---|
| 1582 | let len2 = other.len()?; | 
|---|
| 1583 | Some(len1.saturating_mul(len2)) | 
|---|
| 1584 | } | 
|---|
| 1585 |  | 
|---|
| 1586 | /// Returns the length of the shortest literal in this sequence. | 
|---|
| 1587 | /// | 
|---|
| 1588 | /// If the sequence is infinite or empty, then this returns `None`. | 
|---|
| 1589 | #[ inline] | 
|---|
| 1590 | pub fn min_literal_len(&self) -> Option<usize> { | 
|---|
| 1591 | self.literals.as_ref()?.iter().map(|x| x.len()).min() | 
|---|
| 1592 | } | 
|---|
| 1593 |  | 
|---|
| 1594 | /// Returns the length of the longest literal in this sequence. | 
|---|
| 1595 | /// | 
|---|
| 1596 | /// If the sequence is infinite or empty, then this returns `None`. | 
|---|
| 1597 | #[ inline] | 
|---|
| 1598 | pub fn max_literal_len(&self) -> Option<usize> { | 
|---|
| 1599 | self.literals.as_ref()?.iter().map(|x| x.len()).max() | 
|---|
| 1600 | } | 
|---|
| 1601 |  | 
|---|
| 1602 | /// Returns the longest common prefix from this seq. | 
|---|
| 1603 | /// | 
|---|
| 1604 | /// If the seq matches any literal or other contains no literals, then | 
|---|
| 1605 | /// there is no meaningful prefix and this returns `None`. | 
|---|
| 1606 | /// | 
|---|
| 1607 | /// # Example | 
|---|
| 1608 | /// | 
|---|
| 1609 | /// This shows some example seqs and their longest common prefix. | 
|---|
| 1610 | /// | 
|---|
| 1611 | /// ``` | 
|---|
| 1612 | /// use regex_syntax::hir::literal::Seq; | 
|---|
| 1613 | /// | 
|---|
| 1614 | /// let seq = Seq::new(&[ "foo", "foobar", "fo"]); | 
|---|
| 1615 | /// assert_eq!(Some(& b"fo"[..]), seq.longest_common_prefix()); | 
|---|
| 1616 | /// let seq = Seq::new(&[ "foo", "foo"]); | 
|---|
| 1617 | /// assert_eq!(Some(& b"foo"[..]), seq.longest_common_prefix()); | 
|---|
| 1618 | /// let seq = Seq::new(&[ "foo", "bar"]); | 
|---|
| 1619 | /// assert_eq!(Some(& b""[..]), seq.longest_common_prefix()); | 
|---|
| 1620 | /// let seq = Seq::new(&[ ""]); | 
|---|
| 1621 | /// assert_eq!(Some(& b""[..]), seq.longest_common_prefix()); | 
|---|
| 1622 | /// | 
|---|
| 1623 | /// let seq = Seq::infinite(); | 
|---|
| 1624 | /// assert_eq!(None, seq.longest_common_prefix()); | 
|---|
| 1625 | /// let seq = Seq::empty(); | 
|---|
| 1626 | /// assert_eq!(None, seq.longest_common_prefix()); | 
|---|
| 1627 | /// ``` | 
|---|
| 1628 | #[ inline] | 
|---|
| 1629 | pub fn longest_common_prefix(&self) -> Option<&[u8]> { | 
|---|
| 1630 | // If we match everything or match nothing, then there's no meaningful | 
|---|
| 1631 | // longest common prefix. | 
|---|
| 1632 | let lits = match self.literals { | 
|---|
| 1633 | None => return None, | 
|---|
| 1634 | Some(ref lits) => lits, | 
|---|
| 1635 | }; | 
|---|
| 1636 | if lits.len() == 0 { | 
|---|
| 1637 | return None; | 
|---|
| 1638 | } | 
|---|
| 1639 | let base = lits[0].as_bytes(); | 
|---|
| 1640 | let mut len = base.len(); | 
|---|
| 1641 | for m in lits.iter().skip(1) { | 
|---|
| 1642 | len = m | 
|---|
| 1643 | .as_bytes() | 
|---|
| 1644 | .iter() | 
|---|
| 1645 | .zip(base[..len].iter()) | 
|---|
| 1646 | .take_while(|&(a, b)| a == b) | 
|---|
| 1647 | .count(); | 
|---|
| 1648 | if len == 0 { | 
|---|
| 1649 | return Some(&[]); | 
|---|
| 1650 | } | 
|---|
| 1651 | } | 
|---|
| 1652 | Some(&base[..len]) | 
|---|
| 1653 | } | 
|---|
| 1654 |  | 
|---|
| 1655 | /// Returns the longest common suffix from this seq. | 
|---|
| 1656 | /// | 
|---|
| 1657 | /// If the seq matches any literal or other contains no literals, then | 
|---|
| 1658 | /// there is no meaningful suffix and this returns `None`. | 
|---|
| 1659 | /// | 
|---|
| 1660 | /// # Example | 
|---|
| 1661 | /// | 
|---|
| 1662 | /// This shows some example seqs and their longest common suffix. | 
|---|
| 1663 | /// | 
|---|
| 1664 | /// ``` | 
|---|
| 1665 | /// use regex_syntax::hir::literal::Seq; | 
|---|
| 1666 | /// | 
|---|
| 1667 | /// let seq = Seq::new(&[ "oof", "raboof", "of"]); | 
|---|
| 1668 | /// assert_eq!(Some(& b"of"[..]), seq.longest_common_suffix()); | 
|---|
| 1669 | /// let seq = Seq::new(&[ "foo", "foo"]); | 
|---|
| 1670 | /// assert_eq!(Some(& b"foo"[..]), seq.longest_common_suffix()); | 
|---|
| 1671 | /// let seq = Seq::new(&[ "foo", "bar"]); | 
|---|
| 1672 | /// assert_eq!(Some(& b""[..]), seq.longest_common_suffix()); | 
|---|
| 1673 | /// let seq = Seq::new(&[ ""]); | 
|---|
| 1674 | /// assert_eq!(Some(& b""[..]), seq.longest_common_suffix()); | 
|---|
| 1675 | /// | 
|---|
| 1676 | /// let seq = Seq::infinite(); | 
|---|
| 1677 | /// assert_eq!(None, seq.longest_common_suffix()); | 
|---|
| 1678 | /// let seq = Seq::empty(); | 
|---|
| 1679 | /// assert_eq!(None, seq.longest_common_suffix()); | 
|---|
| 1680 | /// ``` | 
|---|
| 1681 | #[ inline] | 
|---|
| 1682 | pub fn longest_common_suffix(&self) -> Option<&[u8]> { | 
|---|
| 1683 | // If we match everything or match nothing, then there's no meaningful | 
|---|
| 1684 | // longest common suffix. | 
|---|
| 1685 | let lits = match self.literals { | 
|---|
| 1686 | None => return None, | 
|---|
| 1687 | Some(ref lits) => lits, | 
|---|
| 1688 | }; | 
|---|
| 1689 | if lits.len() == 0 { | 
|---|
| 1690 | return None; | 
|---|
| 1691 | } | 
|---|
| 1692 | let base = lits[0].as_bytes(); | 
|---|
| 1693 | let mut len = base.len(); | 
|---|
| 1694 | for m in lits.iter().skip(1) { | 
|---|
| 1695 | len = m | 
|---|
| 1696 | .as_bytes() | 
|---|
| 1697 | .iter() | 
|---|
| 1698 | .rev() | 
|---|
| 1699 | .zip(base[base.len() - len..].iter().rev()) | 
|---|
| 1700 | .take_while(|&(a, b)| a == b) | 
|---|
| 1701 | .count(); | 
|---|
| 1702 | if len == 0 { | 
|---|
| 1703 | return Some(&[]); | 
|---|
| 1704 | } | 
|---|
| 1705 | } | 
|---|
| 1706 | Some(&base[base.len() - len..]) | 
|---|
| 1707 | } | 
|---|
| 1708 |  | 
|---|
| 1709 | /// Optimizes this seq while treating its literals as prefixes and | 
|---|
| 1710 | /// respecting the preference order of its literals. | 
|---|
| 1711 | /// | 
|---|
| 1712 | /// The specific way "optimization" works is meant to be an implementation | 
|---|
| 1713 | /// detail, as it essentially represents a set of heuristics. The goal | 
|---|
| 1714 | /// that optimization tries to accomplish is to make the literals in this | 
|---|
| 1715 | /// set reflect inputs that will result in a more effective prefilter. | 
|---|
| 1716 | /// Principally by reducing the false positive rate of candidates found by | 
|---|
| 1717 | /// the literals in this sequence. That is, when a match of a literal is | 
|---|
| 1718 | /// found, we would like it to be a strong predictor of the overall match | 
|---|
| 1719 | /// of the regex. If it isn't, then much time will be spent starting and | 
|---|
| 1720 | /// stopping the prefilter search and attempting to confirm the match only | 
|---|
| 1721 | /// to have it fail. | 
|---|
| 1722 | /// | 
|---|
| 1723 | /// Some of those heuristics might be: | 
|---|
| 1724 | /// | 
|---|
| 1725 | /// * Identifying a common prefix from a larger sequence of literals, and | 
|---|
| 1726 | /// shrinking the sequence down to that single common prefix. | 
|---|
| 1727 | /// * Rejecting the sequence entirely if it is believed to result in very | 
|---|
| 1728 | /// high false positive rate. When this happens, the sequence is made | 
|---|
| 1729 | /// infinite. | 
|---|
| 1730 | /// * Shrinking the sequence to a smaller number of literals representing | 
|---|
| 1731 | /// prefixes, but not shrinking it so much as to make literals too short. | 
|---|
| 1732 | /// (A sequence with very short literals, of 1 or 2 bytes, will typically | 
|---|
| 1733 | /// result in a higher false positive rate.) | 
|---|
| 1734 | /// | 
|---|
| 1735 | /// Optimization should only be run once extraction is complete. Namely, | 
|---|
| 1736 | /// optimization may make assumptions that do not compose with other | 
|---|
| 1737 | /// operations in the middle of extraction. For example, optimization will | 
|---|
| 1738 | /// reduce `[E(sam), E(samwise)]` to `[E(sam)]`, but such a transformation | 
|---|
| 1739 | /// is only valid if no other extraction will occur. If other extraction | 
|---|
| 1740 | /// may occur, then the correct transformation would be to `[I(sam)]`. | 
|---|
| 1741 | /// | 
|---|
| 1742 | /// The [`Seq::optimize_for_suffix_by_preference`] does the same thing, but | 
|---|
| 1743 | /// for suffixes. | 
|---|
| 1744 | /// | 
|---|
| 1745 | /// # Example | 
|---|
| 1746 | /// | 
|---|
| 1747 | /// This shows how optimization might transform a sequence. Note that | 
|---|
| 1748 | /// the specific behavior is not a documented guarantee. The heuristics | 
|---|
| 1749 | /// used are an implementation detail and may change over time in semver | 
|---|
| 1750 | /// compatible releases. | 
|---|
| 1751 | /// | 
|---|
| 1752 | /// ``` | 
|---|
| 1753 | /// use regex_syntax::hir::literal::{Seq, Literal}; | 
|---|
| 1754 | /// | 
|---|
| 1755 | /// let mut seq = Seq::new(&[ | 
|---|
| 1756 | /// "samantha", | 
|---|
| 1757 | /// "sam", | 
|---|
| 1758 | /// "samwise", | 
|---|
| 1759 | /// "frodo", | 
|---|
| 1760 | /// ]); | 
|---|
| 1761 | /// seq.optimize_for_prefix_by_preference(); | 
|---|
| 1762 | /// assert_eq!(Seq::from_iter([ | 
|---|
| 1763 | ///     Literal::exact( "samantha"), | 
|---|
| 1764 | ///     // Kept exact even though 'samwise' got pruned | 
|---|
| 1765 | ///     // because optimization assumes literal extraction | 
|---|
| 1766 | ///     // has finished. | 
|---|
| 1767 | ///     Literal::exact( "sam"), | 
|---|
| 1768 | ///     Literal::exact( "frodo"), | 
|---|
| 1769 | /// ]), seq); | 
|---|
| 1770 | /// ``` | 
|---|
| 1771 | /// | 
|---|
| 1772 | /// # Example: optimization may make the sequence infinite | 
|---|
| 1773 | /// | 
|---|
| 1774 | /// If the heuristics deem that the sequence could cause a very high false | 
|---|
| 1775 | /// positive rate, then it may make the sequence infinite, effectively | 
|---|
| 1776 | /// disabling its use as a prefilter. | 
|---|
| 1777 | /// | 
|---|
| 1778 | /// ``` | 
|---|
| 1779 | /// use regex_syntax::hir::literal::{Seq, Literal}; | 
|---|
| 1780 | /// | 
|---|
| 1781 | /// let mut seq = Seq::new(&[ | 
|---|
| 1782 | /// "samantha", | 
|---|
| 1783 | ///     // An empty string matches at every position, | 
|---|
| 1784 | ///     // thus rendering the prefilter completely | 
|---|
| 1785 | ///     // ineffective. | 
|---|
| 1786 | /// "", | 
|---|
| 1787 | /// "sam", | 
|---|
| 1788 | /// "samwise", | 
|---|
| 1789 | /// "frodo", | 
|---|
| 1790 | /// ]); | 
|---|
| 1791 | /// seq.optimize_for_prefix_by_preference(); | 
|---|
| 1792 | /// assert!(!seq.is_finite()); | 
|---|
| 1793 | /// ``` | 
|---|
| 1794 | /// | 
|---|
| 1795 | /// Do note that just because there is a `" "` in the sequence, that | 
|---|
| 1796 | /// doesn't mean the sequence will always be made infinite after it is | 
|---|
| 1797 | /// optimized. Namely, if the sequence is considered exact (any match | 
|---|
| 1798 | /// corresponds to an overall match of the original regex), then any match | 
|---|
| 1799 | /// is an overall match, and so the false positive rate is always `0`. | 
|---|
| 1800 | /// | 
|---|
| 1801 | /// To demonstrate this, we remove `samwise` from our sequence. This | 
|---|
| 1802 | /// results in no optimization happening and all literals remain exact. | 
|---|
| 1803 | /// Thus the entire sequence is exact, and it is kept as-is, even though | 
|---|
| 1804 | /// one is an ASCII space: | 
|---|
| 1805 | /// | 
|---|
| 1806 | /// ``` | 
|---|
| 1807 | /// use regex_syntax::hir::literal::{Seq, Literal}; | 
|---|
| 1808 | /// | 
|---|
| 1809 | /// let mut seq = Seq::new(&[ | 
|---|
| 1810 | /// "samantha", | 
|---|
| 1811 | /// " ", | 
|---|
| 1812 | /// "sam", | 
|---|
| 1813 | /// "frodo", | 
|---|
| 1814 | /// ]); | 
|---|
| 1815 | /// seq.optimize_for_prefix_by_preference(); | 
|---|
| 1816 | /// assert!(seq.is_finite()); | 
|---|
| 1817 | /// ``` | 
|---|
| 1818 | #[ inline] | 
|---|
| 1819 | pub fn optimize_for_prefix_by_preference(&mut self) { | 
|---|
| 1820 | self.optimize_by_preference(true); | 
|---|
| 1821 | } | 
|---|
| 1822 |  | 
|---|
| 1823 | /// Optimizes this seq while treating its literals as suffixes and | 
|---|
| 1824 | /// respecting the preference order of its literals. | 
|---|
| 1825 | /// | 
|---|
| 1826 | /// Optimization should only be run once extraction is complete. | 
|---|
| 1827 | /// | 
|---|
| 1828 | /// The [`Seq::optimize_for_prefix_by_preference`] does the same thing, but | 
|---|
| 1829 | /// for prefixes. See its documentation for more explanation. | 
|---|
| 1830 | #[ inline] | 
|---|
| 1831 | pub fn optimize_for_suffix_by_preference(&mut self) { | 
|---|
| 1832 | self.optimize_by_preference(false); | 
|---|
| 1833 | } | 
|---|
| 1834 |  | 
|---|
| 1835 | fn optimize_by_preference(&mut self, prefix: bool) { | 
|---|
| 1836 | let origlen = match self.len() { | 
|---|
| 1837 | None => return, | 
|---|
| 1838 | Some(len) => len, | 
|---|
| 1839 | }; | 
|---|
| 1840 | // Just give up now if our sequence contains an empty string. | 
|---|
| 1841 | if self.min_literal_len().map_or(false, |len| len == 0) { | 
|---|
| 1842 | // We squash the sequence so that nobody else gets any bright | 
|---|
| 1843 | // ideas to try and use it. An empty string implies a match at | 
|---|
| 1844 | // every position. A prefilter cannot help you here. | 
|---|
| 1845 | self.make_infinite(); | 
|---|
| 1846 | return; | 
|---|
| 1847 | } | 
|---|
| 1848 | // Make sure we start with the smallest sequence possible. We use a | 
|---|
| 1849 | // special version of preference minimization that retains exactness. | 
|---|
| 1850 | // This is legal because optimization is only expected to occur once | 
|---|
| 1851 | // extraction is complete. | 
|---|
| 1852 | if prefix { | 
|---|
| 1853 | if let Some(ref mut lits) = self.literals { | 
|---|
| 1854 | PreferenceTrie::minimize(lits, true); | 
|---|
| 1855 | } | 
|---|
| 1856 | } | 
|---|
| 1857 |  | 
|---|
| 1858 | // Look for a common prefix (or suffix). If we found one of those and | 
|---|
| 1859 | // it's long enough, then it's a good bet that it will be our fastest | 
|---|
| 1860 | // possible prefilter since single-substring search is so fast. | 
|---|
| 1861 | let fix = if prefix { | 
|---|
| 1862 | self.longest_common_prefix() | 
|---|
| 1863 | } else { | 
|---|
| 1864 | self.longest_common_suffix() | 
|---|
| 1865 | }; | 
|---|
| 1866 | if let Some(fix) = fix { | 
|---|
| 1867 | // As a special case, if we have a common prefix and the leading | 
|---|
| 1868 | // byte of that prefix is one that we think probably occurs rarely, | 
|---|
| 1869 | // then strip everything down to just that single byte. This should | 
|---|
| 1870 | // promote the use of memchr. | 
|---|
| 1871 | // | 
|---|
| 1872 | // ... we only do this though if our sequence has more than one | 
|---|
| 1873 | // literal. Otherwise, we'd rather just stick with a single literal | 
|---|
| 1874 | // scan. That is, using memchr is probably better than looking | 
|---|
| 1875 | // for 2 or more literals, but probably not as good as a straight | 
|---|
| 1876 | // memmem search. | 
|---|
| 1877 | // | 
|---|
| 1878 | // ... and also only do this when the prefix is short and probably | 
|---|
| 1879 | // not too discriminatory anyway. If it's longer, then it's | 
|---|
| 1880 | // probably quite discriminatory and thus is likely to have a low | 
|---|
| 1881 | // false positive rate. | 
|---|
| 1882 | if prefix | 
|---|
| 1883 | && origlen > 1 | 
|---|
| 1884 | && fix.len() >= 1 | 
|---|
| 1885 | && fix.len() <= 3 | 
|---|
| 1886 | && rank(fix[0]) < 200 | 
|---|
| 1887 | { | 
|---|
| 1888 | self.keep_first_bytes(1); | 
|---|
| 1889 | self.dedup(); | 
|---|
| 1890 | return; | 
|---|
| 1891 | } | 
|---|
| 1892 | // We only strip down to the common prefix/suffix if we think | 
|---|
| 1893 | // the existing set of literals isn't great, or if the common | 
|---|
| 1894 | // prefix/suffix is expected to be particularly discriminatory. | 
|---|
| 1895 | let isfast = | 
|---|
| 1896 | self.is_exact() && self.len().map_or(false, |len| len <= 16); | 
|---|
| 1897 | let usefix = fix.len() > 4 || (fix.len() > 1 && !isfast); | 
|---|
| 1898 | if usefix { | 
|---|
| 1899 | // If we keep exactly the number of bytes equal to the length | 
|---|
| 1900 | // of the prefix (or suffix), then by the definition of a | 
|---|
| 1901 | // prefix, every literal in the sequence will be equivalent. | 
|---|
| 1902 | // Thus, 'dedup' will leave us with one literal. | 
|---|
| 1903 | // | 
|---|
| 1904 | // We do it this way to avoid an alloc, but also to make sure | 
|---|
| 1905 | // the exactness of literals is kept (or not). | 
|---|
| 1906 | if prefix { | 
|---|
| 1907 | self.keep_first_bytes(fix.len()); | 
|---|
| 1908 | } else { | 
|---|
| 1909 | self.keep_last_bytes(fix.len()); | 
|---|
| 1910 | } | 
|---|
| 1911 | self.dedup(); | 
|---|
| 1912 | assert_eq!(Some(1), self.len()); | 
|---|
| 1913 | // We still fall through here. In particular, we want our | 
|---|
| 1914 | // longest common prefix to be subject to the poison check. | 
|---|
| 1915 | } | 
|---|
| 1916 | } | 
|---|
| 1917 | // If we have an exact sequence, we *probably* just want to keep it | 
|---|
| 1918 | // as-is. But there are some cases where we don't. So we save a copy of | 
|---|
| 1919 | // the exact sequence now, and then try to do some more optimizations | 
|---|
| 1920 | // below. If those don't work out, we go back to this exact sequence. | 
|---|
| 1921 | // | 
|---|
| 1922 | // The specific motivation for this is that we sometimes wind up with | 
|---|
| 1923 | // an exact sequence with a hefty number of literals. Say, 100. If we | 
|---|
| 1924 | // stuck with that, it would be too big for Teddy and would result in | 
|---|
| 1925 | // using Aho-Corasick. Which is fine... but the lazy DFA is plenty | 
|---|
| 1926 | // suitable in such cases. The real issue is that we will wind up not | 
|---|
| 1927 | // using a fast prefilter at all. So in cases like this, even though | 
|---|
| 1928 | // we have an exact sequence, it would be better to try and shrink the | 
|---|
| 1929 | // sequence (which we do below) and use it as a prefilter that can | 
|---|
| 1930 | // produce false positive matches. | 
|---|
| 1931 | // | 
|---|
| 1932 | // But if the shrinking below results in a sequence that "sucks," then | 
|---|
| 1933 | // we don't want to use that because we already have an exact sequence | 
|---|
| 1934 | // in hand. | 
|---|
| 1935 | let exact: Option<Seq> = | 
|---|
| 1936 | if self.is_exact() { Some(self.clone()) } else { None }; | 
|---|
| 1937 | // Now we attempt to shorten the sequence. The idea here is that we | 
|---|
| 1938 | // don't want to look for too many literals, but we want to shorten | 
|---|
| 1939 | // our sequence enough to improve our odds of using better algorithms | 
|---|
| 1940 | // downstream (such as Teddy). | 
|---|
| 1941 | // | 
|---|
| 1942 | // The pair of numbers in this list corresponds to the maximal prefix | 
|---|
| 1943 | // (in bytes) to keep for all literals and the length of the sequence | 
|---|
| 1944 | // at which to do it. | 
|---|
| 1945 | // | 
|---|
| 1946 | // So for example, the pair (3, 500) would mean, "if we have more than | 
|---|
| 1947 | // 500 literals in our sequence, then truncate all of our literals | 
|---|
| 1948 | // such that they are at most 3 bytes in length and the minimize the | 
|---|
| 1949 | // sequence." | 
|---|
| 1950 | const ATTEMPTS: [(usize, usize); 5] = | 
|---|
| 1951 | [(5, 10), (4, 10), (3, 64), (2, 64), (1, 10)]; | 
|---|
| 1952 | for (keep, limit) in ATTEMPTS { | 
|---|
| 1953 | let len = match self.len() { | 
|---|
| 1954 | None => break, | 
|---|
| 1955 | Some(len) => len, | 
|---|
| 1956 | }; | 
|---|
| 1957 | if len <= limit { | 
|---|
| 1958 | break; | 
|---|
| 1959 | } | 
|---|
| 1960 | if prefix { | 
|---|
| 1961 | self.keep_first_bytes(keep); | 
|---|
| 1962 | } else { | 
|---|
| 1963 | self.keep_last_bytes(keep); | 
|---|
| 1964 | } | 
|---|
| 1965 | if prefix { | 
|---|
| 1966 | if let Some(ref mut lits) = self.literals { | 
|---|
| 1967 | PreferenceTrie::minimize(lits, true); | 
|---|
| 1968 | } | 
|---|
| 1969 | } | 
|---|
| 1970 | } | 
|---|
| 1971 | // Check for a poison literal. A poison literal is one that is short | 
|---|
| 1972 | // and is believed to have a very high match count. These poisons | 
|---|
| 1973 | // generally lead to a prefilter with a very high false positive rate, | 
|---|
| 1974 | // and thus overall worse performance. | 
|---|
| 1975 | // | 
|---|
| 1976 | // We do this last because we could have gone from a non-poisonous | 
|---|
| 1977 | // sequence to a poisonous one. Perhaps we should add some code to | 
|---|
| 1978 | // prevent such transitions in the first place, but then again, we | 
|---|
| 1979 | // likely only made the transition in the first place if the sequence | 
|---|
| 1980 | // was itself huge. And huge sequences are themselves poisonous. So... | 
|---|
| 1981 | if let Some(lits) = self.literals() { | 
|---|
| 1982 | if lits.iter().any(|lit| lit.is_poisonous()) { | 
|---|
| 1983 | self.make_infinite(); | 
|---|
| 1984 | } | 
|---|
| 1985 | } | 
|---|
| 1986 | // OK, if we had an exact sequence before attempting more optimizations | 
|---|
| 1987 | // above and our post-optimized sequence sucks for some reason or | 
|---|
| 1988 | // another, then we go back to the exact sequence. | 
|---|
| 1989 | if let Some(exact) = exact { | 
|---|
| 1990 | // If optimizing resulted in dropping our literals, then certainly | 
|---|
| 1991 | // backup and use the exact sequence that we had. | 
|---|
| 1992 | if !self.is_finite() { | 
|---|
| 1993 | *self = exact; | 
|---|
| 1994 | return; | 
|---|
| 1995 | } | 
|---|
| 1996 | // If our optimized sequence contains a short literal, then it's | 
|---|
| 1997 | // *probably* not so great. So throw it away and revert to the | 
|---|
| 1998 | // exact sequence. | 
|---|
| 1999 | if self.min_literal_len().map_or(true, |len| len <= 2) { | 
|---|
| 2000 | *self = exact; | 
|---|
| 2001 | return; | 
|---|
| 2002 | } | 
|---|
| 2003 | // Finally, if our optimized sequence is "big" (i.e., can't use | 
|---|
| 2004 | // Teddy), then also don't use it and rely on the exact sequence. | 
|---|
| 2005 | if self.len().map_or(true, |len| len > 64) { | 
|---|
| 2006 | *self = exact; | 
|---|
| 2007 | return; | 
|---|
| 2008 | } | 
|---|
| 2009 | } | 
|---|
| 2010 | } | 
|---|
| 2011 | } | 
|---|
| 2012 |  | 
|---|
| 2013 | impl core::fmt::Debug for Seq { | 
|---|
| 2014 | fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { | 
|---|
| 2015 | write!(f, "Seq")?; | 
|---|
| 2016 | if let Some(lits: &[Literal]) = self.literals() { | 
|---|
| 2017 | f.debug_list().entries(lits.iter()).finish() | 
|---|
| 2018 | } else { | 
|---|
| 2019 | write!(f, "[∞]") | 
|---|
| 2020 | } | 
|---|
| 2021 | } | 
|---|
| 2022 | } | 
|---|
| 2023 |  | 
|---|
| 2024 | impl FromIterator<Literal> for Seq { | 
|---|
| 2025 | fn from_iter<T: IntoIterator<Item = Literal>>(it: T) -> Seq { | 
|---|
| 2026 | let mut seq: Seq = Seq::empty(); | 
|---|
| 2027 | for literal: Literal in it { | 
|---|
| 2028 | seq.push(lit:literal); | 
|---|
| 2029 | } | 
|---|
| 2030 | seq | 
|---|
| 2031 | } | 
|---|
| 2032 | } | 
|---|
| 2033 |  | 
|---|
| 2034 | /// A single literal extracted from an [`Hir`] expression. | 
|---|
| 2035 | /// | 
|---|
| 2036 | /// A literal is composed of two things: | 
|---|
| 2037 | /// | 
|---|
| 2038 | /// * A sequence of bytes. No guarantees with respect to UTF-8 are provided. | 
|---|
| 2039 | /// In particular, even if the regex a literal is extracted from is UTF-8, the | 
|---|
| 2040 | /// literal extracted may not be valid UTF-8. (For example, if an [`Extractor`] | 
|---|
| 2041 | /// limit resulted in trimming a literal in a way that splits a codepoint.) | 
|---|
| 2042 | /// * Whether the literal is "exact" or not. An "exact" literal means that it | 
|---|
| 2043 | /// has not been trimmed, and may continue to be extended. If a literal is | 
|---|
| 2044 | /// "exact" after visiting the entire `Hir` expression, then this implies that | 
|---|
| 2045 | /// the literal leads to a match state. (Although it doesn't necessarily imply | 
|---|
| 2046 | /// all occurrences of the literal correspond to a match of the regex, since | 
|---|
| 2047 | /// literal extraction ignores look-around assertions.) | 
|---|
| 2048 | #[ derive(Clone, Eq, PartialEq, PartialOrd, Ord)] | 
|---|
| 2049 | pub struct Literal { | 
|---|
| 2050 | bytes: Vec<u8>, | 
|---|
| 2051 | exact: bool, | 
|---|
| 2052 | } | 
|---|
| 2053 |  | 
|---|
| 2054 | impl Literal { | 
|---|
| 2055 | /// Returns a new exact literal containing the bytes given. | 
|---|
| 2056 | #[ inline] | 
|---|
| 2057 | pub fn exact<B: Into<Vec<u8>>>(bytes: B) -> Literal { | 
|---|
| 2058 | Literal { bytes: bytes.into(), exact: true } | 
|---|
| 2059 | } | 
|---|
| 2060 |  | 
|---|
| 2061 | /// Returns a new inexact literal containing the bytes given. | 
|---|
| 2062 | #[ inline] | 
|---|
| 2063 | pub fn inexact<B: Into<Vec<u8>>>(bytes: B) -> Literal { | 
|---|
| 2064 | Literal { bytes: bytes.into(), exact: false } | 
|---|
| 2065 | } | 
|---|
| 2066 |  | 
|---|
| 2067 | /// Returns the bytes in this literal. | 
|---|
| 2068 | #[ inline] | 
|---|
| 2069 | pub fn as_bytes(&self) -> &[u8] { | 
|---|
| 2070 | &self.bytes | 
|---|
| 2071 | } | 
|---|
| 2072 |  | 
|---|
| 2073 | /// Yields ownership of the bytes inside this literal. | 
|---|
| 2074 | /// | 
|---|
| 2075 | /// Note that this throws away whether the literal is "exact" or not. | 
|---|
| 2076 | #[ inline] | 
|---|
| 2077 | pub fn into_bytes(self) -> Vec<u8> { | 
|---|
| 2078 | self.bytes | 
|---|
| 2079 | } | 
|---|
| 2080 |  | 
|---|
| 2081 | /// Returns the length of this literal in bytes. | 
|---|
| 2082 | #[ inline] | 
|---|
| 2083 | pub fn len(&self) -> usize { | 
|---|
| 2084 | self.as_bytes().len() | 
|---|
| 2085 | } | 
|---|
| 2086 |  | 
|---|
| 2087 | /// Returns true if and only if this literal has zero bytes. | 
|---|
| 2088 | #[ inline] | 
|---|
| 2089 | pub fn is_empty(&self) -> bool { | 
|---|
| 2090 | self.len() == 0 | 
|---|
| 2091 | } | 
|---|
| 2092 |  | 
|---|
| 2093 | /// Returns true if and only if this literal is exact. | 
|---|
| 2094 | #[ inline] | 
|---|
| 2095 | pub fn is_exact(&self) -> bool { | 
|---|
| 2096 | self.exact | 
|---|
| 2097 | } | 
|---|
| 2098 |  | 
|---|
| 2099 | /// Marks this literal as inexact. | 
|---|
| 2100 | /// | 
|---|
| 2101 | /// Inexact literals can never be extended. For example, | 
|---|
| 2102 | /// [`Seq::cross_forward`] will not extend inexact literals. | 
|---|
| 2103 | #[ inline] | 
|---|
| 2104 | pub fn make_inexact(&mut self) { | 
|---|
| 2105 | self.exact = false; | 
|---|
| 2106 | } | 
|---|
| 2107 |  | 
|---|
| 2108 | /// Reverse the bytes in this literal. | 
|---|
| 2109 | #[ inline] | 
|---|
| 2110 | pub fn reverse(&mut self) { | 
|---|
| 2111 | self.bytes.reverse(); | 
|---|
| 2112 | } | 
|---|
| 2113 |  | 
|---|
| 2114 | /// Extend this literal with the literal given. | 
|---|
| 2115 | /// | 
|---|
| 2116 | /// If this literal is inexact, then this is a no-op. | 
|---|
| 2117 | #[ inline] | 
|---|
| 2118 | pub fn extend(&mut self, lit: &Literal) { | 
|---|
| 2119 | if !self.is_exact() { | 
|---|
| 2120 | return; | 
|---|
| 2121 | } | 
|---|
| 2122 | self.bytes.extend_from_slice(&lit.bytes); | 
|---|
| 2123 | } | 
|---|
| 2124 |  | 
|---|
| 2125 | /// Trims this literal such that only the first `len` bytes remain. If | 
|---|
| 2126 | /// this literal has fewer than `len` bytes, then it remains unchanged. | 
|---|
| 2127 | /// Otherwise, the literal is marked as inexact. | 
|---|
| 2128 | #[ inline] | 
|---|
| 2129 | pub fn keep_first_bytes(&mut self, len: usize) { | 
|---|
| 2130 | if len >= self.len() { | 
|---|
| 2131 | return; | 
|---|
| 2132 | } | 
|---|
| 2133 | self.make_inexact(); | 
|---|
| 2134 | self.bytes.truncate(len); | 
|---|
| 2135 | } | 
|---|
| 2136 |  | 
|---|
| 2137 | /// Trims this literal such that only the last `len` bytes remain. If this | 
|---|
| 2138 | /// literal has fewer than `len` bytes, then it remains unchanged. | 
|---|
| 2139 | /// Otherwise, the literal is marked as inexact. | 
|---|
| 2140 | #[ inline] | 
|---|
| 2141 | pub fn keep_last_bytes(&mut self, len: usize) { | 
|---|
| 2142 | if len >= self.len() { | 
|---|
| 2143 | return; | 
|---|
| 2144 | } | 
|---|
| 2145 | self.make_inexact(); | 
|---|
| 2146 | self.bytes.drain(..self.len() - len); | 
|---|
| 2147 | } | 
|---|
| 2148 |  | 
|---|
| 2149 | /// Returns true if it is believe that this literal is likely to match very | 
|---|
| 2150 | /// frequently, and is thus not a good candidate for a prefilter. | 
|---|
| 2151 | fn is_poisonous(&self) -> bool { | 
|---|
| 2152 | self.is_empty() || (self.len() == 1 && rank(self.as_bytes()[0]) >= 250) | 
|---|
| 2153 | } | 
|---|
| 2154 | } | 
|---|
| 2155 |  | 
|---|
| 2156 | impl From<u8> for Literal { | 
|---|
| 2157 | fn from(byte: u8) -> Literal { | 
|---|
| 2158 | Literal::exact(bytes:vec![byte]) | 
|---|
| 2159 | } | 
|---|
| 2160 | } | 
|---|
| 2161 |  | 
|---|
| 2162 | impl From<char> for Literal { | 
|---|
| 2163 | fn from(ch: char) -> Literal { | 
|---|
| 2164 | use alloc::string::ToString; | 
|---|
| 2165 | Literal::exact(bytes:ch.encode_utf8(&mut [0; 4]).to_string()) | 
|---|
| 2166 | } | 
|---|
| 2167 | } | 
|---|
| 2168 |  | 
|---|
| 2169 | impl AsRef<[u8]> for Literal { | 
|---|
| 2170 | fn as_ref(&self) -> &[u8] { | 
|---|
| 2171 | self.as_bytes() | 
|---|
| 2172 | } | 
|---|
| 2173 | } | 
|---|
| 2174 |  | 
|---|
| 2175 | impl core::fmt::Debug for Literal { | 
|---|
| 2176 | fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { | 
|---|
| 2177 | let tag: &'static str = if self.exact { "E"} else { "I"}; | 
|---|
| 2178 | f&mut DebugTuple<'_, '_>.debug_tuple(name:tag) | 
|---|
| 2179 | .field(&crate::debug::Bytes(self.as_bytes())) | 
|---|
| 2180 | .finish() | 
|---|
| 2181 | } | 
|---|
| 2182 | } | 
|---|
| 2183 |  | 
|---|
| 2184 | /// A "preference" trie that rejects literals that will never match when | 
|---|
| 2185 | /// executing a leftmost first or "preference" search. | 
|---|
| 2186 | /// | 
|---|
| 2187 | /// For example, if 'sam' is inserted, then trying to insert 'samwise' will be | 
|---|
| 2188 | /// rejected because 'samwise' can never match since 'sam' will always take | 
|---|
| 2189 | /// priority. However, if 'samwise' is inserted first, then inserting 'sam' | 
|---|
| 2190 | /// after it is accepted. In this case, either 'samwise' or 'sam' can match in | 
|---|
| 2191 | /// a "preference" search. | 
|---|
| 2192 | /// | 
|---|
| 2193 | /// Note that we only use this trie as a "set." That is, given a sequence of | 
|---|
| 2194 | /// literals, we insert each one in order. An `insert` will reject a literal | 
|---|
| 2195 | /// if a prefix of that literal already exists in the trie. Thus, to rebuild | 
|---|
| 2196 | /// the "minimal" sequence, we simply only keep literals that were successfully | 
|---|
| 2197 | /// inserted. (Since we don't need traversal, one wonders whether we can make | 
|---|
| 2198 | /// some simplifications here, but I haven't given it a ton of thought and I've | 
|---|
| 2199 | /// never seen this show up on a profile. Because of the heuristic limits | 
|---|
| 2200 | /// imposed on literal extractions, the size of the inputs here is usually | 
|---|
| 2201 | /// very small.) | 
|---|
| 2202 | #[ derive(Debug)] | 
|---|
| 2203 | struct PreferenceTrie { | 
|---|
| 2204 | /// The states in this trie. The index of a state in this vector is its ID. | 
|---|
| 2205 | states: Vec<State>, | 
|---|
| 2206 | /// This vec indicates which states are match states. It always has | 
|---|
| 2207 | /// the same length as `states` and is indexed by the same state ID. | 
|---|
| 2208 | /// A state with identifier `sid` is a match state if and only if | 
|---|
| 2209 | /// `matches[sid].is_some()`. The option contains the index of the literal | 
|---|
| 2210 | /// corresponding to the match. The index is offset by 1 so that it fits in | 
|---|
| 2211 | /// a NonZeroUsize. | 
|---|
| 2212 | matches: Vec<Option<NonZeroUsize>>, | 
|---|
| 2213 | /// The index to allocate to the next literal added to this trie. Starts at | 
|---|
| 2214 | /// 1 and increments by 1 for every literal successfully added to the trie. | 
|---|
| 2215 | next_literal_index: usize, | 
|---|
| 2216 | } | 
|---|
| 2217 |  | 
|---|
| 2218 | /// A single state in a trie. Uses a sparse representation for its transitions. | 
|---|
| 2219 | #[ derive(Debug, Default)] | 
|---|
| 2220 | struct State { | 
|---|
| 2221 | /// Sparse representation of the transitions out of this state. Transitions | 
|---|
| 2222 | /// are sorted by byte. There is at most one such transition for any | 
|---|
| 2223 | /// particular byte. | 
|---|
| 2224 | trans: Vec<(u8, usize)>, | 
|---|
| 2225 | } | 
|---|
| 2226 |  | 
|---|
| 2227 | impl PreferenceTrie { | 
|---|
| 2228 | /// Minimizes the given sequence of literals while preserving preference | 
|---|
| 2229 | /// order semantics. | 
|---|
| 2230 | /// | 
|---|
| 2231 | /// When `keep_exact` is true, the exactness of every literal retained is | 
|---|
| 2232 | /// kept. This is useful when dealing with a fully extracted `Seq` that | 
|---|
| 2233 | /// only contains exact literals. In that case, we can keep all retained | 
|---|
| 2234 | /// literals as exact because we know we'll never need to match anything | 
|---|
| 2235 | /// after them and because any removed literals are guaranteed to never | 
|---|
| 2236 | /// match. | 
|---|
| 2237 | fn minimize(literals: &mut Vec<Literal>, keep_exact: bool) { | 
|---|
| 2238 | let mut trie = PreferenceTrie { | 
|---|
| 2239 | states: vec![], | 
|---|
| 2240 | matches: vec![], | 
|---|
| 2241 | next_literal_index: 1, | 
|---|
| 2242 | }; | 
|---|
| 2243 | let mut make_inexact = vec![]; | 
|---|
| 2244 | literals.retain_mut(|lit| match trie.insert(lit.as_bytes()) { | 
|---|
| 2245 | Ok(_) => true, | 
|---|
| 2246 | Err(i) => { | 
|---|
| 2247 | if !keep_exact { | 
|---|
| 2248 | make_inexact.push(i.checked_sub(1).unwrap()); | 
|---|
| 2249 | } | 
|---|
| 2250 | false | 
|---|
| 2251 | } | 
|---|
| 2252 | }); | 
|---|
| 2253 | for i in make_inexact { | 
|---|
| 2254 | literals[i].make_inexact(); | 
|---|
| 2255 | } | 
|---|
| 2256 | } | 
|---|
| 2257 |  | 
|---|
| 2258 | /// Returns `Ok` if the given byte string is accepted into this trie and | 
|---|
| 2259 | /// `Err` otherwise. The index for the success case corresponds to the | 
|---|
| 2260 | /// index of the literal added. The index for the error case corresponds to | 
|---|
| 2261 | /// the index of the literal already in the trie that prevented the given | 
|---|
| 2262 | /// byte string from being added. (Which implies it is a prefix of the one | 
|---|
| 2263 | /// given.) | 
|---|
| 2264 | /// | 
|---|
| 2265 | /// In short, the byte string given is accepted into the trie if and only | 
|---|
| 2266 | /// if it is possible for it to match when executing a preference order | 
|---|
| 2267 | /// search. | 
|---|
| 2268 | fn insert(&mut self, bytes: &[u8]) -> Result<usize, usize> { | 
|---|
| 2269 | let mut prev = self.root(); | 
|---|
| 2270 | if let Some(idx) = self.matches[prev] { | 
|---|
| 2271 | return Err(idx.get()); | 
|---|
| 2272 | } | 
|---|
| 2273 | for &b in bytes.iter() { | 
|---|
| 2274 | match self.states[prev].trans.binary_search_by_key(&b, |t| t.0) { | 
|---|
| 2275 | Ok(i) => { | 
|---|
| 2276 | prev = self.states[prev].trans[i].1; | 
|---|
| 2277 | if let Some(idx) = self.matches[prev] { | 
|---|
| 2278 | return Err(idx.get()); | 
|---|
| 2279 | } | 
|---|
| 2280 | } | 
|---|
| 2281 | Err(i) => { | 
|---|
| 2282 | let next = self.create_state(); | 
|---|
| 2283 | self.states[prev].trans.insert(i, (b, next)); | 
|---|
| 2284 | prev = next; | 
|---|
| 2285 | } | 
|---|
| 2286 | } | 
|---|
| 2287 | } | 
|---|
| 2288 | let idx = self.next_literal_index; | 
|---|
| 2289 | self.next_literal_index += 1; | 
|---|
| 2290 | self.matches[prev] = NonZeroUsize::new(idx); | 
|---|
| 2291 | Ok(idx) | 
|---|
| 2292 | } | 
|---|
| 2293 |  | 
|---|
| 2294 | /// Returns the root state ID, and if it doesn't exist, creates it. | 
|---|
| 2295 | fn root(&mut self) -> usize { | 
|---|
| 2296 | if !self.states.is_empty() { | 
|---|
| 2297 | 0 | 
|---|
| 2298 | } else { | 
|---|
| 2299 | self.create_state() | 
|---|
| 2300 | } | 
|---|
| 2301 | } | 
|---|
| 2302 |  | 
|---|
| 2303 | /// Creates a new empty state and returns its ID. | 
|---|
| 2304 | fn create_state(&mut self) -> usize { | 
|---|
| 2305 | let id = self.states.len(); | 
|---|
| 2306 | self.states.push(State::default()); | 
|---|
| 2307 | self.matches.push(None); | 
|---|
| 2308 | id | 
|---|
| 2309 | } | 
|---|
| 2310 | } | 
|---|
| 2311 |  | 
|---|
| 2312 | /// Returns the "rank" of the given byte. | 
|---|
| 2313 | /// | 
|---|
| 2314 | /// The minimum rank value is `0` and the maximum rank value is `255`. | 
|---|
| 2315 | /// | 
|---|
| 2316 | /// The rank of a byte is derived from a heuristic background distribution of | 
|---|
| 2317 | /// relative frequencies of bytes. The heuristic says that lower the rank of a | 
|---|
| 2318 | /// byte, the less likely that byte is to appear in any arbitrary haystack. | 
|---|
| 2319 | pub fn rank(byte: u8) -> u8 { | 
|---|
| 2320 | crate::rank::BYTE_FREQUENCIES[usize::from(byte)] | 
|---|
| 2321 | } | 
|---|
| 2322 |  | 
|---|
| 2323 | #[ cfg(test)] | 
|---|
| 2324 | mod tests { | 
|---|
| 2325 | use super::*; | 
|---|
| 2326 |  | 
|---|
| 2327 | fn parse(pattern: &str) -> Hir { | 
|---|
| 2328 | crate::ParserBuilder::new().utf8(false).build().parse(pattern).unwrap() | 
|---|
| 2329 | } | 
|---|
| 2330 |  | 
|---|
| 2331 | fn prefixes(pattern: &str) -> Seq { | 
|---|
| 2332 | Extractor::new().kind(ExtractKind::Prefix).extract(&parse(pattern)) | 
|---|
| 2333 | } | 
|---|
| 2334 |  | 
|---|
| 2335 | fn suffixes(pattern: &str) -> Seq { | 
|---|
| 2336 | Extractor::new().kind(ExtractKind::Suffix).extract(&parse(pattern)) | 
|---|
| 2337 | } | 
|---|
| 2338 |  | 
|---|
| 2339 | fn e(pattern: &str) -> (Seq, Seq) { | 
|---|
| 2340 | (prefixes(pattern), suffixes(pattern)) | 
|---|
| 2341 | } | 
|---|
| 2342 |  | 
|---|
| 2343 | #[ allow(non_snake_case)] | 
|---|
| 2344 | fn E(x: &str) -> Literal { | 
|---|
| 2345 | Literal::exact(x.as_bytes()) | 
|---|
| 2346 | } | 
|---|
| 2347 |  | 
|---|
| 2348 | #[ allow(non_snake_case)] | 
|---|
| 2349 | fn I(x: &str) -> Literal { | 
|---|
| 2350 | Literal::inexact(x.as_bytes()) | 
|---|
| 2351 | } | 
|---|
| 2352 |  | 
|---|
| 2353 | fn seq<I: IntoIterator<Item = Literal>>(it: I) -> Seq { | 
|---|
| 2354 | Seq::from_iter(it) | 
|---|
| 2355 | } | 
|---|
| 2356 |  | 
|---|
| 2357 | fn infinite() -> (Seq, Seq) { | 
|---|
| 2358 | (Seq::infinite(), Seq::infinite()) | 
|---|
| 2359 | } | 
|---|
| 2360 |  | 
|---|
| 2361 | fn inexact<I1, I2>(it1: I1, it2: I2) -> (Seq, Seq) | 
|---|
| 2362 | where | 
|---|
| 2363 | I1: IntoIterator<Item = Literal>, | 
|---|
| 2364 | I2: IntoIterator<Item = Literal>, | 
|---|
| 2365 | { | 
|---|
| 2366 | (Seq::from_iter(it1), Seq::from_iter(it2)) | 
|---|
| 2367 | } | 
|---|
| 2368 |  | 
|---|
| 2369 | fn exact<B: AsRef<[u8]>, I: IntoIterator<Item = B>>(it: I) -> (Seq, Seq) { | 
|---|
| 2370 | let s1 = Seq::new(it); | 
|---|
| 2371 | let s2 = s1.clone(); | 
|---|
| 2372 | (s1, s2) | 
|---|
| 2373 | } | 
|---|
| 2374 |  | 
|---|
| 2375 | fn opt<B: AsRef<[u8]>, I: IntoIterator<Item = B>>(it: I) -> (Seq, Seq) { | 
|---|
| 2376 | let (mut p, mut s) = exact(it); | 
|---|
| 2377 | p.optimize_for_prefix_by_preference(); | 
|---|
| 2378 | s.optimize_for_suffix_by_preference(); | 
|---|
| 2379 | (p, s) | 
|---|
| 2380 | } | 
|---|
| 2381 |  | 
|---|
| 2382 | #[ test] | 
|---|
| 2383 | fn literal() { | 
|---|
| 2384 | assert_eq!(exact([ "a"]), e( "a")); | 
|---|
| 2385 | assert_eq!(exact([ "aaaaa"]), e( "aaaaa")); | 
|---|
| 2386 | assert_eq!(exact([ "A", "a"]), e( "(?i-u)a")); | 
|---|
| 2387 | assert_eq!(exact([ "AB", "Ab", "aB", "ab"]), e( "(?i-u)ab")); | 
|---|
| 2388 | assert_eq!(exact([ "abC", "abc"]), e( "ab(?i-u)c")); | 
|---|
| 2389 |  | 
|---|
| 2390 | assert_eq!(exact([ b"\xFF "]), e( r"(?-u:\xFF)")); | 
|---|
| 2391 |  | 
|---|
| 2392 | #[ cfg(feature = "unicode-case")] | 
|---|
| 2393 | { | 
|---|
| 2394 | assert_eq!(exact([ "☃"]), e( "☃")); | 
|---|
| 2395 | assert_eq!(exact([ "☃"]), e( "(?i)☃")); | 
|---|
| 2396 | assert_eq!(exact([ "☃☃☃☃☃"]), e( "☃☃☃☃☃")); | 
|---|
| 2397 |  | 
|---|
| 2398 | assert_eq!(exact([ "Δ"]), e( "Δ")); | 
|---|
| 2399 | assert_eq!(exact([ "δ"]), e( "δ")); | 
|---|
| 2400 | assert_eq!(exact([ "Δ", "δ"]), e( "(?i)Δ")); | 
|---|
| 2401 | assert_eq!(exact([ "Δ", "δ"]), e( "(?i)δ")); | 
|---|
| 2402 |  | 
|---|
| 2403 | assert_eq!(exact([ "S", "s", "ſ"]), e( "(?i)S")); | 
|---|
| 2404 | assert_eq!(exact([ "S", "s", "ſ"]), e( "(?i)s")); | 
|---|
| 2405 | assert_eq!(exact([ "S", "s", "ſ"]), e( "(?i)ſ")); | 
|---|
| 2406 | } | 
|---|
| 2407 |  | 
|---|
| 2408 | let letters = "ͱͳͷΐάέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋ"; | 
|---|
| 2409 | assert_eq!(exact([letters]), e(letters)); | 
|---|
| 2410 | } | 
|---|
| 2411 |  | 
|---|
| 2412 | #[ test] | 
|---|
| 2413 | fn class() { | 
|---|
| 2414 | assert_eq!(exact([ "a", "b", "c"]), e( "[abc]")); | 
|---|
| 2415 | assert_eq!(exact([ "a1b", "a2b", "a3b"]), e( "a[123]b")); | 
|---|
| 2416 | assert_eq!(exact([ "δ", "ε"]), e( "[εδ]")); | 
|---|
| 2417 | #[ cfg(feature = "unicode-case")] | 
|---|
| 2418 | { | 
|---|
| 2419 | assert_eq!(exact([ "Δ", "Ε", "δ", "ε", "ϵ"]), e( r"(?i)[εδ]")); | 
|---|
| 2420 | } | 
|---|
| 2421 | } | 
|---|
| 2422 |  | 
|---|
| 2423 | #[ test] | 
|---|
| 2424 | fn look() { | 
|---|
| 2425 | assert_eq!(exact([ "ab"]), e( r"a\Ab")); | 
|---|
| 2426 | assert_eq!(exact([ "ab"]), e( r"a\zb")); | 
|---|
| 2427 | assert_eq!(exact([ "ab"]), e( r"a(?m:^)b")); | 
|---|
| 2428 | assert_eq!(exact([ "ab"]), e( r"a(?m:$)b")); | 
|---|
| 2429 | assert_eq!(exact([ "ab"]), e( r"a\bb")); | 
|---|
| 2430 | assert_eq!(exact([ "ab"]), e( r"a\Bb")); | 
|---|
| 2431 | assert_eq!(exact([ "ab"]), e( r"a(?-u:\b)b")); | 
|---|
| 2432 | assert_eq!(exact([ "ab"]), e( r"a(?-u:\B)b")); | 
|---|
| 2433 |  | 
|---|
| 2434 | assert_eq!(exact([ "ab"]), e( r"^ab")); | 
|---|
| 2435 | assert_eq!(exact([ "ab"]), e( r"$ab")); | 
|---|
| 2436 | assert_eq!(exact([ "ab"]), e( r"(?m:^)ab")); | 
|---|
| 2437 | assert_eq!(exact([ "ab"]), e( r"(?m:$)ab")); | 
|---|
| 2438 | assert_eq!(exact([ "ab"]), e( r"\bab")); | 
|---|
| 2439 | assert_eq!(exact([ "ab"]), e( r"\Bab")); | 
|---|
| 2440 | assert_eq!(exact([ "ab"]), e( r"(?-u:\b)ab")); | 
|---|
| 2441 | assert_eq!(exact([ "ab"]), e( r"(?-u:\B)ab")); | 
|---|
| 2442 |  | 
|---|
| 2443 | assert_eq!(exact([ "ab"]), e( r"ab^")); | 
|---|
| 2444 | assert_eq!(exact([ "ab"]), e( r"ab$")); | 
|---|
| 2445 | assert_eq!(exact([ "ab"]), e( r"ab(?m:^)")); | 
|---|
| 2446 | assert_eq!(exact([ "ab"]), e( r"ab(?m:$)")); | 
|---|
| 2447 | assert_eq!(exact([ "ab"]), e( r"ab\b")); | 
|---|
| 2448 | assert_eq!(exact([ "ab"]), e( r"ab\B")); | 
|---|
| 2449 | assert_eq!(exact([ "ab"]), e( r"ab(?-u:\b)")); | 
|---|
| 2450 | assert_eq!(exact([ "ab"]), e( r"ab(?-u:\B)")); | 
|---|
| 2451 |  | 
|---|
| 2452 | let expected = (seq([I( "aZ"), E( "ab")]), seq([I( "Zb"), E( "ab")])); | 
|---|
| 2453 | assert_eq!(expected, e( r"^aZ*b")); | 
|---|
| 2454 | } | 
|---|
| 2455 |  | 
|---|
| 2456 | #[ test] | 
|---|
| 2457 | fn repetition() { | 
|---|
| 2458 | assert_eq!(exact([ "a", ""]), e( r"a?")); | 
|---|
| 2459 | assert_eq!(exact([ "", "a"]), e( r"a??")); | 
|---|
| 2460 | assert_eq!(inexact([I( "a"), E( "")], [I( "a"), E( "")]), e( r"a*")); | 
|---|
| 2461 | assert_eq!(inexact([E( ""), I( "a")], [E( ""), I( "a")]), e( r"a*?")); | 
|---|
| 2462 | assert_eq!(inexact([I( "a")], [I( "a")]), e( r"a+")); | 
|---|
| 2463 | assert_eq!(inexact([I( "a")], [I( "a")]), e( r"(a+)+")); | 
|---|
| 2464 |  | 
|---|
| 2465 | assert_eq!(exact([ "ab"]), e( r"aZ{0}b")); | 
|---|
| 2466 | assert_eq!(exact([ "aZb", "ab"]), e( r"aZ?b")); | 
|---|
| 2467 | assert_eq!(exact([ "ab", "aZb"]), e( r"aZ??b")); | 
|---|
| 2468 | assert_eq!( | 
|---|
| 2469 | inexact([I( "aZ"), E( "ab")], [I( "Zb"), E( "ab")]), | 
|---|
| 2470 | e( r"aZ*b") | 
|---|
| 2471 | ); | 
|---|
| 2472 | assert_eq!( | 
|---|
| 2473 | inexact([E( "ab"), I( "aZ")], [E( "ab"), I( "Zb")]), | 
|---|
| 2474 | e( r"aZ*?b") | 
|---|
| 2475 | ); | 
|---|
| 2476 | assert_eq!(inexact([I( "aZ")], [I( "Zb")]), e( r"aZ+b")); | 
|---|
| 2477 | assert_eq!(inexact([I( "aZ")], [I( "Zb")]), e( r"aZ+?b")); | 
|---|
| 2478 |  | 
|---|
| 2479 | assert_eq!(exact([ "aZZb"]), e( r"aZ{2}b")); | 
|---|
| 2480 | assert_eq!(inexact([I( "aZZ")], [I( "ZZb")]), e( r"aZ{2,3}b")); | 
|---|
| 2481 |  | 
|---|
| 2482 | assert_eq!(exact([ "abc", ""]), e( r"(abc)?")); | 
|---|
| 2483 | assert_eq!(exact([ "", "abc"]), e( r"(abc)??")); | 
|---|
| 2484 |  | 
|---|
| 2485 | assert_eq!(inexact([I( "a"), E( "b")], [I( "ab"), E( "b")]), e( r"a*b")); | 
|---|
| 2486 | assert_eq!(inexact([E( "b"), I( "a")], [E( "b"), I( "ab")]), e( r"a*?b")); | 
|---|
| 2487 | assert_eq!(inexact([I( "ab")], [I( "b")]), e( r"ab+")); | 
|---|
| 2488 | assert_eq!(inexact([I( "a"), I( "b")], [I( "b")]), e( r"a*b+")); | 
|---|
| 2489 |  | 
|---|
| 2490 | // FIXME: The suffixes for this don't look quite right to me. I think | 
|---|
| 2491 | // the right suffixes would be: [I(ac), I(bc), E(c)]. The main issue I | 
|---|
| 2492 | // think is that suffixes are computed by iterating over concatenations | 
|---|
| 2493 | // in reverse, and then [bc, ac, c] ordering is indeed correct from | 
|---|
| 2494 | // that perspective. We also test a few more equivalent regexes, and | 
|---|
| 2495 | // we get the same result, so it is consistent at least I suppose. | 
|---|
| 2496 | // | 
|---|
| 2497 | // The reason why this isn't an issue is that it only messes up | 
|---|
| 2498 | // preference order, and currently, suffixes are never used in a | 
|---|
| 2499 | // context where preference order matters. For prefixes it matters | 
|---|
| 2500 | // because we sometimes want to use prefilters without confirmation | 
|---|
| 2501 | // when all of the literals are exact (and there's no look-around). But | 
|---|
| 2502 | // we never do that for suffixes. Any time we use suffixes, we always | 
|---|
| 2503 | // include a confirmation step. If that ever changes, then it's likely | 
|---|
| 2504 | // this bug will need to be fixed, but last time I looked, it appears | 
|---|
| 2505 | // hard to do so. | 
|---|
| 2506 | assert_eq!( | 
|---|
| 2507 | inexact([I( "a"), I( "b"), E( "c")], [I( "bc"), I( "ac"), E( "c")]), | 
|---|
| 2508 | e( r"a*b*c") | 
|---|
| 2509 | ); | 
|---|
| 2510 | assert_eq!( | 
|---|
| 2511 | inexact([I( "a"), I( "b"), E( "c")], [I( "bc"), I( "ac"), E( "c")]), | 
|---|
| 2512 | e( r"(a+)?(b+)?c") | 
|---|
| 2513 | ); | 
|---|
| 2514 | assert_eq!( | 
|---|
| 2515 | inexact([I( "a"), I( "b"), E( "c")], [I( "bc"), I( "ac"), E( "c")]), | 
|---|
| 2516 | e( r"(a+|)(b+|)c") | 
|---|
| 2517 | ); | 
|---|
| 2518 | // A few more similarish but not identical regexes. These may have a | 
|---|
| 2519 | // similar problem as above. | 
|---|
| 2520 | assert_eq!( | 
|---|
| 2521 | inexact( | 
|---|
| 2522 | [I( "a"), I( "b"), I( "c"), E( "")], | 
|---|
| 2523 | [I( "c"), I( "b"), I( "a"), E( "")] | 
|---|
| 2524 | ), | 
|---|
| 2525 | e( r"a*b*c*") | 
|---|
| 2526 | ); | 
|---|
| 2527 | assert_eq!(inexact([I( "a"), I( "b"), I( "c")], [I( "c")]), e( r"a*b*c+")); | 
|---|
| 2528 | assert_eq!(inexact([I( "a"), I( "b")], [I( "bc")]), e( r"a*b+c")); | 
|---|
| 2529 | assert_eq!(inexact([I( "a"), I( "b")], [I( "c"), I( "b")]), e( r"a*b+c*")); | 
|---|
| 2530 | assert_eq!(inexact([I( "ab"), E( "a")], [I( "b"), E( "a")]), e( r"ab*")); | 
|---|
| 2531 | assert_eq!( | 
|---|
| 2532 | inexact([I( "ab"), E( "ac")], [I( "bc"), E( "ac")]), | 
|---|
| 2533 | e( r"ab*c") | 
|---|
| 2534 | ); | 
|---|
| 2535 | assert_eq!(inexact([I( "ab")], [I( "b")]), e( r"ab+")); | 
|---|
| 2536 | assert_eq!(inexact([I( "ab")], [I( "bc")]), e( r"ab+c")); | 
|---|
| 2537 |  | 
|---|
| 2538 | assert_eq!( | 
|---|
| 2539 | inexact([I( "z"), E( "azb")], [I( "zazb"), E( "azb")]), | 
|---|
| 2540 | e( r"z*azb") | 
|---|
| 2541 | ); | 
|---|
| 2542 |  | 
|---|
| 2543 | let expected = | 
|---|
| 2544 | exact([ "aaa", "aab", "aba", "abb", "baa", "bab", "bba", "bbb"]); | 
|---|
| 2545 | assert_eq!(expected, e( r"[ab]{3}")); | 
|---|
| 2546 | let expected = inexact( | 
|---|
| 2547 | [ | 
|---|
| 2548 | I( "aaa"), | 
|---|
| 2549 | I( "aab"), | 
|---|
| 2550 | I( "aba"), | 
|---|
| 2551 | I( "abb"), | 
|---|
| 2552 | I( "baa"), | 
|---|
| 2553 | I( "bab"), | 
|---|
| 2554 | I( "bba"), | 
|---|
| 2555 | I( "bbb"), | 
|---|
| 2556 | ], | 
|---|
| 2557 | [ | 
|---|
| 2558 | I( "aaa"), | 
|---|
| 2559 | I( "aab"), | 
|---|
| 2560 | I( "aba"), | 
|---|
| 2561 | I( "abb"), | 
|---|
| 2562 | I( "baa"), | 
|---|
| 2563 | I( "bab"), | 
|---|
| 2564 | I( "bba"), | 
|---|
| 2565 | I( "bbb"), | 
|---|
| 2566 | ], | 
|---|
| 2567 | ); | 
|---|
| 2568 | assert_eq!(expected, e( r"[ab]{3,4}")); | 
|---|
| 2569 | } | 
|---|
| 2570 |  | 
|---|
| 2571 | #[ test] | 
|---|
| 2572 | fn concat() { | 
|---|
| 2573 | let empty: [&str; 0] = []; | 
|---|
| 2574 |  | 
|---|
| 2575 | assert_eq!(exact([ "abcxyz"]), e( r"abc()xyz")); | 
|---|
| 2576 | assert_eq!(exact([ "abcxyz"]), e( r"(abc)(xyz)")); | 
|---|
| 2577 | assert_eq!(exact([ "abcmnoxyz"]), e( r"abc()mno()xyz")); | 
|---|
| 2578 | assert_eq!(exact(empty), e( r"abc[a&&b]xyz")); | 
|---|
| 2579 | assert_eq!(exact([ "abcxyz"]), e( r"abc[a&&b]*xyz")); | 
|---|
| 2580 | } | 
|---|
| 2581 |  | 
|---|
| 2582 | #[ test] | 
|---|
| 2583 | fn alternation() { | 
|---|
| 2584 | assert_eq!(exact([ "abc", "mno", "xyz"]), e( r"abc|mno|xyz")); | 
|---|
| 2585 | assert_eq!( | 
|---|
| 2586 | inexact( | 
|---|
| 2587 | [E( "abc"), I( "mZ"), E( "mo"), E( "xyz")], | 
|---|
| 2588 | [E( "abc"), I( "Zo"), E( "mo"), E( "xyz")] | 
|---|
| 2589 | ), | 
|---|
| 2590 | e( r"abc|mZ*o|xyz") | 
|---|
| 2591 | ); | 
|---|
| 2592 | assert_eq!(exact([ "abc", "xyz"]), e( r"abc|M[a&&b]N|xyz")); | 
|---|
| 2593 | assert_eq!(exact([ "abc", "MN", "xyz"]), e( r"abc|M[a&&b]*N|xyz")); | 
|---|
| 2594 |  | 
|---|
| 2595 | assert_eq!(exact([ "aaa", "aaaaa"]), e( r"(?:|aa)aaa")); | 
|---|
| 2596 | assert_eq!( | 
|---|
| 2597 | inexact( | 
|---|
| 2598 | [I( "aaa"), E( ""), I( "aaaaa"), E( "aa")], | 
|---|
| 2599 | [I( "aaa"), E( ""), E( "aa")] | 
|---|
| 2600 | ), | 
|---|
| 2601 | e( r"(?:|aa)(?:aaa)*") | 
|---|
| 2602 | ); | 
|---|
| 2603 | assert_eq!( | 
|---|
| 2604 | inexact( | 
|---|
| 2605 | [E( ""), I( "aaa"), E( "aa"), I( "aaaaa")], | 
|---|
| 2606 | [E( ""), I( "aaa"), E( "aa")] | 
|---|
| 2607 | ), | 
|---|
| 2608 | e( r"(?:|aa)(?:aaa)*?") | 
|---|
| 2609 | ); | 
|---|
| 2610 |  | 
|---|
| 2611 | assert_eq!( | 
|---|
| 2612 | inexact([E( "a"), I( "b"), E( "")], [E( "a"), I( "b"), E( "")]), | 
|---|
| 2613 | e( r"a|b*") | 
|---|
| 2614 | ); | 
|---|
| 2615 | assert_eq!(inexact([E( "a"), I( "b")], [E( "a"), I( "b")]), e( r"a|b+")); | 
|---|
| 2616 |  | 
|---|
| 2617 | assert_eq!( | 
|---|
| 2618 | inexact([I( "a"), E( "b"), E( "c")], [I( "ab"), E( "b"), E( "c")]), | 
|---|
| 2619 | e( r"a*b|c") | 
|---|
| 2620 | ); | 
|---|
| 2621 |  | 
|---|
| 2622 | assert_eq!( | 
|---|
| 2623 | inexact( | 
|---|
| 2624 | [E( "a"), E( "b"), I( "c"), E( "")], | 
|---|
| 2625 | [E( "a"), E( "b"), I( "c"), E( "")] | 
|---|
| 2626 | ), | 
|---|
| 2627 | e( r"a|(?:b|c*)") | 
|---|
| 2628 | ); | 
|---|
| 2629 |  | 
|---|
| 2630 | assert_eq!( | 
|---|
| 2631 | inexact( | 
|---|
| 2632 | [I( "a"), I( "b"), E( "c"), I( "a"), I( "ab"), E( "c")], | 
|---|
| 2633 | [I( "ac"), I( "bc"), E( "c"), I( "ac"), I( "abc"), E( "c")], | 
|---|
| 2634 | ), | 
|---|
| 2635 | e( r"(a|b)*c|(a|ab)*c") | 
|---|
| 2636 | ); | 
|---|
| 2637 |  | 
|---|
| 2638 | assert_eq!( | 
|---|
| 2639 | exact([ "abef", "abgh", "cdef", "cdgh"]), | 
|---|
| 2640 | e( r"(ab|cd)(ef|gh)") | 
|---|
| 2641 | ); | 
|---|
| 2642 | assert_eq!( | 
|---|
| 2643 | exact([ | 
|---|
| 2644 | "abefij", "abefkl", "abghij", "abghkl", "cdefij", "cdefkl", | 
|---|
| 2645 | "cdghij", "cdghkl", | 
|---|
| 2646 | ]), | 
|---|
| 2647 | e( r"(ab|cd)(ef|gh)(ij|kl)") | 
|---|
| 2648 | ); | 
|---|
| 2649 |  | 
|---|
| 2650 | assert_eq!(inexact([E( "abab")], [E( "abab")]), e( r"(ab){2}")); | 
|---|
| 2651 |  | 
|---|
| 2652 | assert_eq!(inexact([I( "abab")], [I( "abab")]), e( r"(ab){2,3}")); | 
|---|
| 2653 |  | 
|---|
| 2654 | assert_eq!(inexact([I( "abab")], [I( "abab")]), e( r"(ab){2,}")); | 
|---|
| 2655 | } | 
|---|
| 2656 |  | 
|---|
| 2657 | #[ test] | 
|---|
| 2658 | fn impossible() { | 
|---|
| 2659 | let empty: [&str; 0] = []; | 
|---|
| 2660 |  | 
|---|
| 2661 | assert_eq!(exact(empty), e( r"[a&&b]")); | 
|---|
| 2662 | assert_eq!(exact(empty), e( r"a[a&&b]")); | 
|---|
| 2663 | assert_eq!(exact(empty), e( r"[a&&b]b")); | 
|---|
| 2664 | assert_eq!(exact(empty), e( r"a[a&&b]b")); | 
|---|
| 2665 | assert_eq!(exact([ "a", "b"]), e( r"a|[a&&b]|b")); | 
|---|
| 2666 | assert_eq!(exact([ "a", "b"]), e( r"a|c[a&&b]|b")); | 
|---|
| 2667 | assert_eq!(exact([ "a", "b"]), e( r"a|[a&&b]d|b")); | 
|---|
| 2668 | assert_eq!(exact([ "a", "b"]), e( r"a|c[a&&b]d|b")); | 
|---|
| 2669 | assert_eq!(exact([ ""]), e( r"[a&&b]*")); | 
|---|
| 2670 | assert_eq!(exact([ "MN"]), e( r"M[a&&b]*N")); | 
|---|
| 2671 | } | 
|---|
| 2672 |  | 
|---|
| 2673 | // This tests patterns that contain something that defeats literal | 
|---|
| 2674 | // detection, usually because it would blow some limit on the total number | 
|---|
| 2675 | // of literals that can be returned. | 
|---|
| 2676 | // | 
|---|
| 2677 | // The main idea is that when literal extraction sees something that | 
|---|
| 2678 | // it knows will blow a limit, it replaces it with a marker that says | 
|---|
| 2679 | // "any literal will match here." While not necessarily true, the | 
|---|
| 2680 | // over-estimation is just fine for the purposes of literal extraction, | 
|---|
| 2681 | // because the imprecision doesn't matter: too big is too big. | 
|---|
| 2682 | // | 
|---|
| 2683 | // This is one of the trickier parts of literal extraction, since we need | 
|---|
| 2684 | // to make sure all of our literal extraction operations correctly compose | 
|---|
| 2685 | // with the markers. | 
|---|
| 2686 | #[ test] | 
|---|
| 2687 | fn anything() { | 
|---|
| 2688 | assert_eq!(infinite(), e( r".")); | 
|---|
| 2689 | assert_eq!(infinite(), e( r"(?s).")); | 
|---|
| 2690 | assert_eq!(infinite(), e( r"[A-Za-z]")); | 
|---|
| 2691 | assert_eq!(infinite(), e( r"[A-Z]")); | 
|---|
| 2692 | assert_eq!(exact([ ""]), e( r"[A-Z]{0}")); | 
|---|
| 2693 | assert_eq!(infinite(), e( r"[A-Z]?")); | 
|---|
| 2694 | assert_eq!(infinite(), e( r"[A-Z]*")); | 
|---|
| 2695 | assert_eq!(infinite(), e( r"[A-Z]+")); | 
|---|
| 2696 | assert_eq!((seq([I( "1")]), Seq::infinite()), e( r"1[A-Z]")); | 
|---|
| 2697 | assert_eq!((seq([I( "1")]), seq([I( "2")])), e( r"1[A-Z]2")); | 
|---|
| 2698 | assert_eq!((Seq::infinite(), seq([I( "123")])), e( r"[A-Z]+123")); | 
|---|
| 2699 | assert_eq!(infinite(), e( r"[A-Z]+123[A-Z]+")); | 
|---|
| 2700 | assert_eq!(infinite(), e( r"1|[A-Z]|3")); | 
|---|
| 2701 | assert_eq!( | 
|---|
| 2702 | (seq([E( "1"), I( "2"), E( "3")]), Seq::infinite()), | 
|---|
| 2703 | e( r"1|2[A-Z]|3"), | 
|---|
| 2704 | ); | 
|---|
| 2705 | assert_eq!( | 
|---|
| 2706 | (Seq::infinite(), seq([E( "1"), I( "2"), E( "3")])), | 
|---|
| 2707 | e( r"1|[A-Z]2|3"), | 
|---|
| 2708 | ); | 
|---|
| 2709 | assert_eq!( | 
|---|
| 2710 | (seq([E( "1"), I( "2"), E( "4")]), seq([E( "1"), I( "3"), E( "4")])), | 
|---|
| 2711 | e( r"1|2[A-Z]3|4"), | 
|---|
| 2712 | ); | 
|---|
| 2713 | assert_eq!((Seq::infinite(), seq([I( "2")])), e( r"(?:|1)[A-Z]2")); | 
|---|
| 2714 | assert_eq!(inexact([I( "a")], [I( "z")]), e( r"a.z")); | 
|---|
| 2715 | } | 
|---|
| 2716 |  | 
|---|
| 2717 | // Like the 'anything' test, but it uses smaller limits in order to test | 
|---|
| 2718 | // the logic for effectively aborting literal extraction when the seqs get | 
|---|
| 2719 | // too big. | 
|---|
| 2720 | #[ test] | 
|---|
| 2721 | fn anything_small_limits() { | 
|---|
| 2722 | fn prefixes(pattern: &str) -> Seq { | 
|---|
| 2723 | Extractor::new() | 
|---|
| 2724 | .kind(ExtractKind::Prefix) | 
|---|
| 2725 | .limit_total(10) | 
|---|
| 2726 | .extract(&parse(pattern)) | 
|---|
| 2727 | } | 
|---|
| 2728 |  | 
|---|
| 2729 | fn suffixes(pattern: &str) -> Seq { | 
|---|
| 2730 | Extractor::new() | 
|---|
| 2731 | .kind(ExtractKind::Suffix) | 
|---|
| 2732 | .limit_total(10) | 
|---|
| 2733 | .extract(&parse(pattern)) | 
|---|
| 2734 | } | 
|---|
| 2735 |  | 
|---|
| 2736 | fn e(pattern: &str) -> (Seq, Seq) { | 
|---|
| 2737 | (prefixes(pattern), suffixes(pattern)) | 
|---|
| 2738 | } | 
|---|
| 2739 |  | 
|---|
| 2740 | assert_eq!( | 
|---|
| 2741 | ( | 
|---|
| 2742 | seq([ | 
|---|
| 2743 | I( "aaa"), | 
|---|
| 2744 | I( "aab"), | 
|---|
| 2745 | I( "aba"), | 
|---|
| 2746 | I( "abb"), | 
|---|
| 2747 | I( "baa"), | 
|---|
| 2748 | I( "bab"), | 
|---|
| 2749 | I( "bba"), | 
|---|
| 2750 | I( "bbb") | 
|---|
| 2751 | ]), | 
|---|
| 2752 | seq([ | 
|---|
| 2753 | I( "aaa"), | 
|---|
| 2754 | I( "aab"), | 
|---|
| 2755 | I( "aba"), | 
|---|
| 2756 | I( "abb"), | 
|---|
| 2757 | I( "baa"), | 
|---|
| 2758 | I( "bab"), | 
|---|
| 2759 | I( "bba"), | 
|---|
| 2760 | I( "bbb") | 
|---|
| 2761 | ]) | 
|---|
| 2762 | ), | 
|---|
| 2763 | e( r"[ab]{3}{3}") | 
|---|
| 2764 | ); | 
|---|
| 2765 |  | 
|---|
| 2766 | assert_eq!(infinite(), e( r"ab|cd|ef|gh|ij|kl|mn|op|qr|st|uv|wx|yz")); | 
|---|
| 2767 | } | 
|---|
| 2768 |  | 
|---|
| 2769 | #[ test] | 
|---|
| 2770 | fn empty() { | 
|---|
| 2771 | assert_eq!(exact([ ""]), e( r"")); | 
|---|
| 2772 | assert_eq!(exact([ ""]), e( r"^")); | 
|---|
| 2773 | assert_eq!(exact([ ""]), e( r"$")); | 
|---|
| 2774 | assert_eq!(exact([ ""]), e( r"(?m:^)")); | 
|---|
| 2775 | assert_eq!(exact([ ""]), e( r"(?m:$)")); | 
|---|
| 2776 | assert_eq!(exact([ ""]), e( r"\b")); | 
|---|
| 2777 | assert_eq!(exact([ ""]), e( r"\B")); | 
|---|
| 2778 | assert_eq!(exact([ ""]), e( r"(?-u:\b)")); | 
|---|
| 2779 | assert_eq!(exact([ ""]), e( r"(?-u:\B)")); | 
|---|
| 2780 | } | 
|---|
| 2781 |  | 
|---|
| 2782 | #[ test] | 
|---|
| 2783 | fn odds_and_ends() { | 
|---|
| 2784 | assert_eq!((Seq::infinite(), seq([I( "a")])), e( r".a")); | 
|---|
| 2785 | assert_eq!((seq([I( "a")]), Seq::infinite()), e( r"a.")); | 
|---|
| 2786 | assert_eq!(infinite(), e( r"a|.")); | 
|---|
| 2787 | assert_eq!(infinite(), e( r".|a")); | 
|---|
| 2788 |  | 
|---|
| 2789 | let pat = r"M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]"; | 
|---|
| 2790 | let expected = inexact( | 
|---|
| 2791 | [ "Mo'am", "Moam", "Mu'am", "Muam"].map(I), | 
|---|
| 2792 | [ | 
|---|
| 2793 | "ddafi", "ddafy", "dhafi", "dhafy", "dzafi", "dzafy", "dafi", | 
|---|
| 2794 | "dafy", "tdafi", "tdafy", "thafi", "thafy", "tzafi", "tzafy", | 
|---|
| 2795 | "tafi", "tafy", "zdafi", "zdafy", "zhafi", "zhafy", "zzafi", | 
|---|
| 2796 | "zzafy", "zafi", "zafy", | 
|---|
| 2797 | ] | 
|---|
| 2798 | .map(I), | 
|---|
| 2799 | ); | 
|---|
| 2800 | assert_eq!(expected, e(pat)); | 
|---|
| 2801 |  | 
|---|
| 2802 | assert_eq!( | 
|---|
| 2803 | (seq([ "fn is_", "fn as_"].map(I)), Seq::infinite()), | 
|---|
| 2804 | e( r"fn is_([A-Z]+)|fn as_([A-Z]+)"), | 
|---|
| 2805 | ); | 
|---|
| 2806 | assert_eq!( | 
|---|
| 2807 | inexact([I( "foo")], [I( "quux")]), | 
|---|
| 2808 | e( r"foo[A-Z]+bar[A-Z]+quux") | 
|---|
| 2809 | ); | 
|---|
| 2810 | assert_eq!(infinite(), e( r"[A-Z]+bar[A-Z]+")); | 
|---|
| 2811 | assert_eq!( | 
|---|
| 2812 | exact([ "Sherlock Holmes"]), | 
|---|
| 2813 | e( r"(?m)^Sherlock Holmes|Sherlock Holmes$") | 
|---|
| 2814 | ); | 
|---|
| 2815 |  | 
|---|
| 2816 | assert_eq!(exact([ "sa", "sb"]), e( r"\bs(?:[ab])")); | 
|---|
| 2817 | } | 
|---|
| 2818 |  | 
|---|
| 2819 | // This tests a specific regex along with some heuristic steps to reduce | 
|---|
| 2820 | // the sequences extracted. This is meant to roughly correspond to the | 
|---|
| 2821 | // types of heuristics used to shrink literal sets in practice. (Shrinking | 
|---|
| 2822 | // is done because you want to balance "spend too much work looking for | 
|---|
| 2823 | // too many literals" and "spend too much work processing false positive | 
|---|
| 2824 | // matches from short literals.") | 
|---|
| 2825 | #[ test] | 
|---|
| 2826 | #[ cfg(feature = "unicode-case")] | 
|---|
| 2827 | fn holmes() { | 
|---|
| 2828 | let expected = inexact( | 
|---|
| 2829 | [ "HOL", "HOl", "HoL", "Hol", "hOL", "hOl", "hoL", "hol"].map(I), | 
|---|
| 2830 | [ | 
|---|
| 2831 | "MES", "MEs", "Eſ", "MeS", "Mes", "eſ", "mES", "mEs", "meS", | 
|---|
| 2832 | "mes", | 
|---|
| 2833 | ] | 
|---|
| 2834 | .map(I), | 
|---|
| 2835 | ); | 
|---|
| 2836 | let (mut prefixes, mut suffixes) = e( r"(?i)Holmes"); | 
|---|
| 2837 | prefixes.keep_first_bytes(3); | 
|---|
| 2838 | suffixes.keep_last_bytes(3); | 
|---|
| 2839 | prefixes.minimize_by_preference(); | 
|---|
| 2840 | suffixes.minimize_by_preference(); | 
|---|
| 2841 | assert_eq!(expected, (prefixes, suffixes)); | 
|---|
| 2842 | } | 
|---|
| 2843 |  | 
|---|
| 2844 | // This tests that we get some kind of literals extracted for a beefier | 
|---|
| 2845 | // alternation with case insensitive mode enabled. At one point during | 
|---|
| 2846 | // development, this returned nothing, and motivated some special case | 
|---|
| 2847 | // code in Extractor::union to try and trim down the literal sequences | 
|---|
| 2848 | // if the union would blow the limits set. | 
|---|
| 2849 | #[ test] | 
|---|
| 2850 | #[ cfg(feature = "unicode-case")] | 
|---|
| 2851 | fn holmes_alt() { | 
|---|
| 2852 | let mut pre = | 
|---|
| 2853 | prefixes( r"(?i)Sherlock|Holmes|Watson|Irene|Adler|John|Baker"); | 
|---|
| 2854 | assert!(pre.len().unwrap() > 0); | 
|---|
| 2855 | pre.optimize_for_prefix_by_preference(); | 
|---|
| 2856 | assert!(pre.len().unwrap() > 0); | 
|---|
| 2857 | } | 
|---|
| 2858 |  | 
|---|
| 2859 | // See: https://github.com/rust-lang/regex/security/advisories/GHSA-m5pq-gvj9-9vr8 | 
|---|
| 2860 | // See: CVE-2022-24713 | 
|---|
| 2861 | // | 
|---|
| 2862 | // We test this here to ensure literal extraction completes in reasonable | 
|---|
| 2863 | // time and isn't materially impacted by these sorts of pathological | 
|---|
| 2864 | // repeats. | 
|---|
| 2865 | #[ test] | 
|---|
| 2866 | fn crazy_repeats() { | 
|---|
| 2867 | assert_eq!(inexact([E( "")], [E( "")]), e( r"(?:){4294967295}")); | 
|---|
| 2868 | assert_eq!( | 
|---|
| 2869 | inexact([E( "")], [E( "")]), | 
|---|
| 2870 | e( r"(?:){64}{64}{64}{64}{64}{64}") | 
|---|
| 2871 | ); | 
|---|
| 2872 | assert_eq!(inexact([E( "")], [E( "")]), e( r"x{0}{4294967295}")); | 
|---|
| 2873 | assert_eq!(inexact([E( "")], [E( "")]), e( r"(?:|){4294967295}")); | 
|---|
| 2874 |  | 
|---|
| 2875 | assert_eq!( | 
|---|
| 2876 | inexact([E( "")], [E( "")]), | 
|---|
| 2877 | e( r"(?:){8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}") | 
|---|
| 2878 | ); | 
|---|
| 2879 | let repa = "a".repeat(100); | 
|---|
| 2880 | assert_eq!( | 
|---|
| 2881 | inexact([I(&repa)], [I(&repa)]), | 
|---|
| 2882 | e( r"a{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}") | 
|---|
| 2883 | ); | 
|---|
| 2884 | } | 
|---|
| 2885 |  | 
|---|
| 2886 | #[ test] | 
|---|
| 2887 | fn huge() { | 
|---|
| 2888 | let pat = r#"(?-u) | 
|---|
| 2889 |         2(?: | 
|---|
| 2890 |           [45]\d{3}| | 
|---|
| 2891 |           7(?: | 
|---|
| 2892 |             1[0-267]| | 
|---|
| 2893 |             2[0-289]| | 
|---|
| 2894 |             3[0-29]| | 
|---|
| 2895 |             4[01]| | 
|---|
| 2896 |             5[1-3]| | 
|---|
| 2897 |             6[013]| | 
|---|
| 2898 |             7[0178]| | 
|---|
| 2899 |             91 | 
|---|
| 2900 |           )| | 
|---|
| 2901 |           8(?: | 
|---|
| 2902 |             0[125]| | 
|---|
| 2903 |             [139][1-6]| | 
|---|
| 2904 |             2[0157-9]| | 
|---|
| 2905 |             41| | 
|---|
| 2906 |             6[1-35]| | 
|---|
| 2907 |             7[1-5]| | 
|---|
| 2908 |             8[1-8]| | 
|---|
| 2909 |             90 | 
|---|
| 2910 |           )| | 
|---|
| 2911 |           9(?: | 
|---|
| 2912 |             0[0-2]| | 
|---|
| 2913 |             1[0-4]| | 
|---|
| 2914 |             2[568]| | 
|---|
| 2915 |             3[3-6]| | 
|---|
| 2916 |             5[5-7]| | 
|---|
| 2917 |             6[0167]| | 
|---|
| 2918 |             7[15]| | 
|---|
| 2919 |             8[0146-9] | 
|---|
| 2920 |           ) | 
|---|
| 2921 |         )\d{4}| | 
|---|
| 2922 |         3(?: | 
|---|
| 2923 |           12?[5-7]\d{2}| | 
|---|
| 2924 |           0(?: | 
|---|
| 2925 |             2(?: | 
|---|
| 2926 |               [025-79]\d| | 
|---|
| 2927 |               [348]\d{1,2} | 
|---|
| 2928 |             )| | 
|---|
| 2929 |             3(?: | 
|---|
| 2930 |               [2-4]\d| | 
|---|
| 2931 |               [56]\d? | 
|---|
| 2932 |             ) | 
|---|
| 2933 |           )| | 
|---|
| 2934 |           2(?: | 
|---|
| 2935 |             1\d{2}| | 
|---|
| 2936 |             2(?: | 
|---|
| 2937 |               [12]\d| | 
|---|
| 2938 |               [35]\d{1,2}| | 
|---|
| 2939 |               4\d? | 
|---|
| 2940 |             ) | 
|---|
| 2941 |           )| | 
|---|
| 2942 |           3(?: | 
|---|
| 2943 |             1\d{2}| | 
|---|
| 2944 |             2(?: | 
|---|
| 2945 |               [2356]\d| | 
|---|
| 2946 |               4\d{1,2} | 
|---|
| 2947 |             ) | 
|---|
| 2948 |           )| | 
|---|
| 2949 |           4(?: | 
|---|
| 2950 |             1\d{2}| | 
|---|
| 2951 |             2(?: | 
|---|
| 2952 |               2\d{1,2}| | 
|---|
| 2953 |               [47]| | 
|---|
| 2954 |               5\d{2} | 
|---|
| 2955 |             ) | 
|---|
| 2956 |           )| | 
|---|
| 2957 |           5(?: | 
|---|
| 2958 |             1\d{2}| | 
|---|
| 2959 |             29 | 
|---|
| 2960 |           )| | 
|---|
| 2961 |           [67]1\d{2}| | 
|---|
| 2962 |           8(?: | 
|---|
| 2963 |             1\d{2}| | 
|---|
| 2964 |             2(?: | 
|---|
| 2965 |               2\d{2}| | 
|---|
| 2966 |               3| | 
|---|
| 2967 |               4\d | 
|---|
| 2968 |             ) | 
|---|
| 2969 |           ) | 
|---|
| 2970 |         )\d{3}| | 
|---|
| 2971 |         4(?: | 
|---|
| 2972 |           0(?: | 
|---|
| 2973 |             2(?: | 
|---|
| 2974 |               [09]\d| | 
|---|
| 2975 |               7 | 
|---|
| 2976 |             )| | 
|---|
| 2977 |             33\d{2} | 
|---|
| 2978 |           )| | 
|---|
| 2979 |           1\d{3}| | 
|---|
| 2980 |           2(?: | 
|---|
| 2981 |             1\d{2}| | 
|---|
| 2982 |             2(?: | 
|---|
| 2983 |               [25]\d?| | 
|---|
| 2984 |               [348]\d| | 
|---|
| 2985 |               [67]\d{1,2} | 
|---|
| 2986 |             ) | 
|---|
| 2987 |           )| | 
|---|
| 2988 |           3(?: | 
|---|
| 2989 |             1\d{2}(?: | 
|---|
| 2990 |               \d{2} | 
|---|
| 2991 |             )?| | 
|---|
| 2992 |             2(?: | 
|---|
| 2993 |               [045]\d| | 
|---|
| 2994 |               [236-9]\d{1,2} | 
|---|
| 2995 |             )| | 
|---|
| 2996 |             32\d{2} | 
|---|
| 2997 |           )| | 
|---|
| 2998 |           4(?: | 
|---|
| 2999 |             [18]\d{2}| | 
|---|
| 3000 |             2(?: | 
|---|
| 3001 |               [2-46]\d{2}| | 
|---|
| 3002 |               3 | 
|---|
| 3003 |             )| | 
|---|
| 3004 |             5[25]\d{2} | 
|---|
| 3005 |           )| | 
|---|
| 3006 |           5(?: | 
|---|
| 3007 |             1\d{2}| | 
|---|
| 3008 |             2(?: | 
|---|
| 3009 |               3\d| | 
|---|
| 3010 |               5 | 
|---|
| 3011 |             ) | 
|---|
| 3012 |           )| | 
|---|
| 3013 |           6(?: | 
|---|
| 3014 |             [18]\d{2}| | 
|---|
| 3015 |             2(?: | 
|---|
| 3016 |               3(?: | 
|---|
| 3017 |                 \d{2} | 
|---|
| 3018 |               )?| | 
|---|
| 3019 |               [46]\d{1,2}| | 
|---|
| 3020 |               5\d{2}| | 
|---|
| 3021 |               7\d | 
|---|
| 3022 |             )| | 
|---|
| 3023 |             5(?: | 
|---|
| 3024 |               3\d?| | 
|---|
| 3025 |               4\d| | 
|---|
| 3026 |               [57]\d{1,2}| | 
|---|
| 3027 |               6\d{2}| | 
|---|
| 3028 |               8 | 
|---|
| 3029 |             ) | 
|---|
| 3030 |           )| | 
|---|
| 3031 |           71\d{2}| | 
|---|
| 3032 |           8(?: | 
|---|
| 3033 |             [18]\d{2}| | 
|---|
| 3034 |             23\d{2}| | 
|---|
| 3035 |             54\d{2} | 
|---|
| 3036 |           )| | 
|---|
| 3037 |           9(?: | 
|---|
| 3038 |             [18]\d{2}| | 
|---|
| 3039 |             2[2-5]\d{2}| | 
|---|
| 3040 |             53\d{1,2} | 
|---|
| 3041 |           ) | 
|---|
| 3042 |         )\d{3}| | 
|---|
| 3043 |         5(?: | 
|---|
| 3044 |           02[03489]\d{2}| | 
|---|
| 3045 |           1\d{2}| | 
|---|
| 3046 |           2(?: | 
|---|
| 3047 |             1\d{2}| | 
|---|
| 3048 |             2(?: | 
|---|
| 3049 |               2(?: | 
|---|
| 3050 |                 \d{2} | 
|---|
| 3051 |               )?| | 
|---|
| 3052 |               [457]\d{2} | 
|---|
| 3053 |             ) | 
|---|
| 3054 |           )| | 
|---|
| 3055 |           3(?: | 
|---|
| 3056 |             1\d{2}| | 
|---|
| 3057 |             2(?: | 
|---|
| 3058 |               [37](?: | 
|---|
| 3059 |                 \d{2} | 
|---|
| 3060 |               )?| | 
|---|
| 3061 |               [569]\d{2} | 
|---|
| 3062 |             ) | 
|---|
| 3063 |           )| | 
|---|
| 3064 |           4(?: | 
|---|
| 3065 |             1\d{2}| | 
|---|
| 3066 |             2[46]\d{2} | 
|---|
| 3067 |           )| | 
|---|
| 3068 |           5(?: | 
|---|
| 3069 |             1\d{2}| | 
|---|
| 3070 |             26\d{1,2} | 
|---|
| 3071 |           )| | 
|---|
| 3072 |           6(?: | 
|---|
| 3073 |             [18]\d{2}| | 
|---|
| 3074 |             2| | 
|---|
| 3075 |             53\d{2} | 
|---|
| 3076 |           )| | 
|---|
| 3077 |           7(?: | 
|---|
| 3078 |             1| | 
|---|
| 3079 |             24 | 
|---|
| 3080 |           )\d{2}| | 
|---|
| 3081 |           8(?: | 
|---|
| 3082 |             1| | 
|---|
| 3083 |             26 | 
|---|
| 3084 |           )\d{2}| | 
|---|
| 3085 |           91\d{2} | 
|---|
| 3086 |         )\d{3}| | 
|---|
| 3087 |         6(?: | 
|---|
| 3088 |           0(?: | 
|---|
| 3089 |             1\d{2}| | 
|---|
| 3090 |             2(?: | 
|---|
| 3091 |               3\d{2}| | 
|---|
| 3092 |               4\d{1,2} | 
|---|
| 3093 |             ) | 
|---|
| 3094 |           )| | 
|---|
| 3095 |           2(?: | 
|---|
| 3096 |             2[2-5]\d{2}| | 
|---|
| 3097 |             5(?: | 
|---|
| 3098 |               [3-5]\d{2}| | 
|---|
| 3099 |               7 | 
|---|
| 3100 |             )| | 
|---|
| 3101 |             8\d{2} | 
|---|
| 3102 |           )| | 
|---|
| 3103 |           3(?: | 
|---|
| 3104 |             1| | 
|---|
| 3105 |             2[3478] | 
|---|
| 3106 |           )\d{2}| | 
|---|
| 3107 |           4(?: | 
|---|
| 3108 |             1| | 
|---|
| 3109 |             2[34] | 
|---|
| 3110 |           )\d{2}| | 
|---|
| 3111 |           5(?: | 
|---|
| 3112 |             1| | 
|---|
| 3113 |             2[47] | 
|---|
| 3114 |           )\d{2}| | 
|---|
| 3115 |           6(?: | 
|---|
| 3116 |             [18]\d{2}| | 
|---|
| 3117 |             6(?: | 
|---|
| 3118 |               2(?: | 
|---|
| 3119 |                 2\d| | 
|---|
| 3120 |                 [34]\d{2} | 
|---|
| 3121 |               )| | 
|---|
| 3122 |               5(?: | 
|---|
| 3123 |                 [24]\d{2}| | 
|---|
| 3124 |                 3\d| | 
|---|
| 3125 |                 5\d{1,2} | 
|---|
| 3126 |               ) | 
|---|
| 3127 |             ) | 
|---|
| 3128 |           )| | 
|---|
| 3129 |           72[2-5]\d{2}| | 
|---|
| 3130 |           8(?: | 
|---|
| 3131 |             1\d{2}| | 
|---|
| 3132 |             2[2-5]\d{2} | 
|---|
| 3133 |           )| | 
|---|
| 3134 |           9(?: | 
|---|
| 3135 |             1\d{2}| | 
|---|
| 3136 |             2[2-6]\d{2} | 
|---|
| 3137 |           ) | 
|---|
| 3138 |         )\d{3}| | 
|---|
| 3139 |         7(?: | 
|---|
| 3140 |           (?: | 
|---|
| 3141 |             02| | 
|---|
| 3142 |             [3-589]1| | 
|---|
| 3143 |             6[12]| | 
|---|
| 3144 |             72[24] | 
|---|
| 3145 |           )\d{2}| | 
|---|
| 3146 |           21\d{3}| | 
|---|
| 3147 |           32 | 
|---|
| 3148 |         )\d{3}| | 
|---|
| 3149 |         8(?: | 
|---|
| 3150 |           (?: | 
|---|
| 3151 |             4[12]| | 
|---|
| 3152 |             [5-7]2| | 
|---|
| 3153 |             1\d? | 
|---|
| 3154 |           )| | 
|---|
| 3155 |           (?: | 
|---|
| 3156 |             0| | 
|---|
| 3157 |             3[12]| | 
|---|
| 3158 |             [5-7]1| | 
|---|
| 3159 |             217 | 
|---|
| 3160 |           )\d | 
|---|
| 3161 |         )\d{4}| | 
|---|
| 3162 |         9(?: | 
|---|
| 3163 |           [35]1| | 
|---|
| 3164 |           (?: | 
|---|
| 3165 |             [024]2| | 
|---|
| 3166 |             81 | 
|---|
| 3167 |           )\d| | 
|---|
| 3168 |           (?: | 
|---|
| 3169 |             1| | 
|---|
| 3170 |             [24]1 | 
|---|
| 3171 |           )\d{2} | 
|---|
| 3172 |         )\d{3} | 
|---|
| 3173 |         "#; | 
|---|
| 3174 | // TODO: This is a good candidate of a seq of literals that could be | 
|---|
| 3175 | // shrunk quite a bit and still be very productive with respect to | 
|---|
| 3176 | // literal optimizations. | 
|---|
| 3177 | let (prefixes, suffixes) = e(pat); | 
|---|
| 3178 | assert!(!suffixes.is_finite()); | 
|---|
| 3179 | assert_eq!(Some(243), prefixes.len()); | 
|---|
| 3180 | } | 
|---|
| 3181 |  | 
|---|
| 3182 | #[ test] | 
|---|
| 3183 | fn optimize() { | 
|---|
| 3184 | // This gets a common prefix that isn't too short. | 
|---|
| 3185 | let (p, s) = | 
|---|
| 3186 | opt([ "foobarfoobar", "foobar", "foobarzfoobar", "foobarfoobar"]); | 
|---|
| 3187 | assert_eq!(seq([I( "foobar")]), p); | 
|---|
| 3188 | assert_eq!(seq([I( "foobar")]), s); | 
|---|
| 3189 |  | 
|---|
| 3190 | // This also finds a common prefix, but since it's only one byte, it | 
|---|
| 3191 | // prefers the multiple literals. | 
|---|
| 3192 | let (p, s) = opt([ "abba", "akka", "abccba"]); | 
|---|
| 3193 | assert_eq!(exact([ "abba", "akka", "abccba"]), (p, s)); | 
|---|
| 3194 |  | 
|---|
| 3195 | let (p, s) = opt([ "sam", "samwise"]); | 
|---|
| 3196 | assert_eq!((seq([E( "sam")]), seq([E( "sam"), E( "samwise")])), (p, s)); | 
|---|
| 3197 |  | 
|---|
| 3198 | // The empty string is poisonous, so our seq becomes infinite, even | 
|---|
| 3199 | // though all literals are exact. | 
|---|
| 3200 | let (p, s) = opt([ "foobarfoo", "foo", "", "foozfoo", "foofoo"]); | 
|---|
| 3201 | assert!(!p.is_finite()); | 
|---|
| 3202 | assert!(!s.is_finite()); | 
|---|
| 3203 |  | 
|---|
| 3204 | // A space is also poisonous, so our seq becomes infinite. But this | 
|---|
| 3205 | // only gets triggered when we don't have a completely exact sequence. | 
|---|
| 3206 | // When the sequence is exact, spaces are okay, since we presume that | 
|---|
| 3207 | // any prefilter will match a space more quickly than the regex engine. | 
|---|
| 3208 | // (When the sequence is exact, there's a chance of the prefilter being | 
|---|
| 3209 | // used without needing the regex engine at all.) | 
|---|
| 3210 | let mut p = seq([E( "foobarfoo"), I( "foo"), E( " "), E( "foofoo")]); | 
|---|
| 3211 | p.optimize_for_prefix_by_preference(); | 
|---|
| 3212 | assert!(!p.is_finite()); | 
|---|
| 3213 | } | 
|---|
| 3214 | } | 
|---|
| 3215 |  | 
|---|