1 | /*! |
---|---|

2 | Provides literal extraction from `Hir` expressions. |

3 | |

4 | An [`Extractor`] pulls literals out of [`Hir`] expressions and returns a |

5 | [`Seq`] of [`Literal`]s. |

6 | |

7 | The purpose of literal extraction is generally to provide avenues for |

8 | optimizing regex searches. The main idea is that substring searches can be an |

9 | order of magnitude faster than a regex search. Therefore, if one can execute |

10 | a substring search to find candidate match locations and only run the regex |

11 | search at those locations, then it is possible for huge improvements in |

12 | performance to be realized. |

13 | |

14 | With that said, literal optimizations are generally a black art because even |

15 | though substring search is generally faster, if the number of candidates |

16 | produced is high, then it can create a lot of overhead by ping-ponging between |

17 | the substring search and the regex search. |

18 | |

19 | Here are some heuristics that might be used to help increase the chances of |

20 | effective literal optimizations: |

21 | |

22 | * Stick to small [`Seq`]s. If you search for too many literals, it's likely |

23 | to lead to substring search that is only a little faster than a regex search, |

24 | and thus the overhead of using literal optimizations in the first place might |

25 | make things slower overall. |

26 | * The literals in your [`Seq`] shouldn't be too short. In general, longer is |

27 | better. A sequence corresponding to single bytes that occur frequently in the |

28 | haystack, for example, is probably a bad literal optimization because it's |

29 | likely to produce many false positive candidates. Longer literals are less |

30 | likely to match, and thus probably produce fewer false positives. |

31 | * If it's possible to estimate the approximate frequency of each byte according |

32 | to some pre-computed background distribution, it is possible to compute a score |

33 | of how "good" a `Seq` is. If a `Seq` isn't good enough, you might consider |

34 | skipping the literal optimization and just use the regex engine. |

35 | |

36 | (It should be noted that there are always pathological cases that can make |

37 | any kind of literal optimization be a net slower result. This is why it |

38 | might be a good idea to be conservative, or to even provide a means for |

39 | literal optimizations to be dynamically disabled if they are determined to be |

40 | ineffective according to some measure.) |

41 | |

42 | You're encouraged to explore the methods on [`Seq`], which permit shrinking |

43 | the size of sequences in a preference-order preserving fashion. |

44 | |

45 | Finally, note that it isn't strictly necessary to use an [`Extractor`]. Namely, |

46 | an `Extractor` only uses public APIs of the [`Seq`] and [`Literal`] types, |

47 | so it is possible to implement your own extractor. For example, for n-grams |

48 | or "inner" literals (i.e., not prefix or suffix literals). The `Extractor` |

49 | is mostly responsible for the case analysis over `Hir` expressions. Much of |

50 | the "trickier" parts are how to combine literal sequences, and that is all |

51 | implemented on [`Seq`]. |

52 | */ |

53 | |

54 | use core::{cmp, mem, num::NonZeroUsize}; |

55 | |

56 | use alloc::{vec, vec::Vec}; |

57 | |

58 | use crate::hir::{self, Hir}; |

59 | |

60 | /// Extracts prefix or suffix literal sequences from [`Hir`] expressions. |

61 | /// |

62 | /// Literal extraction is based on the following observations: |

63 | /// |

64 | /// * Many regexes start with one or a small number of literals. |

65 | /// * Substring search for literals is often much faster (sometimes by an order |

66 | /// of magnitude) than a regex search. |

67 | /// |

68 | /// Thus, in many cases, one can search for literals to find candidate starting |

69 | /// locations of a match, and then only run the full regex engine at each such |

70 | /// location instead of over the full haystack. |

71 | /// |

72 | /// The main downside of literal extraction is that it can wind up causing a |

73 | /// search to be slower overall. For example, if there are many matches or if |

74 | /// there are many candidates that don't ultimately lead to a match, then a |

75 | /// lot of overhead will be spent in shuffing back-and-forth between substring |

76 | /// search and the regex engine. This is the fundamental reason why literal |

77 | /// optimizations for regex patterns is sometimes considered a "black art." |

78 | /// |

79 | /// # Look-around assertions |

80 | /// |

81 | /// Literal extraction treats all look-around assertions as-if they match every |

82 | /// empty string. So for example, the regex `\bquux\b` will yield a sequence |

83 | /// containing a single exact literal `quux`. However, not all occurrences |

84 | /// of `quux` correspond to a match a of the regex. For example, `\bquux\b` |

85 | /// does not match `ZquuxZ` anywhere because `quux` does not fall on a word |

86 | /// boundary. |

87 | /// |

88 | /// In effect, if your regex contains look-around assertions, then a match of |

89 | /// an exact literal does not necessarily mean the regex overall matches. So |

90 | /// you may still need to run the regex engine in such cases to confirm the |

91 | /// match. |

92 | /// |

93 | /// The precise guarantee you get from a literal sequence is: if every literal |

94 | /// in the sequence is exact and the original regex contains zero look-around |

95 | /// assertions, then a preference-order multi-substring search of those |

96 | /// literals will precisely match a preference-order search of the original |

97 | /// regex. |

98 | /// |

99 | /// # Example |

100 | /// |

101 | /// This shows how to extract prefixes: |

102 | /// |

103 | /// ``` |

104 | /// use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse}; |

105 | /// |

106 | /// let hir = parse(r"(a|b|c)(x|y|z)[A-Z]+foo")?; |

107 | /// let got = Extractor::new().extract(&hir); |

108 | /// // All literals returned are "inexact" because none of them reach the |

109 | /// // match state. |

110 | /// let expected = Seq::from_iter([ |

111 | /// Literal::inexact("ax"), |

112 | /// Literal::inexact("ay"), |

113 | /// Literal::inexact("az"), |

114 | /// Literal::inexact("bx"), |

115 | /// Literal::inexact("by"), |

116 | /// Literal::inexact("bz"), |

117 | /// Literal::inexact("cx"), |

118 | /// Literal::inexact("cy"), |

119 | /// Literal::inexact("cz"), |

120 | /// ]); |

121 | /// assert_eq!(expected, got); |

122 | /// |

123 | /// # Ok::<(), Box<dyn std::error::Error>>(()) |

124 | /// ``` |

125 | /// |

126 | /// This shows how to extract suffixes: |

127 | /// |

128 | /// ``` |

129 | /// use regex_syntax::{ |

130 | /// hir::literal::{Extractor, ExtractKind, Literal, Seq}, |

131 | /// parse, |

132 | /// }; |

133 | /// |

134 | /// let hir = parse(r"foo|[A-Z]+bar")?; |

135 | /// let got = Extractor::new().kind(ExtractKind::Suffix).extract(&hir); |

136 | /// // Since 'foo' gets to a match state, it is considered exact. But 'bar' |

137 | /// // does not because of the '[A-Z]+', and thus is marked inexact. |

138 | /// let expected = Seq::from_iter([ |

139 | /// Literal::exact("foo"), |

140 | /// Literal::inexact("bar"), |

141 | /// ]); |

142 | /// assert_eq!(expected, got); |

143 | /// |

144 | /// # Ok::<(), Box<dyn std::error::Error>>(()) |

145 | /// ``` |

146 | #[derive(Clone, Debug)] |

147 | pub struct Extractor { |

148 | kind: ExtractKind, |

149 | limit_class: usize, |

150 | limit_repeat: usize, |

151 | limit_literal_len: usize, |

152 | limit_total: usize, |

153 | } |

154 | |

155 | impl Extractor { |

156 | /// Create a new extractor with a default configuration. |

157 | /// |

158 | /// The extractor can be optionally configured before calling |

159 | /// [`Extractor::extract`] to get a literal sequence. |

160 | pub fn new() -> Extractor { |

161 | Extractor { |

162 | kind: ExtractKind::Prefix, |

163 | limit_class: 10, |

164 | limit_repeat: 10, |

165 | limit_literal_len: 100, |

166 | limit_total: 250, |

167 | } |

168 | } |

169 | |

170 | /// Execute the extractor and return a sequence of literals. |

171 | pub fn extract(&self, hir: &Hir) -> Seq { |

172 | use crate::hir::HirKind::*; |

173 | |

174 | match *hir.kind() { |

175 | Empty | Look(_) => Seq::singleton(self::Literal::exact(vec![])), |

176 | Literal(hir::Literal(ref bytes)) => { |

177 | let mut seq = |

178 | Seq::singleton(self::Literal::exact(bytes.to_vec())); |

179 | self.enforce_literal_len(&mut seq); |

180 | seq |

181 | } |

182 | Class(hir::Class::Unicode(ref cls)) => { |

183 | self.extract_class_unicode(cls) |

184 | } |

185 | Class(hir::Class::Bytes(ref cls)) => self.extract_class_bytes(cls), |

186 | Repetition(ref rep) => self.extract_repetition(rep), |

187 | Capture(hir::Capture { ref sub, .. }) => self.extract(sub), |

188 | Concat(ref hirs) => match self.kind { |

189 | ExtractKind::Prefix => self.extract_concat(hirs.iter()), |

190 | ExtractKind::Suffix => self.extract_concat(hirs.iter().rev()), |

191 | }, |

192 | Alternation(ref hirs) => { |

193 | // Unlike concat, we always union starting from the beginning, |

194 | // since the beginning corresponds to the highest preference, |

195 | // which doesn't change based on forwards vs reverse. |

196 | self.extract_alternation(hirs.iter()) |

197 | } |

198 | } |

199 | } |

200 | |

201 | /// Set the kind of literal sequence to extract from an [`Hir`] expression. |

202 | /// |

203 | /// The default is to extract prefixes, but suffixes can be selected |

204 | /// instead. The contract for prefixes is that every match of the |

205 | /// corresponding `Hir` must start with one of the literals in the sequence |

206 | /// returned. Moreover, the _order_ of the sequence returned corresponds to |

207 | /// the preference order. |

208 | /// |

209 | /// Suffixes satisfy a similar contract in that every match of the |

210 | /// corresponding `Hir` must end with one of the literals in the sequence |

211 | /// returned. However, there is no guarantee that the literals are in |

212 | /// preference order. |

213 | /// |

214 | /// Remember that a sequence can be infinite. For example, unless the |

215 | /// limits are configured to be impractically large, attempting to extract |

216 | /// prefixes (or suffixes) for the pattern `[A-Z]` will return an infinite |

217 | /// sequence. Generally speaking, if the sequence returned is infinite, |

218 | /// then it is presumed to be unwise to do prefix (or suffix) optimizations |

219 | /// for the pattern. |

220 | pub fn kind(&mut self, kind: ExtractKind) -> &mut Extractor { |

221 | self.kind = kind; |

222 | self |

223 | } |

224 | |

225 | /// Configure a limit on the length of the sequence that is permitted for |

226 | /// a character class. If a character class exceeds this limit, then the |

227 | /// sequence returned for it is infinite. |

228 | /// |

229 | /// This prevents classes like `[A-Z]` or `\pL` from getting turned into |

230 | /// huge and likely unproductive sequences of literals. |

231 | /// |

232 | /// # Example |

233 | /// |

234 | /// This example shows how this limit can be lowered to decrease the tolerance |

235 | /// for character classes being turned into literal sequences. |

236 | /// |

237 | /// ``` |

238 | /// use regex_syntax::{hir::literal::{Extractor, Seq}, parse}; |

239 | /// |

240 | /// let hir = parse(r"[0-9]")?; |

241 | /// |

242 | /// let got = Extractor::new().extract(&hir); |

243 | /// let expected = Seq::new([ |

244 | /// "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", |

245 | /// ]); |

246 | /// assert_eq!(expected, got); |

247 | /// |

248 | /// // Now let's shrink the limit and see how that changes things. |

249 | /// let got = Extractor::new().limit_class(4).extract(&hir); |

250 | /// let expected = Seq::infinite(); |

251 | /// assert_eq!(expected, got); |

252 | /// |

253 | /// # Ok::<(), Box<dyn std::error::Error>>(()) |

254 | /// ``` |

255 | pub fn limit_class(&mut self, limit: usize) -> &mut Extractor { |

256 | self.limit_class = limit; |

257 | self |

258 | } |

259 | |

260 | /// Configure a limit on the total number of repetitions that is permitted |

261 | /// before literal extraction is stopped. |

262 | /// |

263 | /// This is useful for limiting things like `(abcde){50}`, or more |

264 | /// insidiously, `(?:){1000000000}`. This limit prevents any one single |

265 | /// repetition from adding too much to a literal sequence. |

266 | /// |

267 | /// With this limit set, repetitions that exceed it will be stopped and any |

268 | /// literals extracted up to that point will be made inexact. |

269 | /// |

270 | /// # Example |

271 | /// |

272 | /// This shows how to decrease the limit and compares it with the default. |

273 | /// |

274 | /// ``` |

275 | /// use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse}; |

276 | /// |

277 | /// let hir = parse(r"(abc){8}")?; |

278 | /// |

279 | /// let got = Extractor::new().extract(&hir); |

280 | /// let expected = Seq::new(["abcabcabcabcabcabcabcabc"]); |

281 | /// assert_eq!(expected, got); |

282 | /// |

283 | /// // Now let's shrink the limit and see how that changes things. |

284 | /// let got = Extractor::new().limit_repeat(4).extract(&hir); |

285 | /// let expected = Seq::from_iter([ |

286 | /// Literal::inexact("abcabcabcabc"), |

287 | /// ]); |

288 | /// assert_eq!(expected, got); |

289 | /// |

290 | /// # Ok::<(), Box<dyn std::error::Error>>(()) |

291 | /// ``` |

292 | pub fn limit_repeat(&mut self, limit: usize) -> &mut Extractor { |

293 | self.limit_repeat = limit; |

294 | self |

295 | } |

296 | |

297 | /// Configure a limit on the maximum length of any literal in a sequence. |

298 | /// |

299 | /// This is useful for limiting things like `(abcde){5}{5}{5}{5}`. While |

300 | /// each repetition or literal in that regex is small, when all the |

301 | /// repetitions are applied, one ends up with a literal of length `5^4 = |

302 | /// 625`. |

303 | /// |

304 | /// With this limit set, literals that exceed it will be made inexact and |

305 | /// thus prevented from growing. |

306 | /// |

307 | /// # Example |

308 | /// |

309 | /// This shows how to decrease the limit and compares it with the default. |

310 | /// |

311 | /// ``` |

312 | /// use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse}; |

313 | /// |

314 | /// let hir = parse(r"(abc){2}{2}{2}")?; |

315 | /// |

316 | /// let got = Extractor::new().extract(&hir); |

317 | /// let expected = Seq::new(["abcabcabcabcabcabcabcabc"]); |

318 | /// assert_eq!(expected, got); |

319 | /// |

320 | /// // Now let's shrink the limit and see how that changes things. |

321 | /// let got = Extractor::new().limit_literal_len(14).extract(&hir); |

322 | /// let expected = Seq::from_iter([ |

323 | /// Literal::inexact("abcabcabcabcab"), |

324 | /// ]); |

325 | /// assert_eq!(expected, got); |

326 | /// |

327 | /// # Ok::<(), Box<dyn std::error::Error>>(()) |

328 | /// ``` |

329 | pub fn limit_literal_len(&mut self, limit: usize) -> &mut Extractor { |

330 | self.limit_literal_len = limit; |

331 | self |

332 | } |

333 | |

334 | /// Configure a limit on the total number of literals that will be |

335 | /// returned. |

336 | /// |

337 | /// This is useful as a practical measure for avoiding the creation of |

338 | /// large sequences of literals. While the extractor will automatically |

339 | /// handle local creations of large sequences (for example, `[A-Z]` yields |

340 | /// an infinite sequence by default), large sequences can be created |

341 | /// through non-local means as well. |

342 | /// |

343 | /// For example, `[ab]{3}{3}` would yield a sequence of length `512 = 2^9` |

344 | /// despite each of the repetitions being small on their own. This limit |

345 | /// thus represents a "catch all" for avoiding locally small sequences from |

346 | /// combining into large sequences. |

347 | /// |

348 | /// # Example |

349 | /// |

350 | /// This example shows how reducing the limit will change the literal |

351 | /// sequence returned. |

352 | /// |

353 | /// ``` |

354 | /// use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse}; |

355 | /// |

356 | /// let hir = parse(r"[ab]{2}{2}")?; |

357 | /// |

358 | /// let got = Extractor::new().extract(&hir); |

359 | /// let expected = Seq::new([ |

360 | /// "aaaa", "aaab", "aaba", "aabb", |

361 | /// "abaa", "abab", "abba", "abbb", |

362 | /// "baaa", "baab", "baba", "babb", |

363 | /// "bbaa", "bbab", "bbba", "bbbb", |

364 | /// ]); |

365 | /// assert_eq!(expected, got); |

366 | /// |

367 | /// // The default limit is not too big, but big enough to extract all |

368 | /// // literals from '[ab]{2}{2}'. If we shrink the limit to less than 16, |

369 | /// // then we'll get a truncated set. Notice that it returns a sequence of |

370 | /// // length 4 even though our limit was 10. This is because the sequence |

371 | /// // is difficult to increase without blowing the limit. Notice also |

372 | /// // that every literal in the sequence is now inexact because they were |

373 | /// // stripped of some suffix. |

374 | /// let got = Extractor::new().limit_total(10).extract(&hir); |

375 | /// let expected = Seq::from_iter([ |

376 | /// Literal::inexact("aa"), |

377 | /// Literal::inexact("ab"), |

378 | /// Literal::inexact("ba"), |

379 | /// Literal::inexact("bb"), |

380 | /// ]); |

381 | /// assert_eq!(expected, got); |

382 | /// |

383 | /// # Ok::<(), Box<dyn std::error::Error>>(()) |

384 | /// ``` |

385 | pub fn limit_total(&mut self, limit: usize) -> &mut Extractor { |

386 | self.limit_total = limit; |

387 | self |

388 | } |

389 | |

390 | /// Extract a sequence from the given concatenation. Sequences from each of |

391 | /// the child HIR expressions are combined via cross product. |

392 | /// |

393 | /// This short circuits once the cross product turns into a sequence |

394 | /// containing only inexact literals. |

395 | fn extract_concat<'a, I: Iterator<Item = &'a Hir>>(&self, it: I) -> Seq { |

396 | let mut seq = Seq::singleton(self::Literal::exact(vec![])); |

397 | for hir in it { |

398 | // If every element in the sequence is inexact, then a cross |

399 | // product will always be a no-op. Thus, there is nothing else we |

400 | // can add to it and can quit early. Note that this also includes |

401 | // infinite sequences. |

402 | if seq.is_inexact() { |

403 | break; |

404 | } |

405 | // Note that 'cross' also dispatches based on whether we're |

406 | // extracting prefixes or suffixes. |

407 | seq = self.cross(seq, &mut self.extract(hir)); |

408 | } |

409 | seq |

410 | } |

411 | |

412 | /// Extract a sequence from the given alternation. |

413 | /// |

414 | /// This short circuits once the union turns into an infinite sequence. |

415 | fn extract_alternation<'a, I: Iterator<Item = &'a Hir>>( |

416 | &self, |

417 | it: I, |

418 | ) -> Seq { |

419 | let mut seq = Seq::empty(); |

420 | for hir in it { |

421 | // Once our 'seq' is infinite, every subsequent union |

422 | // operation on it will itself always result in an |

423 | // infinite sequence. Thus, it can never change and we can |

424 | // short-circuit. |

425 | if !seq.is_finite() { |

426 | break; |

427 | } |

428 | seq = self.union(seq, &mut self.extract(hir)); |

429 | } |

430 | seq |

431 | } |

432 | |

433 | /// Extract a sequence of literals from the given repetition. We do our |

434 | /// best, Some examples: |

435 | /// |

436 | /// 'a*' => [inexact(a), exact("")] |

437 | /// 'a*?' => [exact(""), inexact(a)] |

438 | /// 'a+' => [inexact(a)] |

439 | /// 'a{3}' => [exact(aaa)] |

440 | /// 'a{3,5} => [inexact(aaa)] |

441 | /// |

442 | /// The key here really is making sure we get the 'inexact' vs 'exact' |

443 | /// attributes correct on each of the literals we add. For example, the |

444 | /// fact that 'a*' gives us an inexact 'a' and an exact empty string means |

445 | /// that a regex like 'ab*c' will result in [inexact(ab), exact(ac)] |

446 | /// literals being extracted, which might actually be a better prefilter |

447 | /// than just 'a'. |

448 | fn extract_repetition(&self, rep: &hir::Repetition) -> Seq { |

449 | let mut subseq = self.extract(&rep.sub); |

450 | match *rep { |

451 | hir::Repetition { min: 0, max, greedy, .. } => { |

452 | // When 'max=1', we can retain exactness, since 'a?' is |

453 | // equivalent to 'a|'. Similarly below, 'a??' is equivalent to |

454 | // '|a'. |

455 | if max != Some(1) { |

456 | subseq.make_inexact(); |

457 | } |

458 | let mut empty = Seq::singleton(Literal::exact(vec![])); |

459 | if !greedy { |

460 | mem::swap(&mut subseq, &mut empty); |

461 | } |

462 | self.union(subseq, &mut empty) |

463 | } |

464 | hir::Repetition { min, max: Some(max), .. } if min == max => { |

465 | assert!(min > 0); // handled above |

466 | let limit = |

467 | u32::try_from(self.limit_repeat).unwrap_or(u32::MAX); |

468 | let mut seq = Seq::singleton(Literal::exact(vec![])); |

469 | for _ in 0..cmp::min(min, limit) { |

470 | if seq.is_inexact() { |

471 | break; |

472 | } |

473 | seq = self.cross(seq, &mut subseq.clone()); |

474 | } |

475 | if usize::try_from(min).is_err() || min > limit { |

476 | seq.make_inexact(); |

477 | } |

478 | seq |

479 | } |

480 | hir::Repetition { min, .. } => { |

481 | assert!(min > 0); // handled above |

482 | let limit = |

483 | u32::try_from(self.limit_repeat).unwrap_or(u32::MAX); |

484 | let mut seq = Seq::singleton(Literal::exact(vec![])); |

485 | for _ in 0..cmp::min(min, limit) { |

486 | if seq.is_inexact() { |

487 | break; |

488 | } |

489 | seq = self.cross(seq, &mut subseq.clone()); |

490 | } |

491 | seq.make_inexact(); |

492 | seq |

493 | } |

494 | } |

495 | } |

496 | |

497 | /// Convert the given Unicode class into a sequence of literals if the |

498 | /// class is small enough. If the class is too big, return an infinite |

499 | /// sequence. |

500 | fn extract_class_unicode(&self, cls: &hir::ClassUnicode) -> Seq { |

501 | if self.class_over_limit_unicode(cls) { |

502 | return Seq::infinite(); |

503 | } |

504 | let mut seq = Seq::empty(); |

505 | for r in cls.iter() { |

506 | for ch in r.start()..=r.end() { |

507 | seq.push(Literal::from(ch)); |

508 | } |

509 | } |

510 | self.enforce_literal_len(&mut seq); |

511 | seq |

512 | } |

513 | |

514 | /// Convert the given byte class into a sequence of literals if the class |

515 | /// is small enough. If the class is too big, return an infinite sequence. |

516 | fn extract_class_bytes(&self, cls: &hir::ClassBytes) -> Seq { |

517 | if self.class_over_limit_bytes(cls) { |

518 | return Seq::infinite(); |

519 | } |

520 | let mut seq = Seq::empty(); |

521 | for r in cls.iter() { |

522 | for b in r.start()..=r.end() { |

523 | seq.push(Literal::from(b)); |

524 | } |

525 | } |

526 | self.enforce_literal_len(&mut seq); |

527 | seq |

528 | } |

529 | |

530 | /// Returns true if the given Unicode class exceeds the configured limits |

531 | /// on this extractor. |

532 | fn class_over_limit_unicode(&self, cls: &hir::ClassUnicode) -> bool { |

533 | let mut count = 0; |

534 | for r in cls.iter() { |

535 | if count > self.limit_class { |

536 | return true; |

537 | } |

538 | count += r.len(); |

539 | } |

540 | count > self.limit_class |

541 | } |

542 | |

543 | /// Returns true if the given byte class exceeds the configured limits on |

544 | /// this extractor. |

545 | fn class_over_limit_bytes(&self, cls: &hir::ClassBytes) -> bool { |

546 | let mut count = 0; |

547 | for r in cls.iter() { |

548 | if count > self.limit_class { |

549 | return true; |

550 | } |

551 | count += r.len(); |

552 | } |

553 | count > self.limit_class |

554 | } |

555 | |

556 | /// Compute the cross product of the two sequences if the result would be |

557 | /// within configured limits. Otherwise, make `seq2` infinite and cross the |

558 | /// infinite sequence with `seq1`. |

559 | fn cross(&self, mut seq1: Seq, seq2: &mut Seq) -> Seq { |

560 | if seq1.max_cross_len(seq2).map_or(false, |len| len > self.limit_total) |

561 | { |

562 | seq2.make_infinite(); |

563 | } |

564 | if let ExtractKind::Suffix = self.kind { |

565 | seq1.cross_reverse(seq2); |

566 | } else { |

567 | seq1.cross_forward(seq2); |

568 | } |

569 | assert!(seq1.len().map_or(true, |x| x <= self.limit_total)); |

570 | self.enforce_literal_len(&mut seq1); |

571 | seq1 |

572 | } |

573 | |

574 | /// Union the two sequences if the result would be within configured |

575 | /// limits. Otherwise, make `seq2` infinite and union the infinite sequence |

576 | /// with `seq1`. |

577 | fn union(&self, mut seq1: Seq, seq2: &mut Seq) -> Seq { |

578 | if seq1.max_union_len(seq2).map_or(false, |len| len > self.limit_total) |

579 | { |

580 | // We try to trim our literal sequences to see if we can make |

581 | // room for more literals. The idea is that we'd rather trim down |

582 | // literals already in our sequence if it means we can add a few |

583 | // more and retain a finite sequence. Otherwise, we'll union with |

584 | // an infinite sequence and that infects everything and effectively |

585 | // stops literal extraction in its tracks. |

586 | // |

587 | // We do we keep 4 bytes here? Well, it's a bit of an abstraction |

588 | // leakage. Downstream, the literals may wind up getting fed to |

589 | // the Teddy algorithm, which supports searching literals up to |

590 | // length 4. So that's why we pick that number here. Arguably this |

591 | // should be a tuneable parameter, but it seems a little tricky to |

592 | // describe. And I'm still unsure if this is the right way to go |

593 | // about culling literal sequences. |

594 | match self.kind { |

595 | ExtractKind::Prefix => { |

596 | seq1.keep_first_bytes(4); |

597 | seq2.keep_first_bytes(4); |

598 | } |

599 | ExtractKind::Suffix => { |

600 | seq1.keep_last_bytes(4); |

601 | seq2.keep_last_bytes(4); |

602 | } |

603 | } |

604 | seq1.dedup(); |

605 | seq2.dedup(); |

606 | if seq1 |

607 | .max_union_len(seq2) |

608 | .map_or(false, |len| len > self.limit_total) |

609 | { |

610 | seq2.make_infinite(); |

611 | } |

612 | } |

613 | seq1.union(seq2); |

614 | assert!(seq1.len().map_or(true, |x| x <= self.limit_total)); |

615 | seq1 |

616 | } |

617 | |

618 | /// Applies the literal length limit to the given sequence. If none of the |

619 | /// literals in the sequence exceed the limit, then this is a no-op. |

620 | fn enforce_literal_len(&self, seq: &mut Seq) { |

621 | let len = self.limit_literal_len; |

622 | match self.kind { |

623 | ExtractKind::Prefix => seq.keep_first_bytes(len), |

624 | ExtractKind::Suffix => seq.keep_last_bytes(len), |

625 | } |

626 | } |

627 | } |

628 | |

629 | impl Default for Extractor { |

630 | fn default() -> Extractor { |

631 | Extractor::new() |

632 | } |

633 | } |

634 | |

635 | /// The kind of literals to extract from an [`Hir`] expression. |

636 | /// |

637 | /// The default extraction kind is `Prefix`. |

638 | #[non_exhaustive] |

639 | #[derive(Clone, Debug)] |

640 | pub enum ExtractKind { |

641 | /// Extracts only prefix literals from a regex. |

642 | Prefix, |

643 | /// Extracts only suffix literals from a regex. |

644 | /// |

645 | /// Note that the sequence returned by suffix literals currently may |

646 | /// not correctly represent leftmost-first or "preference" order match |

647 | /// semantics. |

648 | Suffix, |

649 | } |

650 | |

651 | impl ExtractKind { |

652 | /// Returns true if this kind is the `Prefix` variant. |

653 | pub fn is_prefix(&self) -> bool { |

654 | matches!(*self, ExtractKind::Prefix) |

655 | } |

656 | |

657 | /// Returns true if this kind is the `Suffix` variant. |

658 | pub fn is_suffix(&self) -> bool { |

659 | matches!(*self, ExtractKind::Suffix) |

660 | } |

661 | } |

662 | |

663 | impl Default for ExtractKind { |

664 | fn default() -> ExtractKind { |

665 | ExtractKind::Prefix |

666 | } |

667 | } |

668 | |

669 | /// A sequence of literals. |

670 | /// |

671 | /// A `Seq` is very much like a set in that it represents a union of its |

672 | /// members. That is, it corresponds to a set of literals where at least one |

673 | /// must match in order for a particular [`Hir`] expression to match. (Whether |

674 | /// this corresponds to the entire `Hir` expression, a prefix of it or a suffix |

675 | /// of it depends on how the `Seq` was extracted from the `Hir`.) |

676 | /// |

677 | /// It is also unlike a set in that multiple identical literals may appear, |

678 | /// and that the order of the literals in the `Seq` matters. For example, if |

679 | /// the sequence is `[sam, samwise]` and leftmost-first matching is used, then |

680 | /// `samwise` can never match and the sequence is equivalent to `[sam]`. |

681 | /// |

682 | /// # States of a sequence |

683 | /// |

684 | /// A `Seq` has a few different logical states to consider: |

685 | /// |

686 | /// * The sequence can represent "any" literal. When this happens, the set does |

687 | /// not have a finite size. The purpose of this state is to inhibit callers |

688 | /// from making assumptions about what literals are required in order to match |

689 | /// a particular [`Hir`] expression. Generally speaking, when a set is in this |

690 | /// state, literal optimizations are inhibited. A good example of a regex that |

691 | /// will cause this sort of set to appear is `[A-Za-z]`. The character class |

692 | /// is just too big (and also too narrow) to be usefully expanded into 52 |

693 | /// different literals. (Note that the decision for when a seq should become |

694 | /// infinite is determined by the caller. A seq itself has no hard-coded |

695 | /// limits.) |

696 | /// * The sequence can be empty, in which case, it is an affirmative statement |

697 | /// that there are no literals that can match the corresponding `Hir`. |

698 | /// Consequently, the `Hir` never matches any input. For example, `[a&&b]`. |

699 | /// * The sequence can be non-empty, in which case, at least one of the |

700 | /// literals must match in order for the corresponding `Hir` to match. |

701 | /// |

702 | /// # Example |

703 | /// |

704 | /// This example shows how literal sequences can be simplified by stripping |

705 | /// suffixes and minimizing while maintaining preference order. |

706 | /// |

707 | /// ``` |

708 | /// use regex_syntax::hir::literal::{Literal, Seq}; |

709 | /// |

710 | /// let mut seq = Seq::new(&[ |

711 | /// "farm", |

712 | /// "appliance", |

713 | /// "faraway", |

714 | /// "apple", |

715 | /// "fare", |

716 | /// "gap", |

717 | /// "applicant", |

718 | /// "applaud", |

719 | /// ]); |

720 | /// seq.keep_first_bytes(3); |

721 | /// seq.minimize_by_preference(); |

722 | /// // Notice that 'far' comes before 'app', which matches the order in the |

723 | /// // original sequence. This guarantees that leftmost-first semantics are |

724 | /// // not altered by simplifying the set. |

725 | /// let expected = Seq::from_iter([ |

726 | /// Literal::inexact("far"), |

727 | /// Literal::inexact("app"), |

728 | /// Literal::exact("gap"), |

729 | /// ]); |

730 | /// assert_eq!(expected, seq); |

731 | /// ``` |

732 | #[derive(Clone, Eq, PartialEq)] |

733 | pub struct Seq { |

734 | /// The members of this seq. |

735 | /// |

736 | /// When `None`, the seq represents all possible literals. That is, it |

737 | /// prevents one from making assumptions about specific literals in the |

738 | /// seq, and forces one to treat it as if any literal might be in the seq. |

739 | /// |

740 | /// Note that `Some(vec![])` is valid and corresponds to the empty seq of |

741 | /// literals, i.e., a regex that can never match. For example, `[a&&b]`. |

742 | /// It is distinct from `Some(vec![""])`, which corresponds to the seq |

743 | /// containing an empty string, which matches at every position. |

744 | literals: Option<Vec<Literal>>, |

745 | } |

746 | |

747 | impl Seq { |

748 | /// Returns an empty sequence. |

749 | /// |

750 | /// An empty sequence matches zero literals, and thus corresponds to a |

751 | /// regex that itself can never match. |

752 | #[inline] |

753 | pub fn empty() -> Seq { |

754 | Seq { literals: Some(vec![]) } |

755 | } |

756 | |

757 | /// Returns a sequence of literals without a finite size and may contain |

758 | /// any literal. |

759 | /// |

760 | /// A sequence without finite size does not reveal anything about the |

761 | /// characteristics of the literals in its set. There are no fixed prefixes |

762 | /// or suffixes, nor are lower or upper bounds on the length of the literals |

763 | /// in the set known. |

764 | /// |

765 | /// This is useful to represent constructs in a regex that are "too big" |

766 | /// to useful represent as a sequence of literals. For example, `[A-Za-z]`. |

767 | /// When sequences get too big, they lose their discriminating nature and |

768 | /// are more likely to produce false positives, which in turn makes them |

769 | /// less likely to speed up searches. |

770 | /// |

771 | /// More pragmatically, for many regexes, enumerating all possible literals |

772 | /// is itself not possible or might otherwise use too many resources. So |

773 | /// constraining the size of sets during extraction is a practical trade |

774 | /// off to make. |

775 | #[inline] |

776 | pub fn infinite() -> Seq { |

777 | Seq { literals: None } |

778 | } |

779 | |

780 | /// Returns a sequence containing a single literal. |

781 | #[inline] |

782 | pub fn singleton(lit: Literal) -> Seq { |

783 | Seq { literals: Some(vec![lit]) } |

784 | } |

785 | |

786 | /// Returns a sequence of exact literals from the given byte strings. |

787 | #[inline] |

788 | pub fn new<I, B>(it: I) -> Seq |

789 | where |

790 | I: IntoIterator<Item = B>, |

791 | B: AsRef<[u8]>, |

792 | { |

793 | it.into_iter().map(|b| Literal::exact(b.as_ref())).collect() |

794 | } |

795 | |

796 | /// If this is a finite sequence, return its members as a slice of |

797 | /// literals. |

798 | /// |

799 | /// The slice returned may be empty, in which case, there are no literals |

800 | /// that can match this sequence. |

801 | #[inline] |

802 | pub fn literals(&self) -> Option<&[Literal]> { |

803 | self.literals.as_deref() |

804 | } |

805 | |

806 | /// Push a literal to the end of this sequence. |

807 | /// |

808 | /// If this sequence is not finite, then this is a no-op. |

809 | /// |

810 | /// Similarly, if the most recently added item of this sequence is |

811 | /// equivalent to the literal given, then it is not added. This reflects |

812 | /// a `Seq`'s "set like" behavior, and represents a practical trade off. |

813 | /// Namely, there is never any need to have two adjacent and equivalent |

814 | /// literals in the same sequence, _and_ it is easy to detect in some |

815 | /// cases. |

816 | #[inline] |

817 | pub fn push(&mut self, lit: Literal) { |

818 | let lits = match self.literals { |

819 | None => return, |

820 | Some(ref mut lits) => lits, |

821 | }; |

822 | if lits.last().map_or(false, |m| m == &lit) { |

823 | return; |

824 | } |

825 | lits.push(lit); |

826 | } |

827 | |

828 | /// Make all of the literals in this sequence inexact. |

829 | /// |

830 | /// This is a no-op if this sequence is not finite. |

831 | #[inline] |

832 | pub fn make_inexact(&mut self) { |

833 | let lits = match self.literals { |

834 | None => return, |

835 | Some(ref mut lits) => lits, |

836 | }; |

837 | for lit in lits.iter_mut() { |

838 | lit.make_inexact(); |

839 | } |

840 | } |

841 | |

842 | /// Converts this sequence to an infinite sequence. |

843 | /// |

844 | /// This is a no-op if the sequence is already infinite. |

845 | #[inline] |

846 | pub fn make_infinite(&mut self) { |

847 | self.literals = None; |

848 | } |

849 | |

850 | /// Modify this sequence to contain the cross product between it and the |

851 | /// sequence given. |

852 | /// |

853 | /// The cross product only considers literals in this sequence that are |

854 | /// exact. That is, inexact literals are not extended. |

855 | /// |

856 | /// The literals are always drained from `other`, even if none are used. |

857 | /// This permits callers to reuse the sequence allocation elsewhere. |

858 | /// |

859 | /// If this sequence is infinite, then this is a no-op, regardless of what |

860 | /// `other` contains (and in this case, the literals are still drained from |

861 | /// `other`). If `other` is infinite and this sequence is finite, then this |

862 | /// is a no-op, unless this sequence contains a zero-length literal. In |

863 | /// which case, the infiniteness of `other` infects this sequence, and this |

864 | /// sequence is itself made infinite. |

865 | /// |

866 | /// Like [`Seq::union`], this may attempt to deduplicate literals. See |

867 | /// [`Seq::dedup`] for how deduplication deals with exact and inexact |

868 | /// literals. |

869 | /// |

870 | /// # Example |

871 | /// |

872 | /// This example shows basic usage and how exact and inexact literals |

873 | /// interact. |

874 | /// |

875 | /// ``` |

876 | /// use regex_syntax::hir::literal::{Literal, Seq}; |

877 | /// |

878 | /// let mut seq1 = Seq::from_iter([ |

879 | /// Literal::exact("foo"), |

880 | /// Literal::inexact("bar"), |

881 | /// ]); |

882 | /// let mut seq2 = Seq::from_iter([ |

883 | /// Literal::inexact("quux"), |

884 | /// Literal::exact("baz"), |

885 | /// ]); |

886 | /// seq1.cross_forward(&mut seq2); |

887 | /// |

888 | /// // The literals are pulled out of seq2. |

889 | /// assert_eq!(Some(0), seq2.len()); |

890 | /// |

891 | /// let expected = Seq::from_iter([ |

892 | /// Literal::inexact("fooquux"), |

893 | /// Literal::exact("foobaz"), |

894 | /// Literal::inexact("bar"), |

895 | /// ]); |

896 | /// assert_eq!(expected, seq1); |

897 | /// ``` |

898 | /// |

899 | /// This example shows the behavior of when `other` is an infinite |

900 | /// sequence. |

901 | /// |

902 | /// ``` |

903 | /// use regex_syntax::hir::literal::{Literal, Seq}; |

904 | /// |

905 | /// let mut seq1 = Seq::from_iter([ |

906 | /// Literal::exact("foo"), |

907 | /// Literal::inexact("bar"), |

908 | /// ]); |

909 | /// let mut seq2 = Seq::infinite(); |

910 | /// seq1.cross_forward(&mut seq2); |

911 | /// |

912 | /// // When seq2 is infinite, cross product doesn't add anything, but |

913 | /// // ensures all members of seq1 are inexact. |

914 | /// let expected = Seq::from_iter([ |

915 | /// Literal::inexact("foo"), |

916 | /// Literal::inexact("bar"), |

917 | /// ]); |

918 | /// assert_eq!(expected, seq1); |

919 | /// ``` |

920 | /// |

921 | /// This example is like the one above, but shows what happens when this |

922 | /// sequence contains an empty string. In this case, an infinite `other` |

923 | /// sequence infects this sequence (because the empty string means that |

924 | /// there are no finite prefixes): |

925 | /// |

926 | /// ``` |

927 | /// use regex_syntax::hir::literal::{Literal, Seq}; |

928 | /// |

929 | /// let mut seq1 = Seq::from_iter([ |

930 | /// Literal::exact("foo"), |

931 | /// Literal::exact(""), // inexact provokes same behavior |

932 | /// Literal::inexact("bar"), |

933 | /// ]); |

934 | /// let mut seq2 = Seq::infinite(); |

935 | /// seq1.cross_forward(&mut seq2); |

936 | /// |

937 | /// // seq1 is now infinite! |

938 | /// assert!(!seq1.is_finite()); |

939 | /// ``` |

940 | /// |

941 | /// This example shows the behavior of this sequence is infinite. |

942 | /// |

943 | /// ``` |

944 | /// use regex_syntax::hir::literal::{Literal, Seq}; |

945 | /// |

946 | /// let mut seq1 = Seq::infinite(); |

947 | /// let mut seq2 = Seq::from_iter([ |

948 | /// Literal::exact("foo"), |

949 | /// Literal::inexact("bar"), |

950 | /// ]); |

951 | /// seq1.cross_forward(&mut seq2); |

952 | /// |

953 | /// // seq1 remains unchanged. |

954 | /// assert!(!seq1.is_finite()); |

955 | /// // Even though the literals in seq2 weren't used, it was still drained. |

956 | /// assert_eq!(Some(0), seq2.len()); |

957 | /// ``` |

958 | #[inline] |

959 | pub fn cross_forward(&mut self, other: &mut Seq) { |

960 | let (lits1, lits2) = match self.cross_preamble(other) { |

961 | None => return, |

962 | Some((lits1, lits2)) => (lits1, lits2), |

963 | }; |

964 | let newcap = lits1.len().saturating_mul(lits2.len()); |

965 | for selflit in mem::replace(lits1, Vec::with_capacity(newcap)) { |

966 | if !selflit.is_exact() { |

967 | lits1.push(selflit); |

968 | continue; |

969 | } |

970 | for otherlit in lits2.iter() { |

971 | let mut newlit = Literal::exact(Vec::with_capacity( |

972 | selflit.len() + otherlit.len(), |

973 | )); |

974 | newlit.extend(&selflit); |

975 | newlit.extend(&otherlit); |

976 | if !otherlit.is_exact() { |

977 | newlit.make_inexact(); |

978 | } |

979 | lits1.push(newlit); |

980 | } |

981 | } |

982 | lits2.drain(..); |

983 | self.dedup(); |

984 | } |

985 | |

986 | /// Modify this sequence to contain the cross product between it and |

987 | /// the sequence given, where the sequences are treated as suffixes |

988 | /// instead of prefixes. Namely, the sequence `other` is *prepended* |

989 | /// to `self` (as opposed to `other` being *appended* to `self` in |

990 | /// [`Seq::cross_forward`]). |

991 | /// |

992 | /// The cross product only considers literals in this sequence that are |

993 | /// exact. That is, inexact literals are not extended. |

994 | /// |

995 | /// The literals are always drained from `other`, even if none are used. |

996 | /// This permits callers to reuse the sequence allocation elsewhere. |

997 | /// |

998 | /// If this sequence is infinite, then this is a no-op, regardless of what |

999 | /// `other` contains (and in this case, the literals are still drained from |

1000 | /// `other`). If `other` is infinite and this sequence is finite, then this |

1001 | /// is a no-op, unless this sequence contains a zero-length literal. In |

1002 | /// which case, the infiniteness of `other` infects this sequence, and this |

1003 | /// sequence is itself made infinite. |

1004 | /// |

1005 | /// Like [`Seq::union`], this may attempt to deduplicate literals. See |

1006 | /// [`Seq::dedup`] for how deduplication deals with exact and inexact |

1007 | /// literals. |

1008 | /// |

1009 | /// # Example |

1010 | /// |

1011 | /// This example shows basic usage and how exact and inexact literals |

1012 | /// interact. |

1013 | /// |

1014 | /// ``` |

1015 | /// use regex_syntax::hir::literal::{Literal, Seq}; |

1016 | /// |

1017 | /// let mut seq1 = Seq::from_iter([ |

1018 | /// Literal::exact("foo"), |

1019 | /// Literal::inexact("bar"), |

1020 | /// ]); |

1021 | /// let mut seq2 = Seq::from_iter([ |

1022 | /// Literal::inexact("quux"), |

1023 | /// Literal::exact("baz"), |

1024 | /// ]); |

1025 | /// seq1.cross_reverse(&mut seq2); |

1026 | /// |

1027 | /// // The literals are pulled out of seq2. |

1028 | /// assert_eq!(Some(0), seq2.len()); |

1029 | /// |

1030 | /// let expected = Seq::from_iter([ |

1031 | /// Literal::inexact("quuxfoo"), |

1032 | /// Literal::inexact("bar"), |

1033 | /// Literal::exact("bazfoo"), |

1034 | /// ]); |

1035 | /// assert_eq!(expected, seq1); |

1036 | /// ``` |

1037 | /// |

1038 | /// This example shows the behavior of when `other` is an infinite |

1039 | /// sequence. |

1040 | /// |

1041 | /// ``` |

1042 | /// use regex_syntax::hir::literal::{Literal, Seq}; |

1043 | /// |

1044 | /// let mut seq1 = Seq::from_iter([ |

1045 | /// Literal::exact("foo"), |

1046 | /// Literal::inexact("bar"), |

1047 | /// ]); |

1048 | /// let mut seq2 = Seq::infinite(); |

1049 | /// seq1.cross_reverse(&mut seq2); |

1050 | /// |

1051 | /// // When seq2 is infinite, cross product doesn't add anything, but |

1052 | /// // ensures all members of seq1 are inexact. |

1053 | /// let expected = Seq::from_iter([ |

1054 | /// Literal::inexact("foo"), |

1055 | /// Literal::inexact("bar"), |

1056 | /// ]); |

1057 | /// assert_eq!(expected, seq1); |

1058 | /// ``` |

1059 | /// |

1060 | /// This example is like the one above, but shows what happens when this |

1061 | /// sequence contains an empty string. In this case, an infinite `other` |

1062 | /// sequence infects this sequence (because the empty string means that |

1063 | /// there are no finite suffixes): |

1064 | /// |

1065 | /// ``` |

1066 | /// use regex_syntax::hir::literal::{Literal, Seq}; |

1067 | /// |

1068 | /// let mut seq1 = Seq::from_iter([ |

1069 | /// Literal::exact("foo"), |

1070 | /// Literal::exact(""), // inexact provokes same behavior |

1071 | /// Literal::inexact("bar"), |

1072 | /// ]); |

1073 | /// let mut seq2 = Seq::infinite(); |

1074 | /// seq1.cross_reverse(&mut seq2); |

1075 | /// |

1076 | /// // seq1 is now infinite! |

1077 | /// assert!(!seq1.is_finite()); |

1078 | /// ``` |

1079 | /// |

1080 | /// This example shows the behavior when this sequence is infinite. |

1081 | /// |

1082 | /// ``` |

1083 | /// use regex_syntax::hir::literal::{Literal, Seq}; |

1084 | /// |

1085 | /// let mut seq1 = Seq::infinite(); |

1086 | /// let mut seq2 = Seq::from_iter([ |

1087 | /// Literal::exact("foo"), |

1088 | /// Literal::inexact("bar"), |

1089 | /// ]); |

1090 | /// seq1.cross_reverse(&mut seq2); |

1091 | /// |

1092 | /// // seq1 remains unchanged. |

1093 | /// assert!(!seq1.is_finite()); |

1094 | /// // Even though the literals in seq2 weren't used, it was still drained. |

1095 | /// assert_eq!(Some(0), seq2.len()); |

1096 | /// ``` |

1097 | #[inline] |

1098 | pub fn cross_reverse(&mut self, other: &mut Seq) { |

1099 | let (lits1, lits2) = match self.cross_preamble(other) { |

1100 | None => return, |

1101 | Some((lits1, lits2)) => (lits1, lits2), |

1102 | }; |

1103 | // We basically proceed as we do in 'cross_forward' at this point, |

1104 | // except that the outer loop is now 'other' and the inner loop is now |

1105 | // 'self'. That's because 'self' corresponds to suffixes and 'other' |

1106 | // corresponds to the sequence we want to *prepend* to the suffixes. |

1107 | let newcap = lits1.len().saturating_mul(lits2.len()); |

1108 | let selflits = mem::replace(lits1, Vec::with_capacity(newcap)); |

1109 | for (i, otherlit) in lits2.drain(..).enumerate() { |

1110 | for selflit in selflits.iter() { |

1111 | if !selflit.is_exact() { |

1112 | // If the suffix isn't exact, then we can't prepend |

1113 | // anything to it. However, we still want to keep it. But |

1114 | // we only want to keep one of them, to avoid duplication. |

1115 | // (The duplication is okay from a correctness perspective, |

1116 | // but wasteful.) |

1117 | if i == 0 { |

1118 | lits1.push(selflit.clone()); |

1119 | } |

1120 | continue; |

1121 | } |

1122 | let mut newlit = Literal::exact(Vec::with_capacity( |

1123 | otherlit.len() + selflit.len(), |

1124 | )); |

1125 | newlit.extend(&otherlit); |

1126 | newlit.extend(&selflit); |

1127 | if !otherlit.is_exact() { |

1128 | newlit.make_inexact(); |

1129 | } |

1130 | lits1.push(newlit); |

1131 | } |

1132 | } |

1133 | self.dedup(); |

1134 | } |

1135 | |

1136 | /// A helper function the corresponds to the subtle preamble for both |

1137 | /// `cross_forward` and `cross_reverse`. In effect, it handles the cases |

1138 | /// of infinite sequences for both `self` and `other`, as well as ensuring |

1139 | /// that literals from `other` are drained even if they aren't used. |

1140 | fn cross_preamble<'a>( |

1141 | &'a mut self, |

1142 | other: &'a mut Seq, |

1143 | ) -> Option<(&'a mut Vec<Literal>, &'a mut Vec<Literal>)> { |

1144 | let lits2 = match other.literals { |

1145 | None => { |

1146 | // If our current seq contains the empty string and the seq |

1147 | // we're adding matches any literal, then it follows that the |

1148 | // current seq must now also match any literal. |

1149 | // |

1150 | // Otherwise, we just have to make sure everything in this |

1151 | // sequence is inexact. |

1152 | if self.min_literal_len() == Some(0) { |

1153 | *self = Seq::infinite(); |

1154 | } else { |

1155 | self.make_inexact(); |

1156 | } |

1157 | return None; |

1158 | } |

1159 | Some(ref mut lits) => lits, |

1160 | }; |

1161 | let lits1 = match self.literals { |

1162 | None => { |

1163 | // If we aren't going to make it to the end of this routine |

1164 | // where lits2 is drained, then we need to do it now. |

1165 | lits2.drain(..); |

1166 | return None; |

1167 | } |

1168 | Some(ref mut lits) => lits, |

1169 | }; |

1170 | Some((lits1, lits2)) |

1171 | } |

1172 | |

1173 | /// Unions the `other` sequence into this one. |

1174 | /// |

1175 | /// The literals are always drained out of the given `other` sequence, |

1176 | /// even if they are being unioned into an infinite sequence. This permits |

1177 | /// the caller to reuse the `other` sequence in another context. |

1178 | /// |

1179 | /// Some literal deduping may be performed. If any deduping happens, |

1180 | /// any leftmost-first or "preference" order match semantics will be |

1181 | /// preserved. |

1182 | /// |

1183 | /// # Example |

1184 | /// |

1185 | /// This example shows basic usage. |

1186 | /// |

1187 | /// ``` |

1188 | /// use regex_syntax::hir::literal::Seq; |

1189 | /// |

1190 | /// let mut seq1 = Seq::new(&["foo", "bar"]); |

1191 | /// let mut seq2 = Seq::new(&["bar", "quux", "foo"]); |

1192 | /// seq1.union(&mut seq2); |

1193 | /// |

1194 | /// // The literals are pulled out of seq2. |

1195 | /// assert_eq!(Some(0), seq2.len()); |

1196 | /// |

1197 | /// // Adjacent literals are deduped, but non-adjacent literals may not be. |

1198 | /// assert_eq!(Seq::new(&["foo", "bar", "quux", "foo"]), seq1); |

1199 | /// ``` |

1200 | /// |

1201 | /// This example shows that literals are drained from `other` even when |

1202 | /// they aren't necessarily used. |

1203 | /// |

1204 | /// ``` |

1205 | /// use regex_syntax::hir::literal::Seq; |

1206 | /// |

1207 | /// let mut seq1 = Seq::infinite(); |

1208 | /// // Infinite sequences have no finite length. |

1209 | /// assert_eq!(None, seq1.len()); |

1210 | /// |

1211 | /// let mut seq2 = Seq::new(&["bar", "quux", "foo"]); |

1212 | /// seq1.union(&mut seq2); |

1213 | /// |

1214 | /// // seq1 is still infinite and seq2 has been drained. |

1215 | /// assert_eq!(None, seq1.len()); |

1216 | /// assert_eq!(Some(0), seq2.len()); |

1217 | /// ``` |

1218 | #[inline] |

1219 | pub fn union(&mut self, other: &mut Seq) { |

1220 | let lits2 = match other.literals { |

1221 | None => { |

1222 | // Unioning with an infinite sequence always results in an |

1223 | // infinite sequence. |

1224 | self.make_infinite(); |

1225 | return; |

1226 | } |

1227 | Some(ref mut lits) => lits.drain(..), |

1228 | }; |

1229 | let lits1 = match self.literals { |

1230 | None => return, |

1231 | Some(ref mut lits) => lits, |

1232 | }; |

1233 | lits1.extend(lits2); |

1234 | self.dedup(); |

1235 | } |

1236 | |

1237 | /// Unions the `other` sequence into this one by splice the `other` |

1238 | /// sequence at the position of the first zero-length literal. |

1239 | /// |

1240 | /// This is useful for preserving preference order semantics when combining |

1241 | /// two literal sequences. For example, in the regex `(a||f)+foo`, the |

1242 | /// correct preference order prefix sequence is `[a, foo, f]`. |

1243 | /// |

1244 | /// The literals are always drained out of the given `other` sequence, |

1245 | /// even if they are being unioned into an infinite sequence. This permits |

1246 | /// the caller to reuse the `other` sequence in another context. Note that |

1247 | /// the literals are drained even if no union is performed as well, i.e., |

1248 | /// when this sequence does not contain a zero-length literal. |

1249 | /// |

1250 | /// Some literal deduping may be performed. If any deduping happens, |

1251 | /// any leftmost-first or "preference" order match semantics will be |

1252 | /// preserved. |

1253 | /// |

1254 | /// # Example |

1255 | /// |

1256 | /// This example shows basic usage. |

1257 | /// |

1258 | /// ``` |

1259 | /// use regex_syntax::hir::literal::Seq; |

1260 | /// |

1261 | /// let mut seq1 = Seq::new(&["a", "", "f", ""]); |

1262 | /// let mut seq2 = Seq::new(&["foo"]); |

1263 | /// seq1.union_into_empty(&mut seq2); |

1264 | /// |

1265 | /// // The literals are pulled out of seq2. |

1266 | /// assert_eq!(Some(0), seq2.len()); |

1267 | /// // 'foo' gets spliced into seq1 where the first empty string occurs. |

1268 | /// assert_eq!(Seq::new(&["a", "foo", "f"]), seq1); |

1269 | /// ``` |

1270 | /// |

1271 | /// This example shows that literals are drained from `other` even when |

1272 | /// they aren't necessarily used. |

1273 | /// |

1274 | /// ``` |

1275 | /// use regex_syntax::hir::literal::Seq; |

1276 | /// |

1277 | /// let mut seq1 = Seq::new(&["foo", "bar"]); |

1278 | /// let mut seq2 = Seq::new(&["bar", "quux", "foo"]); |

1279 | /// seq1.union_into_empty(&mut seq2); |

1280 | /// |

1281 | /// // seq1 has no zero length literals, so no splicing happens. |

1282 | /// assert_eq!(Seq::new(&["foo", "bar"]), seq1); |

1283 | /// // Even though no splicing happens, seq2 is still drained. |

1284 | /// assert_eq!(Some(0), seq2.len()); |

1285 | /// ``` |

1286 | #[inline] |

1287 | pub fn union_into_empty(&mut self, other: &mut Seq) { |

1288 | let lits2 = other.literals.as_mut().map(|lits| lits.drain(..)); |

1289 | let lits1 = match self.literals { |

1290 | None => return, |

1291 | Some(ref mut lits) => lits, |

1292 | }; |

1293 | let first_empty = match lits1.iter().position(|m| m.is_empty()) { |

1294 | None => return, |

1295 | Some(i) => i, |

1296 | }; |

1297 | let lits2 = match lits2 { |

1298 | None => { |

1299 | // Note that we are only here if we've found an empty literal, |

1300 | // which implies that an infinite sequence infects this seq and |

1301 | // also turns it into an infinite sequence. |

1302 | self.literals = None; |

1303 | return; |

1304 | } |

1305 | Some(lits) => lits, |

1306 | }; |

1307 | // Clearing out the empties needs to come before the splice because |

1308 | // the splice might add more empties that we don't want to get rid |

1309 | // of. Since we're splicing into the position of the first empty, the |

1310 | // 'first_empty' position computed above is still correct. |

1311 | lits1.retain(|m| !m.is_empty()); |

1312 | lits1.splice(first_empty..first_empty, lits2); |

1313 | self.dedup(); |

1314 | } |

1315 | |

1316 | /// Deduplicate adjacent equivalent literals in this sequence. |

1317 | /// |

1318 | /// If adjacent literals are equivalent strings but one is exact and the |

1319 | /// other inexact, the inexact literal is kept and the exact one is |

1320 | /// removed. |

1321 | /// |

1322 | /// Deduping an infinite sequence is a no-op. |

1323 | /// |

1324 | /// # Example |

1325 | /// |

1326 | /// This example shows how literals that are duplicate byte strings but |

1327 | /// are not equivalent with respect to exactness are resolved. |

1328 | /// |

1329 | /// ``` |

1330 | /// use regex_syntax::hir::literal::{Literal, Seq}; |

1331 | /// |

1332 | /// let mut seq = Seq::from_iter([ |

1333 | /// Literal::exact("foo"), |

1334 | /// Literal::inexact("foo"), |

1335 | /// ]); |

1336 | /// seq.dedup(); |

1337 | /// |

1338 | /// assert_eq!(Seq::from_iter([Literal::inexact("foo")]), seq); |

1339 | /// ``` |

1340 | #[inline] |

1341 | pub fn dedup(&mut self) { |

1342 | if let Some(ref mut lits) = self.literals { |

1343 | lits.dedup_by(|lit1, lit2| { |

1344 | if lit1.as_bytes() != lit2.as_bytes() { |

1345 | return false; |

1346 | } |

1347 | if lit1.is_exact() != lit2.is_exact() { |

1348 | lit1.make_inexact(); |

1349 | lit2.make_inexact(); |

1350 | } |

1351 | true |

1352 | }); |

1353 | } |

1354 | } |

1355 | |

1356 | /// Sorts this sequence of literals lexicographically. |

1357 | /// |

1358 | /// Note that if, before sorting, if a literal that is a prefix of another |

1359 | /// literal appears after it, then after sorting, the sequence will not |

1360 | /// represent the same preference order match semantics. For example, |

1361 | /// sorting the sequence `[samwise, sam]` yields the sequence `[sam, |

1362 | /// samwise]`. Under preference order semantics, the latter sequence will |

1363 | /// never match `samwise` where as the first sequence can. |

1364 | /// |

1365 | /// # Example |

1366 | /// |

1367 | /// This example shows basic usage. |

1368 | /// |

1369 | /// ``` |

1370 | /// use regex_syntax::hir::literal::Seq; |

1371 | /// |

1372 | /// let mut seq = Seq::new(&["foo", "quux", "bar"]); |

1373 | /// seq.sort(); |

1374 | /// |

1375 | /// assert_eq!(Seq::new(&["bar", "foo", "quux"]), seq); |

1376 | /// ``` |

1377 | #[inline] |

1378 | pub fn sort(&mut self) { |

1379 | if let Some(ref mut lits) = self.literals { |

1380 | lits.sort(); |

1381 | } |

1382 | } |

1383 | |

1384 | /// Reverses all of the literals in this sequence. |

1385 | /// |

1386 | /// The order of the sequence itself is preserved. |

1387 | /// |

1388 | /// # Example |

1389 | /// |

1390 | /// This example shows basic usage. |

1391 | /// |

1392 | /// ``` |

1393 | /// use regex_syntax::hir::literal::Seq; |

1394 | /// |

1395 | /// let mut seq = Seq::new(&["oof", "rab"]); |

1396 | /// seq.reverse_literals(); |

1397 | /// assert_eq!(Seq::new(&["foo", "bar"]), seq); |

1398 | /// ``` |

1399 | #[inline] |

1400 | pub fn reverse_literals(&mut self) { |

1401 | if let Some(ref mut lits) = self.literals { |

1402 | for lit in lits.iter_mut() { |

1403 | lit.reverse(); |

1404 | } |

1405 | } |

1406 | } |

1407 | |

1408 | /// Shrinks this seq to its minimal size while respecting the preference |

1409 | /// order of its literals. |

1410 | /// |

1411 | /// While this routine will remove duplicate literals from this seq, it |

1412 | /// will also remove literals that can never match in a leftmost-first or |

1413 | /// "preference order" search. Similar to [`Seq::dedup`], if a literal is |

1414 | /// deduped, then the one that remains is made inexact. |

1415 | /// |

1416 | /// This is a no-op on seqs that are empty or not finite. |

1417 | /// |

1418 | /// # Example |

1419 | /// |

1420 | /// This example shows the difference between `{sam, samwise}` and |

1421 | /// `{samwise, sam}`. |

1422 | /// |

1423 | /// ``` |

1424 | /// use regex_syntax::hir::literal::{Literal, Seq}; |

1425 | /// |

1426 | /// // If 'sam' comes before 'samwise' and a preference order search is |

1427 | /// // executed, then 'samwise' can never match. |

1428 | /// let mut seq = Seq::new(&["sam", "samwise"]); |

1429 | /// seq.minimize_by_preference(); |

1430 | /// assert_eq!(Seq::from_iter([Literal::inexact("sam")]), seq); |

1431 | /// |

1432 | /// // But if they are reversed, then it's possible for 'samwise' to match |

1433 | /// // since it is given higher preference. |

1434 | /// let mut seq = Seq::new(&["samwise", "sam"]); |

1435 | /// seq.minimize_by_preference(); |

1436 | /// assert_eq!(Seq::new(&["samwise", "sam"]), seq); |

1437 | /// ``` |

1438 | /// |

1439 | /// This example shows that if an empty string is in this seq, then |

1440 | /// anything that comes after it can never match. |

1441 | /// |

1442 | /// ``` |

1443 | /// use regex_syntax::hir::literal::{Literal, Seq}; |

1444 | /// |

1445 | /// // An empty string is a prefix of all strings, so it automatically |

1446 | /// // inhibits any subsequent strings from matching. |

1447 | /// let mut seq = Seq::new(&["foo", "bar", "", "quux", "fox"]); |

1448 | /// seq.minimize_by_preference(); |

1449 | /// let expected = Seq::from_iter([ |

1450 | /// Literal::exact("foo"), |

1451 | /// Literal::exact("bar"), |

1452 | /// Literal::inexact(""), |

1453 | /// ]); |

1454 | /// assert_eq!(expected, seq); |

1455 | /// |

1456 | /// // And of course, if it's at the beginning, then it makes it impossible |

1457 | /// // for anything else to match. |

1458 | /// let mut seq = Seq::new(&["", "foo", "quux", "fox"]); |

1459 | /// seq.minimize_by_preference(); |

1460 | /// assert_eq!(Seq::from_iter([Literal::inexact("")]), seq); |

1461 | /// ``` |

1462 | #[inline] |

1463 | pub fn minimize_by_preference(&mut self) { |

1464 | if let Some(ref mut lits) = self.literals { |

1465 | PreferenceTrie::minimize(lits, false); |

1466 | } |

1467 | } |

1468 | |

1469 | /// Trims all literals in this seq such that only the first `len` bytes |

1470 | /// remain. If a literal has less than or equal to `len` bytes, then it |

1471 | /// remains unchanged. Otherwise, it is trimmed and made inexact. |

1472 | /// |

1473 | /// # Example |

1474 | /// |

1475 | /// ``` |

1476 | /// use regex_syntax::hir::literal::{Literal, Seq}; |

1477 | /// |

1478 | /// let mut seq = Seq::new(&["a", "foo", "quux"]); |

1479 | /// seq.keep_first_bytes(2); |

1480 | /// |

1481 | /// let expected = Seq::from_iter([ |

1482 | /// Literal::exact("a"), |

1483 | /// Literal::inexact("fo"), |

1484 | /// Literal::inexact("qu"), |

1485 | /// ]); |

1486 | /// assert_eq!(expected, seq); |

1487 | /// ``` |

1488 | #[inline] |

1489 | pub fn keep_first_bytes(&mut self, len: usize) { |

1490 | if let Some(ref mut lits) = self.literals { |

1491 | for m in lits.iter_mut() { |

1492 | m.keep_first_bytes(len); |

1493 | } |

1494 | } |

1495 | } |

1496 | |

1497 | /// Trims all literals in this seq such that only the last `len` bytes |

1498 | /// remain. If a literal has less than or equal to `len` bytes, then it |

1499 | /// remains unchanged. Otherwise, it is trimmed and made inexact. |

1500 | /// |

1501 | /// # Example |

1502 | /// |

1503 | /// ``` |

1504 | /// use regex_syntax::hir::literal::{Literal, Seq}; |

1505 | /// |

1506 | /// let mut seq = Seq::new(&["a", "foo", "quux"]); |

1507 | /// seq.keep_last_bytes(2); |

1508 | /// |

1509 | /// let expected = Seq::from_iter([ |

1510 | /// Literal::exact("a"), |

1511 | /// Literal::inexact("oo"), |

1512 | /// Literal::inexact("ux"), |

1513 | /// ]); |

1514 | /// assert_eq!(expected, seq); |

1515 | /// ``` |

1516 | #[inline] |

1517 | pub fn keep_last_bytes(&mut self, len: usize) { |

1518 | if let Some(ref mut lits) = self.literals { |

1519 | for m in lits.iter_mut() { |

1520 | m.keep_last_bytes(len); |

1521 | } |

1522 | } |

1523 | } |

1524 | |

1525 | /// Returns true if this sequence is finite. |

1526 | /// |

1527 | /// When false, this sequence is infinite and must be treated as if it |

1528 | /// contains every possible literal. |

1529 | #[inline] |

1530 | pub fn is_finite(&self) -> bool { |

1531 | self.literals.is_some() |

1532 | } |

1533 | |

1534 | /// Returns true if and only if this sequence is finite and empty. |

1535 | /// |

1536 | /// An empty sequence never matches anything. It can only be produced by |

1537 | /// literal extraction when the corresponding regex itself cannot match. |

1538 | #[inline] |

1539 | pub fn is_empty(&self) -> bool { |

1540 | self.len() == Some(0) |

1541 | } |

1542 | |

1543 | /// Returns the number of literals in this sequence if the sequence is |

1544 | /// finite. If the sequence is infinite, then `None` is returned. |

1545 | #[inline] |

1546 | pub fn len(&self) -> Option<usize> { |

1547 | self.literals.as_ref().map(|lits| lits.len()) |

1548 | } |

1549 | |

1550 | /// Returns true if and only if all literals in this sequence are exact. |

1551 | /// |

1552 | /// This returns false if the sequence is infinite. |

1553 | #[inline] |

1554 | pub fn is_exact(&self) -> bool { |

1555 | self.literals().map_or(false, |lits| lits.iter().all(|x| x.is_exact())) |

1556 | } |

1557 | |

1558 | /// Returns true if and only if all literals in this sequence are inexact. |

1559 | /// |

1560 | /// This returns true if the sequence is infinite. |

1561 | #[inline] |

1562 | pub fn is_inexact(&self) -> bool { |

1563 | self.literals().map_or(true, |lits| lits.iter().all(|x| !x.is_exact())) |

1564 | } |

1565 | |

1566 | /// Return the maximum length of the sequence that would result from |

1567 | /// unioning `self` with `other`. If either set is infinite, then this |

1568 | /// returns `None`. |

1569 | #[inline] |

1570 | pub fn max_union_len(&self, other: &Seq) -> Option<usize> { |

1571 | let len1 = self.len()?; |

1572 | let len2 = other.len()?; |

1573 | Some(len1.saturating_add(len2)) |

1574 | } |

1575 | |

1576 | /// Return the maximum length of the sequence that would result from the |

1577 | /// cross product of `self` with `other`. If either set is infinite, then |

1578 | /// this returns `None`. |

1579 | #[inline] |

1580 | pub fn max_cross_len(&self, other: &Seq) -> Option<usize> { |

1581 | let len1 = self.len()?; |

1582 | let len2 = other.len()?; |

1583 | Some(len1.saturating_mul(len2)) |

1584 | } |

1585 | |

1586 | /// Returns the length of the shortest literal in this sequence. |

1587 | /// |

1588 | /// If the sequence is infinite or empty, then this returns `None`. |

1589 | #[inline] |

1590 | pub fn min_literal_len(&self) -> Option<usize> { |

1591 | self.literals.as_ref()?.iter().map(|x| x.len()).min() |

1592 | } |

1593 | |

1594 | /// Returns the length of the longest literal in this sequence. |

1595 | /// |

1596 | /// If the sequence is infinite or empty, then this returns `None`. |

1597 | #[inline] |

1598 | pub fn max_literal_len(&self) -> Option<usize> { |

1599 | self.literals.as_ref()?.iter().map(|x| x.len()).max() |

1600 | } |

1601 | |

1602 | /// Returns the longest common prefix from this seq. |

1603 | /// |

1604 | /// If the seq matches any literal or other contains no literals, then |

1605 | /// there is no meaningful prefix and this returns `None`. |

1606 | /// |

1607 | /// # Example |

1608 | /// |

1609 | /// This shows some example seqs and their longest common prefix. |

1610 | /// |

1611 | /// ``` |

1612 | /// use regex_syntax::hir::literal::Seq; |

1613 | /// |

1614 | /// let seq = Seq::new(&["foo", "foobar", "fo"]); |

1615 | /// assert_eq!(Some(&b"fo"[..]), seq.longest_common_prefix()); |

1616 | /// let seq = Seq::new(&["foo", "foo"]); |

1617 | /// assert_eq!(Some(&b"foo"[..]), seq.longest_common_prefix()); |

1618 | /// let seq = Seq::new(&["foo", "bar"]); |

1619 | /// assert_eq!(Some(&b""[..]), seq.longest_common_prefix()); |

1620 | /// let seq = Seq::new(&[""]); |

1621 | /// assert_eq!(Some(&b""[..]), seq.longest_common_prefix()); |

1622 | /// |

1623 | /// let seq = Seq::infinite(); |

1624 | /// assert_eq!(None, seq.longest_common_prefix()); |

1625 | /// let seq = Seq::empty(); |

1626 | /// assert_eq!(None, seq.longest_common_prefix()); |

1627 | /// ``` |

1628 | #[inline] |

1629 | pub fn longest_common_prefix(&self) -> Option<&[u8]> { |

1630 | // If we match everything or match nothing, then there's no meaningful |

1631 | // longest common prefix. |

1632 | let lits = match self.literals { |

1633 | None => return None, |

1634 | Some(ref lits) => lits, |

1635 | }; |

1636 | if lits.len() == 0 { |

1637 | return None; |

1638 | } |

1639 | let base = lits[0].as_bytes(); |

1640 | let mut len = base.len(); |

1641 | for m in lits.iter().skip(1) { |

1642 | len = m |

1643 | .as_bytes() |

1644 | .iter() |

1645 | .zip(base[..len].iter()) |

1646 | .take_while(|&(a, b)| a == b) |

1647 | .count(); |

1648 | if len == 0 { |

1649 | return Some(&[]); |

1650 | } |

1651 | } |

1652 | Some(&base[..len]) |

1653 | } |

1654 | |

1655 | /// Returns the longest common suffix from this seq. |

1656 | /// |

1657 | /// If the seq matches any literal or other contains no literals, then |

1658 | /// there is no meaningful suffix and this returns `None`. |

1659 | /// |

1660 | /// # Example |

1661 | /// |

1662 | /// This shows some example seqs and their longest common suffix. |

1663 | /// |

1664 | /// ``` |

1665 | /// use regex_syntax::hir::literal::Seq; |

1666 | /// |

1667 | /// let seq = Seq::new(&["oof", "raboof", "of"]); |

1668 | /// assert_eq!(Some(&b"of"[..]), seq.longest_common_suffix()); |

1669 | /// let seq = Seq::new(&["foo", "foo"]); |

1670 | /// assert_eq!(Some(&b"foo"[..]), seq.longest_common_suffix()); |

1671 | /// let seq = Seq::new(&["foo", "bar"]); |

1672 | /// assert_eq!(Some(&b""[..]), seq.longest_common_suffix()); |

1673 | /// let seq = Seq::new(&[""]); |

1674 | /// assert_eq!(Some(&b""[..]), seq.longest_common_suffix()); |

1675 | /// |

1676 | /// let seq = Seq::infinite(); |

1677 | /// assert_eq!(None, seq.longest_common_suffix()); |

1678 | /// let seq = Seq::empty(); |

1679 | /// assert_eq!(None, seq.longest_common_suffix()); |

1680 | /// ``` |

1681 | #[inline] |

1682 | pub fn longest_common_suffix(&self) -> Option<&[u8]> { |

1683 | // If we match everything or match nothing, then there's no meaningful |

1684 | // longest common suffix. |

1685 | let lits = match self.literals { |

1686 | None => return None, |

1687 | Some(ref lits) => lits, |

1688 | }; |

1689 | if lits.len() == 0 { |

1690 | return None; |

1691 | } |

1692 | let base = lits[0].as_bytes(); |

1693 | let mut len = base.len(); |

1694 | for m in lits.iter().skip(1) { |

1695 | len = m |

1696 | .as_bytes() |

1697 | .iter() |

1698 | .rev() |

1699 | .zip(base[base.len() - len..].iter().rev()) |

1700 | .take_while(|&(a, b)| a == b) |

1701 | .count(); |

1702 | if len == 0 { |

1703 | return Some(&[]); |

1704 | } |

1705 | } |

1706 | Some(&base[base.len() - len..]) |

1707 | } |

1708 | |

1709 | /// Optimizes this seq while treating its literals as prefixes and |

1710 | /// respecting the preference order of its literals. |

1711 | /// |

1712 | /// The specific way "optimization" works is meant to be an implementation |

1713 | /// detail, as it essentially represents a set of heuristics. The goal |

1714 | /// that optimization tries to accomplish is to make the literals in this |

1715 | /// set reflect inputs that will result in a more effective prefilter. |

1716 | /// Principally by reducing the false positive rate of candidates found by |

1717 | /// the literals in this sequence. That is, when a match of a literal is |

1718 | /// found, we would like it to be a strong predictor of the overall match |

1719 | /// of the regex. If it isn't, then much time will be spent starting and |

1720 | /// stopping the prefilter search and attempting to confirm the match only |

1721 | /// to have it fail. |

1722 | /// |

1723 | /// Some of those heuristics might be: |

1724 | /// |

1725 | /// * Identifying a common prefix from a larger sequence of literals, and |

1726 | /// shrinking the sequence down to that single common prefix. |

1727 | /// * Rejecting the sequence entirely if it is believed to result in very |

1728 | /// high false positive rate. When this happens, the sequence is made |

1729 | /// infinite. |

1730 | /// * Shrinking the sequence to a smaller number of literals representing |

1731 | /// prefixes, but not shrinking it so much as to make literals too short. |

1732 | /// (A sequence with very short literals, of 1 or 2 bytes, will typically |

1733 | /// result in a higher false positive rate.) |

1734 | /// |

1735 | /// Optimization should only be run once extraction is complete. Namely, |

1736 | /// optimization may make assumptions that do not compose with other |

1737 | /// operations in the middle of extraction. For example, optimization will |

1738 | /// reduce `[E(sam), E(samwise)]` to `[E(sam)]`, but such a transformation |

1739 | /// is only valid if no other extraction will occur. If other extraction |

1740 | /// may occur, then the correct transformation would be to `[I(sam)]`. |

1741 | /// |

1742 | /// The [`Seq::optimize_for_suffix_by_preference`] does the same thing, but |

1743 | /// for suffixes. |

1744 | /// |

1745 | /// # Example |

1746 | /// |

1747 | /// This shows how optimization might transform a sequence. Note that |

1748 | /// the specific behavior is not a documented guarantee. The heuristics |

1749 | /// used are an implementation detail and may change over time in semver |

1750 | /// compatible releases. |

1751 | /// |

1752 | /// ``` |

1753 | /// use regex_syntax::hir::literal::{Seq, Literal}; |

1754 | /// |

1755 | /// let mut seq = Seq::new(&[ |

1756 | /// "samantha", |

1757 | /// "sam", |

1758 | /// "samwise", |

1759 | /// "frodo", |

1760 | /// ]); |

1761 | /// seq.optimize_for_prefix_by_preference(); |

1762 | /// assert_eq!(Seq::from_iter([ |

1763 | /// Literal::exact("samantha"), |

1764 | /// // Kept exact even though 'samwise' got pruned |

1765 | /// // because optimization assumes literal extraction |

1766 | /// // has finished. |

1767 | /// Literal::exact("sam"), |

1768 | /// Literal::exact("frodo"), |

1769 | /// ]), seq); |

1770 | /// ``` |

1771 | /// |

1772 | /// # Example: optimization may make the sequence infinite |

1773 | /// |

1774 | /// If the heuristics deem that the sequence could cause a very high false |

1775 | /// positive rate, then it may make the sequence infinite, effectively |

1776 | /// disabling its use as a prefilter. |

1777 | /// |

1778 | /// ``` |

1779 | /// use regex_syntax::hir::literal::{Seq, Literal}; |

1780 | /// |

1781 | /// let mut seq = Seq::new(&[ |

1782 | /// "samantha", |

1783 | /// // An empty string matches at every position, |

1784 | /// // thus rendering the prefilter completely |

1785 | /// // ineffective. |

1786 | /// "", |

1787 | /// "sam", |

1788 | /// "samwise", |

1789 | /// "frodo", |

1790 | /// ]); |

1791 | /// seq.optimize_for_prefix_by_preference(); |

1792 | /// assert!(!seq.is_finite()); |

1793 | /// ``` |

1794 | /// |

1795 | /// Do note that just because there is a `" "` in the sequence, that |

1796 | /// doesn't mean the sequence will always be made infinite after it is |

1797 | /// optimized. Namely, if the sequence is considered exact (any match |

1798 | /// corresponds to an overall match of the original regex), then any match |

1799 | /// is an overall match, and so the false positive rate is always `0`. |

1800 | /// |

1801 | /// To demonstrate this, we remove `samwise` from our sequence. This |

1802 | /// results in no optimization happening and all literals remain exact. |

1803 | /// Thus the entire sequence is exact, and it is kept as-is, even though |

1804 | /// one is an ASCII space: |

1805 | /// |

1806 | /// ``` |

1807 | /// use regex_syntax::hir::literal::{Seq, Literal}; |

1808 | /// |

1809 | /// let mut seq = Seq::new(&[ |

1810 | /// "samantha", |

1811 | /// " ", |

1812 | /// "sam", |

1813 | /// "frodo", |

1814 | /// ]); |

1815 | /// seq.optimize_for_prefix_by_preference(); |

1816 | /// assert!(seq.is_finite()); |

1817 | /// ``` |

1818 | #[inline] |

1819 | pub fn optimize_for_prefix_by_preference(&mut self) { |

1820 | self.optimize_by_preference(true); |

1821 | } |

1822 | |

1823 | /// Optimizes this seq while treating its literals as suffixes and |

1824 | /// respecting the preference order of its literals. |

1825 | /// |

1826 | /// Optimization should only be run once extraction is complete. |

1827 | /// |

1828 | /// The [`Seq::optimize_for_prefix_by_preference`] does the same thing, but |

1829 | /// for prefixes. See its documentation for more explanation. |

1830 | #[inline] |

1831 | pub fn optimize_for_suffix_by_preference(&mut self) { |

1832 | self.optimize_by_preference(false); |

1833 | } |

1834 | |

1835 | fn optimize_by_preference(&mut self, prefix: bool) { |

1836 | let origlen = match self.len() { |

1837 | None => return, |

1838 | Some(len) => len, |

1839 | }; |

1840 | // Just give up now if our sequence contains an empty string. |

1841 | if self.min_literal_len().map_or(false, |len| len == 0) { |

1842 | // We squash the sequence so that nobody else gets any bright |

1843 | // ideas to try and use it. An empty string implies a match at |

1844 | // every position. A prefilter cannot help you here. |

1845 | self.make_infinite(); |

1846 | return; |

1847 | } |

1848 | // Make sure we start with the smallest sequence possible. We use a |

1849 | // special version of preference minimization that retains exactness. |

1850 | // This is legal because optimization is only expected to occur once |

1851 | // extraction is complete. |

1852 | if prefix { |

1853 | if let Some(ref mut lits) = self.literals { |

1854 | PreferenceTrie::minimize(lits, true); |

1855 | } |

1856 | } |

1857 | |

1858 | // Look for a common prefix (or suffix). If we found one of those and |

1859 | // it's long enough, then it's a good bet that it will be our fastest |

1860 | // possible prefilter since single-substring search is so fast. |

1861 | let fix = if prefix { |

1862 | self.longest_common_prefix() |

1863 | } else { |

1864 | self.longest_common_suffix() |

1865 | }; |

1866 | if let Some(fix) = fix { |

1867 | // As a special case, if we have a common prefix and the leading |

1868 | // byte of that prefix is one that we think probably occurs rarely, |

1869 | // then strip everything down to just that single byte. This should |

1870 | // promote the use of memchr. |

1871 | // |

1872 | // ... we only do this though if our sequence has more than one |

1873 | // literal. Otherwise, we'd rather just stick with a single literal |

1874 | // scan. That is, using memchr is probably better than looking |

1875 | // for 2 or more literals, but probably not as good as a straight |

1876 | // memmem search. |

1877 | // |

1878 | // ... and also only do this when the prefix is short and probably |

1879 | // not too discriminatory anyway. If it's longer, then it's |

1880 | // probably quite discriminatory and thus is likely to have a low |

1881 | // false positive rate. |

1882 | if prefix |

1883 | && origlen > 1 |

1884 | && fix.len() >= 1 |

1885 | && fix.len() <= 3 |

1886 | && rank(fix[0]) < 200 |

1887 | { |

1888 | self.keep_first_bytes(1); |

1889 | self.dedup(); |

1890 | return; |

1891 | } |

1892 | // We only strip down to the common prefix/suffix if we think |

1893 | // the existing set of literals isn't great, or if the common |

1894 | // prefix/suffix is expected to be particularly discriminatory. |

1895 | let isfast = |

1896 | self.is_exact() && self.len().map_or(false, |len| len <= 16); |

1897 | let usefix = fix.len() > 4 || (fix.len() > 1 && !isfast); |

1898 | if usefix { |

1899 | // If we keep exactly the number of bytes equal to the length |

1900 | // of the prefix (or suffix), then by the definition of a |

1901 | // prefix, every literal in the sequence will be equivalent. |

1902 | // Thus, 'dedup' will leave us with one literal. |

1903 | // |

1904 | // We do it this way to avoid an alloc, but also to make sure |

1905 | // the exactness of literals is kept (or not). |

1906 | if prefix { |

1907 | self.keep_first_bytes(fix.len()); |

1908 | } else { |

1909 | self.keep_last_bytes(fix.len()); |

1910 | } |

1911 | self.dedup(); |

1912 | assert_eq!(Some(1), self.len()); |

1913 | // We still fall through here. In particular, we want our |

1914 | // longest common prefix to be subject to the poison check. |

1915 | } |

1916 | } |

1917 | // If we have an exact sequence, we *probably* just want to keep it |

1918 | // as-is. But there are some cases where we don't. So we save a copy of |

1919 | // the exact sequence now, and then try to do some more optimizations |

1920 | // below. If those don't work out, we go back to this exact sequence. |

1921 | // |

1922 | // The specific motivation for this is that we sometimes wind up with |

1923 | // an exact sequence with a hefty number of literals. Say, 100. If we |

1924 | // stuck with that, it would be too big for Teddy and would result in |

1925 | // using Aho-Corasick. Which is fine... but the lazy DFA is plenty |

1926 | // suitable in such cases. The real issue is that we will wind up not |

1927 | // using a fast prefilter at all. So in cases like this, even though |

1928 | // we have an exact sequence, it would be better to try and shrink the |

1929 | // sequence (which we do below) and use it as a prefilter that can |

1930 | // produce false positive matches. |

1931 | // |

1932 | // But if the shrinking below results in a sequence that "sucks," then |

1933 | // we don't want to use that because we already have an exact sequence |

1934 | // in hand. |

1935 | let exact: Option<Seq> = |

1936 | if self.is_exact() { Some(self.clone()) } else { None }; |

1937 | // Now we attempt to shorten the sequence. The idea here is that we |

1938 | // don't want to look for too many literals, but we want to shorten |

1939 | // our sequence enough to improve our odds of using better algorithms |

1940 | // downstream (such as Teddy). |

1941 | // |

1942 | // The pair of numbers in this list corresponds to the maximal prefix |

1943 | // (in bytes) to keep for all literals and the length of the sequence |

1944 | // at which to do it. |

1945 | // |

1946 | // So for example, the pair (3, 500) would mean, "if we have more than |

1947 | // 500 literals in our sequence, then truncate all of our literals |

1948 | // such that they are at most 3 bytes in length and the minimize the |

1949 | // sequence." |

1950 | const ATTEMPTS: [(usize, usize); 5] = |

1951 | [(5, 10), (4, 10), (3, 64), (2, 64), (1, 10)]; |

1952 | for (keep, limit) in ATTEMPTS { |

1953 | let len = match self.len() { |

1954 | None => break, |

1955 | Some(len) => len, |

1956 | }; |

1957 | if len <= limit { |

1958 | break; |

1959 | } |

1960 | if prefix { |

1961 | self.keep_first_bytes(keep); |

1962 | } else { |

1963 | self.keep_last_bytes(keep); |

1964 | } |

1965 | if prefix { |

1966 | if let Some(ref mut lits) = self.literals { |

1967 | PreferenceTrie::minimize(lits, true); |

1968 | } |

1969 | } |

1970 | } |

1971 | // Check for a poison literal. A poison literal is one that is short |

1972 | // and is believed to have a very high match count. These poisons |

1973 | // generally lead to a prefilter with a very high false positive rate, |

1974 | // and thus overall worse performance. |

1975 | // |

1976 | // We do this last because we could have gone from a non-poisonous |

1977 | // sequence to a poisonous one. Perhaps we should add some code to |

1978 | // prevent such transitions in the first place, but then again, we |

1979 | // likely only made the transition in the first place if the sequence |

1980 | // was itself huge. And huge sequences are themselves poisonous. So... |

1981 | if let Some(lits) = self.literals() { |

1982 | if lits.iter().any(|lit| lit.is_poisonous()) { |

1983 | self.make_infinite(); |

1984 | } |

1985 | } |

1986 | // OK, if we had an exact sequence before attempting more optimizations |

1987 | // above and our post-optimized sequence sucks for some reason or |

1988 | // another, then we go back to the exact sequence. |

1989 | if let Some(exact) = exact { |

1990 | // If optimizing resulted in dropping our literals, then certainly |

1991 | // backup and use the exact sequence that we had. |

1992 | if !self.is_finite() { |

1993 | *self = exact; |

1994 | return; |

1995 | } |

1996 | // If our optimized sequence contains a short literal, then it's |

1997 | // *probably* not so great. So throw it away and revert to the |

1998 | // exact sequence. |

1999 | if self.min_literal_len().map_or(true, |len| len <= 2) { |

2000 | *self = exact; |

2001 | return; |

2002 | } |

2003 | // Finally, if our optimized sequence is "big" (i.e., can't use |

2004 | // Teddy), then also don't use it and rely on the exact sequence. |

2005 | if self.len().map_or(true, |len| len > 64) { |

2006 | *self = exact; |

2007 | return; |

2008 | } |

2009 | } |

2010 | } |

2011 | } |

2012 | |

2013 | impl core::fmt::Debug for Seq { |

2014 | fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { |

2015 | write!(f, "Seq")?; |

2016 | if let Some(lits) = self.literals() { |

2017 | f.debug_list().entries(lits.iter()).finish() |

2018 | } else { |

2019 | write!(f, "[∞]") |

2020 | } |

2021 | } |

2022 | } |

2023 | |

2024 | impl FromIterator<Literal> for Seq { |

2025 | fn from_iter<T: IntoIterator<Item = Literal>>(it: T) -> Seq { |

2026 | let mut seq = Seq::empty(); |

2027 | for literal in it { |

2028 | seq.push(literal); |

2029 | } |

2030 | seq |

2031 | } |

2032 | } |

2033 | |

2034 | /// A single literal extracted from an [`Hir`] expression. |

2035 | /// |

2036 | /// A literal is composed of two things: |

2037 | /// |

2038 | /// * A sequence of bytes. No guarantees with respect to UTF-8 are provided. |

2039 | /// In particular, even if the regex a literal is extracted from is UTF-8, the |

2040 | /// literal extracted may not be valid UTF-8. (For example, if an [`Extractor`] |

2041 | /// limit resulted in trimming a literal in a way that splits a codepoint.) |

2042 | /// * Whether the literal is "exact" or not. An "exact" literal means that it |

2043 | /// has not been trimmed, and may continue to be extended. If a literal is |

2044 | /// "exact" after visiting the entire `Hir` expression, then this implies that |

2045 | /// the literal leads to a match state. (Although it doesn't necessarily imply |

2046 | /// all occurrences of the literal correspond to a match of the regex, since |

2047 | /// literal extraction ignores look-around assertions.) |

2048 | #[derive(Clone, Eq, PartialEq, PartialOrd, Ord)] |

2049 | pub struct Literal { |

2050 | bytes: Vec<u8>, |

2051 | exact: bool, |

2052 | } |

2053 | |

2054 | impl Literal { |

2055 | /// Returns a new exact literal containing the bytes given. |

2056 | #[inline] |

2057 | pub fn exact<B: Into<Vec<u8>>>(bytes: B) -> Literal { |

2058 | Literal { bytes: bytes.into(), exact: true } |

2059 | } |

2060 | |

2061 | /// Returns a new inexact literal containing the bytes given. |

2062 | #[inline] |

2063 | pub fn inexact<B: Into<Vec<u8>>>(bytes: B) -> Literal { |

2064 | Literal { bytes: bytes.into(), exact: false } |

2065 | } |

2066 | |

2067 | /// Returns the bytes in this literal. |

2068 | #[inline] |

2069 | pub fn as_bytes(&self) -> &[u8] { |

2070 | &self.bytes |

2071 | } |

2072 | |

2073 | /// Yields ownership of the bytes inside this literal. |

2074 | /// |

2075 | /// Note that this throws away whether the literal is "exact" or not. |

2076 | #[inline] |

2077 | pub fn into_bytes(self) -> Vec<u8> { |

2078 | self.bytes |

2079 | } |

2080 | |

2081 | /// Returns the length of this literal in bytes. |

2082 | #[inline] |

2083 | pub fn len(&self) -> usize { |

2084 | self.as_bytes().len() |

2085 | } |

2086 | |

2087 | /// Returns true if and only if this literal has zero bytes. |

2088 | #[inline] |

2089 | pub fn is_empty(&self) -> bool { |

2090 | self.len() == 0 |

2091 | } |

2092 | |

2093 | /// Returns true if and only if this literal is exact. |

2094 | #[inline] |

2095 | pub fn is_exact(&self) -> bool { |

2096 | self.exact |

2097 | } |

2098 | |

2099 | /// Marks this literal as inexact. |

2100 | /// |

2101 | /// Inexact literals can never be extended. For example, |

2102 | /// [`Seq::cross_forward`] will not extend inexact literals. |

2103 | #[inline] |

2104 | pub fn make_inexact(&mut self) { |

2105 | self.exact = false; |

2106 | } |

2107 | |

2108 | /// Reverse the bytes in this literal. |

2109 | #[inline] |

2110 | pub fn reverse(&mut self) { |

2111 | self.bytes.reverse(); |

2112 | } |

2113 | |

2114 | /// Extend this literal with the literal given. |

2115 | /// |

2116 | /// If this literal is inexact, then this is a no-op. |

2117 | #[inline] |

2118 | pub fn extend(&mut self, lit: &Literal) { |

2119 | if !self.is_exact() { |

2120 | return; |

2121 | } |

2122 | self.bytes.extend_from_slice(&lit.bytes); |

2123 | } |

2124 | |

2125 | /// Trims this literal such that only the first `len` bytes remain. If |

2126 | /// this literal has fewer than `len` bytes, then it remains unchanged. |

2127 | /// Otherwise, the literal is marked as inexact. |

2128 | #[inline] |

2129 | pub fn keep_first_bytes(&mut self, len: usize) { |

2130 | if len >= self.len() { |

2131 | return; |

2132 | } |

2133 | self.make_inexact(); |

2134 | self.bytes.truncate(len); |

2135 | } |

2136 | |

2137 | /// Trims this literal such that only the last `len` bytes remain. If this |

2138 | /// literal has fewer than `len` bytes, then it remains unchanged. |

2139 | /// Otherwise, the literal is marked as inexact. |

2140 | #[inline] |

2141 | pub fn keep_last_bytes(&mut self, len: usize) { |

2142 | if len >= self.len() { |

2143 | return; |

2144 | } |

2145 | self.make_inexact(); |

2146 | self.bytes.drain(..self.len() - len); |

2147 | } |

2148 | |

2149 | /// Returns true if it is believe that this literal is likely to match very |

2150 | /// frequently, and is thus not a good candidate for a prefilter. |

2151 | fn is_poisonous(&self) -> bool { |

2152 | self.is_empty() || (self.len() == 1 && rank(self.as_bytes()[0]) >= 250) |

2153 | } |

2154 | } |

2155 | |

2156 | impl From<u8> for Literal { |

2157 | fn from(byte: u8) -> Literal { |

2158 | Literal::exact(vec![byte]) |

2159 | } |

2160 | } |

2161 | |

2162 | impl From<char> for Literal { |

2163 | fn from(ch: char) -> Literal { |

2164 | use alloc::string::ToString; |

2165 | Literal::exact(ch.encode_utf8(&mut [0; 4]).to_string()) |

2166 | } |

2167 | } |

2168 | |

2169 | impl AsRef<[u8]> for Literal { |

2170 | fn as_ref(&self) -> &[u8] { |

2171 | self.as_bytes() |

2172 | } |

2173 | } |

2174 | |

2175 | impl core::fmt::Debug for Literal { |

2176 | fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { |

2177 | let tag = if self.exact { "E"} else { "I"}; |

2178 | f.debug_tuple(tag) |

2179 | .field(&crate::debug::Bytes(self.as_bytes())) |

2180 | .finish() |

2181 | } |

2182 | } |

2183 | |

2184 | /// A "preference" trie that rejects literals that will never match when |

2185 | /// executing a leftmost first or "preference" search. |

2186 | /// |

2187 | /// For example, if 'sam' is inserted, then trying to insert 'samwise' will be |

2188 | /// rejected because 'samwise' can never match since 'sam' will always take |

2189 | /// priority. However, if 'samwise' is inserted first, then inserting 'sam' |

2190 | /// after it is accepted. In this case, either 'samwise' or 'sam' can match in |

2191 | /// a "preference" search. |

2192 | /// |

2193 | /// Note that we only use this trie as a "set." That is, given a sequence of |

2194 | /// literals, we insert each one in order. An `insert` will reject a literal |

2195 | /// if a prefix of that literal already exists in the trie. Thus, to rebuild |

2196 | /// the "minimal" sequence, we simply only keep literals that were successfully |

2197 | /// inserted. (Since we don't need traversal, one wonders whether we can make |

2198 | /// some simplifications here, but I haven't given it a ton of thought and I've |

2199 | /// never seen this show up on a profile. Because of the heuristic limits |

2200 | /// imposed on literal extractions, the size of the inputs here is usually |

2201 | /// very small.) |

2202 | #[derive(Debug)] |

2203 | struct PreferenceTrie { |

2204 | /// The states in this trie. The index of a state in this vector is its ID. |

2205 | states: Vec<State>, |

2206 | /// This vec indicates which states are match states. It always has |

2207 | /// the same length as `states` and is indexed by the same state ID. |

2208 | /// A state with identifier `sid` is a match state if and only if |

2209 | /// `matches[sid].is_some()`. The option contains the index of the literal |

2210 | /// corresponding to the match. The index is offset by 1 so that it fits in |

2211 | /// a NonZeroUsize. |

2212 | matches: Vec<Option<NonZeroUsize>>, |

2213 | /// The index to allocate to the next literal added to this trie. Starts at |

2214 | /// 1 and increments by 1 for every literal successfully added to the trie. |

2215 | next_literal_index: usize, |

2216 | } |

2217 | |

2218 | /// A single state in a trie. Uses a sparse representation for its transitions. |

2219 | #[derive(Debug, Default)] |

2220 | struct State { |

2221 | /// Sparse representation of the transitions out of this state. Transitions |

2222 | /// are sorted by byte. There is at most one such transition for any |

2223 | /// particular byte. |

2224 | trans: Vec<(u8, usize)>, |

2225 | } |

2226 | |

2227 | impl PreferenceTrie { |

2228 | /// Minimizes the given sequence of literals while preserving preference |

2229 | /// order semantics. |

2230 | /// |

2231 | /// When `keep_exact` is true, the exactness of every literal retained is |

2232 | /// kept. This is useful when dealing with a fully extracted `Seq` that |

2233 | /// only contains exact literals. In that case, we can keep all retained |

2234 | /// literals as exact because we know we'll never need to match anything |

2235 | /// after them and because any removed literals are guaranteed to never |

2236 | /// match. |

2237 | fn minimize(literals: &mut Vec<Literal>, keep_exact: bool) { |

2238 | let mut trie = PreferenceTrie { |

2239 | states: vec![], |

2240 | matches: vec![], |

2241 | next_literal_index: 1, |

2242 | }; |

2243 | let mut make_inexact = vec![]; |

2244 | literals.retain_mut(|lit| match trie.insert(lit.as_bytes()) { |

2245 | Ok(_) => true, |

2246 | Err(i) => { |

2247 | if !keep_exact { |

2248 | make_inexact.push(i.checked_sub(1).unwrap()); |

2249 | } |

2250 | false |

2251 | } |

2252 | }); |

2253 | for i in make_inexact { |

2254 | literals[i].make_inexact(); |

2255 | } |

2256 | } |

2257 | |

2258 | /// Returns `Ok` if the given byte string is accepted into this trie and |

2259 | /// `Err` otherwise. The index for the success case corresponds to the |

2260 | /// index of the literal added. The index for the error case corresponds to |

2261 | /// the index of the literal already in the trie that prevented the given |

2262 | /// byte string from being added. (Which implies it is a prefix of the one |

2263 | /// given.) |

2264 | /// |

2265 | /// In short, the byte string given is accepted into the trie if and only |

2266 | /// if it is possible for it to match when executing a preference order |

2267 | /// search. |

2268 | fn insert(&mut self, bytes: &[u8]) -> Result<usize, usize> { |

2269 | let mut prev = self.root(); |

2270 | if let Some(idx) = self.matches[prev] { |

2271 | return Err(idx.get()); |

2272 | } |

2273 | for &b in bytes.iter() { |

2274 | match self.states[prev].trans.binary_search_by_key(&b, |t| t.0) { |

2275 | Ok(i) => { |

2276 | prev = self.states[prev].trans[i].1; |

2277 | if let Some(idx) = self.matches[prev] { |

2278 | return Err(idx.get()); |

2279 | } |

2280 | } |

2281 | Err(i) => { |

2282 | let next = self.create_state(); |

2283 | self.states[prev].trans.insert(i, (b, next)); |

2284 | prev = next; |

2285 | } |

2286 | } |

2287 | } |

2288 | let idx = self.next_literal_index; |

2289 | self.next_literal_index += 1; |

2290 | self.matches[prev] = NonZeroUsize::new(idx); |

2291 | Ok(idx) |

2292 | } |

2293 | |

2294 | /// Returns the root state ID, and if it doesn't exist, creates it. |

2295 | fn root(&mut self) -> usize { |

2296 | if !self.states.is_empty() { |

2297 | 0 |

2298 | } else { |

2299 | self.create_state() |

2300 | } |

2301 | } |

2302 | |

2303 | /// Creates a new empty state and returns its ID. |

2304 | fn create_state(&mut self) -> usize { |

2305 | let id = self.states.len(); |

2306 | self.states.push(State::default()); |

2307 | self.matches.push(None); |

2308 | id |

2309 | } |

2310 | } |

2311 | |

2312 | /// Returns the "rank" of the given byte. |

2313 | /// |

2314 | /// The minimum rank value is `0` and the maximum rank value is `255`. |

2315 | /// |

2316 | /// The rank of a byte is derived from a heuristic background distribution of |

2317 | /// relative frequencies of bytes. The heuristic says that lower the rank of a |

2318 | /// byte, the less likely that byte is to appear in any arbitrary haystack. |

2319 | pub fn rank(byte: u8) -> u8 { |

2320 | crate::rank::BYTE_FREQUENCIES[usize::from(byte)] |

2321 | } |

2322 | |

2323 | #[cfg(test)] |

2324 | mod tests { |

2325 | use super::*; |

2326 | |

2327 | fn parse(pattern: &str) -> Hir { |

2328 | crate::ParserBuilder::new().utf8(false).build().parse(pattern).unwrap() |

2329 | } |

2330 | |

2331 | fn prefixes(pattern: &str) -> Seq { |

2332 | Extractor::new().kind(ExtractKind::Prefix).extract(&parse(pattern)) |

2333 | } |

2334 | |

2335 | fn suffixes(pattern: &str) -> Seq { |

2336 | Extractor::new().kind(ExtractKind::Suffix).extract(&parse(pattern)) |

2337 | } |

2338 | |

2339 | fn e(pattern: &str) -> (Seq, Seq) { |

2340 | (prefixes(pattern), suffixes(pattern)) |

2341 | } |

2342 | |

2343 | #[allow(non_snake_case)] |

2344 | fn E(x: &str) -> Literal { |

2345 | Literal::exact(x.as_bytes()) |

2346 | } |

2347 | |

2348 | #[allow(non_snake_case)] |

2349 | fn I(x: &str) -> Literal { |

2350 | Literal::inexact(x.as_bytes()) |

2351 | } |

2352 | |

2353 | fn seq<I: IntoIterator<Item = Literal>>(it: I) -> Seq { |

2354 | Seq::from_iter(it) |

2355 | } |

2356 | |

2357 | fn infinite() -> (Seq, Seq) { |

2358 | (Seq::infinite(), Seq::infinite()) |

2359 | } |

2360 | |

2361 | fn inexact<I1, I2>(it1: I1, it2: I2) -> (Seq, Seq) |

2362 | where |

2363 | I1: IntoIterator<Item = Literal>, |

2364 | I2: IntoIterator<Item = Literal>, |

2365 | { |

2366 | (Seq::from_iter(it1), Seq::from_iter(it2)) |

2367 | } |

2368 | |

2369 | fn exact<B: AsRef<[u8]>, I: IntoIterator<Item = B>>(it: I) -> (Seq, Seq) { |

2370 | let s1 = Seq::new(it); |

2371 | let s2 = s1.clone(); |

2372 | (s1, s2) |

2373 | } |

2374 | |

2375 | fn opt<B: AsRef<[u8]>, I: IntoIterator<Item = B>>(it: I) -> (Seq, Seq) { |

2376 | let (mut p, mut s) = exact(it); |

2377 | p.optimize_for_prefix_by_preference(); |

2378 | s.optimize_for_suffix_by_preference(); |

2379 | (p, s) |

2380 | } |

2381 | |

2382 | #[test] |

2383 | fn literal() { |

2384 | assert_eq!(exact(["a"]), e( "a")); |

2385 | assert_eq!(exact(["aaaaa"]), e( "aaaaa")); |

2386 | assert_eq!(exact(["A", "a"]), e( "(?i-u)a")); |

2387 | assert_eq!(exact(["AB", "Ab", "aB", "ab"]), e( "(?i-u)ab")); |

2388 | assert_eq!(exact(["abC", "abc"]), e( "ab(?i-u)c")); |

2389 | |

2390 | assert_eq!(exact([b" \xFF"]), e( r"(?-u:\xFF)")); |

2391 | |

2392 | #[cfg(feature = "unicode-case")] |

2393 | { |

2394 | assert_eq!(exact(["☃"]), e( "☃")); |

2395 | assert_eq!(exact(["☃"]), e( "(?i)☃")); |

2396 | assert_eq!(exact(["☃☃☃☃☃"]), e( "☃☃☃☃☃")); |

2397 | |

2398 | assert_eq!(exact(["Δ"]), e( "Δ")); |

2399 | assert_eq!(exact(["δ"]), e( "δ")); |

2400 | assert_eq!(exact(["Δ", "δ"]), e( "(?i)Δ")); |

2401 | assert_eq!(exact(["Δ", "δ"]), e( "(?i)δ")); |

2402 | |

2403 | assert_eq!(exact(["S", "s", "ſ"]), e( "(?i)S")); |

2404 | assert_eq!(exact(["S", "s", "ſ"]), e( "(?i)s")); |

2405 | assert_eq!(exact(["S", "s", "ſ"]), e( "(?i)ſ")); |

2406 | } |

2407 | |

2408 | let letters = "ͱͳͷΐάέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋ"; |

2409 | assert_eq!(exact([letters]), e(letters)); |

2410 | } |

2411 | |

2412 | #[test] |

2413 | fn class() { |

2414 | assert_eq!(exact(["a", "b", "c"]), e( "[abc]")); |

2415 | assert_eq!(exact(["a1b", "a2b", "a3b"]), e( "a[123]b")); |

2416 | assert_eq!(exact(["δ", "ε"]), e( "[εδ]")); |

2417 | #[cfg(feature = "unicode-case")] |

2418 | { |

2419 | assert_eq!(exact(["Δ", "Ε", "δ", "ε", "ϵ"]), e( r"(?i)[εδ]")); |

2420 | } |

2421 | } |

2422 | |

2423 | #[test] |

2424 | fn look() { |

2425 | assert_eq!(exact(["ab"]), e( r"a\Ab")); |

2426 | assert_eq!(exact(["ab"]), e( r"a\zb")); |

2427 | assert_eq!(exact(["ab"]), e( r"a(?m:^)b")); |

2428 | assert_eq!(exact(["ab"]), e( r"a(?m:$)b")); |

2429 | assert_eq!(exact(["ab"]), e( r"a\bb")); |

2430 | assert_eq!(exact(["ab"]), e( r"a\Bb")); |

2431 | assert_eq!(exact(["ab"]), e( r"a(?-u:\b)b")); |

2432 | assert_eq!(exact(["ab"]), e( r"a(?-u:\B)b")); |

2433 | |

2434 | assert_eq!(exact(["ab"]), e( r"^ab")); |

2435 | assert_eq!(exact(["ab"]), e( r"$ab")); |

2436 | assert_eq!(exact(["ab"]), e( r"(?m:^)ab")); |

2437 | assert_eq!(exact(["ab"]), e( r"(?m:$)ab")); |

2438 | assert_eq!(exact(["ab"]), e( r"\bab")); |

2439 | assert_eq!(exact(["ab"]), e( r"\Bab")); |

2440 | assert_eq!(exact(["ab"]), e( r"(?-u:\b)ab")); |

2441 | assert_eq!(exact(["ab"]), e( r"(?-u:\B)ab")); |

2442 | |

2443 | assert_eq!(exact(["ab"]), e( r"ab^")); |

2444 | assert_eq!(exact(["ab"]), e( r"ab$")); |

2445 | assert_eq!(exact(["ab"]), e( r"ab(?m:^)")); |

2446 | assert_eq!(exact(["ab"]), e( r"ab(?m:$)")); |

2447 | assert_eq!(exact(["ab"]), e( r"ab\b")); |

2448 | assert_eq!(exact(["ab"]), e( r"ab\B")); |

2449 | assert_eq!(exact(["ab"]), e( r"ab(?-u:\b)")); |

2450 | assert_eq!(exact(["ab"]), e( r"ab(?-u:\B)")); |

2451 | |

2452 | let expected = (seq([I("aZ"), E( "ab")]), seq([I( "Zb"), E( "ab")])); |

2453 | assert_eq!(expected, e(r"^aZ*b")); |

2454 | } |

2455 | |

2456 | #[test] |

2457 | fn repetition() { |

2458 | assert_eq!(exact(["a", ""]), e( r"a?")); |

2459 | assert_eq!(exact(["", "a"]), e( r"a??")); |

2460 | assert_eq!(inexact([I("a"), E( "")], [I( "a"), E( "")]), e( r"a*")); |

2461 | assert_eq!(inexact([E(""), I( "a")], [E( ""), I( "a")]), e( r"a*?")); |

2462 | assert_eq!(inexact([I("a")], [I( "a")]), e( r"a+")); |

2463 | assert_eq!(inexact([I("a")], [I( "a")]), e( r"(a+)+")); |

2464 | |

2465 | assert_eq!(exact(["ab"]), e( r"aZ{0}b")); |

2466 | assert_eq!(exact(["aZb", "ab"]), e( r"aZ?b")); |

2467 | assert_eq!(exact(["ab", "aZb"]), e( r"aZ??b")); |

2468 | assert_eq!( |

2469 | inexact([I("aZ"), E( "ab")], [I( "Zb"), E( "ab")]), |

2470 | e(r"aZ*b") |

2471 | ); |

2472 | assert_eq!( |

2473 | inexact([E("ab"), I( "aZ")], [E( "ab"), I( "Zb")]), |

2474 | e(r"aZ*?b") |

2475 | ); |

2476 | assert_eq!(inexact([I("aZ")], [I( "Zb")]), e( r"aZ+b")); |

2477 | assert_eq!(inexact([I("aZ")], [I( "Zb")]), e( r"aZ+?b")); |

2478 | |

2479 | assert_eq!(exact(["aZZb"]), e( r"aZ{2}b")); |

2480 | assert_eq!(inexact([I("aZZ")], [I( "ZZb")]), e( r"aZ{2,3}b")); |

2481 | |

2482 | assert_eq!(exact(["abc", ""]), e( r"(abc)?")); |

2483 | assert_eq!(exact(["", "abc"]), e( r"(abc)??")); |

2484 | |

2485 | assert_eq!(inexact([I("a"), E( "b")], [I( "ab"), E( "b")]), e( r"a*b")); |

2486 | assert_eq!(inexact([E("b"), I( "a")], [E( "b"), I( "ab")]), e( r"a*?b")); |

2487 | assert_eq!(inexact([I("ab")], [I( "b")]), e( r"ab+")); |

2488 | assert_eq!(inexact([I("a"), I( "b")], [I( "b")]), e( r"a*b+")); |

2489 | |

2490 | // FIXME: The suffixes for this don't look quite right to me. I think |

2491 | // the right suffixes would be: [I(ac), I(bc), E(c)]. The main issue I |

2492 | // think is that suffixes are computed by iterating over concatenations |

2493 | // in reverse, and then [bc, ac, c] ordering is indeed correct from |

2494 | // that perspective. We also test a few more equivalent regexes, and |

2495 | // we get the same result, so it is consistent at least I suppose. |

2496 | // |

2497 | // The reason why this isn't an issue is that it only messes up |

2498 | // preference order, and currently, suffixes are never used in a |

2499 | // context where preference order matters. For prefixes it matters |

2500 | // because we sometimes want to use prefilters without confirmation |

2501 | // when all of the literals are exact (and there's no look-around). But |

2502 | // we never do that for suffixes. Any time we use suffixes, we always |

2503 | // include a confirmation step. If that ever changes, then it's likely |

2504 | // this bug will need to be fixed, but last time I looked, it appears |

2505 | // hard to do so. |

2506 | assert_eq!( |

2507 | inexact([I("a"), I( "b"), E( "c")], [I( "bc"), I( "ac"), E( "c")]), |

2508 | e(r"a*b*c") |

2509 | ); |

2510 | assert_eq!( |

2511 | inexact([I("a"), I( "b"), E( "c")], [I( "bc"), I( "ac"), E( "c")]), |

2512 | e(r"(a+)?(b+)?c") |

2513 | ); |

2514 | assert_eq!( |

2515 | inexact([I("a"), I( "b"), E( "c")], [I( "bc"), I( "ac"), E( "c")]), |

2516 | e(r"(a+|)(b+|)c") |

2517 | ); |

2518 | // A few more similarish but not identical regexes. These may have a |

2519 | // similar problem as above. |

2520 | assert_eq!( |

2521 | inexact( |

2522 | [I("a"), I( "b"), I( "c"), E( "")], |

2523 | [I("c"), I( "b"), I( "a"), E( "")] |

2524 | ), |

2525 | e(r"a*b*c*") |

2526 | ); |

2527 | assert_eq!(inexact([I("a"), I( "b"), I( "c")], [I( "c")]), e( r"a*b*c+")); |

2528 | assert_eq!(inexact([I("a"), I( "b")], [I( "bc")]), e( r"a*b+c")); |

2529 | assert_eq!(inexact([I("a"), I( "b")], [I( "c"), I( "b")]), e( r"a*b+c*")); |

2530 | assert_eq!(inexact([I("ab"), E( "a")], [I( "b"), E( "a")]), e( r"ab*")); |

2531 | assert_eq!( |

2532 | inexact([I("ab"), E( "ac")], [I( "bc"), E( "ac")]), |

2533 | e(r"ab*c") |

2534 | ); |

2535 | assert_eq!(inexact([I("ab")], [I( "b")]), e( r"ab+")); |

2536 | assert_eq!(inexact([I("ab")], [I( "bc")]), e( r"ab+c")); |

2537 | |

2538 | assert_eq!( |

2539 | inexact([I("z"), E( "azb")], [I( "zazb"), E( "azb")]), |

2540 | e(r"z*azb") |

2541 | ); |

2542 | |

2543 | let expected = |

2544 | exact(["aaa", "aab", "aba", "abb", "baa", "bab", "bba", "bbb"]); |

2545 | assert_eq!(expected, e(r"[ab]{3}")); |

2546 | let expected = inexact( |

2547 | [ |

2548 | I("aaa"), |

2549 | I("aab"), |

2550 | I("aba"), |

2551 | I("abb"), |

2552 | I("baa"), |

2553 | I("bab"), |

2554 | I("bba"), |

2555 | I("bbb"), |

2556 | ], |

2557 | [ |

2558 | I("aaa"), |

2559 | I("aab"), |

2560 | I("aba"), |

2561 | I("abb"), |

2562 | I("baa"), |

2563 | I("bab"), |

2564 | I("bba"), |

2565 | I("bbb"), |

2566 | ], |

2567 | ); |

2568 | assert_eq!(expected, e(r"[ab]{3,4}")); |

2569 | } |

2570 | |

2571 | #[test] |

2572 | fn concat() { |

2573 | let empty: [&str; 0] = []; |

2574 | |

2575 | assert_eq!(exact(["abcxyz"]), e( r"abc()xyz")); |

2576 | assert_eq!(exact(["abcxyz"]), e( r"(abc)(xyz)")); |

2577 | assert_eq!(exact(["abcmnoxyz"]), e( r"abc()mno()xyz")); |

2578 | assert_eq!(exact(empty), e(r"abc[a&&b]xyz")); |

2579 | assert_eq!(exact(["abcxyz"]), e( r"abc[a&&b]*xyz")); |

2580 | } |

2581 | |

2582 | #[test] |

2583 | fn alternation() { |

2584 | assert_eq!(exact(["abc", "mno", "xyz"]), e( r"abc|mno|xyz")); |

2585 | assert_eq!( |

2586 | inexact( |

2587 | [E("abc"), I( "mZ"), E( "mo"), E( "xyz")], |

2588 | [E("abc"), I( "Zo"), E( "mo"), E( "xyz")] |

2589 | ), |

2590 | e(r"abc|mZ*o|xyz") |

2591 | ); |

2592 | assert_eq!(exact(["abc", "xyz"]), e( r"abc|M[a&&b]N|xyz")); |

2593 | assert_eq!(exact(["abc", "MN", "xyz"]), e( r"abc|M[a&&b]*N|xyz")); |

2594 | |

2595 | assert_eq!(exact(["aaa", "aaaaa"]), e( r"(?:|aa)aaa")); |

2596 | assert_eq!( |

2597 | inexact( |

2598 | [I("aaa"), E( ""), I( "aaaaa"), E( "aa")], |

2599 | [I("aaa"), E( ""), E( "aa")] |

2600 | ), |

2601 | e(r"(?:|aa)(?:aaa)*") |

2602 | ); |

2603 | assert_eq!( |

2604 | inexact( |

2605 | [E(""), I( "aaa"), E( "aa"), I( "aaaaa")], |

2606 | [E(""), I( "aaa"), E( "aa")] |

2607 | ), |

2608 | e(r"(?:|aa)(?:aaa)*?") |

2609 | ); |

2610 | |

2611 | assert_eq!( |

2612 | inexact([E("a"), I( "b"), E( "")], [E( "a"), I( "b"), E( "")]), |

2613 | e(r"a|b*") |

2614 | ); |

2615 | assert_eq!(inexact([E("a"), I( "b")], [E( "a"), I( "b")]), e( r"a|b+")); |

2616 | |

2617 | assert_eq!( |

2618 | inexact([I("a"), E( "b"), E( "c")], [I( "ab"), E( "b"), E( "c")]), |

2619 | e(r"a*b|c") |

2620 | ); |

2621 | |

2622 | assert_eq!( |

2623 | inexact( |

2624 | [E("a"), E( "b"), I( "c"), E( "")], |

2625 | [E("a"), E( "b"), I( "c"), E( "")] |

2626 | ), |

2627 | e(r"a|(?:b|c*)") |

2628 | ); |

2629 | |

2630 | assert_eq!( |

2631 | inexact( |

2632 | [I("a"), I( "b"), E( "c"), I( "a"), I( "ab"), E( "c")], |

2633 | [I("ac"), I( "bc"), E( "c"), I( "ac"), I( "abc"), E( "c")], |

2634 | ), |

2635 | e(r"(a|b)*c|(a|ab)*c") |

2636 | ); |

2637 | |

2638 | assert_eq!( |

2639 | exact(["abef", "abgh", "cdef", "cdgh"]), |

2640 | e(r"(ab|cd)(ef|gh)") |

2641 | ); |

2642 | assert_eq!( |

2643 | exact([ |

2644 | "abefij", "abefkl", "abghij", "abghkl", "cdefij", "cdefkl", |

2645 | "cdghij", "cdghkl", |

2646 | ]), |

2647 | e(r"(ab|cd)(ef|gh)(ij|kl)") |

2648 | ); |

2649 | |

2650 | assert_eq!(inexact([E("abab")], [E( "abab")]), e( r"(ab){2}")); |

2651 | |

2652 | assert_eq!(inexact([I("abab")], [I( "abab")]), e( r"(ab){2,3}")); |

2653 | |

2654 | assert_eq!(inexact([I("abab")], [I( "abab")]), e( r"(ab){2,}")); |

2655 | } |

2656 | |

2657 | #[test] |

2658 | fn impossible() { |

2659 | let empty: [&str; 0] = []; |

2660 | |

2661 | assert_eq!(exact(empty), e(r"[a&&b]")); |

2662 | assert_eq!(exact(empty), e(r"a[a&&b]")); |

2663 | assert_eq!(exact(empty), e(r"[a&&b]b")); |

2664 | assert_eq!(exact(empty), e(r"a[a&&b]b")); |

2665 | assert_eq!(exact(["a", "b"]), e( r"a|[a&&b]|b")); |

2666 | assert_eq!(exact(["a", "b"]), e( r"a|c[a&&b]|b")); |

2667 | assert_eq!(exact(["a", "b"]), e( r"a|[a&&b]d|b")); |

2668 | assert_eq!(exact(["a", "b"]), e( r"a|c[a&&b]d|b")); |

2669 | assert_eq!(exact([""]), e( r"[a&&b]*")); |

2670 | assert_eq!(exact(["MN"]), e( r"M[a&&b]*N")); |

2671 | } |

2672 | |

2673 | // This tests patterns that contain something that defeats literal |

2674 | // detection, usually because it would blow some limit on the total number |

2675 | // of literals that can be returned. |

2676 | // |

2677 | // The main idea is that when literal extraction sees something that |

2678 | // it knows will blow a limit, it replaces it with a marker that says |

2679 | // "any literal will match here." While not necessarily true, the |

2680 | // over-estimation is just fine for the purposes of literal extraction, |

2681 | // because the imprecision doesn't matter: too big is too big. |

2682 | // |

2683 | // This is one of the trickier parts of literal extraction, since we need |

2684 | // to make sure all of our literal extraction operations correctly compose |

2685 | // with the markers. |

2686 | #[test] |

2687 | fn anything() { |

2688 | assert_eq!(infinite(), e(r".")); |

2689 | assert_eq!(infinite(), e(r"(?s).")); |

2690 | assert_eq!(infinite(), e(r"[A-Za-z]")); |

2691 | assert_eq!(infinite(), e(r"[A-Z]")); |

2692 | assert_eq!(exact([""]), e( r"[A-Z]{0}")); |

2693 | assert_eq!(infinite(), e(r"[A-Z]?")); |

2694 | assert_eq!(infinite(), e(r"[A-Z]*")); |

2695 | assert_eq!(infinite(), e(r"[A-Z]+")); |

2696 | assert_eq!((seq([I("1")]), Seq::infinite()), e( r"1[A-Z]")); |

2697 | assert_eq!((seq([I("1")]), seq([I( "2")])), e( r"1[A-Z]2")); |

2698 | assert_eq!((Seq::infinite(), seq([I("123")])), e( r"[A-Z]+123")); |

2699 | assert_eq!(infinite(), e(r"[A-Z]+123[A-Z]+")); |

2700 | assert_eq!(infinite(), e(r"1|[A-Z]|3")); |

2701 | assert_eq!( |

2702 | (seq([E("1"), I( "2"), E( "3")]), Seq::infinite()), |

2703 | e(r"1|2[A-Z]|3"), |

2704 | ); |

2705 | assert_eq!( |

2706 | (Seq::infinite(), seq([E("1"), I( "2"), E( "3")])), |

2707 | e(r"1|[A-Z]2|3"), |

2708 | ); |

2709 | assert_eq!( |

2710 | (seq([E("1"), I( "2"), E( "4")]), seq([E( "1"), I( "3"), E( "4")])), |

2711 | e(r"1|2[A-Z]3|4"), |

2712 | ); |

2713 | assert_eq!((Seq::infinite(), seq([I("2")])), e( r"(?:|1)[A-Z]2")); |

2714 | assert_eq!(inexact([I("a")], [I( "z")]), e( r"a.z")); |

2715 | } |

2716 | |

2717 | // Like the 'anything' test, but it uses smaller limits in order to test |

2718 | // the logic for effectively aborting literal extraction when the seqs get |

2719 | // too big. |

2720 | #[test] |

2721 | fn anything_small_limits() { |

2722 | fn prefixes(pattern: &str) -> Seq { |

2723 | Extractor::new() |

2724 | .kind(ExtractKind::Prefix) |

2725 | .limit_total(10) |

2726 | .extract(&parse(pattern)) |

2727 | } |

2728 | |

2729 | fn suffixes(pattern: &str) -> Seq { |

2730 | Extractor::new() |

2731 | .kind(ExtractKind::Suffix) |

2732 | .limit_total(10) |

2733 | .extract(&parse(pattern)) |

2734 | } |

2735 | |

2736 | fn e(pattern: &str) -> (Seq, Seq) { |

2737 | (prefixes(pattern), suffixes(pattern)) |

2738 | } |

2739 | |

2740 | assert_eq!( |

2741 | ( |

2742 | seq([ |

2743 | I("aaa"), |

2744 | I("aab"), |

2745 | I("aba"), |

2746 | I("abb"), |

2747 | I("baa"), |

2748 | I("bab"), |

2749 | I("bba"), |

2750 | I("bbb") |

2751 | ]), |

2752 | seq([ |

2753 | I("aaa"), |

2754 | I("aab"), |

2755 | I("aba"), |

2756 | I("abb"), |

2757 | I("baa"), |

2758 | I("bab"), |

2759 | I("bba"), |

2760 | I("bbb") |

2761 | ]) |

2762 | ), |

2763 | e(r"[ab]{3}{3}") |

2764 | ); |

2765 | |

2766 | assert_eq!(infinite(), e(r"ab|cd|ef|gh|ij|kl|mn|op|qr|st|uv|wx|yz")); |

2767 | } |

2768 | |

2769 | #[test] |

2770 | fn empty() { |

2771 | assert_eq!(exact([""]), e( r"")); |

2772 | assert_eq!(exact([""]), e( r"^")); |

2773 | assert_eq!(exact([""]), e( r"$")); |

2774 | assert_eq!(exact([""]), e( r"(?m:^)")); |

2775 | assert_eq!(exact([""]), e( r"(?m:$)")); |

2776 | assert_eq!(exact([""]), e( r"\b")); |

2777 | assert_eq!(exact([""]), e( r"\B")); |

2778 | assert_eq!(exact([""]), e( r"(?-u:\b)")); |

2779 | assert_eq!(exact([""]), e( r"(?-u:\B)")); |

2780 | } |

2781 | |

2782 | #[test] |

2783 | fn odds_and_ends() { |

2784 | assert_eq!((Seq::infinite(), seq([I("a")])), e( r".a")); |

2785 | assert_eq!((seq([I("a")]), Seq::infinite()), e( r"a.")); |

2786 | assert_eq!(infinite(), e(r"a|.")); |

2787 | assert_eq!(infinite(), e(r".|a")); |

2788 | |

2789 | let pat = r"M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]"; |

2790 | let expected = inexact( |

2791 | ["Mo'am", "Moam", "Mu'am", "Muam"].map(I), |

2792 | [ |

2793 | "ddafi", "ddafy", "dhafi", "dhafy", "dzafi", "dzafy", "dafi", |

2794 | "dafy", "tdafi", "tdafy", "thafi", "thafy", "tzafi", "tzafy", |

2795 | "tafi", "tafy", "zdafi", "zdafy", "zhafi", "zhafy", "zzafi", |

2796 | "zzafy", "zafi", "zafy", |

2797 | ] |

2798 | .map(I), |

2799 | ); |

2800 | assert_eq!(expected, e(pat)); |

2801 | |

2802 | assert_eq!( |

2803 | (seq(["fn is_", "fn as_"].map(I)), Seq::infinite()), |

2804 | e(r"fn is_([A-Z]+)|fn as_([A-Z]+)"), |

2805 | ); |

2806 | assert_eq!( |

2807 | inexact([I("foo")], [I( "quux")]), |

2808 | e(r"foo[A-Z]+bar[A-Z]+quux") |

2809 | ); |

2810 | assert_eq!(infinite(), e(r"[A-Z]+bar[A-Z]+")); |

2811 | assert_eq!( |

2812 | exact(["Sherlock Holmes"]), |

2813 | e(r"(?m)^Sherlock Holmes|Sherlock Holmes$") |

2814 | ); |

2815 | |

2816 | assert_eq!(exact(["sa", "sb"]), e( r"\bs(?:[ab])")); |

2817 | } |

2818 | |

2819 | // This tests a specific regex along with some heuristic steps to reduce |

2820 | // the sequences extracted. This is meant to roughly correspond to the |

2821 | // types of heuristics used to shrink literal sets in practice. (Shrinking |

2822 | // is done because you want to balance "spend too much work looking for |

2823 | // too many literals" and "spend too much work processing false positive |

2824 | // matches from short literals.") |

2825 | #[test] |

2826 | #[cfg(feature = "unicode-case")] |

2827 | fn holmes() { |

2828 | let expected = inexact( |

2829 | ["HOL", "HOl", "HoL", "Hol", "hOL", "hOl", "hoL", "hol"].map(I), |

2830 | [ |

2831 | "MES", "MEs", "Eſ", "MeS", "Mes", "eſ", "mES", "mEs", "meS", |

2832 | "mes", |

2833 | ] |

2834 | .map(I), |

2835 | ); |

2836 | let (mut prefixes, mut suffixes) = e(r"(?i)Holmes"); |

2837 | prefixes.keep_first_bytes(3); |

2838 | suffixes.keep_last_bytes(3); |

2839 | prefixes.minimize_by_preference(); |

2840 | suffixes.minimize_by_preference(); |

2841 | assert_eq!(expected, (prefixes, suffixes)); |

2842 | } |

2843 | |

2844 | // This tests that we get some kind of literals extracted for a beefier |

2845 | // alternation with case insensitive mode enabled. At one point during |

2846 | // development, this returned nothing, and motivated some special case |

2847 | // code in Extractor::union to try and trim down the literal sequences |

2848 | // if the union would blow the limits set. |

2849 | #[test] |

2850 | #[cfg(feature = "unicode-case")] |

2851 | fn holmes_alt() { |

2852 | let mut pre = |

2853 | prefixes(r"(?i)Sherlock|Holmes|Watson|Irene|Adler|John|Baker"); |

2854 | assert!(pre.len().unwrap() > 0); |

2855 | pre.optimize_for_prefix_by_preference(); |

2856 | assert!(pre.len().unwrap() > 0); |

2857 | } |

2858 | |

2859 | // See: https://github.com/rust-lang/regex/security/advisories/GHSA-m5pq-gvj9-9vr8 |

2860 | // See: CVE-2022-24713 |

2861 | // |

2862 | // We test this here to ensure literal extraction completes in reasonable |

2863 | // time and isn't materially impacted by these sorts of pathological |

2864 | // repeats. |

2865 | #[test] |

2866 | fn crazy_repeats() { |

2867 | assert_eq!(inexact([E("")], [E( "")]), e( r"(?:){4294967295}")); |

2868 | assert_eq!( |

2869 | inexact([E("")], [E( "")]), |

2870 | e(r"(?:){64}{64}{64}{64}{64}{64}") |

2871 | ); |

2872 | assert_eq!(inexact([E("")], [E( "")]), e( r"x{0}{4294967295}")); |

2873 | assert_eq!(inexact([E("")], [E( "")]), e( r"(?:|){4294967295}")); |

2874 | |

2875 | assert_eq!( |

2876 | inexact([E("")], [E( "")]), |

2877 | e(r"(?:){8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}") |

2878 | ); |

2879 | let repa = "a".repeat( 100); |

2880 | assert_eq!( |

2881 | inexact([I(&repa)], [I(&repa)]), |

2882 | e(r"a{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}{8}") |

2883 | ); |

2884 | } |

2885 | |

2886 | #[test] |

2887 | fn huge() { |

2888 | let pat = r#"(?-u) |

2889 | 2(?: |

2890 | [45]\d{3}| |

2891 | 7(?: |

2892 | 1[0-267]| |

2893 | 2[0-289]| |

2894 | 3[0-29]| |

2895 | 4[01]| |

2896 | 5[1-3]| |

2897 | 6[013]| |

2898 | 7[0178]| |

2899 | 91 |

2900 | )| |

2901 | 8(?: |

2902 | 0[125]| |

2903 | [139][1-6]| |

2904 | 2[0157-9]| |

2905 | 41| |

2906 | 6[1-35]| |

2907 | 7[1-5]| |

2908 | 8[1-8]| |

2909 | 90 |

2910 | )| |

2911 | 9(?: |

2912 | 0[0-2]| |

2913 | 1[0-4]| |

2914 | 2[568]| |

2915 | 3[3-6]| |

2916 | 5[5-7]| |

2917 | 6[0167]| |

2918 | 7[15]| |

2919 | 8[0146-9] |

2920 | ) |

2921 | )\d{4}| |

2922 | 3(?: |

2923 | 12?[5-7]\d{2}| |

2924 | 0(?: |

2925 | 2(?: |

2926 | [025-79]\d| |

2927 | [348]\d{1,2} |

2928 | )| |

2929 | 3(?: |

2930 | [2-4]\d| |

2931 | [56]\d? |

2932 | ) |

2933 | )| |

2934 | 2(?: |

2935 | 1\d{2}| |

2936 | 2(?: |

2937 | [12]\d| |

2938 | [35]\d{1,2}| |

2939 | 4\d? |

2940 | ) |

2941 | )| |

2942 | 3(?: |

2943 | 1\d{2}| |

2944 | 2(?: |

2945 | [2356]\d| |

2946 | 4\d{1,2} |

2947 | ) |

2948 | )| |

2949 | 4(?: |

2950 | 1\d{2}| |

2951 | 2(?: |

2952 | 2\d{1,2}| |

2953 | [47]| |

2954 | 5\d{2} |

2955 | ) |

2956 | )| |

2957 | 5(?: |

2958 | 1\d{2}| |

2959 | 29 |

2960 | )| |

2961 | [67]1\d{2}| |

2962 | 8(?: |

2963 | 1\d{2}| |

2964 | 2(?: |

2965 | 2\d{2}| |

2966 | 3| |

2967 | 4\d |

2968 | ) |

2969 | ) |

2970 | )\d{3}| |

2971 | 4(?: |

2972 | 0(?: |

2973 | 2(?: |

2974 | [09]\d| |

2975 | 7 |

2976 | )| |

2977 | 33\d{2} |

2978 | )| |

2979 | 1\d{3}| |

2980 | 2(?: |

2981 | 1\d{2}| |

2982 | 2(?: |

2983 | [25]\d?| |

2984 | [348]\d| |

2985 | [67]\d{1,2} |

2986 | ) |

2987 | )| |

2988 | 3(?: |

2989 | 1\d{2}(?: |

2990 | \d{2} |

2991 | )?| |

2992 | 2(?: |

2993 | [045]\d| |

2994 | [236-9]\d{1,2} |

2995 | )| |

2996 | 32\d{2} |

2997 | )| |

2998 | 4(?: |

2999 | [18]\d{2}| |

3000 | 2(?: |

3001 | [2-46]\d{2}| |

3002 | 3 |

3003 | )| |

3004 | 5[25]\d{2} |

3005 | )| |

3006 | 5(?: |

3007 | 1\d{2}| |

3008 | 2(?: |

3009 | 3\d| |

3010 | 5 |

3011 | ) |

3012 | )| |

3013 | 6(?: |

3014 | [18]\d{2}| |

3015 | 2(?: |

3016 | 3(?: |

3017 | \d{2} |

3018 | )?| |

3019 | [46]\d{1,2}| |

3020 | 5\d{2}| |

3021 | 7\d |

3022 | )| |

3023 | 5(?: |

3024 | 3\d?| |

3025 | 4\d| |

3026 | [57]\d{1,2}| |

3027 | 6\d{2}| |

3028 | 8 |

3029 | ) |

3030 | )| |

3031 | 71\d{2}| |

3032 | 8(?: |

3033 | [18]\d{2}| |

3034 | 23\d{2}| |

3035 | 54\d{2} |

3036 | )| |

3037 | 9(?: |

3038 | [18]\d{2}| |

3039 | 2[2-5]\d{2}| |

3040 | 53\d{1,2} |

3041 | ) |

3042 | )\d{3}| |

3043 | 5(?: |

3044 | 02[03489]\d{2}| |

3045 | 1\d{2}| |

3046 | 2(?: |

3047 | 1\d{2}| |

3048 | 2(?: |

3049 | 2(?: |

3050 | \d{2} |

3051 | )?| |

3052 | [457]\d{2} |

3053 | ) |

3054 | )| |

3055 | 3(?: |

3056 | 1\d{2}| |

3057 | 2(?: |

3058 | [37](?: |

3059 | \d{2} |

3060 | )?| |

3061 | [569]\d{2} |

3062 | ) |

3063 | )| |

3064 | 4(?: |

3065 | 1\d{2}| |

3066 | 2[46]\d{2} |

3067 | )| |

3068 | 5(?: |

3069 | 1\d{2}| |

3070 | 26\d{1,2} |

3071 | )| |

3072 | 6(?: |

3073 | [18]\d{2}| |

3074 | 2| |

3075 | 53\d{2} |

3076 | )| |

3077 | 7(?: |

3078 | 1| |

3079 | 24 |

3080 | )\d{2}| |

3081 | 8(?: |

3082 | 1| |

3083 | 26 |

3084 | )\d{2}| |

3085 | 91\d{2} |

3086 | )\d{3}| |

3087 | 6(?: |

3088 | 0(?: |

3089 | 1\d{2}| |

3090 | 2(?: |

3091 | 3\d{2}| |

3092 | 4\d{1,2} |

3093 | ) |

3094 | )| |

3095 | 2(?: |

3096 | 2[2-5]\d{2}| |

3097 | 5(?: |

3098 | [3-5]\d{2}| |

3099 | 7 |

3100 | )| |

3101 | 8\d{2} |

3102 | )| |

3103 | 3(?: |

3104 | 1| |

3105 | 2[3478] |

3106 | )\d{2}| |

3107 | 4(?: |

3108 | 1| |

3109 | 2[34] |

3110 | )\d{2}| |

3111 | 5(?: |

3112 | 1| |

3113 | 2[47] |

3114 | )\d{2}| |

3115 | 6(?: |

3116 | [18]\d{2}| |

3117 | 6(?: |

3118 | 2(?: |

3119 | 2\d| |

3120 | [34]\d{2} |

3121 | )| |

3122 | 5(?: |

3123 | [24]\d{2}| |

3124 | 3\d| |

3125 | 5\d{1,2} |

3126 | ) |

3127 | ) |

3128 | )| |

3129 | 72[2-5]\d{2}| |

3130 | 8(?: |

3131 | 1\d{2}| |

3132 | 2[2-5]\d{2} |

3133 | )| |

3134 | 9(?: |

3135 | 1\d{2}| |

3136 | 2[2-6]\d{2} |

3137 | ) |

3138 | )\d{3}| |

3139 | 7(?: |

3140 | (?: |

3141 | 02| |

3142 | [3-589]1| |

3143 | 6[12]| |

3144 | 72[24] |

3145 | )\d{2}| |

3146 | 21\d{3}| |

3147 | 32 |

3148 | )\d{3}| |

3149 | 8(?: |

3150 | (?: |

3151 | 4[12]| |

3152 | [5-7]2| |

3153 | 1\d? |

3154 | )| |

3155 | (?: |

3156 | 0| |

3157 | 3[12]| |

3158 | [5-7]1| |

3159 | 217 |

3160 | )\d |

3161 | )\d{4}| |

3162 | 9(?: |

3163 | [35]1| |

3164 | (?: |

3165 | [024]2| |

3166 | 81 |

3167 | )\d| |

3168 | (?: |

3169 | 1| |

3170 | [24]1 |

3171 | )\d{2} |

3172 | )\d{3} |

3173 | "#; |

3174 | // TODO: This is a good candidate of a seq of literals that could be |

3175 | // shrunk quite a bit and still be very productive with respect to |

3176 | // literal optimizations. |

3177 | let (prefixes, suffixes) = e(pat); |

3178 | assert!(!suffixes.is_finite()); |

3179 | assert_eq!(Some(243), prefixes.len()); |

3180 | } |

3181 | |

3182 | #[test] |

3183 | fn optimize() { |

3184 | // This gets a common prefix that isn't too short. |

3185 | let (p, s) = |

3186 | opt(["foobarfoobar", "foobar", "foobarzfoobar", "foobarfoobar"]); |

3187 | assert_eq!(seq([I("foobar")]), p); |

3188 | assert_eq!(seq([I("foobar")]), s); |

3189 | |

3190 | // This also finds a common prefix, but since it's only one byte, it |

3191 | // prefers the multiple literals. |

3192 | let (p, s) = opt(["abba", "akka", "abccba"]); |

3193 | assert_eq!(exact(["abba", "akka", "abccba"]), (p, s)); |

3194 | |

3195 | let (p, s) = opt(["sam", "samwise"]); |

3196 | assert_eq!((seq([E("sam")]), seq([E( "sam"), E( "samwise")])), (p, s)); |

3197 | |

3198 | // The empty string is poisonous, so our seq becomes infinite, even |

3199 | // though all literals are exact. |

3200 | let (p, s) = opt(["foobarfoo", "foo", "", "foozfoo", "foofoo"]); |

3201 | assert!(!p.is_finite()); |

3202 | assert!(!s.is_finite()); |

3203 | |

3204 | // A space is also poisonous, so our seq becomes infinite. But this |

3205 | // only gets triggered when we don't have a completely exact sequence. |

3206 | // When the sequence is exact, spaces are okay, since we presume that |

3207 | // any prefilter will match a space more quickly than the regex engine. |

3208 | // (When the sequence is exact, there's a chance of the prefilter being |

3209 | // used without needing the regex engine at all.) |

3210 | let mut p = seq([E("foobarfoo"), I( "foo"), E( " "), E( "foofoo")]); |

3211 | p.optimize_for_prefix_by_preference(); |

3212 | assert!(!p.is_finite()); |

3213 | } |

3214 | } |

3215 |