1 | /*! |
2 | Generic helpers for iteration of matches from a regex engine in a haystack. |
3 | |
4 | The principle type in this module is a [`Searcher`]. A `Searcher` provides |
5 | its own lower level iterator-like API in addition to methods for constructing |
6 | types that implement `Iterator`. The documentation for `Searcher` explains a |
7 | bit more about why these different APIs exist. |
8 | |
9 | Currently, this module supports iteration over any regex engine that works |
10 | with the [`HalfMatch`], [`Match`] or [`Captures`] types. |
11 | */ |
12 | |
13 | #[cfg (feature = "alloc" )] |
14 | use crate::util::captures::Captures; |
15 | use crate::util::search::{HalfMatch, Input, Match, MatchError}; |
16 | |
17 | /// A searcher for creating iterators and performing lower level iteration. |
18 | /// |
19 | /// This searcher encapsulates the logic required for finding all successive |
20 | /// non-overlapping matches in a haystack. In theory, iteration would look |
21 | /// something like this: |
22 | /// |
23 | /// 1. Setting the start position to `0`. |
24 | /// 2. Execute a regex search. If no match, end iteration. |
25 | /// 3. Report the match and set the start position to the end of the match. |
26 | /// 4. Go back to (2). |
27 | /// |
28 | /// And if this were indeed the case, it's likely that `Searcher` wouldn't |
29 | /// exist. Unfortunately, because a regex may match the empty string, the above |
30 | /// logic won't work for all possible regexes. Namely, if an empty match is |
31 | /// found, then step (3) would set the start position of the search to the |
32 | /// position it was at. Thus, iteration would never end. |
33 | /// |
34 | /// Instead, a `Searcher` knows how to detect these cases and forcefully |
35 | /// advance iteration in the case of an empty match that overlaps with a |
36 | /// previous match. |
37 | /// |
38 | /// If you know that your regex cannot match any empty string, then the simple |
39 | /// algorithm described above will work correctly. |
40 | /// |
41 | /// When possible, prefer the iterators defined on the regex engine you're |
42 | /// using. This tries to abstract over the regex engine and is thus a bit more |
43 | /// unwieldy to use. |
44 | /// |
45 | /// In particular, a `Searcher` is not itself an iterator. Instead, it provides |
46 | /// `advance` routines that permit moving the search along explicitly. It also |
47 | /// provides various routines, like [`Searcher::into_matches_iter`], that |
48 | /// accept a closure (representing how a regex engine executes a search) and |
49 | /// returns a conventional iterator. |
50 | /// |
51 | /// The lifetime parameters come from the [`Input`] type passed to |
52 | /// [`Searcher::new`]: |
53 | /// |
54 | /// * `'h` is the lifetime of the underlying haystack. |
55 | /// |
56 | /// # Searcher vs Iterator |
57 | /// |
58 | /// Why does a search type with "advance" APIs exist at all when we also have |
59 | /// iterators? Unfortunately, the reasoning behind this split is a complex |
60 | /// combination of the following things: |
61 | /// |
62 | /// 1. While many of the regex engines expose their own iterators, it is also |
63 | /// nice to expose this lower level iteration helper because it permits callers |
64 | /// to provide their own `Input` configuration. Moreover, a `Searcher` can work |
65 | /// with _any_ regex engine instead of only the ones defined in this crate. |
66 | /// This way, everyone benefits from a shared iteration implementation. |
67 | /// 2. There are many different regex engines that, while they have the same |
68 | /// match semantics, they have slightly different APIs. Iteration is just |
69 | /// complex enough to want to share code, and so we need a way of abstracting |
70 | /// over those different regex engines. While we could define a new trait that |
71 | /// describes any regex engine search API, it would wind up looking very close |
72 | /// to a closure. While there may still be reasons for the more generic trait |
73 | /// to exist, for now and for the purposes of iteration, we use a closure. |
74 | /// Closures also provide a lot of easy flexibility at the call site, in that |
75 | /// they permit the caller to borrow any kind of state they want for use during |
76 | /// each search call. |
77 | /// 3. As a result of using closures, and because closures are anonymous types |
78 | /// that cannot be named, it is difficult to encapsulate them without both |
79 | /// costs to speed and added complexity to the public API. For example, in |
80 | /// defining an iterator type like |
81 | /// [`dfa::regex::FindMatches`](crate::dfa::regex::FindMatches), |
82 | /// if we use a closure internally, it's not possible to name this type in the |
83 | /// return type of the iterator constructor. Thus, the only way around it is |
84 | /// to erase the type by boxing it and turning it into a `Box<dyn FnMut ...>`. |
85 | /// This boxed closure is unlikely to be inlined _and_ it infects the public |
86 | /// API in subtle ways. Namely, unless you declare the closure as implementing |
87 | /// `Send` and `Sync`, then the resulting iterator type won't implement it |
88 | /// either. But there are practical issues with requiring the closure to |
89 | /// implement `Send` and `Sync` that result in other API complexities that |
90 | /// are beyond the scope of this already long exposition. |
91 | /// 4. Some regex engines expose more complex match information than just |
92 | /// "which pattern matched" and "at what offsets." For example, the PikeVM |
93 | /// exposes match spans for each capturing group that participated in the |
94 | /// match. In such cases, it can be quite beneficial to reuse the capturing |
95 | /// group allocation on subsequent searches. A proper iterator doesn't permit |
96 | /// this API due to its interface, so it's useful to have something a bit lower |
97 | /// level that permits callers to amortize allocations while also reusing a |
98 | /// shared implementation of iteration. (See the documentation for |
99 | /// [`Searcher::advance`] for an example of using the "advance" API with the |
100 | /// PikeVM.) |
101 | /// |
102 | /// What this boils down to is that there are "advance" APIs which require |
103 | /// handing a closure to it for every call, and there are also APIs to create |
104 | /// iterators from a closure. The former are useful for _implementing_ |
105 | /// iterators or when you need more flexibility, while the latter are useful |
106 | /// for conveniently writing custom iterators on-the-fly. |
107 | /// |
108 | /// # Example: iterating with captures |
109 | /// |
110 | /// Several regex engines in this crate over convenient iterator APIs over |
111 | /// [`Captures`] values. To do so, this requires allocating a new `Captures` |
112 | /// value for each iteration step. This can perhaps be more costly than you |
113 | /// might want. Instead of implementing your own iterator to avoid that |
114 | /// cost (which can be a little subtle if you want to handle empty matches |
115 | /// correctly), you can use this `Searcher` to do it for you: |
116 | /// |
117 | /// ``` |
118 | /// use regex_automata::{ |
119 | /// nfa::thompson::pikevm::PikeVM, |
120 | /// util::iter::Searcher, |
121 | /// Input, Span, |
122 | /// }; |
123 | /// |
124 | /// let re = PikeVM::new("foo(?P<numbers>[0-9]+)" )?; |
125 | /// let haystack = "foo1 foo12 foo123" ; |
126 | /// |
127 | /// let mut caps = re.create_captures(); |
128 | /// let mut cache = re.create_cache(); |
129 | /// let mut matches = vec![]; |
130 | /// let mut searcher = Searcher::new(Input::new(haystack)); |
131 | /// while let Some(_) = searcher.advance(|input| { |
132 | /// re.search(&mut cache, input, &mut caps); |
133 | /// Ok(caps.get_match()) |
134 | /// }) { |
135 | /// // The unwrap is OK since 'numbers' matches if the pattern matches. |
136 | /// matches.push(caps.get_group_by_name("numbers" ).unwrap()); |
137 | /// } |
138 | /// assert_eq!(matches, vec![ |
139 | /// Span::from(3..4), |
140 | /// Span::from(8..10), |
141 | /// Span::from(14..17), |
142 | /// ]); |
143 | /// |
144 | /// # Ok::<(), Box<dyn std::error::Error>>(()) |
145 | /// ``` |
146 | #[derive (Clone, Debug)] |
147 | pub struct Searcher<'h> { |
148 | /// The input parameters to give to each regex engine call. |
149 | /// |
150 | /// The start position of the search is mutated during iteration. |
151 | input: Input<'h>, |
152 | /// Records the end offset of the most recent match. This is necessary to |
153 | /// handle a corner case for preventing empty matches from overlapping with |
154 | /// the ending bounds of a prior match. |
155 | last_match_end: Option<usize>, |
156 | } |
157 | |
158 | impl<'h> Searcher<'h> { |
159 | /// Create a new fallible non-overlapping matches iterator. |
160 | /// |
161 | /// The given `input` provides the parameters (including the haystack), |
162 | /// while the `finder` represents a closure that calls the underlying regex |
163 | /// engine. The closure may borrow any additional state that is needed, |
164 | /// such as a prefilter scanner. |
165 | pub fn new(input: Input<'h>) -> Searcher<'h> { |
166 | Searcher { input, last_match_end: None } |
167 | } |
168 | |
169 | /// Returns the current `Input` used by this searcher. |
170 | /// |
171 | /// The `Input` returned is generally equivalent to the one given to |
172 | /// [`Searcher::new`], but its start position may be different to reflect |
173 | /// the start of the next search to be executed. |
174 | pub fn input<'s>(&'s self) -> &'s Input<'h> { |
175 | &self.input |
176 | } |
177 | |
178 | /// Return the next half match for an infallible search if one exists, and |
179 | /// advance to the next position. |
180 | /// |
181 | /// This is like `try_advance_half`, except errors are converted into |
182 | /// panics. |
183 | /// |
184 | /// # Panics |
185 | /// |
186 | /// If the given closure returns an error, then this panics. This is useful |
187 | /// when you know your underlying regex engine has been configured to not |
188 | /// return an error. |
189 | /// |
190 | /// # Example |
191 | /// |
192 | /// This example shows how to use a `Searcher` to iterate over all matches |
193 | /// when using a DFA, which only provides "half" matches. |
194 | /// |
195 | /// ``` |
196 | /// use regex_automata::{ |
197 | /// hybrid::dfa::DFA, |
198 | /// util::iter::Searcher, |
199 | /// HalfMatch, Input, |
200 | /// }; |
201 | /// |
202 | /// let re = DFA::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}" )?; |
203 | /// let mut cache = re.create_cache(); |
204 | /// |
205 | /// let input = Input::new("2010-03-14 2016-10-08 2020-10-22" ); |
206 | /// let mut it = Searcher::new(input); |
207 | /// |
208 | /// let expected = Some(HalfMatch::must(0, 10)); |
209 | /// let got = it.advance_half(|input| re.try_search_fwd(&mut cache, input)); |
210 | /// assert_eq!(expected, got); |
211 | /// |
212 | /// let expected = Some(HalfMatch::must(0, 21)); |
213 | /// let got = it.advance_half(|input| re.try_search_fwd(&mut cache, input)); |
214 | /// assert_eq!(expected, got); |
215 | /// |
216 | /// let expected = Some(HalfMatch::must(0, 32)); |
217 | /// let got = it.advance_half(|input| re.try_search_fwd(&mut cache, input)); |
218 | /// assert_eq!(expected, got); |
219 | /// |
220 | /// let expected = None; |
221 | /// let got = it.advance_half(|input| re.try_search_fwd(&mut cache, input)); |
222 | /// assert_eq!(expected, got); |
223 | /// |
224 | /// # Ok::<(), Box<dyn std::error::Error>>(()) |
225 | /// ``` |
226 | /// |
227 | /// This correctly moves iteration forward even when an empty match occurs: |
228 | /// |
229 | /// ``` |
230 | /// use regex_automata::{ |
231 | /// hybrid::dfa::DFA, |
232 | /// util::iter::Searcher, |
233 | /// HalfMatch, Input, |
234 | /// }; |
235 | /// |
236 | /// let re = DFA::new(r"a|" )?; |
237 | /// let mut cache = re.create_cache(); |
238 | /// |
239 | /// let input = Input::new("abba" ); |
240 | /// let mut it = Searcher::new(input); |
241 | /// |
242 | /// let expected = Some(HalfMatch::must(0, 1)); |
243 | /// let got = it.advance_half(|input| re.try_search_fwd(&mut cache, input)); |
244 | /// assert_eq!(expected, got); |
245 | /// |
246 | /// let expected = Some(HalfMatch::must(0, 2)); |
247 | /// let got = it.advance_half(|input| re.try_search_fwd(&mut cache, input)); |
248 | /// assert_eq!(expected, got); |
249 | /// |
250 | /// let expected = Some(HalfMatch::must(0, 4)); |
251 | /// let got = it.advance_half(|input| re.try_search_fwd(&mut cache, input)); |
252 | /// assert_eq!(expected, got); |
253 | /// |
254 | /// let expected = None; |
255 | /// let got = it.advance_half(|input| re.try_search_fwd(&mut cache, input)); |
256 | /// assert_eq!(expected, got); |
257 | /// |
258 | /// # Ok::<(), Box<dyn std::error::Error>>(()) |
259 | /// ``` |
260 | #[inline ] |
261 | pub fn advance_half<F>(&mut self, finder: F) -> Option<HalfMatch> |
262 | where |
263 | F: FnMut(&Input<'_>) -> Result<Option<HalfMatch>, MatchError>, |
264 | { |
265 | match self.try_advance_half(finder) { |
266 | Ok(m) => m, |
267 | Err(err) => panic!( |
268 | "unexpected regex half find error: {}\n\ |
269 | to handle find errors, use 'try' or 'search' methods" , |
270 | err, |
271 | ), |
272 | } |
273 | } |
274 | |
275 | /// Return the next match for an infallible search if one exists, and |
276 | /// advance to the next position. |
277 | /// |
278 | /// The search is advanced even in the presence of empty matches by |
279 | /// forbidding empty matches from overlapping with any other match. |
280 | /// |
281 | /// This is like `try_advance`, except errors are converted into panics. |
282 | /// |
283 | /// # Panics |
284 | /// |
285 | /// If the given closure returns an error, then this panics. This is useful |
286 | /// when you know your underlying regex engine has been configured to not |
287 | /// return an error. |
288 | /// |
289 | /// # Example |
290 | /// |
291 | /// This example shows how to use a `Searcher` to iterate over all matches |
292 | /// when using a regex based on lazy DFAs: |
293 | /// |
294 | /// ``` |
295 | /// use regex_automata::{ |
296 | /// hybrid::regex::Regex, |
297 | /// util::iter::Searcher, |
298 | /// Match, Input, |
299 | /// }; |
300 | /// |
301 | /// let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}" )?; |
302 | /// let mut cache = re.create_cache(); |
303 | /// |
304 | /// let input = Input::new("2010-03-14 2016-10-08 2020-10-22" ); |
305 | /// let mut it = Searcher::new(input); |
306 | /// |
307 | /// let expected = Some(Match::must(0, 0..10)); |
308 | /// let got = it.advance(|input| re.try_search(&mut cache, input)); |
309 | /// assert_eq!(expected, got); |
310 | /// |
311 | /// let expected = Some(Match::must(0, 11..21)); |
312 | /// let got = it.advance(|input| re.try_search(&mut cache, input)); |
313 | /// assert_eq!(expected, got); |
314 | /// |
315 | /// let expected = Some(Match::must(0, 22..32)); |
316 | /// let got = it.advance(|input| re.try_search(&mut cache, input)); |
317 | /// assert_eq!(expected, got); |
318 | /// |
319 | /// let expected = None; |
320 | /// let got = it.advance(|input| re.try_search(&mut cache, input)); |
321 | /// assert_eq!(expected, got); |
322 | /// |
323 | /// # Ok::<(), Box<dyn std::error::Error>>(()) |
324 | /// ``` |
325 | /// |
326 | /// This example shows the same as above, but with the PikeVM. This example |
327 | /// is useful because it shows how to use this API even when the regex |
328 | /// engine doesn't directly return a `Match`. |
329 | /// |
330 | /// ``` |
331 | /// use regex_automata::{ |
332 | /// nfa::thompson::pikevm::PikeVM, |
333 | /// util::iter::Searcher, |
334 | /// Match, Input, |
335 | /// }; |
336 | /// |
337 | /// let re = PikeVM::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}" )?; |
338 | /// let (mut cache, mut caps) = (re.create_cache(), re.create_captures()); |
339 | /// |
340 | /// let input = Input::new("2010-03-14 2016-10-08 2020-10-22" ); |
341 | /// let mut it = Searcher::new(input); |
342 | /// |
343 | /// let expected = Some(Match::must(0, 0..10)); |
344 | /// let got = it.advance(|input| { |
345 | /// re.search(&mut cache, input, &mut caps); |
346 | /// Ok(caps.get_match()) |
347 | /// }); |
348 | /// // Note that if we wanted to extract capturing group spans, we could |
349 | /// // do that here with 'caps'. |
350 | /// assert_eq!(expected, got); |
351 | /// |
352 | /// let expected = Some(Match::must(0, 11..21)); |
353 | /// let got = it.advance(|input| { |
354 | /// re.search(&mut cache, input, &mut caps); |
355 | /// Ok(caps.get_match()) |
356 | /// }); |
357 | /// assert_eq!(expected, got); |
358 | /// |
359 | /// let expected = Some(Match::must(0, 22..32)); |
360 | /// let got = it.advance(|input| { |
361 | /// re.search(&mut cache, input, &mut caps); |
362 | /// Ok(caps.get_match()) |
363 | /// }); |
364 | /// assert_eq!(expected, got); |
365 | /// |
366 | /// let expected = None; |
367 | /// let got = it.advance(|input| { |
368 | /// re.search(&mut cache, input, &mut caps); |
369 | /// Ok(caps.get_match()) |
370 | /// }); |
371 | /// assert_eq!(expected, got); |
372 | /// |
373 | /// # Ok::<(), Box<dyn std::error::Error>>(()) |
374 | /// ``` |
375 | #[inline ] |
376 | pub fn advance<F>(&mut self, finder: F) -> Option<Match> |
377 | where |
378 | F: FnMut(&Input<'_>) -> Result<Option<Match>, MatchError>, |
379 | { |
380 | match self.try_advance(finder) { |
381 | Ok(m) => m, |
382 | Err(err) => panic!( |
383 | "unexpected regex find error: {}\n\ |
384 | to handle find errors, use 'try' or 'search' methods" , |
385 | err, |
386 | ), |
387 | } |
388 | } |
389 | |
390 | /// Return the next half match for a fallible search if one exists, and |
391 | /// advance to the next position. |
392 | /// |
393 | /// This is like `advance_half`, except it permits callers to handle errors |
394 | /// during iteration. |
395 | #[inline ] |
396 | pub fn try_advance_half<F>( |
397 | &mut self, |
398 | mut finder: F, |
399 | ) -> Result<Option<HalfMatch>, MatchError> |
400 | where |
401 | F: FnMut(&Input<'_>) -> Result<Option<HalfMatch>, MatchError>, |
402 | { |
403 | let mut m = match finder(&self.input)? { |
404 | None => return Ok(None), |
405 | Some(m) => m, |
406 | }; |
407 | if Some(m.offset()) == self.last_match_end { |
408 | m = match self.handle_overlapping_empty_half_match(m, finder)? { |
409 | None => return Ok(None), |
410 | Some(m) => m, |
411 | }; |
412 | } |
413 | self.input.set_start(m.offset()); |
414 | self.last_match_end = Some(m.offset()); |
415 | Ok(Some(m)) |
416 | } |
417 | |
418 | /// Return the next match for a fallible search if one exists, and advance |
419 | /// to the next position. |
420 | /// |
421 | /// This is like `advance`, except it permits callers to handle errors |
422 | /// during iteration. |
423 | #[inline ] |
424 | pub fn try_advance<F>( |
425 | &mut self, |
426 | mut finder: F, |
427 | ) -> Result<Option<Match>, MatchError> |
428 | where |
429 | F: FnMut(&Input<'_>) -> Result<Option<Match>, MatchError>, |
430 | { |
431 | let mut m = match finder(&self.input)? { |
432 | None => return Ok(None), |
433 | Some(m) => m, |
434 | }; |
435 | if m.is_empty() && Some(m.end()) == self.last_match_end { |
436 | m = match self.handle_overlapping_empty_match(m, finder)? { |
437 | None => return Ok(None), |
438 | Some(m) => m, |
439 | }; |
440 | } |
441 | self.input.set_start(m.end()); |
442 | self.last_match_end = Some(m.end()); |
443 | Ok(Some(m)) |
444 | } |
445 | |
446 | /// Given a closure that executes a single search, return an iterator over |
447 | /// all successive non-overlapping half matches. |
448 | /// |
449 | /// The iterator returned yields result values. If the underlying regex |
450 | /// engine is configured to never return an error, consider calling |
451 | /// [`TryHalfMatchesIter::infallible`] to convert errors into panics. |
452 | /// |
453 | /// # Example |
454 | /// |
455 | /// This example shows how to use a `Searcher` to create a proper |
456 | /// iterator over half matches. |
457 | /// |
458 | /// ``` |
459 | /// use regex_automata::{ |
460 | /// hybrid::dfa::DFA, |
461 | /// util::iter::Searcher, |
462 | /// HalfMatch, Input, |
463 | /// }; |
464 | /// |
465 | /// let re = DFA::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}" )?; |
466 | /// let mut cache = re.create_cache(); |
467 | /// |
468 | /// let input = Input::new("2010-03-14 2016-10-08 2020-10-22" ); |
469 | /// let mut it = Searcher::new(input).into_half_matches_iter(|input| { |
470 | /// re.try_search_fwd(&mut cache, input) |
471 | /// }); |
472 | /// |
473 | /// let expected = Some(Ok(HalfMatch::must(0, 10))); |
474 | /// assert_eq!(expected, it.next()); |
475 | /// |
476 | /// let expected = Some(Ok(HalfMatch::must(0, 21))); |
477 | /// assert_eq!(expected, it.next()); |
478 | /// |
479 | /// let expected = Some(Ok(HalfMatch::must(0, 32))); |
480 | /// assert_eq!(expected, it.next()); |
481 | /// |
482 | /// let expected = None; |
483 | /// assert_eq!(expected, it.next()); |
484 | /// |
485 | /// # Ok::<(), Box<dyn std::error::Error>>(()) |
486 | /// ``` |
487 | #[inline ] |
488 | pub fn into_half_matches_iter<F>( |
489 | self, |
490 | finder: F, |
491 | ) -> TryHalfMatchesIter<'h, F> |
492 | where |
493 | F: FnMut(&Input<'_>) -> Result<Option<HalfMatch>, MatchError>, |
494 | { |
495 | TryHalfMatchesIter { it: self, finder } |
496 | } |
497 | |
498 | /// Given a closure that executes a single search, return an iterator over |
499 | /// all successive non-overlapping matches. |
500 | /// |
501 | /// The iterator returned yields result values. If the underlying regex |
502 | /// engine is configured to never return an error, consider calling |
503 | /// [`TryMatchesIter::infallible`] to convert errors into panics. |
504 | /// |
505 | /// # Example |
506 | /// |
507 | /// This example shows how to use a `Searcher` to create a proper |
508 | /// iterator over matches. |
509 | /// |
510 | /// ``` |
511 | /// use regex_automata::{ |
512 | /// hybrid::regex::Regex, |
513 | /// util::iter::Searcher, |
514 | /// Match, Input, |
515 | /// }; |
516 | /// |
517 | /// let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}" )?; |
518 | /// let mut cache = re.create_cache(); |
519 | /// |
520 | /// let input = Input::new("2010-03-14 2016-10-08 2020-10-22" ); |
521 | /// let mut it = Searcher::new(input).into_matches_iter(|input| { |
522 | /// re.try_search(&mut cache, input) |
523 | /// }); |
524 | /// |
525 | /// let expected = Some(Ok(Match::must(0, 0..10))); |
526 | /// assert_eq!(expected, it.next()); |
527 | /// |
528 | /// let expected = Some(Ok(Match::must(0, 11..21))); |
529 | /// assert_eq!(expected, it.next()); |
530 | /// |
531 | /// let expected = Some(Ok(Match::must(0, 22..32))); |
532 | /// assert_eq!(expected, it.next()); |
533 | /// |
534 | /// let expected = None; |
535 | /// assert_eq!(expected, it.next()); |
536 | /// |
537 | /// # Ok::<(), Box<dyn std::error::Error>>(()) |
538 | /// ``` |
539 | #[inline ] |
540 | pub fn into_matches_iter<F>(self, finder: F) -> TryMatchesIter<'h, F> |
541 | where |
542 | F: FnMut(&Input<'_>) -> Result<Option<Match>, MatchError>, |
543 | { |
544 | TryMatchesIter { it: self, finder } |
545 | } |
546 | |
547 | /// Given a closure that executes a single search, return an iterator over |
548 | /// all successive non-overlapping `Captures` values. |
549 | /// |
550 | /// The iterator returned yields result values. If the underlying regex |
551 | /// engine is configured to never return an error, consider calling |
552 | /// [`TryCapturesIter::infallible`] to convert errors into panics. |
553 | /// |
554 | /// Unlike the other iterator constructors, this accepts an initial |
555 | /// `Captures` value. This `Captures` value is reused for each search, and |
556 | /// the iterator implementation clones it before returning it. The caller |
557 | /// must provide this value because the iterator is purposely ignorant |
558 | /// of the underlying regex engine and thus doesn't know how to create |
559 | /// one itself. More to the point, a `Captures` value itself has a few |
560 | /// different constructors, which change which kind of information is |
561 | /// available to query in exchange for search performance. |
562 | /// |
563 | /// # Example |
564 | /// |
565 | /// This example shows how to use a `Searcher` to create a proper iterator |
566 | /// over `Captures` values, which provides access to all capturing group |
567 | /// spans for each match. |
568 | /// |
569 | /// ``` |
570 | /// use regex_automata::{ |
571 | /// nfa::thompson::pikevm::PikeVM, |
572 | /// util::iter::Searcher, |
573 | /// Input, |
574 | /// }; |
575 | /// |
576 | /// let re = PikeVM::new( |
577 | /// r"(?P<y>[0-9]{4})-(?P<m>[0-9]{2})-(?P<d>[0-9]{2})" , |
578 | /// )?; |
579 | /// let (mut cache, caps) = (re.create_cache(), re.create_captures()); |
580 | /// |
581 | /// let haystack = "2010-03-14 2016-10-08 2020-10-22" ; |
582 | /// let input = Input::new(haystack); |
583 | /// let mut it = Searcher::new(input) |
584 | /// .into_captures_iter(caps, |input, caps| { |
585 | /// re.search(&mut cache, input, caps); |
586 | /// Ok(()) |
587 | /// }); |
588 | /// |
589 | /// let got = it.next().expect("first date" )?; |
590 | /// let year = got.get_group_by_name("y" ).expect("must match" ); |
591 | /// assert_eq!("2010" , &haystack[year]); |
592 | /// |
593 | /// let got = it.next().expect("second date" )?; |
594 | /// let month = got.get_group_by_name("m" ).expect("must match" ); |
595 | /// assert_eq!("10" , &haystack[month]); |
596 | /// |
597 | /// let got = it.next().expect("third date" )?; |
598 | /// let day = got.get_group_by_name("d" ).expect("must match" ); |
599 | /// assert_eq!("22" , &haystack[day]); |
600 | /// |
601 | /// assert!(it.next().is_none()); |
602 | /// |
603 | /// # Ok::<(), Box<dyn std::error::Error>>(()) |
604 | /// ``` |
605 | #[cfg (feature = "alloc" )] |
606 | #[inline ] |
607 | pub fn into_captures_iter<F>( |
608 | self, |
609 | caps: Captures, |
610 | finder: F, |
611 | ) -> TryCapturesIter<'h, F> |
612 | where |
613 | F: FnMut(&Input<'_>, &mut Captures) -> Result<(), MatchError>, |
614 | { |
615 | TryCapturesIter { it: self, caps, finder } |
616 | } |
617 | |
618 | /// Handles the special case of a match that begins where the previous |
619 | /// match ended. Without this special handling, it'd be possible to get |
620 | /// stuck where an empty match never results in forward progress. This |
621 | /// also makes it more consistent with how presiding general purpose regex |
622 | /// engines work. |
623 | #[cold ] |
624 | #[inline (never)] |
625 | fn handle_overlapping_empty_half_match<F>( |
626 | &mut self, |
627 | _: HalfMatch, |
628 | mut finder: F, |
629 | ) -> Result<Option<HalfMatch>, MatchError> |
630 | where |
631 | F: FnMut(&Input<'_>) -> Result<Option<HalfMatch>, MatchError>, |
632 | { |
633 | // Since we are only here when 'm.offset()' matches the offset of the |
634 | // last match, it follows that this must have been an empty match. |
635 | // Since we both need to make progress *and* prevent overlapping |
636 | // matches, we discard this match and advance the search by 1. |
637 | // |
638 | // Note that this may start a search in the middle of a codepoint. The |
639 | // regex engines themselves are expected to deal with that and not |
640 | // report any matches within a codepoint if they are configured in |
641 | // UTF-8 mode. |
642 | self.input.set_start(self.input.start().checked_add(1).unwrap()); |
643 | finder(&self.input) |
644 | } |
645 | |
646 | /// Handles the special case of an empty match by ensuring that 1) the |
647 | /// iterator always advances and 2) empty matches never overlap with other |
648 | /// matches. |
649 | /// |
650 | /// (1) is necessary because we principally make progress by setting the |
651 | /// starting location of the next search to the ending location of the last |
652 | /// match. But if a match is empty, then this results in a search that does |
653 | /// not advance and thus does not terminate. |
654 | /// |
655 | /// (2) is not strictly necessary, but makes intuitive sense and matches |
656 | /// the presiding behavior of most general purpose regex engines. The |
657 | /// "intuitive sense" here is that we want to report NON-overlapping |
658 | /// matches. So for example, given the regex 'a|(?:)' against the haystack |
659 | /// 'a', without the special handling, you'd get the matches [0, 1) and [1, |
660 | /// 1), where the latter overlaps with the end bounds of the former. |
661 | /// |
662 | /// Note that we mark this cold and forcefully prevent inlining because |
663 | /// handling empty matches like this is extremely rare and does require |
664 | /// quite a bit of code, comparatively. Keeping this code out of the main |
665 | /// iterator function keeps it smaller and more amenable to inlining |
666 | /// itself. |
667 | #[cold ] |
668 | #[inline (never)] |
669 | fn handle_overlapping_empty_match<F>( |
670 | &mut self, |
671 | m: Match, |
672 | mut finder: F, |
673 | ) -> Result<Option<Match>, MatchError> |
674 | where |
675 | F: FnMut(&Input<'_>) -> Result<Option<Match>, MatchError>, |
676 | { |
677 | assert!(m.is_empty()); |
678 | self.input.set_start(self.input.start().checked_add(1).unwrap()); |
679 | finder(&self.input) |
680 | } |
681 | } |
682 | |
683 | /// An iterator over all non-overlapping half matches for a fallible search. |
684 | /// |
685 | /// The iterator yields a `Result<HalfMatch, MatchError>` value until no more |
686 | /// matches could be found. |
687 | /// |
688 | /// The type parameters are as follows: |
689 | /// |
690 | /// * `F` represents the type of a closure that executes the search. |
691 | /// |
692 | /// The lifetime parameters come from the [`Input`] type: |
693 | /// |
694 | /// * `'h` is the lifetime of the underlying haystack. |
695 | /// |
696 | /// When possible, prefer the iterators defined on the regex engine you're |
697 | /// using. This tries to abstract over the regex engine and is thus a bit more |
698 | /// unwieldy to use. |
699 | /// |
700 | /// This iterator is created by [`Searcher::into_half_matches_iter`]. |
701 | pub struct TryHalfMatchesIter<'h, F> { |
702 | it: Searcher<'h>, |
703 | finder: F, |
704 | } |
705 | |
706 | impl<'h, F> TryHalfMatchesIter<'h, F> { |
707 | /// Return an infallible version of this iterator. |
708 | /// |
709 | /// Any item yielded that corresponds to an error results in a panic. This |
710 | /// is useful if your underlying regex engine is configured in a way that |
711 | /// it is guaranteed to never return an error. |
712 | pub fn infallible(self) -> HalfMatchesIter<'h, F> { |
713 | HalfMatchesIter(self) |
714 | } |
715 | |
716 | /// Returns the current `Input` used by this iterator. |
717 | /// |
718 | /// The `Input` returned is generally equivalent to the one used to |
719 | /// construct this iterator, but its start position may be different to |
720 | /// reflect the start of the next search to be executed. |
721 | pub fn input<'i>(&'i self) -> &'i Input<'h> { |
722 | self.it.input() |
723 | } |
724 | } |
725 | |
726 | impl<'h, F> Iterator for TryHalfMatchesIter<'h, F> |
727 | where |
728 | F: FnMut(&Input<'_>) -> Result<Option<HalfMatch>, MatchError>, |
729 | { |
730 | type Item = Result<HalfMatch, MatchError>; |
731 | |
732 | #[inline ] |
733 | fn next(&mut self) -> Option<Result<HalfMatch, MatchError>> { |
734 | self.it.try_advance_half(&mut self.finder).transpose() |
735 | } |
736 | } |
737 | |
738 | impl<'h, F> core::fmt::Debug for TryHalfMatchesIter<'h, F> { |
739 | fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result { |
740 | f&mut DebugStruct<'_, '_>.debug_struct("TryHalfMatchesIter" ) |
741 | .field("it" , &self.it) |
742 | .field(name:"finder" , &"<closure>" ) |
743 | .finish() |
744 | } |
745 | } |
746 | |
747 | /// An iterator over all non-overlapping half matches for an infallible search. |
748 | /// |
749 | /// The iterator yields a [`HalfMatch`] value until no more matches could be |
750 | /// found. |
751 | /// |
752 | /// The type parameters are as follows: |
753 | /// |
754 | /// * `F` represents the type of a closure that executes the search. |
755 | /// |
756 | /// The lifetime parameters come from the [`Input`] type: |
757 | /// |
758 | /// * `'h` is the lifetime of the underlying haystack. |
759 | /// |
760 | /// When possible, prefer the iterators defined on the regex engine you're |
761 | /// using. This tries to abstract over the regex engine and is thus a bit more |
762 | /// unwieldy to use. |
763 | /// |
764 | /// This iterator is created by [`Searcher::into_half_matches_iter`] and |
765 | /// then calling [`TryHalfMatchesIter::infallible`]. |
766 | #[derive (Debug)] |
767 | pub struct HalfMatchesIter<'h, F>(TryHalfMatchesIter<'h, F>); |
768 | |
769 | impl<'h, F> HalfMatchesIter<'h, F> { |
770 | /// Returns the current `Input` used by this iterator. |
771 | /// |
772 | /// The `Input` returned is generally equivalent to the one used to |
773 | /// construct this iterator, but its start position may be different to |
774 | /// reflect the start of the next search to be executed. |
775 | pub fn input<'i>(&'i self) -> &'i Input<'h> { |
776 | self.0.it.input() |
777 | } |
778 | } |
779 | |
780 | impl<'h, F> Iterator for HalfMatchesIter<'h, F> |
781 | where |
782 | F: FnMut(&Input<'_>) -> Result<Option<HalfMatch>, MatchError>, |
783 | { |
784 | type Item = HalfMatch; |
785 | |
786 | #[inline ] |
787 | fn next(&mut self) -> Option<HalfMatch> { |
788 | match self.0.next()? { |
789 | Ok(m: HalfMatch) => Some(m), |
790 | Err(err: MatchError) => panic!( |
791 | "unexpected regex half find error: {}\n\ |
792 | to handle find errors, use 'try' or 'search' methods" , |
793 | err, |
794 | ), |
795 | } |
796 | } |
797 | } |
798 | |
799 | /// An iterator over all non-overlapping matches for a fallible search. |
800 | /// |
801 | /// The iterator yields a `Result<Match, MatchError>` value until no more |
802 | /// matches could be found. |
803 | /// |
804 | /// The type parameters are as follows: |
805 | /// |
806 | /// * `F` represents the type of a closure that executes the search. |
807 | /// |
808 | /// The lifetime parameters come from the [`Input`] type: |
809 | /// |
810 | /// * `'h` is the lifetime of the underlying haystack. |
811 | /// |
812 | /// When possible, prefer the iterators defined on the regex engine you're |
813 | /// using. This tries to abstract over the regex engine and is thus a bit more |
814 | /// unwieldy to use. |
815 | /// |
816 | /// This iterator is created by [`Searcher::into_matches_iter`]. |
817 | pub struct TryMatchesIter<'h, F> { |
818 | it: Searcher<'h>, |
819 | finder: F, |
820 | } |
821 | |
822 | impl<'h, F> TryMatchesIter<'h, F> { |
823 | /// Return an infallible version of this iterator. |
824 | /// |
825 | /// Any item yielded that corresponds to an error results in a panic. This |
826 | /// is useful if your underlying regex engine is configured in a way that |
827 | /// it is guaranteed to never return an error. |
828 | pub fn infallible(self) -> MatchesIter<'h, F> { |
829 | MatchesIter(self) |
830 | } |
831 | |
832 | /// Returns the current `Input` used by this iterator. |
833 | /// |
834 | /// The `Input` returned is generally equivalent to the one used to |
835 | /// construct this iterator, but its start position may be different to |
836 | /// reflect the start of the next search to be executed. |
837 | pub fn input<'i>(&'i self) -> &'i Input<'h> { |
838 | self.it.input() |
839 | } |
840 | } |
841 | |
842 | impl<'h, F> Iterator for TryMatchesIter<'h, F> |
843 | where |
844 | F: FnMut(&Input<'_>) -> Result<Option<Match>, MatchError>, |
845 | { |
846 | type Item = Result<Match, MatchError>; |
847 | |
848 | #[inline ] |
849 | fn next(&mut self) -> Option<Result<Match, MatchError>> { |
850 | self.it.try_advance(&mut self.finder).transpose() |
851 | } |
852 | } |
853 | |
854 | impl<'h, F> core::fmt::Debug for TryMatchesIter<'h, F> { |
855 | fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result { |
856 | f&mut DebugStruct<'_, '_>.debug_struct("TryMatchesIter" ) |
857 | .field("it" , &self.it) |
858 | .field(name:"finder" , &"<closure>" ) |
859 | .finish() |
860 | } |
861 | } |
862 | |
863 | /// An iterator over all non-overlapping matches for an infallible search. |
864 | /// |
865 | /// The iterator yields a [`Match`] value until no more matches could be found. |
866 | /// |
867 | /// The type parameters are as follows: |
868 | /// |
869 | /// * `F` represents the type of a closure that executes the search. |
870 | /// |
871 | /// The lifetime parameters come from the [`Input`] type: |
872 | /// |
873 | /// * `'h` is the lifetime of the underlying haystack. |
874 | /// |
875 | /// When possible, prefer the iterators defined on the regex engine you're |
876 | /// using. This tries to abstract over the regex engine and is thus a bit more |
877 | /// unwieldy to use. |
878 | /// |
879 | /// This iterator is created by [`Searcher::into_matches_iter`] and |
880 | /// then calling [`TryMatchesIter::infallible`]. |
881 | #[derive (Debug)] |
882 | pub struct MatchesIter<'h, F>(TryMatchesIter<'h, F>); |
883 | |
884 | impl<'h, F> MatchesIter<'h, F> { |
885 | /// Returns the current `Input` used by this iterator. |
886 | /// |
887 | /// The `Input` returned is generally equivalent to the one used to |
888 | /// construct this iterator, but its start position may be different to |
889 | /// reflect the start of the next search to be executed. |
890 | pub fn input<'i>(&'i self) -> &'i Input<'h> { |
891 | self.0.it.input() |
892 | } |
893 | } |
894 | |
895 | impl<'h, F> Iterator for MatchesIter<'h, F> |
896 | where |
897 | F: FnMut(&Input<'_>) -> Result<Option<Match>, MatchError>, |
898 | { |
899 | type Item = Match; |
900 | |
901 | #[inline ] |
902 | fn next(&mut self) -> Option<Match> { |
903 | match self.0.next()? { |
904 | Ok(m: Match) => Some(m), |
905 | Err(err: MatchError) => panic!( |
906 | "unexpected regex find error: {}\n\ |
907 | to handle find errors, use 'try' or 'search' methods" , |
908 | err, |
909 | ), |
910 | } |
911 | } |
912 | } |
913 | |
914 | /// An iterator over all non-overlapping captures for a fallible search. |
915 | /// |
916 | /// The iterator yields a `Result<Captures, MatchError>` value until no more |
917 | /// matches could be found. |
918 | /// |
919 | /// The type parameters are as follows: |
920 | /// |
921 | /// * `F` represents the type of a closure that executes the search. |
922 | /// |
923 | /// The lifetime parameters come from the [`Input`] type: |
924 | /// |
925 | /// * `'h` is the lifetime of the underlying haystack. |
926 | /// |
927 | /// When possible, prefer the iterators defined on the regex engine you're |
928 | /// using. This tries to abstract over the regex engine and is thus a bit more |
929 | /// unwieldy to use. |
930 | /// |
931 | /// This iterator is created by [`Searcher::into_captures_iter`]. |
932 | #[cfg (feature = "alloc" )] |
933 | pub struct TryCapturesIter<'h, F> { |
934 | it: Searcher<'h>, |
935 | caps: Captures, |
936 | finder: F, |
937 | } |
938 | |
939 | #[cfg (feature = "alloc" )] |
940 | impl<'h, F> TryCapturesIter<'h, F> { |
941 | /// Return an infallible version of this iterator. |
942 | /// |
943 | /// Any item yielded that corresponds to an error results in a panic. This |
944 | /// is useful if your underlying regex engine is configured in a way that |
945 | /// it is guaranteed to never return an error. |
946 | pub fn infallible(self) -> CapturesIter<'h, F> { |
947 | CapturesIter(self) |
948 | } |
949 | } |
950 | |
951 | #[cfg (feature = "alloc" )] |
952 | impl<'h, F> Iterator for TryCapturesIter<'h, F> |
953 | where |
954 | F: FnMut(&Input<'_>, &mut Captures) -> Result<(), MatchError>, |
955 | { |
956 | type Item = Result<Captures, MatchError>; |
957 | |
958 | #[inline ] |
959 | fn next(&mut self) -> Option<Result<Captures, MatchError>> { |
960 | let TryCapturesIter { ref mut it: &mut Searcher<'_>, ref mut caps: &mut Captures, ref mut finder: &mut F } = |
961 | *self; |
962 | let result: Result = itResult |
963 | .try_advance(|input: &Input<'_>| { |
964 | (finder)(input, caps)?; |
965 | Ok(caps.get_match()) |
966 | }) |
967 | .transpose()?; |
968 | match result { |
969 | Ok(_) => Some(Ok(caps.clone())), |
970 | Err(err: MatchError) => Some(Err(err)), |
971 | } |
972 | } |
973 | } |
974 | |
975 | #[cfg (feature = "alloc" )] |
976 | impl<'h, F> core::fmt::Debug for TryCapturesIter<'h, F> { |
977 | fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result { |
978 | f&mut DebugStruct<'_, '_>.debug_struct("TryCapturesIter" ) |
979 | .field("it" , &self.it) |
980 | .field("caps" , &self.caps) |
981 | .field(name:"finder" , &"<closure>" ) |
982 | .finish() |
983 | } |
984 | } |
985 | |
986 | /// An iterator over all non-overlapping captures for an infallible search. |
987 | /// |
988 | /// The iterator yields a [`Captures`] value until no more matches could be |
989 | /// found. |
990 | /// |
991 | /// The type parameters are as follows: |
992 | /// |
993 | /// * `F` represents the type of a closure that executes the search. |
994 | /// |
995 | /// The lifetime parameters come from the [`Input`] type: |
996 | /// |
997 | /// * `'h` is the lifetime of the underlying haystack. |
998 | /// |
999 | /// When possible, prefer the iterators defined on the regex engine you're |
1000 | /// using. This tries to abstract over the regex engine and is thus a bit more |
1001 | /// unwieldy to use. |
1002 | /// |
1003 | /// This iterator is created by [`Searcher::into_captures_iter`] and then |
1004 | /// calling [`TryCapturesIter::infallible`]. |
1005 | #[cfg (feature = "alloc" )] |
1006 | #[derive (Debug)] |
1007 | pub struct CapturesIter<'h, F>(TryCapturesIter<'h, F>); |
1008 | |
1009 | #[cfg (feature = "alloc" )] |
1010 | impl<'h, F> Iterator for CapturesIter<'h, F> |
1011 | where |
1012 | F: FnMut(&Input<'_>, &mut Captures) -> Result<(), MatchError>, |
1013 | { |
1014 | type Item = Captures; |
1015 | |
1016 | #[inline ] |
1017 | fn next(&mut self) -> Option<Captures> { |
1018 | match self.0.next()? { |
1019 | Ok(m: Captures) => Some(m), |
1020 | Err(err: MatchError) => panic!( |
1021 | "unexpected regex captures error: {}\n\ |
1022 | to handle find errors, use 'try' or 'search' methods" , |
1023 | err, |
1024 | ), |
1025 | } |
1026 | } |
1027 | } |
1028 | |