1/*!
2This module provides helper routines for dealing with zero-width matches.
3
4The main problem being solved here is this:
5
61. The caller wants to search something that they know is valid UTF-8, such
7as a Rust `&str`.
82. The regex used by the caller can match the empty string. For example, `a*`.
93. The caller should never get match offsets returned that occur within the
10encoding of a UTF-8 codepoint. It is logically incorrect, and also means that,
11e.g., slicing the `&str` at those offsets will lead to a panic.
12
13So the question here is, how do we prevent the caller from getting match
14offsets that split a codepoint? For example, strictly speaking, the regex `a*`
15matches `☃` at the positions `[0, 0]`, `[1, 1]`, `[2, 2]` and `[3, 3]` since
16the UTF-8 encoding of `☃` is `\xE2\x98\x83`. In particular, the `NFA` that
17underlies all of the matching engines in this crate doesn't have anything in
18its state graph that prevents matching between UTF-8 code units. Indeed, any
19engine derived from the `NFA` will match at those positions by virtue of the
20fact that the `NFA` is byte oriented. That is, its transitions are defined over
21bytes and the matching engines work by proceeding one byte at a time.
22
23(An alternative architecture would be to define the transitions in an `NFA`
24over codepoints, or `char`. And then make the matching engines proceed by
25decoding one codepoint at a time. This is a viable strategy, but it doesn't
26work for DFA matching engines because designing a fast and memory efficient
27transition table for an alphabet as large as Unicode is quite difficult. More
28to the point, the top-level `regex` crate supports matching on arbitrary bytes
29when Unicode mode is disabled and one is searching a `&[u8]`. So in that case,
30you can't just limit yourself to decoding codepoints and matching those. You
31really do need to be able to follow byte oriented transitions on the `NFA`.)
32
33In an older version of the regex crate, we handled this case not in the regex
34engine, but in the iterators over matches. Namely, since this case only arises
35when the match is empty, we "just" incremented the next starting position
36of the search by `N`, where `N` is the length of the codepoint encoded at
37the current position. The alternative or more "natural" solution of just
38incrementing by `1` would result in executing a search of `a*` on `☃` like
39this:
40
41* Start search at `0`.
42* Found match at `[0, 0]`.
43* Next start position is `0`.
44* To avoid an infinite loop, since it's an empty match, increment by `1`.
45* Start search at `1`.
46* Found match at `[1, 1]`. Oops.
47
48But if we instead incremented by `3` (the length in bytes of `☃`), then we get
49the following:
50
51* Start search at `0`.
52* Found match at `[0, 0]`.
53* Next start position is `0`.
54* To avoid an infinite loop, since it's an empty match, increment by `3`.
55* Start search at `3`.
56* Found match at `[3, 3]`.
57
58And we get the correct result. But does this technique work in all cases?
59Crucially, it requires that a zero-width match that splits a codepoint never
60occurs beyond the starting position of the search. Because if it did, merely
61incrementing the start position by the number of bytes in the codepoint at
62the current position wouldn't be enough. A zero-width match could just occur
63anywhere. It turns out that it is _almost_ true. We can convince ourselves by
64looking at all possible patterns that can match the empty string:
65
66* Patterns like `a*`, `a{0}`, `(?:)`, `a|` and `|a` all unconditionally match
67the empty string. That is, assuming there isn't an `a` at the current position,
68they will all match the empty string at the start of a search. There is no way
69to move past it because any other match would not be "leftmost."
70* `^` only matches at the beginning of the haystack, where the start position
71is `0`. Since we know we're searching valid UTF-8 (if it isn't valid UTF-8,
72then this entire problem goes away because it implies your string type supports
73invalid UTF-8 and thus must deal with offsets that not only split a codepoint
74but occur in entirely invalid UTF-8 somehow), it follows that `^` never matches
75between the code units of a codepoint because the start of a valid UTF-8 string
76is never within the encoding of a codepoint.
77* `$` basically the same logic as `^`, but for the end of a string. A valid
78UTF-8 string can't have an incomplete codepoint at the end of it.
79* `(?m:^)` follows similarly to `^`, but it can match immediately following
80a `\n`. However, since a `\n` is always a codepoint itself and can never
81appear within a codepoint, it follows that the position immediately following
82a `\n` in a string that is valid UTF-8 is guaranteed to not be between the
83code units of another codepoint. (One caveat here is that the line terminator
84for multi-line anchors can now be changed to any arbitrary byte, including
85things like `\x98` which might occur within a codepoint. However, this wasn't
86supported by the old regex crate. If it was, it pose the same problems as
87`(?-u:\B)`, as we'll discuss below.)
88* `(?m:$)` a similar argument as for `(?m:^)`. The only difference is that a
89`(?m:$)` matches just before a `\n`. But the same argument applies.
90* `(?Rm:^)` and `(?Rm:$)` weren't supported by the old regex crate, but the
91CRLF aware line anchors follow a similar argument as for `(?m:^)` and `(?m:$)`.
92Namely, since they only ever match at a boundary where one side is either a
93`\r` or a `\n`, neither of which can occur within a codepoint.
94* `\b` only matches at positions where both sides are valid codepoints, so
95this cannot split a codepoint.
96* `\B`, like `\b`, also only matches at positions where both sides are valid
97codepoints. So this cannot split a codepoint either.
98* `(?-u:\b)` matches only at positions where at least one side of it is an ASCII
99word byte. Since ASCII bytes cannot appear as code units in non-ASCII codepoints
100(one of the many amazing qualities of UTF-8), it follows that this too cannot
101split a codepoint.
102* `(?-u:\B)` finally represents a problem. It can matches between *any* two
103bytes that are either both word bytes or non-word bytes. Since code units like
104`\xE2` and `\x98` (from the UTF-8 encoding of `☃`) are both non-word bytes,
105`(?-u:\B)` will match at the position between them.
106
107Thus, our approach of incrementing one codepoint at a time after seeing an
108empty match is flawed because `(?-u:\B)` can result in an empty match that
109splits a codepoint at a position past the starting point of a search. For
110example, searching `(?-u:\B)` on `a☃` would produce the following matches: `[2,
1112]`, `[3, 3]` and `[4, 4]`. The positions at `0` and `1` don't match because
112they correspond to word boundaries since `a` is an ASCII word byte.
113
114So what did the old regex crate do to avoid this? It banned `(?-u:\B)` from
115regexes that could match `&str`. That might sound extreme, but a lot of other
116things were banned too. For example, all of `(?-u:.)`, `(?-u:[^a])` and
117`(?-u:\W)` can match invalid UTF-8 too, including individual code units with a
118codepoint. The key difference is that those expressions could never produce an
119empty match. That ban happens when translating an `Ast` to an `Hir`, because
120that process that reason about whether an `Hir` can produce *non-empty* matches
121at invalid UTF-8 boundaries. Bottom line though is that we side-stepped the
122`(?-u:\B)` issue by banning it.
123
124If banning `(?-u:\B)` were the only issue with the old regex crate's approach,
125then I probably would have kept it. `\B` is rarely used, so it's not such a big
126deal to have to work-around it. However, the problem with the above approach
127is that it doesn't compose. The logic for avoiding splitting a codepoint only
128lived in the iterator, which means if anyone wants to implement their own
129iterator over regex matches, they have to deal with this extremely subtle edge
130case to get full correctness.
131
132Instead, in this crate, we take the approach of pushing this complexity down
133to the lowest layers of each regex engine. The approach is pretty simple:
134
135* If this corner case doesn't apply, don't do anything. (For example, if UTF-8
136mode isn't enabled or if the regex cannot match the empty string.)
137* If an empty match is reported, explicitly check if it splits a codepoint.
138* If it doesn't, we're done, return the match.
139* If it does, then ignore the match and re-run the search.
140* Repeat the above process until the end of the haystack is reached or a match
141is found that doesn't split a codepoint or isn't zero width.
142
143And that's pretty much what this module provides. Every regex engine uses these
144methods in their lowest level public APIs, but just above the layer where
145their internal engine is used. That way, all regex engines can be arbitrarily
146composed without worrying about handling this case, and iterators don't need to
147handle it explicitly.
148
149(It turns out that a new feature I added, support for changing the line
150terminator in a regex to any arbitrary byte, also provokes the above problem.
151Namely, the byte could be invalid UTF-8 or a UTF-8 continuation byte. So that
152support would need to be limited or banned when UTF-8 mode is enabled, just
153like we did for `(?-u:\B)`. But thankfully our more robust approach in this
154crate handles that case just fine too.)
155*/
156
157use crate::util::search::{Input, MatchError};
158
159#[cold]
160#[inline(never)]
161pub(crate) fn skip_splits_fwd<T, F>(
162 input: &Input<'_>,
163 init_value: T,
164 match_offset: usize,
165 find: F,
166) -> Result<Option<T>, MatchError>
167where
168 F: FnMut(&Input<'_>) -> Result<Option<(T, usize)>, MatchError>,
169{
170 skip_splits(true, input, init_value, match_offset, find)
171}
172
173#[cold]
174#[inline(never)]
175pub(crate) fn skip_splits_rev<T, F>(
176 input: &Input<'_>,
177 init_value: T,
178 match_offset: usize,
179 find: F,
180) -> Result<Option<T>, MatchError>
181where
182 F: FnMut(&Input<'_>) -> Result<Option<(T, usize)>, MatchError>,
183{
184 skip_splits(false, input, init_value, match_offset, find)
185}
186
187fn skip_splits<T, F>(
188 forward: bool,
189 input: &Input<'_>,
190 init_value: T,
191 mut match_offset: usize,
192 mut find: F,
193) -> Result<Option<T>, MatchError>
194where
195 F: FnMut(&Input<'_>) -> Result<Option<(T, usize)>, MatchError>,
196{
197 // If our config says to do an anchored search, then we're definitely
198 // done. We just need to determine whether we have a valid match or
199 // not. If we don't, then we're not allowed to continue, so we report
200 // no match.
201 //
202 // This is actually quite a subtle correctness thing. The key here is
203 // that if we got an empty match that splits a codepoint after doing an
204 // anchored search in UTF-8 mode, then that implies that we must have
205 // *started* the search at a location that splits a codepoint. This
206 // follows from the fact that if a match is reported from an anchored
207 // search, then the start offset of the match *must* match the start
208 // offset of the search.
209 //
210 // It also follows that no other non-empty match is possible. For
211 // example, you might write a regex like '(?:)|SOMETHING' and start its
212 // search in the middle of a codepoint. The first branch is an empty
213 // regex that will bubble up a match at the first position, and then
214 // get rejected here and report no match. But what if 'SOMETHING' could
215 // have matched? We reason that such a thing is impossible, because
216 // if it does, it must report a match that starts in the middle of a
217 // codepoint. This in turn implies that a match is reported whose span
218 // does not correspond to valid UTF-8, and this breaks the promise
219 // made when UTF-8 mode is enabled. (That promise *can* be broken, for
220 // example, by enabling UTF-8 mode but building an by hand NFA that
221 // produces non-empty matches that span invalid UTF-8. This is an unchecked
222 // but documented precondition violation of UTF-8 mode, and is documented
223 // to have unspecified behavior.)
224 //
225 // I believe this actually means that if an anchored search is run, and
226 // UTF-8 mode is enabled and the start position splits a codepoint,
227 // then it is correct to immediately report no match without even
228 // executing the regex engine. But it doesn't really seem worth writing
229 // out that case in every regex engine to save a tiny bit of work in an
230 // extremely pathological case, so we just handle it here.
231 if input.get_anchored().is_anchored() {
232 return Ok(if input.is_char_boundary(match_offset) {
233 Some(init_value)
234 } else {
235 None
236 });
237 }
238 // Otherwise, we have an unanchored search, so just keep looking for
239 // matches until we have one that does not split a codepoint or we hit
240 // EOI.
241 let mut value = init_value;
242 let mut input = input.clone();
243 while !input.is_char_boundary(match_offset) {
244 if forward {
245 // The unwrap is OK here because overflowing usize while
246 // iterating over a slice is impossible, at it would require
247 // a slice of length greater than isize::MAX, which is itself
248 // impossible.
249 input.set_start(input.start().checked_add(1).unwrap());
250 } else {
251 input.set_end(match input.end().checked_sub(1) {
252 None => return Ok(None),
253 Some(end) => end,
254 });
255 }
256 match find(&input)? {
257 None => return Ok(None),
258 Some((new_value, new_match_end)) => {
259 value = new_value;
260 match_offset = new_match_end;
261 }
262 }
263 }
264 Ok(Some(value))
265}
266