1/*
2I've called the primary data structure in this module a "range trie." As far
3as I can tell, there is no prior art on a data structure like this, however,
4it's likely someone somewhere has built something like it. Searching for
5"range trie" turns up the paper "Range Tries for Scalable Address Lookup,"
6but it does not appear relevant.
7
8The range trie is just like a trie in that it is a special case of a
9deterministic finite state machine. It has states and each state has a set
10of transitions to other states. It is acyclic, and, like a normal trie,
11it makes no attempt to reuse common suffixes among its elements. The key
12difference between a normal trie and a range trie below is that a range trie
13operates on *contiguous sequences* of bytes instead of singleton bytes.
14One could say say that our alphabet is ranges of bytes instead of bytes
15themselves, except a key part of range trie construction is splitting ranges
16apart to ensure there is at most one transition that can be taken for any
17byte in a given state.
18
19I've tried to explain the details of how the range trie works below, so
20for now, we are left with trying to understand what problem we're trying to
21solve. Which is itself fairly involved!
22
23At the highest level, here's what we want to do. We want to convert a
24sequence of Unicode codepoints into a finite state machine whose transitions
25are over *bytes* and *not* Unicode codepoints. We want this because it makes
26said finite state machines much smaller and much faster to execute. As a
27simple example, consider a byte oriented automaton for all Unicode scalar
28values (0x00 through 0x10FFFF, not including surrogate codepoints):
29
30 [00-7F]
31 [C2-DF][80-BF]
32 [E0-E0][A0-BF][80-BF]
33 [E1-EC][80-BF][80-BF]
34 [ED-ED][80-9F][80-BF]
35 [EE-EF][80-BF][80-BF]
36 [F0-F0][90-BF][80-BF][80-BF]
37 [F1-F3][80-BF][80-BF][80-BF]
38 [F4-F4][80-8F][80-BF][80-BF]
39
40(These byte ranges are generated via the regex-syntax::utf8 module, which
41was based on Russ Cox's code in RE2, which was in turn based on Ken
42Thompson's implementation of the same idea in his Plan9 implementation of
43grep.)
44
45It should be fairly straight-forward to see how one could compile this into
46a DFA. The sequences are sorted and non-overlapping. Essentially, you could
47build a trie from this fairly easy. The problem comes when your initial
48range (in this case, 0x00-0x10FFFF) isn't so nice. For example, the class
49represented by '\w' contains only a tenth of the codepoints that
500x00-0x10FFFF contains, but if we were to write out the byte based ranges
51as we did above, the list would stretch to 892 entries! This turns into
52quite a large NFA with a few thousand states. Turning this beast into a DFA
53takes quite a bit of time. We are thus left with trying to trim down the
54number of states we produce as early as possible.
55
56One approach (used by RE2 and still by the regex crate, at time of writing)
57is to try to find common suffixes while building NFA states for the above
58and reuse them. This is very cheap to do and one can control precisely how
59much extra memory you want to use for the cache.
60
61Another approach, however, is to reuse an algorithm for constructing a
62*minimal* DFA from a sorted sequence of inputs. I don't want to go into
63the full details here, but I explain it in more depth in my blog post on
64FSTs[1]. Note that the algorithm was not invented by me, but was published
65in paper by Daciuk et al. in 2000 called "Incremental Construction of
66MinimalAcyclic Finite-State Automata." Like the suffix cache approach above,
67it is also possible to control the amount of extra memory one uses, although
68this usually comes with the cost of sacrificing true minimality. (But it's
69typically close enough with a reasonably sized cache of states.)
70
71The catch is that Daciuk's algorithm only works if you add your keys in
72lexicographic ascending order. In our case, since we're dealing with ranges,
73we also need the additional requirement that ranges are either equivalent
74or do not overlap at all. For example, if one were given the following byte
75ranges:
76
77 [BC-BF][80-BF]
78 [BC-BF][90-BF]
79
80Then Daciuk's algorithm would not work, since there is nothing to handle the
81fact that the ranges overlap. They would need to be split apart. Thankfully,
82Thompson's algorithm for producing byte ranges for Unicode codepoint ranges
83meets both of our requirements. (A proof for this eludes me, but it appears
84true.)
85
86... however, we would also like to be able to compile UTF-8 automata in
87reverse. We want this because in order to find the starting location of a
88match using a DFA, we need to run a second DFA---a reversed version of the
89forward DFA---backwards to discover the match location. Unfortunately, if
90we reverse our byte sequences for 0x00-0x10FFFF, we get sequences that are
91can overlap, even if they are sorted:
92
93 [00-7F]
94 [80-BF][80-9F][ED-ED]
95 [80-BF][80-BF][80-8F][F4-F4]
96 [80-BF][80-BF][80-BF][F1-F3]
97 [80-BF][80-BF][90-BF][F0-F0]
98 [80-BF][80-BF][E1-EC]
99 [80-BF][80-BF][EE-EF]
100 [80-BF][A0-BF][E0-E0]
101 [80-BF][C2-DF]
102
103For example, '[80-BF][80-BF][EE-EF]' and '[80-BF][A0-BF][E0-E0]' have
104overlapping ranges between '[80-BF]' and '[A0-BF]'. Thus, there is no
105simple way to apply Daciuk's algorithm.
106
107And thus, the range trie was born. The range trie's only purpose is to take
108sequences of byte ranges like the ones above, collect them into a trie and then
109spit them out in a sorted fashion with no overlapping ranges. For example,
1100x00-0x10FFFF gets translated to:
111
112 [0-7F]
113 [80-BF][80-9F][80-8F][F1-F3]
114 [80-BF][80-9F][80-8F][F4]
115 [80-BF][80-9F][90-BF][F0]
116 [80-BF][80-9F][90-BF][F1-F3]
117 [80-BF][80-9F][E1-EC]
118 [80-BF][80-9F][ED]
119 [80-BF][80-9F][EE-EF]
120 [80-BF][A0-BF][80-8F][F1-F3]
121 [80-BF][A0-BF][80-8F][F4]
122 [80-BF][A0-BF][90-BF][F0]
123 [80-BF][A0-BF][90-BF][F1-F3]
124 [80-BF][A0-BF][E0]
125 [80-BF][A0-BF][E1-EC]
126 [80-BF][A0-BF][EE-EF]
127 [80-BF][C2-DF]
128
129We've thus satisfied our requirements for running Daciuk's algorithm. All
130sequences of ranges are sorted, and any corresponding ranges are either
131exactly equivalent or non-overlapping.
132
133In effect, a range trie is building a DFA from a sequence of arbitrary byte
134ranges. But it uses an algorithm custom tailored to its input, so it is not as
135costly as traditional DFA construction. While it is still quite a bit more
136costly than the forward case (which only needs Daciuk's algorithm), it winds
137up saving a substantial amount of time if one is doing a full DFA powerset
138construction later by virtue of producing a much much smaller NFA.
139
140[1] - https://blog.burntsushi.net/transducers/
141[2] - https://www.mitpressjournals.org/doi/pdfplus/10.1162/089120100561601
142*/
143
144use core::{cell::RefCell, convert::TryFrom, fmt, mem, ops::RangeInclusive};
145
146use alloc::{format, string::String, vec, vec::Vec};
147
148use regex_syntax::utf8::Utf8Range;
149
150use crate::util::primitives::StateID;
151
152/// There is only one final state in this trie. Every sequence of byte ranges
153/// added shares the same final state.
154const FINAL: StateID = StateID::ZERO;
155
156/// The root state of the trie.
157const ROOT: StateID = StateID::new_unchecked(1);
158
159/// A range trie represents an ordered set of sequences of bytes.
160///
161/// A range trie accepts as input a sequence of byte ranges and merges
162/// them into the existing set such that the trie can produce a sorted
163/// non-overlapping sequence of byte ranges. The sequence emitted corresponds
164/// precisely to the sequence of bytes matched by the given keys, although the
165/// byte ranges themselves may be split at different boundaries.
166///
167/// The order complexity of this data structure seems difficult to analyze.
168/// If the size of a byte is held as a constant, then insertion is clearly
169/// O(n) where n is the number of byte ranges in the input key. However, if
170/// k=256 is our alphabet size, then insertion could be O(k^2 * n). In
171/// particular it seems possible for pathological inputs to cause insertion
172/// to do a lot of work. However, for what we use this data structure for,
173/// there should be no pathological inputs since the ultimate source is always
174/// a sorted set of Unicode scalar value ranges.
175///
176/// Internally, this trie is setup like a finite state machine. Note though
177/// that it is acyclic.
178#[derive(Clone)]
179pub struct RangeTrie {
180 /// The states in this trie. The first is always the shared final state.
181 /// The second is always the root state. Otherwise, there is no
182 /// particular order.
183 states: Vec<State>,
184 /// A free-list of states. When a range trie is cleared, all of its states
185 /// are added to this list. Creating a new state reuses states from this
186 /// list before allocating a new one.
187 free: Vec<State>,
188 /// A stack for traversing this trie to yield sequences of byte ranges in
189 /// lexicographic order.
190 iter_stack: RefCell<Vec<NextIter>>,
191 /// A buffer that stores the current sequence during iteration.
192 iter_ranges: RefCell<Vec<Utf8Range>>,
193 /// A stack used for traversing the trie in order to (deeply) duplicate
194 /// a state. States are recursively duplicated when ranges are split.
195 dupe_stack: Vec<NextDupe>,
196 /// A stack used for traversing the trie during insertion of a new
197 /// sequence of byte ranges.
198 insert_stack: Vec<NextInsert>,
199}
200
201/// A single state in this trie.
202#[derive(Clone)]
203struct State {
204 /// A sorted sequence of non-overlapping transitions to other states. Each
205 /// transition corresponds to a single range of bytes.
206 transitions: Vec<Transition>,
207}
208
209/// A transition is a single range of bytes. If a particular byte is in this
210/// range, then the corresponding machine may transition to the state pointed
211/// to by `next_id`.
212#[derive(Clone)]
213struct Transition {
214 /// The byte range.
215 range: Utf8Range,
216 /// The next state to transition to.
217 next_id: StateID,
218}
219
220impl RangeTrie {
221 /// Create a new empty range trie.
222 pub fn new() -> RangeTrie {
223 let mut trie = RangeTrie {
224 states: vec![],
225 free: vec![],
226 iter_stack: RefCell::new(vec![]),
227 iter_ranges: RefCell::new(vec![]),
228 dupe_stack: vec![],
229 insert_stack: vec![],
230 };
231 trie.clear();
232 trie
233 }
234
235 /// Clear this range trie such that it is empty. Clearing a range trie
236 /// and reusing it can beneficial because this may reuse allocations.
237 pub fn clear(&mut self) {
238 self.free.extend(self.states.drain(..));
239 self.add_empty(); // final
240 self.add_empty(); // root
241 }
242
243 /// Iterate over all of the sequences of byte ranges in this trie, and
244 /// call the provided function for each sequence. Iteration occurs in
245 /// lexicographic order.
246 pub fn iter<E, F: FnMut(&[Utf8Range]) -> Result<(), E>>(
247 &self,
248 mut f: F,
249 ) -> Result<(), E> {
250 let mut stack = self.iter_stack.borrow_mut();
251 stack.clear();
252 let mut ranges = self.iter_ranges.borrow_mut();
253 ranges.clear();
254
255 // We do iteration in a way that permits us to use a single buffer
256 // for our keys. We iterate in a depth first fashion, while being
257 // careful to expand our frontier as we move deeper in the trie.
258 stack.push(NextIter { state_id: ROOT, tidx: 0 });
259 while let Some(NextIter { mut state_id, mut tidx }) = stack.pop() {
260 // This could be implemented more simply without an inner loop
261 // here, but at the cost of more stack pushes.
262 loop {
263 let state = self.state(state_id);
264 // If we've visited all transitions in this state, then pop
265 // back to the parent state.
266 if tidx >= state.transitions.len() {
267 ranges.pop();
268 break;
269 }
270
271 let t = &state.transitions[tidx];
272 ranges.push(t.range);
273 if t.next_id == FINAL {
274 f(&ranges)?;
275 ranges.pop();
276 tidx += 1;
277 } else {
278 // Expand our frontier. Once we come back to this state
279 // via the stack, start in on the next transition.
280 stack.push(NextIter { state_id, tidx: tidx + 1 });
281 // Otherwise, move to the first transition of the next
282 // state.
283 state_id = t.next_id;
284 tidx = 0;
285 }
286 }
287 }
288 Ok(())
289 }
290
291 /// Inserts a new sequence of ranges into this trie.
292 ///
293 /// The sequence given must be non-empty and must not have a length
294 /// exceeding 4.
295 pub fn insert(&mut self, ranges: &[Utf8Range]) {
296 assert!(!ranges.is_empty());
297 assert!(ranges.len() <= 4);
298
299 let mut stack = mem::replace(&mut self.insert_stack, vec![]);
300 stack.clear();
301
302 stack.push(NextInsert::new(ROOT, ranges));
303 while let Some(next) = stack.pop() {
304 let (state_id, ranges) = (next.state_id(), next.ranges());
305 assert!(!ranges.is_empty());
306
307 let (mut new, rest) = (ranges[0], &ranges[1..]);
308
309 // i corresponds to the position of the existing transition on
310 // which we are operating. Typically, the result is to remove the
311 // transition and replace it with two or more new transitions
312 // corresponding to the partitions generated by splitting the
313 // 'new' with the ith transition's range.
314 let mut i = self.state(state_id).find(new);
315
316 // In this case, there is no overlap *and* the new range is greater
317 // than all existing ranges. So we can just add it to the end.
318 if i == self.state(state_id).transitions.len() {
319 let next_id = NextInsert::push(self, &mut stack, rest);
320 self.add_transition(state_id, new, next_id);
321 continue;
322 }
323
324 // The need for this loop is a bit subtle, buf basically, after
325 // we've handled the partitions from our initial split, it's
326 // possible that there will be a partition leftover that overlaps
327 // with a subsequent transition. If so, then we have to repeat
328 // the split process again with the leftovers and that subsequent
329 // transition.
330 'OUTER: loop {
331 let old = self.state(state_id).transitions[i].clone();
332 let split = match Split::new(old.range, new) {
333 Some(split) => split,
334 None => {
335 let next_id = NextInsert::push(self, &mut stack, rest);
336 self.add_transition_at(i, state_id, new, next_id);
337 continue;
338 }
339 };
340 let splits = split.as_slice();
341 // If we only have one partition, then the ranges must be
342 // equivalent. There's nothing to do here for this state, so
343 // just move on to the next one.
344 if splits.len() == 1 {
345 // ... but only if we have anything left to do.
346 if !rest.is_empty() {
347 stack.push(NextInsert::new(old.next_id, rest));
348 }
349 break;
350 }
351 // At this point, we know that 'split' is non-empty and there
352 // must be some overlap AND that the two ranges are not
353 // equivalent. Therefore, the existing range MUST be removed
354 // and split up somehow. Instead of actually doing the removal
355 // and then a subsequent insertion---with all the memory
356 // shuffling that entails---we simply overwrite the transition
357 // at position `i` for the first new transition we want to
358 // insert. After that, we're forced to do expensive inserts.
359 let mut first = true;
360 let mut add_trans =
361 |trie: &mut RangeTrie, pos, from, range, to| {
362 if first {
363 trie.set_transition_at(pos, from, range, to);
364 first = false;
365 } else {
366 trie.add_transition_at(pos, from, range, to);
367 }
368 };
369 for (j, &srange) in splits.iter().enumerate() {
370 match srange {
371 SplitRange::Old(r) => {
372 // Deep clone the state pointed to by the ith
373 // transition. This is always necessary since 'old'
374 // is always coupled with at least a 'both'
375 // partition. We don't want any new changes made
376 // via the 'both' partition to impact the part of
377 // the transition that doesn't overlap with the
378 // new range.
379 let dup_id = self.duplicate(old.next_id);
380 add_trans(self, i, state_id, r, dup_id);
381 }
382 SplitRange::New(r) => {
383 // This is a bit subtle, but if this happens to be
384 // the last partition in our split, it is possible
385 // that this overlaps with a subsequent transition.
386 // If it does, then we must repeat the whole
387 // splitting process over again with `r` and the
388 // subsequent transition.
389 {
390 let trans = &self.state(state_id).transitions;
391 if j + 1 == splits.len()
392 && i < trans.len()
393 && intersects(r, trans[i].range)
394 {
395 new = r;
396 continue 'OUTER;
397 }
398 }
399
400 // ... otherwise, setup exploration for a new
401 // empty state and add a brand new transition for
402 // this new range.
403 let next_id =
404 NextInsert::push(self, &mut stack, rest);
405 add_trans(self, i, state_id, r, next_id);
406 }
407 SplitRange::Both(r) => {
408 // Continue adding the remaining ranges on this
409 // path and update the transition with the new
410 // range.
411 if !rest.is_empty() {
412 stack.push(NextInsert::new(old.next_id, rest));
413 }
414 add_trans(self, i, state_id, r, old.next_id);
415 }
416 }
417 i += 1;
418 }
419 // If we've reached this point, then we know that there are
420 // no subsequent transitions with any overlap. Therefore, we
421 // can stop processing this range and move on to the next one.
422 break;
423 }
424 }
425 self.insert_stack = stack;
426 }
427
428 pub fn add_empty(&mut self) -> StateID {
429 let id = match StateID::try_from(self.states.len()) {
430 Ok(id) => id,
431 Err(_) => {
432 // This generally should not happen since a range trie is
433 // only ever used to compile a single sequence of Unicode
434 // scalar values. If we ever got to this point, we would, at
435 // *minimum*, be using 96GB in just the range trie alone.
436 panic!("too many sequences added to range trie");
437 }
438 };
439 // If we have some free states available, then use them to avoid
440 // more allocations.
441 if let Some(mut state) = self.free.pop() {
442 state.clear();
443 self.states.push(state);
444 } else {
445 self.states.push(State { transitions: vec![] });
446 }
447 id
448 }
449
450 /// Performs a deep clone of the given state and returns the duplicate's
451 /// state ID.
452 ///
453 /// A "deep clone" in this context means that the state given along with
454 /// recursively all states that it points to are copied. Once complete,
455 /// the given state ID and the returned state ID share nothing.
456 ///
457 /// This is useful during range trie insertion when a new range overlaps
458 /// with an existing range that is bigger than the new one. The part
459 /// of the existing range that does *not* overlap with the new one is
460 /// duplicated so that adding the new range to the overlap doesn't disturb
461 /// the non-overlapping portion.
462 ///
463 /// There's one exception: if old_id is the final state, then it is not
464 /// duplicated and the same final state is returned. This is because all
465 /// final states in this trie are equivalent.
466 fn duplicate(&mut self, old_id: StateID) -> StateID {
467 if old_id == FINAL {
468 return FINAL;
469 }
470
471 let mut stack = mem::replace(&mut self.dupe_stack, vec![]);
472 stack.clear();
473
474 let new_id = self.add_empty();
475 // old_id is the state we're cloning and new_id is the ID of the
476 // duplicated state for old_id.
477 stack.push(NextDupe { old_id, new_id });
478 while let Some(NextDupe { old_id, new_id }) = stack.pop() {
479 for i in 0..self.state(old_id).transitions.len() {
480 let t = self.state(old_id).transitions[i].clone();
481 if t.next_id == FINAL {
482 // All final states are the same, so there's no need to
483 // duplicate it.
484 self.add_transition(new_id, t.range, FINAL);
485 continue;
486 }
487
488 let new_child_id = self.add_empty();
489 self.add_transition(new_id, t.range, new_child_id);
490 stack.push(NextDupe {
491 old_id: t.next_id,
492 new_id: new_child_id,
493 });
494 }
495 }
496 self.dupe_stack = stack;
497 new_id
498 }
499
500 /// Adds the given transition to the given state.
501 ///
502 /// Callers must ensure that all previous transitions in this state
503 /// are lexicographically smaller than the given range.
504 fn add_transition(
505 &mut self,
506 from_id: StateID,
507 range: Utf8Range,
508 next_id: StateID,
509 ) {
510 self.state_mut(from_id)
511 .transitions
512 .push(Transition { range, next_id });
513 }
514
515 /// Like `add_transition`, except this inserts the transition just before
516 /// the ith transition.
517 fn add_transition_at(
518 &mut self,
519 i: usize,
520 from_id: StateID,
521 range: Utf8Range,
522 next_id: StateID,
523 ) {
524 self.state_mut(from_id)
525 .transitions
526 .insert(i, Transition { range, next_id });
527 }
528
529 /// Overwrites the transition at position i with the given transition.
530 fn set_transition_at(
531 &mut self,
532 i: usize,
533 from_id: StateID,
534 range: Utf8Range,
535 next_id: StateID,
536 ) {
537 self.state_mut(from_id).transitions[i] = Transition { range, next_id };
538 }
539
540 /// Return an immutable borrow for the state with the given ID.
541 fn state(&self, id: StateID) -> &State {
542 &self.states[id]
543 }
544
545 /// Return a mutable borrow for the state with the given ID.
546 fn state_mut(&mut self, id: StateID) -> &mut State {
547 &mut self.states[id]
548 }
549}
550
551impl State {
552 /// Find the position at which the given range should be inserted in this
553 /// state.
554 ///
555 /// The position returned is always in the inclusive range
556 /// [0, transitions.len()]. If 'transitions.len()' is returned, then the
557 /// given range overlaps with no other range in this state *and* is greater
558 /// than all of them.
559 ///
560 /// For all other possible positions, the given range either overlaps
561 /// with the transition at that position or is otherwise less than it
562 /// with no overlap (and is greater than the previous transition). In the
563 /// former case, careful attention must be paid to inserting this range
564 /// as a new transition. In the latter case, the range can be inserted as
565 /// a new transition at the given position without disrupting any other
566 /// transitions.
567 fn find(&self, range: Utf8Range) -> usize {
568 /// Returns the position `i` at which `pred(xs[i])` first returns true
569 /// such that for all `j >= i`, `pred(xs[j]) == true`. If `pred` never
570 /// returns true, then `xs.len()` is returned.
571 ///
572 /// We roll our own binary search because it doesn't seem like the
573 /// standard library's binary search can be used here. Namely, if
574 /// there is an overlapping range, then we want to find the first such
575 /// occurrence, but there may be many. Or at least, it's not quite
576 /// clear to me how to do it.
577 fn binary_search<T, F>(xs: &[T], mut pred: F) -> usize
578 where
579 F: FnMut(&T) -> bool,
580 {
581 let (mut left, mut right) = (0, xs.len());
582 while left < right {
583 // Overflow is impossible because xs.len() <= 256.
584 let mid = (left + right) / 2;
585 if pred(&xs[mid]) {
586 right = mid;
587 } else {
588 left = mid + 1;
589 }
590 }
591 left
592 }
593
594 // Benchmarks suggest that binary search is just a bit faster than
595 // straight linear search. Specifically when using the debug tool:
596 //
597 // hyperfine "regex-cli debug thompson -qr --no-captures '\w{90} ecurB'"
598 binary_search(&self.transitions, |t| range.start <= t.range.end)
599 }
600
601 /// Clear this state such that it has zero transitions.
602 fn clear(&mut self) {
603 self.transitions.clear();
604 }
605}
606
607/// The next state to process during duplication.
608#[derive(Clone, Debug)]
609struct NextDupe {
610 /// The state we want to duplicate.
611 old_id: StateID,
612 /// The ID of the new state that is a duplicate of old_id.
613 new_id: StateID,
614}
615
616/// The next state (and its corresponding transition) that we want to visit
617/// during iteration in lexicographic order.
618#[derive(Clone, Debug)]
619struct NextIter {
620 state_id: StateID,
621 tidx: usize,
622}
623
624/// The next state to process during insertion and any remaining ranges that we
625/// want to add for a particular sequence of ranges. The first such instance
626/// is always the root state along with all ranges given.
627#[derive(Clone, Debug)]
628struct NextInsert {
629 /// The next state to begin inserting ranges. This state should be the
630 /// state at which `ranges[0]` should be inserted.
631 state_id: StateID,
632 /// The ranges to insert. We used a fixed-size array here to avoid an
633 /// allocation.
634 ranges: [Utf8Range; 4],
635 /// The number of valid ranges in the above array.
636 len: u8,
637}
638
639impl NextInsert {
640 /// Create the next item to visit. The given state ID should correspond
641 /// to the state at which the first range in the given slice should be
642 /// inserted. The slice given must not be empty and it must be no longer
643 /// than 4.
644 fn new(state_id: StateID, ranges: &[Utf8Range]) -> NextInsert {
645 let len = ranges.len();
646 assert!(len > 0);
647 assert!(len <= 4);
648
649 let mut tmp = [Utf8Range { start: 0, end: 0 }; 4];
650 tmp[..len].copy_from_slice(ranges);
651 NextInsert { state_id, ranges: tmp, len: u8::try_from(len).unwrap() }
652 }
653
654 /// Push a new empty state to visit along with any remaining ranges that
655 /// still need to be inserted. The ID of the new empty state is returned.
656 ///
657 /// If ranges is empty, then no new state is created and FINAL is returned.
658 fn push(
659 trie: &mut RangeTrie,
660 stack: &mut Vec<NextInsert>,
661 ranges: &[Utf8Range],
662 ) -> StateID {
663 if ranges.is_empty() {
664 FINAL
665 } else {
666 let next_id = trie.add_empty();
667 stack.push(NextInsert::new(next_id, ranges));
668 next_id
669 }
670 }
671
672 /// Return the ID of the state to visit.
673 fn state_id(&self) -> StateID {
674 self.state_id
675 }
676
677 /// Return the remaining ranges to insert.
678 fn ranges(&self) -> &[Utf8Range] {
679 &self.ranges[..usize::try_from(self.len).unwrap()]
680 }
681}
682
683/// Split represents a partitioning of two ranges into one or more ranges. This
684/// is the secret sauce that makes a range trie work, as it's what tells us
685/// how to deal with two overlapping but unequal ranges during insertion.
686///
687/// Essentially, either two ranges overlap or they don't. If they don't, then
688/// handling insertion is easy: just insert the new range into its
689/// lexicographically correct position. Since it does not overlap with anything
690/// else, no other transitions are impacted by the new range.
691///
692/// If they do overlap though, there are generally three possible cases to
693/// handle:
694///
695/// 1. The part where the two ranges actually overlap. i.e., The intersection.
696/// 2. The part of the existing range that is not in the the new range.
697/// 3. The part of the new range that is not in the old range.
698///
699/// (1) is guaranteed to always occur since all overlapping ranges have a
700/// non-empty intersection. If the two ranges are not equivalent, then at
701/// least one of (2) or (3) is guaranteed to occur as well. In some cases,
702/// e.g., `[0-4]` and `[4-9]`, all three cases will occur.
703///
704/// This `Split` type is responsible for providing (1), (2) and (3) for any
705/// possible pair of byte ranges.
706///
707/// As for insertion, for the overlap in (1), the remaining ranges to insert
708/// should be added by following the corresponding transition. However, this
709/// should only be done for the overlapping parts of the range. If there was
710/// a part of the existing range that was not in the new range, then that
711/// existing part must be split off from the transition and duplicated. The
712/// remaining parts of the overlap can then be added to using the new ranges
713/// without disturbing the existing range.
714///
715/// Handling the case for the part of a new range that is not in an existing
716/// range is seemingly easy. Just treat it as if it were a non-overlapping
717/// range. The problem here is that if this new non-overlapping range occurs
718/// after both (1) and (2), then it's possible that it can overlap with the
719/// next transition in the current state. If it does, then the whole process
720/// must be repeated!
721///
722/// # Details of the 3 cases
723///
724/// The following details the various cases that are implemented in code
725/// below. It's plausible that the number of cases is not actually minimal,
726/// but it's important for this code to remain at least somewhat readable.
727///
728/// Given [a,b] and [x,y], where a <= b, x <= y, b < 256 and y < 256, we define
729/// the follow distinct relationships where at least one must apply. The order
730/// of these matters, since multiple can match. The first to match applies.
731///
732/// 1. b < x <=> [a,b] < [x,y]
733/// 2. y < a <=> [x,y] < [a,b]
734///
735/// In the case of (1) and (2), these are the only cases where there is no
736/// overlap. Or otherwise, the intersection of [a,b] and [x,y] is empty. In
737/// order to compute the intersection, one can do [max(a,x), min(b,y)]. The
738/// intersection in all of the following cases is non-empty.
739///
740/// 3. a = x && b = y <=> [a,b] == [x,y]
741/// 4. a = x && b < y <=> [x,y] right-extends [a,b]
742/// 5. b = y && a > x <=> [x,y] left-extends [a,b]
743/// 6. x = a && y < b <=> [a,b] right-extends [x,y]
744/// 7. y = b && x > a <=> [a,b] left-extends [x,y]
745/// 8. a > x && b < y <=> [x,y] covers [a,b]
746/// 9. x > a && y < b <=> [a,b] covers [x,y]
747/// 10. b = x && a < y <=> [a,b] is left-adjacent to [x,y]
748/// 11. y = a && x < b <=> [x,y] is left-adjacent to [a,b]
749/// 12. b > x && b < y <=> [a,b] left-overlaps [x,y]
750/// 13. y > a && y < b <=> [x,y] left-overlaps [a,b]
751///
752/// In cases 3-13, we can form rules that partition the ranges into a
753/// non-overlapping ordered sequence of ranges:
754///
755/// 3. [a,b]
756/// 4. [a,b], [b+1,y]
757/// 5. [x,a-1], [a,b]
758/// 6. [x,y], [y+1,b]
759/// 7. [a,x-1], [x,y]
760/// 8. [x,a-1], [a,b], [b+1,y]
761/// 9. [a,x-1], [x,y], [y+1,b]
762/// 10. [a,b-1], [b,b], [b+1,y]
763/// 11. [x,y-1], [y,y], [y+1,b]
764/// 12. [a,x-1], [x,b], [b+1,y]
765/// 13. [x,a-1], [a,y], [y+1,b]
766///
767/// In the code below, we go a step further and identify each of the above
768/// outputs as belonging either to the overlap of the two ranges or to one
769/// of [a,b] or [x,y] exclusively.
770#[derive(Clone, Debug, Eq, PartialEq)]
771struct Split {
772 partitions: [SplitRange; 3],
773 len: usize,
774}
775
776/// A tagged range indicating how it was derived from a pair of ranges.
777#[derive(Clone, Copy, Debug, Eq, PartialEq)]
778enum SplitRange {
779 Old(Utf8Range),
780 New(Utf8Range),
781 Both(Utf8Range),
782}
783
784impl Split {
785 /// Create a partitioning of the given ranges.
786 ///
787 /// If the given ranges have an empty intersection, then None is returned.
788 fn new(o: Utf8Range, n: Utf8Range) -> Option<Split> {
789 let range = |r: RangeInclusive<u8>| Utf8Range {
790 start: *r.start(),
791 end: *r.end(),
792 };
793 let old = |r| SplitRange::Old(range(r));
794 let new = |r| SplitRange::New(range(r));
795 let both = |r| SplitRange::Both(range(r));
796
797 // Use same names as the comment above to make it easier to compare.
798 let (a, b, x, y) = (o.start, o.end, n.start, n.end);
799
800 if b < x || y < a {
801 // case 1, case 2
802 None
803 } else if a == x && b == y {
804 // case 3
805 Some(Split::parts1(both(a..=b)))
806 } else if a == x && b < y {
807 // case 4
808 Some(Split::parts2(both(a..=b), new(b + 1..=y)))
809 } else if b == y && a > x {
810 // case 5
811 Some(Split::parts2(new(x..=a - 1), both(a..=b)))
812 } else if x == a && y < b {
813 // case 6
814 Some(Split::parts2(both(x..=y), old(y + 1..=b)))
815 } else if y == b && x > a {
816 // case 7
817 Some(Split::parts2(old(a..=x - 1), both(x..=y)))
818 } else if a > x && b < y {
819 // case 8
820 Some(Split::parts3(new(x..=a - 1), both(a..=b), new(b + 1..=y)))
821 } else if x > a && y < b {
822 // case 9
823 Some(Split::parts3(old(a..=x - 1), both(x..=y), old(y + 1..=b)))
824 } else if b == x && a < y {
825 // case 10
826 Some(Split::parts3(old(a..=b - 1), both(b..=b), new(b + 1..=y)))
827 } else if y == a && x < b {
828 // case 11
829 Some(Split::parts3(new(x..=y - 1), both(y..=y), old(y + 1..=b)))
830 } else if b > x && b < y {
831 // case 12
832 Some(Split::parts3(old(a..=x - 1), both(x..=b), new(b + 1..=y)))
833 } else if y > a && y < b {
834 // case 13
835 Some(Split::parts3(new(x..=a - 1), both(a..=y), old(y + 1..=b)))
836 } else {
837 unreachable!()
838 }
839 }
840
841 /// Create a new split with a single partition. This only occurs when two
842 /// ranges are equivalent.
843 fn parts1(r1: SplitRange) -> Split {
844 // This value doesn't matter since it is never accessed.
845 let nada = SplitRange::Old(Utf8Range { start: 0, end: 0 });
846 Split { partitions: [r1, nada, nada], len: 1 }
847 }
848
849 /// Create a new split with two partitions.
850 fn parts2(r1: SplitRange, r2: SplitRange) -> Split {
851 // This value doesn't matter since it is never accessed.
852 let nada = SplitRange::Old(Utf8Range { start: 0, end: 0 });
853 Split { partitions: [r1, r2, nada], len: 2 }
854 }
855
856 /// Create a new split with three partitions.
857 fn parts3(r1: SplitRange, r2: SplitRange, r3: SplitRange) -> Split {
858 Split { partitions: [r1, r2, r3], len: 3 }
859 }
860
861 /// Return the partitions in this split as a slice.
862 fn as_slice(&self) -> &[SplitRange] {
863 &self.partitions[..self.len]
864 }
865}
866
867impl fmt::Debug for RangeTrie {
868 fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
869 writeln!(f, "")?;
870 for (i, state) in self.states.iter().enumerate() {
871 let status = if i == FINAL.as_usize() { '*' } else { ' ' };
872 writeln!(f, "{}{:06}: {:?}", status, i, state)?;
873 }
874 Ok(())
875 }
876}
877
878impl fmt::Debug for State {
879 fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
880 let rs = self
881 .transitions
882 .iter()
883 .map(|t| format!("{:?}", t))
884 .collect::<Vec<String>>()
885 .join(", ");
886 write!(f, "{}", rs)
887 }
888}
889
890impl fmt::Debug for Transition {
891 fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
892 if self.range.start == self.range.end {
893 write!(
894 f,
895 "{:02X} => {:02X}",
896 self.range.start,
897 self.next_id.as_usize(),
898 )
899 } else {
900 write!(
901 f,
902 "{:02X}-{:02X} => {:02X}",
903 self.range.start,
904 self.range.end,
905 self.next_id.as_usize(),
906 )
907 }
908 }
909}
910
911/// Returns true if and only if the given ranges intersect.
912fn intersects(r1: Utf8Range, r2: Utf8Range) -> bool {
913 !(r1.end < r2.start || r2.end < r1.start)
914}
915
916#[cfg(test)]
917mod tests {
918 use core::ops::RangeInclusive;
919
920 use regex_syntax::utf8::Utf8Range;
921
922 use super::*;
923
924 fn r(range: RangeInclusive<u8>) -> Utf8Range {
925 Utf8Range { start: *range.start(), end: *range.end() }
926 }
927
928 fn split_maybe(
929 old: RangeInclusive<u8>,
930 new: RangeInclusive<u8>,
931 ) -> Option<Split> {
932 Split::new(r(old), r(new))
933 }
934
935 fn split(
936 old: RangeInclusive<u8>,
937 new: RangeInclusive<u8>,
938 ) -> Vec<SplitRange> {
939 split_maybe(old, new).unwrap().as_slice().to_vec()
940 }
941
942 #[test]
943 fn no_splits() {
944 // case 1
945 assert_eq!(None, split_maybe(0..=1, 2..=3));
946 // case 2
947 assert_eq!(None, split_maybe(2..=3, 0..=1));
948 }
949
950 #[test]
951 fn splits() {
952 let range = |r: RangeInclusive<u8>| Utf8Range {
953 start: *r.start(),
954 end: *r.end(),
955 };
956 let old = |r| SplitRange::Old(range(r));
957 let new = |r| SplitRange::New(range(r));
958 let both = |r| SplitRange::Both(range(r));
959
960 // case 3
961 assert_eq!(split(0..=0, 0..=0), vec![both(0..=0)]);
962 assert_eq!(split(9..=9, 9..=9), vec![both(9..=9)]);
963
964 // case 4
965 assert_eq!(split(0..=5, 0..=6), vec![both(0..=5), new(6..=6)]);
966 assert_eq!(split(0..=5, 0..=8), vec![both(0..=5), new(6..=8)]);
967 assert_eq!(split(5..=5, 5..=8), vec![both(5..=5), new(6..=8)]);
968
969 // case 5
970 assert_eq!(split(1..=5, 0..=5), vec![new(0..=0), both(1..=5)]);
971 assert_eq!(split(3..=5, 0..=5), vec![new(0..=2), both(3..=5)]);
972 assert_eq!(split(5..=5, 0..=5), vec![new(0..=4), both(5..=5)]);
973
974 // case 6
975 assert_eq!(split(0..=6, 0..=5), vec![both(0..=5), old(6..=6)]);
976 assert_eq!(split(0..=8, 0..=5), vec![both(0..=5), old(6..=8)]);
977 assert_eq!(split(5..=8, 5..=5), vec![both(5..=5), old(6..=8)]);
978
979 // case 7
980 assert_eq!(split(0..=5, 1..=5), vec![old(0..=0), both(1..=5)]);
981 assert_eq!(split(0..=5, 3..=5), vec![old(0..=2), both(3..=5)]);
982 assert_eq!(split(0..=5, 5..=5), vec![old(0..=4), both(5..=5)]);
983
984 // case 8
985 assert_eq!(
986 split(3..=6, 2..=7),
987 vec![new(2..=2), both(3..=6), new(7..=7)],
988 );
989 assert_eq!(
990 split(3..=6, 1..=8),
991 vec![new(1..=2), both(3..=6), new(7..=8)],
992 );
993
994 // case 9
995 assert_eq!(
996 split(2..=7, 3..=6),
997 vec![old(2..=2), both(3..=6), old(7..=7)],
998 );
999 assert_eq!(
1000 split(1..=8, 3..=6),
1001 vec![old(1..=2), both(3..=6), old(7..=8)],
1002 );
1003
1004 // case 10
1005 assert_eq!(
1006 split(3..=6, 6..=7),
1007 vec![old(3..=5), both(6..=6), new(7..=7)],
1008 );
1009 assert_eq!(
1010 split(3..=6, 6..=8),
1011 vec![old(3..=5), both(6..=6), new(7..=8)],
1012 );
1013 assert_eq!(
1014 split(5..=6, 6..=7),
1015 vec![old(5..=5), both(6..=6), new(7..=7)],
1016 );
1017
1018 // case 11
1019 assert_eq!(
1020 split(6..=7, 3..=6),
1021 vec![new(3..=5), both(6..=6), old(7..=7)],
1022 );
1023 assert_eq!(
1024 split(6..=8, 3..=6),
1025 vec![new(3..=5), both(6..=6), old(7..=8)],
1026 );
1027 assert_eq!(
1028 split(6..=7, 5..=6),
1029 vec![new(5..=5), both(6..=6), old(7..=7)],
1030 );
1031
1032 // case 12
1033 assert_eq!(
1034 split(3..=7, 5..=9),
1035 vec![old(3..=4), both(5..=7), new(8..=9)],
1036 );
1037 assert_eq!(
1038 split(3..=5, 4..=6),
1039 vec![old(3..=3), both(4..=5), new(6..=6)],
1040 );
1041
1042 // case 13
1043 assert_eq!(
1044 split(5..=9, 3..=7),
1045 vec![new(3..=4), both(5..=7), old(8..=9)],
1046 );
1047 assert_eq!(
1048 split(4..=6, 3..=5),
1049 vec![new(3..=3), both(4..=5), old(6..=6)],
1050 );
1051 }
1052
1053 // Arguably there should be more tests here, but in practice, this data
1054 // structure is well covered by the huge number of regex tests.
1055}
1056