range_trie.rs - Codebrowser

1	/*
2	I've called the primary data structure in this module a "range trie." As far
3	as I can tell, there is no prior art on a data structure like this, however,
4	it's likely someone somewhere has built something like it. Searching for
5	"range trie" turns up the paper "Range Tries for Scalable Address Lookup,"
6	but it does not appear relevant.
7
8	The range trie is just like a trie in that it is a special case of a
9	deterministic finite state machine. It has states and each state has a set
10	of transitions to other states. It is acyclic, and, like a normal trie,
11	it makes no attempt to reuse common suffixes among its elements. The key
12	difference between a normal trie and a range trie below is that a range trie
13	operates on contiguous sequences* of bytes instead of singleton bytes.*
14	One could say say that our alphabet is ranges of bytes instead of bytes
15	themselves, except a key part of range trie construction is splitting ranges
16	apart to ensure there is at most one transition that can be taken for any
17	byte in a given state.
18
19	I've tried to explain the details of how the range trie works below, so
20	for now, we are left with trying to understand what problem we're trying to
21	solve. Which is itself fairly involved!
22
23	At the highest level, here's what we want to do. We want to convert a
24	sequence of Unicode codepoints into a finite state machine whose transitions
25	are over bytes* and not Unicode codepoints. We want this because it makes*
26	said finite state machines much smaller and much faster to execute. As a
27	simple example, consider a byte oriented automaton for all Unicode scalar
28	values (0x00 through 0x10FFFF, not including surrogate codepoints):
29
30	[00-7F]
31	[C2-DF][80-BF]
32	[E0-E0][A0-BF][80-BF]
33	[E1-EC][80-BF][80-BF]
34	[ED-ED][80-9F][80-BF]
35	[EE-EF][80-BF][80-BF]
36	[F0-F0][90-BF][80-BF][80-BF]
37	[F1-F3][80-BF][80-BF][80-BF]
38	[F4-F4][80-8F][80-BF][80-BF]
39
40	(These byte ranges are generated via the regex-syntax::utf8 module, which
41	was based on Russ Cox's code in RE2, which was in turn based on Ken
42	Thompson's implementation of the same idea in his Plan9 implementation of
43	grep.)
44
45	It should be fairly straight-forward to see how one could compile this into
46	a DFA. The sequences are sorted and non-overlapping. Essentially, you could
47	build a trie from this fairly easy. The problem comes when your initial
48	range (in this case, 0x00-0x10FFFF) isn't so nice. For example, the class
49	represented by '\w' contains only a tenth of the codepoints that
50	0x00-0x10FFFF contains, but if we were to write out the byte based ranges
51	as we did above, the list would stretch to 892 entries! This turns into
52	quite a large NFA with a few thousand states. Turning this beast into a DFA
53	takes quite a bit of time. We are thus left with trying to trim down the
54	number of states we produce as early as possible.
55
56	One approach (used by RE2 and still by the regex crate, at time of writing)
57	is to try to find common suffixes while building NFA states for the above
58	and reuse them. This is very cheap to do and one can control precisely how
59	much extra memory you want to use for the cache.
60
61	Another approach, however, is to reuse an algorithm for constructing a
62	minimal DFA from a sorted sequence of inputs. I don't want to go into
63	the full details here, but I explain it in more depth in my blog post on
64	FSTs[1]. Note that the algorithm was not invented by me, but was published
65	in paper by Daciuk et al. in 2000 called "Incremental Construction of
66	MinimalAcyclic Finite-State Automata." Like the suffix cache approach above,
67	it is also possible to control the amount of extra memory one uses, although
68	this usually comes with the cost of sacrificing true minimality. (But it's
69	typically close enough with a reasonably sized cache of states.)
70
71	The catch is that Daciuk's algorithm only works if you add your keys in
72	lexicographic ascending order. In our case, since we're dealing with ranges,
73	we also need the additional requirement that ranges are either equivalent
74	or do not overlap at all. For example, if one were given the following byte
75	ranges:
76
77	[BC-BF][80-BF]
78	[BC-BF][90-BF]
79
80	Then Daciuk's algorithm would not work, since there is nothing to handle the
81	fact that the ranges overlap. They would need to be split apart. Thankfully,
82	Thompson's algorithm for producing byte ranges for Unicode codepoint ranges
83	meets both of our requirements. (A proof for this eludes me, but it appears
84	true.)
85
86	... however, we would also like to be able to compile UTF-8 automata in
87	reverse. We want this because in order to find the starting location of a
88	match using a DFA, we need to run a second DFA---a reversed version of the
89	forward DFA---backwards to discover the match location. Unfortunately, if
90	we reverse our byte sequences for 0x00-0x10FFFF, we get sequences that are
91	can overlap, even if they are sorted:
92
93	[00-7F]
94	[80-BF][80-9F][ED-ED]
95	[80-BF][80-BF][80-8F][F4-F4]
96	[80-BF][80-BF][80-BF][F1-F3]
97	[80-BF][80-BF][90-BF][F0-F0]
98	[80-BF][80-BF][E1-EC]
99	[80-BF][80-BF][EE-EF]
100	[80-BF][A0-BF][E0-E0]
101	[80-BF][C2-DF]
102
103	For example, '[80-BF][80-BF][EE-EF]' and '[80-BF][A0-BF][E0-E0]' have
104	overlapping ranges between '[80-BF]' and '[A0-BF]'. Thus, there is no
105	simple way to apply Daciuk's algorithm.
106
107	And thus, the range trie was born. The range trie's only purpose is to take
108	sequences of byte ranges like the ones above, collect them into a trie and then
109	spit them out in a sorted fashion with no overlapping ranges. For example,
110	0x00-0x10FFFF gets translated to:
111
112	[0-7F]
113	[80-BF][80-9F][80-8F][F1-F3]
114	[80-BF][80-9F][80-8F][F4]
115	[80-BF][80-9F][90-BF][F0]
116	[80-BF][80-9F][90-BF][F1-F3]
117	[80-BF][80-9F][E1-EC]
118	[80-BF][80-9F][ED]
119	[80-BF][80-9F][EE-EF]
120	[80-BF][A0-BF][80-8F][F1-F3]
121	[80-BF][A0-BF][80-8F][F4]
122	[80-BF][A0-BF][90-BF][F0]
123	[80-BF][A0-BF][90-BF][F1-F3]
124	[80-BF][A0-BF][E0]
125	[80-BF][A0-BF][E1-EC]
126	[80-BF][A0-BF][EE-EF]
127	[80-BF][C2-DF]
128
129	We've thus satisfied our requirements for running Daciuk's algorithm. All
130	sequences of ranges are sorted, and any corresponding ranges are either
131	exactly equivalent or non-overlapping.
132
133	In effect, a range trie is building a DFA from a sequence of arbitrary byte
134	ranges. But it uses an algorithm custom tailored to its input, so it is not as
135	costly as traditional DFA construction. While it is still quite a bit more
136	costly than the forward case (which only needs Daciuk's algorithm), it winds
137	up saving a substantial amount of time if one is doing a full DFA powerset
138	construction later by virtue of producing a much much smaller NFA.
139
140	[1] - https://blog.burntsushi.net/transducers/
141	[2] - https://www.mitpressjournals.org/doi/pdfplus/10.1162/089120100561601
142	*/
143
144	use core::{cell::RefCell, convert::TryFrom, fmt, mem, ops::RangeInclusive};
145
146	use alloc::{format, string::String, vec, vec::Vec};
147
148	use regex_syntax::utf8::Utf8Range;
149
150	use crate::util::primitives::StateID;
151
152	/// There is only one final state in this trie. Every sequence of byte ranges
153	/// added shares the same final state.
154	const FINAL: StateID = StateID::ZERO;
155
156	/// The root state of the trie.
157	const ROOT: StateID = StateID::new_unchecked(`1`);
158
159	/// A range trie represents an ordered set of sequences of bytes.
160	///
161	/// A range trie accepts as input a sequence of byte ranges and merges
162	/// them into the existing set such that the trie can produce a sorted
163	/// non-overlapping sequence of byte ranges. The sequence emitted corresponds
164	/// precisely to the sequence of bytes matched by the given keys, although the
165	/// byte ranges themselves may be split at different boundaries.
166	///
167	/// The order complexity of this data structure seems difficult to analyze.
168	/// If the size of a byte is held as a constant, then insertion is clearly
169	/// O(n) where n is the number of byte ranges in the input key. However, if
170	/// k=256 is our alphabet size, then insertion could be O(k^2 n). In*
171	/// particular it seems possible for pathological inputs to cause insertion
172	/// to do a lot of work. However, for what we use this data structure for,
173	/// there should be no pathological inputs since the ultimate source is always
174	/// a sorted set of Unicode scalar value ranges.
175	///
176	/// Internally, this trie is setup like a finite state machine. Note though
177	/// that it is acyclic.
178	#[derive(Clone)]
179	pub struct RangeTrie {
180	/// The states in this trie. The first is always the shared final state.
181	/// The second is always the root state. Otherwise, there is no
182	/// particular order.
183	states: Vec<State>,
184	/// A free-list of states. When a range trie is cleared, all of its states
185	/// are added to this list. Creating a new state reuses states from this
186	/// list before allocating a new one.
187	free: Vec<State>,
188	/// A stack for traversing this trie to yield sequences of byte ranges in
189	/// lexicographic order.
190	iter_stack: RefCell<Vec<NextIter>>,
191	/// A buffer that stores the current sequence during iteration.
192	iter_ranges: RefCell<Vec<Utf8Range>>,
193	/// A stack used for traversing the trie in order to (deeply) duplicate
194	/// a state. States are recursively duplicated when ranges are split.
195	dupe_stack: Vec<NextDupe>,
196	/// A stack used for traversing the trie during insertion of a new
197	/// sequence of byte ranges.
198	insert_stack: Vec<NextInsert>,
199	}
200
201	/// A single state in this trie.
202	#[derive(Clone)]
203	struct State {
204	/// A sorted sequence of non-overlapping transitions to other states. Each
205	/// transition corresponds to a single range of bytes.
206	transitions: Vec<Transition>,
207	}
208
209	/// A transition is a single range of bytes. If a particular byte is in this
210	/// range, then the corresponding machine may transition to the state pointed
211	/// to by `next_id`.
212	#[derive(Clone)]
213	struct Transition {
214	/// The byte range.
215	range: Utf8Range,
216	/// The next state to transition to.
217	next_id: StateID,
218	}
219
220	impl RangeTrie {
221	/// Create a new empty range trie.
222	pub fn new() -> RangeTrie {
223	let mut trie = RangeTrie {
224	states: vec![],
225	free: vec![],
226	iter_stack: RefCell::new(vec![]),
227	iter_ranges: RefCell::new(vec![]),
228	dupe_stack: vec![],
229	insert_stack: vec![],
230	};
231	trie.clear();
232	trie
233	}
234
235	/// Clear this range trie such that it is empty. Clearing a range trie
236	/// and reusing it can beneficial because this may reuse allocations.
237	pub fn clear(&mut self) {
238	self.free.extend(self.states.drain(..));
239	self.add_empty(); // final
240	self.add_empty(); // root
241	}
242
243	/// Iterate over all of the sequences of byte ranges in this trie, and
244	/// call the provided function for each sequence. Iteration occurs in
245	/// lexicographic order.
246	pub fn iter<E, F: FnMut(&[Utf8Range]) -> Result<(), E>>(
247	&self,
248	mut f: F,
249	) -> Result<(), E> {
250	let mut stack = self.iter_stack.borrow_mut();
251	stack.clear();
252	let mut ranges = self.iter_ranges.borrow_mut();
253	ranges.clear();
254
255	// We do iteration in a way that permits us to use a single buffer
256	// for our keys. We iterate in a depth first fashion, while being
257	// careful to expand our frontier as we move deeper in the trie.
258	stack.push(NextIter { state_id: ROOT, tidx: `0` });
259	while let Some(NextIter { mut state_id, mut tidx }) = stack.pop() {
260	// This could be implemented more simply without an inner loop
261	// here, but at the cost of more stack pushes.
262	loop {
263	let state = self.state(state_id);
264	// If we've visited all transitions in this state, then pop
265	// back to the parent state.
266	if tidx >= state.transitions.len() {
267	ranges.pop();
268	break;
269	}
270
271	let t = &state.transitions[tidx];
272	ranges.push(t.range);
273	if t.next_id == FINAL {
274	f(&ranges)?;
275	ranges.pop();
276	tidx += `1`;
277	} else {
278	// Expand our frontier. Once we come back to this state
279	// via the stack, start in on the next transition.
280	stack.push(NextIter { state_id, tidx: tidx + `1` });
281	// Otherwise, move to the first transition of the next
282	// state.
283	state_id = t.next_id;
284	tidx = `0`;
285	}
286	}
287	}
288	Ok(())
289	}
290
291	/// Inserts a new sequence of ranges into this trie.
292	///
293	/// The sequence given must be non-empty and must not have a length
294	/// exceeding 4.
295	pub fn insert(&mut self, ranges: &[Utf8Range]) {
296	assert!(!ranges.is_empty());
297	assert!(ranges.len() <= `4`);
298
299	let mut stack = mem::replace(&mut self.insert_stack, vec![]);
300	stack.clear();
301
302	stack.push(NextInsert::new(ROOT, ranges));
303	while let Some(next) = stack.pop() {
304	let (state_id, ranges) = (next.state_id(), next.ranges());
305	assert!(!ranges.is_empty());
306
307	let (mut new, rest) = (ranges[`0`], &ranges[`1`..]);
308
309	// i corresponds to the position of the existing transition on
310	// which we are operating. Typically, the result is to remove the
311	// transition and replace it with two or more new transitions
312	// corresponding to the partitions generated by splitting the
313	// 'new' with the ith transition's range.
314	let mut i = self.state(state_id).find(new);
315
316	// In this case, there is no overlap and* the new range is greater*
317	// than all existing ranges. So we can just add it to the end.
318	if i == self.state(state_id).transitions.len() {
319	let next_id = NextInsert::push(self, &mut stack, rest);
320	self.add_transition(state_id, new, next_id);
321	continue;
322	}
323
324	// The need for this loop is a bit subtle, buf basically, after
325	// we've handled the partitions from our initial split, it's
326	// possible that there will be a partition leftover that overlaps
327	// with a subsequent transition. If so, then we have to repeat
328	// the split process again with the leftovers and that subsequent
329	// transition.
330	'OUTER: loop {
331	let old = self.state(state_id).transitions[i].clone();
332	let split = match Split::new(old.range, new) {
333	Some(split) => split,
334	None => {
335	let next_id = NextInsert::push(self, &mut stack, rest);
336	self.add_transition_at(i, state_id, new, next_id);
337	continue;
338	}
339	};
340	let splits = split.as_slice();
341	// If we only have one partition, then the ranges must be
342	// equivalent. There's nothing to do here for this state, so
343	// just move on to the next one.
344	if splits.len() == `1` {
345	// ... but only if we have anything left to do.
346	if !rest.is_empty() {
347	stack.push(NextInsert::new(old.next_id, rest));
348	}
349	break;
350	}
351	// At this point, we know that 'split' is non-empty and there
352	// must be some overlap AND that the two ranges are not
353	// equivalent. Therefore, the existing range MUST be removed
354	// and split up somehow. Instead of actually doing the removal
355	// and then a subsequent insertion---with all the memory
356	// shuffling that entails---we simply overwrite the transition
357	// at position `i` for the first new transition we want to
358	// insert. After that, we're forced to do expensive inserts.
359	let mut first = `true`;
360	let mut add_trans =
361	\|trie: &mut RangeTrie, pos, from, range, to\| {
362	if first {
363	trie.set_transition_at(pos, from, range, to);
364	first = `false`;
365	} else {
366	trie.add_transition_at(pos, from, range, to);
367	}
368	};
369	for (j, &srange) in splits.iter().enumerate() {
370	match srange {
371	SplitRange::Old(r) => {
372	// Deep clone the state pointed to by the ith
373	// transition. This is always necessary since 'old'
374	// is always coupled with at least a 'both'
375	// partition. We don't want any new changes made
376	// via the 'both' partition to impact the part of
377	// the transition that doesn't overlap with the
378	// new range.
379	let dup_id = self.duplicate(old.next_id);
380	add_trans(self, i, state_id, r, dup_id);
381	}
382	SplitRange::New(r) => {
383	// This is a bit subtle, but if this happens to be
384	// the last partition in our split, it is possible
385	// that this overlaps with a subsequent transition.
386	// If it does, then we must repeat the whole
387	// splitting process over again with `r` and the
388	// subsequent transition.
389	{
390	let trans = &self.state(state_id).transitions;
391	if j + `1` == splits.len()
392	&& i < trans.len()
393	&& intersects(r, trans[i].range)
394	{
395	new = r;
396	continue 'OUTER;
397	}
398	}
399
400	// ... otherwise, setup exploration for a new
401	// empty state and add a brand new transition for
402	// this new range.
403	let next_id =
404	NextInsert::push(self, &mut stack, rest);
405	add_trans(self, i, state_id, r, next_id);
406	}
407	SplitRange::Both(r) => {
408	// Continue adding the remaining ranges on this
409	// path and update the transition with the new
410	// range.
411	if !rest.is_empty() {
412	stack.push(NextInsert::new(old.next_id, rest));
413	}
414	add_trans(self, i, state_id, r, old.next_id);
415	}
416	}
417	i += `1`;
418	}
419	// If we've reached this point, then we know that there are
420	// no subsequent transitions with any overlap. Therefore, we
421	// can stop processing this range and move on to the next one.
422	break;
423	}
424	}
425	self.insert_stack = stack;
426	}
427
428	pub fn add_empty(&mut self) -> StateID {
429	let id = match StateID::try_from(self.states.len()) {
430	Ok(id) => id,
431	Err(_) => {
432	// This generally should not happen since a range trie is
433	// only ever used to compile a single sequence of Unicode
434	// scalar values. If we ever got to this point, we would, at
435	// minimum, be using 96GB in just the range trie alone.
436	panic!("too many sequences added to range trie");
437	}
438	};
439	// If we have some free states available, then use them to avoid
440	// more allocations.
441	if let Some(mut state) = self.free.pop() {
442	state.clear();
443	self.states.push(state);
444	} else {
445	self.states.push(State { transitions: vec![] });
446	}
447	id
448	}
449
450	/// Performs a deep clone of the given state and returns the duplicate's
451	/// state ID.
452	///
453	/// A "deep clone" in this context means that the state given along with
454	/// recursively all states that it points to are copied. Once complete,
455	/// the given state ID and the returned state ID share nothing.
456	///
457	/// This is useful during range trie insertion when a new range overlaps
458	/// with an existing range that is bigger than the new one. The part
459	/// of the existing range that does not* overlap with the new one is*
460	/// duplicated so that adding the new range to the overlap doesn't disturb
461	/// the non-overlapping portion.
462	///
463	/// There's one exception: if old_id is the final state, then it is not
464	/// duplicated and the same final state is returned. This is because all
465	/// final states in this trie are equivalent.
466	fn duplicate(&mut self, old_id: StateID) -> StateID {
467	if old_id == FINAL {
468	return FINAL;
469	}
470
471	let mut stack = mem::replace(&mut self.dupe_stack, vec![]);
472	stack.clear();
473
474	let new_id = self.add_empty();
475	// old_id is the state we're cloning and new_id is the ID of the
476	// duplicated state for old_id.
477	stack.push(NextDupe { old_id, new_id });
478	while let Some(NextDupe { old_id, new_id }) = stack.pop() {
479	for i in `0`..self.state(old_id).transitions.len() {
480	let t = self.state(old_id).transitions[i].clone();
481	if t.next_id == FINAL {
482	// All final states are the same, so there's no need to
483	// duplicate it.
484	self.add_transition(new_id, t.range, FINAL);
485	continue;
486	}
487
488	let new_child_id = self.add_empty();
489	self.add_transition(new_id, t.range, new_child_id);
490	stack.push(NextDupe {
491	old_id: t.next_id,
492	new_id: new_child_id,
493	});
494	}
495	}
496	self.dupe_stack = stack;
497	new_id
498	}
499
500	/// Adds the given transition to the given state.
501	///
502	/// Callers must ensure that all previous transitions in this state
503	/// are lexicographically smaller than the given range.
504	fn add_transition(
505	&mut self,
506	from_id: StateID,
507	range: Utf8Range,
508	next_id: StateID,
509	) {
510	self.state_mut(from_id)
511	.transitions
512	.push(Transition { range, next_id });
513	}
514
515	/// Like `add_transition`, except this inserts the transition just before
516	/// the ith transition.
517	fn add_transition_at(
518	&mut self,
519	i: usize,
520	from_id: StateID,
521	range: Utf8Range,
522	next_id: StateID,
523	) {
524	self.state_mut(from_id)
525	.transitions
526	.insert(i, Transition { range, next_id });
527	}
528
529	/// Overwrites the transition at position i with the given transition.
530	fn set_transition_at(
531	&mut self,
532	i: usize,
533	from_id: StateID,
534	range: Utf8Range,
535	next_id: StateID,
536	) {
537	self.state_mut(from_id).transitions[i] = Transition { range, next_id };
538	}
539
540	/// Return an immutable borrow for the state with the given ID.
541	fn state(&self, id: StateID) -> &State {
542	&self.states[id]
543	}
544
545	/// Return a mutable borrow for the state with the given ID.
546	fn state_mut(&mut self, id: StateID) -> &mut State {
547	&mut self.states[id]
548	}
549	}
550
551	impl State {
552	/// Find the position at which the given range should be inserted in this
553	/// state.
554	///
555	/// The position returned is always in the inclusive range
556	/// [0, transitions.len()]. If 'transitions.len()' is returned, then the
557	/// given range overlaps with no other range in this state and* is greater*
558	/// than all of them.
559	///
560	/// For all other possible positions, the given range either overlaps
561	/// with the transition at that position or is otherwise less than it
562	/// with no overlap (and is greater than the previous transition). In the
563	/// former case, careful attention must be paid to inserting this range
564	/// as a new transition. In the latter case, the range can be inserted as
565	/// a new transition at the given position without disrupting any other
566	/// transitions.
567	fn find(&self, range: Utf8Range) -> usize {
568	/// Returns the position `i` at which `pred(xs[i])` first returns true
569	/// such that for all `j >= i`, `pred(xs[j]) == true`. If `pred` never
570	/// returns true, then `xs.len()` is returned.
571	///
572	/// We roll our own binary search because it doesn't seem like the
573	/// standard library's binary search can be used here. Namely, if
574	/// there is an overlapping range, then we want to find the first such
575	/// occurrence, but there may be many. Or at least, it's not quite
576	/// clear to me how to do it.
577	fn binary_search<T, F>(xs: &[T], mut pred: F) -> usize
578	where
579	F: FnMut(&T) -> bool,
580	{
581	let (mut left, mut right) = (`0`, xs.len());
582	while left < right {
583	// Overflow is impossible because xs.len() <= 256.
584	let mid = (left + right) / `2`;
585	if pred(&xs[mid]) {
586	right = mid;
587	} else {
588	left = mid + `1`;
589	}
590	}
591	left
592	}
593
594	// Benchmarks suggest that binary search is just a bit faster than
595	// straight linear search. Specifically when using the debug tool:
596	//
597	// hyperfine "regex-cli debug thompson -qr --no-captures '\w{90} ecurB'"
598	binary_search(&self.transitions, \|t\| range.start <= t.range.end)
599	}
600
601	/// Clear this state such that it has zero transitions.
602	fn clear(&mut self) {
603	self.transitions.clear();
604	}
605	}
606
607	/// The next state to process during duplication.
608	#[derive(Clone, Debug)]
609	struct NextDupe {
610	/// The state we want to duplicate.
611	old_id: StateID,
612	/// The ID of the new state that is a duplicate of old_id.
613	new_id: StateID,
614	}
615
616	/// The next state (and its corresponding transition) that we want to visit
617	/// during iteration in lexicographic order.
618	#[derive(Clone, Debug)]
619	struct NextIter {
620	state_id: StateID,
621	tidx: usize,
622	}
623
624	/// The next state to process during insertion and any remaining ranges that we
625	/// want to add for a particular sequence of ranges. The first such instance
626	/// is always the root state along with all ranges given.
627	#[derive(Clone, Debug)]
628	struct NextInsert {
629	/// The next state to begin inserting ranges. This state should be the
630	/// state at which `ranges[0]` should be inserted.
631	state_id: StateID,
632	/// The ranges to insert. We used a fixed-size array here to avoid an
633	/// allocation.
634	ranges: [Utf8Range; `4`],
635	/// The number of valid ranges in the above array.
636	len: u8,
637	}
638
639	impl NextInsert {
640	/// Create the next item to visit. The given state ID should correspond
641	/// to the state at which the first range in the given slice should be
642	/// inserted. The slice given must not be empty and it must be no longer
643	/// than 4.
644	fn new(state_id: StateID, ranges: &[Utf8Range]) -> NextInsert {
645	let len = ranges.len();
646	assert!(len > `0`);
647	assert!(len <= `4`);
648
649	let mut tmp = [Utf8Range { start: `0`, end: `0` }; `4`];
650	tmp[..len].copy_from_slice(ranges);
651	NextInsert { state_id, ranges: tmp, len: u8::try_from(len).unwrap() }
652	}
653
654	/// Push a new empty state to visit along with any remaining ranges that
655	/// still need to be inserted. The ID of the new empty state is returned.
656	///
657	/// If ranges is empty, then no new state is created and FINAL is returned.
658	fn push(
659	trie: &mut RangeTrie,
660	stack: &mut Vec<NextInsert>,
661	ranges: &[Utf8Range],
662	) -> StateID {
663	if ranges.is_empty() {
664	FINAL
665	} else {
666	let next_id = trie.add_empty();
667	stack.push(NextInsert::new(next_id, ranges));
668	next_id
669	}
670	}
671
672	/// Return the ID of the state to visit.
673	fn state_id(&self) -> StateID {
674	self.state_id
675	}
676
677	/// Return the remaining ranges to insert.
678	fn ranges(&self) -> &[Utf8Range] {
679	&self.ranges[..usize::try_from(self.len).unwrap()]
680	}
681	}
682
683	/// Split represents a partitioning of two ranges into one or more ranges. This
684	/// is the secret sauce that makes a range trie work, as it's what tells us
685	/// how to deal with two overlapping but unequal ranges during insertion.
686	///
687	/// Essentially, either two ranges overlap or they don't. If they don't, then
688	/// handling insertion is easy: just insert the new range into its
689	/// lexicographically correct position. Since it does not overlap with anything
690	/// else, no other transitions are impacted by the new range.
691	///
692	/// If they do overlap though, there are generally three possible cases to
693	/// handle:
694	///
695	/// 1. The part where the two ranges actually overlap. i.e., The intersection.
696	/// 2. The part of the existing range that is not in the the new range.
697	/// 3. The part of the new range that is not in the old range.
698	///
699	/// (1) is guaranteed to always occur since all overlapping ranges have a
700	/// non-empty intersection. If the two ranges are not equivalent, then at
701	/// least one of (2) or (3) is guaranteed to occur as well. In some cases,
702	/// e.g., `[0-4]` and `[4-9]`, all three cases will occur.
703	///
704	/// This `Split` type is responsible for providing (1), (2) and (3) for any
705	/// possible pair of byte ranges.
706	///
707	/// As for insertion, for the overlap in (1), the remaining ranges to insert
708	/// should be added by following the corresponding transition. However, this
709	/// should only be done for the overlapping parts of the range. If there was
710	/// a part of the existing range that was not in the new range, then that
711	/// existing part must be split off from the transition and duplicated. The
712	/// remaining parts of the overlap can then be added to using the new ranges
713	/// without disturbing the existing range.
714	///
715	/// Handling the case for the part of a new range that is not in an existing
716	/// range is seemingly easy. Just treat it as if it were a non-overlapping
717	/// range. The problem here is that if this new non-overlapping range occurs
718	/// after both (1) and (2), then it's possible that it can overlap with the
719	/// next transition in the current state. If it does, then the whole process
720	/// must be repeated!
721	///
722	/// # Details of the 3 cases
723	///
724	/// The following details the various cases that are implemented in code
725	/// below. It's plausible that the number of cases is not actually minimal,
726	/// but it's important for this code to remain at least somewhat readable.
727	///
728	/// Given [a,b] and [x,y], where a <= b, x <= y, b < 256 and y < 256, we define
729	/// the follow distinct relationships where at least one must apply. The order
730	/// of these matters, since multiple can match. The first to match applies.
731	///
732	/// 1. b < x <=> [a,b] < [x,y]
733	/// 2. y < a <=> [x,y] < [a,b]
734	///
735	/// In the case of (1) and (2), these are the only cases where there is no
736	/// overlap. Or otherwise, the intersection of [a,b] and [x,y] is empty. In
737	/// order to compute the intersection, one can do [max(a,x), min(b,y)]. The
738	/// intersection in all of the following cases is non-empty.
739	///
740	/// 3. a = x && b = y <=> [a,b] == [x,y]
741	/// 4. a = x && b < y <=> [x,y] right-extends [a,b]
742	/// 5. b = y && a > x <=> [x,y] left-extends [a,b]
743	/// 6. x = a && y < b <=> [a,b] right-extends [x,y]
744	/// 7. y = b && x > a <=> [a,b] left-extends [x,y]
745	/// 8. a > x && b < y <=> [x,y] covers [a,b]
746	/// 9. x > a && y < b <=> [a,b] covers [x,y]
747	/// 10. b = x && a < y <=> [a,b] is left-adjacent to [x,y]
748	/// 11. y = a && x < b <=> [x,y] is left-adjacent to [a,b]
749	/// 12. b > x && b < y <=> [a,b] left-overlaps [x,y]
750	/// 13. y > a && y < b <=> [x,y] left-overlaps [a,b]
751	///
752	/// In cases 3-13, we can form rules that partition the ranges into a
753	/// non-overlapping ordered sequence of ranges:
754	///
755	/// 3. [a,b]
756	/// 4. [a,b], [b+1,y]
757	/// 5. [x,a-1], [a,b]
758	/// 6. [x,y], [y+1,b]
759	/// 7. [a,x-1], [x,y]
760	/// 8. [x,a-1], [a,b], [b+1,y]
761	/// 9. [a,x-1], [x,y], [y+1,b]
762	/// 10. [a,b-1], [b,b], [b+1,y]
763	/// 11. [x,y-1], [y,y], [y+1,b]
764	/// 12. [a,x-1], [x,b], [b+1,y]
765	/// 13. [x,a-1], [a,y], [y+1,b]
766	///
767	/// In the code below, we go a step further and identify each of the above
768	/// outputs as belonging either to the overlap of the two ranges or to one
769	/// of [a,b] or [x,y] exclusively.
770	#[derive(Clone, Debug, Eq, PartialEq)]
771	struct Split {
772	partitions: [SplitRange; `3`],
773	len: usize,
774	}
775
776	/// A tagged range indicating how it was derived from a pair of ranges.
777	#[derive(Clone, Copy, Debug, Eq, PartialEq)]
778	enum SplitRange {
779	Old(Utf8Range),
780	New(Utf8Range),
781	Both(Utf8Range),
782	}
783
784	impl Split {
785	/// Create a partitioning of the given ranges.
786	///
787	/// If the given ranges have an empty intersection, then None is returned.
788	fn new(o: Utf8Range, n: Utf8Range) -> Option<Split> {
789	let range = \|r: RangeInclusive<u8>\| Utf8Range {
790	start: *r.start(),
791	end: *r.end(),
792	};
793	let old = \|r\| SplitRange::Old(range(r));
794	let new = \|r\| SplitRange::New(range(r));
795	let both = \|r\| SplitRange::Both(range(r));
796
797	// Use same names as the comment above to make it easier to compare.
798	let (a, b, x, y) = (o.start, o.end, n.start, n.end);
799
800	if b < x \|\| y < a {
801	// case 1, case 2
802	None
803	} else if a == x && b == y {
804	// case 3
805	Some(Split::parts1(both(a..=b)))
806	} else if a == x && b < y {
807	// case 4
808	Some(Split::parts2(both(a..=b), new(b + `1`..=y)))
809	} else if b == y && a > x {
810	// case 5
811	Some(Split::parts2(new(x..=a - `1`), both(a..=b)))
812	} else if x == a && y < b {
813	// case 6
814	Some(Split::parts2(both(x..=y), old(y + `1`..=b)))
815	} else if y == b && x > a {
816	// case 7
817	Some(Split::parts2(old(a..=x - `1`), both(x..=y)))
818	} else if a > x && b < y {
819	// case 8
820	Some(Split::parts3(new(x..=a - `1`), both(a..=b), new(b + `1`..=y)))
821	} else if x > a && y < b {
822	// case 9
823	Some(Split::parts3(old(a..=x - `1`), both(x..=y), old(y + `1`..=b)))
824	} else if b == x && a < y {
825	// case 10
826	Some(Split::parts3(old(a..=b - `1`), both(b..=b), new(b + `1`..=y)))
827	} else if y == a && x < b {
828	// case 11
829	Some(Split::parts3(new(x..=y - `1`), both(y..=y), old(y + `1`..=b)))
830	} else if b > x && b < y {
831	// case 12
832	Some(Split::parts3(old(a..=x - `1`), both(x..=b), new(b + `1`..=y)))
833	} else if y > a && y < b {
834	// case 13
835	Some(Split::parts3(new(x..=a - `1`), both(a..=y), old(y + `1`..=b)))
836	} else {
837	unreachable!()
838	}
839	}
840
841	/// Create a new split with a single partition. This only occurs when two
842	/// ranges are equivalent.
843	fn parts1(r1: SplitRange) -> Split {
844	// This value doesn't matter since it is never accessed.
845	let nada = SplitRange::Old(Utf8Range { start: `0`, end: `0` });
846	Split { partitions: [r1, nada, nada], len: `1` }
847	}
848
849	/// Create a new split with two partitions.
850	fn parts2(r1: SplitRange, r2: SplitRange) -> Split {
851	// This value doesn't matter since it is never accessed.
852	let nada = SplitRange::Old(Utf8Range { start: `0`, end: `0` });
853	Split { partitions: [r1, r2, nada], len: `2` }
854	}
855
856	/// Create a new split with three partitions.
857	fn parts3(r1: SplitRange, r2: SplitRange, r3: SplitRange) -> Split {
858	Split { partitions: [r1, r2, r3], len: `3` }
859	}
860
861	/// Return the partitions in this split as a slice.
862	fn as_slice(&self) -> &[SplitRange] {
863	&self.partitions[..self.len]
864	}
865	}
866
867	impl fmt::Debug for RangeTrie {
868	fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
869	writeln!(f, "")?;
870	for (i, state) in self.states.iter().enumerate() {
871	let status = if i == FINAL.as_usize() { '' } else* { ' ' };
872	writeln!(f, "{}{:06}: {:?}", status, i, state)?;
873	}
874	Ok(())
875	}
876	}
877
878	impl fmt::Debug for State {
879	fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
880	let rs = self
881	.transitions
882	.iter()
883	.map(\|t\| format!("{:?}", t))
884	.collect::<Vec<String>>()
885	.join(", ");
886	write!(f, "{}", rs)
887	}
888	}
889
890	impl fmt::Debug for Transition {
891	fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
892	if self.range.start == self.range.end {
893	write!(
894	f,
895	"{:02X} => {:02X}",
896	self.range.start,
897	self.next_id.as_usize(),
898	)
899	} else {
900	write!(
901	f,
902	"{:02X}-{:02X} => {:02X}",
903	self.range.start,
904	self.range.end,
905	self.next_id.as_usize(),
906	)
907	}
908	}
909	}
910
911	/// Returns true if and only if the given ranges intersect.
912	fn intersects(r1: Utf8Range, r2: Utf8Range) -> bool {
913	!(r1.end < r2.start \|\| r2.end < r1.start)
914	}
915
916	#[cfg(test)]
917	mod tests {
918	use core::ops::RangeInclusive;
919
920	use regex_syntax::utf8::Utf8Range;
921
922	use super::*;
923
924	fn r(range: RangeInclusive<u8>) -> Utf8Range {
925	Utf8Range { start: range.start(), end: range.end() }
926	}
927
928	fn split_maybe(
929	old: RangeInclusive<u8>,
930	new: RangeInclusive<u8>,
931	) -> Option<Split> {
932	Split::new(r(old), r(new))
933	}
934
935	fn split(
936	old: RangeInclusive<u8>,
937	new: RangeInclusive<u8>,
938	) -> Vec<SplitRange> {
939	split_maybe(old, new).unwrap().as_slice().to_vec()
940	}
941
942	#[test]
943	fn no_splits() {
944	// case 1
945	assert_eq!(None, split_maybe(`0`..=`1`, `2`..=`3`));
946	// case 2
947	assert_eq!(None, split_maybe(`2`..=`3`, `0`..=`1`));
948	}
949
950	#[test]
951	fn splits() {
952	let range = \|r: RangeInclusive<u8>\| Utf8Range {
953	start: *r.start(),
954	end: *r.end(),
955	};
956	let old = \|r\| SplitRange::Old(range(r));
957	let new = \|r\| SplitRange::New(range(r));
958	let both = \|r\| SplitRange::Both(range(r));
959
960	// case 3
961	assert_eq!(split(`0`..=`0`, `0`..=`0`), vec![both(`0`..=`0`)]);
962	assert_eq!(split(`9`..=`9`, `9`..=`9`), vec![both(`9`..=`9`)]);
963
964	// case 4
965	assert_eq!(split(`0`..=`5`, `0`..=`6`), vec![both(`0`..=`5`), new(`6`..=`6`)]);
966	assert_eq!(split(`0`..=`5`, `0`..=`8`), vec![both(`0`..=`5`), new(`6`..=`8`)]);
967	assert_eq!(split(`5`..=`5`, `5`..=`8`), vec![both(`5`..=`5`), new(`6`..=`8`)]);
968
969	// case 5
970	assert_eq!(split(`1`..=`5`, `0`..=`5`), vec![new(`0`..=`0`), both(`1`..=`5`)]);
971	assert_eq!(split(`3`..=`5`, `0`..=`5`), vec![new(`0`..=`2`), both(`3`..=`5`)]);
972	assert_eq!(split(`5`..=`5`, `0`..=`5`), vec![new(`0`..=`4`), both(`5`..=`5`)]);
973
974	// case 6
975	assert_eq!(split(`0`..=`6`, `0`..=`5`), vec![both(`0`..=`5`), old(`6`..=`6`)]);
976	assert_eq!(split(`0`..=`8`, `0`..=`5`), vec![both(`0`..=`5`), old(`6`..=`8`)]);
977	assert_eq!(split(`5`..=`8`, `5`..=`5`), vec![both(`5`..=`5`), old(`6`..=`8`)]);
978
979	// case 7
980	assert_eq!(split(`0`..=`5`, `1`..=`5`), vec![old(`0`..=`0`), both(`1`..=`5`)]);
981	assert_eq!(split(`0`..=`5`, `3`..=`5`), vec![old(`0`..=`2`), both(`3`..=`5`)]);
982	assert_eq!(split(`0`..=`5`, `5`..=`5`), vec![old(`0`..=`4`), both(`5`..=`5`)]);
983
984	// case 8
985	assert_eq!(
986	split(`3`..=`6`, `2`..=`7`),
987	vec![new(`2`..=`2`), both(`3`..=`6`), new(`7`..=`7`)],
988	);
989	assert_eq!(
990	split(`3`..=`6`, `1`..=`8`),
991	vec![new(`1`..=`2`), both(`3`..=`6`), new(`7`..=`8`)],
992	);
993
994	// case 9
995	assert_eq!(
996	split(`2`..=`7`, `3`..=`6`),
997	vec![old(`2`..=`2`), both(`3`..=`6`), old(`7`..=`7`)],
998	);
999	assert_eq!(
1000	split(`1`..=`8`, `3`..=`6`),
1001	vec![old(`1`..=`2`), both(`3`..=`6`), old(`7`..=`8`)],
1002	);
1003
1004	// case 10
1005	assert_eq!(
1006	split(`3`..=`6`, `6`..=`7`),
1007	vec![old(`3`..=`5`), both(`6`..=`6`), new(`7`..=`7`)],
1008	);
1009	assert_eq!(
1010	split(`3`..=`6`, `6`..=`8`),
1011	vec![old(`3`..=`5`), both(`6`..=`6`), new(`7`..=`8`)],
1012	);
1013	assert_eq!(
1014	split(`5`..=`6`, `6`..=`7`),
1015	vec![old(`5`..=`5`), both(`6`..=`6`), new(`7`..=`7`)],
1016	);
1017
1018	// case 11
1019	assert_eq!(
1020	split(`6`..=`7`, `3`..=`6`),
1021	vec![new(`3`..=`5`), both(`6`..=`6`), old(`7`..=`7`)],
1022	);
1023	assert_eq!(
1024	split(`6`..=`8`, `3`..=`6`),
1025	vec![new(`3`..=`5`), both(`6`..=`6`), old(`7`..=`8`)],
1026	);
1027	assert_eq!(
1028	split(`6`..=`7`, `5`..=`6`),
1029	vec![new(`5`..=`5`), both(`6`..=`6`), old(`7`..=`7`)],
1030	);
1031
1032	// case 12
1033	assert_eq!(
1034	split(`3`..=`7`, `5`..=`9`),
1035	vec![old(`3`..=`4`), both(`5`..=`7`), new(`8`..=`9`)],
1036	);
1037	assert_eq!(
1038	split(`3`..=`5`, `4`..=`6`),
1039	vec![old(`3`..=`3`), both(`4`..=`5`), new(`6`..=`6`)],
1040	);
1041
1042	// case 13
1043	assert_eq!(
1044	split(`5`..=`9`, `3`..=`7`),
1045	vec![new(`3`..=`4`), both(`5`..=`7`), old(`8`..=`9`)],
1046	);
1047	assert_eq!(
1048	split(`4`..=`6`, `3`..=`5`),
1049	vec![new(`3`..=`3`), both(`4`..=`5`), old(`6`..=`6`)],
1050	);
1051	}
1052
1053	// Arguably there should be more tests here, but in practice, this data
1054	// structure is well covered by the huge number of regex tests.
1055	}
1056

Provided by KDAB

Definitions