range_trie.rs - Codebrowser

1	// I've called the primary data structure in this module a "range trie." As far
2	// as I can tell, there is no prior art on a data structure like this, however,
3	// it's likely someone somewhere has built something like it. Searching for
4	// "range trie" turns up the paper "Range Tries for Scalable Address Lookup,"
5	// but it does not appear relevant.
6	//
7	// The range trie is just like a trie in that it is a special case of a
8	// deterministic finite state machine. It has states and each state has a set
9	// of transitions to other states. It is acyclic, and, like a normal trie,
10	// it makes no attempt to reuse common suffixes among its elements. The key
11	// difference between a normal trie and a range trie below is that a range trie
12	// operates on contiguous sequences* of bytes instead of singleton bytes.*
13	// One could say say that our alphabet is ranges of bytes instead of bytes
14	// themselves, except a key part of range trie construction is splitting ranges
15	// apart to ensure there is at most one transition that can be taken for any
16	// byte in a given state.
17	//
18	// I've tried to explain the details of how the range trie works below, so
19	// for now, we are left with trying to understand what problem we're trying to
20	// solve. Which is itself fairly involved!
21	//
22	// At the highest level, here's what we want to do. We want to convert a
23	// sequence of Unicode codepoints into a finite state machine whose transitions
24	// are over bytes* and not Unicode codepoints. We want this because it makes*
25	// said finite state machines much smaller and much faster to execute. As a
26	// simple example, consider a byte oriented automaton for all Unicode scalar
27	// values (0x00 through 0x10FFFF, not including surrogate codepoints):
28	//
29	// [00-7F]
30	// [C2-DF][80-BF]
31	// [E0-E0][A0-BF][80-BF]
32	// [E1-EC][80-BF][80-BF]
33	// [ED-ED][80-9F][80-BF]
34	// [EE-EF][80-BF][80-BF]
35	// [F0-F0][90-BF][80-BF][80-BF]
36	// [F1-F3][80-BF][80-BF][80-BF]
37	// [F4-F4][80-8F][80-BF][80-BF]
38	//
39	// (These byte ranges are generated via the regex-syntax::utf8 module, which
40	// was based on Russ Cox's code in RE2, which was in turn based on Ken
41	// Thompson's implementation of the same idea in his Plan9 implementation of
42	// grep.)
43	//
44	// It should be fairly straight-forward to see how one could compile this into
45	// a DFA. The sequences are sorted and non-overlapping. Essentially, you could
46	// build a trie from this fairly easy. The problem comes when your initial
47	// range (in this case, 0x00-0x10FFFF) isn't so nice. For example, the class
48	// represented by '\w' contains only a tenth of the codepoints that
49	// 0x00-0x10FFFF contains, but if we were to write out the byte based ranges
50	// as we did above, the list would stretch to 892 entries! This turns into
51	// quite a large NFA with a few thousand states. Turning this beast into a DFA
52	// takes quite a bit of time. We are thus left with trying to trim down the
53	// number of states we produce as early as possible.
54	//
55	// One approach (used by RE2 and still by the regex crate, at time of writing)
56	// is to try to find common suffixes while building NFA states for the above
57	// and reuse them. This is very cheap to do and one can control precisely how
58	// much extra memory you want to use for the cache.
59	//
60	// Another approach, however, is to reuse an algorithm for constructing a
61	// minimal* DFA from a sorted sequence of inputs. I don't want to go into*
62	// the full details here, but I explain it in more depth in my blog post on
63	// FSTs[1]. Note that the algorithm not invented by me, but was published
64	// in paper by Daciuk et al. in 2000 called "Incremental Construction of
65	// MinimalAcyclic Finite-State Automata." Like the suffix cache approach above,
66	// it is also possible to control the amount of extra memory one uses, although
67	// this usually comes with the cost of sacrificing true minimality. (But it's
68	// typically close enough with a reasonably sized cache of states.)
69	//
70	// The catch is that Daciuk's algorithm only works if you add your keys in
71	// lexicographic ascending order. In our case, since we're dealing with ranges,
72	// we also need the additional requirement that ranges are either equivalent
73	// or do not overlap at all. For example, if one were given the following byte
74	// ranges:
75	//
76	// [BC-BF][80-BF]
77	// [BC-BF][90-BF]
78	//
79	// Then Daciuk's algorithm also would not work, since there is nothing to
80	// handle the fact that the ranges overlap. They would need to be split apart.
81	// Thankfully, Thompson's algorithm for producing byte ranges for Unicode
82	// codepoint ranges meets both of our requirements.
83	//
84	// ... however, we would also like to be able to compile UTF-8 automata in
85	// reverse. We want this because in order to find the starting location of a
86	// match using a DFA, we need to run a second DFA---a reversed version of the
87	// forward DFA---backwards to discover the match location. Unfortunately, if
88	// we reverse our byte sequences for 0x00-0x10FFFF, we get sequences that are
89	// can overlap, even if they are sorted:
90	//
91	// [00-7F]
92	// [80-BF][80-9F][ED-ED]
93	// [80-BF][80-BF][80-8F][F4-F4]
94	// [80-BF][80-BF][80-BF][F1-F3]
95	// [80-BF][80-BF][90-BF][F0-F0]
96	// [80-BF][80-BF][E1-EC]
97	// [80-BF][80-BF][EE-EF]
98	// [80-BF][A0-BF][E0-E0]
99	// [80-BF][C2-DF]
100	//
101	// For example, '[80-BF][80-BF][EE-EF]' and '[80-BF][A0-BF][E0-E0]' have
102	// overlapping ranges between '[80-BF]' and '[A0-BF]'. Thus, there is no
103	// simple way to apply Daciuk's algorithm.
104	//
105	// And thus, the range trie was born. The range trie's only purpose is to take
106	// sequences of byte ranges like the ones above, collect them into a trie and
107	// then spit them in a sorted fashion with no overlapping ranges. For example,
108	// 0x00-0x10FFFF gets translated to:
109	//
110	// [0-7F]
111	// [80-BF][80-9F][80-8F][F1-F3]
112	// [80-BF][80-9F][80-8F][F4]
113	// [80-BF][80-9F][90-BF][F0]
114	// [80-BF][80-9F][90-BF][F1-F3]
115	// [80-BF][80-9F][E1-EC]
116	// [80-BF][80-9F][ED]
117	// [80-BF][80-9F][EE-EF]
118	// [80-BF][A0-BF][80-8F][F1-F3]
119	// [80-BF][A0-BF][80-8F][F4]
120	// [80-BF][A0-BF][90-BF][F0]
121	// [80-BF][A0-BF][90-BF][F1-F3]
122	// [80-BF][A0-BF][E0]
123	// [80-BF][A0-BF][E1-EC]
124	// [80-BF][A0-BF][EE-EF]
125	// [80-BF][C2-DF]
126	//
127	// We've thus satisfied our requirements for running Daciuk's algorithm. All
128	// sequences of ranges are sorted, and any corresponding ranges are either
129	// exactly equivalent or non-overlapping.
130	//
131	// In effect, a range trie is building a DFA from a sequence of arbitrary
132	// byte ranges. But it uses an algoritm custom tailored to its input, so it
133	// is not as costly as traditional DFA construction. While it is still quite
134	// a bit more costly than the forward's case (which only needs Daciuk's
135	// algorithm), it winds up saving a substantial amount of time if one is doing
136	// a full DFA powerset construction later by virtue of producing a much much
137	// smaller NFA.
138	//
139	// [1] - https://blog.burntsushi.net/transducers/
140	// [2] - https://www.mitpressjournals.org/doi/pdfplus/10.1162/089120100561601
141
142	use std::cell::RefCell;
143	use std::fmt;
144	use std::mem;
145	use std::ops::RangeInclusive;
146	use std::u32;
147
148	use regex_syntax::utf8::Utf8Range;
149
150	/// A smaller state ID means more effective use of the CPU cache and less
151	/// time spent copying. The implementation below will panic if the state ID
152	/// space is exhausted, but in order for that to happen, the range trie itself
153	/// would use well over 100GB of memory. Moreover, it's likely impossible
154	/// for the state ID space to get that big. In fact, it's likely that even a
155	/// u16 would be good enough here. But it's not quite clear how to prove this.
156	type StateID = u32;
157
158	/// There is only one final state in this trie. Every sequence of byte ranges
159	/// added shares the same final state.
160	const FINAL: StateID = `0`;
161
162	/// The root state of the trie.
163	const ROOT: StateID = `1`;
164
165	/// A range trie represents an ordered set of sequences of bytes.
166	///
167	/// A range trie accepts as input a sequence of byte ranges and merges
168	/// them into the existing set such that the trie can produce a sorted
169	/// non-overlapping sequence of byte ranges. The sequence emitted corresponds
170	/// precisely to the sequence of bytes matched by the given keys, although the
171	/// byte ranges themselves may be split at different boundaries.
172	///
173	/// The order complexity of this data structure seems difficult to analyze.
174	/// If the size of a byte is held as a constant, then insertion is clearly
175	/// O(n) where n is the number of byte ranges in the input key. However, if
176	/// k=256 is our alphabet size, then insertion could be O(k^2 n). In*
177	/// particular it seems possible for pathological inputs to cause insertion
178	/// to do a lot of work. However, for what we use this data structure for,
179	/// there should be no pathological inputs since the ultimate source is always
180	/// a sorted set of Unicode scalar value ranges.
181	///
182	/// Internally, this trie is setup like a finite state machine. Note though
183	/// that it is acyclic.
184	#[derive(Clone)]
185	pub struct RangeTrie {
186	/// The states in this trie. The first is always the shared final state.
187	/// The second is always the root state. Otherwise, there is no
188	/// particular order.
189	states: Vec<State>,
190	/// A free-list of states. When a range trie is cleared, all of its states
191	/// are added to list. Creating a new state reuses states from this list
192	/// before allocating a new one.
193	free: Vec<State>,
194	/// A stack for traversing this trie to yield sequences of byte ranges in
195	/// lexicographic order.
196	iter_stack: RefCell<Vec<NextIter>>,
197	/// A bufer that stores the current sequence during iteration.
198	iter_ranges: RefCell<Vec<Utf8Range>>,
199	/// A stack used for traversing the trie in order to (deeply) duplicate
200	/// a state.
201	dupe_stack: Vec<NextDupe>,
202	/// A stack used for traversing the trie during insertion of a new
203	/// sequence of byte ranges.
204	insert_stack: Vec<NextInsert>,
205	}
206
207	/// A single state in this trie.
208	#[derive(Clone)]
209	struct State {
210	/// A sorted sequence of non-overlapping transitions to other states. Each
211	/// transition corresponds to a single range of bytes.
212	transitions: Vec<Transition>,
213	}
214
215	/// A transition is a single range of bytes. If a particular byte is in this
216	/// range, then the corresponding machine may transition to the state pointed
217	/// to by `next_id`.
218	#[derive(Clone)]
219	struct Transition {
220	/// The byte range.
221	range: Utf8Range,
222	/// The next state to transition to.
223	next_id: StateID,
224	}
225
226	impl RangeTrie {
227	/// Create a new empty range trie.
228	pub fn new() -> RangeTrie {
229	let mut trie = RangeTrie {
230	states: vec![],
231	free: vec![],
232	iter_stack: RefCell::new(vec![]),
233	iter_ranges: RefCell::new(vec![]),
234	dupe_stack: vec![],
235	insert_stack: vec![],
236	};
237	trie.clear();
238	trie
239	}
240
241	/// Clear this range trie such that it is empty. Clearing a range trie
242	/// and reusing it can beneficial because this may reuse allocations.
243	pub fn clear(&mut self) {
244	self.free.extend(self.states.drain(..));
245	self.add_empty(); // final
246	self.add_empty(); // root
247	}
248
249	/// Iterate over all of the sequences of byte ranges in this trie, and
250	/// call the provided function for each sequence. Iteration occurs in
251	/// lexicographic order.
252	pub fn iter<F: FnMut(&[Utf8Range])>(&self, mut f: F) {
253	let mut stack = self.iter_stack.borrow_mut();
254	stack.clear();
255	let mut ranges = self.iter_ranges.borrow_mut();
256	ranges.clear();
257
258	// We do iteration in a way that permits us to use a single buffer
259	// for our keys. We iterate in a depth first fashion, while being
260	// careful to expand our frontier as we move deeper in the trie.
261	stack.push(NextIter { state_id: ROOT, tidx: `0` });
262	while let Some(NextIter { mut state_id, mut tidx }) = stack.pop() {
263	// This could be implemented more simply without an inner loop
264	// here, but at the cost of more stack pushes.
265	loop {
266	let state = self.state(state_id);
267	// If we're visited all transitions in this state, then pop
268	// back to the parent state.
269	if tidx >= state.transitions.len() {
270	ranges.pop();
271	break;
272	}
273
274	let t = &state.transitions[tidx];
275	ranges.push(t.range);
276	if t.next_id == FINAL {
277	f(&ranges);
278	ranges.pop();
279	tidx += `1`;
280	} else {
281	// Expand our frontier. Once we come back to this state
282	// via the stack, start in on the next transition.
283	stack.push(NextIter { state_id, tidx: tidx + `1` });
284	// Otherwise, move to the first transition of the next
285	// state.
286	state_id = t.next_id;
287	tidx = `0`;
288	}
289	}
290	}
291	}
292
293	/// Inserts a new sequence of ranges into this trie.
294	///
295	/// The sequence given must be non-empty and must not have a length
296	/// exceeding 4.
297	pub fn insert(&mut self, ranges: &[Utf8Range]) {
298	assert!(!ranges.is_empty());
299	assert!(ranges.len() <= `4`);
300
301	let mut stack = mem::replace(&mut self.insert_stack, vec![]);
302	stack.clear();
303
304	stack.push(NextInsert::new(ROOT, ranges));
305	while let Some(next) = stack.pop() {
306	let (state_id, ranges) = (next.state_id(), next.ranges());
307	assert!(!ranges.is_empty());
308
309	let (mut new, rest) = (ranges[`0`], &ranges[`1`..]);
310
311	// i corresponds to the position of the existing transition on
312	// which we are operating. Typically, the result is to remove the
313	// transition and replace it with two or more new transitions
314	// corresponding to the partitions generated by splitting the
315	// 'new' with the ith transition's range.
316	let mut i = self.state(state_id).find(new);
317
318	// In this case, there is no overlap and* the new range is greater*
319	// than all existing ranges. So we can just add it to the end.
320	if i == self.state(state_id).transitions.len() {
321	let next_id = NextInsert::push(self, &mut stack, rest);
322	self.add_transition(state_id, new, next_id);
323	continue;
324	}
325
326	// The need for this loop is a bit subtle, buf basically, after
327	// we've handled the partitions from our initial split, it's
328	// possible that there will be a partition leftover that overlaps
329	// with a subsequent transition. If so, then we have to repeat
330	// the split process again with the leftovers and that subsequent
331	// transition.
332	'OUTER: loop {
333	let old = self.state(state_id).transitions[i].clone();
334	let split = match Split::new(old.range, new) {
335	Some(split) => split,
336	None => {
337	let next_id = NextInsert::push(self, &mut stack, rest);
338	self.add_transition_at(i, state_id, new, next_id);
339	continue;
340	}
341	};
342	let splits = split.as_slice();
343	// If we only have one partition, then the ranges must be
344	// equivalent. There's nothing to do here for this state, so
345	// just move on to the next one.
346	if splits.len() == `1` {
347	// ... but only if we have anything left to do.
348	if !rest.is_empty() {
349	stack.push(NextInsert::new(old.next_id, rest));
350	}
351	break;
352	}
353	// At this point, we know that 'split' is non-empty and there
354	// must be some overlap AND that the two ranges are not
355	// equivalent. Therefore, the existing range MUST be removed
356	// and split up somehow. Instead of actually doing the removal
357	// and then a subsequent insertion---with all the memory
358	// shuffling that entails---we simply overwrite the transition
359	// at position `i` for the first new transition we want to
360	// insert. After that, we're forced to do expensive inserts.
361	let mut first = `true`;
362	let mut add_trans =
363	\|trie: &mut RangeTrie, pos, from, range, to\| {
364	if first {
365	trie.set_transition_at(pos, from, range, to);
366	first = `false`;
367	} else {
368	trie.add_transition_at(pos, from, range, to);
369	}
370	};
371	for (j, &srange) in splits.iter().enumerate() {
372	match srange {
373	SplitRange::Old(r) => {
374	// Deep clone the state pointed to by the ith
375	// transition. This is always necessary since 'old'
376	// is always coupled with at least a 'both'
377	// partition. We don't want any new changes made
378	// via the 'both' partition to impact the part of
379	// the transition that doesn't overlap with the
380	// new range.
381	let dup_id = self.duplicate(old.next_id);
382	add_trans(self, i, state_id, r, dup_id);
383	}
384	SplitRange::New(r) => {
385	// This is a bit subtle, but if this happens to be
386	// the last partition in our split, it is possible
387	// that this overlaps with a subsequent transition.
388	// If it does, then we must repeat the whole
389	// splitting process over again with `r` and the
390	// subsequent transition.
391	{
392	let trans = &self.state(state_id).transitions;
393	if j + `1` == splits.len()
394	&& i < trans.len()
395	&& intersects(r, trans[i].range)
396	{
397	new = r;
398	continue 'OUTER;
399	}
400	}
401
402	// ... otherwise, setup exploration for a new
403	// empty state and add a brand new transition for
404	// this new range.
405	let next_id =
406	NextInsert::push(self, &mut stack, rest);
407	add_trans(self, i, state_id, r, next_id);
408	}
409	SplitRange::Both(r) => {
410	// Continue adding the remaining ranges on this
411	// path and update the transition with the new
412	// range.
413	if !rest.is_empty() {
414	stack.push(NextInsert::new(old.next_id, rest));
415	}
416	add_trans(self, i, state_id, r, old.next_id);
417	}
418	}
419	i += `1`;
420	}
421	// If we've reached this point, then we know that there are
422	// no subsequent transitions with any overlap. Therefore, we
423	// can stop processing this range and move on to the next one.
424	break;
425	}
426	}
427	self.insert_stack = stack;
428	}
429
430	pub fn add_empty(&mut self) -> StateID {
431	if self.states.len() as u64 > u32::MAX as u64 {
432	// This generally should not happen since a range trie is only
433	// ever used to compile a single sequence of Unicode scalar values.
434	// If we ever got to this point, we would, at minimum, be using
435	// 96GB in just the range trie alone.
436	panic!("too many sequences added to range trie");
437	}
438	let id = self.states.len() as StateID;
439	// If we have some free states available, then use them to avoid
440	// more allocations.
441	if let Some(mut state) = self.free.pop() {
442	state.clear();
443	self.states.push(state);
444	} else {
445	self.states.push(State { transitions: vec![] });
446	}
447	id
448	}
449
450	/// Performs a deep clone of the given state and returns the duplicate's
451	/// state ID.
452	///
453	/// A "deep clone" in this context means that the state given along with
454	/// recursively all states that it points to are copied. Once complete,
455	/// the given state ID and the returned state ID share nothing.
456	///
457	/// This is useful during range trie insertion when a new range overlaps
458	/// with an existing range that is bigger than the new one. The part of
459	/// the existing range that does not* overlap with the new one is that*
460	/// duplicated so that adding the new range to the overlap doesn't disturb
461	/// the non-overlapping portion.
462	///
463	/// There's one exception: if old_id is the final state, then it is not
464	/// duplicated and the same final state is returned. This is because all
465	/// final states in this trie are equivalent.
466	fn duplicate(&mut self, old_id: StateID) -> StateID {
467	if old_id == FINAL {
468	return FINAL;
469	}
470
471	let mut stack = mem::replace(&mut self.dupe_stack, vec![]);
472	stack.clear();
473
474	let new_id = self.add_empty();
475	// old_id is the state we're cloning and new_id is the ID of the
476	// duplicated state for old_id.
477	stack.push(NextDupe { old_id, new_id });
478	while let Some(NextDupe { old_id, new_id }) = stack.pop() {
479	for i in `0`..self.state(old_id).transitions.len() {
480	let t = self.state(old_id).transitions[i].clone();
481	if t.next_id == FINAL {
482	// All final states are the same, so there's no need to
483	// duplicate it.
484	self.add_transition(new_id, t.range, FINAL);
485	continue;
486	}
487
488	let new_child_id = self.add_empty();
489	self.add_transition(new_id, t.range, new_child_id);
490	stack.push(NextDupe {
491	old_id: t.next_id,
492	new_id: new_child_id,
493	});
494	}
495	}
496	self.dupe_stack = stack;
497	new_id
498	}
499
500	/// Adds the given transition to the given state.
501	///
502	/// Callers must ensure that all previous transitions in this state
503	/// are lexicographically smaller than the given range.
504	fn add_transition(
505	&mut self,
506	from_id: StateID,
507	range: Utf8Range,
508	next_id: StateID,
509	) {
510	self.state_mut(from_id)
511	.transitions
512	.push(Transition { range, next_id });
513	}
514
515	/// Like `add_transition`, except this inserts the transition just before
516	/// the ith transition.
517	fn add_transition_at(
518	&mut self,
519	i: usize,
520	from_id: StateID,
521	range: Utf8Range,
522	next_id: StateID,
523	) {
524	self.state_mut(from_id)
525	.transitions
526	.insert(i, Transition { range, next_id });
527	}
528
529	/// Overwrites the transition at position i with the given transition.
530	fn set_transition_at(
531	&mut self,
532	i: usize,
533	from_id: StateID,
534	range: Utf8Range,
535	next_id: StateID,
536	) {
537	self.state_mut(from_id).transitions[i] = Transition { range, next_id };
538	}
539
540	/// Return an immutable borrow for the state with the given ID.
541	fn state(&self, id: StateID) -> &State {
542	&self.states[id as usize]
543	}
544
545	/// Return a mutable borrow for the state with the given ID.
546	fn state_mut(&mut self, id: StateID) -> &mut State {
547	&mut self.states[id as usize]
548	}
549	}
550
551	impl State {
552	/// Find the position at which the given range should be inserted in this
553	/// state.
554	///
555	/// The position returned is always in the inclusive range
556	/// [0, transitions.len()]. If 'transitions.len()' is returned, then the
557	/// given range overlaps with no other range in this state and* is greater*
558	/// than all of them.
559	///
560	/// For all other possible positions, the given range either overlaps
561	/// with the transition at that position or is otherwise less than it
562	/// with no overlap (and is greater than the previous transition). In the
563	/// former case, careful attention must be paid to inserting this range
564	/// as a new transition. In the latter case, the range can be inserted as
565	/// a new transition at the given position without disrupting any other
566	/// transitions.
567	fn find(&self, range: Utf8Range) -> usize {
568	/// Returns the position `i` at which `pred(xs[i])` first returns true
569	/// such that for all `j >= i`, `pred(xs[j]) == true`. If `pred` never
570	/// returns true, then `xs.len()` is returned.
571	///
572	/// We roll our own binary search because it doesn't seem like the
573	/// standard library's binary search can be used here. Namely, if
574	/// there is an overlapping range, then we want to find the first such
575	/// occurrence, but there may be many. Or at least, it's not quite
576	/// clear to me how to do it.
577	fn binary_search<T, F>(xs: &[T], mut pred: F) -> usize
578	where
579	F: FnMut(&T) -> bool,
580	{
581	let (mut left, mut right) = (`0`, xs.len());
582	while left < right {
583	// Overflow is impossible because xs.len() <= 256.
584	let mid = (left + right) / `2`;
585	if pred(&xs[mid]) {
586	right = mid;
587	} else {
588	left = mid + `1`;
589	}
590	}
591	left
592	}
593
594	// Benchmarks suggest that binary search is just a bit faster than
595	// straight linear search. Specifically when using the debug tool:
596	//
597	// hyperfine "regex-automata-debug debug -acqr '\w{40} ecurB'"
598	binary_search(&self.transitions, \|t\| range.start <= t.range.end)
599	}
600
601	/// Clear this state such that it has zero transitions.
602	fn clear(&mut self) {
603	self.transitions.clear();
604	}
605	}
606
607	/// The next state to process during duplication.
608	#[derive(Clone, Debug)]
609	struct NextDupe {
610	/// The state we want to duplicate.
611	old_id: StateID,
612	/// The ID of the new state that is a duplicate of old_id.
613	new_id: StateID,
614	}
615
616	/// The next state (and its corresponding transition) that we want to visit
617	/// during iteration in lexicographic order.
618	#[derive(Clone, Debug)]
619	struct NextIter {
620	state_id: StateID,
621	tidx: usize,
622	}
623
624	/// The next state to process during insertion and any remaining ranges that we
625	/// want to add for a partcular sequence of ranges. The first such instance
626	/// is always the root state along with all ranges given.
627	#[derive(Clone, Debug)]
628	struct NextInsert {
629	/// The next state to begin inserting ranges. This state should be the
630	/// state at which `ranges[0]` should be inserted.
631	state_id: StateID,
632	/// The ranges to insert. We used a fixed-size array here to avoid an
633	/// allocation.
634	ranges: [Utf8Range; `4`],
635	/// The number of valid ranges in the above array.
636	len: u8,
637	}
638
639	impl NextInsert {
640	/// Create the next item to visit. The given state ID should correspond
641	/// to the state at which the first range in the given slice should be
642	/// inserted. The slice given must not be empty and it must be no longer
643	/// than 4.
644	fn new(state_id: StateID, ranges: &[Utf8Range]) -> NextInsert {
645	let len = ranges.len();
646	assert!(len > `0`);
647	assert!(len <= `4`);
648
649	let mut tmp = [Utf8Range { start: `0`, end: `0` }; `4`];
650	tmp[..len].copy_from_slice(ranges);
651	NextInsert { state_id, ranges: tmp, len: len as u8 }
652	}
653
654	/// Push a new empty state to visit along with any remaining ranges that
655	/// still need to be inserted. The ID of the new empty state is returned.
656	///
657	/// If ranges is empty, then no new state is created and FINAL is returned.
658	fn push(
659	trie: &mut RangeTrie,
660	stack: &mut Vec<NextInsert>,
661	ranges: &[Utf8Range],
662	) -> StateID {
663	if ranges.is_empty() {
664	FINAL
665	} else {
666	let next_id = trie.add_empty();
667	stack.push(NextInsert::new(next_id, ranges));
668	next_id
669	}
670	}
671
672	/// Return the ID of the state to visit.
673	fn state_id(&self) -> StateID {
674	self.state_id
675	}
676
677	/// Return the remaining ranges to insert.
678	fn ranges(&self) -> &[Utf8Range] {
679	&self.ranges[..self.len as usize]
680	}
681	}
682
683	/// Split represents a partitioning of two ranges into one or more ranges. This
684	/// is the secret sauce that makes a range trie work, as it's what tells us
685	/// how to deal with two overlapping but unequal ranges during insertion.
686	///
687	/// Essentially, either two ranges overlap or they don't. If they don't, then
688	/// handling insertion is easy: just insert the new range into its
689	/// lexicographically correct position. Since it does not overlap with anything
690	/// else, no other transitions are impacted by the new range.
691	///
692	/// If they do overlap though, there are generally three possible cases to
693	/// handle:
694	///
695	/// 1. The part where the two ranges actually overlap. i.e., The intersection.
696	/// 2. The part of the existing range that is not in the the new range.
697	/// 3. The part of the new range that is not in the old range.
698	///
699	/// (1) is guaranteed to always occur since all overlapping ranges have a
700	/// non-empty intersection. If the two ranges are not equivalent, then at
701	/// least one of (2) or (3) is guaranteed to occur as well. In some cases,
702	/// e.g., `[0-4]` and `[4-9]`, all three cases will occur.
703	///
704	/// This `Split` type is responsible for providing (1), (2) and (3) for any
705	/// possible pair of byte ranges.
706	///
707	/// As for insertion, for the overlap in (1), the remaining ranges to insert
708	/// should be added by following the corresponding transition. However, this
709	/// should only be done for the overlapping parts of the range. If there was
710	/// a part of the existing range that was not in the new range, then that
711	/// existing part must be split off from the transition and duplicated. The
712	/// remaining parts of the overlap can then be added to using the new ranges
713	/// without disturbing the existing range.
714	///
715	/// Handling the case for the part of a new range that is not in an existing
716	/// range is seemingly easy. Just treat it as if it were a non-overlapping
717	/// range. The problem here is that if this new non-overlapping range occurs
718	/// after both (1) and (2), then it's possible that it can overlap with the
719	/// next transition in the current state. If it does, then the whole process
720	/// must be repeated!
721	///
722	/// # Details of the 3 cases
723	///
724	/// The following details the various cases that are implemented in code
725	/// below. It's plausible that the number of cases is not actually minimal,
726	/// but it's important for this code to remain at least somewhat readable.
727	///
728	/// Given [a,b] and [x,y], where a <= b, x <= y, b < 256 and y < 256, we define
729	/// the follow distinct relationships where at least one must apply. The order
730	/// of these matters, since multiple can match. The first to match applies.
731	///
732	/// 1. b < x <=> [a,b] < [x,y]
733	/// 2. y < a <=> [x,y] < [a,b]
734	///
735	/// In the case of (1) and (2), these are the only cases where there is no
736	/// overlap. Or otherwise, the intersection of [a,b] and [x,y] is empty. In
737	/// order to compute the intersection, one can do [max(a,x), min(b,y)]. The
738	/// intersection in all of the following cases is non-empty.
739	///
740	/// 3. a = x && b = y <=> [a,b] == [x,y]
741	/// 4. a = x && b < y <=> [x,y] right-extends [a,b]
742	/// 5. b = y && a > x <=> [x,y] left-extends [a,b]
743	/// 6. x = a && y < b <=> [a,b] right-extends [x,y]
744	/// 7. y = b && x > a <=> [a,b] left-extends [x,y]
745	/// 8. a > x && b < y <=> [x,y] covers [a,b]
746	/// 9. x > a && y < b <=> [a,b] covers [x,y]
747	/// 10. b = x && a < y <=> [a,b] is left-adjacent to [x,y]
748	/// 11. y = a && x < b <=> [x,y] is left-adjacent to [a,b]
749	/// 12. b > x && b < y <=> [a,b] left-overlaps [x,y]
750	/// 13. y > a && y < b <=> [x,y] left-overlaps [a,b]
751	///
752	/// In cases 3-13, we can form rules that partition the ranges into a
753	/// non-overlapping ordered sequence of ranges:
754	///
755	/// 3. [a,b]
756	/// 4. [a,b], [b+1,y]
757	/// 5. [x,a-1], [a,b]
758	/// 6. [x,y], [y+1,b]
759	/// 7. [a,x-1], [x,y]
760	/// 8. [x,a-1], [a,b], [b+1,y]
761	/// 9. [a,x-1], [x,y], [y+1,b]
762	/// 10. [a,b-1], [b,b], [b+1,y]
763	/// 11. [x,y-1], [y,y], [y+1,b]
764	/// 12. [a,x-1], [x,b], [b+1,y]
765	/// 13. [x,a-1], [a,y], [y+1,b]
766	///
767	/// In the code below, we go a step further and identify each of the above
768	/// outputs as belonging either to the overlap of the two ranges or to one
769	/// of [a,b] or [x,y] exclusively.
770	#[derive(Clone, Debug, Eq, PartialEq)]
771	struct Split {
772	partitions: [SplitRange; `3`],
773	len: usize,
774	}
775
776	/// A tagged range indicating how it was derived from a pair of ranges.
777	#[derive(Clone, Copy, Debug, Eq, PartialEq)]
778	enum SplitRange {
779	Old(Utf8Range),
780	New(Utf8Range),
781	Both(Utf8Range),
782	}
783
784	impl Split {
785	/// Create a partitioning of the given ranges.
786	///
787	/// If the given ranges have an empty intersection, then None is returned.
788	fn new(o: Utf8Range, n: Utf8Range) -> Option<Split> {
789	let range = \|r: RangeInclusive<u8>\| Utf8Range {
790	start: *r.start(),
791	end: *r.end(),
792	};
793	let old = \|r\| SplitRange::Old(range(r));
794	let new = \|r\| SplitRange::New(range(r));
795	let both = \|r\| SplitRange::Both(range(r));
796
797	// Use same names as the comment above to make it easier to compare.
798	let (a, b, x, y) = (o.start, o.end, n.start, n.end);
799
800	if b < x \|\| y < a {
801	// case 1, case 2
802	None
803	} else if a == x && b == y {
804	// case 3
805	Some(Split::parts1(both(a..=b)))
806	} else if a == x && b < y {
807	// case 4
808	Some(Split::parts2(both(a..=b), new(b + `1`..=y)))
809	} else if b == y && a > x {
810	// case 5
811	Some(Split::parts2(new(x..=a - `1`), both(a..=b)))
812	} else if x == a && y < b {
813	// case 6
814	Some(Split::parts2(both(x..=y), old(y + `1`..=b)))
815	} else if y == b && x > a {
816	// case 7
817	Some(Split::parts2(old(a..=x - `1`), both(x..=y)))
818	} else if a > x && b < y {
819	// case 8
820	Some(Split::parts3(new(x..=a - `1`), both(a..=b), new(b + `1`..=y)))
821	} else if x > a && y < b {
822	// case 9
823	Some(Split::parts3(old(a..=x - `1`), both(x..=y), old(y + `1`..=b)))
824	} else if b == x && a < y {
825	// case 10
826	Some(Split::parts3(old(a..=b - `1`), both(b..=b), new(b + `1`..=y)))
827	} else if y == a && x < b {
828	// case 11
829	Some(Split::parts3(new(x..=y - `1`), both(y..=y), old(y + `1`..=b)))
830	} else if b > x && b < y {
831	// case 12
832	Some(Split::parts3(old(a..=x - `1`), both(x..=b), new(b + `1`..=y)))
833	} else if y > a && y < b {
834	// case 13
835	Some(Split::parts3(new(x..=a - `1`), both(a..=y), old(y + `1`..=b)))
836	} else {
837	unreachable!()
838	}
839	}
840
841	/// Create a new split with a single partition. This only occurs when two
842	/// ranges are equivalent.
843	fn parts1(r1: SplitRange) -> Split {
844	// This value doesn't matter since it is never accessed.
845	let nada = SplitRange::Old(Utf8Range { start: `0`, end: `0` });
846	Split { partitions: [r1, nada, nada], len: `1` }
847	}
848
849	/// Create a new split with two partitions.
850	fn parts2(r1: SplitRange, r2: SplitRange) -> Split {
851	// This value doesn't matter since it is never accessed.
852	let nada = SplitRange::Old(Utf8Range { start: `0`, end: `0` });
853	Split { partitions: [r1, r2, nada], len: `2` }
854	}
855
856	/// Create a new split with three partitions.
857	fn parts3(r1: SplitRange, r2: SplitRange, r3: SplitRange) -> Split {
858	Split { partitions: [r1, r2, r3], len: `3` }
859	}
860
861	/// Return the partitions in this split as a slice.
862	fn as_slice(&self) -> &[SplitRange] {
863	&self.partitions[..self.len]
864	}
865	}
866
867	impl fmt::Debug for RangeTrie {
868	fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
869	writeln!(f, "")?;
870	for (i, state) in self.states.iter().enumerate() {
871	let status = if i == FINAL as usize { '' } else* { ' ' };
872	writeln!(f, "{}{:06}: {:?}", status, i, state)?;
873	}
874	Ok(())
875	}
876	}
877
878	impl fmt::Debug for State {
879	fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
880	let rs = self
881	.transitions
882	.iter()
883	.map(\|t\| format!("{:?}", t))
884	.collect::<Vec<String>>()
885	.join(", ");
886	write!(f, "{}", rs)
887	}
888	}
889
890	impl fmt::Debug for Transition {
891	fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
892	if self.range.start == self.range.end {
893	write!(f, "{:02X} => {:02X}", self.range.start, self.next_id)
894	} else {
895	write!(
896	f,
897	"{:02X}-{:02X} => {:02X}",
898	self.range.start, self.range.end, self.next_id
899	)
900	}
901	}
902	}
903
904	/// Returns true if and only if the given ranges intersect.
905	fn intersects(r1: Utf8Range, r2: Utf8Range) -> bool {
906	!(r1.end < r2.start \|\| r2.end < r1.start)
907	}
908
909	#[cfg(test)]
910	mod tests {
911	use std::ops::RangeInclusive;
912
913	use regex_syntax::utf8::Utf8Range;
914
915	use super::*;
916
917	fn r(range: RangeInclusive<u8>) -> Utf8Range {
918	Utf8Range { start: range.start(), end: range.end() }
919	}
920
921	fn split_maybe(
922	old: RangeInclusive<u8>,
923	new: RangeInclusive<u8>,
924	) -> Option<Split> {
925	Split::new(r(old), r(new))
926	}
927
928	fn split(
929	old: RangeInclusive<u8>,
930	new: RangeInclusive<u8>,
931	) -> Vec<SplitRange> {
932	split_maybe(old, new).unwrap().as_slice().to_vec()
933	}
934
935	#[test]
936	fn no_splits() {
937	// case 1
938	assert_eq!(None, split_maybe(`0`..=`1`, `2`..=`3`));
939	// case 2
940	assert_eq!(None, split_maybe(`2`..=`3`, `0`..=`1`));
941	}
942
943	#[test]
944	fn splits() {
945	let range = \|r: RangeInclusive<u8>\| Utf8Range {
946	start: *r.start(),
947	end: *r.end(),
948	};
949	let old = \|r\| SplitRange::Old(range(r));
950	let new = \|r\| SplitRange::New(range(r));
951	let both = \|r\| SplitRange::Both(range(r));
952
953	// case 3
954	assert_eq!(split(`0`..=`0`, `0`..=`0`), vec![both(`0`..=`0`)]);
955	assert_eq!(split(`9`..=`9`, `9`..=`9`), vec![both(`9`..=`9`)]);
956
957	// case 4
958	assert_eq!(split(`0`..=`5`, `0`..=`6`), vec![both(`0`..=`5`), new(`6`..=`6`)]);
959	assert_eq!(split(`0`..=`5`, `0`..=`8`), vec![both(`0`..=`5`), new(`6`..=`8`)]);
960	assert_eq!(split(`5`..=`5`, `5`..=`8`), vec![both(`5`..=`5`), new(`6`..=`8`)]);
961
962	// case 5
963	assert_eq!(split(`1`..=`5`, `0`..=`5`), vec![new(`0`..=`0`), both(`1`..=`5`)]);
964	assert_eq!(split(`3`..=`5`, `0`..=`5`), vec![new(`0`..=`2`), both(`3`..=`5`)]);
965	assert_eq!(split(`5`..=`5`, `0`..=`5`), vec![new(`0`..=`4`), both(`5`..=`5`)]);
966
967	// case 6
968	assert_eq!(split(`0`..=`6`, `0`..=`5`), vec![both(`0`..=`5`), old(`6`..=`6`)]);
969	assert_eq!(split(`0`..=`8`, `0`..=`5`), vec![both(`0`..=`5`), old(`6`..=`8`)]);
970	assert_eq!(split(`5`..=`8`, `5`..=`5`), vec![both(`5`..=`5`), old(`6`..=`8`)]);
971
972	// case 7
973	assert_eq!(split(`0`..=`5`, `1`..=`5`), vec![old(`0`..=`0`), both(`1`..=`5`)]);
974	assert_eq!(split(`0`..=`5`, `3`..=`5`), vec![old(`0`..=`2`), both(`3`..=`5`)]);
975	assert_eq!(split(`0`..=`5`, `5`..=`5`), vec![old(`0`..=`4`), both(`5`..=`5`)]);
976
977	// case 8
978	assert_eq!(
979	split(`3`..=`6`, `2`..=`7`),
980	vec![new(`2`..=`2`), both(`3`..=`6`), new(`7`..=`7`)],
981	);
982	assert_eq!(
983	split(`3`..=`6`, `1`..=`8`),
984	vec![new(`1`..=`2`), both(`3`..=`6`), new(`7`..=`8`)],
985	);
986
987	// case 9
988	assert_eq!(
989	split(`2`..=`7`, `3`..=`6`),
990	vec![old(`2`..=`2`), both(`3`..=`6`), old(`7`..=`7`)],
991	);
992	assert_eq!(
993	split(`1`..=`8`, `3`..=`6`),
994	vec![old(`1`..=`2`), both(`3`..=`6`), old(`7`..=`8`)],
995	);
996
997	// case 10
998	assert_eq!(
999	split(`3`..=`6`, `6`..=`7`),
1000	vec![old(`3`..=`5`), both(`6`..=`6`), new(`7`..=`7`)],
1001	);
1002	assert_eq!(
1003	split(`3`..=`6`, `6`..=`8`),
1004	vec![old(`3`..=`5`), both(`6`..=`6`), new(`7`..=`8`)],
1005	);
1006	assert_eq!(
1007	split(`5`..=`6`, `6`..=`7`),
1008	vec![old(`5`..=`5`), both(`6`..=`6`), new(`7`..=`7`)],
1009	);
1010
1011	// case 11
1012	assert_eq!(
1013	split(`6`..=`7`, `3`..=`6`),
1014	vec![new(`3`..=`5`), both(`6`..=`6`), old(`7`..=`7`)],
1015	);
1016	assert_eq!(
1017	split(`6`..=`8`, `3`..=`6`),
1018	vec![new(`3`..=`5`), both(`6`..=`6`), old(`7`..=`8`)],
1019	);
1020	assert_eq!(
1021	split(`6`..=`7`, `5`..=`6`),
1022	vec![new(`5`..=`5`), both(`6`..=`6`), old(`7`..=`7`)],
1023	);
1024
1025	// case 12
1026	assert_eq!(
1027	split(`3`..=`7`, `5`..=`9`),
1028	vec![old(`3`..=`4`), both(`5`..=`7`), new(`8`..=`9`)],
1029	);
1030	assert_eq!(
1031	split(`3`..=`5`, `4`..=`6`),
1032	vec![old(`3`..=`3`), both(`4`..=`5`), new(`6`..=`6`)],
1033	);
1034
1035	// case 13
1036	assert_eq!(
1037	split(`5`..=`9`, `3`..=`7`),
1038	vec![new(`3`..=`4`), both(`5`..=`7`), old(`8`..=`9`)],
1039	);
1040	assert_eq!(
1041	split(`4`..=`6`, `3`..=`5`),
1042	vec![new(`3`..=`3`), both(`4`..=`5`), old(`6`..=`6`)],
1043	);
1044	}
1045
1046	// Arguably there should be more tests here, but in practice, this data
1047	// structure is well covered by the huge number of regex tests.
1048	}
1049