1/*!
2This module defines a DFA state representation and builders for constructing
3DFA states.
4
5This representation is specifically for use in implementations of NFA-to-DFA
6conversion via powerset construction. (Also called "determinization" in this
7crate.)
8
9The term "DFA state" is somewhat overloaded in this crate. In some cases, it
10refers to the set of transitions over an alphabet for a particular state. In
11other cases, it refers to a set of NFA states. The former is really about the
12final representation of a state in a DFA's transition table, where as the
13latter---what this module is focused on---is closer to an intermediate form
14that is used to help eventually build the transition table.
15
16This module exports four types. All four types represent the same idea: an
17ordered set of NFA states. This ordered set represents the epsilon closure of a
18particular NFA state, where the "epsilon closure" is the set of NFA states that
19can be transitioned to without consuming any input. i.e., Follow all of the NFA
20state's epsilon transitions. In addition, this implementation of DFA states
21cares about two other things: the ordered set of pattern IDs corresponding
22to the patterns that match if the state is a match state, and the set of
23look-behind assertions that were true when the state was created.
24
25The first, `State`, is a frozen representation of a state that cannot be
26modified. It may be cheaply cloned without copying the state itself and can be
27accessed safely from multiple threads simultaneously. This type is useful for
28when one knows that the DFA state being constructed is distinct from any other
29previously constructed states. Namely, powerset construction, in practice,
30requires one to keep a cache of previously created DFA states. Otherwise,
31the number of DFA states created in memory balloons to an impractically
32large number. For this reason, equivalent states should endeavor to have an
33equivalent byte-level representation. (In general, "equivalency" here means,
34"equivalent assertions, pattern IDs and NFA state IDs." We do not require that
35full DFA minimization be implemented here. This form of equivalency is only
36surface deep and is more-or-less a practical necessity.)
37
38The other three types represent different phases in the construction of a
39DFA state. Internally, these three types (and `State`) all use the same
40byte-oriented representation. That means one can use any of the builder types
41to check whether the state it represents already exists or not. If it does,
42then there is no need to freeze it into a `State` (which requires an alloc and
43a copy). Here are the three types described succinctly:
44
45* `StateBuilderEmpty` represents a state with no pattern IDs, no assertions
46and no NFA states. Creating a `StateBuilderEmpty` performs no allocs. A
47`StateBuilderEmpty` can only be used to query its underlying memory capacity,
48or to convert into a builder for recording pattern IDs and/or assertions.
49
50* `StateBuilderMatches` represents a state with zero or more pattern IDs, zero
51or more satisfied assertions and zero NFA state IDs. A `StateBuilderMatches`
52can only be used for adding pattern IDs and recording assertions.
53
54* `StateBuilderNFA` represents a state with zero or more pattern IDs, zero or
55more satisfied assertions and zero or more NFA state IDs. A `StateBuilderNFA`
56can only be used for adding NFA state IDs and recording some assertions.
57
58The expected flow here is to use the above builders to construct a candidate
59DFA state to check if it already exists. If it does, then there's no need to
60freeze it into a `State`. It it doesn't exist, then `StateBuilderNFA::to_state`
61can be called to freeze the builder into an immutable `State`. In either
62case, `clear` should be called on the builder to turn it back into a
63`StateBuilderEmpty` that reuses the underlying memory.
64
65The main purpose for splitting the builder into these distinct types is to
66make it impossible to do things like adding a pattern ID after adding an NFA
67state ID. Namely, this makes it simpler to use a space-and-time efficient
68binary representation for the state. (The format is documented on the `Repr`
69type below.) If we just used one type for everything, it would be possible for
70callers to use an incorrect interleaving of calls and thus result in a corrupt
71representation. I chose to use more type machinery to make this impossible to
72do because 1) determinization is itself pretty complex and it wouldn't be too
73hard to foul this up and 2) there isn't too much machinery involved and it's
74well contained.
75
76As an optimization, sometimes states won't have certain things set. For
77example, if the underlying NFA has no word boundary assertions, then there is
78no reason to set a state's look-behind assertion as to whether it was generated
79from a word byte or not. Similarly, if a state has no NFA states corresponding
80to look-around assertions, then there is no reason to set `look_have` to a
81non-empty set. Finally, callers usually omit unconditional epsilon transitions
82when adding NFA state IDs since they aren't discriminatory.
83
84Finally, the binary representation used by these states is, thankfully, not
85serialized anywhere. So any kind of change can be made with reckless abandon,
86as long as everything in this module agrees.
87*/
88
89use core::{convert::TryFrom, mem};
90
91use alloc::{sync::Arc, vec::Vec};
92
93use crate::util::{
94 int::{I32, U32},
95 look::LookSet,
96 primitives::{PatternID, StateID},
97 wire::{self, Endian},
98};
99
100/// A DFA state that, at its core, is represented by an ordered set of NFA
101/// states.
102///
103/// This type is intended to be used only in NFA-to-DFA conversion via powerset
104/// construction.
105///
106/// It may be cheaply cloned and accessed safely from multiple threads
107/// simultaneously.
108#[derive(Clone, Eq, Hash, PartialEq, PartialOrd, Ord)]
109pub(crate) struct State(Arc<[u8]>);
110
111/// This Borrow impl permits us to lookup any state in a map by its byte
112/// representation. This is particularly convenient when one has a StateBuilder
113/// and we want to see if a correspondingly equivalent state already exists. If
114/// one does exist, then we can reuse the allocation required by StateBuilder
115/// without having to convert it into a State first.
116impl core::borrow::Borrow<[u8]> for State {
117 fn borrow(&self) -> &[u8] {
118 &*self.0
119 }
120}
121
122impl core::fmt::Debug for State {
123 fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
124 f.debug_tuple("State").field(&self.repr()).finish()
125 }
126}
127
128/// For docs on these routines, see the internal Repr and ReprVec types below.
129impl State {
130 pub(crate) fn dead() -> State {
131 StateBuilderEmpty::new().into_matches().into_nfa().to_state()
132 }
133
134 pub(crate) fn is_match(&self) -> bool {
135 self.repr().is_match()
136 }
137
138 pub(crate) fn is_from_word(&self) -> bool {
139 self.repr().is_from_word()
140 }
141
142 pub(crate) fn is_half_crlf(&self) -> bool {
143 self.repr().is_half_crlf()
144 }
145
146 pub(crate) fn look_have(&self) -> LookSet {
147 self.repr().look_have()
148 }
149
150 pub(crate) fn look_need(&self) -> LookSet {
151 self.repr().look_need()
152 }
153
154 pub(crate) fn match_len(&self) -> usize {
155 self.repr().match_len()
156 }
157
158 pub(crate) fn match_pattern(&self, index: usize) -> PatternID {
159 self.repr().match_pattern(index)
160 }
161
162 pub(crate) fn match_pattern_ids(&self) -> Option<Vec<PatternID>> {
163 self.repr().match_pattern_ids()
164 }
165
166 #[cfg(all(test, not(miri)))]
167 pub(crate) fn iter_match_pattern_ids<F: FnMut(PatternID)>(&self, f: F) {
168 self.repr().iter_match_pattern_ids(f)
169 }
170
171 pub(crate) fn iter_nfa_state_ids<F: FnMut(StateID)>(&self, f: F) {
172 self.repr().iter_nfa_state_ids(f)
173 }
174
175 pub(crate) fn memory_usage(&self) -> usize {
176 self.0.len()
177 }
178
179 fn repr(&self) -> Repr<'_> {
180 Repr(&*self.0)
181 }
182}
183
184/// A state builder that represents an empty state.
185///
186/// This is a useful "initial condition" for state construction. It has no
187/// NFA state IDs, no assertions set and no pattern IDs. No allocations are
188/// made when new() is called. Its main use is for being converted into a
189/// builder that can capture assertions and pattern IDs.
190#[derive(Clone, Debug)]
191pub(crate) struct StateBuilderEmpty(Vec<u8>);
192
193/// For docs on these routines, see the internal Repr and ReprVec types below.
194impl StateBuilderEmpty {
195 pub(crate) fn new() -> StateBuilderEmpty {
196 StateBuilderEmpty(alloc::vec![])
197 }
198
199 pub(crate) fn into_matches(mut self) -> StateBuilderMatches {
200 self.0.extend_from_slice(&[0, 0, 0, 0, 0, 0, 0, 0, 0]);
201 StateBuilderMatches(self.0)
202 }
203
204 fn clear(&mut self) {
205 self.0.clear();
206 }
207
208 pub(crate) fn capacity(&self) -> usize {
209 self.0.capacity()
210 }
211}
212
213/// A state builder that collects assertions and pattern IDs.
214///
215/// When collecting pattern IDs is finished, this can be converted into a
216/// builder that collects NFA state IDs.
217#[derive(Clone)]
218pub(crate) struct StateBuilderMatches(Vec<u8>);
219
220impl core::fmt::Debug for StateBuilderMatches {
221 fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
222 f.debug_tuple("StateBuilderMatches").field(&self.repr()).finish()
223 }
224}
225
226/// For docs on these routines, see the internal Repr and ReprVec types below.
227impl StateBuilderMatches {
228 pub(crate) fn into_nfa(mut self) -> StateBuilderNFA {
229 self.repr_vec().close_match_pattern_ids();
230 StateBuilderNFA { repr: self.0, prev_nfa_state_id: StateID::ZERO }
231 }
232
233 pub(crate) fn set_is_from_word(&mut self) {
234 self.repr_vec().set_is_from_word()
235 }
236
237 pub(crate) fn set_is_half_crlf(&mut self) {
238 self.repr_vec().set_is_half_crlf()
239 }
240
241 pub(crate) fn look_have(&self) -> LookSet {
242 LookSet::read_repr(&self.0[1..])
243 }
244
245 pub(crate) fn set_look_have(
246 &mut self,
247 set: impl FnMut(LookSet) -> LookSet,
248 ) {
249 self.repr_vec().set_look_have(set)
250 }
251
252 pub(crate) fn add_match_pattern_id(&mut self, pid: PatternID) {
253 self.repr_vec().add_match_pattern_id(pid)
254 }
255
256 fn repr(&self) -> Repr<'_> {
257 Repr(&self.0)
258 }
259
260 fn repr_vec(&mut self) -> ReprVec<'_> {
261 ReprVec(&mut self.0)
262 }
263}
264
265/// A state builder that collects some assertions and NFA state IDs.
266///
267/// When collecting NFA state IDs is finished, this can be used to build a
268/// `State` if necessary.
269///
270/// When dont with building a state (regardless of whether it got kept or not),
271/// it's usually a good idea to call `clear` to get an empty builder back so
272/// that it can be reused to build the next state.
273#[derive(Clone)]
274pub(crate) struct StateBuilderNFA {
275 repr: Vec<u8>,
276 prev_nfa_state_id: StateID,
277}
278
279impl core::fmt::Debug for StateBuilderNFA {
280 fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
281 f.debug_tuple("StateBuilderNFA").field(&self.repr()).finish()
282 }
283}
284
285/// For docs on these routines, see the internal Repr and ReprVec types below.
286impl StateBuilderNFA {
287 pub(crate) fn to_state(&self) -> State {
288 State(Arc::from(&*self.repr))
289 }
290
291 pub(crate) fn clear(self) -> StateBuilderEmpty {
292 let mut builder = StateBuilderEmpty(self.repr);
293 builder.clear();
294 builder
295 }
296
297 pub(crate) fn look_need(&self) -> LookSet {
298 self.repr().look_need()
299 }
300
301 pub(crate) fn set_look_have(
302 &mut self,
303 set: impl FnMut(LookSet) -> LookSet,
304 ) {
305 self.repr_vec().set_look_have(set)
306 }
307
308 pub(crate) fn set_look_need(
309 &mut self,
310 set: impl FnMut(LookSet) -> LookSet,
311 ) {
312 self.repr_vec().set_look_need(set)
313 }
314
315 pub(crate) fn add_nfa_state_id(&mut self, sid: StateID) {
316 ReprVec(&mut self.repr)
317 .add_nfa_state_id(&mut self.prev_nfa_state_id, sid)
318 }
319
320 pub(crate) fn as_bytes(&self) -> &[u8] {
321 &self.repr
322 }
323
324 fn repr(&self) -> Repr<'_> {
325 Repr(&self.repr)
326 }
327
328 fn repr_vec(&mut self) -> ReprVec<'_> {
329 ReprVec(&mut self.repr)
330 }
331}
332
333/// Repr is a read-only view into the representation of a DFA state.
334///
335/// Primarily, a Repr is how we achieve DRY: we implement decoding the format
336/// in one place, and then use a Repr to implement the various methods on the
337/// public state types.
338///
339/// The format is as follows:
340///
341/// The first three bytes correspond to bitsets.
342///
343/// Byte 0 is a bitset corresponding to miscellaneous flags associated with the
344/// state. Bit 0 is set to 1 if the state is a match state. Bit 1 is set to 1
345/// if the state has pattern IDs explicitly written to it. (This is a flag that
346/// is not meant to be set by determinization, but rather, is used as part of
347/// an internal space-saving optimization.) Bit 2 is set to 1 if the state was
348/// generated by a transition over a "word" byte. (Callers may not always set
349/// this. For example, if the NFA has no word boundary assertion, then needing
350/// to track whether a state came from a word byte or not is superfluous and
351/// wasteful.) Bit 3 is set to 1 if the state was generated by a transition
352/// from a `\r` (forward search) or a `\n` (reverse search) when CRLF mode is
353/// enabled.
354///
355/// Bytes 1..5 correspond to the look-behind assertions that were satisfied
356/// by the transition that created this state. (Look-ahead assertions are not
357/// tracked as part of states. Instead, these are applied by re-computing the
358/// epsilon closure of a state when computing the transition function. See
359/// `next` in the parent module.)
360///
361/// Bytes 5..9 correspond to the set of look-around assertions (including both
362/// look-behind and look-ahead) that appear somewhere in this state's set of
363/// NFA state IDs. This is used to determine whether this state's epsilon
364/// closure should be re-computed when computing the transition function.
365/// Namely, look-around assertions are "just" conditional epsilon transitions,
366/// so if there are new assertions available when computing the transition
367/// function, we should only re-compute the epsilon closure if those new
368/// assertions are relevant to this particular state.
369///
370/// Bytes 9..13 correspond to a 32-bit native-endian encoded integer
371/// corresponding to the number of patterns encoded in this state. If the state
372/// is not a match state (byte 0 bit 0 is 0) or if it's only pattern ID is
373/// PatternID::ZERO, then no integer is encoded at this position. Instead, byte
374/// offset 3 is the position at which the first NFA state ID is encoded.
375///
376/// For a match state with at least one non-ZERO pattern ID, the next bytes
377/// correspond to a sequence of 32-bit native endian encoded integers that
378/// represent each pattern ID, in order, that this match state represents.
379///
380/// After the pattern IDs (if any), NFA state IDs are delta encoded as
381/// varints.[1] The first NFA state ID is encoded as itself, and each
382/// subsequent NFA state ID is encoded as the difference between itself and the
383/// previous NFA state ID.
384///
385/// [1] - https://developers.google.com/protocol-buffers/docs/encoding#varints
386struct Repr<'a>(&'a [u8]);
387
388impl<'a> Repr<'a> {
389 /// Returns true if and only if this is a match state.
390 ///
391 /// If callers have added pattern IDs to this state, then callers MUST set
392 /// this state as a match state explicitly. However, as a special case,
393 /// states that are marked as match states but with no pattern IDs, then
394 /// the state is treated as if it had a single pattern ID equivalent to
395 /// PatternID::ZERO.
396 fn is_match(&self) -> bool {
397 self.0[0] & (1 << 0) > 0
398 }
399
400 /// Returns true if and only if this state has had at least one pattern
401 /// ID added to it.
402 ///
403 /// This is an internal-only flag that permits the representation to save
404 /// space in the common case of an NFA with one pattern in it. In that
405 /// case, a match state can only ever have exactly one pattern ID:
406 /// PatternID::ZERO. So there's no need to represent it.
407 fn has_pattern_ids(&self) -> bool {
408 self.0[0] & (1 << 1) > 0
409 }
410
411 /// Returns true if and only if this state is marked as having been created
412 /// from a transition over a word byte. This is useful for checking whether
413 /// a word boundary assertion is true or not, which requires look-behind
414 /// (whether the current state came from a word byte or not) and look-ahead
415 /// (whether the transition byte is a word byte or not).
416 ///
417 /// Since states with this set are distinct from states that don't have
418 /// this set (even if they are otherwise equivalent), callers should not
419 /// set this assertion unless the underlying NFA has at least one word
420 /// boundary assertion somewhere. Otherwise, a superfluous number of states
421 /// may be created.
422 fn is_from_word(&self) -> bool {
423 self.0[0] & (1 << 2) > 0
424 }
425
426 /// Returns true if and only if this state is marked as being inside of a
427 /// CRLF terminator. In the forward direction, this means the state was
428 /// created after seeing a `\r`. In the reverse direction, this means the
429 /// state was created after seeing a `\n`.
430 fn is_half_crlf(&self) -> bool {
431 self.0[0] & (1 << 3) > 0
432 }
433
434 /// The set of look-behind assertions that were true in the transition that
435 /// created this state.
436 ///
437 /// Generally, this should be empty if 'look_need' is empty, since there is
438 /// no reason to track which look-behind assertions are true if the state
439 /// has no conditional epsilon transitions.
440 ///
441 /// Satisfied look-ahead assertions are not tracked in states. Instead,
442 /// these are re-computed on demand via epsilon closure when computing the
443 /// transition function.
444 fn look_have(&self) -> LookSet {
445 LookSet::read_repr(&self.0[1..])
446 }
447
448 /// The set of look-around (both behind and ahead) assertions that appear
449 /// at least once in this state's set of NFA states.
450 ///
451 /// This is used to determine whether the epsilon closure needs to be
452 /// re-computed when computing the transition function. Namely, if the
453 /// state has no conditional epsilon transitions, then there is no need
454 /// to re-compute the epsilon closure.
455 fn look_need(&self) -> LookSet {
456 LookSet::read_repr(&self.0[5..])
457 }
458
459 /// Returns the total number of match pattern IDs in this state.
460 ///
461 /// If this state is not a match state, then this always returns 0.
462 fn match_len(&self) -> usize {
463 if !self.is_match() {
464 return 0;
465 } else if !self.has_pattern_ids() {
466 1
467 } else {
468 self.encoded_pattern_len()
469 }
470 }
471
472 /// Returns the pattern ID for this match state at the given index.
473 ///
474 /// If the given index is greater than or equal to `match_len()` for this
475 /// state, then this could panic or return incorrect results.
476 fn match_pattern(&self, index: usize) -> PatternID {
477 if !self.has_pattern_ids() {
478 PatternID::ZERO
479 } else {
480 let offset = 13 + index * PatternID::SIZE;
481 // This is OK since we only ever serialize valid PatternIDs to
482 // states.
483 wire::read_pattern_id_unchecked(&self.0[offset..]).0
484 }
485 }
486
487 /// Returns a copy of all match pattern IDs in this state. If this state
488 /// is not a match state, then this returns None.
489 fn match_pattern_ids(&self) -> Option<Vec<PatternID>> {
490 if !self.is_match() {
491 return None;
492 }
493 let mut pids = alloc::vec![];
494 self.iter_match_pattern_ids(|pid| pids.push(pid));
495 Some(pids)
496 }
497
498 /// Calls the given function on every pattern ID in this state.
499 fn iter_match_pattern_ids<F: FnMut(PatternID)>(&self, mut f: F) {
500 if !self.is_match() {
501 return;
502 }
503 // As an optimization for a very common case, when this is a match
504 // state for an NFA with only one pattern, we don't actually write the
505 // pattern ID to the state representation. Instead, we know it must
506 // be there since it is the only possible choice.
507 if !self.has_pattern_ids() {
508 f(PatternID::ZERO);
509 return;
510 }
511 let mut pids = &self.0[13..self.pattern_offset_end()];
512 while !pids.is_empty() {
513 let pid = wire::read_u32(pids);
514 pids = &pids[PatternID::SIZE..];
515 // This is OK since we only ever serialize valid PatternIDs to
516 // states. And since pattern IDs can never exceed a usize, the
517 // unwrap is OK.
518 f(PatternID::new_unchecked(usize::try_from(pid).unwrap()));
519 }
520 }
521
522 /// Calls the given function on every NFA state ID in this state.
523 fn iter_nfa_state_ids<F: FnMut(StateID)>(&self, mut f: F) {
524 let mut sids = &self.0[self.pattern_offset_end()..];
525 let mut prev = 0i32;
526 while !sids.is_empty() {
527 let (delta, nr) = read_vari32(sids);
528 sids = &sids[nr..];
529 let sid = prev + delta;
530 prev = sid;
531 // This is OK since we only ever serialize valid StateIDs to
532 // states. And since state IDs can never exceed an isize, they must
533 // always be able to fit into a usize, and thus cast is OK.
534 f(StateID::new_unchecked(sid.as_usize()))
535 }
536 }
537
538 /// Returns the offset into this state's representation where the pattern
539 /// IDs end and the NFA state IDs begin.
540 fn pattern_offset_end(&self) -> usize {
541 let encoded = self.encoded_pattern_len();
542 if encoded == 0 {
543 return 9;
544 }
545 // This arithmetic is OK since we were able to address this many bytes
546 // when writing to the state, thus, it must fit into a usize.
547 encoded.checked_mul(4).unwrap().checked_add(13).unwrap()
548 }
549
550 /// Returns the total number of *encoded* pattern IDs in this state.
551 ///
552 /// This may return 0 even when this is a match state, since the pattern
553 /// ID `PatternID::ZERO` is not encoded when it's the only pattern ID in
554 /// the match state (the overwhelming common case).
555 fn encoded_pattern_len(&self) -> usize {
556 if !self.has_pattern_ids() {
557 return 0;
558 }
559 // This unwrap is OK since the total number of patterns is always
560 // guaranteed to fit into a usize.
561 usize::try_from(wire::read_u32(&self.0[9..13])).unwrap()
562 }
563}
564
565impl<'a> core::fmt::Debug for Repr<'a> {
566 fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
567 let mut nfa_ids = alloc::vec![];
568 self.iter_nfa_state_ids(|sid| nfa_ids.push(sid));
569 f.debug_struct("Repr")
570 .field("is_match", &self.is_match())
571 .field("is_from_word", &self.is_from_word())
572 .field("is_half_crlf", &self.is_half_crlf())
573 .field("look_have", &self.look_have())
574 .field("look_need", &self.look_need())
575 .field("match_pattern_ids", &self.match_pattern_ids())
576 .field("nfa_state_ids", &nfa_ids)
577 .finish()
578 }
579}
580
581/// ReprVec is a write-only view into the representation of a DFA state.
582///
583/// See Repr for more details on the purpose of this type and also the format.
584///
585/// Note that not all possible combinations of methods may be called. This is
586/// precisely what the various StateBuilder types encapsulate: they only
587/// permit valid combinations via Rust's linear typing.
588struct ReprVec<'a>(&'a mut Vec<u8>);
589
590impl<'a> ReprVec<'a> {
591 /// Set this state as a match state.
592 ///
593 /// This should not be exposed explicitly outside of this module. It is
594 /// set automatically when a pattern ID is added.
595 fn set_is_match(&mut self) {
596 self.0[0] |= 1 << 0;
597 }
598
599 /// Set that this state has pattern IDs explicitly written to it.
600 ///
601 /// This should not be exposed explicitly outside of this module. This is
602 /// used internally as a space saving optimization. Namely, if the state
603 /// is a match state but does not have any pattern IDs written to it,
604 /// then it is automatically inferred to have a pattern ID of ZERO.
605 fn set_has_pattern_ids(&mut self) {
606 self.0[0] |= 1 << 1;
607 }
608
609 /// Set this state as being built from a transition over a word byte.
610 ///
611 /// Setting this is only necessary when one needs to deal with word
612 /// boundary assertions. Therefore, if the underlying NFA has no word
613 /// boundary assertions, callers should not set this.
614 fn set_is_from_word(&mut self) {
615 self.0[0] |= 1 << 2;
616 }
617
618 /// Set this state as having seen half of a CRLF terminator.
619 ///
620 /// In the forward direction, this should be set when a `\r` has been seen.
621 /// In the reverse direction, this should be set when a `\n` has been seen.
622 fn set_is_half_crlf(&mut self) {
623 self.0[0] |= 1 << 3;
624 }
625
626 /// The set of look-behind assertions that were true in the transition that
627 /// created this state.
628 fn look_have(&self) -> LookSet {
629 self.repr().look_have()
630 }
631
632 /// The set of look-around (both behind and ahead) assertions that appear
633 /// at least once in this state's set of NFA states.
634 fn look_need(&self) -> LookSet {
635 self.repr().look_need()
636 }
637
638 /// Mutate the set of look-behind assertions that were true in the
639 /// transition that created this state.
640 fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) {
641 set(self.look_have()).write_repr(&mut self.0[1..]);
642 }
643
644 /// Mutate the set of look-around (both behind and ahead) assertions that
645 /// appear at least once in this state's set of NFA states.
646 fn set_look_need(&mut self, mut set: impl FnMut(LookSet) -> LookSet) {
647 set(self.look_need()).write_repr(&mut self.0[5..]);
648 }
649
650 /// Add a pattern ID to this state. All match states must have at least
651 /// one pattern ID associated with it.
652 ///
653 /// Callers must never add duplicative pattern IDs.
654 ///
655 /// The order in which patterns are added must correspond to the order
656 /// in which patterns are reported as matches.
657 fn add_match_pattern_id(&mut self, pid: PatternID) {
658 // As a (somewhat small) space saving optimization, in the case where
659 // a matching state has exactly one pattern ID, PatternID::ZERO, we do
660 // not write either the pattern ID or the number of patterns encoded.
661 // Instead, all we do is set the 'is_match' bit on this state. Overall,
662 // this saves 8 bytes per match state for the overwhelming majority of
663 // match states.
664 //
665 // In order to know whether pattern IDs need to be explicitly read or
666 // not, we use another internal-only bit, 'has_pattern_ids', to
667 // indicate whether they have been explicitly written or not.
668 if !self.repr().has_pattern_ids() {
669 if pid == PatternID::ZERO {
670 self.set_is_match();
671 return;
672 }
673 // Make room for 'close_match_pattern_ids' to write the total
674 // number of pattern IDs written.
675 self.0.extend(core::iter::repeat(0).take(PatternID::SIZE));
676 self.set_has_pattern_ids();
677 // If this was already a match state, then the only way that's
678 // possible when the state doesn't have pattern IDs is if
679 // PatternID::ZERO was added by the caller previously. In this
680 // case, we are now adding a non-ZERO pattern ID after it, in
681 // which case, we want to make sure to represent ZERO explicitly
682 // now.
683 if self.repr().is_match() {
684 write_u32(self.0, 0)
685 } else {
686 // Otherwise, just make sure the 'is_match' bit is set.
687 self.set_is_match();
688 }
689 }
690 write_u32(self.0, pid.as_u32());
691 }
692
693 /// Indicate that no more pattern IDs will be added to this state.
694 ///
695 /// Once this is called, callers must not call it or 'add_match_pattern_id'
696 /// again.
697 ///
698 /// This should not be exposed explicitly outside of this module. It
699 /// should be called only when converting a StateBuilderMatches into a
700 /// StateBuilderNFA.
701 fn close_match_pattern_ids(&mut self) {
702 // If we never wrote any pattern IDs, then there's nothing to do here.
703 if !self.repr().has_pattern_ids() {
704 return;
705 }
706 let patsize = PatternID::SIZE;
707 let pattern_bytes = self.0.len() - 13;
708 // Every pattern ID uses 4 bytes, so number of bytes should be
709 // divisible by 4.
710 assert_eq!(pattern_bytes % patsize, 0);
711 // This unwrap is OK since we are guaranteed that the maximum number
712 // of possible patterns fits into a u32.
713 let count32 = u32::try_from(pattern_bytes / patsize).unwrap();
714 wire::NE::write_u32(count32, &mut self.0[9..13]);
715 }
716
717 /// Add an NFA state ID to this state. The order in which NFA states are
718 /// added matters. It is the caller's responsibility to ensure that
719 /// duplicate NFA state IDs are not added.
720 fn add_nfa_state_id(&mut self, prev: &mut StateID, sid: StateID) {
721 let delta = sid.as_i32() - prev.as_i32();
722 write_vari32(self.0, delta);
723 *prev = sid;
724 }
725
726 /// Return a read-only view of this state's representation.
727 fn repr(&self) -> Repr<'_> {
728 Repr(self.0.as_slice())
729 }
730}
731
732/// Write a signed 32-bit integer using zig-zag encoding.
733///
734/// https://developers.google.com/protocol-buffers/docs/encoding#varints
735fn write_vari32(data: &mut Vec<u8>, n: i32) {
736 let mut un = n.to_bits() << 1;
737 if n < 0 {
738 un = !un;
739 }
740 write_varu32(data, un)
741}
742
743/// Read a signed 32-bit integer using zig-zag encoding. Also, return the
744/// number of bytes read.
745///
746/// https://developers.google.com/protocol-buffers/docs/encoding#varints
747fn read_vari32(data: &[u8]) -> (i32, usize) {
748 let (un, i) = read_varu32(data);
749 let mut n = i32::from_bits(un >> 1);
750 if un & 1 != 0 {
751 n = !n;
752 }
753 (n, i)
754}
755
756/// Write an unsigned 32-bit integer as a varint. In essence, `n` is written
757/// as a sequence of bytes where all bytes except for the last one have the
758/// most significant bit set. The least significant 7 bits correspond to the
759/// actual bits of `n`. So in the worst case, a varint uses 5 bytes, but in
760/// very common cases, it uses fewer than 4.
761///
762/// https://developers.google.com/protocol-buffers/docs/encoding#varints
763fn write_varu32(data: &mut Vec<u8>, mut n: u32) {
764 while n >= 0b1000_0000 {
765 data.push(n.low_u8() | 0b1000_0000);
766 n >>= 7;
767 }
768 data.push(n.low_u8());
769}
770
771/// Read an unsigned 32-bit varint. Also, return the number of bytes read.
772///
773/// https://developers.google.com/protocol-buffers/docs/encoding#varints
774fn read_varu32(data: &[u8]) -> (u32, usize) {
775 // N.B. We can assume correctness here since we know that all varuints are
776 // written with write_varu32. Hence, the 'as' uses and unchecked arithmetic
777 // is all okay.
778 let mut n: u32 = 0;
779 let mut shift: u32 = 0;
780 for (i, &b) in data.iter().enumerate() {
781 if b < 0b1000_0000 {
782 return (n | (u32::from(b) << shift), i + 1);
783 }
784 n |= (u32::from(b) & 0b0111_1111) << shift;
785 shift += 7;
786 }
787 (0, 0)
788}
789
790/// Push a native-endian encoded `n` on to `dst`.
791fn write_u32(dst: &mut Vec<u8>, n: u32) {
792 use crate::util::wire::NE;
793
794 let start = dst.len();
795 dst.extend(core::iter::repeat(0).take(mem::size_of::<u32>()));
796 NE::write_u32(n, &mut dst[start..]);
797}
798
799#[cfg(test)]
800mod tests {
801 use alloc::vec;
802
803 use quickcheck::quickcheck;
804
805 use super::*;
806
807 #[cfg(not(miri))]
808 quickcheck! {
809 fn prop_state_read_write_nfa_state_ids(sids: Vec<StateID>) -> bool {
810 // Builders states do not permit duplicate IDs.
811 let sids = dedup_state_ids(sids);
812
813 let mut b = StateBuilderEmpty::new().into_matches().into_nfa();
814 for &sid in &sids {
815 b.add_nfa_state_id(sid);
816 }
817 let s = b.to_state();
818 let mut got = vec![];
819 s.iter_nfa_state_ids(|sid| got.push(sid));
820 got == sids
821 }
822
823 fn prop_state_read_write_pattern_ids(pids: Vec<PatternID>) -> bool {
824 // Builders states do not permit duplicate IDs.
825 let pids = dedup_pattern_ids(pids);
826
827 let mut b = StateBuilderEmpty::new().into_matches();
828 for &pid in &pids {
829 b.add_match_pattern_id(pid);
830 }
831 let s = b.into_nfa().to_state();
832 let mut got = vec![];
833 s.iter_match_pattern_ids(|pid| got.push(pid));
834 got == pids
835 }
836
837 fn prop_state_read_write_nfa_state_and_pattern_ids(
838 sids: Vec<StateID>,
839 pids: Vec<PatternID>
840 ) -> bool {
841 // Builders states do not permit duplicate IDs.
842 let sids = dedup_state_ids(sids);
843 let pids = dedup_pattern_ids(pids);
844
845 let mut b = StateBuilderEmpty::new().into_matches();
846 for &pid in &pids {
847 b.add_match_pattern_id(pid);
848 }
849
850 let mut b = b.into_nfa();
851 for &sid in &sids {
852 b.add_nfa_state_id(sid);
853 }
854
855 let s = b.to_state();
856 let mut got_pids = vec![];
857 s.iter_match_pattern_ids(|pid| got_pids.push(pid));
858 let mut got_sids = vec![];
859 s.iter_nfa_state_ids(|sid| got_sids.push(sid));
860 got_pids == pids && got_sids == sids
861 }
862 }
863
864 quickcheck! {
865 fn prop_read_write_varu32(n: u32) -> bool {
866 let mut buf = vec![];
867 write_varu32(&mut buf, n);
868 let (got, nread) = read_varu32(&buf);
869 nread == buf.len() && got == n
870 }
871
872 fn prop_read_write_vari32(n: i32) -> bool {
873 let mut buf = vec![];
874 write_vari32(&mut buf, n);
875 let (got, nread) = read_vari32(&buf);
876 nread == buf.len() && got == n
877 }
878 }
879
880 #[cfg(not(miri))]
881 fn dedup_state_ids(sids: Vec<StateID>) -> Vec<StateID> {
882 let mut set = alloc::collections::BTreeSet::new();
883 let mut deduped = vec![];
884 for sid in sids {
885 if set.contains(&sid) {
886 continue;
887 }
888 set.insert(sid);
889 deduped.push(sid);
890 }
891 deduped
892 }
893
894 #[cfg(not(miri))]
895 fn dedup_pattern_ids(pids: Vec<PatternID>) -> Vec<PatternID> {
896 let mut set = alloc::collections::BTreeSet::new();
897 let mut deduped = vec![];
898 for pid in pids {
899 if set.contains(&pid) {
900 continue;
901 }
902 set.insert(pid);
903 deduped.push(pid);
904 }
905 deduped
906 }
907}
908