1 | /*! |
2 | This module defines a DFA state representation and builders for constructing |
3 | DFA states. |
4 | |
5 | This representation is specifically for use in implementations of NFA-to-DFA |
6 | conversion via powerset construction. (Also called "determinization" in this |
7 | crate.) |
8 | |
9 | The term "DFA state" is somewhat overloaded in this crate. In some cases, it |
10 | refers to the set of transitions over an alphabet for a particular state. In |
11 | other cases, it refers to a set of NFA states. The former is really about the |
12 | final representation of a state in a DFA's transition table, where as the |
13 | latter---what this module is focused on---is closer to an intermediate form |
14 | that is used to help eventually build the transition table. |
15 | |
16 | This module exports four types. All four types represent the same idea: an |
17 | ordered set of NFA states. This ordered set represents the epsilon closure of a |
18 | particular NFA state, where the "epsilon closure" is the set of NFA states that |
19 | can be transitioned to without consuming any input. i.e., Follow all of the NFA |
20 | state's epsilon transitions. In addition, this implementation of DFA states |
21 | cares about two other things: the ordered set of pattern IDs corresponding |
22 | to the patterns that match if the state is a match state, and the set of |
23 | look-behind assertions that were true when the state was created. |
24 | |
25 | The first, `State`, is a frozen representation of a state that cannot be |
26 | modified. It may be cheaply cloned without copying the state itself and can be |
27 | accessed safely from multiple threads simultaneously. This type is useful for |
28 | when one knows that the DFA state being constructed is distinct from any other |
29 | previously constructed states. Namely, powerset construction, in practice, |
30 | requires one to keep a cache of previously created DFA states. Otherwise, |
31 | the number of DFA states created in memory balloons to an impractically |
32 | large number. For this reason, equivalent states should endeavor to have an |
33 | equivalent byte-level representation. (In general, "equivalency" here means, |
34 | "equivalent assertions, pattern IDs and NFA state IDs." We do not require that |
35 | full DFA minimization be implemented here. This form of equivalency is only |
36 | surface deep and is more-or-less a practical necessity.) |
37 | |
38 | The other three types represent different phases in the construction of a |
39 | DFA state. Internally, these three types (and `State`) all use the same |
40 | byte-oriented representation. That means one can use any of the builder types |
41 | to check whether the state it represents already exists or not. If it does, |
42 | then there is no need to freeze it into a `State` (which requires an alloc and |
43 | a copy). Here are the three types described succinctly: |
44 | |
45 | * `StateBuilderEmpty` represents a state with no pattern IDs, no assertions |
46 | and no NFA states. Creating a `StateBuilderEmpty` performs no allocs. A |
47 | `StateBuilderEmpty` can only be used to query its underlying memory capacity, |
48 | or to convert into a builder for recording pattern IDs and/or assertions. |
49 | |
50 | * `StateBuilderMatches` represents a state with zero or more pattern IDs, zero |
51 | or more satisfied assertions and zero NFA state IDs. A `StateBuilderMatches` |
52 | can only be used for adding pattern IDs and recording assertions. |
53 | |
54 | * `StateBuilderNFA` represents a state with zero or more pattern IDs, zero or |
55 | more satisfied assertions and zero or more NFA state IDs. A `StateBuilderNFA` |
56 | can only be used for adding NFA state IDs and recording some assertions. |
57 | |
58 | The expected flow here is to use the above builders to construct a candidate |
59 | DFA state to check if it already exists. If it does, then there's no need to |
60 | freeze it into a `State`. It it doesn't exist, then `StateBuilderNFA::to_state` |
61 | can be called to freeze the builder into an immutable `State`. In either |
62 | case, `clear` should be called on the builder to turn it back into a |
63 | `StateBuilderEmpty` that reuses the underlying memory. |
64 | |
65 | The main purpose for splitting the builder into these distinct types is to |
66 | make it impossible to do things like adding a pattern ID after adding an NFA |
67 | state ID. Namely, this makes it simpler to use a space-and-time efficient |
68 | binary representation for the state. (The format is documented on the `Repr` |
69 | type below.) If we just used one type for everything, it would be possible for |
70 | callers to use an incorrect interleaving of calls and thus result in a corrupt |
71 | representation. I chose to use more type machinery to make this impossible to |
72 | do because 1) determinization is itself pretty complex and it wouldn't be too |
73 | hard to foul this up and 2) there isn't too much machinery involved and it's |
74 | well contained. |
75 | |
76 | As an optimization, sometimes states won't have certain things set. For |
77 | example, if the underlying NFA has no word boundary assertions, then there is |
78 | no reason to set a state's look-behind assertion as to whether it was generated |
79 | from a word byte or not. Similarly, if a state has no NFA states corresponding |
80 | to look-around assertions, then there is no reason to set `look_have` to a |
81 | non-empty set. Finally, callers usually omit unconditional epsilon transitions |
82 | when adding NFA state IDs since they aren't discriminatory. |
83 | |
84 | Finally, the binary representation used by these states is, thankfully, not |
85 | serialized anywhere. So any kind of change can be made with reckless abandon, |
86 | as long as everything in this module agrees. |
87 | */ |
88 | |
89 | use core::mem; |
90 | |
91 | use alloc::{sync::Arc, vec::Vec}; |
92 | |
93 | use crate::util::{ |
94 | int::{I32, U32}, |
95 | look::LookSet, |
96 | primitives::{PatternID, StateID}, |
97 | wire::{self, Endian}, |
98 | }; |
99 | |
100 | /// A DFA state that, at its core, is represented by an ordered set of NFA |
101 | /// states. |
102 | /// |
103 | /// This type is intended to be used only in NFA-to-DFA conversion via powerset |
104 | /// construction. |
105 | /// |
106 | /// It may be cheaply cloned and accessed safely from multiple threads |
107 | /// simultaneously. |
108 | #[derive (Clone, Eq, Hash, PartialEq, PartialOrd, Ord)] |
109 | pub(crate) struct State(Arc<[u8]>); |
110 | |
111 | /// This Borrow impl permits us to lookup any state in a map by its byte |
112 | /// representation. This is particularly convenient when one has a StateBuilder |
113 | /// and we want to see if a correspondingly equivalent state already exists. If |
114 | /// one does exist, then we can reuse the allocation required by StateBuilder |
115 | /// without having to convert it into a State first. |
116 | impl core::borrow::Borrow<[u8]> for State { |
117 | fn borrow(&self) -> &[u8] { |
118 | &*self.0 |
119 | } |
120 | } |
121 | |
122 | impl core::fmt::Debug for State { |
123 | fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { |
124 | f.debug_tuple(name:"State" ).field(&self.repr()).finish() |
125 | } |
126 | } |
127 | |
128 | /// For docs on these routines, see the internal Repr and ReprVec types below. |
129 | impl State { |
130 | pub(crate) fn dead() -> State { |
131 | StateBuilderEmpty::new().into_matches().into_nfa().to_state() |
132 | } |
133 | |
134 | pub(crate) fn is_match(&self) -> bool { |
135 | self.repr().is_match() |
136 | } |
137 | |
138 | pub(crate) fn is_from_word(&self) -> bool { |
139 | self.repr().is_from_word() |
140 | } |
141 | |
142 | pub(crate) fn is_half_crlf(&self) -> bool { |
143 | self.repr().is_half_crlf() |
144 | } |
145 | |
146 | pub(crate) fn look_have(&self) -> LookSet { |
147 | self.repr().look_have() |
148 | } |
149 | |
150 | pub(crate) fn look_need(&self) -> LookSet { |
151 | self.repr().look_need() |
152 | } |
153 | |
154 | pub(crate) fn match_len(&self) -> usize { |
155 | self.repr().match_len() |
156 | } |
157 | |
158 | pub(crate) fn match_pattern(&self, index: usize) -> PatternID { |
159 | self.repr().match_pattern(index) |
160 | } |
161 | |
162 | pub(crate) fn match_pattern_ids(&self) -> Option<Vec<PatternID>> { |
163 | self.repr().match_pattern_ids() |
164 | } |
165 | |
166 | #[cfg (all(test, not(miri)))] |
167 | pub(crate) fn iter_match_pattern_ids<F: FnMut(PatternID)>(&self, f: F) { |
168 | self.repr().iter_match_pattern_ids(f) |
169 | } |
170 | |
171 | pub(crate) fn iter_nfa_state_ids<F: FnMut(StateID)>(&self, f: F) { |
172 | self.repr().iter_nfa_state_ids(f) |
173 | } |
174 | |
175 | pub(crate) fn memory_usage(&self) -> usize { |
176 | self.0.len() |
177 | } |
178 | |
179 | fn repr(&self) -> Repr<'_> { |
180 | Repr(&*self.0) |
181 | } |
182 | } |
183 | |
184 | /// A state builder that represents an empty state. |
185 | /// |
186 | /// This is a useful "initial condition" for state construction. It has no |
187 | /// NFA state IDs, no assertions set and no pattern IDs. No allocations are |
188 | /// made when new() is called. Its main use is for being converted into a |
189 | /// builder that can capture assertions and pattern IDs. |
190 | #[derive (Clone, Debug)] |
191 | pub(crate) struct StateBuilderEmpty(Vec<u8>); |
192 | |
193 | /// For docs on these routines, see the internal Repr and ReprVec types below. |
194 | impl StateBuilderEmpty { |
195 | pub(crate) fn new() -> StateBuilderEmpty { |
196 | StateBuilderEmpty(alloc::vec![]) |
197 | } |
198 | |
199 | pub(crate) fn into_matches(mut self) -> StateBuilderMatches { |
200 | self.0.extend_from_slice(&[0, 0, 0, 0, 0, 0, 0, 0, 0]); |
201 | StateBuilderMatches(self.0) |
202 | } |
203 | |
204 | fn clear(&mut self) { |
205 | self.0.clear(); |
206 | } |
207 | |
208 | pub(crate) fn capacity(&self) -> usize { |
209 | self.0.capacity() |
210 | } |
211 | } |
212 | |
213 | /// A state builder that collects assertions and pattern IDs. |
214 | /// |
215 | /// When collecting pattern IDs is finished, this can be converted into a |
216 | /// builder that collects NFA state IDs. |
217 | #[derive (Clone)] |
218 | pub(crate) struct StateBuilderMatches(Vec<u8>); |
219 | |
220 | impl core::fmt::Debug for StateBuilderMatches { |
221 | fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { |
222 | f.debug_tuple(name:"StateBuilderMatches" ).field(&self.repr()).finish() |
223 | } |
224 | } |
225 | |
226 | /// For docs on these routines, see the internal Repr and ReprVec types below. |
227 | impl StateBuilderMatches { |
228 | pub(crate) fn into_nfa(mut self) -> StateBuilderNFA { |
229 | self.repr_vec().close_match_pattern_ids(); |
230 | StateBuilderNFA { repr: self.0, prev_nfa_state_id: StateID::ZERO } |
231 | } |
232 | |
233 | pub(crate) fn set_is_from_word(&mut self) { |
234 | self.repr_vec().set_is_from_word() |
235 | } |
236 | |
237 | pub(crate) fn set_is_half_crlf(&mut self) { |
238 | self.repr_vec().set_is_half_crlf() |
239 | } |
240 | |
241 | pub(crate) fn look_have(&self) -> LookSet { |
242 | LookSet::read_repr(&self.0[1..]) |
243 | } |
244 | |
245 | pub(crate) fn set_look_have( |
246 | &mut self, |
247 | set: impl FnMut(LookSet) -> LookSet, |
248 | ) { |
249 | self.repr_vec().set_look_have(set) |
250 | } |
251 | |
252 | pub(crate) fn add_match_pattern_id(&mut self, pid: PatternID) { |
253 | self.repr_vec().add_match_pattern_id(pid) |
254 | } |
255 | |
256 | fn repr(&self) -> Repr<'_> { |
257 | Repr(&self.0) |
258 | } |
259 | |
260 | fn repr_vec(&mut self) -> ReprVec<'_> { |
261 | ReprVec(&mut self.0) |
262 | } |
263 | } |
264 | |
265 | /// A state builder that collects some assertions and NFA state IDs. |
266 | /// |
267 | /// When collecting NFA state IDs is finished, this can be used to build a |
268 | /// `State` if necessary. |
269 | /// |
270 | /// When dont with building a state (regardless of whether it got kept or not), |
271 | /// it's usually a good idea to call `clear` to get an empty builder back so |
272 | /// that it can be reused to build the next state. |
273 | #[derive (Clone)] |
274 | pub(crate) struct StateBuilderNFA { |
275 | repr: Vec<u8>, |
276 | prev_nfa_state_id: StateID, |
277 | } |
278 | |
279 | impl core::fmt::Debug for StateBuilderNFA { |
280 | fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { |
281 | f.debug_tuple(name:"StateBuilderNFA" ).field(&self.repr()).finish() |
282 | } |
283 | } |
284 | |
285 | /// For docs on these routines, see the internal Repr and ReprVec types below. |
286 | impl StateBuilderNFA { |
287 | pub(crate) fn to_state(&self) -> State { |
288 | State(Arc::from(&*self.repr)) |
289 | } |
290 | |
291 | pub(crate) fn clear(self) -> StateBuilderEmpty { |
292 | let mut builder = StateBuilderEmpty(self.repr); |
293 | builder.clear(); |
294 | builder |
295 | } |
296 | |
297 | pub(crate) fn look_need(&self) -> LookSet { |
298 | self.repr().look_need() |
299 | } |
300 | |
301 | pub(crate) fn set_look_have( |
302 | &mut self, |
303 | set: impl FnMut(LookSet) -> LookSet, |
304 | ) { |
305 | self.repr_vec().set_look_have(set) |
306 | } |
307 | |
308 | pub(crate) fn set_look_need( |
309 | &mut self, |
310 | set: impl FnMut(LookSet) -> LookSet, |
311 | ) { |
312 | self.repr_vec().set_look_need(set) |
313 | } |
314 | |
315 | pub(crate) fn add_nfa_state_id(&mut self, sid: StateID) { |
316 | ReprVec(&mut self.repr) |
317 | .add_nfa_state_id(&mut self.prev_nfa_state_id, sid) |
318 | } |
319 | |
320 | pub(crate) fn as_bytes(&self) -> &[u8] { |
321 | &self.repr |
322 | } |
323 | |
324 | fn repr(&self) -> Repr<'_> { |
325 | Repr(&self.repr) |
326 | } |
327 | |
328 | fn repr_vec(&mut self) -> ReprVec<'_> { |
329 | ReprVec(&mut self.repr) |
330 | } |
331 | } |
332 | |
333 | /// Repr is a read-only view into the representation of a DFA state. |
334 | /// |
335 | /// Primarily, a Repr is how we achieve DRY: we implement decoding the format |
336 | /// in one place, and then use a Repr to implement the various methods on the |
337 | /// public state types. |
338 | /// |
339 | /// The format is as follows: |
340 | /// |
341 | /// The first three bytes correspond to bitsets. |
342 | /// |
343 | /// Byte 0 is a bitset corresponding to miscellaneous flags associated with the |
344 | /// state. Bit 0 is set to 1 if the state is a match state. Bit 1 is set to 1 |
345 | /// if the state has pattern IDs explicitly written to it. (This is a flag that |
346 | /// is not meant to be set by determinization, but rather, is used as part of |
347 | /// an internal space-saving optimization.) Bit 2 is set to 1 if the state was |
348 | /// generated by a transition over a "word" byte. (Callers may not always set |
349 | /// this. For example, if the NFA has no word boundary assertion, then needing |
350 | /// to track whether a state came from a word byte or not is superfluous and |
351 | /// wasteful.) Bit 3 is set to 1 if the state was generated by a transition |
352 | /// from a `\r` (forward search) or a `\n` (reverse search) when CRLF mode is |
353 | /// enabled. |
354 | /// |
355 | /// Bytes 1..5 correspond to the look-behind assertions that were satisfied |
356 | /// by the transition that created this state. (Look-ahead assertions are not |
357 | /// tracked as part of states. Instead, these are applied by re-computing the |
358 | /// epsilon closure of a state when computing the transition function. See |
359 | /// `next` in the parent module.) |
360 | /// |
361 | /// Bytes 5..9 correspond to the set of look-around assertions (including both |
362 | /// look-behind and look-ahead) that appear somewhere in this state's set of |
363 | /// NFA state IDs. This is used to determine whether this state's epsilon |
364 | /// closure should be re-computed when computing the transition function. |
365 | /// Namely, look-around assertions are "just" conditional epsilon transitions, |
366 | /// so if there are new assertions available when computing the transition |
367 | /// function, we should only re-compute the epsilon closure if those new |
368 | /// assertions are relevant to this particular state. |
369 | /// |
370 | /// Bytes 9..13 correspond to a 32-bit native-endian encoded integer |
371 | /// corresponding to the number of patterns encoded in this state. If the state |
372 | /// is not a match state (byte 0 bit 0 is 0) or if it's only pattern ID is |
373 | /// PatternID::ZERO, then no integer is encoded at this position. Instead, byte |
374 | /// offset 3 is the position at which the first NFA state ID is encoded. |
375 | /// |
376 | /// For a match state with at least one non-ZERO pattern ID, the next bytes |
377 | /// correspond to a sequence of 32-bit native endian encoded integers that |
378 | /// represent each pattern ID, in order, that this match state represents. |
379 | /// |
380 | /// After the pattern IDs (if any), NFA state IDs are delta encoded as |
381 | /// varints.[1] The first NFA state ID is encoded as itself, and each |
382 | /// subsequent NFA state ID is encoded as the difference between itself and the |
383 | /// previous NFA state ID. |
384 | /// |
385 | /// [1] - https://developers.google.com/protocol-buffers/docs/encoding#varints |
386 | struct Repr<'a>(&'a [u8]); |
387 | |
388 | impl<'a> Repr<'a> { |
389 | /// Returns true if and only if this is a match state. |
390 | /// |
391 | /// If callers have added pattern IDs to this state, then callers MUST set |
392 | /// this state as a match state explicitly. However, as a special case, |
393 | /// states that are marked as match states but with no pattern IDs, then |
394 | /// the state is treated as if it had a single pattern ID equivalent to |
395 | /// PatternID::ZERO. |
396 | fn is_match(&self) -> bool { |
397 | self.0[0] & (1 << 0) > 0 |
398 | } |
399 | |
400 | /// Returns true if and only if this state has had at least one pattern |
401 | /// ID added to it. |
402 | /// |
403 | /// This is an internal-only flag that permits the representation to save |
404 | /// space in the common case of an NFA with one pattern in it. In that |
405 | /// case, a match state can only ever have exactly one pattern ID: |
406 | /// PatternID::ZERO. So there's no need to represent it. |
407 | fn has_pattern_ids(&self) -> bool { |
408 | self.0[0] & (1 << 1) > 0 |
409 | } |
410 | |
411 | /// Returns true if and only if this state is marked as having been created |
412 | /// from a transition over a word byte. This is useful for checking whether |
413 | /// a word boundary assertion is true or not, which requires look-behind |
414 | /// (whether the current state came from a word byte or not) and look-ahead |
415 | /// (whether the transition byte is a word byte or not). |
416 | /// |
417 | /// Since states with this set are distinct from states that don't have |
418 | /// this set (even if they are otherwise equivalent), callers should not |
419 | /// set this assertion unless the underlying NFA has at least one word |
420 | /// boundary assertion somewhere. Otherwise, a superfluous number of states |
421 | /// may be created. |
422 | fn is_from_word(&self) -> bool { |
423 | self.0[0] & (1 << 2) > 0 |
424 | } |
425 | |
426 | /// Returns true if and only if this state is marked as being inside of a |
427 | /// CRLF terminator. In the forward direction, this means the state was |
428 | /// created after seeing a `\r`. In the reverse direction, this means the |
429 | /// state was created after seeing a `\n`. |
430 | fn is_half_crlf(&self) -> bool { |
431 | self.0[0] & (1 << 3) > 0 |
432 | } |
433 | |
434 | /// The set of look-behind assertions that were true in the transition that |
435 | /// created this state. |
436 | /// |
437 | /// Generally, this should be empty if 'look_need' is empty, since there is |
438 | /// no reason to track which look-behind assertions are true if the state |
439 | /// has no conditional epsilon transitions. |
440 | /// |
441 | /// Satisfied look-ahead assertions are not tracked in states. Instead, |
442 | /// these are re-computed on demand via epsilon closure when computing the |
443 | /// transition function. |
444 | fn look_have(&self) -> LookSet { |
445 | LookSet::read_repr(&self.0[1..]) |
446 | } |
447 | |
448 | /// The set of look-around (both behind and ahead) assertions that appear |
449 | /// at least once in this state's set of NFA states. |
450 | /// |
451 | /// This is used to determine whether the epsilon closure needs to be |
452 | /// re-computed when computing the transition function. Namely, if the |
453 | /// state has no conditional epsilon transitions, then there is no need |
454 | /// to re-compute the epsilon closure. |
455 | fn look_need(&self) -> LookSet { |
456 | LookSet::read_repr(&self.0[5..]) |
457 | } |
458 | |
459 | /// Returns the total number of match pattern IDs in this state. |
460 | /// |
461 | /// If this state is not a match state, then this always returns 0. |
462 | fn match_len(&self) -> usize { |
463 | if !self.is_match() { |
464 | return 0; |
465 | } else if !self.has_pattern_ids() { |
466 | 1 |
467 | } else { |
468 | self.encoded_pattern_len() |
469 | } |
470 | } |
471 | |
472 | /// Returns the pattern ID for this match state at the given index. |
473 | /// |
474 | /// If the given index is greater than or equal to `match_len()` for this |
475 | /// state, then this could panic or return incorrect results. |
476 | fn match_pattern(&self, index: usize) -> PatternID { |
477 | if !self.has_pattern_ids() { |
478 | PatternID::ZERO |
479 | } else { |
480 | let offset = 13 + index * PatternID::SIZE; |
481 | // This is OK since we only ever serialize valid PatternIDs to |
482 | // states. |
483 | wire::read_pattern_id_unchecked(&self.0[offset..]).0 |
484 | } |
485 | } |
486 | |
487 | /// Returns a copy of all match pattern IDs in this state. If this state |
488 | /// is not a match state, then this returns None. |
489 | fn match_pattern_ids(&self) -> Option<Vec<PatternID>> { |
490 | if !self.is_match() { |
491 | return None; |
492 | } |
493 | let mut pids = alloc::vec![]; |
494 | self.iter_match_pattern_ids(|pid| pids.push(pid)); |
495 | Some(pids) |
496 | } |
497 | |
498 | /// Calls the given function on every pattern ID in this state. |
499 | fn iter_match_pattern_ids<F: FnMut(PatternID)>(&self, mut f: F) { |
500 | if !self.is_match() { |
501 | return; |
502 | } |
503 | // As an optimization for a very common case, when this is a match |
504 | // state for an NFA with only one pattern, we don't actually write the |
505 | // pattern ID to the state representation. Instead, we know it must |
506 | // be there since it is the only possible choice. |
507 | if !self.has_pattern_ids() { |
508 | f(PatternID::ZERO); |
509 | return; |
510 | } |
511 | let mut pids = &self.0[13..self.pattern_offset_end()]; |
512 | while !pids.is_empty() { |
513 | let pid = wire::read_u32(pids); |
514 | pids = &pids[PatternID::SIZE..]; |
515 | // This is OK since we only ever serialize valid PatternIDs to |
516 | // states. And since pattern IDs can never exceed a usize, the |
517 | // unwrap is OK. |
518 | f(PatternID::new_unchecked(usize::try_from(pid).unwrap())); |
519 | } |
520 | } |
521 | |
522 | /// Calls the given function on every NFA state ID in this state. |
523 | fn iter_nfa_state_ids<F: FnMut(StateID)>(&self, mut f: F) { |
524 | let mut sids = &self.0[self.pattern_offset_end()..]; |
525 | let mut prev = 0i32; |
526 | while !sids.is_empty() { |
527 | let (delta, nr) = read_vari32(sids); |
528 | sids = &sids[nr..]; |
529 | let sid = prev + delta; |
530 | prev = sid; |
531 | // This is OK since we only ever serialize valid StateIDs to |
532 | // states. And since state IDs can never exceed an isize, they must |
533 | // always be able to fit into a usize, and thus cast is OK. |
534 | f(StateID::new_unchecked(sid.as_usize())) |
535 | } |
536 | } |
537 | |
538 | /// Returns the offset into this state's representation where the pattern |
539 | /// IDs end and the NFA state IDs begin. |
540 | fn pattern_offset_end(&self) -> usize { |
541 | let encoded = self.encoded_pattern_len(); |
542 | if encoded == 0 { |
543 | return 9; |
544 | } |
545 | // This arithmetic is OK since we were able to address this many bytes |
546 | // when writing to the state, thus, it must fit into a usize. |
547 | encoded.checked_mul(4).unwrap().checked_add(13).unwrap() |
548 | } |
549 | |
550 | /// Returns the total number of *encoded* pattern IDs in this state. |
551 | /// |
552 | /// This may return 0 even when this is a match state, since the pattern |
553 | /// ID `PatternID::ZERO` is not encoded when it's the only pattern ID in |
554 | /// the match state (the overwhelming common case). |
555 | fn encoded_pattern_len(&self) -> usize { |
556 | if !self.has_pattern_ids() { |
557 | return 0; |
558 | } |
559 | // This unwrap is OK since the total number of patterns is always |
560 | // guaranteed to fit into a usize. |
561 | usize::try_from(wire::read_u32(&self.0[9..13])).unwrap() |
562 | } |
563 | } |
564 | |
565 | impl<'a> core::fmt::Debug for Repr<'a> { |
566 | fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { |
567 | let mut nfa_ids: Vec = alloc::vec![]; |
568 | self.iter_nfa_state_ids(|sid: StateID| nfa_ids.push(sid)); |
569 | f&mut DebugStruct<'_, '_>.debug_struct("Repr" ) |
570 | .field("is_match" , &self.is_match()) |
571 | .field("is_from_word" , &self.is_from_word()) |
572 | .field("is_half_crlf" , &self.is_half_crlf()) |
573 | .field("look_have" , &self.look_have()) |
574 | .field("look_need" , &self.look_need()) |
575 | .field("match_pattern_ids" , &self.match_pattern_ids()) |
576 | .field(name:"nfa_state_ids" , &nfa_ids) |
577 | .finish() |
578 | } |
579 | } |
580 | |
581 | /// ReprVec is a write-only view into the representation of a DFA state. |
582 | /// |
583 | /// See Repr for more details on the purpose of this type and also the format. |
584 | /// |
585 | /// Note that not all possible combinations of methods may be called. This is |
586 | /// precisely what the various StateBuilder types encapsulate: they only |
587 | /// permit valid combinations via Rust's linear typing. |
588 | struct ReprVec<'a>(&'a mut Vec<u8>); |
589 | |
590 | impl<'a> ReprVec<'a> { |
591 | /// Set this state as a match state. |
592 | /// |
593 | /// This should not be exposed explicitly outside of this module. It is |
594 | /// set automatically when a pattern ID is added. |
595 | fn set_is_match(&mut self) { |
596 | self.0[0] |= 1 << 0; |
597 | } |
598 | |
599 | /// Set that this state has pattern IDs explicitly written to it. |
600 | /// |
601 | /// This should not be exposed explicitly outside of this module. This is |
602 | /// used internally as a space saving optimization. Namely, if the state |
603 | /// is a match state but does not have any pattern IDs written to it, |
604 | /// then it is automatically inferred to have a pattern ID of ZERO. |
605 | fn set_has_pattern_ids(&mut self) { |
606 | self.0[0] |= 1 << 1; |
607 | } |
608 | |
609 | /// Set this state as being built from a transition over a word byte. |
610 | /// |
611 | /// Setting this is only necessary when one needs to deal with word |
612 | /// boundary assertions. Therefore, if the underlying NFA has no word |
613 | /// boundary assertions, callers should not set this. |
614 | fn set_is_from_word(&mut self) { |
615 | self.0[0] |= 1 << 2; |
616 | } |
617 | |
618 | /// Set this state as having seen half of a CRLF terminator. |
619 | /// |
620 | /// In the forward direction, this should be set when a `\r` has been seen. |
621 | /// In the reverse direction, this should be set when a `\n` has been seen. |
622 | fn set_is_half_crlf(&mut self) { |
623 | self.0[0] |= 1 << 3; |
624 | } |
625 | |
626 | /// The set of look-behind assertions that were true in the transition that |
627 | /// created this state. |
628 | fn look_have(&self) -> LookSet { |
629 | self.repr().look_have() |
630 | } |
631 | |
632 | /// The set of look-around (both behind and ahead) assertions that appear |
633 | /// at least once in this state's set of NFA states. |
634 | fn look_need(&self) -> LookSet { |
635 | self.repr().look_need() |
636 | } |
637 | |
638 | /// Mutate the set of look-behind assertions that were true in the |
639 | /// transition that created this state. |
640 | fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) { |
641 | set(self.look_have()).write_repr(&mut self.0[1..]); |
642 | } |
643 | |
644 | /// Mutate the set of look-around (both behind and ahead) assertions that |
645 | /// appear at least once in this state's set of NFA states. |
646 | fn set_look_need(&mut self, mut set: impl FnMut(LookSet) -> LookSet) { |
647 | set(self.look_need()).write_repr(&mut self.0[5..]); |
648 | } |
649 | |
650 | /// Add a pattern ID to this state. All match states must have at least |
651 | /// one pattern ID associated with it. |
652 | /// |
653 | /// Callers must never add duplicative pattern IDs. |
654 | /// |
655 | /// The order in which patterns are added must correspond to the order |
656 | /// in which patterns are reported as matches. |
657 | fn add_match_pattern_id(&mut self, pid: PatternID) { |
658 | // As a (somewhat small) space saving optimization, in the case where |
659 | // a matching state has exactly one pattern ID, PatternID::ZERO, we do |
660 | // not write either the pattern ID or the number of patterns encoded. |
661 | // Instead, all we do is set the 'is_match' bit on this state. Overall, |
662 | // this saves 8 bytes per match state for the overwhelming majority of |
663 | // match states. |
664 | // |
665 | // In order to know whether pattern IDs need to be explicitly read or |
666 | // not, we use another internal-only bit, 'has_pattern_ids', to |
667 | // indicate whether they have been explicitly written or not. |
668 | if !self.repr().has_pattern_ids() { |
669 | if pid == PatternID::ZERO { |
670 | self.set_is_match(); |
671 | return; |
672 | } |
673 | // Make room for 'close_match_pattern_ids' to write the total |
674 | // number of pattern IDs written. |
675 | self.0.extend(core::iter::repeat(0).take(PatternID::SIZE)); |
676 | self.set_has_pattern_ids(); |
677 | // If this was already a match state, then the only way that's |
678 | // possible when the state doesn't have pattern IDs is if |
679 | // PatternID::ZERO was added by the caller previously. In this |
680 | // case, we are now adding a non-ZERO pattern ID after it, in |
681 | // which case, we want to make sure to represent ZERO explicitly |
682 | // now. |
683 | if self.repr().is_match() { |
684 | write_u32(self.0, 0) |
685 | } else { |
686 | // Otherwise, just make sure the 'is_match' bit is set. |
687 | self.set_is_match(); |
688 | } |
689 | } |
690 | write_u32(self.0, pid.as_u32()); |
691 | } |
692 | |
693 | /// Indicate that no more pattern IDs will be added to this state. |
694 | /// |
695 | /// Once this is called, callers must not call it or 'add_match_pattern_id' |
696 | /// again. |
697 | /// |
698 | /// This should not be exposed explicitly outside of this module. It |
699 | /// should be called only when converting a StateBuilderMatches into a |
700 | /// StateBuilderNFA. |
701 | fn close_match_pattern_ids(&mut self) { |
702 | // If we never wrote any pattern IDs, then there's nothing to do here. |
703 | if !self.repr().has_pattern_ids() { |
704 | return; |
705 | } |
706 | let patsize = PatternID::SIZE; |
707 | let pattern_bytes = self.0.len() - 13; |
708 | // Every pattern ID uses 4 bytes, so number of bytes should be |
709 | // divisible by 4. |
710 | assert_eq!(pattern_bytes % patsize, 0); |
711 | // This unwrap is OK since we are guaranteed that the maximum number |
712 | // of possible patterns fits into a u32. |
713 | let count32 = u32::try_from(pattern_bytes / patsize).unwrap(); |
714 | wire::NE::write_u32(count32, &mut self.0[9..13]); |
715 | } |
716 | |
717 | /// Add an NFA state ID to this state. The order in which NFA states are |
718 | /// added matters. It is the caller's responsibility to ensure that |
719 | /// duplicate NFA state IDs are not added. |
720 | fn add_nfa_state_id(&mut self, prev: &mut StateID, sid: StateID) { |
721 | let delta = sid.as_i32() - prev.as_i32(); |
722 | write_vari32(self.0, delta); |
723 | *prev = sid; |
724 | } |
725 | |
726 | /// Return a read-only view of this state's representation. |
727 | fn repr(&self) -> Repr<'_> { |
728 | Repr(self.0.as_slice()) |
729 | } |
730 | } |
731 | |
732 | /// Write a signed 32-bit integer using zig-zag encoding. |
733 | /// |
734 | /// https://developers.google.com/protocol-buffers/docs/encoding#varints |
735 | fn write_vari32(data: &mut Vec<u8>, n: i32) { |
736 | let mut un: u32 = n.to_bits() << 1; |
737 | if n < 0 { |
738 | un = !un; |
739 | } |
740 | write_varu32(data, n:un) |
741 | } |
742 | |
743 | /// Read a signed 32-bit integer using zig-zag encoding. Also, return the |
744 | /// number of bytes read. |
745 | /// |
746 | /// https://developers.google.com/protocol-buffers/docs/encoding#varints |
747 | fn read_vari32(data: &[u8]) -> (i32, usize) { |
748 | let (un: u32, i: usize) = read_varu32(data); |
749 | let mut n: i32 = i32::from_bits(un >> 1); |
750 | if un & 1 != 0 { |
751 | n = !n; |
752 | } |
753 | (n, i) |
754 | } |
755 | |
756 | /// Write an unsigned 32-bit integer as a varint. In essence, `n` is written |
757 | /// as a sequence of bytes where all bytes except for the last one have the |
758 | /// most significant bit set. The least significant 7 bits correspond to the |
759 | /// actual bits of `n`. So in the worst case, a varint uses 5 bytes, but in |
760 | /// very common cases, it uses fewer than 4. |
761 | /// |
762 | /// https://developers.google.com/protocol-buffers/docs/encoding#varints |
763 | fn write_varu32(data: &mut Vec<u8>, mut n: u32) { |
764 | while n >= 0b1000_0000 { |
765 | data.push(n.low_u8() | 0b1000_0000); |
766 | n >>= 7; |
767 | } |
768 | data.push(n.low_u8()); |
769 | } |
770 | |
771 | /// Read an unsigned 32-bit varint. Also, return the number of bytes read. |
772 | /// |
773 | /// https://developers.google.com/protocol-buffers/docs/encoding#varints |
774 | fn read_varu32(data: &[u8]) -> (u32, usize) { |
775 | // N.B. We can assume correctness here since we know that all varuints are |
776 | // written with write_varu32. Hence, the 'as' uses and unchecked arithmetic |
777 | // is all okay. |
778 | let mut n: u32 = 0; |
779 | let mut shift: u32 = 0; |
780 | for (i: usize, &b: u8) in data.iter().enumerate() { |
781 | if b < 0b1000_0000 { |
782 | return (n | (u32::from(b) << shift), i + 1); |
783 | } |
784 | n |= (u32::from(b) & 0b0111_1111) << shift; |
785 | shift += 7; |
786 | } |
787 | (0, 0) |
788 | } |
789 | |
790 | /// Push a native-endian encoded `n` on to `dst`. |
791 | fn write_u32(dst: &mut Vec<u8>, n: u32) { |
792 | use crate::util::wire::NE; |
793 | |
794 | let start: usize = dst.len(); |
795 | dst.extend(iter:core::iter::repeat(elt:0).take(mem::size_of::<u32>())); |
796 | NE::write_u32(n, &mut dst[start..]); |
797 | } |
798 | |
799 | #[cfg (test)] |
800 | mod tests { |
801 | use alloc::vec; |
802 | |
803 | use quickcheck::quickcheck; |
804 | |
805 | use super::*; |
806 | |
807 | #[cfg (not(miri))] |
808 | quickcheck! { |
809 | fn prop_state_read_write_nfa_state_ids(sids: Vec<StateID>) -> bool { |
810 | // Builders states do not permit duplicate IDs. |
811 | let sids = dedup_state_ids(sids); |
812 | |
813 | let mut b = StateBuilderEmpty::new().into_matches().into_nfa(); |
814 | for &sid in &sids { |
815 | b.add_nfa_state_id(sid); |
816 | } |
817 | let s = b.to_state(); |
818 | let mut got = vec![]; |
819 | s.iter_nfa_state_ids(|sid| got.push(sid)); |
820 | got == sids |
821 | } |
822 | |
823 | fn prop_state_read_write_pattern_ids(pids: Vec<PatternID>) -> bool { |
824 | // Builders states do not permit duplicate IDs. |
825 | let pids = dedup_pattern_ids(pids); |
826 | |
827 | let mut b = StateBuilderEmpty::new().into_matches(); |
828 | for &pid in &pids { |
829 | b.add_match_pattern_id(pid); |
830 | } |
831 | let s = b.into_nfa().to_state(); |
832 | let mut got = vec![]; |
833 | s.iter_match_pattern_ids(|pid| got.push(pid)); |
834 | got == pids |
835 | } |
836 | |
837 | fn prop_state_read_write_nfa_state_and_pattern_ids( |
838 | sids: Vec<StateID>, |
839 | pids: Vec<PatternID> |
840 | ) -> bool { |
841 | // Builders states do not permit duplicate IDs. |
842 | let sids = dedup_state_ids(sids); |
843 | let pids = dedup_pattern_ids(pids); |
844 | |
845 | let mut b = StateBuilderEmpty::new().into_matches(); |
846 | for &pid in &pids { |
847 | b.add_match_pattern_id(pid); |
848 | } |
849 | |
850 | let mut b = b.into_nfa(); |
851 | for &sid in &sids { |
852 | b.add_nfa_state_id(sid); |
853 | } |
854 | |
855 | let s = b.to_state(); |
856 | let mut got_pids = vec![]; |
857 | s.iter_match_pattern_ids(|pid| got_pids.push(pid)); |
858 | let mut got_sids = vec![]; |
859 | s.iter_nfa_state_ids(|sid| got_sids.push(sid)); |
860 | got_pids == pids && got_sids == sids |
861 | } |
862 | } |
863 | |
864 | quickcheck! { |
865 | fn prop_read_write_varu32(n: u32) -> bool { |
866 | let mut buf = vec![]; |
867 | write_varu32(&mut buf, n); |
868 | let (got, nread) = read_varu32(&buf); |
869 | nread == buf.len() && got == n |
870 | } |
871 | |
872 | fn prop_read_write_vari32(n: i32) -> bool { |
873 | let mut buf = vec![]; |
874 | write_vari32(&mut buf, n); |
875 | let (got, nread) = read_vari32(&buf); |
876 | nread == buf.len() && got == n |
877 | } |
878 | } |
879 | |
880 | #[cfg (not(miri))] |
881 | fn dedup_state_ids(sids: Vec<StateID>) -> Vec<StateID> { |
882 | let mut set = alloc::collections::BTreeSet::new(); |
883 | let mut deduped = vec![]; |
884 | for sid in sids { |
885 | if set.contains(&sid) { |
886 | continue; |
887 | } |
888 | set.insert(sid); |
889 | deduped.push(sid); |
890 | } |
891 | deduped |
892 | } |
893 | |
894 | #[cfg (not(miri))] |
895 | fn dedup_pattern_ids(pids: Vec<PatternID>) -> Vec<PatternID> { |
896 | let mut set = alloc::collections::BTreeSet::new(); |
897 | let mut deduped = vec![]; |
898 | for pid in pids { |
899 | if set.contains(&pid) { |
900 | continue; |
901 | } |
902 | set.insert(pid); |
903 | deduped.push(pid); |
904 | } |
905 | deduped |
906 | } |
907 | } |
908 | |