| 1 | /*! | 
| 2 | This module defines a DFA state representation and builders for constructing | 
|---|
| 3 | DFA states. | 
|---|
| 4 |  | 
|---|
| 5 | This representation is specifically for use in implementations of NFA-to-DFA | 
|---|
| 6 | conversion via powerset construction. (Also called "determinization" in this | 
|---|
| 7 | crate.) | 
|---|
| 8 |  | 
|---|
| 9 | The term "DFA state" is somewhat overloaded in this crate. In some cases, it | 
|---|
| 10 | refers to the set of transitions over an alphabet for a particular state. In | 
|---|
| 11 | other cases, it refers to a set of NFA states. The former is really about the | 
|---|
| 12 | final representation of a state in a DFA's transition table, where as the | 
|---|
| 13 | latter---what this module is focused on---is closer to an intermediate form | 
|---|
| 14 | that is used to help eventually build the transition table. | 
|---|
| 15 |  | 
|---|
| 16 | This module exports four types. All four types represent the same idea: an | 
|---|
| 17 | ordered set of NFA states. This ordered set represents the epsilon closure of a | 
|---|
| 18 | particular NFA state, where the "epsilon closure" is the set of NFA states that | 
|---|
| 19 | can be transitioned to without consuming any input. i.e., Follow all of the NFA | 
|---|
| 20 | state's epsilon transitions. In addition, this implementation of DFA states | 
|---|
| 21 | cares about two other things: the ordered set of pattern IDs corresponding | 
|---|
| 22 | to the patterns that match if the state is a match state, and the set of | 
|---|
| 23 | look-behind assertions that were true when the state was created. | 
|---|
| 24 |  | 
|---|
| 25 | The first, `State`, is a frozen representation of a state that cannot be | 
|---|
| 26 | modified. It may be cheaply cloned without copying the state itself and can be | 
|---|
| 27 | accessed safely from multiple threads simultaneously. This type is useful for | 
|---|
| 28 | when one knows that the DFA state being constructed is distinct from any other | 
|---|
| 29 | previously constructed states. Namely, powerset construction, in practice, | 
|---|
| 30 | requires one to keep a cache of previously created DFA states. Otherwise, | 
|---|
| 31 | the number of DFA states created in memory balloons to an impractically | 
|---|
| 32 | large number. For this reason, equivalent states should endeavor to have an | 
|---|
| 33 | equivalent byte-level representation. (In general, "equivalency" here means, | 
|---|
| 34 | "equivalent assertions, pattern IDs and NFA state IDs." We do not require that | 
|---|
| 35 | full DFA minimization be implemented here. This form of equivalency is only | 
|---|
| 36 | surface deep and is more-or-less a practical necessity.) | 
|---|
| 37 |  | 
|---|
| 38 | The other three types represent different phases in the construction of a | 
|---|
| 39 | DFA state. Internally, these three types (and `State`) all use the same | 
|---|
| 40 | byte-oriented representation. That means one can use any of the builder types | 
|---|
| 41 | to check whether the state it represents already exists or not. If it does, | 
|---|
| 42 | then there is no need to freeze it into a `State` (which requires an alloc and | 
|---|
| 43 | a copy). Here are the three types described succinctly: | 
|---|
| 44 |  | 
|---|
| 45 | * `StateBuilderEmpty` represents a state with no pattern IDs, no assertions | 
|---|
| 46 | and no NFA states. Creating a `StateBuilderEmpty` performs no allocs. A | 
|---|
| 47 | `StateBuilderEmpty` can only be used to query its underlying memory capacity, | 
|---|
| 48 | or to convert into a builder for recording pattern IDs and/or assertions. | 
|---|
| 49 |  | 
|---|
| 50 | * `StateBuilderMatches` represents a state with zero or more pattern IDs, zero | 
|---|
| 51 | or more satisfied assertions and zero NFA state IDs. A `StateBuilderMatches` | 
|---|
| 52 | can only be used for adding pattern IDs and recording assertions. | 
|---|
| 53 |  | 
|---|
| 54 | * `StateBuilderNFA` represents a state with zero or more pattern IDs, zero or | 
|---|
| 55 | more satisfied assertions and zero or more NFA state IDs. A `StateBuilderNFA` | 
|---|
| 56 | can only be used for adding NFA state IDs and recording some assertions. | 
|---|
| 57 |  | 
|---|
| 58 | The expected flow here is to use the above builders to construct a candidate | 
|---|
| 59 | DFA state to check if it already exists. If it does, then there's no need to | 
|---|
| 60 | freeze it into a `State`. If it doesn't exist, then `StateBuilderNFA::to_state` | 
|---|
| 61 | can be called to freeze the builder into an immutable `State`. In either | 
|---|
| 62 | case, `clear` should be called on the builder to turn it back into a | 
|---|
| 63 | `StateBuilderEmpty` that reuses the underlying memory. | 
|---|
| 64 |  | 
|---|
| 65 | The main purpose for splitting the builder into these distinct types is to | 
|---|
| 66 | make it impossible to do things like adding a pattern ID after adding an NFA | 
|---|
| 67 | state ID. Namely, this makes it simpler to use a space-and-time efficient | 
|---|
| 68 | binary representation for the state. (The format is documented on the `Repr` | 
|---|
| 69 | type below.) If we just used one type for everything, it would be possible for | 
|---|
| 70 | callers to use an incorrect interleaving of calls and thus result in a corrupt | 
|---|
| 71 | representation. I chose to use more type machinery to make this impossible to | 
|---|
| 72 | do because 1) determinization is itself pretty complex and it wouldn't be too | 
|---|
| 73 | hard to foul this up and 2) there isn't too much machinery involved and it's | 
|---|
| 74 | well contained. | 
|---|
| 75 |  | 
|---|
| 76 | As an optimization, sometimes states won't have certain things set. For | 
|---|
| 77 | example, if the underlying NFA has no word boundary assertions, then there is | 
|---|
| 78 | no reason to set a state's look-behind assertion as to whether it was generated | 
|---|
| 79 | from a word byte or not. Similarly, if a state has no NFA states corresponding | 
|---|
| 80 | to look-around assertions, then there is no reason to set `look_have` to a | 
|---|
| 81 | non-empty set. Finally, callers usually omit unconditional epsilon transitions | 
|---|
| 82 | when adding NFA state IDs since they aren't discriminatory. | 
|---|
| 83 |  | 
|---|
| 84 | Finally, the binary representation used by these states is, thankfully, not | 
|---|
| 85 | serialized anywhere. So any kind of change can be made with reckless abandon, | 
|---|
| 86 | as long as everything in this module agrees. | 
|---|
| 87 | */ | 
|---|
| 88 |  | 
|---|
| 89 | use core::mem; | 
|---|
| 90 |  | 
|---|
| 91 | use alloc::{sync::Arc, vec::Vec}; | 
|---|
| 92 |  | 
|---|
| 93 | use crate::util::{ | 
|---|
| 94 | int::{I32, U32}, | 
|---|
| 95 | look::LookSet, | 
|---|
| 96 | primitives::{PatternID, StateID}, | 
|---|
| 97 | wire::{self, Endian}, | 
|---|
| 98 | }; | 
|---|
| 99 |  | 
|---|
| 100 | /// A DFA state that, at its core, is represented by an ordered set of NFA | 
|---|
| 101 | /// states. | 
|---|
| 102 | /// | 
|---|
| 103 | /// This type is intended to be used only in NFA-to-DFA conversion via powerset | 
|---|
| 104 | /// construction. | 
|---|
| 105 | /// | 
|---|
| 106 | /// It may be cheaply cloned and accessed safely from multiple threads | 
|---|
| 107 | /// simultaneously. | 
|---|
| 108 | #[ derive(Clone, Eq, Hash, PartialEq, PartialOrd, Ord)] | 
|---|
| 109 | pub(crate) struct State(Arc<[u8]>); | 
|---|
| 110 |  | 
|---|
| 111 | /// This Borrow impl permits us to lookup any state in a map by its byte | 
|---|
| 112 | /// representation. This is particularly convenient when one has a StateBuilder | 
|---|
| 113 | /// and we want to see if a correspondingly equivalent state already exists. If | 
|---|
| 114 | /// one does exist, then we can reuse the allocation required by StateBuilder | 
|---|
| 115 | /// without having to convert it into a State first. | 
|---|
| 116 | impl core::borrow::Borrow<[u8]> for State { | 
|---|
| 117 | fn borrow(&self) -> &[u8] { | 
|---|
| 118 | &*self.0 | 
|---|
| 119 | } | 
|---|
| 120 | } | 
|---|
| 121 |  | 
|---|
| 122 | impl core::fmt::Debug for State { | 
|---|
| 123 | fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { | 
|---|
| 124 | f.debug_tuple(name: "State").field(&self.repr()).finish() | 
|---|
| 125 | } | 
|---|
| 126 | } | 
|---|
| 127 |  | 
|---|
| 128 | /// For docs on these routines, see the internal Repr and ReprVec types below. | 
|---|
| 129 | impl State { | 
|---|
| 130 | pub(crate) fn dead() -> State { | 
|---|
| 131 | StateBuilderEmpty::new().into_matches().into_nfa().to_state() | 
|---|
| 132 | } | 
|---|
| 133 |  | 
|---|
| 134 | pub(crate) fn is_match(&self) -> bool { | 
|---|
| 135 | self.repr().is_match() | 
|---|
| 136 | } | 
|---|
| 137 |  | 
|---|
| 138 | pub(crate) fn is_from_word(&self) -> bool { | 
|---|
| 139 | self.repr().is_from_word() | 
|---|
| 140 | } | 
|---|
| 141 |  | 
|---|
| 142 | pub(crate) fn is_half_crlf(&self) -> bool { | 
|---|
| 143 | self.repr().is_half_crlf() | 
|---|
| 144 | } | 
|---|
| 145 |  | 
|---|
| 146 | pub(crate) fn look_have(&self) -> LookSet { | 
|---|
| 147 | self.repr().look_have() | 
|---|
| 148 | } | 
|---|
| 149 |  | 
|---|
| 150 | pub(crate) fn look_need(&self) -> LookSet { | 
|---|
| 151 | self.repr().look_need() | 
|---|
| 152 | } | 
|---|
| 153 |  | 
|---|
| 154 | pub(crate) fn match_len(&self) -> usize { | 
|---|
| 155 | self.repr().match_len() | 
|---|
| 156 | } | 
|---|
| 157 |  | 
|---|
| 158 | pub(crate) fn match_pattern(&self, index: usize) -> PatternID { | 
|---|
| 159 | self.repr().match_pattern(index) | 
|---|
| 160 | } | 
|---|
| 161 |  | 
|---|
| 162 | pub(crate) fn match_pattern_ids(&self) -> Option<Vec<PatternID>> { | 
|---|
| 163 | self.repr().match_pattern_ids() | 
|---|
| 164 | } | 
|---|
| 165 |  | 
|---|
| 166 | #[ cfg(all(test, not(miri)))] | 
|---|
| 167 | pub(crate) fn iter_match_pattern_ids<F: FnMut(PatternID)>(&self, f: F) { | 
|---|
| 168 | self.repr().iter_match_pattern_ids(f) | 
|---|
| 169 | } | 
|---|
| 170 |  | 
|---|
| 171 | pub(crate) fn iter_nfa_state_ids<F: FnMut(StateID)>(&self, f: F) { | 
|---|
| 172 | self.repr().iter_nfa_state_ids(f) | 
|---|
| 173 | } | 
|---|
| 174 |  | 
|---|
| 175 | pub(crate) fn memory_usage(&self) -> usize { | 
|---|
| 176 | self.0.len() | 
|---|
| 177 | } | 
|---|
| 178 |  | 
|---|
| 179 | fn repr(&self) -> Repr<'_> { | 
|---|
| 180 | Repr(&*self.0) | 
|---|
| 181 | } | 
|---|
| 182 | } | 
|---|
| 183 |  | 
|---|
| 184 | /// A state builder that represents an empty state. | 
|---|
| 185 | /// | 
|---|
| 186 | /// This is a useful "initial condition" for state construction. It has no | 
|---|
| 187 | /// NFA state IDs, no assertions set and no pattern IDs. No allocations are | 
|---|
| 188 | /// made when new() is called. Its main use is for being converted into a | 
|---|
| 189 | /// builder that can capture assertions and pattern IDs. | 
|---|
| 190 | #[ derive(Clone, Debug)] | 
|---|
| 191 | pub(crate) struct StateBuilderEmpty(Vec<u8>); | 
|---|
| 192 |  | 
|---|
| 193 | /// For docs on these routines, see the internal Repr and ReprVec types below. | 
|---|
| 194 | impl StateBuilderEmpty { | 
|---|
| 195 | pub(crate) fn new() -> StateBuilderEmpty { | 
|---|
| 196 | StateBuilderEmpty(alloc::vec![]) | 
|---|
| 197 | } | 
|---|
| 198 |  | 
|---|
| 199 | pub(crate) fn into_matches(mut self) -> StateBuilderMatches { | 
|---|
| 200 | self.0.extend_from_slice(&[0, 0, 0, 0, 0, 0, 0, 0, 0]); | 
|---|
| 201 | StateBuilderMatches(self.0) | 
|---|
| 202 | } | 
|---|
| 203 |  | 
|---|
| 204 | fn clear(&mut self) { | 
|---|
| 205 | self.0.clear(); | 
|---|
| 206 | } | 
|---|
| 207 |  | 
|---|
| 208 | pub(crate) fn capacity(&self) -> usize { | 
|---|
| 209 | self.0.capacity() | 
|---|
| 210 | } | 
|---|
| 211 | } | 
|---|
| 212 |  | 
|---|
| 213 | /// A state builder that collects assertions and pattern IDs. | 
|---|
| 214 | /// | 
|---|
| 215 | /// When collecting pattern IDs is finished, this can be converted into a | 
|---|
| 216 | /// builder that collects NFA state IDs. | 
|---|
| 217 | #[ derive(Clone)] | 
|---|
| 218 | pub(crate) struct StateBuilderMatches(Vec<u8>); | 
|---|
| 219 |  | 
|---|
| 220 | impl core::fmt::Debug for StateBuilderMatches { | 
|---|
| 221 | fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { | 
|---|
| 222 | f.debug_tuple(name: "StateBuilderMatches").field(&self.repr()).finish() | 
|---|
| 223 | } | 
|---|
| 224 | } | 
|---|
| 225 |  | 
|---|
| 226 | /// For docs on these routines, see the internal Repr and ReprVec types below. | 
|---|
| 227 | impl StateBuilderMatches { | 
|---|
| 228 | pub(crate) fn into_nfa(mut self) -> StateBuilderNFA { | 
|---|
| 229 | self.repr_vec().close_match_pattern_ids(); | 
|---|
| 230 | StateBuilderNFA { repr: self.0, prev_nfa_state_id: StateID::ZERO } | 
|---|
| 231 | } | 
|---|
| 232 |  | 
|---|
| 233 | pub(crate) fn set_is_from_word(&mut self) { | 
|---|
| 234 | self.repr_vec().set_is_from_word() | 
|---|
| 235 | } | 
|---|
| 236 |  | 
|---|
| 237 | pub(crate) fn set_is_half_crlf(&mut self) { | 
|---|
| 238 | self.repr_vec().set_is_half_crlf() | 
|---|
| 239 | } | 
|---|
| 240 |  | 
|---|
| 241 | pub(crate) fn look_have(&self) -> LookSet { | 
|---|
| 242 | LookSet::read_repr(&self.0[1..]) | 
|---|
| 243 | } | 
|---|
| 244 |  | 
|---|
| 245 | pub(crate) fn set_look_have( | 
|---|
| 246 | &mut self, | 
|---|
| 247 | set: impl FnMut(LookSet) -> LookSet, | 
|---|
| 248 | ) { | 
|---|
| 249 | self.repr_vec().set_look_have(set) | 
|---|
| 250 | } | 
|---|
| 251 |  | 
|---|
| 252 | pub(crate) fn add_match_pattern_id(&mut self, pid: PatternID) { | 
|---|
| 253 | self.repr_vec().add_match_pattern_id(pid) | 
|---|
| 254 | } | 
|---|
| 255 |  | 
|---|
| 256 | fn repr(&self) -> Repr<'_> { | 
|---|
| 257 | Repr(&self.0) | 
|---|
| 258 | } | 
|---|
| 259 |  | 
|---|
| 260 | fn repr_vec(&mut self) -> ReprVec<'_> { | 
|---|
| 261 | ReprVec(&mut self.0) | 
|---|
| 262 | } | 
|---|
| 263 | } | 
|---|
| 264 |  | 
|---|
| 265 | /// A state builder that collects some assertions and NFA state IDs. | 
|---|
| 266 | /// | 
|---|
| 267 | /// When collecting NFA state IDs is finished, this can be used to build a | 
|---|
| 268 | /// `State` if necessary. | 
|---|
| 269 | /// | 
|---|
| 270 | /// When dont with building a state (regardless of whether it got kept or not), | 
|---|
| 271 | /// it's usually a good idea to call `clear` to get an empty builder back so | 
|---|
| 272 | /// that it can be reused to build the next state. | 
|---|
| 273 | #[ derive(Clone)] | 
|---|
| 274 | pub(crate) struct StateBuilderNFA { | 
|---|
| 275 | repr: Vec<u8>, | 
|---|
| 276 | prev_nfa_state_id: StateID, | 
|---|
| 277 | } | 
|---|
| 278 |  | 
|---|
| 279 | impl core::fmt::Debug for StateBuilderNFA { | 
|---|
| 280 | fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { | 
|---|
| 281 | f.debug_tuple(name: "StateBuilderNFA").field(&self.repr()).finish() | 
|---|
| 282 | } | 
|---|
| 283 | } | 
|---|
| 284 |  | 
|---|
| 285 | /// For docs on these routines, see the internal Repr and ReprVec types below. | 
|---|
| 286 | impl StateBuilderNFA { | 
|---|
| 287 | pub(crate) fn to_state(&self) -> State { | 
|---|
| 288 | State(Arc::from(&*self.repr)) | 
|---|
| 289 | } | 
|---|
| 290 |  | 
|---|
| 291 | pub(crate) fn clear(self) -> StateBuilderEmpty { | 
|---|
| 292 | let mut builder = StateBuilderEmpty(self.repr); | 
|---|
| 293 | builder.clear(); | 
|---|
| 294 | builder | 
|---|
| 295 | } | 
|---|
| 296 |  | 
|---|
| 297 | pub(crate) fn look_need(&self) -> LookSet { | 
|---|
| 298 | self.repr().look_need() | 
|---|
| 299 | } | 
|---|
| 300 |  | 
|---|
| 301 | pub(crate) fn set_look_have( | 
|---|
| 302 | &mut self, | 
|---|
| 303 | set: impl FnMut(LookSet) -> LookSet, | 
|---|
| 304 | ) { | 
|---|
| 305 | self.repr_vec().set_look_have(set) | 
|---|
| 306 | } | 
|---|
| 307 |  | 
|---|
| 308 | pub(crate) fn set_look_need( | 
|---|
| 309 | &mut self, | 
|---|
| 310 | set: impl FnMut(LookSet) -> LookSet, | 
|---|
| 311 | ) { | 
|---|
| 312 | self.repr_vec().set_look_need(set) | 
|---|
| 313 | } | 
|---|
| 314 |  | 
|---|
| 315 | pub(crate) fn add_nfa_state_id(&mut self, sid: StateID) { | 
|---|
| 316 | ReprVec(&mut self.repr) | 
|---|
| 317 | .add_nfa_state_id(&mut self.prev_nfa_state_id, sid) | 
|---|
| 318 | } | 
|---|
| 319 |  | 
|---|
| 320 | pub(crate) fn as_bytes(&self) -> &[u8] { | 
|---|
| 321 | &self.repr | 
|---|
| 322 | } | 
|---|
| 323 |  | 
|---|
| 324 | fn repr(&self) -> Repr<'_> { | 
|---|
| 325 | Repr(&self.repr) | 
|---|
| 326 | } | 
|---|
| 327 |  | 
|---|
| 328 | fn repr_vec(&mut self) -> ReprVec<'_> { | 
|---|
| 329 | ReprVec(&mut self.repr) | 
|---|
| 330 | } | 
|---|
| 331 | } | 
|---|
| 332 |  | 
|---|
| 333 | /// Repr is a read-only view into the representation of a DFA state. | 
|---|
| 334 | /// | 
|---|
| 335 | /// Primarily, a Repr is how we achieve DRY: we implement decoding the format | 
|---|
| 336 | /// in one place, and then use a Repr to implement the various methods on the | 
|---|
| 337 | /// public state types. | 
|---|
| 338 | /// | 
|---|
| 339 | /// The format is as follows: | 
|---|
| 340 | /// | 
|---|
| 341 | /// The first three bytes correspond to bitsets. | 
|---|
| 342 | /// | 
|---|
| 343 | /// Byte 0 is a bitset corresponding to miscellaneous flags associated with the | 
|---|
| 344 | /// state. Bit 0 is set to 1 if the state is a match state. Bit 1 is set to 1 | 
|---|
| 345 | /// if the state has pattern IDs explicitly written to it. (This is a flag that | 
|---|
| 346 | /// is not meant to be set by determinization, but rather, is used as part of | 
|---|
| 347 | /// an internal space-saving optimization.) Bit 2 is set to 1 if the state was | 
|---|
| 348 | /// generated by a transition over a "word" byte. (Callers may not always set | 
|---|
| 349 | /// this. For example, if the NFA has no word boundary assertion, then needing | 
|---|
| 350 | /// to track whether a state came from a word byte or not is superfluous and | 
|---|
| 351 | /// wasteful.) Bit 3 is set to 1 if the state was generated by a transition | 
|---|
| 352 | /// from a `\r` (forward search) or a `\n` (reverse search) when CRLF mode is | 
|---|
| 353 | /// enabled. | 
|---|
| 354 | /// | 
|---|
| 355 | /// Bytes 1..5 correspond to the look-behind assertions that were satisfied | 
|---|
| 356 | /// by the transition that created this state. (Look-ahead assertions are not | 
|---|
| 357 | /// tracked as part of states. Instead, these are applied by re-computing the | 
|---|
| 358 | /// epsilon closure of a state when computing the transition function. See | 
|---|
| 359 | /// `next` in the parent module.) | 
|---|
| 360 | /// | 
|---|
| 361 | /// Bytes 5..9 correspond to the set of look-around assertions (including both | 
|---|
| 362 | /// look-behind and look-ahead) that appear somewhere in this state's set of | 
|---|
| 363 | /// NFA state IDs. This is used to determine whether this state's epsilon | 
|---|
| 364 | /// closure should be re-computed when computing the transition function. | 
|---|
| 365 | /// Namely, look-around assertions are "just" conditional epsilon transitions, | 
|---|
| 366 | /// so if there are new assertions available when computing the transition | 
|---|
| 367 | /// function, we should only re-compute the epsilon closure if those new | 
|---|
| 368 | /// assertions are relevant to this particular state. | 
|---|
| 369 | /// | 
|---|
| 370 | /// Bytes 9..13 correspond to a 32-bit native-endian encoded integer | 
|---|
| 371 | /// corresponding to the number of patterns encoded in this state. If the state | 
|---|
| 372 | /// is not a match state (byte 0 bit 0 is 0) or if it's only pattern ID is | 
|---|
| 373 | /// PatternID::ZERO, then no integer is encoded at this position. Instead, byte | 
|---|
| 374 | /// offset 3 is the position at which the first NFA state ID is encoded. | 
|---|
| 375 | /// | 
|---|
| 376 | /// For a match state with at least one non-ZERO pattern ID, the next bytes | 
|---|
| 377 | /// correspond to a sequence of 32-bit native endian encoded integers that | 
|---|
| 378 | /// represent each pattern ID, in order, that this match state represents. | 
|---|
| 379 | /// | 
|---|
| 380 | /// After the pattern IDs (if any), NFA state IDs are delta encoded as | 
|---|
| 381 | /// varints.[1] The first NFA state ID is encoded as itself, and each | 
|---|
| 382 | /// subsequent NFA state ID is encoded as the difference between itself and the | 
|---|
| 383 | /// previous NFA state ID. | 
|---|
| 384 | /// | 
|---|
| 385 | /// [1] - https://developers.google.com/protocol-buffers/docs/encoding#varints | 
|---|
| 386 | struct Repr<'a>(&'a [u8]); | 
|---|
| 387 |  | 
|---|
| 388 | impl<'a> Repr<'a> { | 
|---|
| 389 | /// Returns true if and only if this is a match state. | 
|---|
| 390 | /// | 
|---|
| 391 | /// If callers have added pattern IDs to this state, then callers MUST set | 
|---|
| 392 | /// this state as a match state explicitly. However, as a special case, | 
|---|
| 393 | /// states that are marked as match states but with no pattern IDs, then | 
|---|
| 394 | /// the state is treated as if it had a single pattern ID equivalent to | 
|---|
| 395 | /// PatternID::ZERO. | 
|---|
| 396 | fn is_match(&self) -> bool { | 
|---|
| 397 | self.0[0] & (1 << 0) > 0 | 
|---|
| 398 | } | 
|---|
| 399 |  | 
|---|
| 400 | /// Returns true if and only if this state has had at least one pattern | 
|---|
| 401 | /// ID added to it. | 
|---|
| 402 | /// | 
|---|
| 403 | /// This is an internal-only flag that permits the representation to save | 
|---|
| 404 | /// space in the common case of an NFA with one pattern in it. In that | 
|---|
| 405 | /// case, a match state can only ever have exactly one pattern ID: | 
|---|
| 406 | /// PatternID::ZERO. So there's no need to represent it. | 
|---|
| 407 | fn has_pattern_ids(&self) -> bool { | 
|---|
| 408 | self.0[0] & (1 << 1) > 0 | 
|---|
| 409 | } | 
|---|
| 410 |  | 
|---|
| 411 | /// Returns true if and only if this state is marked as having been created | 
|---|
| 412 | /// from a transition over a word byte. This is useful for checking whether | 
|---|
| 413 | /// a word boundary assertion is true or not, which requires look-behind | 
|---|
| 414 | /// (whether the current state came from a word byte or not) and look-ahead | 
|---|
| 415 | /// (whether the transition byte is a word byte or not). | 
|---|
| 416 | /// | 
|---|
| 417 | /// Since states with this set are distinct from states that don't have | 
|---|
| 418 | /// this set (even if they are otherwise equivalent), callers should not | 
|---|
| 419 | /// set this assertion unless the underlying NFA has at least one word | 
|---|
| 420 | /// boundary assertion somewhere. Otherwise, a superfluous number of states | 
|---|
| 421 | /// may be created. | 
|---|
| 422 | fn is_from_word(&self) -> bool { | 
|---|
| 423 | self.0[0] & (1 << 2) > 0 | 
|---|
| 424 | } | 
|---|
| 425 |  | 
|---|
| 426 | /// Returns true if and only if this state is marked as being inside of a | 
|---|
| 427 | /// CRLF terminator. In the forward direction, this means the state was | 
|---|
| 428 | /// created after seeing a `\r`. In the reverse direction, this means the | 
|---|
| 429 | /// state was created after seeing a `\n`. | 
|---|
| 430 | fn is_half_crlf(&self) -> bool { | 
|---|
| 431 | self.0[0] & (1 << 3) > 0 | 
|---|
| 432 | } | 
|---|
| 433 |  | 
|---|
| 434 | /// The set of look-behind assertions that were true in the transition that | 
|---|
| 435 | /// created this state. | 
|---|
| 436 | /// | 
|---|
| 437 | /// Generally, this should be empty if 'look_need' is empty, since there is | 
|---|
| 438 | /// no reason to track which look-behind assertions are true if the state | 
|---|
| 439 | /// has no conditional epsilon transitions. | 
|---|
| 440 | /// | 
|---|
| 441 | /// Satisfied look-ahead assertions are not tracked in states. Instead, | 
|---|
| 442 | /// these are re-computed on demand via epsilon closure when computing the | 
|---|
| 443 | /// transition function. | 
|---|
| 444 | fn look_have(&self) -> LookSet { | 
|---|
| 445 | LookSet::read_repr(&self.0[1..]) | 
|---|
| 446 | } | 
|---|
| 447 |  | 
|---|
| 448 | /// The set of look-around (both behind and ahead) assertions that appear | 
|---|
| 449 | /// at least once in this state's set of NFA states. | 
|---|
| 450 | /// | 
|---|
| 451 | /// This is used to determine whether the epsilon closure needs to be | 
|---|
| 452 | /// re-computed when computing the transition function. Namely, if the | 
|---|
| 453 | /// state has no conditional epsilon transitions, then there is no need | 
|---|
| 454 | /// to re-compute the epsilon closure. | 
|---|
| 455 | fn look_need(&self) -> LookSet { | 
|---|
| 456 | LookSet::read_repr(&self.0[5..]) | 
|---|
| 457 | } | 
|---|
| 458 |  | 
|---|
| 459 | /// Returns the total number of match pattern IDs in this state. | 
|---|
| 460 | /// | 
|---|
| 461 | /// If this state is not a match state, then this always returns 0. | 
|---|
| 462 | fn match_len(&self) -> usize { | 
|---|
| 463 | if !self.is_match() { | 
|---|
| 464 | return 0; | 
|---|
| 465 | } else if !self.has_pattern_ids() { | 
|---|
| 466 | 1 | 
|---|
| 467 | } else { | 
|---|
| 468 | self.encoded_pattern_len() | 
|---|
| 469 | } | 
|---|
| 470 | } | 
|---|
| 471 |  | 
|---|
| 472 | /// Returns the pattern ID for this match state at the given index. | 
|---|
| 473 | /// | 
|---|
| 474 | /// If the given index is greater than or equal to `match_len()` for this | 
|---|
| 475 | /// state, then this could panic or return incorrect results. | 
|---|
| 476 | fn match_pattern(&self, index: usize) -> PatternID { | 
|---|
| 477 | if !self.has_pattern_ids() { | 
|---|
| 478 | PatternID::ZERO | 
|---|
| 479 | } else { | 
|---|
| 480 | let offset = 13 + index * PatternID::SIZE; | 
|---|
| 481 | // This is OK since we only ever serialize valid PatternIDs to | 
|---|
| 482 | // states. | 
|---|
| 483 | wire::read_pattern_id_unchecked(&self.0[offset..]).0 | 
|---|
| 484 | } | 
|---|
| 485 | } | 
|---|
| 486 |  | 
|---|
| 487 | /// Returns a copy of all match pattern IDs in this state. If this state | 
|---|
| 488 | /// is not a match state, then this returns None. | 
|---|
| 489 | fn match_pattern_ids(&self) -> Option<Vec<PatternID>> { | 
|---|
| 490 | if !self.is_match() { | 
|---|
| 491 | return None; | 
|---|
| 492 | } | 
|---|
| 493 | let mut pids = alloc::vec![]; | 
|---|
| 494 | self.iter_match_pattern_ids(|pid| pids.push(pid)); | 
|---|
| 495 | Some(pids) | 
|---|
| 496 | } | 
|---|
| 497 |  | 
|---|
| 498 | /// Calls the given function on every pattern ID in this state. | 
|---|
| 499 | fn iter_match_pattern_ids<F: FnMut(PatternID)>(&self, mut f: F) { | 
|---|
| 500 | if !self.is_match() { | 
|---|
| 501 | return; | 
|---|
| 502 | } | 
|---|
| 503 | // As an optimization for a very common case, when this is a match | 
|---|
| 504 | // state for an NFA with only one pattern, we don't actually write the | 
|---|
| 505 | // pattern ID to the state representation. Instead, we know it must | 
|---|
| 506 | // be there since it is the only possible choice. | 
|---|
| 507 | if !self.has_pattern_ids() { | 
|---|
| 508 | f(PatternID::ZERO); | 
|---|
| 509 | return; | 
|---|
| 510 | } | 
|---|
| 511 | let mut pids = &self.0[13..self.pattern_offset_end()]; | 
|---|
| 512 | while !pids.is_empty() { | 
|---|
| 513 | let pid = wire::read_u32(pids); | 
|---|
| 514 | pids = &pids[PatternID::SIZE..]; | 
|---|
| 515 | // This is OK since we only ever serialize valid PatternIDs to | 
|---|
| 516 | // states. And since pattern IDs can never exceed a usize, the | 
|---|
| 517 | // unwrap is OK. | 
|---|
| 518 | f(PatternID::new_unchecked(usize::try_from(pid).unwrap())); | 
|---|
| 519 | } | 
|---|
| 520 | } | 
|---|
| 521 |  | 
|---|
| 522 | /// Calls the given function on every NFA state ID in this state. | 
|---|
| 523 | fn iter_nfa_state_ids<F: FnMut(StateID)>(&self, mut f: F) { | 
|---|
| 524 | let mut sids = &self.0[self.pattern_offset_end()..]; | 
|---|
| 525 | let mut prev = 0i32; | 
|---|
| 526 | while !sids.is_empty() { | 
|---|
| 527 | let (delta, nr) = read_vari32(sids); | 
|---|
| 528 | sids = &sids[nr..]; | 
|---|
| 529 | let sid = prev + delta; | 
|---|
| 530 | prev = sid; | 
|---|
| 531 | // This is OK since we only ever serialize valid StateIDs to | 
|---|
| 532 | // states. And since state IDs can never exceed an isize, they must | 
|---|
| 533 | // always be able to fit into a usize, and thus cast is OK. | 
|---|
| 534 | f(StateID::new_unchecked(sid.as_usize())) | 
|---|
| 535 | } | 
|---|
| 536 | } | 
|---|
| 537 |  | 
|---|
| 538 | /// Returns the offset into this state's representation where the pattern | 
|---|
| 539 | /// IDs end and the NFA state IDs begin. | 
|---|
| 540 | fn pattern_offset_end(&self) -> usize { | 
|---|
| 541 | let encoded = self.encoded_pattern_len(); | 
|---|
| 542 | if encoded == 0 { | 
|---|
| 543 | return 9; | 
|---|
| 544 | } | 
|---|
| 545 | // This arithmetic is OK since we were able to address this many bytes | 
|---|
| 546 | // when writing to the state, thus, it must fit into a usize. | 
|---|
| 547 | encoded.checked_mul(4).unwrap().checked_add(13).unwrap() | 
|---|
| 548 | } | 
|---|
| 549 |  | 
|---|
| 550 | /// Returns the total number of *encoded* pattern IDs in this state. | 
|---|
| 551 | /// | 
|---|
| 552 | /// This may return 0 even when this is a match state, since the pattern | 
|---|
| 553 | /// ID `PatternID::ZERO` is not encoded when it's the only pattern ID in | 
|---|
| 554 | /// the match state (the overwhelming common case). | 
|---|
| 555 | fn encoded_pattern_len(&self) -> usize { | 
|---|
| 556 | if !self.has_pattern_ids() { | 
|---|
| 557 | return 0; | 
|---|
| 558 | } | 
|---|
| 559 | // This unwrap is OK since the total number of patterns is always | 
|---|
| 560 | // guaranteed to fit into a usize. | 
|---|
| 561 | usize::try_from(wire::read_u32(&self.0[9..13])).unwrap() | 
|---|
| 562 | } | 
|---|
| 563 | } | 
|---|
| 564 |  | 
|---|
| 565 | impl<'a> core::fmt::Debug for Repr<'a> { | 
|---|
| 566 | fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { | 
|---|
| 567 | let mut nfa_ids: Vec = alloc::vec![]; | 
|---|
| 568 | self.iter_nfa_state_ids(|sid: StateID| nfa_ids.push(sid)); | 
|---|
| 569 | f&mut DebugStruct<'_, '_>.debug_struct( "Repr") | 
|---|
| 570 | .field( "is_match", &self.is_match()) | 
|---|
| 571 | .field( "is_from_word", &self.is_from_word()) | 
|---|
| 572 | .field( "is_half_crlf", &self.is_half_crlf()) | 
|---|
| 573 | .field( "look_have", &self.look_have()) | 
|---|
| 574 | .field( "look_need", &self.look_need()) | 
|---|
| 575 | .field( "match_pattern_ids", &self.match_pattern_ids()) | 
|---|
| 576 | .field(name: "nfa_state_ids", &nfa_ids) | 
|---|
| 577 | .finish() | 
|---|
| 578 | } | 
|---|
| 579 | } | 
|---|
| 580 |  | 
|---|
| 581 | /// ReprVec is a write-only view into the representation of a DFA state. | 
|---|
| 582 | /// | 
|---|
| 583 | /// See Repr for more details on the purpose of this type and also the format. | 
|---|
| 584 | /// | 
|---|
| 585 | /// Note that not all possible combinations of methods may be called. This is | 
|---|
| 586 | /// precisely what the various StateBuilder types encapsulate: they only | 
|---|
| 587 | /// permit valid combinations via Rust's linear typing. | 
|---|
| 588 | struct ReprVec<'a>(&'a mut Vec<u8>); | 
|---|
| 589 |  | 
|---|
| 590 | impl<'a> ReprVec<'a> { | 
|---|
| 591 | /// Set this state as a match state. | 
|---|
| 592 | /// | 
|---|
| 593 | /// This should not be exposed explicitly outside of this module. It is | 
|---|
| 594 | /// set automatically when a pattern ID is added. | 
|---|
| 595 | fn set_is_match(&mut self) { | 
|---|
| 596 | self.0[0] |= 1 << 0; | 
|---|
| 597 | } | 
|---|
| 598 |  | 
|---|
| 599 | /// Set that this state has pattern IDs explicitly written to it. | 
|---|
| 600 | /// | 
|---|
| 601 | /// This should not be exposed explicitly outside of this module. This is | 
|---|
| 602 | /// used internally as a space saving optimization. Namely, if the state | 
|---|
| 603 | /// is a match state but does not have any pattern IDs written to it, | 
|---|
| 604 | /// then it is automatically inferred to have a pattern ID of ZERO. | 
|---|
| 605 | fn set_has_pattern_ids(&mut self) { | 
|---|
| 606 | self.0[0] |= 1 << 1; | 
|---|
| 607 | } | 
|---|
| 608 |  | 
|---|
| 609 | /// Set this state as being built from a transition over a word byte. | 
|---|
| 610 | /// | 
|---|
| 611 | /// Setting this is only necessary when one needs to deal with word | 
|---|
| 612 | /// boundary assertions. Therefore, if the underlying NFA has no word | 
|---|
| 613 | /// boundary assertions, callers should not set this. | 
|---|
| 614 | fn set_is_from_word(&mut self) { | 
|---|
| 615 | self.0[0] |= 1 << 2; | 
|---|
| 616 | } | 
|---|
| 617 |  | 
|---|
| 618 | /// Set this state as having seen half of a CRLF terminator. | 
|---|
| 619 | /// | 
|---|
| 620 | /// In the forward direction, this should be set when a `\r` has been seen. | 
|---|
| 621 | /// In the reverse direction, this should be set when a `\n` has been seen. | 
|---|
| 622 | fn set_is_half_crlf(&mut self) { | 
|---|
| 623 | self.0[0] |= 1 << 3; | 
|---|
| 624 | } | 
|---|
| 625 |  | 
|---|
| 626 | /// The set of look-behind assertions that were true in the transition that | 
|---|
| 627 | /// created this state. | 
|---|
| 628 | fn look_have(&self) -> LookSet { | 
|---|
| 629 | self.repr().look_have() | 
|---|
| 630 | } | 
|---|
| 631 |  | 
|---|
| 632 | /// The set of look-around (both behind and ahead) assertions that appear | 
|---|
| 633 | /// at least once in this state's set of NFA states. | 
|---|
| 634 | fn look_need(&self) -> LookSet { | 
|---|
| 635 | self.repr().look_need() | 
|---|
| 636 | } | 
|---|
| 637 |  | 
|---|
| 638 | /// Mutate the set of look-behind assertions that were true in the | 
|---|
| 639 | /// transition that created this state. | 
|---|
| 640 | fn set_look_have(&mut self, mut set: impl FnMut(LookSet) -> LookSet) { | 
|---|
| 641 | set(self.look_have()).write_repr(&mut self.0[1..]); | 
|---|
| 642 | } | 
|---|
| 643 |  | 
|---|
| 644 | /// Mutate the set of look-around (both behind and ahead) assertions that | 
|---|
| 645 | /// appear at least once in this state's set of NFA states. | 
|---|
| 646 | fn set_look_need(&mut self, mut set: impl FnMut(LookSet) -> LookSet) { | 
|---|
| 647 | set(self.look_need()).write_repr(&mut self.0[5..]); | 
|---|
| 648 | } | 
|---|
| 649 |  | 
|---|
| 650 | /// Add a pattern ID to this state. All match states must have at least | 
|---|
| 651 | /// one pattern ID associated with it. | 
|---|
| 652 | /// | 
|---|
| 653 | /// Callers must never add duplicative pattern IDs. | 
|---|
| 654 | /// | 
|---|
| 655 | /// The order in which patterns are added must correspond to the order | 
|---|
| 656 | /// in which patterns are reported as matches. | 
|---|
| 657 | fn add_match_pattern_id(&mut self, pid: PatternID) { | 
|---|
| 658 | // As a (somewhat small) space saving optimization, in the case where | 
|---|
| 659 | // a matching state has exactly one pattern ID, PatternID::ZERO, we do | 
|---|
| 660 | // not write either the pattern ID or the number of patterns encoded. | 
|---|
| 661 | // Instead, all we do is set the 'is_match' bit on this state. Overall, | 
|---|
| 662 | // this saves 8 bytes per match state for the overwhelming majority of | 
|---|
| 663 | // match states. | 
|---|
| 664 | // | 
|---|
| 665 | // In order to know whether pattern IDs need to be explicitly read or | 
|---|
| 666 | // not, we use another internal-only bit, 'has_pattern_ids', to | 
|---|
| 667 | // indicate whether they have been explicitly written or not. | 
|---|
| 668 | if !self.repr().has_pattern_ids() { | 
|---|
| 669 | if pid == PatternID::ZERO { | 
|---|
| 670 | self.set_is_match(); | 
|---|
| 671 | return; | 
|---|
| 672 | } | 
|---|
| 673 | // Make room for 'close_match_pattern_ids' to write the total | 
|---|
| 674 | // number of pattern IDs written. | 
|---|
| 675 | self.0.extend(core::iter::repeat(0).take(PatternID::SIZE)); | 
|---|
| 676 | self.set_has_pattern_ids(); | 
|---|
| 677 | // If this was already a match state, then the only way that's | 
|---|
| 678 | // possible when the state doesn't have pattern IDs is if | 
|---|
| 679 | // PatternID::ZERO was added by the caller previously. In this | 
|---|
| 680 | // case, we are now adding a non-ZERO pattern ID after it, in | 
|---|
| 681 | // which case, we want to make sure to represent ZERO explicitly | 
|---|
| 682 | // now. | 
|---|
| 683 | if self.repr().is_match() { | 
|---|
| 684 | write_u32(self.0, 0) | 
|---|
| 685 | } else { | 
|---|
| 686 | // Otherwise, just make sure the 'is_match' bit is set. | 
|---|
| 687 | self.set_is_match(); | 
|---|
| 688 | } | 
|---|
| 689 | } | 
|---|
| 690 | write_u32(self.0, pid.as_u32()); | 
|---|
| 691 | } | 
|---|
| 692 |  | 
|---|
| 693 | /// Indicate that no more pattern IDs will be added to this state. | 
|---|
| 694 | /// | 
|---|
| 695 | /// Once this is called, callers must not call it or 'add_match_pattern_id' | 
|---|
| 696 | /// again. | 
|---|
| 697 | /// | 
|---|
| 698 | /// This should not be exposed explicitly outside of this module. It | 
|---|
| 699 | /// should be called only when converting a StateBuilderMatches into a | 
|---|
| 700 | /// StateBuilderNFA. | 
|---|
| 701 | fn close_match_pattern_ids(&mut self) { | 
|---|
| 702 | // If we never wrote any pattern IDs, then there's nothing to do here. | 
|---|
| 703 | if !self.repr().has_pattern_ids() { | 
|---|
| 704 | return; | 
|---|
| 705 | } | 
|---|
| 706 | let patsize = PatternID::SIZE; | 
|---|
| 707 | let pattern_bytes = self.0.len() - 13; | 
|---|
| 708 | // Every pattern ID uses 4 bytes, so number of bytes should be | 
|---|
| 709 | // divisible by 4. | 
|---|
| 710 | assert_eq!(pattern_bytes % patsize, 0); | 
|---|
| 711 | // This unwrap is OK since we are guaranteed that the maximum number | 
|---|
| 712 | // of possible patterns fits into a u32. | 
|---|
| 713 | let count32 = u32::try_from(pattern_bytes / patsize).unwrap(); | 
|---|
| 714 | wire::NE::write_u32(count32, &mut self.0[9..13]); | 
|---|
| 715 | } | 
|---|
| 716 |  | 
|---|
| 717 | /// Add an NFA state ID to this state. The order in which NFA states are | 
|---|
| 718 | /// added matters. It is the caller's responsibility to ensure that | 
|---|
| 719 | /// duplicate NFA state IDs are not added. | 
|---|
| 720 | fn add_nfa_state_id(&mut self, prev: &mut StateID, sid: StateID) { | 
|---|
| 721 | let delta = sid.as_i32() - prev.as_i32(); | 
|---|
| 722 | write_vari32(self.0, delta); | 
|---|
| 723 | *prev = sid; | 
|---|
| 724 | } | 
|---|
| 725 |  | 
|---|
| 726 | /// Return a read-only view of this state's representation. | 
|---|
| 727 | fn repr(&self) -> Repr<'_> { | 
|---|
| 728 | Repr(self.0.as_slice()) | 
|---|
| 729 | } | 
|---|
| 730 | } | 
|---|
| 731 |  | 
|---|
| 732 | /// Write a signed 32-bit integer using zig-zag encoding. | 
|---|
| 733 | /// | 
|---|
| 734 | /// https://developers.google.com/protocol-buffers/docs/encoding#varints | 
|---|
| 735 | fn write_vari32(data: &mut Vec<u8>, n: i32) { | 
|---|
| 736 | let mut un: u32 = n.to_bits() << 1; | 
|---|
| 737 | if n < 0 { | 
|---|
| 738 | un = !un; | 
|---|
| 739 | } | 
|---|
| 740 | write_varu32(data, n:un) | 
|---|
| 741 | } | 
|---|
| 742 |  | 
|---|
| 743 | /// Read a signed 32-bit integer using zig-zag encoding. Also, return the | 
|---|
| 744 | /// number of bytes read. | 
|---|
| 745 | /// | 
|---|
| 746 | /// https://developers.google.com/protocol-buffers/docs/encoding#varints | 
|---|
| 747 | fn read_vari32(data: &[u8]) -> (i32, usize) { | 
|---|
| 748 | let (un: u32, i: usize) = read_varu32(data); | 
|---|
| 749 | let mut n: i32 = i32::from_bits(un >> 1); | 
|---|
| 750 | if un & 1 != 0 { | 
|---|
| 751 | n = !n; | 
|---|
| 752 | } | 
|---|
| 753 | (n, i) | 
|---|
| 754 | } | 
|---|
| 755 |  | 
|---|
| 756 | /// Write an unsigned 32-bit integer as a varint. In essence, `n` is written | 
|---|
| 757 | /// as a sequence of bytes where all bytes except for the last one have the | 
|---|
| 758 | /// most significant bit set. The least significant 7 bits correspond to the | 
|---|
| 759 | /// actual bits of `n`. So in the worst case, a varint uses 5 bytes, but in | 
|---|
| 760 | /// very common cases, it uses fewer than 4. | 
|---|
| 761 | /// | 
|---|
| 762 | /// https://developers.google.com/protocol-buffers/docs/encoding#varints | 
|---|
| 763 | fn write_varu32(data: &mut Vec<u8>, mut n: u32) { | 
|---|
| 764 | while n >= 0b1000_0000 { | 
|---|
| 765 | data.push(n.low_u8() | 0b1000_0000); | 
|---|
| 766 | n >>= 7; | 
|---|
| 767 | } | 
|---|
| 768 | data.push(n.low_u8()); | 
|---|
| 769 | } | 
|---|
| 770 |  | 
|---|
| 771 | /// Read an unsigned 32-bit varint. Also, return the number of bytes read. | 
|---|
| 772 | /// | 
|---|
| 773 | /// https://developers.google.com/protocol-buffers/docs/encoding#varints | 
|---|
| 774 | fn read_varu32(data: &[u8]) -> (u32, usize) { | 
|---|
| 775 | // N.B. We can assume correctness here since we know that all varuints are | 
|---|
| 776 | // written with write_varu32. Hence, the 'as' uses and unchecked arithmetic | 
|---|
| 777 | // is all okay. | 
|---|
| 778 | let mut n: u32 = 0; | 
|---|
| 779 | let mut shift: u32 = 0; | 
|---|
| 780 | for (i: usize, &b: u8) in data.iter().enumerate() { | 
|---|
| 781 | if b < 0b1000_0000 { | 
|---|
| 782 | return (n | (u32::from(b) << shift), i + 1); | 
|---|
| 783 | } | 
|---|
| 784 | n |= (u32::from(b) & 0b0111_1111) << shift; | 
|---|
| 785 | shift += 7; | 
|---|
| 786 | } | 
|---|
| 787 | (0, 0) | 
|---|
| 788 | } | 
|---|
| 789 |  | 
|---|
| 790 | /// Push a native-endian encoded `n` on to `dst`. | 
|---|
| 791 | fn write_u32(dst: &mut Vec<u8>, n: u32) { | 
|---|
| 792 | use crate::util::wire::NE; | 
|---|
| 793 |  | 
|---|
| 794 | let start: usize = dst.len(); | 
|---|
| 795 | dst.extend(iter:core::iter::repeat(elt:0).take(mem::size_of::<u32>())); | 
|---|
| 796 | NE::write_u32(n, &mut dst[start..]); | 
|---|
| 797 | } | 
|---|
| 798 |  | 
|---|
| 799 | #[ cfg(test)] | 
|---|
| 800 | mod tests { | 
|---|
| 801 | use alloc::vec; | 
|---|
| 802 |  | 
|---|
| 803 | use quickcheck::quickcheck; | 
|---|
| 804 |  | 
|---|
| 805 | use super::*; | 
|---|
| 806 |  | 
|---|
| 807 | #[ cfg(not(miri))] | 
|---|
| 808 | quickcheck! { | 
|---|
| 809 | fn prop_state_read_write_nfa_state_ids(sids: Vec<StateID>) -> bool { | 
|---|
| 810 | // Builders states do not permit duplicate IDs. | 
|---|
| 811 | let sids = dedup_state_ids(sids); | 
|---|
| 812 |  | 
|---|
| 813 | let mut b = StateBuilderEmpty::new().into_matches().into_nfa(); | 
|---|
| 814 | for &sid in &sids { | 
|---|
| 815 | b.add_nfa_state_id(sid); | 
|---|
| 816 | } | 
|---|
| 817 | let s = b.to_state(); | 
|---|
| 818 | let mut got = vec![]; | 
|---|
| 819 | s.iter_nfa_state_ids(|sid| got.push(sid)); | 
|---|
| 820 | got == sids | 
|---|
| 821 | } | 
|---|
| 822 |  | 
|---|
| 823 | fn prop_state_read_write_pattern_ids(pids: Vec<PatternID>) -> bool { | 
|---|
| 824 | // Builders states do not permit duplicate IDs. | 
|---|
| 825 | let pids = dedup_pattern_ids(pids); | 
|---|
| 826 |  | 
|---|
| 827 | let mut b = StateBuilderEmpty::new().into_matches(); | 
|---|
| 828 | for &pid in &pids { | 
|---|
| 829 | b.add_match_pattern_id(pid); | 
|---|
| 830 | } | 
|---|
| 831 | let s = b.into_nfa().to_state(); | 
|---|
| 832 | let mut got = vec![]; | 
|---|
| 833 | s.iter_match_pattern_ids(|pid| got.push(pid)); | 
|---|
| 834 | got == pids | 
|---|
| 835 | } | 
|---|
| 836 |  | 
|---|
| 837 | fn prop_state_read_write_nfa_state_and_pattern_ids( | 
|---|
| 838 | sids: Vec<StateID>, | 
|---|
| 839 | pids: Vec<PatternID> | 
|---|
| 840 | ) -> bool { | 
|---|
| 841 | // Builders states do not permit duplicate IDs. | 
|---|
| 842 | let sids = dedup_state_ids(sids); | 
|---|
| 843 | let pids = dedup_pattern_ids(pids); | 
|---|
| 844 |  | 
|---|
| 845 | let mut b = StateBuilderEmpty::new().into_matches(); | 
|---|
| 846 | for &pid in &pids { | 
|---|
| 847 | b.add_match_pattern_id(pid); | 
|---|
| 848 | } | 
|---|
| 849 |  | 
|---|
| 850 | let mut b = b.into_nfa(); | 
|---|
| 851 | for &sid in &sids { | 
|---|
| 852 | b.add_nfa_state_id(sid); | 
|---|
| 853 | } | 
|---|
| 854 |  | 
|---|
| 855 | let s = b.to_state(); | 
|---|
| 856 | let mut got_pids = vec![]; | 
|---|
| 857 | s.iter_match_pattern_ids(|pid| got_pids.push(pid)); | 
|---|
| 858 | let mut got_sids = vec![]; | 
|---|
| 859 | s.iter_nfa_state_ids(|sid| got_sids.push(sid)); | 
|---|
| 860 | got_pids == pids && got_sids == sids | 
|---|
| 861 | } | 
|---|
| 862 | } | 
|---|
| 863 |  | 
|---|
| 864 | quickcheck! { | 
|---|
| 865 | fn prop_read_write_varu32(n: u32) -> bool { | 
|---|
| 866 | let mut buf = vec![]; | 
|---|
| 867 | write_varu32(&mut buf, n); | 
|---|
| 868 | let (got, nread) = read_varu32(&buf); | 
|---|
| 869 | nread == buf.len() && got == n | 
|---|
| 870 | } | 
|---|
| 871 |  | 
|---|
| 872 | fn prop_read_write_vari32(n: i32) -> bool { | 
|---|
| 873 | let mut buf = vec![]; | 
|---|
| 874 | write_vari32(&mut buf, n); | 
|---|
| 875 | let (got, nread) = read_vari32(&buf); | 
|---|
| 876 | nread == buf.len() && got == n | 
|---|
| 877 | } | 
|---|
| 878 | } | 
|---|
| 879 |  | 
|---|
| 880 | #[ cfg(not(miri))] | 
|---|
| 881 | fn dedup_state_ids(sids: Vec<StateID>) -> Vec<StateID> { | 
|---|
| 882 | let mut set = alloc::collections::BTreeSet::new(); | 
|---|
| 883 | let mut deduped = vec![]; | 
|---|
| 884 | for sid in sids { | 
|---|
| 885 | if set.contains(&sid) { | 
|---|
| 886 | continue; | 
|---|
| 887 | } | 
|---|
| 888 | set.insert(sid); | 
|---|
| 889 | deduped.push(sid); | 
|---|
| 890 | } | 
|---|
| 891 | deduped | 
|---|
| 892 | } | 
|---|
| 893 |  | 
|---|
| 894 | #[ cfg(not(miri))] | 
|---|
| 895 | fn dedup_pattern_ids(pids: Vec<PatternID>) -> Vec<PatternID> { | 
|---|
| 896 | let mut set = alloc::collections::BTreeSet::new(); | 
|---|
| 897 | let mut deduped = vec![]; | 
|---|
| 898 | for pid in pids { | 
|---|
| 899 | if set.contains(&pid) { | 
|---|
| 900 | continue; | 
|---|
| 901 | } | 
|---|
| 902 | set.insert(pid); | 
|---|
| 903 | deduped.push(pid); | 
|---|
| 904 | } | 
|---|
| 905 | deduped | 
|---|
| 906 | } | 
|---|
| 907 | } | 
|---|
| 908 |  | 
|---|