| 1 | /*! |
| 2 | Search for regex matches in `&[u8]` haystacks. |
| 3 | |
| 4 | This module provides a nearly identical API via [`Regex`] to the one found in |
| 5 | the top-level of this crate. There are two important differences: |
| 6 | |
| 7 | 1. Matching is done on `&[u8]` instead of `&str`. Additionally, `Vec<u8>` |
| 8 | is used where `String` would have been used in the top-level API. |
| 9 | 2. Unicode support can be disabled even when disabling it would result in |
| 10 | matching invalid UTF-8 bytes. |
| 11 | |
| 12 | # Example: match null terminated string |
| 13 | |
| 14 | This shows how to find all null-terminated strings in a slice of bytes. This |
| 15 | works even if a C string contains invalid UTF-8. |
| 16 | |
| 17 | ```rust |
| 18 | use regex::bytes::Regex; |
| 19 | |
| 20 | let re = Regex::new(r"(?-u)(?<cstr>[^\x00]+)\x00" ).unwrap(); |
| 21 | let hay = b"foo \x00qu \xFFux \x00baz \x00" ; |
| 22 | |
| 23 | // Extract all of the strings without the NUL terminator from each match. |
| 24 | // The unwrap is OK here since a match requires the `cstr` capture to match. |
| 25 | let cstrs: Vec<&[u8]> = |
| 26 | re.captures_iter(hay) |
| 27 | .map(|c| c.name("cstr" ).unwrap().as_bytes()) |
| 28 | .collect(); |
| 29 | assert_eq!(cstrs, vec![&b"foo" [..], &b"qu \xFFux" [..], &b"baz" [..]]); |
| 30 | ``` |
| 31 | |
| 32 | # Example: selectively enable Unicode support |
| 33 | |
| 34 | This shows how to match an arbitrary byte pattern followed by a UTF-8 encoded |
| 35 | string (e.g., to extract a title from a Matroska file): |
| 36 | |
| 37 | ```rust |
| 38 | use regex::bytes::Regex; |
| 39 | |
| 40 | let re = Regex::new( |
| 41 | r"(?-u)\x7b\xa9(?:[\x80-\xfe]|[\x40-\xff].)(?u:(.*))" |
| 42 | ).unwrap(); |
| 43 | let hay = b" \x12\xd0\x3b\x5f\x7b\xa9\x85\xe2\x98\x83\x80\x98\x54\x76\x68\x65" ; |
| 44 | |
| 45 | // Notice that despite the `.*` at the end, it will only match valid UTF-8 |
| 46 | // because Unicode mode was enabled with the `u` flag. Without the `u` flag, |
| 47 | // the `.*` would match the rest of the bytes regardless of whether they were |
| 48 | // valid UTF-8. |
| 49 | let (_, [title]) = re.captures(hay).unwrap().extract(); |
| 50 | assert_eq!(title, b" \xE2\x98\x83" ); |
| 51 | // We can UTF-8 decode the title now. And the unwrap here |
| 52 | // is correct because the existence of a match guarantees |
| 53 | // that `title` is valid UTF-8. |
| 54 | let title = std::str::from_utf8(title).unwrap(); |
| 55 | assert_eq!(title, "☃" ); |
| 56 | ``` |
| 57 | |
| 58 | In general, if the Unicode flag is enabled in a capture group and that capture |
| 59 | is part of the overall match, then the capture is *guaranteed* to be valid |
| 60 | UTF-8. |
| 61 | |
| 62 | # Syntax |
| 63 | |
| 64 | The supported syntax is pretty much the same as the syntax for Unicode |
| 65 | regular expressions with a few changes that make sense for matching arbitrary |
| 66 | bytes: |
| 67 | |
| 68 | 1. The `u` flag can be disabled even when disabling it might cause the regex to |
| 69 | match invalid UTF-8. When the `u` flag is disabled, the regex is said to be in |
| 70 | "ASCII compatible" mode. |
| 71 | 2. In ASCII compatible mode, Unicode character classes are not allowed. Literal |
| 72 | Unicode scalar values outside of character classes are allowed. |
| 73 | 3. In ASCII compatible mode, Perl character classes (`\w`, `\d` and `\s`) |
| 74 | revert to their typical ASCII definition. `\w` maps to `[[:word:]]`, `\d` maps |
| 75 | to `[[:digit:]]` and `\s` maps to `[[:space:]]`. |
| 76 | 4. In ASCII compatible mode, word boundaries use the ASCII compatible `\w` to |
| 77 | determine whether a byte is a word byte or not. |
| 78 | 5. Hexadecimal notation can be used to specify arbitrary bytes instead of |
| 79 | Unicode codepoints. For example, in ASCII compatible mode, `\xFF` matches the |
| 80 | literal byte `\xFF`, while in Unicode mode, `\xFF` is the Unicode codepoint |
| 81 | `U+00FF` that matches its UTF-8 encoding of `\xC3\xBF`. Similarly for octal |
| 82 | notation when enabled. |
| 83 | 6. In ASCII compatible mode, `.` matches any *byte* except for `\n`. When the |
| 84 | `s` flag is additionally enabled, `.` matches any byte. |
| 85 | |
| 86 | # Performance |
| 87 | |
| 88 | In general, one should expect performance on `&[u8]` to be roughly similar to |
| 89 | performance on `&str`. |
| 90 | */ |
| 91 | pub use crate::{builders::bytes::*, regex::bytes::*, regexset::bytes::*}; |
| 92 | |