1 | /*! |
2 | Search for regex matches in `&[u8]` haystacks. |
3 | |
4 | This module provides a nearly identical API via [`Regex`] to the one found in |
5 | the top-level of this crate. There are two important differences: |
6 | |
7 | 1. Matching is done on `&[u8]` instead of `&str`. Additionally, `Vec<u8>` |
8 | is used where `String` would have been used in the top-level API. |
9 | 2. Unicode support can be disabled even when disabling it would result in |
10 | matching invalid UTF-8 bytes. |
11 | |
12 | # Example: match null terminated string |
13 | |
14 | This shows how to find all null-terminated strings in a slice of bytes. This |
15 | works even if a C string contains invalid UTF-8. |
16 | |
17 | ```rust |
18 | use regex::bytes::Regex; |
19 | |
20 | let re = Regex::new(r"(?-u)(?<cstr>[^\x00]+)\x00" ).unwrap(); |
21 | let hay = b"foo \x00qu \xFFux \x00baz \x00" ; |
22 | |
23 | // Extract all of the strings without the NUL terminator from each match. |
24 | // The unwrap is OK here since a match requires the `cstr` capture to match. |
25 | let cstrs: Vec<&[u8]> = |
26 | re.captures_iter(hay) |
27 | .map(|c| c.name("cstr" ).unwrap().as_bytes()) |
28 | .collect(); |
29 | assert_eq!(cstrs, vec![&b"foo" [..], &b"qu \xFFux" [..], &b"baz" [..]]); |
30 | ``` |
31 | |
32 | # Example: selectively enable Unicode support |
33 | |
34 | This shows how to match an arbitrary byte pattern followed by a UTF-8 encoded |
35 | string (e.g., to extract a title from a Matroska file): |
36 | |
37 | ```rust |
38 | use regex::bytes::Regex; |
39 | |
40 | let re = Regex::new( |
41 | r"(?-u)\x7b\xa9(?:[\x80-\xfe]|[\x40-\xff].)(?u:(.*))" |
42 | ).unwrap(); |
43 | let hay = b" \x12\xd0\x3b\x5f\x7b\xa9\x85\xe2\x98\x83\x80\x98\x54\x76\x68\x65" ; |
44 | |
45 | // Notice that despite the `.*` at the end, it will only match valid UTF-8 |
46 | // because Unicode mode was enabled with the `u` flag. Without the `u` flag, |
47 | // the `.*` would match the rest of the bytes regardless of whether they were |
48 | // valid UTF-8. |
49 | let (_, [title]) = re.captures(hay).unwrap().extract(); |
50 | assert_eq!(title, b" \xE2\x98\x83" ); |
51 | // We can UTF-8 decode the title now. And the unwrap here |
52 | // is correct because the existence of a match guarantees |
53 | // that `title` is valid UTF-8. |
54 | let title = std::str::from_utf8(title).unwrap(); |
55 | assert_eq!(title, "☃" ); |
56 | ``` |
57 | |
58 | In general, if the Unicode flag is enabled in a capture group and that capture |
59 | is part of the overall match, then the capture is *guaranteed* to be valid |
60 | UTF-8. |
61 | |
62 | # Syntax |
63 | |
64 | The supported syntax is pretty much the same as the syntax for Unicode |
65 | regular expressions with a few changes that make sense for matching arbitrary |
66 | bytes: |
67 | |
68 | 1. The `u` flag can be disabled even when disabling it might cause the regex to |
69 | match invalid UTF-8. When the `u` flag is disabled, the regex is said to be in |
70 | "ASCII compatible" mode. |
71 | 2. In ASCII compatible mode, Unicode character classes are not allowed. Literal |
72 | Unicode scalar values outside of character classes are allowed. |
73 | 3. In ASCII compatible mode, Perl character classes (`\w`, `\d` and `\s`) |
74 | revert to their typical ASCII definition. `\w` maps to `[[:word:]]`, `\d` maps |
75 | to `[[:digit:]]` and `\s` maps to `[[:space:]]`. |
76 | 4. In ASCII compatible mode, word boundaries use the ASCII compatible `\w` to |
77 | determine whether a byte is a word byte or not. |
78 | 5. Hexadecimal notation can be used to specify arbitrary bytes instead of |
79 | Unicode codepoints. For example, in ASCII compatible mode, `\xFF` matches the |
80 | literal byte `\xFF`, while in Unicode mode, `\xFF` is the Unicode codepoint |
81 | `U+00FF` that matches its UTF-8 encoding of `\xC3\xBF`. Similarly for octal |
82 | notation when enabled. |
83 | 6. In ASCII compatible mode, `.` matches any *byte* except for `\n`. When the |
84 | `s` flag is additionally enabled, `.` matches any byte. |
85 | |
86 | # Performance |
87 | |
88 | In general, one should expect performance on `&[u8]` to be roughly similar to |
89 | performance on `&str`. |
90 | */ |
91 | pub use crate::{builders::bytes::*, regex::bytes::*, regexset::bytes::*}; |
92 | |