1/*!
2Search for regex matches in `&[u8]` haystacks.
3
4This module provides a nearly identical API via [`Regex`] to the one found in
5the top-level of this crate. There are two important differences:
6
71. Matching is done on `&[u8]` instead of `&str`. Additionally, `Vec<u8>`
8is used where `String` would have been used in the top-level API.
92. Unicode support can be disabled even when disabling it would result in
10matching invalid UTF-8 bytes.
11
12# Example: match null terminated string
13
14This shows how to find all null-terminated strings in a slice of bytes. This
15works even if a C string contains invalid UTF-8.
16
17```rust
18use regex::bytes::Regex;
19
20let re = Regex::new(r"(?-u)(?<cstr>[^\x00]+)\x00").unwrap();
21let hay = b"foo\x00qu\xFFux\x00baz\x00";
22
23// Extract all of the strings without the NUL terminator from each match.
24// The unwrap is OK here since a match requires the `cstr` capture to match.
25let cstrs: Vec<&[u8]> =
26 re.captures_iter(hay)
27 .map(|c| c.name("cstr").unwrap().as_bytes())
28 .collect();
29assert_eq!(cstrs, vec![&b"foo"[..], &b"qu\xFFux"[..], &b"baz"[..]]);
30```
31
32# Example: selectively enable Unicode support
33
34This shows how to match an arbitrary byte pattern followed by a UTF-8 encoded
35string (e.g., to extract a title from a Matroska file):
36
37```rust
38use regex::bytes::Regex;
39
40let re = Regex::new(
41 r"(?-u)\x7b\xa9(?:[\x80-\xfe]|[\x40-\xff].)(?u:(.*))"
42).unwrap();
43let hay = b"\x12\xd0\x3b\x5f\x7b\xa9\x85\xe2\x98\x83\x80\x98\x54\x76\x68\x65";
44
45// Notice that despite the `.*` at the end, it will only match valid UTF-8
46// because Unicode mode was enabled with the `u` flag. Without the `u` flag,
47// the `.*` would match the rest of the bytes regardless of whether they were
48// valid UTF-8.
49let (_, [title]) = re.captures(hay).unwrap().extract();
50assert_eq!(title, b"\xE2\x98\x83");
51// We can UTF-8 decode the title now. And the unwrap here
52// is correct because the existence of a match guarantees
53// that `title` is valid UTF-8.
54let title = std::str::from_utf8(title).unwrap();
55assert_eq!(title, "☃");
56```
57
58In general, if the Unicode flag is enabled in a capture group and that capture
59is part of the overall match, then the capture is *guaranteed* to be valid
60UTF-8.
61
62# Syntax
63
64The supported syntax is pretty much the same as the syntax for Unicode
65regular expressions with a few changes that make sense for matching arbitrary
66bytes:
67
681. The `u` flag can be disabled even when disabling it might cause the regex to
69match invalid UTF-8. When the `u` flag is disabled, the regex is said to be in
70"ASCII compatible" mode.
712. In ASCII compatible mode, Unicode character classes are not allowed. Literal
72Unicode scalar values outside of character classes are allowed.
733. In ASCII compatible mode, Perl character classes (`\w`, `\d` and `\s`)
74revert to their typical ASCII definition. `\w` maps to `[[:word:]]`, `\d` maps
75to `[[:digit:]]` and `\s` maps to `[[:space:]]`.
764. In ASCII compatible mode, word boundaries use the ASCII compatible `\w` to
77determine whether a byte is a word byte or not.
785. Hexadecimal notation can be used to specify arbitrary bytes instead of
79Unicode codepoints. For example, in ASCII compatible mode, `\xFF` matches the
80literal byte `\xFF`, while in Unicode mode, `\xFF` is the Unicode codepoint
81`U+00FF` that matches its UTF-8 encoding of `\xC3\xBF`. Similarly for octal
82notation when enabled.
836. In ASCII compatible mode, `.` matches any *byte* except for `\n`. When the
84`s` flag is additionally enabled, `.` matches any byte.
85
86# Performance
87
88In general, one should expect performance on `&[u8]` to be roughly similar to
89performance on `&str`.
90*/
91pub use crate::{builders::bytes::*, regex::bytes::*, regexset::bytes::*};
92