| 1 | /*! |
| 2 | This crate provides a **lightweight** regex engine for searching strings. The |
| 3 | regex syntax supported by this crate is nearly identical to what is found in |
| 4 | the [`regex`](https://docs.rs/regex) crate. Like the `regex` crate, all regex |
| 5 | searches in this crate have worst case `O(m * n)` time complexity, where `m` is |
| 6 | proportional to the size of the regex and `n` is proportional to the size of |
| 7 | the string being searched. |
| 8 | |
| 9 | The principal difference between the `regex` and `regex-lite` crates is that |
| 10 | the latter prioritizes smaller binary sizes and shorter Rust compile times |
| 11 | over performance and functionality. As a result, regex searches in this crate |
| 12 | are typically substantially slower than what is provided by the `regex` crate. |
| 13 | Moreover, this crate only has the most basic level of Unicode support: it |
| 14 | matches codepoint by codepoint but otherwise doesn't support Unicode case |
| 15 | insensivity or things like `\p{Letter}`. In exchange, this crate contributes |
| 16 | far less to binary size and compiles much more quickly. |
| 17 | |
| 18 | If you just want API documentation, then skip to the [`Regex`] type. Otherwise, |
| 19 | here's a quick example showing one way of parsing the output of a grep-like |
| 20 | program: |
| 21 | |
| 22 | ```rust |
| 23 | use regex_lite::Regex; |
| 24 | |
| 25 | let re = Regex::new(r"(?m)^([^:]+):([0-9]+):(.+)$" ).unwrap(); |
| 26 | let hay = "\ |
| 27 | path/to/foo:54:Blue Harvest |
| 28 | path/to/bar:90:Something, Something, Something, Dark Side |
| 29 | path/to/baz:3:It's a Trap! |
| 30 | " ; |
| 31 | |
| 32 | let mut results = vec![]; |
| 33 | for (_, [path, lineno, line]) in re.captures_iter(hay).map(|c| c.extract()) { |
| 34 | results.push((path, lineno.parse::<u64>()?, line)); |
| 35 | } |
| 36 | assert_eq!(results, vec![ |
| 37 | ("path/to/foo" , 54, "Blue Harvest" ), |
| 38 | ("path/to/bar" , 90, "Something, Something, Something, Dark Side" ), |
| 39 | ("path/to/baz" , 3, "It's a Trap!" ), |
| 40 | ]); |
| 41 | # Ok::<(), Box<dyn std::error::Error>>(()) |
| 42 | ``` |
| 43 | |
| 44 | # Overview |
| 45 | |
| 46 | The primary type in this crate is a [`Regex`]. Its most important methods are |
| 47 | as follows: |
| 48 | |
| 49 | * [`Regex::new`] compiles a regex using the default configuration. A |
| 50 | [`RegexBuilder`] permits setting a non-default configuration. (For example, |
| 51 | case insensitive matching, verbose mode and others.) |
| 52 | * [`Regex::is_match`] reports whether a match exists in a particular haystack. |
| 53 | * [`Regex::find`] reports the byte offsets of a match in a haystack, if one |
| 54 | exists. [`Regex::find_iter`] returns an iterator over all such matches. |
| 55 | * [`Regex::captures`] returns a [`Captures`], which reports both the byte |
| 56 | offsets of a match in a haystack and the byte offsets of each matching capture |
| 57 | group from the regex in the haystack. |
| 58 | [`Regex::captures_iter`] returns an iterator over all such matches. |
| 59 | |
| 60 | Otherwise, this top-level crate documentation is organized as follows: |
| 61 | |
| 62 | * [Usage](#usage) shows how to add the `regex` crate to your Rust project. |
| 63 | * [Examples](#examples) provides a limited selection of regex search examples. |
| 64 | * [Differences with the regex crate](#differences-with-the-regex-crate) |
| 65 | provides a precise description of how `regex-lite` differs from `regex`. |
| 66 | * [Syntax](#syntax) enumerates the specific regex syntax supported by this |
| 67 | crate. |
| 68 | * [Untrusted input](#untrusted-input) discusses how this crate deals with regex |
| 69 | patterns or haystacks that are untrusted. |
| 70 | |
| 71 | # Usage |
| 72 | |
| 73 | The `regex-lite` crate is [on crates.io](https://crates.io/crates/regex-lite) |
| 74 | and can be used by adding `regex-lite` to your dependencies in your project's |
| 75 | `Cargo.toml`. Or more simply, just run `cargo add regex-lite`. |
| 76 | |
| 77 | Here is a complete example that creates a new Rust project, adds a dependency |
| 78 | on `regex-lite`, creates the source code for a regex search and then runs the |
| 79 | program. |
| 80 | |
| 81 | First, create the project in a new directory: |
| 82 | |
| 83 | ```text |
| 84 | $ mkdir regex-example |
| 85 | $ cd regex-example |
| 86 | $ cargo init |
| 87 | ``` |
| 88 | |
| 89 | Second, add a dependency on `regex`: |
| 90 | |
| 91 | ```text |
| 92 | $ cargo add regex-lite |
| 93 | ``` |
| 94 | |
| 95 | Third, edit `src/main.rs`. Delete what's there and replace it with this: |
| 96 | |
| 97 | ``` |
| 98 | use regex_lite::Regex; |
| 99 | |
| 100 | fn main() { |
| 101 | let re = Regex::new(r"Hello (?<name>\w+)!" ).unwrap(); |
| 102 | let Some(caps) = re.captures("Hello Murphy!" ) else { |
| 103 | println!("no match!" ); |
| 104 | return; |
| 105 | }; |
| 106 | println!("The name is: {}" , &caps["name" ]); |
| 107 | } |
| 108 | ``` |
| 109 | |
| 110 | Fourth, run it with `cargo run`: |
| 111 | |
| 112 | ```text |
| 113 | $ cargo run |
| 114 | Compiling regex-lite v0.1.0 |
| 115 | Compiling regex-example v0.1.0 (/tmp/regex-example) |
| 116 | Finished dev [unoptimized + debuginfo] target(s) in 4.22s |
| 117 | Running `target/debug/regex-example` |
| 118 | The name is: Murphy |
| 119 | ``` |
| 120 | |
| 121 | The first time you run the program will show more output like above. But |
| 122 | subsequent runs shouldn't have to re-compile the dependencies. |
| 123 | |
| 124 | # Examples |
| 125 | |
| 126 | This section provides a few examples, in tutorial style, showing how to |
| 127 | search a haystack with a regex. There are more examples throughout the API |
| 128 | documentation. |
| 129 | |
| 130 | Before starting though, it's worth defining a few terms: |
| 131 | |
| 132 | * A **regex** is a Rust value whose type is `Regex`. We use `re` as a |
| 133 | variable name for a regex. |
| 134 | * A **pattern** is the string that is used to build a regex. We use `pat` as |
| 135 | a variable name for a pattern. |
| 136 | * A **haystack** is the string that is searched by a regex. We use `hay` as a |
| 137 | variable name for a haystack. |
| 138 | |
| 139 | Sometimes the words "regex" and "pattern" are used interchangeably. |
| 140 | |
| 141 | General use of regular expressions in this crate proceeds by compiling a |
| 142 | **pattern** into a **regex**, and then using that regex to search, split or |
| 143 | replace parts of a **haystack**. |
| 144 | |
| 145 | ### Example: find a middle initial |
| 146 | |
| 147 | We'll start off with a very simple example: a regex that looks for a specific |
| 148 | name but uses a wildcard to match a middle initial. Our pattern serves as |
| 149 | something like a template that will match a particular name with *any* middle |
| 150 | initial. |
| 151 | |
| 152 | ```rust |
| 153 | use regex_lite::Regex; |
| 154 | |
| 155 | // We use 'unwrap()' here because it would be a bug in our program if the |
| 156 | // pattern failed to compile to a regex. Panicking in the presence of a bug |
| 157 | // is okay. |
| 158 | let re = Regex::new(r"Homer (.)\. Simpson" ).unwrap(); |
| 159 | let hay = "Homer J. Simpson" ; |
| 160 | let Some(caps) = re.captures(hay) else { return }; |
| 161 | assert_eq!("J" , &caps[1]); |
| 162 | ``` |
| 163 | |
| 164 | There are a few things worth noticing here in our first example: |
| 165 | |
| 166 | * The `.` is a special pattern meta character that means "match any single |
| 167 | character except for new lines." (More precisely, in this crate, it means |
| 168 | "match any UTF-8 encoding of any Unicode scalar value other than `\n`.") |
| 169 | * We can match an actual `.` literally by escaping it, i.e., `\.`. |
| 170 | * We use Rust's [raw strings] to avoid needing to deal with escape sequences in |
| 171 | both the regex pattern syntax and in Rust's string literal syntax. If we didn't |
| 172 | use raw strings here, we would have had to use `\\.` to match a literal `.` |
| 173 | character. That is, `r"\."` and `"\\."` are equivalent patterns. |
| 174 | * We put our wildcard `.` instruction in parentheses. These parentheses have a |
| 175 | special meaning that says, "make whatever part of the haystack matches within |
| 176 | these parentheses available as a capturing group." After finding a match, we |
| 177 | access this capture group with `&caps[1]`. |
| 178 | |
| 179 | [raw strings]: https://doc.rust-lang.org/stable/reference/tokens.html#raw-string-literals |
| 180 | |
| 181 | Otherwise, we execute a search using `re.captures(hay)` and return from our |
| 182 | function if no match occurred. We then reference the middle initial by asking |
| 183 | for the part of the haystack that matched the capture group indexed at `1`. |
| 184 | (The capture group at index 0 is implicit and always corresponds to the entire |
| 185 | match. In this case, that's `Homer J. Simpson`.) |
| 186 | |
| 187 | ### Example: named capture groups |
| 188 | |
| 189 | Continuing from our middle initial example above, we can tweak the pattern |
| 190 | slightly to give a name to the group that matches the middle initial: |
| 191 | |
| 192 | ```rust |
| 193 | use regex_lite::Regex; |
| 194 | |
| 195 | // Note that (?P<middle>.) is a different way to spell the same thing. |
| 196 | let re = Regex::new(r"Homer (?<middle>.)\. Simpson" ).unwrap(); |
| 197 | let hay = "Homer J. Simpson" ; |
| 198 | let Some(caps) = re.captures(hay) else { return }; |
| 199 | assert_eq!("J" , &caps["middle" ]); |
| 200 | ``` |
| 201 | |
| 202 | Giving a name to a group can be useful when there are multiple groups in |
| 203 | a pattern. It makes the code referring to those groups a bit easier to |
| 204 | understand. |
| 205 | |
| 206 | ### Example: validating a particular date format |
| 207 | |
| 208 | This examples shows how to confirm whether a haystack, in its entirety, matches |
| 209 | a particular date format: |
| 210 | |
| 211 | ```rust |
| 212 | use regex_lite::Regex; |
| 213 | |
| 214 | let re = Regex::new(r"^\d{4}-\d{2}-\d{2}$" ).unwrap(); |
| 215 | assert!(re.is_match("2010-03-14" )); |
| 216 | ``` |
| 217 | |
| 218 | Notice the use of the `^` and `$` anchors. In this crate, every regex search is |
| 219 | run with an implicit `(?s:.)*?` at the beginning of its pattern, which allows |
| 220 | the regex to match anywhere in a haystack. Anchors, as above, can be used to |
| 221 | ensure that the full haystack matches a pattern. |
| 222 | |
| 223 | ### Example: finding dates in a haystack |
| 224 | |
| 225 | In the previous example, we showed how one might validate that a haystack, |
| 226 | in its entirety, corresponded to a particular date format. But what if we wanted |
| 227 | to extract all things that look like dates in a specific format from a haystack? |
| 228 | To do this, we can use an iterator API to find all matches (notice that we've |
| 229 | removed the anchors): |
| 230 | |
| 231 | ```rust |
| 232 | use regex_lite::Regex; |
| 233 | |
| 234 | let re = Regex::new(r"\d{4}-\d{2}-\d{2}" ).unwrap(); |
| 235 | let hay = "What do 1865-04-14, 1881-07-02, 1901-09-06 and 1963-11-22 have in common?" ; |
| 236 | // 'm' is a 'Match', and 'as_str()' returns the matching part of the haystack. |
| 237 | let dates: Vec<&str> = re.find_iter(hay).map(|m| m.as_str()).collect(); |
| 238 | assert_eq!(dates, vec![ |
| 239 | "1865-04-14" , |
| 240 | "1881-07-02" , |
| 241 | "1901-09-06" , |
| 242 | "1963-11-22" , |
| 243 | ]); |
| 244 | ``` |
| 245 | |
| 246 | We can also iterate over [`Captures`] values instead of [`Match`] values, and |
| 247 | that in turn permits accessing each component of the date via capturing groups: |
| 248 | |
| 249 | ```rust |
| 250 | use regex_lite::Regex; |
| 251 | |
| 252 | let re = Regex::new(r"(?<y>\d{4})-(?<m>\d{2})-(?<d>\d{2})" ).unwrap(); |
| 253 | let hay = "What do 1865-04-14, 1881-07-02, 1901-09-06 and 1963-11-22 have in common?" ; |
| 254 | // 'm' is a 'Match', and 'as_str()' returns the matching part of the haystack. |
| 255 | let dates: Vec<(&str, &str, &str)> = re.captures_iter(hay).map(|caps| { |
| 256 | // The unwraps are okay because every capture group must match if the whole |
| 257 | // regex matches, and in this context, we know we have a match. |
| 258 | // |
| 259 | // Note that we use `caps.name("y").unwrap().as_str()` instead of |
| 260 | // `&caps["y"]` because the lifetime of the former is the same as the |
| 261 | // lifetime of `hay` above, but the lifetime of the latter is tied to the |
| 262 | // lifetime of `caps` due to how the `Index` trait is defined. |
| 263 | let year = caps.name("y" ).unwrap().as_str(); |
| 264 | let month = caps.name("m" ).unwrap().as_str(); |
| 265 | let day = caps.name("d" ).unwrap().as_str(); |
| 266 | (year, month, day) |
| 267 | }).collect(); |
| 268 | assert_eq!(dates, vec![ |
| 269 | ("1865" , "04" , "14" ), |
| 270 | ("1881" , "07" , "02" ), |
| 271 | ("1901" , "09" , "06" ), |
| 272 | ("1963" , "11" , "22" ), |
| 273 | ]); |
| 274 | ``` |
| 275 | |
| 276 | ### Example: simpler capture group extraction |
| 277 | |
| 278 | One can use [`Captures::extract`] to make the code from the previous example a |
| 279 | bit simpler in this case: |
| 280 | |
| 281 | ```rust |
| 282 | use regex_lite::Regex; |
| 283 | |
| 284 | let re = Regex::new(r"(\d{4})-(\d{2})-(\d{2})" ).unwrap(); |
| 285 | let hay = "What do 1865-04-14, 1881-07-02, 1901-09-06 and 1963-11-22 have in common?" ; |
| 286 | let dates: Vec<(&str, &str, &str)> = re.captures_iter(hay).map(|caps| { |
| 287 | let (_, [year, month, day]) = caps.extract(); |
| 288 | (year, month, day) |
| 289 | }).collect(); |
| 290 | assert_eq!(dates, vec![ |
| 291 | ("1865" , "04" , "14" ), |
| 292 | ("1881" , "07" , "02" ), |
| 293 | ("1901" , "09" , "06" ), |
| 294 | ("1963" , "11" , "22" ), |
| 295 | ]); |
| 296 | ``` |
| 297 | |
| 298 | `Captures::extract` works by ensuring that the number of matching groups match |
| 299 | the number of groups requested via the `[year, month, day]` syntax. If they do, |
| 300 | then the substrings for each corresponding capture group are automatically |
| 301 | returned in an appropriately sized array. Rust's syntax for pattern matching |
| 302 | arrays does the rest. |
| 303 | |
| 304 | ### Example: replacement with named capture groups |
| 305 | |
| 306 | Building on the previous example, perhaps we'd like to rearrange the date |
| 307 | formats. This can be done by finding each match and replacing it with |
| 308 | something different. The [`Regex::replace_all`] routine provides a convenient |
| 309 | way to do this, including by supporting references to named groups in the |
| 310 | replacement string: |
| 311 | |
| 312 | ```rust |
| 313 | use regex_lite::Regex; |
| 314 | |
| 315 | let re = Regex::new(r"(?<y>\d{4})-(?<m>\d{2})-(?<d>\d{2})" ).unwrap(); |
| 316 | let before = "1973-01-05, 1975-08-25 and 1980-10-18" ; |
| 317 | let after = re.replace_all(before, "$m/$d/$y" ); |
| 318 | assert_eq!(after, "01/05/1973, 08/25/1975 and 10/18/1980" ); |
| 319 | ``` |
| 320 | |
| 321 | The replace methods are actually polymorphic in the replacement, which |
| 322 | provides more flexibility than is seen here. (See the documentation for |
| 323 | [`Regex::replace`] for more details.) |
| 324 | |
| 325 | ### Example: verbose mode |
| 326 | |
| 327 | When your regex gets complicated, you might consider using something other |
| 328 | than regex. But if you stick with regex, you can use the `x` flag to enable |
| 329 | insignificant whitespace mode or "verbose mode." In this mode, whitespace |
| 330 | is treated as insignificant and one may write comments. This may make your |
| 331 | patterns easier to comprehend. |
| 332 | |
| 333 | ```rust |
| 334 | use regex_lite::Regex; |
| 335 | |
| 336 | let re = Regex::new(r"(?x) |
| 337 | (?P<y>\d{4}) # the year |
| 338 | - |
| 339 | (?P<m>\d{2}) # the month |
| 340 | - |
| 341 | (?P<d>\d{2}) # the day |
| 342 | " ).unwrap(); |
| 343 | |
| 344 | let before = "1973-01-05, 1975-08-25 and 1980-10-18" ; |
| 345 | let after = re.replace_all(before, "$m/$d/$y" ); |
| 346 | assert_eq!(after, "01/05/1973, 08/25/1975 and 10/18/1980" ); |
| 347 | ``` |
| 348 | |
| 349 | If you wish to match against whitespace in this mode, you can still use `\s`, |
| 350 | `\n`, `\t`, etc. For escaping a single space character, you can escape it |
| 351 | directly with `\ `, use its hex character code `\x20` or temporarily disable |
| 352 | the `x` flag, e.g., `(?-x: )`. |
| 353 | |
| 354 | # Differences with the `regex` crate |
| 355 | |
| 356 | As mentioned in the introduction above, the purpose of this crate is to |
| 357 | prioritize small binary sizes and shorter Rust compilation times as much as |
| 358 | possible. Namely, while the `regex` crate tends to eschew both binary size |
| 359 | and compilation time in favor of faster searches and features, the |
| 360 | `regex-lite` crate gives up faster searches and some functionality in exchange |
| 361 | for smaller binary sizes and faster compilation times. |
| 362 | |
| 363 | The precise set of differences at the syntax level: |
| 364 | |
| 365 | * The Perl character classes are limited to ASCII codepoints. That is, |
| 366 | `\d` is `[0-9]`, `\s` is `[\t\n\v\f\r ]` and `\w` is `[0-9A-Za-z_]`. |
| 367 | * Unicode character classes of the form `\p{...}` and `\P{...}` are not |
| 368 | supported at all. Note though that things like `[^β]` are still supported and |
| 369 | will match any Unicode scalar value except for `β`. |
| 370 | * Case insensitive searching is limited to ASCII case insensitivity. |
| 371 | * Character class set operations other than union are not supported. That is, |
| 372 | difference (`--`), intersection (`&&`) and symmetric difference (`~~`) are |
| 373 | not available. These tend to be most useful with Unicode classes, which also |
| 374 | aren't available. |
| 375 | * Opt-in octal support is not available in this crate. |
| 376 | |
| 377 | And now at the API level: |
| 378 | |
| 379 | * Currently, this crate only supports searching `&str`. It does not have APIs |
| 380 | for searching `&[u8]` haystacks, although it is planned to add these in the |
| 381 | future if there's demand. |
| 382 | * There is no `RegexSet` in this crate and there are no plans to add it. |
| 383 | * The `Error` type in this crate is completely opaque. |
| 384 | |
| 385 | Other than these things, the `regex-lite` crate is intended to be a drop-in |
| 386 | replacement for the `regex` crate. In most cases, you can just replace `use |
| 387 | regex::Regex;` with `use regex_lite::Regex;` and everything will work. (Unless |
| 388 | you're depending on Unicode support in your regexes.) |
| 389 | |
| 390 | # Syntax |
| 391 | |
| 392 | The syntax supported in this crate is documented below. |
| 393 | |
| 394 | ### Matching one character |
| 395 | |
| 396 | <pre class="rust"> |
| 397 | . any character except new line (includes new line with s flag) |
| 398 | [0-9] any ASCII digit |
| 399 | \d digit ([0-9]) |
| 400 | \D not digit |
| 401 | </pre> |
| 402 | |
| 403 | ### Character classes |
| 404 | |
| 405 | <pre class="rust"> |
| 406 | [xyz] A character class matching either x, y or z (union). |
| 407 | [^xyz] A character class matching any character except x, y and z. |
| 408 | [a-z] A character class matching any character in range a-z. |
| 409 | [[:alpha:]] ASCII character class ([A-Za-z]) |
| 410 | [[:^alpha:]] Negated ASCII character class ([^A-Za-z]) |
| 411 | [\[\]] Escaping in character classes (matching [ or ]) |
| 412 | </pre> |
| 413 | |
| 414 | Any ASCII or Perl character class may appear inside a bracketed `[...]` character |
| 415 | class. For example, `[\s[:digit:]]` matches any digit or space character. |
| 416 | |
| 417 | Precedence in character classes, from most binding to least: |
| 418 | |
| 419 | 1. Ranges: `[a-cd]` == `[[a-c]d]` |
| 420 | 2. Union: `[ab&&bc]` == `[[ab]&&[bc]]` |
| 421 | 3. Negation: `[^a-z&&b]` == `[^[a-z&&b]]`. |
| 422 | |
| 423 | ### Composites |
| 424 | |
| 425 | <pre class="rust"> |
| 426 | xy concatenation (x followed by y) |
| 427 | x|y alternation (x or y, prefer x) |
| 428 | </pre> |
| 429 | |
| 430 | This example shows how an alternation works, and what it means to prefer a |
| 431 | branch in the alternation over subsequent branches. |
| 432 | |
| 433 | ``` |
| 434 | use regex_lite::Regex; |
| 435 | |
| 436 | let haystack = "samwise" ; |
| 437 | // If 'samwise' comes first in our alternation, then it is |
| 438 | // preferred as a match, even if the regex engine could |
| 439 | // technically detect that 'sam' led to a match earlier. |
| 440 | let re = Regex::new(r"samwise|sam" ).unwrap(); |
| 441 | assert_eq!("samwise" , re.find(haystack).unwrap().as_str()); |
| 442 | // But if 'sam' comes first, then it will match instead. |
| 443 | // In this case, it is impossible for 'samwise' to match |
| 444 | // because 'sam' is a prefix of it. |
| 445 | let re = Regex::new(r"sam|samwise" ).unwrap(); |
| 446 | assert_eq!("sam" , re.find(haystack).unwrap().as_str()); |
| 447 | ``` |
| 448 | |
| 449 | ### Repetitions |
| 450 | |
| 451 | <pre class="rust"> |
| 452 | x* zero or more of x (greedy) |
| 453 | x+ one or more of x (greedy) |
| 454 | x? zero or one of x (greedy) |
| 455 | x*? zero or more of x (ungreedy/lazy) |
| 456 | x+? one or more of x (ungreedy/lazy) |
| 457 | x?? zero or one of x (ungreedy/lazy) |
| 458 | x{n,m} at least n x and at most m x (greedy) |
| 459 | x{n,} at least n x (greedy) |
| 460 | x{n} exactly n x |
| 461 | x{n,m}? at least n x and at most m x (ungreedy/lazy) |
| 462 | x{n,}? at least n x (ungreedy/lazy) |
| 463 | x{n}? exactly n x |
| 464 | </pre> |
| 465 | |
| 466 | ### Empty matches |
| 467 | |
| 468 | <pre class="rust"> |
| 469 | ^ the beginning of a haystack (or start-of-line with multi-line mode) |
| 470 | $ the end of a haystack (or end-of-line with multi-line mode) |
| 471 | \A only the beginning of a haystack (even with multi-line mode enabled) |
| 472 | \z only the end of a haystack (even with multi-line mode enabled) |
| 473 | \b an ASCII word boundary (\w on one side and \W, \A, or \z on other) |
| 474 | \B not an ASCII word boundary |
| 475 | \b{start}, \< an ASCII start-of-word boundary (\W|\A on the left, \w on the right) |
| 476 | \b{end}, \> an ASCII end-of-word boundary (\w on the left, \W|\z on the right)) |
| 477 | \b{start-half} half of an ASCII start-of-word boundary (\W|\A on the left) |
| 478 | \b{end-half} half of an ASCII end-of-word boundary (\W|\z on the right) |
| 479 | </pre> |
| 480 | |
| 481 | The empty regex is valid and matches the empty string. For example, the |
| 482 | empty regex matches `abc` at positions `0`, `1`, `2` and `3`. When using the |
| 483 | top-level [`Regex`] on `&str` haystacks, an empty match that splits a codepoint |
| 484 | is guaranteed to never be returned. For example: |
| 485 | |
| 486 | ```rust |
| 487 | let re = regex_lite::Regex::new(r"" ).unwrap(); |
| 488 | let ranges: Vec<_> = re.find_iter("💩" ).map(|m| m.range()).collect(); |
| 489 | assert_eq!(ranges, vec![0..0, 4..4]); |
| 490 | ``` |
| 491 | |
| 492 | Note that an empty regex is distinct from a regex that can never match. For |
| 493 | example, the regex `[^\s\S]` is a character class that represents the negation |
| 494 | of `[\s\S]`, where the union of `\s` and `\S` corresponds to all Unicode scalar |
| 495 | values. The negation of everything is nothing, which means the character class |
| 496 | is empty. Since nothing is in the empty set, `[^\s\S]` matches nothing, not |
| 497 | even the empty string. |
| 498 | |
| 499 | ### Grouping and flags |
| 500 | |
| 501 | <pre class="rust"> |
| 502 | (exp) numbered capture group (indexed by opening parenthesis) |
| 503 | (?P<name>exp) named (also numbered) capture group (names must be alpha-numeric) |
| 504 | (?<name>exp) named (also numbered) capture group (names must be alpha-numeric) |
| 505 | (?:exp) non-capturing group |
| 506 | (?flags) set flags within current group |
| 507 | (?flags:exp) set flags for exp (non-capturing) |
| 508 | </pre> |
| 509 | |
| 510 | Capture group names must be any sequence of alpha-numeric Unicode codepoints, |
| 511 | in addition to `.`, `_`, `[` and `]`. Names must start with either an `_` or |
| 512 | an alphabetic codepoint. Alphabetic codepoints correspond to the `Alphabetic` |
| 513 | Unicode property, while numeric codepoints correspond to the union of the |
| 514 | `Decimal_Number`, `Letter_Number` and `Other_Number` general categories. |
| 515 | |
| 516 | Flags are each a single character. For example, `(?x)` sets the flag `x` |
| 517 | and `(?-x)` clears the flag `x`. Multiple flags can be set or cleared at |
| 518 | the same time: `(?xy)` sets both the `x` and `y` flags and `(?x-y)` sets |
| 519 | the `x` flag and clears the `y` flag. |
| 520 | |
| 521 | All flags are by default disabled unless stated otherwise. They are: |
| 522 | |
| 523 | <pre class="rust"> |
| 524 | i case-insensitive: letters match both upper and lower case |
| 525 | m multi-line mode: ^ and $ match begin/end of line |
| 526 | s allow . to match \n |
| 527 | R enables CRLF mode: when multi-line mode is enabled, \r\n is used |
| 528 | U swap the meaning of x* and x*? |
| 529 | x verbose mode, ignores whitespace and allow line comments (starting with `#`) |
| 530 | </pre> |
| 531 | |
| 532 | Note that in verbose mode, whitespace is ignored everywhere, including within |
| 533 | character classes. To insert whitespace, use its escaped form or a hex literal. |
| 534 | For example, `\ ` or `\x20` for an ASCII space. |
| 535 | |
| 536 | Flags can be toggled within a pattern. Here's an example that matches |
| 537 | case-insensitively for the first part but case-sensitively for the second part: |
| 538 | |
| 539 | ```rust |
| 540 | use regex_lite::Regex; |
| 541 | |
| 542 | let re = Regex::new(r"(?i)a+(?-i)b+" ).unwrap(); |
| 543 | let m = re.find("AaAaAbbBBBb" ).unwrap(); |
| 544 | assert_eq!(m.as_str(), "AaAaAbb" ); |
| 545 | ``` |
| 546 | |
| 547 | Notice that the `a+` matches either `a` or `A`, but the `b+` only matches |
| 548 | `b`. |
| 549 | |
| 550 | Multi-line mode means `^` and `$` no longer match just at the beginning/end of |
| 551 | the input, but also at the beginning/end of lines: |
| 552 | |
| 553 | ``` |
| 554 | use regex_lite::Regex; |
| 555 | |
| 556 | let re = Regex::new(r"(?m)^line \d+" ).unwrap(); |
| 557 | let m = re.find("line one \nline 2 \n" ).unwrap(); |
| 558 | assert_eq!(m.as_str(), "line 2" ); |
| 559 | ``` |
| 560 | |
| 561 | Note that `^` matches after new lines, even at the end of input: |
| 562 | |
| 563 | ``` |
| 564 | use regex_lite::Regex; |
| 565 | |
| 566 | let re = Regex::new(r"(?m)^" ).unwrap(); |
| 567 | let m = re.find_iter("test \n" ).last().unwrap(); |
| 568 | assert_eq!((m.start(), m.end()), (5, 5)); |
| 569 | ``` |
| 570 | |
| 571 | When both CRLF mode and multi-line mode are enabled, then `^` and `$` will |
| 572 | match either `\r` and `\n`, but never in the middle of a `\r\n`: |
| 573 | |
| 574 | ``` |
| 575 | use regex_lite::Regex; |
| 576 | |
| 577 | let re = Regex::new(r"(?mR)^foo$" ).unwrap(); |
| 578 | let m = re.find(" \r\nfoo \r\n" ).unwrap(); |
| 579 | assert_eq!(m.as_str(), "foo" ); |
| 580 | ``` |
| 581 | |
| 582 | ### Escape sequences |
| 583 | |
| 584 | Note that this includes all possible escape sequences, even ones that are |
| 585 | documented elsewhere. |
| 586 | |
| 587 | <pre class="rust"> |
| 588 | \* literal *, applies to all ASCII except [0-9A-Za-z<>] |
| 589 | \a bell (\x07) |
| 590 | \f form feed (\x0C) |
| 591 | \t horizontal tab |
| 592 | \n new line |
| 593 | \r carriage return |
| 594 | \v vertical tab (\x0B) |
| 595 | \A matches at the beginning of a haystack |
| 596 | \z matches at the end of a haystack |
| 597 | \b word boundary assertion |
| 598 | \B negated word boundary assertion |
| 599 | \b{start}, \< start-of-word boundary assertion |
| 600 | \b{end}, \> end-of-word boundary assertion |
| 601 | \b{start-half} half of a start-of-word boundary assertion |
| 602 | \b{end-half} half of a end-of-word boundary assertion |
| 603 | \x7F hex character code (exactly two digits) |
| 604 | \x{10FFFF} any hex character code corresponding to a Unicode code point |
| 605 | \u007F hex character code (exactly four digits) |
| 606 | \u{7F} any hex character code corresponding to a Unicode code point |
| 607 | \U0000007F hex character code (exactly eight digits) |
| 608 | \U{7F} any hex character code corresponding to a Unicode code point |
| 609 | \d, \s, \w Perl character class |
| 610 | \D, \S, \W negated Perl character class |
| 611 | </pre> |
| 612 | |
| 613 | ### Perl character classes (ASCII only) |
| 614 | |
| 615 | These character classes are short-hands for common groups of characters. In |
| 616 | this crate, `\d`, `\s` and `\w` only consist of ASCII codepoints. |
| 617 | |
| 618 | <pre class="rust"> |
| 619 | \d digit ([0-9]) |
| 620 | \D not digit |
| 621 | \s whitespace ([\t\n\v\f\r ]) |
| 622 | \S not whitespace |
| 623 | \w word character ([0-9A-Za-z_]) |
| 624 | \W not word character |
| 625 | </pre> |
| 626 | |
| 627 | ### ASCII character classes |
| 628 | |
| 629 | These reflect additional groups of characters taken from POSIX regex syntax |
| 630 | that are sometimes useful to have. In this crate, all of these classes only |
| 631 | consist of ASCII codepoints. |
| 632 | |
| 633 | <pre class="rust"> |
| 634 | [[:alnum:]] alphanumeric ([0-9A-Za-z]) |
| 635 | [[:alpha:]] alphabetic ([A-Za-z]) |
| 636 | [[:ascii:]] ASCII ([\x00-\x7F]) |
| 637 | [[:blank:]] blank ([\t ]) |
| 638 | [[:cntrl:]] control ([\x00-\x1F\x7F]) |
| 639 | [[:digit:]] digits ([0-9]) |
| 640 | [[:graph:]] graphical ([!-~]) |
| 641 | [[:lower:]] lower case ([a-z]) |
| 642 | [[:print:]] printable ([ -~]) |
| 643 | [[:punct:]] punctuation ([!-/:-@\[-`{-~]) |
| 644 | [[:space:]] whitespace ([\t\n\v\f\r ]) |
| 645 | [[:upper:]] upper case ([A-Z]) |
| 646 | [[:word:]] word characters ([0-9A-Za-z_]) |
| 647 | [[:xdigit:]] hex digit ([0-9A-Fa-f]) |
| 648 | </pre> |
| 649 | |
| 650 | # Untrusted input |
| 651 | |
| 652 | This crate is meant to be able to run regex searches on untrusted haystacks |
| 653 | without fear of [ReDoS]. This crate also, to a certain extent, supports |
| 654 | untrusted patterns. |
| 655 | |
| 656 | [ReDoS]: https://en.wikipedia.org/wiki/ReDoS |
| 657 | |
| 658 | This crate differs from most (but not all) other regex engines in that it |
| 659 | doesn't use unbounded backtracking to run a regex search. In those cases, |
| 660 | one generally cannot use untrusted patterns *or* untrusted haystacks because |
| 661 | it can be very difficult to know whether a particular pattern will result in |
| 662 | catastrophic backtracking or not. |
| 663 | |
| 664 | We'll first discuss how this crate deals with untrusted inputs and then wrap |
| 665 | it up with a realistic discussion about what practice really looks like. |
| 666 | |
| 667 | ### Panics |
| 668 | |
| 669 | Outside of clearly documented cases, most APIs in this crate are intended to |
| 670 | never panic regardless of the inputs given to them. For example, `Regex::new`, |
| 671 | `Regex::is_match`, `Regex::find` and `Regex::captures` should never panic. That |
| 672 | is, it is an API promise that those APIs will never panic no matter what inputs |
| 673 | are given to them. With that said, regex engines are complicated beasts, and |
| 674 | providing a rock solid guarantee that these APIs literally never panic is |
| 675 | essentially equivalent to saying, "there are no bugs in this library." That is |
| 676 | a bold claim, and not really one that can be feasibly made with a straight |
| 677 | face. |
| 678 | |
| 679 | Don't get the wrong impression here. This crate is extensively tested, not just |
| 680 | with unit and integration tests, but also via fuzz testing. For example, this |
| 681 | crate is part of the [OSS-fuzz project]. Panics should be incredibly rare, but |
| 682 | it is possible for bugs to exist, and thus possible for a panic to occur. If |
| 683 | you need a rock solid guarantee against panics, then you should wrap calls into |
| 684 | this library with [`std::panic::catch_unwind`]. |
| 685 | |
| 686 | It's also worth pointing out that this library will generally panic when other |
| 687 | regex engines would commit undefined behavior. When undefined behavior occurs, |
| 688 | your program might continue as if nothing bad has happened, but it also might |
| 689 | mean your program is open to the worst kinds of exploits. In contrast, the |
| 690 | worst thing a panic can do is a denial of service. |
| 691 | |
| 692 | [OSS-fuzz project]: https://android.googlesource.com/platform/external/oss-fuzz/+/refs/tags/android-t-preview-1/projects/rust-regex/ |
| 693 | [`std::panic::catch_unwind`]: https://doc.rust-lang.org/std/panic/fn.catch_unwind.html |
| 694 | |
| 695 | ### Untrusted patterns |
| 696 | |
| 697 | The principal way this crate deals with them is by limiting their size by |
| 698 | default. The size limit can be configured via [`RegexBuilder::size_limit`]. The |
| 699 | idea of a size limit is that compiling a pattern into a `Regex` will fail if it |
| 700 | becomes "too big." Namely, while *most* resources consumed by compiling a regex |
| 701 | are approximately proportional to the length of the pattern itself, there is |
| 702 | one particular exception to this: counted repetitions. Namely, this pattern: |
| 703 | |
| 704 | ```text |
| 705 | a{5}{5}{5}{5}{5}{5} |
| 706 | ``` |
| 707 | |
| 708 | Is equivalent to this pattern: |
| 709 | |
| 710 | ```text |
| 711 | a{15625} |
| 712 | ``` |
| 713 | |
| 714 | In both of these cases, the actual pattern string is quite small, but the |
| 715 | resulting `Regex` value is quite large. Indeed, as the first pattern shows, |
| 716 | it isn't enough to locally limit the size of each repetition because they can |
| 717 | be stacked in a way that results in exponential growth. |
| 718 | |
| 719 | To provide a bit more context, a simplified view of regex compilation looks |
| 720 | like this: |
| 721 | |
| 722 | * The pattern string is parsed into a structured representation called an HIR |
| 723 | (high-level intermediate representation). Counted repetitions are not expanded |
| 724 | in this stage. That is, the size of the HIR is proportional to the size |
| 725 | of the pattern with "reasonable" constant factors. In other words, one can |
| 726 | reasonably limit the memory used by an HIR by limiting the length of the |
| 727 | pattern string. |
| 728 | * The HIR is compiled into a [Thompson NFA]. This is the stage at which |
| 729 | something like `\w{5}` is rewritten to `\w\w\w\w\w`. Thus, this is the stage |
| 730 | at which [`RegexBuilder::size_limit`] is enforced. If the NFA exceeds the |
| 731 | configured size, then this stage will fail. |
| 732 | |
| 733 | [Thompson NFA]: https://en.wikipedia.org/wiki/Thompson%27s_construction |
| 734 | |
| 735 | The size limit helps avoid two different kinds of exorbitant resource usage: |
| 736 | |
| 737 | * It avoids permitting exponential memory usage based on the size of the |
| 738 | pattern string. |
| 739 | * It avoids long search times. This will be discussed in more detail in the |
| 740 | next section, but worst case search time *is* dependent on the size of the |
| 741 | regex. So keeping regexes limited to a reasonable size is also a way of keeping |
| 742 | search times reasonable. |
| 743 | |
| 744 | Finally, it's worth pointing out that regex compilation is guaranteed to take |
| 745 | worst case `O(m)` time, where `m` is proportional to the size of regex. The |
| 746 | size of the regex here is *after* the counted repetitions have been expanded. |
| 747 | |
| 748 | **Advice for those using untrusted regexes**: limit the pattern length to |
| 749 | something small and expand it as needed. Configure [`RegexBuilder::size_limit`] |
| 750 | to something small and then expand it as needed. |
| 751 | |
| 752 | ### Untrusted haystacks |
| 753 | |
| 754 | The main way this crate guards against searches from taking a long time is by |
| 755 | using algorithms that guarantee a `O(m * n)` worst case time and space bound. |
| 756 | Namely: |
| 757 | |
| 758 | * `m` is proportional to the size of the regex, where the size of the regex |
| 759 | includes the expansion of all counted repetitions. (See the previous section on |
| 760 | untrusted patterns.) |
| 761 | * `n` is proportional to the length, in bytes, of the haystack. |
| 762 | |
| 763 | In other words, if you consider `m` to be a constant (for example, the regex |
| 764 | pattern is a literal in the source code), then the search can be said to run |
| 765 | in "linear time." Or equivalently, "linear time with respect to the size of the |
| 766 | haystack." |
| 767 | |
| 768 | But the `m` factor here is important not to ignore. If a regex is |
| 769 | particularly big, the search times can get quite slow. This is why, in part, |
| 770 | [`RegexBuilder::size_limit`] exists. |
| 771 | |
| 772 | **Advice for those searching untrusted haystacks**: As long as your regexes |
| 773 | are not enormous, you should expect to be able to search untrusted haystacks |
| 774 | without fear. If you aren't sure, you should benchmark it. Unlike backtracking |
| 775 | engines, if your regex is so big that it's likely to result in slow searches, |
| 776 | this is probably something you'll be able to observe regardless of what the |
| 777 | haystack is made up of. |
| 778 | |
| 779 | ### Iterating over matches |
| 780 | |
| 781 | One thing that is perhaps easy to miss is that the worst case time |
| 782 | complexity bound of `O(m * n)` applies to methods like [`Regex::is_match`], |
| 783 | [`Regex::find`] and [`Regex::captures`]. It does **not** apply to |
| 784 | [`Regex::find_iter`] or [`Regex::captures_iter`]. Namely, since iterating over |
| 785 | all matches can execute many searches, and each search can scan the entire |
| 786 | haystack, the worst case time complexity for iterators is `O(m * n^2)`. |
| 787 | |
| 788 | One example of where this occurs is when a pattern consists of an alternation, |
| 789 | where an earlier branch of the alternation requires scanning the entire |
| 790 | haystack only to discover that there is no match. It also requires a later |
| 791 | branch of the alternation to have matched at the beginning of the search. For |
| 792 | example, consider the pattern `.*[^A-Z]|[A-Z]` and the haystack `AAAAA`. The |
| 793 | first search will scan to the end looking for matches of `.*[^A-Z]` even though |
| 794 | a finite automata engine (as in this crate) knows that `[A-Z]` has already |
| 795 | matched the first character of the haystack. This is due to the greedy nature |
| 796 | of regex searching. That first search will report a match at the first `A` only |
| 797 | after scanning to the end to discover that no other match exists. The next |
| 798 | search then begins at the second `A` and the behavior repeats. |
| 799 | |
| 800 | There is no way to avoid this. This means that if both patterns and haystacks |
| 801 | are untrusted and you're iterating over all matches, you're susceptible |
| 802 | to worst case quadratic time complexity. One possible way to mitigate |
| 803 | this is to switch to the lower level `regex-automata` crate and use its |
| 804 | `meta::Regex` iterator APIs. There, you can configure the search to operate |
| 805 | in "earliest" mode by passing a `Input::new(haystack).earliest(true)` to |
| 806 | `meta::Regex::find_iter` (for example). By enabling this mode, you give up |
| 807 | the normal greedy match semantics of regex searches and instead ask the regex |
| 808 | engine to immediately stop as soon as a match has been found. Enabling this |
| 809 | mode will thus restore the worst case `O(m * n)` time complexity bound, but at |
| 810 | the cost of different semantics. |
| 811 | |
| 812 | ### Untrusted inputs in practice |
| 813 | |
| 814 | While providing a `O(m * n)` worst case time bound on all searches goes a long |
| 815 | way toward preventing [ReDoS], that doesn't mean every search you can possibly |
| 816 | run will complete without burning CPU time. In general, there are a few ways |
| 817 | for the `m * n` time bound to still bite you: |
| 818 | |
| 819 | * You are searching an exceptionally long haystack. No matter how you slice |
| 820 | it, a longer haystack will take more time to search. |
| 821 | * Very large regexes can searches to be quite slow due to increasing the size |
| 822 | `m` in the worst case `O(m * n)` bound. This is especially true when they |
| 823 | are combined with counted repetitions. While the regex size limit above will |
| 824 | protect you from the most egregious cases, the default size limit still |
| 825 | permits pretty big regexes that can execute more slowly than one might expect. |
| 826 | * While routines like [`Regex::find`] and [`Regex::captures`] guarantee |
| 827 | worst case `O(m * n)` search time, routines like [`Regex::find_iter`] and |
| 828 | [`Regex::captures_iter`] actually have worst case `O(m * n^2)` search time. |
| 829 | This is because `find_iter` runs many searches, and each search takes worst |
| 830 | case `O(m * n)` time. Thus, iteration of all matches in a haystack has |
| 831 | worst case `O(m * n^2)`. A good example of a pattern that exhibits this is |
| 832 | `(?:A+){1000}|` or even `.*[^A-Z]|[A-Z]`. |
| 833 | |
| 834 | In general, unstrusted haystacks are easier to stomach than untrusted patterns. |
| 835 | Untrusted patterns give a lot more control to the caller to impact the |
| 836 | performance of a search. Therefore, permitting untrusted patterns means that |
| 837 | your only line of defense is to put a limit on how big `m` (and perhaps also |
| 838 | `n`) can be in `O(m * n)`. `n` is limited by simply inspecting the length |
| 839 | of the haystack while `m` is limited by *both* applying a limit to the |
| 840 | length of the pattern *and* a limit on the compiled size of the regex via |
| 841 | [`RegexBuilder::size_limit`]. |
| 842 | |
| 843 | It bears repeating: if you're accepting untrusted patterns, it would be a good |
| 844 | idea to start with conservative limits on `m` and `n`, and then carefully |
| 845 | increase them as needed. |
| 846 | */ |
| 847 | |
| 848 | #![no_std ] |
| 849 | // I'm not ideologically opposed to allowing non-safe code in this crate, but |
| 850 | // IMO it needs really excellent justification. One likely place where this |
| 851 | // could show up is if and when we support a non-std alloc mode. In that case, |
| 852 | // we need some way to synchronize access to a PikeVM cache. That in turn will |
| 853 | // likely require rolling our own primitive spin-lock or similar structure. |
| 854 | #![forbid (unsafe_code)] |
| 855 | #![deny (missing_docs, rustdoc::broken_intra_doc_links)] |
| 856 | #![warn (missing_debug_implementations)] |
| 857 | // When the main features are disabled, squash dead code warnings. The |
| 858 | // alternative is to litter conditional compilation directives everywhere, |
| 859 | // which is super annoying. |
| 860 | #![cfg_attr (not(feature = "string" ), allow(dead_code))] |
| 861 | |
| 862 | #[cfg (not(feature = "std" ))] |
| 863 | compile_error!("'std' is currently a required feature, please file an issue" ); |
| 864 | |
| 865 | #[cfg (not(any(target_pointer_width = "32" , target_pointer_width = "64" )))] |
| 866 | compile_error!("not supported on non-{32,64}, please file an issue" ); |
| 867 | |
| 868 | extern crate alloc; |
| 869 | #[cfg (any(test, feature = "std" ))] |
| 870 | extern crate std; |
| 871 | |
| 872 | #[cfg (feature = "string" )] |
| 873 | pub use self::string::*; |
| 874 | pub use self::{error::Error, hir::escape}; |
| 875 | |
| 876 | mod error; |
| 877 | mod hir; |
| 878 | mod int; |
| 879 | mod interpolate; |
| 880 | mod nfa; |
| 881 | mod pikevm; |
| 882 | mod pool; |
| 883 | #[cfg (feature = "string" )] |
| 884 | mod string; |
| 885 | mod utf8; |
| 886 | |