1 | /*! |
2 | This crate provides a **lightweight** regex engine for searching strings. The |
3 | regex syntax supported by this crate is nearly identical to what is found in |
4 | the [`regex`](https://docs.rs/regex) crate. Like the `regex` crate, all regex |
5 | searches in this crate have worst case `O(m * n)` time complexity, where `m` is |
6 | proportional to the size of the regex and `n` is proportional to the size of |
7 | the string being searched. |
8 | |
9 | The principal difference between the `regex` and `regex-lite` crates is that |
10 | the latter prioritizes smaller binary sizes and shorter Rust compile times |
11 | over performance and functionality. As a result, regex searches in this crate |
12 | are typically substantially slower than what is provided by the `regex` crate. |
13 | Moreover, this crate only has the most basic level of Unicode support: it |
14 | matches codepoint by codepoint but otherwise doesn't support Unicode case |
15 | insensivity or things like `\p{Letter}`. In exchange, this crate contributes |
16 | far less to binary size and compiles much more quickly. |
17 | |
18 | If you just want API documentation, then skip to the [`Regex`] type. Otherwise, |
19 | here's a quick example showing one way of parsing the output of a grep-like |
20 | program: |
21 | |
22 | ```rust |
23 | use regex_lite::Regex; |
24 | |
25 | let re = Regex::new(r"(?m)^([^:]+):([0-9]+):(.+)$" ).unwrap(); |
26 | let hay = "\ |
27 | path/to/foo:54:Blue Harvest |
28 | path/to/bar:90:Something, Something, Something, Dark Side |
29 | path/to/baz:3:It's a Trap! |
30 | " ; |
31 | |
32 | let mut results = vec![]; |
33 | for (_, [path, lineno, line]) in re.captures_iter(hay).map(|c| c.extract()) { |
34 | results.push((path, lineno.parse::<u64>()?, line)); |
35 | } |
36 | assert_eq!(results, vec![ |
37 | ("path/to/foo" , 54, "Blue Harvest" ), |
38 | ("path/to/bar" , 90, "Something, Something, Something, Dark Side" ), |
39 | ("path/to/baz" , 3, "It's a Trap!" ), |
40 | ]); |
41 | # Ok::<(), Box<dyn std::error::Error>>(()) |
42 | ``` |
43 | |
44 | # Overview |
45 | |
46 | The primary type in this crate is a [`Regex`]. Its most important methods are |
47 | as follows: |
48 | |
49 | * [`Regex::new`] compiles a regex using the default configuration. A |
50 | [`RegexBuilder`] permits setting a non-default configuration. (For example, |
51 | case insensitive matching, verbose mode and others.) |
52 | * [`Regex::is_match`] reports whether a match exists in a particular haystack. |
53 | * [`Regex::find`] reports the byte offsets of a match in a haystack, if one |
54 | exists. [`Regex::find_iter`] returns an iterator over all such matches. |
55 | * [`Regex::captures`] returns a [`Captures`], which reports both the byte |
56 | offsets of a match in a haystack and the byte offsets of each matching capture |
57 | group from the regex in the haystack. |
58 | [`Regex::captures_iter`] returns an iterator over all such matches. |
59 | |
60 | Otherwise, this top-level crate documentation is organized as follows: |
61 | |
62 | * [Usage](#usage) shows how to add the `regex` crate to your Rust project. |
63 | * [Examples](#examples) provides a limited selection of regex search examples. |
64 | * [Differences with the regex crate](#differences-with-the-regex-crate) |
65 | provides a precise description of how `regex-lite` differs from `regex`. |
66 | * [Syntax](#syntax) enumerates the specific regex syntax supported by this |
67 | crate. |
68 | * [Untrusted input](#untrusted-input) discusses how this crate deals with regex |
69 | patterns or haystacks that are untrusted. |
70 | |
71 | # Usage |
72 | |
73 | The `regex-lite` crate is [on crates.io](https://crates.io/crates/regex-lite) |
74 | and can be used by adding `regex-lite` to your dependencies in your project's |
75 | `Cargo.toml`. Or more simply, just run `cargo add regex-lite`. |
76 | |
77 | Here is a complete example that creates a new Rust project, adds a dependency |
78 | on `regex-lite`, creates the source code for a regex search and then runs the |
79 | program. |
80 | |
81 | First, create the project in a new directory: |
82 | |
83 | ```text |
84 | $ mkdir regex-example |
85 | $ cd regex-example |
86 | $ cargo init |
87 | ``` |
88 | |
89 | Second, add a dependency on `regex`: |
90 | |
91 | ```text |
92 | $ cargo add regex-lite |
93 | ``` |
94 | |
95 | Third, edit `src/main.rs`. Delete what's there and replace it with this: |
96 | |
97 | ``` |
98 | use regex_lite::Regex; |
99 | |
100 | fn main() { |
101 | let re = Regex::new(r"Hello (?<name>\w+)!" ).unwrap(); |
102 | let Some(caps) = re.captures("Hello Murphy!" ) else { |
103 | println!("no match!" ); |
104 | return; |
105 | }; |
106 | println!("The name is: {}" , &caps["name" ]); |
107 | } |
108 | ``` |
109 | |
110 | Fourth, run it with `cargo run`: |
111 | |
112 | ```text |
113 | $ cargo run |
114 | Compiling regex-lite v0.1.0 |
115 | Compiling regex-example v0.1.0 (/tmp/regex-example) |
116 | Finished dev [unoptimized + debuginfo] target(s) in 4.22s |
117 | Running `target/debug/regex-example` |
118 | The name is: Murphy |
119 | ``` |
120 | |
121 | The first time you run the program will show more output like above. But |
122 | subsequent runs shouldn't have to re-compile the dependencies. |
123 | |
124 | # Examples |
125 | |
126 | This section provides a few examples, in tutorial style, showing how to |
127 | search a haystack with a regex. There are more examples throughout the API |
128 | documentation. |
129 | |
130 | Before starting though, it's worth defining a few terms: |
131 | |
132 | * A **regex** is a Rust value whose type is `Regex`. We use `re` as a |
133 | variable name for a regex. |
134 | * A **pattern** is the string that is used to build a regex. We use `pat` as |
135 | a variable name for a pattern. |
136 | * A **haystack** is the string that is searched by a regex. We use `hay` as a |
137 | variable name for a haystack. |
138 | |
139 | Sometimes the words "regex" and "pattern" are used interchangeably. |
140 | |
141 | General use of regular expressions in this crate proceeds by compiling a |
142 | **pattern** into a **regex**, and then using that regex to search, split or |
143 | replace parts of a **haystack**. |
144 | |
145 | ### Example: find a middle initial |
146 | |
147 | We'll start off with a very simple example: a regex that looks for a specific |
148 | name but uses a wildcard to match a middle initial. Our pattern serves as |
149 | something like a template that will match a particular name with *any* middle |
150 | initial. |
151 | |
152 | ```rust |
153 | use regex_lite::Regex; |
154 | |
155 | // We use 'unwrap()' here because it would be a bug in our program if the |
156 | // pattern failed to compile to a regex. Panicking in the presence of a bug |
157 | // is okay. |
158 | let re = Regex::new(r"Homer (.)\. Simpson" ).unwrap(); |
159 | let hay = "Homer J. Simpson" ; |
160 | let Some(caps) = re.captures(hay) else { return }; |
161 | assert_eq!("J" , &caps[1]); |
162 | ``` |
163 | |
164 | There are a few things worth noticing here in our first example: |
165 | |
166 | * The `.` is a special pattern meta character that means "match any single |
167 | character except for new lines." (More precisely, in this crate, it means |
168 | "match any UTF-8 encoding of any Unicode scalar value other than `\n`.") |
169 | * We can match an actual `.` literally by escaping it, i.e., `\.`. |
170 | * We use Rust's [raw strings] to avoid needing to deal with escape sequences in |
171 | both the regex pattern syntax and in Rust's string literal syntax. If we didn't |
172 | use raw strings here, we would have had to use `\\.` to match a literal `.` |
173 | character. That is, `r"\."` and `"\\."` are equivalent patterns. |
174 | * We put our wildcard `.` instruction in parentheses. These parentheses have a |
175 | special meaning that says, "make whatever part of the haystack matches within |
176 | these parentheses available as a capturing group." After finding a match, we |
177 | access this capture group with `&caps[1]`. |
178 | |
179 | [raw strings]: https://doc.rust-lang.org/stable/reference/tokens.html#raw-string-literals |
180 | |
181 | Otherwise, we execute a search using `re.captures(hay)` and return from our |
182 | function if no match occurred. We then reference the middle initial by asking |
183 | for the part of the haystack that matched the capture group indexed at `1`. |
184 | (The capture group at index 0 is implicit and always corresponds to the entire |
185 | match. In this case, that's `Homer J. Simpson`.) |
186 | |
187 | ### Example: named capture groups |
188 | |
189 | Continuing from our middle initial example above, we can tweak the pattern |
190 | slightly to give a name to the group that matches the middle initial: |
191 | |
192 | ```rust |
193 | use regex_lite::Regex; |
194 | |
195 | // Note that (?P<middle>.) is a different way to spell the same thing. |
196 | let re = Regex::new(r"Homer (?<middle>.)\. Simpson" ).unwrap(); |
197 | let hay = "Homer J. Simpson" ; |
198 | let Some(caps) = re.captures(hay) else { return }; |
199 | assert_eq!("J" , &caps["middle" ]); |
200 | ``` |
201 | |
202 | Giving a name to a group can be useful when there are multiple groups in |
203 | a pattern. It makes the code referring to those groups a bit easier to |
204 | understand. |
205 | |
206 | ### Example: validating a particular date format |
207 | |
208 | This examples shows how to confirm whether a haystack, in its entirety, matches |
209 | a particular date format: |
210 | |
211 | ```rust |
212 | use regex_lite::Regex; |
213 | |
214 | let re = Regex::new(r"^\d{4}-\d{2}-\d{2}$" ).unwrap(); |
215 | assert!(re.is_match("2010-03-14" )); |
216 | ``` |
217 | |
218 | Notice the use of the `^` and `$` anchors. In this crate, every regex search is |
219 | run with an implicit `(?s:.)*?` at the beginning of its pattern, which allows |
220 | the regex to match anywhere in a haystack. Anchors, as above, can be used to |
221 | ensure that the full haystack matches a pattern. |
222 | |
223 | ### Example: finding dates in a haystack |
224 | |
225 | In the previous example, we showed how one might validate that a haystack, |
226 | in its entirety, corresponded to a particular date format. But what if we wanted |
227 | to extract all things that look like dates in a specific format from a haystack? |
228 | To do this, we can use an iterator API to find all matches (notice that we've |
229 | removed the anchors): |
230 | |
231 | ```rust |
232 | use regex_lite::Regex; |
233 | |
234 | let re = Regex::new(r"\d{4}-\d{2}-\d{2}" ).unwrap(); |
235 | let hay = "What do 1865-04-14, 1881-07-02, 1901-09-06 and 1963-11-22 have in common?" ; |
236 | // 'm' is a 'Match', and 'as_str()' returns the matching part of the haystack. |
237 | let dates: Vec<&str> = re.find_iter(hay).map(|m| m.as_str()).collect(); |
238 | assert_eq!(dates, vec![ |
239 | "1865-04-14" , |
240 | "1881-07-02" , |
241 | "1901-09-06" , |
242 | "1963-11-22" , |
243 | ]); |
244 | ``` |
245 | |
246 | We can also iterate over [`Captures`] values instead of [`Match`] values, and |
247 | that in turn permits accessing each component of the date via capturing groups: |
248 | |
249 | ```rust |
250 | use regex_lite::Regex; |
251 | |
252 | let re = Regex::new(r"(?<y>\d{4})-(?<m>\d{2})-(?<d>\d{2})" ).unwrap(); |
253 | let hay = "What do 1865-04-14, 1881-07-02, 1901-09-06 and 1963-11-22 have in common?" ; |
254 | // 'm' is a 'Match', and 'as_str()' returns the matching part of the haystack. |
255 | let dates: Vec<(&str, &str, &str)> = re.captures_iter(hay).map(|caps| { |
256 | // The unwraps are okay because every capture group must match if the whole |
257 | // regex matches, and in this context, we know we have a match. |
258 | // |
259 | // Note that we use `caps.name("y").unwrap().as_str()` instead of |
260 | // `&caps["y"]` because the lifetime of the former is the same as the |
261 | // lifetime of `hay` above, but the lifetime of the latter is tied to the |
262 | // lifetime of `caps` due to how the `Index` trait is defined. |
263 | let year = caps.name("y" ).unwrap().as_str(); |
264 | let month = caps.name("m" ).unwrap().as_str(); |
265 | let day = caps.name("d" ).unwrap().as_str(); |
266 | (year, month, day) |
267 | }).collect(); |
268 | assert_eq!(dates, vec![ |
269 | ("1865" , "04" , "14" ), |
270 | ("1881" , "07" , "02" ), |
271 | ("1901" , "09" , "06" ), |
272 | ("1963" , "11" , "22" ), |
273 | ]); |
274 | ``` |
275 | |
276 | ### Example: simpler capture group extraction |
277 | |
278 | One can use [`Captures::extract`] to make the code from the previous example a |
279 | bit simpler in this case: |
280 | |
281 | ```rust |
282 | use regex_lite::Regex; |
283 | |
284 | let re = Regex::new(r"(\d{4})-(\d{2})-(\d{2})" ).unwrap(); |
285 | let hay = "What do 1865-04-14, 1881-07-02, 1901-09-06 and 1963-11-22 have in common?" ; |
286 | let dates: Vec<(&str, &str, &str)> = re.captures_iter(hay).map(|caps| { |
287 | let (_, [year, month, day]) = caps.extract(); |
288 | (year, month, day) |
289 | }).collect(); |
290 | assert_eq!(dates, vec![ |
291 | ("1865" , "04" , "14" ), |
292 | ("1881" , "07" , "02" ), |
293 | ("1901" , "09" , "06" ), |
294 | ("1963" , "11" , "22" ), |
295 | ]); |
296 | ``` |
297 | |
298 | `Captures::extract` works by ensuring that the number of matching groups match |
299 | the number of groups requested via the `[year, month, day]` syntax. If they do, |
300 | then the substrings for each corresponding capture group are automatically |
301 | returned in an appropriately sized array. Rust's syntax for pattern matching |
302 | arrays does the rest. |
303 | |
304 | ### Example: replacement with named capture groups |
305 | |
306 | Building on the previous example, perhaps we'd like to rearrange the date |
307 | formats. This can be done by finding each match and replacing it with |
308 | something different. The [`Regex::replace_all`] routine provides a convenient |
309 | way to do this, including by supporting references to named groups in the |
310 | replacement string: |
311 | |
312 | ```rust |
313 | use regex_lite::Regex; |
314 | |
315 | let re = Regex::new(r"(?<y>\d{4})-(?<m>\d{2})-(?<d>\d{2})" ).unwrap(); |
316 | let before = "1973-01-05, 1975-08-25 and 1980-10-18" ; |
317 | let after = re.replace_all(before, "$m/$d/$y" ); |
318 | assert_eq!(after, "01/05/1973, 08/25/1975 and 10/18/1980" ); |
319 | ``` |
320 | |
321 | The replace methods are actually polymorphic in the replacement, which |
322 | provides more flexibility than is seen here. (See the documentation for |
323 | [`Regex::replace`] for more details.) |
324 | |
325 | ### Example: verbose mode |
326 | |
327 | When your regex gets complicated, you might consider using something other |
328 | than regex. But if you stick with regex, you can use the `x` flag to enable |
329 | insignificant whitespace mode or "verbose mode." In this mode, whitespace |
330 | is treated as insignificant and one may write comments. This may make your |
331 | patterns easier to comprehend. |
332 | |
333 | ```rust |
334 | use regex_lite::Regex; |
335 | |
336 | let re = Regex::new(r"(?x) |
337 | (?P<y>\d{4}) # the year |
338 | - |
339 | (?P<m>\d{2}) # the month |
340 | - |
341 | (?P<d>\d{2}) # the day |
342 | " ).unwrap(); |
343 | |
344 | let before = "1973-01-05, 1975-08-25 and 1980-10-18" ; |
345 | let after = re.replace_all(before, "$m/$d/$y" ); |
346 | assert_eq!(after, "01/05/1973, 08/25/1975 and 10/18/1980" ); |
347 | ``` |
348 | |
349 | If you wish to match against whitespace in this mode, you can still use `\s`, |
350 | `\n`, `\t`, etc. For escaping a single space character, you can escape it |
351 | directly with `\ `, use its hex character code `\x20` or temporarily disable |
352 | the `x` flag, e.g., `(?-x: )`. |
353 | |
354 | # Differences with the `regex` crate |
355 | |
356 | As mentioned in the introduction above, the purpose of this crate is to |
357 | prioritize small binary sizes and shorter Rust compilation times as much as |
358 | possible. Namely, while the `regex` crate tends to eschew both binary size |
359 | and compilation time in favor of faster searches and features, the |
360 | `regex-lite` crate gives up faster searches and some functionality in exchange |
361 | for smaller binary sizes and faster compilation times. |
362 | |
363 | The precise set of differences at the syntax level: |
364 | |
365 | * The Perl character classes are limited to ASCII codepoints. That is, |
366 | `\d` is `[0-9]`, `\s` is `[\t\n\v\f\r ]` and `\w` is `[0-9A-Za-z_]`. |
367 | * Unicode character classes of the form `\p{...}` and `\P{...}` are not |
368 | supported at all. Note though that things like `[^β]` are still supported and |
369 | will match any Unicode scalar value except for `β`. |
370 | * Case insensitive searching is limited to ASCII case insensitivity. |
371 | * Character class set operations other than union are not supported. That is, |
372 | difference (`--`), intersection (`&&`) and symmetric difference (`~~`) are |
373 | not available. These tend to be most useful with Unicode classes, which also |
374 | aren't available. |
375 | * Opt-in octal support is not available in this crate. |
376 | |
377 | And now at the API level: |
378 | |
379 | * Currently, this crate only supports searching `&str`. It does not have APIs |
380 | for searching `&[u8]` haystacks, although it is planned to add these in the |
381 | future if there's demand. |
382 | * There is no `RegexSet` in this crate and there are no plans to add it. |
383 | * The `Error` type in this crate is completely opaque. |
384 | |
385 | Other than these things, the `regex-lite` crate is intended to be a drop-in |
386 | replacement for the `regex` crate. In most cases, you can just replace `use |
387 | regex::Regex;` with `use regex_lite::Regex;` and everything will work. (Unless |
388 | you're depending on Unicode support in your regexes.) |
389 | |
390 | # Syntax |
391 | |
392 | The syntax supported in this crate is documented below. |
393 | |
394 | ### Matching one character |
395 | |
396 | <pre class="rust"> |
397 | . any character except new line (includes new line with s flag) |
398 | [0-9] any ASCII digit |
399 | \d digit ([0-9]) |
400 | \D not digit |
401 | </pre> |
402 | |
403 | ### Character classes |
404 | |
405 | <pre class="rust"> |
406 | [xyz] A character class matching either x, y or z (union). |
407 | [^xyz] A character class matching any character except x, y and z. |
408 | [a-z] A character class matching any character in range a-z. |
409 | [[:alpha:]] ASCII character class ([A-Za-z]) |
410 | [[:^alpha:]] Negated ASCII character class ([^A-Za-z]) |
411 | [\[\]] Escaping in character classes (matching [ or ]) |
412 | </pre> |
413 | |
414 | Any ASCII or Perl character class may appear inside a bracketed `[...]` character |
415 | class. For example, `[\s[:digit:]]` matches any digit or space character. |
416 | |
417 | Precedence in character classes, from most binding to least: |
418 | |
419 | 1. Ranges: `[a-cd]` == `[[a-c]d]` |
420 | 2. Union: `[ab&&bc]` == `[[ab]&&[bc]]` |
421 | 3. Negation: `[^a-z&&b]` == `[^[a-z&&b]]`. |
422 | |
423 | ### Composites |
424 | |
425 | <pre class="rust"> |
426 | xy concatenation (x followed by y) |
427 | x|y alternation (x or y, prefer x) |
428 | </pre> |
429 | |
430 | This example shows how an alternation works, and what it means to prefer a |
431 | branch in the alternation over subsequent branches. |
432 | |
433 | ``` |
434 | use regex_lite::Regex; |
435 | |
436 | let haystack = "samwise" ; |
437 | // If 'samwise' comes first in our alternation, then it is |
438 | // preferred as a match, even if the regex engine could |
439 | // technically detect that 'sam' led to a match earlier. |
440 | let re = Regex::new(r"samwise|sam" ).unwrap(); |
441 | assert_eq!("samwise" , re.find(haystack).unwrap().as_str()); |
442 | // But if 'sam' comes first, then it will match instead. |
443 | // In this case, it is impossible for 'samwise' to match |
444 | // because 'sam' is a prefix of it. |
445 | let re = Regex::new(r"sam|samwise" ).unwrap(); |
446 | assert_eq!("sam" , re.find(haystack).unwrap().as_str()); |
447 | ``` |
448 | |
449 | ### Repetitions |
450 | |
451 | <pre class="rust"> |
452 | x* zero or more of x (greedy) |
453 | x+ one or more of x (greedy) |
454 | x? zero or one of x (greedy) |
455 | x*? zero or more of x (ungreedy/lazy) |
456 | x+? one or more of x (ungreedy/lazy) |
457 | x?? zero or one of x (ungreedy/lazy) |
458 | x{n,m} at least n x and at most m x (greedy) |
459 | x{n,} at least n x (greedy) |
460 | x{n} exactly n x |
461 | x{n,m}? at least n x and at most m x (ungreedy/lazy) |
462 | x{n,}? at least n x (ungreedy/lazy) |
463 | x{n}? exactly n x |
464 | </pre> |
465 | |
466 | ### Empty matches |
467 | |
468 | <pre class="rust"> |
469 | ^ the beginning of a haystack (or start-of-line with multi-line mode) |
470 | $ the end of a haystack (or end-of-line with multi-line mode) |
471 | \A only the beginning of a haystack (even with multi-line mode enabled) |
472 | \z only the end of a haystack (even with multi-line mode enabled) |
473 | \b an ASCII word boundary (\w on one side and \W, \A, or \z on other) |
474 | \B not an ASCII word boundary |
475 | \b{start}, \< an ASCII start-of-word boundary (\W|\A on the left, \w on the right) |
476 | \b{end}, \> an ASCII end-of-word boundary (\w on the left, \W|\z on the right)) |
477 | \b{start-half} half of an ASCII start-of-word boundary (\W|\A on the left) |
478 | \b{end-half} half of an ASCII end-of-word boundary (\W|\z on the right) |
479 | </pre> |
480 | |
481 | The empty regex is valid and matches the empty string. For example, the |
482 | empty regex matches `abc` at positions `0`, `1`, `2` and `3`. When using the |
483 | top-level [`Regex`] on `&str` haystacks, an empty match that splits a codepoint |
484 | is guaranteed to never be returned. For example: |
485 | |
486 | ```rust |
487 | let re = regex_lite::Regex::new(r"" ).unwrap(); |
488 | let ranges: Vec<_> = re.find_iter("💩" ).map(|m| m.range()).collect(); |
489 | assert_eq!(ranges, vec![0..0, 4..4]); |
490 | ``` |
491 | |
492 | Note that an empty regex is distinct from a regex that can never match. For |
493 | example, the regex `[^\s\S]` is a character class that represents the negation |
494 | of `[\s\S]`, where the union of `\s` and `\S` corresponds to all Unicode scalar |
495 | values. The negation of everything is nothing, which means the character class |
496 | is empty. Since nothing is in the empty set, `[^\s\S]` matches nothing, not |
497 | even the empty string. |
498 | |
499 | ### Grouping and flags |
500 | |
501 | <pre class="rust"> |
502 | (exp) numbered capture group (indexed by opening parenthesis) |
503 | (?P<name>exp) named (also numbered) capture group (names must be alpha-numeric) |
504 | (?<name>exp) named (also numbered) capture group (names must be alpha-numeric) |
505 | (?:exp) non-capturing group |
506 | (?flags) set flags within current group |
507 | (?flags:exp) set flags for exp (non-capturing) |
508 | </pre> |
509 | |
510 | Capture group names must be any sequence of alpha-numeric Unicode codepoints, |
511 | in addition to `.`, `_`, `[` and `]`. Names must start with either an `_` or |
512 | an alphabetic codepoint. Alphabetic codepoints correspond to the `Alphabetic` |
513 | Unicode property, while numeric codepoints correspond to the union of the |
514 | `Decimal_Number`, `Letter_Number` and `Other_Number` general categories. |
515 | |
516 | Flags are each a single character. For example, `(?x)` sets the flag `x` |
517 | and `(?-x)` clears the flag `x`. Multiple flags can be set or cleared at |
518 | the same time: `(?xy)` sets both the `x` and `y` flags and `(?x-y)` sets |
519 | the `x` flag and clears the `y` flag. |
520 | |
521 | All flags are by default disabled unless stated otherwise. They are: |
522 | |
523 | <pre class="rust"> |
524 | i case-insensitive: letters match both upper and lower case |
525 | m multi-line mode: ^ and $ match begin/end of line |
526 | s allow . to match \n |
527 | R enables CRLF mode: when multi-line mode is enabled, \r\n is used |
528 | U swap the meaning of x* and x*? |
529 | x verbose mode, ignores whitespace and allow line comments (starting with `#`) |
530 | </pre> |
531 | |
532 | Note that in verbose mode, whitespace is ignored everywhere, including within |
533 | character classes. To insert whitespace, use its escaped form or a hex literal. |
534 | For example, `\ ` or `\x20` for an ASCII space. |
535 | |
536 | Flags can be toggled within a pattern. Here's an example that matches |
537 | case-insensitively for the first part but case-sensitively for the second part: |
538 | |
539 | ```rust |
540 | use regex_lite::Regex; |
541 | |
542 | let re = Regex::new(r"(?i)a+(?-i)b+" ).unwrap(); |
543 | let m = re.find("AaAaAbbBBBb" ).unwrap(); |
544 | assert_eq!(m.as_str(), "AaAaAbb" ); |
545 | ``` |
546 | |
547 | Notice that the `a+` matches either `a` or `A`, but the `b+` only matches |
548 | `b`. |
549 | |
550 | Multi-line mode means `^` and `$` no longer match just at the beginning/end of |
551 | the input, but also at the beginning/end of lines: |
552 | |
553 | ``` |
554 | use regex_lite::Regex; |
555 | |
556 | let re = Regex::new(r"(?m)^line \d+" ).unwrap(); |
557 | let m = re.find("line one \nline 2 \n" ).unwrap(); |
558 | assert_eq!(m.as_str(), "line 2" ); |
559 | ``` |
560 | |
561 | Note that `^` matches after new lines, even at the end of input: |
562 | |
563 | ``` |
564 | use regex_lite::Regex; |
565 | |
566 | let re = Regex::new(r"(?m)^" ).unwrap(); |
567 | let m = re.find_iter("test \n" ).last().unwrap(); |
568 | assert_eq!((m.start(), m.end()), (5, 5)); |
569 | ``` |
570 | |
571 | When both CRLF mode and multi-line mode are enabled, then `^` and `$` will |
572 | match either `\r` and `\n`, but never in the middle of a `\r\n`: |
573 | |
574 | ``` |
575 | use regex_lite::Regex; |
576 | |
577 | let re = Regex::new(r"(?mR)^foo$" ).unwrap(); |
578 | let m = re.find(" \r\nfoo \r\n" ).unwrap(); |
579 | assert_eq!(m.as_str(), "foo" ); |
580 | ``` |
581 | |
582 | ### Escape sequences |
583 | |
584 | Note that this includes all possible escape sequences, even ones that are |
585 | documented elsewhere. |
586 | |
587 | <pre class="rust"> |
588 | \* literal *, applies to all ASCII except [0-9A-Za-z<>] |
589 | \a bell (\x07) |
590 | \f form feed (\x0C) |
591 | \t horizontal tab |
592 | \n new line |
593 | \r carriage return |
594 | \v vertical tab (\x0B) |
595 | \A matches at the beginning of a haystack |
596 | \z matches at the end of a haystack |
597 | \b word boundary assertion |
598 | \B negated word boundary assertion |
599 | \b{start}, \< start-of-word boundary assertion |
600 | \b{end}, \> end-of-word boundary assertion |
601 | \b{start-half} half of a start-of-word boundary assertion |
602 | \b{end-half} half of a end-of-word boundary assertion |
603 | \x7F hex character code (exactly two digits) |
604 | \x{10FFFF} any hex character code corresponding to a Unicode code point |
605 | \u007F hex character code (exactly four digits) |
606 | \u{7F} any hex character code corresponding to a Unicode code point |
607 | \U0000007F hex character code (exactly eight digits) |
608 | \U{7F} any hex character code corresponding to a Unicode code point |
609 | \d, \s, \w Perl character class |
610 | \D, \S, \W negated Perl character class |
611 | </pre> |
612 | |
613 | ### Perl character classes (ASCII only) |
614 | |
615 | These character classes are short-hands for common groups of characters. In |
616 | this crate, `\d`, `\s` and `\w` only consist of ASCII codepoints. |
617 | |
618 | <pre class="rust"> |
619 | \d digit ([0-9]) |
620 | \D not digit |
621 | \s whitespace ([\t\n\v\f\r ]) |
622 | \S not whitespace |
623 | \w word character ([0-9A-Za-z_]) |
624 | \W not word character |
625 | </pre> |
626 | |
627 | ### ASCII character classes |
628 | |
629 | These reflect additional groups of characters taken from POSIX regex syntax |
630 | that are sometimes useful to have. In this crate, all of these classes only |
631 | consist of ASCII codepoints. |
632 | |
633 | <pre class="rust"> |
634 | [[:alnum:]] alphanumeric ([0-9A-Za-z]) |
635 | [[:alpha:]] alphabetic ([A-Za-z]) |
636 | [[:ascii:]] ASCII ([\x00-\x7F]) |
637 | [[:blank:]] blank ([\t ]) |
638 | [[:cntrl:]] control ([\x00-\x1F\x7F]) |
639 | [[:digit:]] digits ([0-9]) |
640 | [[:graph:]] graphical ([!-~]) |
641 | [[:lower:]] lower case ([a-z]) |
642 | [[:print:]] printable ([ -~]) |
643 | [[:punct:]] punctuation ([!-/:-@\[-`{-~]) |
644 | [[:space:]] whitespace ([\t\n\v\f\r ]) |
645 | [[:upper:]] upper case ([A-Z]) |
646 | [[:word:]] word characters ([0-9A-Za-z_]) |
647 | [[:xdigit:]] hex digit ([0-9A-Fa-f]) |
648 | </pre> |
649 | |
650 | # Untrusted input |
651 | |
652 | This crate is meant to be able to run regex searches on untrusted haystacks |
653 | without fear of [ReDoS]. This crate also, to a certain extent, supports |
654 | untrusted patterns. |
655 | |
656 | [ReDoS]: https://en.wikipedia.org/wiki/ReDoS |
657 | |
658 | This crate differs from most (but not all) other regex engines in that it |
659 | doesn't use unbounded backtracking to run a regex search. In those cases, |
660 | one generally cannot use untrusted patterns *or* untrusted haystacks because |
661 | it can be very difficult to know whether a particular pattern will result in |
662 | catastrophic backtracking or not. |
663 | |
664 | We'll first discuss how this crate deals with untrusted inputs and then wrap |
665 | it up with a realistic discussion about what practice really looks like. |
666 | |
667 | ### Panics |
668 | |
669 | Outside of clearly documented cases, most APIs in this crate are intended to |
670 | never panic regardless of the inputs given to them. For example, `Regex::new`, |
671 | `Regex::is_match`, `Regex::find` and `Regex::captures` should never panic. That |
672 | is, it is an API promise that those APIs will never panic no matter what inputs |
673 | are given to them. With that said, regex engines are complicated beasts, and |
674 | providing a rock solid guarantee that these APIs literally never panic is |
675 | essentially equivalent to saying, "there are no bugs in this library." That is |
676 | a bold claim, and not really one that can be feasibly made with a straight |
677 | face. |
678 | |
679 | Don't get the wrong impression here. This crate is extensively tested, not just |
680 | with unit and integration tests, but also via fuzz testing. For example, this |
681 | crate is part of the [OSS-fuzz project]. Panics should be incredibly rare, but |
682 | it is possible for bugs to exist, and thus possible for a panic to occur. If |
683 | you need a rock solid guarantee against panics, then you should wrap calls into |
684 | this library with [`std::panic::catch_unwind`]. |
685 | |
686 | It's also worth pointing out that this library will generally panic when other |
687 | regex engines would commit undefined behavior. When undefined behavior occurs, |
688 | your program might continue as if nothing bad has happened, but it also might |
689 | mean your program is open to the worst kinds of exploits. In contrast, the |
690 | worst thing a panic can do is a denial of service. |
691 | |
692 | [OSS-fuzz project]: https://android.googlesource.com/platform/external/oss-fuzz/+/refs/tags/android-t-preview-1/projects/rust-regex/ |
693 | [`std::panic::catch_unwind`]: https://doc.rust-lang.org/std/panic/fn.catch_unwind.html |
694 | |
695 | ### Untrusted patterns |
696 | |
697 | The principal way this crate deals with them is by limiting their size by |
698 | default. The size limit can be configured via [`RegexBuilder::size_limit`]. The |
699 | idea of a size limit is that compiling a pattern into a `Regex` will fail if it |
700 | becomes "too big." Namely, while *most* resources consumed by compiling a regex |
701 | are approximately proportional to the length of the pattern itself, there is |
702 | one particular exception to this: counted repetitions. Namely, this pattern: |
703 | |
704 | ```text |
705 | a{5}{5}{5}{5}{5}{5} |
706 | ``` |
707 | |
708 | Is equivalent to this pattern: |
709 | |
710 | ```text |
711 | a{15625} |
712 | ``` |
713 | |
714 | In both of these cases, the actual pattern string is quite small, but the |
715 | resulting `Regex` value is quite large. Indeed, as the first pattern shows, |
716 | it isn't enough to locally limit the size of each repetition because they can |
717 | be stacked in a way that results in exponential growth. |
718 | |
719 | To provide a bit more context, a simplified view of regex compilation looks |
720 | like this: |
721 | |
722 | * The pattern string is parsed into a structured representation called an HIR |
723 | (high-level intermediate representation). Counted repetitions are not expanded |
724 | in this stage. That is, the size of the HIR is proportional to the size |
725 | of the pattern with "reasonable" constant factors. In other words, one can |
726 | reasonably limit the memory used by an HIR by limiting the length of the |
727 | pattern string. |
728 | * The HIR is compiled into a [Thompson NFA]. This is the stage at which |
729 | something like `\w{5}` is rewritten to `\w\w\w\w\w`. Thus, this is the stage |
730 | at which [`RegexBuilder::size_limit`] is enforced. If the NFA exceeds the |
731 | configured size, then this stage will fail. |
732 | |
733 | [Thompson NFA]: https://en.wikipedia.org/wiki/Thompson%27s_construction |
734 | |
735 | The size limit helps avoid two different kinds of exorbitant resource usage: |
736 | |
737 | * It avoids permitting exponential memory usage based on the size of the |
738 | pattern string. |
739 | * It avoids long search times. This will be discussed in more detail in the |
740 | next section, but worst case search time *is* dependent on the size of the |
741 | regex. So keeping regexes limited to a reasonable size is also a way of keeping |
742 | search times reasonable. |
743 | |
744 | Finally, it's worth pointing out that regex compilation is guaranteed to take |
745 | worst case `O(m)` time, where `m` is proportional to the size of regex. The |
746 | size of the regex here is *after* the counted repetitions have been expanded. |
747 | |
748 | **Advice for those using untrusted regexes**: limit the pattern length to |
749 | something small and expand it as needed. Configure [`RegexBuilder::size_limit`] |
750 | to something small and then expand it as needed. |
751 | |
752 | ### Untrusted haystacks |
753 | |
754 | The main way this crate guards against searches from taking a long time is by |
755 | using algorithms that guarantee a `O(m * n)` worst case time and space bound. |
756 | Namely: |
757 | |
758 | * `m` is proportional to the size of the regex, where the size of the regex |
759 | includes the expansion of all counted repetitions. (See the previous section on |
760 | untrusted patterns.) |
761 | * `n` is proportional to the length, in bytes, of the haystack. |
762 | |
763 | In other words, if you consider `m` to be a constant (for example, the regex |
764 | pattern is a literal in the source code), then the search can be said to run |
765 | in "linear time." Or equivalently, "linear time with respect to the size of the |
766 | haystack." |
767 | |
768 | But the `m` factor here is important not to ignore. If a regex is |
769 | particularly big, the search times can get quite slow. This is why, in part, |
770 | [`RegexBuilder::size_limit`] exists. |
771 | |
772 | **Advice for those searching untrusted haystacks**: As long as your regexes |
773 | are not enormous, you should expect to be able to search untrusted haystacks |
774 | without fear. If you aren't sure, you should benchmark it. Unlike backtracking |
775 | engines, if your regex is so big that it's likely to result in slow searches, |
776 | this is probably something you'll be able to observe regardless of what the |
777 | haystack is made up of. |
778 | |
779 | ### Iterating over matches |
780 | |
781 | One thing that is perhaps easy to miss is that the worst case time |
782 | complexity bound of `O(m * n)` applies to methods like [`Regex::is_match`], |
783 | [`Regex::find`] and [`Regex::captures`]. It does **not** apply to |
784 | [`Regex::find_iter`] or [`Regex::captures_iter`]. Namely, since iterating over |
785 | all matches can execute many searches, and each search can scan the entire |
786 | haystack, the worst case time complexity for iterators is `O(m * n^2)`. |
787 | |
788 | One example of where this occurs is when a pattern consists of an alternation, |
789 | where an earlier branch of the alternation requires scanning the entire |
790 | haystack only to discover that there is no match. It also requires a later |
791 | branch of the alternation to have matched at the beginning of the search. For |
792 | example, consider the pattern `.*[^A-Z]|[A-Z]` and the haystack `AAAAA`. The |
793 | first search will scan to the end looking for matches of `.*[^A-Z]` even though |
794 | a finite automata engine (as in this crate) knows that `[A-Z]` has already |
795 | matched the first character of the haystack. This is due to the greedy nature |
796 | of regex searching. That first search will report a match at the first `A` only |
797 | after scanning to the end to discover that no other match exists. The next |
798 | search then begins at the second `A` and the behavior repeats. |
799 | |
800 | There is no way to avoid this. This means that if both patterns and haystacks |
801 | are untrusted and you're iterating over all matches, you're susceptible |
802 | to worst case quadratic time complexity. One possible way to mitigate |
803 | this is to switch to the lower level `regex-automata` crate and use its |
804 | `meta::Regex` iterator APIs. There, you can configure the search to operate |
805 | in "earliest" mode by passing a `Input::new(haystack).earliest(true)` to |
806 | `meta::Regex::find_iter` (for example). By enabling this mode, you give up |
807 | the normal greedy match semantics of regex searches and instead ask the regex |
808 | engine to immediately stop as soon as a match has been found. Enabling this |
809 | mode will thus restore the worst case `O(m * n)` time complexity bound, but at |
810 | the cost of different semantics. |
811 | |
812 | ### Untrusted inputs in practice |
813 | |
814 | While providing a `O(m * n)` worst case time bound on all searches goes a long |
815 | way toward preventing [ReDoS], that doesn't mean every search you can possibly |
816 | run will complete without burning CPU time. In general, there are a few ways |
817 | for the `m * n` time bound to still bite you: |
818 | |
819 | * You are searching an exceptionally long haystack. No matter how you slice |
820 | it, a longer haystack will take more time to search. |
821 | * Very large regexes can searches to be quite slow due to increasing the size |
822 | `m` in the worst case `O(m * n)` bound. This is especially true when they |
823 | are combined with counted repetitions. While the regex size limit above will |
824 | protect you from the most egregious cases, the default size limit still |
825 | permits pretty big regexes that can execute more slowly than one might expect. |
826 | * While routines like [`Regex::find`] and [`Regex::captures`] guarantee |
827 | worst case `O(m * n)` search time, routines like [`Regex::find_iter`] and |
828 | [`Regex::captures_iter`] actually have worst case `O(m * n^2)` search time. |
829 | This is because `find_iter` runs many searches, and each search takes worst |
830 | case `O(m * n)` time. Thus, iteration of all matches in a haystack has |
831 | worst case `O(m * n^2)`. A good example of a pattern that exhibits this is |
832 | `(?:A+){1000}|` or even `.*[^A-Z]|[A-Z]`. |
833 | |
834 | In general, unstrusted haystacks are easier to stomach than untrusted patterns. |
835 | Untrusted patterns give a lot more control to the caller to impact the |
836 | performance of a search. Therefore, permitting untrusted patterns means that |
837 | your only line of defense is to put a limit on how big `m` (and perhaps also |
838 | `n`) can be in `O(m * n)`. `n` is limited by simply inspecting the length |
839 | of the haystack while `m` is limited by *both* applying a limit to the |
840 | length of the pattern *and* a limit on the compiled size of the regex via |
841 | [`RegexBuilder::size_limit`]. |
842 | |
843 | It bears repeating: if you're accepting untrusted patterns, it would be a good |
844 | idea to start with conservative limits on `m` and `n`, and then carefully |
845 | increase them as needed. |
846 | */ |
847 | |
848 | #![no_std ] |
849 | // I'm not ideologically opposed to allowing non-safe code in this crate, but |
850 | // IMO it needs really excellent justification. One likely place where this |
851 | // could show up is if and when we support a non-std alloc mode. In that case, |
852 | // we need some way to synchronize access to a PikeVM cache. That in turn will |
853 | // likely require rolling our own primitive spin-lock or similar structure. |
854 | #![forbid (unsafe_code)] |
855 | #![deny (missing_docs, rustdoc::broken_intra_doc_links)] |
856 | #![warn (missing_debug_implementations)] |
857 | // When the main features are disabled, squash dead code warnings. The |
858 | // alternative is to litter conditional compilation directives everywhere, |
859 | // which is super annoying. |
860 | #![cfg_attr (not(feature = "string" ), allow(dead_code))] |
861 | |
862 | #[cfg (not(feature = "std" ))] |
863 | compile_error!("'std' is currently a required feature, please file an issue" ); |
864 | |
865 | #[cfg (not(any(target_pointer_width = "32" , target_pointer_width = "64" )))] |
866 | compile_error!("not supported on non-{32,64}, please file an issue" ); |
867 | |
868 | extern crate alloc; |
869 | #[cfg (any(test, feature = "std" ))] |
870 | extern crate std; |
871 | |
872 | #[cfg (feature = "string" )] |
873 | pub use self::string::*; |
874 | pub use self::{error::Error, hir::escape}; |
875 | |
876 | mod error; |
877 | mod hir; |
878 | mod int; |
879 | mod interpolate; |
880 | mod nfa; |
881 | mod pikevm; |
882 | mod pool; |
883 | #[cfg (feature = "string" )] |
884 | mod string; |
885 | mod utf8; |
886 | |