1 | /*! |
2 | This crate provides a robust regular expression parser. |
3 | |
4 | This crate defines two primary types: |
5 | |
6 | * [`Ast`](ast/enum.Ast.html) is the abstract syntax of a regular expression. |
7 | An abstract syntax corresponds to a *structured representation* of the |
8 | concrete syntax of a regular expression, where the concrete syntax is the |
9 | pattern string itself (e.g., `foo(bar)+`). Given some abstract syntax, it |
10 | can be converted back to the original concrete syntax (modulo some details, |
11 | like whitespace). To a first approximation, the abstract syntax is complex |
12 | and difficult to analyze. |
13 | * [`Hir`](hir/struct.Hir.html) is the high-level intermediate representation |
14 | ("HIR" or "high-level IR" for short) of regular expression. It corresponds to |
15 | an intermediate state of a regular expression that sits between the abstract |
16 | syntax and the low level compiled opcodes that are eventually responsible for |
17 | executing a regular expression search. Given some high-level IR, it is not |
18 | possible to produce the original concrete syntax (although it is possible to |
19 | produce an equivalent concrete syntax, but it will likely scarcely resemble |
20 | the original pattern). To a first approximation, the high-level IR is simple |
21 | and easy to analyze. |
22 | |
23 | These two types come with conversion routines: |
24 | |
25 | * An [`ast::parse::Parser`](ast/parse/struct.Parser.html) converts concrete |
26 | syntax (a `&str`) to an [`Ast`](ast/enum.Ast.html). |
27 | * A [`hir::translate::Translator`](hir/translate/struct.Translator.html) |
28 | converts an [`Ast`](ast/enum.Ast.html) to a [`Hir`](hir/struct.Hir.html). |
29 | |
30 | As a convenience, the above two conversion routines are combined into one via |
31 | the top-level [`Parser`](struct.Parser.html) type. This `Parser` will first |
32 | convert your pattern to an `Ast` and then convert the `Ast` to an `Hir`. |
33 | |
34 | |
35 | # Example |
36 | |
37 | This example shows how to parse a pattern string into its HIR: |
38 | |
39 | ``` |
40 | use regex_syntax::Parser; |
41 | use regex_syntax::hir::{self, Hir}; |
42 | |
43 | let hir = Parser::new().parse("a|b" ).unwrap(); |
44 | assert_eq!(hir, Hir::alternation(vec![ |
45 | Hir::literal(hir::Literal::Unicode('a' )), |
46 | Hir::literal(hir::Literal::Unicode('b' )), |
47 | ])); |
48 | ``` |
49 | |
50 | |
51 | # Concrete syntax supported |
52 | |
53 | The concrete syntax is documented as part of the public API of the |
54 | [`regex` crate](https://docs.rs/regex/%2A/regex/#syntax). |
55 | |
56 | |
57 | # Input safety |
58 | |
59 | A key feature of this library is that it is safe to use with end user facing |
60 | input. This plays a significant role in the internal implementation. In |
61 | particular: |
62 | |
63 | 1. Parsers provide a `nest_limit` option that permits callers to control how |
64 | deeply nested a regular expression is allowed to be. This makes it possible |
65 | to do case analysis over an `Ast` or an `Hir` using recursion without |
66 | worrying about stack overflow. |
67 | 2. Since relying on a particular stack size is brittle, this crate goes to |
68 | great lengths to ensure that all interactions with both the `Ast` and the |
69 | `Hir` do not use recursion. Namely, they use constant stack space and heap |
70 | space proportional to the size of the original pattern string (in bytes). |
71 | This includes the type's corresponding destructors. (One exception to this |
72 | is literal extraction, but this will eventually get fixed.) |
73 | |
74 | |
75 | # Error reporting |
76 | |
77 | The `Display` implementations on all `Error` types exposed in this library |
78 | provide nice human readable errors that are suitable for showing to end users |
79 | in a monospace font. |
80 | |
81 | |
82 | # Literal extraction |
83 | |
84 | This crate provides limited support for |
85 | [literal extraction from `Hir` values](hir/literal/struct.Literals.html). |
86 | Be warned that literal extraction currently uses recursion, and therefore, |
87 | stack size proportional to the size of the `Hir`. |
88 | |
89 | The purpose of literal extraction is to speed up searches. That is, if you |
90 | know a regular expression must match a prefix or suffix literal, then it is |
91 | often quicker to search for instances of that literal, and then confirm or deny |
92 | the match using the full regular expression engine. These optimizations are |
93 | done automatically in the `regex` crate. |
94 | |
95 | |
96 | # Crate features |
97 | |
98 | An important feature provided by this crate is its Unicode support. This |
99 | includes things like case folding, boolean properties, general categories, |
100 | scripts and Unicode-aware support for the Perl classes `\w`, `\s` and `\d`. |
101 | However, a downside of this support is that it requires bundling several |
102 | Unicode data tables that are substantial in size. |
103 | |
104 | A fair number of use cases do not require full Unicode support. For this |
105 | reason, this crate exposes a number of features to control which Unicode |
106 | data is available. |
107 | |
108 | If a regular expression attempts to use a Unicode feature that is not available |
109 | because the corresponding crate feature was disabled, then translating that |
110 | regular expression to an `Hir` will return an error. (It is still possible |
111 | construct an `Ast` for such a regular expression, since Unicode data is not |
112 | used until translation to an `Hir`.) Stated differently, enabling or disabling |
113 | any of the features below can only add or subtract from the total set of valid |
114 | regular expressions. Enabling or disabling a feature will never modify the |
115 | match semantics of a regular expression. |
116 | |
117 | The following features are available: |
118 | |
119 | * **unicode** - |
120 | Enables all Unicode features. This feature is enabled by default, and will |
121 | always cover all Unicode features, even if more are added in the future. |
122 | * **unicode-age** - |
123 | Provide the data for the |
124 | [Unicode `Age` property](https://www.unicode.org/reports/tr44/tr44-24.html#Character_Age). |
125 | This makes it possible to use classes like `\p{Age:6.0}` to refer to all |
126 | codepoints first introduced in Unicode 6.0 |
127 | * **unicode-bool** - |
128 | Provide the data for numerous Unicode boolean properties. The full list |
129 | is not included here, but contains properties like `Alphabetic`, `Emoji`, |
130 | `Lowercase`, `Math`, `Uppercase` and `White_Space`. |
131 | * **unicode-case** - |
132 | Provide the data for case insensitive matching using |
133 | [Unicode's "simple loose matches" specification](https://www.unicode.org/reports/tr18/#Simple_Loose_Matches). |
134 | * **unicode-gencat** - |
135 | Provide the data for |
136 | [Uncode general categories](https://www.unicode.org/reports/tr44/tr44-24.html#General_Category_Values). |
137 | This includes, but is not limited to, `Decimal_Number`, `Letter`, |
138 | `Math_Symbol`, `Number` and `Punctuation`. |
139 | * **unicode-perl** - |
140 | Provide the data for supporting the Unicode-aware Perl character classes, |
141 | corresponding to `\w`, `\s` and `\d`. This is also necessary for using |
142 | Unicode-aware word boundary assertions. Note that if this feature is |
143 | disabled, the `\s` and `\d` character classes are still available if the |
144 | `unicode-bool` and `unicode-gencat` features are enabled, respectively. |
145 | * **unicode-script** - |
146 | Provide the data for |
147 | [Unicode scripts and script extensions](https://www.unicode.org/reports/tr24/). |
148 | This includes, but is not limited to, `Arabic`, `Cyrillic`, `Hebrew`, |
149 | `Latin` and `Thai`. |
150 | * **unicode-segment** - |
151 | Provide the data necessary to provide the properties used to implement the |
152 | [Unicode text segmentation algorithms](https://www.unicode.org/reports/tr29/). |
153 | This enables using classes like `\p{gcb=Extend}`, `\p{wb=Katakana}` and |
154 | `\p{sb=ATerm}`. |
155 | */ |
156 | |
157 | #![deny (missing_docs)] |
158 | #![warn (missing_debug_implementations)] |
159 | #![forbid (unsafe_code)] |
160 | |
161 | pub use crate::error::{Error, Result}; |
162 | pub use crate::parser::{Parser, ParserBuilder}; |
163 | pub use crate::unicode::UnicodeWordError; |
164 | |
165 | pub mod ast; |
166 | mod either; |
167 | mod error; |
168 | pub mod hir; |
169 | mod parser; |
170 | mod unicode; |
171 | mod unicode_tables; |
172 | pub mod utf8; |
173 | |
174 | /// Escapes all regular expression meta characters in `text`. |
175 | /// |
176 | /// The string returned may be safely used as a literal in a regular |
177 | /// expression. |
178 | pub fn escape(text: &str) -> String { |
179 | let mut quoted = String::new(); |
180 | escape_into(text, &mut quoted); |
181 | quoted |
182 | } |
183 | |
184 | /// Escapes all meta characters in `text` and writes the result into `buf`. |
185 | /// |
186 | /// This will append escape characters into the given buffer. The characters |
187 | /// that are appended are safe to use as a literal in a regular expression. |
188 | pub fn escape_into(text: &str, buf: &mut String) { |
189 | buf.reserve(text.len()); |
190 | for c in text.chars() { |
191 | if is_meta_character(c) { |
192 | buf.push(' \\' ); |
193 | } |
194 | buf.push(c); |
195 | } |
196 | } |
197 | |
198 | /// Returns true if the given character has significance in a regex. |
199 | /// |
200 | /// These are the only characters that are allowed to be escaped, with one |
201 | /// exception: an ASCII space character may be escaped when extended mode (with |
202 | /// the `x` flag) is enabled. In particular, `is_meta_character(' ')` returns |
203 | /// `false`. |
204 | /// |
205 | /// Note that the set of characters for which this function returns `true` or |
206 | /// `false` is fixed and won't change in a semver compatible release. |
207 | pub fn is_meta_character(c: char) -> bool { |
208 | match c { |
209 | ' \\' | '.' | '+' | '*' | '?' | '(' | ')' | '|' | '[' | ']' | '{' |
210 | | '}' | '^' | '$' | '#' | '&' | '-' | '~' => true, |
211 | _ => false, |
212 | } |
213 | } |
214 | |
215 | /// Returns true if and only if the given character is a Unicode word |
216 | /// character. |
217 | /// |
218 | /// A Unicode word character is defined by |
219 | /// [UTS#18 Annex C](https://unicode.org/reports/tr18/#Compatibility_Properties). |
220 | /// In particular, a character |
221 | /// is considered a word character if it is in either of the `Alphabetic` or |
222 | /// `Join_Control` properties, or is in one of the `Decimal_Number`, `Mark` |
223 | /// or `Connector_Punctuation` general categories. |
224 | /// |
225 | /// # Panics |
226 | /// |
227 | /// If the `unicode-perl` feature is not enabled, then this function panics. |
228 | /// For this reason, it is recommended that callers use |
229 | /// [`try_is_word_character`](fn.try_is_word_character.html) |
230 | /// instead. |
231 | pub fn is_word_character(c: char) -> bool { |
232 | try_is_word_character(c).expect("unicode-perl feature must be enabled" ) |
233 | } |
234 | |
235 | /// Returns true if and only if the given character is a Unicode word |
236 | /// character. |
237 | /// |
238 | /// A Unicode word character is defined by |
239 | /// [UTS#18 Annex C](https://unicode.org/reports/tr18/#Compatibility_Properties). |
240 | /// In particular, a character |
241 | /// is considered a word character if it is in either of the `Alphabetic` or |
242 | /// `Join_Control` properties, or is in one of the `Decimal_Number`, `Mark` |
243 | /// or `Connector_Punctuation` general categories. |
244 | /// |
245 | /// # Errors |
246 | /// |
247 | /// If the `unicode-perl` feature is not enabled, then this function always |
248 | /// returns an error. |
249 | pub fn try_is_word_character( |
250 | c: char, |
251 | ) -> std::result::Result<bool, UnicodeWordError> { |
252 | unicode::is_word_character(c) |
253 | } |
254 | |
255 | /// Returns true if and only if the given character is an ASCII word character. |
256 | /// |
257 | /// An ASCII word character is defined by the following character class: |
258 | /// `[_0-9a-zA-Z]'. |
259 | pub fn is_word_byte(c: u8) -> bool { |
260 | match c { |
261 | b'_' | b'0' ..=b'9' | b'a' ..=b'z' | b'A' ..=b'Z' => true, |
262 | _ => false, |
263 | } |
264 | } |
265 | |
266 | #[cfg (test)] |
267 | mod tests { |
268 | use super::*; |
269 | |
270 | #[test] |
271 | fn escape_meta() { |
272 | assert_eq!( |
273 | escape(r"\.+*?()|[]{}^$#&-~" ), |
274 | r"\\\.\+\*\?\(\)\|\[\]\{\}\^\$\#\&\-\~" .to_string() |
275 | ); |
276 | } |
277 | |
278 | #[test] |
279 | fn word_byte() { |
280 | assert!(is_word_byte(b'a' )); |
281 | assert!(!is_word_byte(b'-' )); |
282 | } |
283 | |
284 | #[test] |
285 | #[cfg (feature = "unicode-perl" )] |
286 | fn word_char() { |
287 | assert!(is_word_character('a' ), "ASCII" ); |
288 | assert!(is_word_character('à' ), "Latin-1" ); |
289 | assert!(is_word_character('β' ), "Greek" ); |
290 | assert!(is_word_character(' \u{11011}' ), "Brahmi (Unicode 6.0)" ); |
291 | assert!(is_word_character(' \u{11611}' ), "Modi (Unicode 7.0)" ); |
292 | assert!(is_word_character(' \u{11711}' ), "Ahom (Unicode 8.0)" ); |
293 | assert!(is_word_character(' \u{17828}' ), "Tangut (Unicode 9.0)" ); |
294 | assert!(is_word_character(' \u{1B1B1}' ), "Nushu (Unicode 10.0)" ); |
295 | assert!(is_word_character(' \u{16E40}' ), "Medefaidrin (Unicode 11.0)" ); |
296 | assert!(!is_word_character('-' )); |
297 | assert!(!is_word_character('☃' )); |
298 | } |
299 | |
300 | #[test] |
301 | #[should_panic ] |
302 | #[cfg (not(feature = "unicode-perl" ))] |
303 | fn word_char_disabled_panic() { |
304 | assert!(is_word_character('a' )); |
305 | } |
306 | |
307 | #[test] |
308 | #[cfg (not(feature = "unicode-perl" ))] |
309 | fn word_char_disabled_error() { |
310 | assert!(try_is_word_character('a' ).is_err()); |
311 | } |
312 | } |
313 | |