| 1 | /*! | 
| 2 | This crate provides a robust regular expression parser. | 
|---|
| 3 |  | 
|---|
| 4 | This crate defines two primary types: | 
|---|
| 5 |  | 
|---|
| 6 | * [`Ast`](ast::Ast) is the abstract syntax of a regular expression. | 
|---|
| 7 | An abstract syntax corresponds to a *structured representation* of the | 
|---|
| 8 | concrete syntax of a regular expression, where the concrete syntax is the | 
|---|
| 9 | pattern string itself (e.g., `foo(bar)+`). Given some abstract syntax, it | 
|---|
| 10 | can be converted back to the original concrete syntax (modulo some details, | 
|---|
| 11 | like whitespace). To a first approximation, the abstract syntax is complex | 
|---|
| 12 | and difficult to analyze. | 
|---|
| 13 | * [`Hir`](hir::Hir) is the high-level intermediate representation | 
|---|
| 14 | ("HIR" or "high-level IR" for short) of regular expression. It corresponds to | 
|---|
| 15 | an intermediate state of a regular expression that sits between the abstract | 
|---|
| 16 | syntax and the low level compiled opcodes that are eventually responsible for | 
|---|
| 17 | executing a regular expression search. Given some high-level IR, it is not | 
|---|
| 18 | possible to produce the original concrete syntax (although it is possible to | 
|---|
| 19 | produce an equivalent concrete syntax, but it will likely scarcely resemble | 
|---|
| 20 | the original pattern). To a first approximation, the high-level IR is simple | 
|---|
| 21 | and easy to analyze. | 
|---|
| 22 |  | 
|---|
| 23 | These two types come with conversion routines: | 
|---|
| 24 |  | 
|---|
| 25 | * An [`ast::parse::Parser`] converts concrete syntax (a `&str`) to an | 
|---|
| 26 | [`Ast`](ast::Ast). | 
|---|
| 27 | * A [`hir::translate::Translator`] converts an [`Ast`](ast::Ast) to a | 
|---|
| 28 | [`Hir`](hir::Hir). | 
|---|
| 29 |  | 
|---|
| 30 | As a convenience, the above two conversion routines are combined into one via | 
|---|
| 31 | the top-level [`Parser`] type. This `Parser` will first convert your pattern to | 
|---|
| 32 | an `Ast` and then convert the `Ast` to an `Hir`. It's also exposed as top-level | 
|---|
| 33 | [`parse`] free function. | 
|---|
| 34 |  | 
|---|
| 35 |  | 
|---|
| 36 | # Example | 
|---|
| 37 |  | 
|---|
| 38 | This example shows how to parse a pattern string into its HIR: | 
|---|
| 39 |  | 
|---|
| 40 | ``` | 
|---|
| 41 | use regex_syntax::{hir::Hir, parse}; | 
|---|
| 42 |  | 
|---|
| 43 | let hir = parse( "a|b")?; | 
|---|
| 44 | assert_eq!(hir, Hir::alternation(vec![ | 
|---|
| 45 | Hir::literal( "a".as_bytes()), | 
|---|
| 46 | Hir::literal( "b".as_bytes()), | 
|---|
| 47 | ])); | 
|---|
| 48 | # Ok::<(), Box<dyn std::error::Error>>(()) | 
|---|
| 49 | ``` | 
|---|
| 50 |  | 
|---|
| 51 |  | 
|---|
| 52 | # Concrete syntax supported | 
|---|
| 53 |  | 
|---|
| 54 | The concrete syntax is documented as part of the public API of the | 
|---|
| 55 | [`regex` crate](https://docs.rs/regex/%2A/regex/#syntax). | 
|---|
| 56 |  | 
|---|
| 57 |  | 
|---|
| 58 | # Input safety | 
|---|
| 59 |  | 
|---|
| 60 | A key feature of this library is that it is safe to use with end user facing | 
|---|
| 61 | input. This plays a significant role in the internal implementation. In | 
|---|
| 62 | particular: | 
|---|
| 63 |  | 
|---|
| 64 | 1. Parsers provide a `nest_limit` option that permits callers to control how | 
|---|
| 65 | deeply nested a regular expression is allowed to be. This makes it possible | 
|---|
| 66 | to do case analysis over an `Ast` or an `Hir` using recursion without | 
|---|
| 67 | worrying about stack overflow. | 
|---|
| 68 | 2. Since relying on a particular stack size is brittle, this crate goes to | 
|---|
| 69 | great lengths to ensure that all interactions with both the `Ast` and the | 
|---|
| 70 | `Hir` do not use recursion. Namely, they use constant stack space and heap | 
|---|
| 71 | space proportional to the size of the original pattern string (in bytes). | 
|---|
| 72 | This includes the type's corresponding destructors. (One exception to this | 
|---|
| 73 | is literal extraction, but this will eventually get fixed.) | 
|---|
| 74 |  | 
|---|
| 75 |  | 
|---|
| 76 | # Error reporting | 
|---|
| 77 |  | 
|---|
| 78 | The `Display` implementations on all `Error` types exposed in this library | 
|---|
| 79 | provide nice human readable errors that are suitable for showing to end users | 
|---|
| 80 | in a monospace font. | 
|---|
| 81 |  | 
|---|
| 82 |  | 
|---|
| 83 | # Literal extraction | 
|---|
| 84 |  | 
|---|
| 85 | This crate provides limited support for [literal extraction from `Hir` | 
|---|
| 86 | values](hir::literal). Be warned that literal extraction uses recursion, and | 
|---|
| 87 | therefore, stack size proportional to the size of the `Hir`. | 
|---|
| 88 |  | 
|---|
| 89 | The purpose of literal extraction is to speed up searches. That is, if you | 
|---|
| 90 | know a regular expression must match a prefix or suffix literal, then it is | 
|---|
| 91 | often quicker to search for instances of that literal, and then confirm or deny | 
|---|
| 92 | the match using the full regular expression engine. These optimizations are | 
|---|
| 93 | done automatically in the `regex` crate. | 
|---|
| 94 |  | 
|---|
| 95 |  | 
|---|
| 96 | # Crate features | 
|---|
| 97 |  | 
|---|
| 98 | An important feature provided by this crate is its Unicode support. This | 
|---|
| 99 | includes things like case folding, boolean properties, general categories, | 
|---|
| 100 | scripts and Unicode-aware support for the Perl classes `\w`, `\s` and `\d`. | 
|---|
| 101 | However, a downside of this support is that it requires bundling several | 
|---|
| 102 | Unicode data tables that are substantial in size. | 
|---|
| 103 |  | 
|---|
| 104 | A fair number of use cases do not require full Unicode support. For this | 
|---|
| 105 | reason, this crate exposes a number of features to control which Unicode | 
|---|
| 106 | data is available. | 
|---|
| 107 |  | 
|---|
| 108 | If a regular expression attempts to use a Unicode feature that is not available | 
|---|
| 109 | because the corresponding crate feature was disabled, then translating that | 
|---|
| 110 | regular expression to an `Hir` will return an error. (It is still possible | 
|---|
| 111 | construct an `Ast` for such a regular expression, since Unicode data is not | 
|---|
| 112 | used until translation to an `Hir`.) Stated differently, enabling or disabling | 
|---|
| 113 | any of the features below can only add or subtract from the total set of valid | 
|---|
| 114 | regular expressions. Enabling or disabling a feature will never modify the | 
|---|
| 115 | match semantics of a regular expression. | 
|---|
| 116 |  | 
|---|
| 117 | The following features are available: | 
|---|
| 118 |  | 
|---|
| 119 | * **std** - | 
|---|
| 120 | Enables support for the standard library. This feature is enabled by default. | 
|---|
| 121 | When disabled, only `core` and `alloc` are used. Otherwise, enabling `std` | 
|---|
| 122 | generally just enables `std::error::Error` trait impls for the various error | 
|---|
| 123 | types. | 
|---|
| 124 | * **unicode** - | 
|---|
| 125 | Enables all Unicode features. This feature is enabled by default, and will | 
|---|
| 126 | always cover all Unicode features, even if more are added in the future. | 
|---|
| 127 | * **unicode-age** - | 
|---|
| 128 | Provide the data for the | 
|---|
| 129 | [Unicode `Age` property](https://www.unicode.org/reports/tr44/tr44-24.html#Character_Age). | 
|---|
| 130 | This makes it possible to use classes like `\p{Age:6.0}` to refer to all | 
|---|
| 131 | codepoints first introduced in Unicode 6.0 | 
|---|
| 132 | * **unicode-bool** - | 
|---|
| 133 | Provide the data for numerous Unicode boolean properties. The full list | 
|---|
| 134 | is not included here, but contains properties like `Alphabetic`, `Emoji`, | 
|---|
| 135 | `Lowercase`, `Math`, `Uppercase` and `White_Space`. | 
|---|
| 136 | * **unicode-case** - | 
|---|
| 137 | Provide the data for case insensitive matching using | 
|---|
| 138 | [Unicode's "simple loose matches" specification](https://www.unicode.org/reports/tr18/#Simple_Loose_Matches). | 
|---|
| 139 | * **unicode-gencat** - | 
|---|
| 140 | Provide the data for | 
|---|
| 141 | [Unicode general categories](https://www.unicode.org/reports/tr44/tr44-24.html#General_Category_Values). | 
|---|
| 142 | This includes, but is not limited to, `Decimal_Number`, `Letter`, | 
|---|
| 143 | `Math_Symbol`, `Number` and `Punctuation`. | 
|---|
| 144 | * **unicode-perl** - | 
|---|
| 145 | Provide the data for supporting the Unicode-aware Perl character classes, | 
|---|
| 146 | corresponding to `\w`, `\s` and `\d`. This is also necessary for using | 
|---|
| 147 | Unicode-aware word boundary assertions. Note that if this feature is | 
|---|
| 148 | disabled, the `\s` and `\d` character classes are still available if the | 
|---|
| 149 | `unicode-bool` and `unicode-gencat` features are enabled, respectively. | 
|---|
| 150 | * **unicode-script** - | 
|---|
| 151 | Provide the data for | 
|---|
| 152 | [Unicode scripts and script extensions](https://www.unicode.org/reports/tr24/). | 
|---|
| 153 | This includes, but is not limited to, `Arabic`, `Cyrillic`, `Hebrew`, | 
|---|
| 154 | `Latin` and `Thai`. | 
|---|
| 155 | * **unicode-segment** - | 
|---|
| 156 | Provide the data necessary to provide the properties used to implement the | 
|---|
| 157 | [Unicode text segmentation algorithms](https://www.unicode.org/reports/tr29/). | 
|---|
| 158 | This enables using classes like `\p{gcb=Extend}`, `\p{wb=Katakana}` and | 
|---|
| 159 | `\p{sb=ATerm}`. | 
|---|
| 160 | * **arbitrary** - | 
|---|
| 161 | Enabling this feature introduces a public dependency on the | 
|---|
| 162 | [`arbitrary`](https://crates.io/crates/arbitrary) | 
|---|
| 163 | crate. Namely, it implements the `Arbitrary` trait from that crate for the | 
|---|
| 164 | [`Ast`](crate::ast::Ast) type. This feature is disabled by default. | 
|---|
| 165 | */ | 
|---|
| 166 |  | 
|---|
| 167 | #![ no_std] | 
|---|
| 168 | #![ forbid(unsafe_code)] | 
|---|
| 169 | #![ deny(missing_docs, rustdoc::broken_intra_doc_links)] | 
|---|
| 170 | #![ warn(missing_debug_implementations)] | 
|---|
| 171 | #![ cfg_attr(docsrs, feature(doc_auto_cfg))] | 
|---|
| 172 |  | 
|---|
| 173 | #[ cfg(any(test, feature = "std"))] | 
|---|
| 174 | extern crate std; | 
|---|
| 175 |  | 
|---|
| 176 | extern crate alloc; | 
|---|
| 177 |  | 
|---|
| 178 | pub use crate::{ | 
|---|
| 179 | error::Error, | 
|---|
| 180 | parser::{parse, Parser, ParserBuilder}, | 
|---|
| 181 | unicode::UnicodeWordError, | 
|---|
| 182 | }; | 
|---|
| 183 |  | 
|---|
| 184 | use alloc::string::String; | 
|---|
| 185 |  | 
|---|
| 186 | pub mod ast; | 
|---|
| 187 | mod debug; | 
|---|
| 188 | mod either; | 
|---|
| 189 | mod error; | 
|---|
| 190 | pub mod hir; | 
|---|
| 191 | mod parser; | 
|---|
| 192 | mod rank; | 
|---|
| 193 | mod unicode; | 
|---|
| 194 | mod unicode_tables; | 
|---|
| 195 | pub mod utf8; | 
|---|
| 196 |  | 
|---|
| 197 | /// Escapes all regular expression meta characters in `text`. | 
|---|
| 198 | /// | 
|---|
| 199 | /// The string returned may be safely used as a literal in a regular | 
|---|
| 200 | /// expression. | 
|---|
| 201 | pub fn escape(text: &str) -> String { | 
|---|
| 202 | let mut quoted: String = String::new(); | 
|---|
| 203 | escape_into(text, &mut quoted); | 
|---|
| 204 | quoted | 
|---|
| 205 | } | 
|---|
| 206 |  | 
|---|
| 207 | /// Escapes all meta characters in `text` and writes the result into `buf`. | 
|---|
| 208 | /// | 
|---|
| 209 | /// This will append escape characters into the given buffer. The characters | 
|---|
| 210 | /// that are appended are safe to use as a literal in a regular expression. | 
|---|
| 211 | pub fn escape_into(text: &str, buf: &mut String) { | 
|---|
| 212 | buf.reserve(additional:text.len()); | 
|---|
| 213 | for c: char in text.chars() { | 
|---|
| 214 | if is_meta_character(c) { | 
|---|
| 215 | buf.push(ch: '\\ '); | 
|---|
| 216 | } | 
|---|
| 217 | buf.push(ch:c); | 
|---|
| 218 | } | 
|---|
| 219 | } | 
|---|
| 220 |  | 
|---|
| 221 | /// Returns true if the given character has significance in a regex. | 
|---|
| 222 | /// | 
|---|
| 223 | /// Generally speaking, these are the only characters which _must_ be escaped | 
|---|
| 224 | /// in order to match their literal meaning. For example, to match a literal | 
|---|
| 225 | /// `|`, one could write `\|`. Sometimes escaping isn't always necessary. For | 
|---|
| 226 | /// example, `-` is treated as a meta character because of its significance | 
|---|
| 227 | /// for writing ranges inside of character classes, but the regex `-` will | 
|---|
| 228 | /// match a literal `-` because `-` has no special meaning outside of character | 
|---|
| 229 | /// classes. | 
|---|
| 230 | /// | 
|---|
| 231 | /// In order to determine whether a character may be escaped at all, the | 
|---|
| 232 | /// [`is_escapeable_character`] routine should be used. The difference between | 
|---|
| 233 | /// `is_meta_character` and `is_escapeable_character` is that the latter will | 
|---|
| 234 | /// return true for some characters that are _not_ meta characters. For | 
|---|
| 235 | /// example, `%` and `\%` both match a literal `%` in all contexts. In other | 
|---|
| 236 | /// words, `is_escapeable_character` includes "superfluous" escapes. | 
|---|
| 237 | /// | 
|---|
| 238 | /// Note that the set of characters for which this function returns `true` or | 
|---|
| 239 | /// `false` is fixed and won't change in a semver compatible release. (In this | 
|---|
| 240 | /// case, "semver compatible release" actually refers to the `regex` crate | 
|---|
| 241 | /// itself, since reducing or expanding the set of meta characters would be a | 
|---|
| 242 | /// breaking change for not just `regex-syntax` but also `regex` itself.) | 
|---|
| 243 | /// | 
|---|
| 244 | /// # Example | 
|---|
| 245 | /// | 
|---|
| 246 | /// ``` | 
|---|
| 247 | /// use regex_syntax::is_meta_character; | 
|---|
| 248 | /// | 
|---|
| 249 | /// assert!(is_meta_character( '?')); | 
|---|
| 250 | /// assert!(is_meta_character( '-')); | 
|---|
| 251 | /// assert!(is_meta_character( '&')); | 
|---|
| 252 | /// assert!(is_meta_character( '#')); | 
|---|
| 253 | /// | 
|---|
| 254 | /// assert!(!is_meta_character( '%')); | 
|---|
| 255 | /// assert!(!is_meta_character( '/')); | 
|---|
| 256 | /// assert!(!is_meta_character( '!')); | 
|---|
| 257 | /// assert!(!is_meta_character( '"')); | 
|---|
| 258 | /// assert!(!is_meta_character( 'e')); | 
|---|
| 259 | /// ``` | 
|---|
| 260 | pub fn is_meta_character(c: char) -> bool { | 
|---|
| 261 | match c { | 
|---|
| 262 | '\\ '| '.'| '+'| '*'| '?'| '('| ')'| '|'| '['| ']'| '{' | 
|---|
| 263 | | '}'| '^'| '$'| '#'| '&'| '-'| '~'=> true, | 
|---|
| 264 | _ => false, | 
|---|
| 265 | } | 
|---|
| 266 | } | 
|---|
| 267 |  | 
|---|
| 268 | /// Returns true if the given character can be escaped in a regex. | 
|---|
| 269 | /// | 
|---|
| 270 | /// This returns true in all cases that `is_meta_character` returns true, but | 
|---|
| 271 | /// also returns true in some cases where `is_meta_character` returns false. | 
|---|
| 272 | /// For example, `%` is not a meta character, but it is escapeable. That is, | 
|---|
| 273 | /// `%` and `\%` both match a literal `%` in all contexts. | 
|---|
| 274 | /// | 
|---|
| 275 | /// The purpose of this routine is to provide knowledge about what characters | 
|---|
| 276 | /// may be escaped. Namely, most regex engines permit "superfluous" escapes | 
|---|
| 277 | /// where characters without any special significance may be escaped even | 
|---|
| 278 | /// though there is no actual _need_ to do so. | 
|---|
| 279 | /// | 
|---|
| 280 | /// This will return false for some characters. For example, `e` is not | 
|---|
| 281 | /// escapeable. Therefore, `\e` will either result in a parse error (which is | 
|---|
| 282 | /// true today), or it could backwards compatibly evolve into a new construct | 
|---|
| 283 | /// with its own meaning. Indeed, that is the purpose of banning _some_ | 
|---|
| 284 | /// superfluous escapes: it provides a way to evolve the syntax in a compatible | 
|---|
| 285 | /// manner. | 
|---|
| 286 | /// | 
|---|
| 287 | /// # Example | 
|---|
| 288 | /// | 
|---|
| 289 | /// ``` | 
|---|
| 290 | /// use regex_syntax::is_escapeable_character; | 
|---|
| 291 | /// | 
|---|
| 292 | /// assert!(is_escapeable_character( '?')); | 
|---|
| 293 | /// assert!(is_escapeable_character( '-')); | 
|---|
| 294 | /// assert!(is_escapeable_character( '&')); | 
|---|
| 295 | /// assert!(is_escapeable_character( '#')); | 
|---|
| 296 | /// assert!(is_escapeable_character( '%')); | 
|---|
| 297 | /// assert!(is_escapeable_character( '/')); | 
|---|
| 298 | /// assert!(is_escapeable_character( '!')); | 
|---|
| 299 | /// assert!(is_escapeable_character( '"')); | 
|---|
| 300 | /// | 
|---|
| 301 | /// assert!(!is_escapeable_character( 'e')); | 
|---|
| 302 | /// ``` | 
|---|
| 303 | pub fn is_escapeable_character(c: char) -> bool { | 
|---|
| 304 | // Certainly escapeable if it's a meta character. | 
|---|
| 305 | if is_meta_character(c) { | 
|---|
| 306 | return true; | 
|---|
| 307 | } | 
|---|
| 308 | // Any character that isn't ASCII is definitely not escapeable. There's | 
|---|
| 309 | // no real need to allow things like \☃ right? | 
|---|
| 310 | if !c.is_ascii() { | 
|---|
| 311 | return false; | 
|---|
| 312 | } | 
|---|
| 313 | // Otherwise, we basically say that everything is escapeable unless it's a | 
|---|
| 314 | // letter or digit. Things like \3 are either octal (when enabled) or an | 
|---|
| 315 | // error, and we should keep it that way. Otherwise, letters are reserved | 
|---|
| 316 | // for adding new syntax in a backwards compatible way. | 
|---|
| 317 | match c { | 
|---|
| 318 | '0'..= '9'| 'A'..= 'Z'| 'a'..= 'z'=> false, | 
|---|
| 319 | // While not currently supported, we keep these as not escapeable to | 
|---|
| 320 | // give us some flexibility with respect to supporting the \< and | 
|---|
| 321 | // \> word boundary assertions in the future. By rejecting them as | 
|---|
| 322 | // escapeable, \< and \> will result in a parse error. Thus, we can | 
|---|
| 323 | // turn them into something else in the future without it being a | 
|---|
| 324 | // backwards incompatible change. | 
|---|
| 325 | // | 
|---|
| 326 | // OK, now we support \< and \>, and we need to retain them as *not* | 
|---|
| 327 | // escapeable here since the escape sequence is significant. | 
|---|
| 328 | '<'| '>'=> false, | 
|---|
| 329 | _ => true, | 
|---|
| 330 | } | 
|---|
| 331 | } | 
|---|
| 332 |  | 
|---|
| 333 | /// Returns true if and only if the given character is a Unicode word | 
|---|
| 334 | /// character. | 
|---|
| 335 | /// | 
|---|
| 336 | /// A Unicode word character is defined by | 
|---|
| 337 | /// [UTS#18 Annex C](https://unicode.org/reports/tr18/#Compatibility_Properties). | 
|---|
| 338 | /// In particular, a character | 
|---|
| 339 | /// is considered a word character if it is in either of the `Alphabetic` or | 
|---|
| 340 | /// `Join_Control` properties, or is in one of the `Decimal_Number`, `Mark` | 
|---|
| 341 | /// or `Connector_Punctuation` general categories. | 
|---|
| 342 | /// | 
|---|
| 343 | /// # Panics | 
|---|
| 344 | /// | 
|---|
| 345 | /// If the `unicode-perl` feature is not enabled, then this function | 
|---|
| 346 | /// panics. For this reason, it is recommended that callers use | 
|---|
| 347 | /// [`try_is_word_character`] instead. | 
|---|
| 348 | pub fn is_word_character(c: char) -> bool { | 
|---|
| 349 | try_is_word_character(c).expect(msg: "unicode-perl feature must be enabled") | 
|---|
| 350 | } | 
|---|
| 351 |  | 
|---|
| 352 | /// Returns true if and only if the given character is a Unicode word | 
|---|
| 353 | /// character. | 
|---|
| 354 | /// | 
|---|
| 355 | /// A Unicode word character is defined by | 
|---|
| 356 | /// [UTS#18 Annex C](https://unicode.org/reports/tr18/#Compatibility_Properties). | 
|---|
| 357 | /// In particular, a character | 
|---|
| 358 | /// is considered a word character if it is in either of the `Alphabetic` or | 
|---|
| 359 | /// `Join_Control` properties, or is in one of the `Decimal_Number`, `Mark` | 
|---|
| 360 | /// or `Connector_Punctuation` general categories. | 
|---|
| 361 | /// | 
|---|
| 362 | /// # Errors | 
|---|
| 363 | /// | 
|---|
| 364 | /// If the `unicode-perl` feature is not enabled, then this function always | 
|---|
| 365 | /// returns an error. | 
|---|
| 366 | pub fn try_is_word_character( | 
|---|
| 367 | c: char, | 
|---|
| 368 | ) -> core::result::Result<bool, UnicodeWordError> { | 
|---|
| 369 | unicode::is_word_character(c) | 
|---|
| 370 | } | 
|---|
| 371 |  | 
|---|
| 372 | /// Returns true if and only if the given character is an ASCII word character. | 
|---|
| 373 | /// | 
|---|
| 374 | /// An ASCII word character is defined by the following character class: | 
|---|
| 375 | /// `[_0-9a-zA-Z]`. | 
|---|
| 376 | pub fn is_word_byte(c: u8) -> bool { | 
|---|
| 377 | match c { | 
|---|
| 378 | b'_'| b'0'..= b'9'| b'a'..= b'z'| b'A'..= b'Z'=> true, | 
|---|
| 379 | _ => false, | 
|---|
| 380 | } | 
|---|
| 381 | } | 
|---|
| 382 |  | 
|---|
| 383 | #[ cfg(test)] | 
|---|
| 384 | mod tests { | 
|---|
| 385 | use alloc::string::ToString; | 
|---|
| 386 |  | 
|---|
| 387 | use super::*; | 
|---|
| 388 |  | 
|---|
| 389 | #[ test] | 
|---|
| 390 | fn escape_meta() { | 
|---|
| 391 | assert_eq!( | 
|---|
| 392 | escape( r"\.+*?()|[]{}^$#&-~"), | 
|---|
| 393 | r"\\\.\+\*\?\(\)\|\[\]\{\}\^\$\#\&\-\~".to_string() | 
|---|
| 394 | ); | 
|---|
| 395 | } | 
|---|
| 396 |  | 
|---|
| 397 | #[ test] | 
|---|
| 398 | fn word_byte() { | 
|---|
| 399 | assert!(is_word_byte( b'a')); | 
|---|
| 400 | assert!(!is_word_byte( b'-')); | 
|---|
| 401 | } | 
|---|
| 402 |  | 
|---|
| 403 | #[ test] | 
|---|
| 404 | #[ cfg(feature = "unicode-perl")] | 
|---|
| 405 | fn word_char() { | 
|---|
| 406 | assert!(is_word_character( 'a'), "ASCII"); | 
|---|
| 407 | assert!(is_word_character( 'à'), "Latin-1"); | 
|---|
| 408 | assert!(is_word_character( 'β'), "Greek"); | 
|---|
| 409 | assert!(is_word_character( '\u{11011} '), "Brahmi (Unicode 6.0)"); | 
|---|
| 410 | assert!(is_word_character( '\u{11611} '), "Modi (Unicode 7.0)"); | 
|---|
| 411 | assert!(is_word_character( '\u{11711} '), "Ahom (Unicode 8.0)"); | 
|---|
| 412 | assert!(is_word_character( '\u{17828} '), "Tangut (Unicode 9.0)"); | 
|---|
| 413 | assert!(is_word_character( '\u{1B1B1} '), "Nushu (Unicode 10.0)"); | 
|---|
| 414 | assert!(is_word_character( '\u{16E40} '), "Medefaidrin (Unicode 11.0)"); | 
|---|
| 415 | assert!(!is_word_character( '-')); | 
|---|
| 416 | assert!(!is_word_character( '☃')); | 
|---|
| 417 | } | 
|---|
| 418 |  | 
|---|
| 419 | #[ test] | 
|---|
| 420 | #[ should_panic] | 
|---|
| 421 | #[ cfg(not(feature = "unicode-perl"))] | 
|---|
| 422 | fn word_char_disabled_panic() { | 
|---|
| 423 | assert!(is_word_character( 'a')); | 
|---|
| 424 | } | 
|---|
| 425 |  | 
|---|
| 426 | #[ test] | 
|---|
| 427 | #[ cfg(not(feature = "unicode-perl"))] | 
|---|
| 428 | fn word_char_disabled_error() { | 
|---|
| 429 | assert!(try_is_word_character( 'a').is_err()); | 
|---|
| 430 | } | 
|---|
| 431 | } | 
|---|
| 432 |  | 
|---|