1/*!
2A byte string library.
3
4Byte strings are just like standard Unicode strings with one very important
5difference: byte strings are only *conventionally* UTF-8 while Rust's standard
6Unicode strings are *guaranteed* to be valid UTF-8. The primary motivation for
7byte strings is for handling arbitrary bytes that are mostly UTF-8.
8
9# Overview
10
11This crate provides two important traits that provide string oriented methods
12on `&[u8]` and `Vec<u8>` types:
13
14* [`ByteSlice`](trait.ByteSlice.html) extends the `[u8]` type with additional
15 string oriented methods.
16* [`ByteVec`](trait.ByteVec.html) extends the `Vec<u8>` type with additional
17 string oriented methods.
18
19Additionally, this crate provides two concrete byte string types that deref to
20`[u8]` and `Vec<u8>`. These are useful for storing byte string types, and come
21with convenient `std::fmt::Debug` implementations:
22
23* [`BStr`](struct.BStr.html) is a byte string slice, analogous to `str`.
24* [`BString`](struct.BString.html) is an owned growable byte string buffer,
25 analogous to `String`.
26
27Additionally, the free function [`B`](fn.B.html) serves as a convenient short
28hand for writing byte string literals.
29
30# Quick examples
31
32Byte strings build on the existing APIs for `Vec<u8>` and `&[u8]`, with
33additional string oriented methods. Operations such as iterating over
34graphemes, searching for substrings, replacing substrings, trimming and case
35conversion are examples of things not provided on the standard library `&[u8]`
36APIs but are provided by this crate. For example, this code iterates over all
37of occurrences of a substring:
38
39```
40use bstr::ByteSlice;
41
42let s = b"foo bar foo foo quux foo";
43
44let mut matches = vec![];
45for start in s.find_iter("foo") {
46 matches.push(start);
47}
48assert_eq!(matches, [0, 8, 12, 21]);
49```
50
51Here's another example showing how to do a search and replace (and also showing
52use of the `B` function):
53
54```
55# #[cfg(feature = "alloc")] {
56use bstr::{B, ByteSlice};
57
58let old = B("foo ☃☃☃ foo foo quux foo");
59let new = old.replace("foo", "hello");
60assert_eq!(new, B("hello ☃☃☃ hello hello quux hello"));
61# }
62```
63
64And here's an example that shows case conversion, even in the presence of
65invalid UTF-8:
66
67```
68# #[cfg(all(feature = "alloc", feature = "unicode"))] {
69use bstr::{ByteSlice, ByteVec};
70
71let mut lower = Vec::from("hello β");
72lower[0] = b'\xFF';
73// lowercase β is uppercased to Β
74assert_eq!(lower.to_uppercase(), b"\xFFELLO \xCE\x92");
75# }
76```
77
78# Convenient debug representation
79
80When working with byte strings, it is often useful to be able to print them
81as if they were byte strings and not sequences of integers. While this crate
82cannot affect the `std::fmt::Debug` implementations for `[u8]` and `Vec<u8>`,
83this crate does provide the `BStr` and `BString` types which have convenient
84`std::fmt::Debug` implementations.
85
86For example, this
87
88```
89use bstr::ByteSlice;
90
91let mut bytes = Vec::from("hello β");
92bytes[0] = b'\xFF';
93
94println!("{:?}", bytes.as_bstr());
95```
96
97will output `"\xFFello β"`.
98
99This example works because the
100[`ByteSlice::as_bstr`](trait.ByteSlice.html#method.as_bstr)
101method converts any `&[u8]` to a `&BStr`.
102
103# When should I use byte strings?
104
105This library reflects my belief that UTF-8 by convention is a better trade
106off in some circumstances than guaranteed UTF-8.
107
108The first time this idea hit me was in the implementation of Rust's regex
109engine. In particular, very little of the internal implementation cares at all
110about searching valid UTF-8 encoded strings. Indeed, internally, the
111implementation converts `&str` from the API to `&[u8]` fairly quickly and
112just deals with raw bytes. UTF-8 match boundaries are then guaranteed by the
113finite state machine itself rather than any specific string type. This makes it
114possible to not only run regexes on `&str` values, but also on `&[u8]` values.
115
116Why would you ever want to run a regex on a `&[u8]` though? Well, `&[u8]` is
117the fundamental way at which one reads data from all sorts of streams, via the
118standard library's [`Read`](https://doc.rust-lang.org/std/io/trait.Read.html)
119trait. In particular, there is no platform independent way to determine whether
120what you're reading from is some binary file or a human readable text file.
121Therefore, if you're writing a program to search files, you probably need to
122deal with `&[u8]` directly unless you're okay with first converting it to a
123`&str` and dropping any bytes that aren't valid UTF-8. (Or otherwise determine
124the encoding---which is often impractical---and perform a transcoding step.)
125Often, the simplest and most robust way to approach this is to simply treat the
126contents of a file as if it were mostly valid UTF-8 and pass through invalid
127UTF-8 untouched. This may not be the most correct approach though!
128
129One case in particular exacerbates these issues, and that's memory mapping
130a file. When you memory map a file, that file may be gigabytes big, but all
131you get is a `&[u8]`. Converting that to a `&str` all in one go is generally
132not a good idea because of the costs associated with doing so, and also
133because it generally causes one to do two passes over the data instead of
134one, which is quite undesirable. It is of course usually possible to do it an
135incremental way by only parsing chunks at a time, but this is often complex to
136do or impractical. For example, many regex engines only accept one contiguous
137sequence of bytes at a time with no way to perform incremental matching.
138
139# `bstr` in public APIs
140
141This library is past version `1` and is expected to remain at version `1` for
142the foreseeable future. Therefore, it is encouraged to put types from `bstr`
143(like `BStr` and `BString`) in your public API if that makes sense for your
144crate.
145
146With that said, in general, it should be possible to avoid putting anything
147in this crate into your public APIs. Namely, you should never need to use the
148`ByteSlice` or `ByteVec` traits as bounds on public APIs, since their only
149purpose is to extend the methods on the concrete types `[u8]` and `Vec<u8>`,
150respectively. Similarly, it should not be necessary to put either the `BStr` or
151`BString` types into public APIs. If you want to use them internally, then they
152can be converted to/from `[u8]`/`Vec<u8>` as needed. The conversions are free.
153
154So while it shouldn't ever be 100% necessary to make `bstr` a public
155dependency, there may be cases where it is convenient to do so. This is an
156explicitly supported use case of `bstr`, and as such, major version releases
157should be exceptionally rare.
158
159
160# Differences with standard strings
161
162The primary difference between `[u8]` and `str` is that the former is
163conventionally UTF-8 while the latter is guaranteed to be UTF-8. The phrase
164"conventionally UTF-8" means that a `[u8]` may contain bytes that do not form
165a valid UTF-8 sequence, but operations defined on the type in this crate are
166generally most useful on valid UTF-8 sequences. For example, iterating over
167Unicode codepoints or grapheme clusters is an operation that is only defined
168on valid UTF-8. Therefore, when invalid UTF-8 is encountered, the Unicode
169replacement codepoint is substituted. Thus, a byte string that is not UTF-8 at
170all is of limited utility when using these crate.
171
172However, not all operations on byte strings are specifically Unicode aware. For
173example, substring search has no specific Unicode semantics ascribed to it. It
174works just as well for byte strings that are completely valid UTF-8 as for byte
175strings that contain no valid UTF-8 at all. Similarly for replacements and
176various other operations that do not need any Unicode specific tailoring.
177
178Aside from the difference in how UTF-8 is handled, the APIs between `[u8]` and
179`str` (and `Vec<u8>` and `String`) are intentionally very similar, including
180maintaining the same behavior for corner cases in things like substring
181splitting. There are, however, some differences:
182
183* Substring search is not done with `matches`, but instead, `find_iter`.
184 In general, this crate does not define any generic
185 [`Pattern`](https://doc.rust-lang.org/std/str/pattern/trait.Pattern.html)
186 infrastructure, and instead prefers adding new methods for different
187 argument types. For example, `matches` can search by a `char` or a `&str`,
188 where as `find_iter` can only search by a byte string. `find_char` can be
189 used for searching by a `char`.
190* Since `SliceConcatExt` in the standard library is unstable, it is not
191 possible to reuse that to implement `join` and `concat` methods. Instead,
192 [`join`](fn.join.html) and [`concat`](fn.concat.html) are provided as free
193 functions that perform a similar task.
194* This library bundles in a few more Unicode operations, such as grapheme,
195 word and sentence iterators. More operations, such as normalization and
196 case folding, may be provided in the future.
197* Some `String`/`str` APIs will panic if a particular index was not on a valid
198 UTF-8 code unit sequence boundary. Conversely, no such checking is performed
199 in this crate, as is consistent with treating byte strings as a sequence of
200 bytes. This means callers are responsible for maintaining a UTF-8 invariant
201 if that's important.
202* Some routines provided by this crate, such as `starts_with_str`, have a
203 `_str` suffix to differentiate them from similar routines already defined
204 on the `[u8]` type. The difference is that `starts_with` requires its
205 parameter to be a `&[u8]`, where as `starts_with_str` permits its parameter
206 to by anything that implements `AsRef<[u8]>`, which is more flexible. This
207 means you can write `bytes.starts_with_str("☃")` instead of
208 `bytes.starts_with("☃".as_bytes())`.
209
210Otherwise, you should find most of the APIs between this crate and the standard
211library string APIs to be very similar, if not identical.
212
213# Handling of invalid UTF-8
214
215Since byte strings are only *conventionally* UTF-8, there is no guarantee
216that byte strings contain valid UTF-8. Indeed, it is perfectly legal for a
217byte string to contain arbitrary bytes. However, since this library defines
218a *string* type, it provides many operations specified by Unicode. These
219operations are typically only defined over codepoints, and thus have no real
220meaning on bytes that are invalid UTF-8 because they do not map to a particular
221codepoint.
222
223For this reason, whenever operations defined only on codepoints are used, this
224library will automatically convert invalid UTF-8 to the Unicode replacement
225codepoint, `U+FFFD`, which looks like this: `�`. For example, an
226[iterator over codepoints](struct.Chars.html) will yield a Unicode
227replacement codepoint whenever it comes across bytes that are not valid UTF-8:
228
229```
230use bstr::ByteSlice;
231
232let bs = b"a\xFF\xFFz";
233let chars: Vec<char> = bs.chars().collect();
234assert_eq!(vec!['a', '\u{FFFD}', '\u{FFFD}', 'z'], chars);
235```
236
237There are a few ways in which invalid bytes can be substituted with a Unicode
238replacement codepoint. One way, not used by this crate, is to replace every
239individual invalid byte with a single replacement codepoint. In contrast, the
240approach this crate uses is called the "substitution of maximal subparts," as
241specified by the Unicode Standard (Chapter 3, Section 9). (This approach is
242also used by [W3C's Encoding Standard](https://www.w3.org/TR/encoding/).) In
243this strategy, a replacement codepoint is inserted whenever a byte is found
244that cannot possibly lead to a valid UTF-8 code unit sequence. If there were
245previous bytes that represented a *prefix* of a well-formed UTF-8 code unit
246sequence, then all of those bytes (up to 3) are substituted with a single
247replacement codepoint. For example:
248
249```
250use bstr::ByteSlice;
251
252let bs = b"a\xF0\x9F\x87z";
253let chars: Vec<char> = bs.chars().collect();
254// The bytes \xF0\x9F\x87 could lead to a valid UTF-8 sequence, but 3 of them
255// on their own are invalid. Only one replacement codepoint is substituted,
256// which demonstrates the "substitution of maximal subparts" strategy.
257assert_eq!(vec!['a', '\u{FFFD}', 'z'], chars);
258```
259
260If you do need to access the raw bytes for some reason in an iterator like
261`Chars`, then you should use the iterator's "indices" variant, which gives
262the byte offsets containing the invalid UTF-8 bytes that were substituted with
263the replacement codepoint. For example:
264
265```
266use bstr::{B, ByteSlice};
267
268let bs = b"a\xE2\x98z";
269let chars: Vec<(usize, usize, char)> = bs.char_indices().collect();
270// Even though the replacement codepoint is encoded as 3 bytes itself, the
271// byte range given here is only two bytes, corresponding to the original
272// raw bytes.
273assert_eq!(vec![(0, 1, 'a'), (1, 3, '\u{FFFD}'), (3, 4, 'z')], chars);
274
275// Thus, getting the original raw bytes is as simple as slicing the original
276// byte string:
277let chars: Vec<&[u8]> = bs.char_indices().map(|(s, e, _)| &bs[s..e]).collect();
278assert_eq!(vec![B("a"), B(b"\xE2\x98"), B("z")], chars);
279```
280
281# File paths and OS strings
282
283One of the premiere features of Rust's standard library is how it handles file
284paths. In particular, it makes it very hard to write incorrect code while
285simultaneously providing a correct cross platform abstraction for manipulating
286file paths. The key challenge that one faces with file paths across platforms
287is derived from the following observations:
288
289* On most Unix-like systems, file paths are an arbitrary sequence of bytes.
290* On Windows, file paths are an arbitrary sequence of 16-bit integers.
291
292(In both cases, certain sequences aren't allowed. For example a `NUL` byte is
293not allowed in either case. But we can ignore this for the purposes of this
294section.)
295
296Byte strings, like the ones provided in this crate, line up really well with
297file paths on Unix like systems, which are themselves just arbitrary sequences
298of bytes. It turns out that if you treat them as "mostly UTF-8," then things
299work out pretty well. On the contrary, byte strings _don't_ really work
300that well on Windows because it's not possible to correctly roundtrip file
301paths between 16-bit integers and something that looks like UTF-8 _without_
302explicitly defining an encoding to do this for you, which is anathema to byte
303strings, which are just bytes.
304
305Rust's standard library elegantly solves this problem by specifying an
306internal encoding for file paths that's only used on Windows called
307[WTF-8](https://simonsapin.github.io/wtf-8/). Its key properties are that they
308permit losslessly roundtripping file paths on Windows by extending UTF-8 to
309support an encoding of surrogate codepoints, while simultaneously supporting
310zero-cost conversion from Rust's Unicode strings to file paths. (Since UTF-8 is
311a proper subset of WTF-8.)
312
313The fundamental point at which the above strategy fails is when you want to
314treat file paths as things that look like strings in a zero cost way. In most
315cases, this is actually the wrong thing to do, but some cases call for it,
316for example, glob or regex matching on file paths. This is because WTF-8 is
317treated as an internal implementation detail, and there is no way to access
318those bytes via a public API. Therefore, such consumers are limited in what
319they can do:
320
3211. One could re-implement WTF-8 and re-encode file paths on Windows to WTF-8
322 by accessing their underlying 16-bit integer representation. Unfortunately,
323 this isn't zero cost (it introduces a second WTF-8 decoding step) and it's
324 not clear this is a good thing to do, since WTF-8 should ideally remain an
325 internal implementation detail. This is roughly the approach taken by the
326 [`os_str_bytes`](https://crates.io/crates/os_str_bytes) crate.
3272. One could instead declare that they will not handle paths on Windows that
328 are not valid UTF-16, and return an error when one is encountered.
3293. Like (2), but instead of returning an error, lossily decode the file path
330 on Windows that isn't valid UTF-16 into UTF-16 by replacing invalid bytes
331 with the Unicode replacement codepoint.
332
333While this library may provide facilities for (1) in the future, currently,
334this library only provides facilities for (2) and (3). In particular, a suite
335of conversion functions are provided that permit converting between byte
336strings, OS strings and file paths. For owned byte strings, they are:
337
338* [`ByteVec::from_os_string`](trait.ByteVec.html#method.from_os_string)
339* [`ByteVec::from_os_str_lossy`](trait.ByteVec.html#method.from_os_str_lossy)
340* [`ByteVec::from_path_buf`](trait.ByteVec.html#method.from_path_buf)
341* [`ByteVec::from_path_lossy`](trait.ByteVec.html#method.from_path_lossy)
342* [`ByteVec::into_os_string`](trait.ByteVec.html#method.into_os_string)
343* [`ByteVec::into_os_string_lossy`](trait.ByteVec.html#method.into_os_string_lossy)
344* [`ByteVec::into_path_buf`](trait.ByteVec.html#method.into_path_buf)
345* [`ByteVec::into_path_buf_lossy`](trait.ByteVec.html#method.into_path_buf_lossy)
346
347For byte string slices, they are:
348
349* [`ByteSlice::from_os_str`](trait.ByteSlice.html#method.from_os_str)
350* [`ByteSlice::from_path`](trait.ByteSlice.html#method.from_path)
351* [`ByteSlice::to_os_str`](trait.ByteSlice.html#method.to_os_str)
352* [`ByteSlice::to_os_str_lossy`](trait.ByteSlice.html#method.to_os_str_lossy)
353* [`ByteSlice::to_path`](trait.ByteSlice.html#method.to_path)
354* [`ByteSlice::to_path_lossy`](trait.ByteSlice.html#method.to_path_lossy)
355
356On Unix, all of these conversions are rigorously zero cost, which gives one
357a way to ergonomically deal with raw file paths exactly as they are using
358normal string-related functions. On Windows, these conversion routines perform
359a UTF-8 check and either return an error or lossily decode the file path
360into valid UTF-8, depending on which function you use. This means that you
361cannot roundtrip all file paths on Windows correctly using these conversion
362routines. However, this may be an acceptable downside since such file paths
363are exceptionally rare. Moreover, roundtripping isn't always necessary, for
364example, if all you're doing is filtering based on file paths.
365
366The reason why using byte strings for this is potentially superior than the
367standard library's approach is that a lot of Rust code is already lossily
368converting file paths to Rust's Unicode strings, which are required to be valid
369UTF-8, and thus contain latent bugs on Unix where paths with invalid UTF-8 are
370not terribly uncommon. If you instead use byte strings, then you're guaranteed
371to write correct code for Unix, at the cost of getting a corner case wrong on
372Windows.
373
374# Cargo features
375
376This crates comes with a few features that control standard library, serde
377and Unicode support.
378
379* `std` - **Enabled** by default. This provides APIs that require the standard
380 library, such as `Vec<u8>` and `PathBuf`. Enabling this feature also enables
381 the `alloc` feature and any other relevant `std` features for dependencies.
382* `alloc` - **Enabled** by default. This provides APIs that require allocations
383 via the `alloc` crate, such as `Vec<u8>`.
384* `unicode` - **Enabled** by default. This provides APIs that require sizable
385 Unicode data compiled into the binary. This includes, but is not limited to,
386 grapheme/word/sentence segmenters. When this is disabled, basic support such
387 as UTF-8 decoding is still included. Note that currently, enabling this
388 feature also requires enabling the `std` feature. It is expected that this
389 limitation will be lifted at some point.
390* `serde` - Enables implementations of serde traits for `BStr`, and also
391 `BString` when `alloc` is enabled.
392*/
393
394#![cfg_attr(not(any(feature = "std", test)), no_std)]
395#![cfg_attr(docsrs, feature(doc_auto_cfg))]
396
397// Why do we do this? Well, in order for us to use once_cell's 'Lazy' type to
398// load DFAs, it requires enabling its 'std' feature. Yet, there is really
399// nothing about our 'unicode' feature that requires 'std'. We could declare
400// that 'unicode = [std, ...]', which would be fine, but once regex-automata
401// 0.3 is a thing, I believe we can drop once_cell altogether and thus drop
402// the need for 'std' to be enabled when 'unicode' is enabled. But if we make
403// 'unicode' also enable 'std', then it would be a breaking change to remove
404// 'std' from that list.
405//
406// So, for right now, we force folks to explicitly say they want 'std' if they
407// want 'unicode'. In the future, we should be able to relax this.
408#[cfg(all(feature = "unicode", not(feature = "std")))]
409compile_error!("enabling 'unicode' requires enabling 'std'");
410
411#[cfg(feature = "alloc")]
412extern crate alloc;
413
414pub use crate::bstr::BStr;
415#[cfg(feature = "alloc")]
416pub use crate::bstring::BString;
417pub use crate::escape_bytes::EscapeBytes;
418#[cfg(feature = "unicode")]
419pub use crate::ext_slice::Fields;
420pub use crate::ext_slice::{
421 ByteSlice, Bytes, FieldsWith, Find, FindReverse, Finder, FinderReverse,
422 Lines, LinesWithTerminator, Split, SplitN, SplitNReverse, SplitReverse, B,
423};
424#[cfg(feature = "alloc")]
425pub use crate::ext_vec::{concat, join, ByteVec, DrainBytes, FromUtf8Error};
426#[cfg(feature = "unicode")]
427pub use crate::unicode::{
428 GraphemeIndices, Graphemes, SentenceIndices, Sentences, WordIndices,
429 Words, WordsWithBreakIndices, WordsWithBreaks,
430};
431pub use crate::utf8::{
432 decode as decode_utf8, decode_last as decode_last_utf8, CharIndices,
433 Chars, Utf8Chunk, Utf8Chunks, Utf8Error,
434};
435
436mod ascii;
437mod bstr;
438#[cfg(feature = "alloc")]
439mod bstring;
440mod byteset;
441mod escape_bytes;
442mod ext_slice;
443#[cfg(feature = "alloc")]
444mod ext_vec;
445mod impls;
446#[cfg(feature = "std")]
447pub mod io;
448#[cfg(all(test, feature = "std"))]
449mod tests;
450#[cfg(feature = "unicode")]
451mod unicode;
452mod utf8;
453
454#[cfg(all(test, feature = "std"))]
455mod apitests {
456 use crate::{
457 bstr::BStr,
458 bstring::BString,
459 ext_slice::{Finder, FinderReverse},
460 };
461
462 #[test]
463 fn oibits() {
464 use std::panic::{RefUnwindSafe, UnwindSafe};
465
466 fn assert_send<T: Send>() {}
467 fn assert_sync<T: Sync>() {}
468 fn assert_unwind_safe<T: RefUnwindSafe + UnwindSafe>() {}
469
470 assert_send::<&BStr>();
471 assert_sync::<&BStr>();
472 assert_unwind_safe::<&BStr>();
473 assert_send::<BString>();
474 assert_sync::<BString>();
475 assert_unwind_safe::<BString>();
476
477 assert_send::<Finder<'_>>();
478 assert_sync::<Finder<'_>>();
479 assert_unwind_safe::<Finder<'_>>();
480 assert_send::<FinderReverse<'_>>();
481 assert_sync::<FinderReverse<'_>>();
482 assert_unwind_safe::<FinderReverse<'_>>();
483 }
484}
485