lib.rs source code [crates/bstr/src/lib.rs]

1	/!*
2	A byte string library.
3
4	Byte strings are just like standard Unicode strings with one very important
5	difference: byte strings are only conventionally* UTF-8 while Rust's standard*
6	Unicode strings are guaranteed* to be valid UTF-8. The primary motivation for*
7	byte strings is for handling arbitrary bytes that are mostly UTF-8.
8
9	# Overview
10
11	This crate provides two important traits that provide string oriented methods
12	on `&[u8]` and `Vec<u8>` types:
13
14	* [`ByteSlice`](trait.ByteSlice.html) extends the `[u8]` type with additional
15	string oriented methods.
16	* [`ByteVec`](trait.ByteVec.html) extends the `Vec<u8>` type with additional
17	string oriented methods.
18
19	Additionally, this crate provides two concrete byte string types that deref to
20	`[u8]` and `Vec<u8>`. These are useful for storing byte string types, and come
21	with convenient `std::fmt::Debug` implementations:
22
23	* [`BStr`](struct.BStr.html) is a byte string slice, analogous to `str`.
24	* [`BString`](struct.BString.html) is an owned growable byte string buffer,
25	analogous to `String`.
26
27	Additionally, the free function [`B`](fn.B.html) serves as a convenient short
28	hand for writing byte string literals.
29
30	# Quick examples
31
32	Byte strings build on the existing APIs for `Vec<u8>` and `&[u8]`, with
33	additional string oriented methods. Operations such as iterating over
34	graphemes, searching for substrings, replacing substrings, trimming and case
35	conversion are examples of things not provided on the standard library `&[u8]`
36	APIs but are provided by this crate. For example, this code iterates over all
37	of occurrences of a substring:
38
39	```
40	use bstr::ByteSlice;
41
42	let s = b"foo bar foo foo quux foo";
43
44	let mut matches = vec![];
45	for start in s.find_iter("foo") {
46	matches.push(start);
47	}
48	assert_eq!(matches, [`0`, `8`, `12`, `21`]);
49	```
50
51	Here's another example showing how to do a search and replace (and also showing
52	use of the `B` function):
53
54	```
55	# #[cfg(feature = "alloc")] {
56	use bstr::{B, ByteSlice};
57
58	let old = B("foo ☃☃☃ foo foo quux foo");
59	let new = old.replace("foo", "hello");
60	assert_eq!(new, B("hello ☃☃☃ hello hello quux hello"));
61	# }
62	```
63
64	And here's an example that shows case conversion, even in the presence of
65	invalid UTF-8:
66
67	```
68	# #[cfg(all(feature = "alloc", feature = "unicode"))] {
69	use bstr::{ByteSlice, ByteVec};
70
71	let mut lower = Vec::from("hello β");
72	lower[`0`] = b'`\xFF`';
73	// lowercase β is uppercased to Β
74	assert_eq!(lower.to_uppercase(), b"`\xFF`ELLO `\xCE\x92`");
75	# }
76	```
77
78	# Convenient debug representation
79
80	When working with byte strings, it is often useful to be able to print them
81	as if they were byte strings and not sequences of integers. While this crate
82	cannot affect the `std::fmt::Debug` implementations for `[u8]` and `Vec<u8>`,
83	this crate does provide the `BStr` and `BString` types which have convenient
84	`std::fmt::Debug` implementations.
85
86	For example, this
87
88	```
89	use bstr::ByteSlice;
90
91	let mut bytes = Vec::from("hello β");
92	bytes[`0`] = b'`\xFF`';
93
94	println!("{:?}", bytes.as_bstr());
95	```
96
97	will output `"\xFFello β"`.
98
99	This example works because the
100	[`ByteSlice::as_bstr`](trait.ByteSlice.html#method.as_bstr)
101	method converts any `&[u8]` to a `&BStr`.
102
103	# When should I use byte strings?
104
105	This library reflects my belief that UTF-8 by convention is a better trade
106	off in some circumstances than guaranteed UTF-8.
107
108	The first time this idea hit me was in the implementation of Rust's regex
109	engine. In particular, very little of the internal implementation cares at all
110	about searching valid UTF-8 encoded strings. Indeed, internally, the
111	implementation converts `&str` from the API to `&[u8]` fairly quickly and
112	just deals with raw bytes. UTF-8 match boundaries are then guaranteed by the
113	finite state machine itself rather than any specific string type. This makes it
114	possible to not only run regexes on `&str` values, but also on `&[u8]` values.
115
116	Why would you ever want to run a regex on a `&[u8]` though? Well, `&[u8]` is
117	the fundamental way at which one reads data from all sorts of streams, via the
118	standard library's [`Read`](https://doc.rust-lang.org/std/io/trait.Read.html)
119	trait. In particular, there is no platform independent way to determine whether
120	what you're reading from is some binary file or a human readable text file.
121	Therefore, if you're writing a program to search files, you probably need to
122	deal with `&[u8]` directly unless you're okay with first converting it to a
123	`&str` and dropping any bytes that aren't valid UTF-8. (Or otherwise determine
124	the encoding---which is often impractical---and perform a transcoding step.)
125	Often, the simplest and most robust way to approach this is to simply treat the
126	contents of a file as if it were mostly valid UTF-8 and pass through invalid
127	UTF-8 untouched. This may not be the most correct approach though!
128
129	One case in particular exacerbates these issues, and that's memory mapping
130	a file. When you memory map a file, that file may be gigabytes big, but all
131	you get is a `&[u8]`. Converting that to a `&str` all in one go is generally
132	not a good idea because of the costs associated with doing so, and also
133	because it generally causes one to do two passes over the data instead of
134	one, which is quite undesirable. It is of course usually possible to do it an
135	incremental way by only parsing chunks at a time, but this is often complex to
136	do or impractical. For example, many regex engines only accept one contiguous
137	sequence of bytes at a time with no way to perform incremental matching.
138
139	# `bstr` in public APIs
140
141	This library is past version `1` and is expected to remain at version `1` for
142	the foreseeable future. Therefore, it is encouraged to put types from `bstr`
143	(like `BStr` and `BString`) in your public API if that makes sense for your
144	crate.
145
146	With that said, in general, it should be possible to avoid putting anything
147	in this crate into your public APIs. Namely, you should never need to use the
148	`ByteSlice` or `ByteVec` traits as bounds on public APIs, since their only
149	purpose is to extend the methods on the concrete types `[u8]` and `Vec<u8>`,
150	respectively. Similarly, it should not be necessary to put either the `BStr` or
151	`BString` types into public APIs. If you want to use them internally, then they
152	can be converted to/from `[u8]`/`Vec<u8>` as needed. The conversions are free.
153
154	So while it shouldn't ever be 100% necessary to make `bstr` a public
155	dependency, there may be cases where it is convenient to do so. This is an
156	explicitly supported use case of `bstr`, and as such, major version releases
157	should be exceptionally rare.
158
159
160	# Differences with standard strings
161
162	The primary difference between `[u8]` and `str` is that the former is
163	conventionally UTF-8 while the latter is guaranteed to be UTF-8. The phrase
164	"conventionally UTF-8" means that a `[u8]` may contain bytes that do not form
165	a valid UTF-8 sequence, but operations defined on the type in this crate are
166	generally most useful on valid UTF-8 sequences. For example, iterating over
167	Unicode codepoints or grapheme clusters is an operation that is only defined
168	on valid UTF-8. Therefore, when invalid UTF-8 is encountered, the Unicode
169	replacement codepoint is substituted. Thus, a byte string that is not UTF-8 at
170	all is of limited utility when using these crate.
171
172	However, not all operations on byte strings are specifically Unicode aware. For
173	example, substring search has no specific Unicode semantics ascribed to it. It
174	works just as well for byte strings that are completely valid UTF-8 as for byte
175	strings that contain no valid UTF-8 at all. Similarly for replacements and
176	various other operations that do not need any Unicode specific tailoring.
177
178	Aside from the difference in how UTF-8 is handled, the APIs between `[u8]` and
179	`str` (and `Vec<u8>` and `String`) are intentionally very similar, including
180	maintaining the same behavior for corner cases in things like substring
181	splitting. There are, however, some differences:
182
183	* Substring search is not done with `matches`, but instead, `find_iter`.
184	In general, this crate does not define any generic
185	[`Pattern`](https://doc.rust-lang.org/std/str/pattern/trait.Pattern.html)
186	infrastructure, and instead prefers adding new methods for different
187	argument types. For example, `matches` can search by a `char` or a `&str`,
188	where as `find_iter` can only search by a byte string. `find_char` can be
189	used for searching by a `char`.
190	* Since `SliceConcatExt` in the standard library is unstable, it is not
191	possible to reuse that to implement `join` and `concat` methods. Instead,
192	[`join`](fn.join.html) and [`concat`](fn.concat.html) are provided as free
193	functions that perform a similar task.
194	* This library bundles in a few more Unicode operations, such as grapheme,
195	word and sentence iterators. More operations, such as normalization and
196	case folding, may be provided in the future.
197	* Some `String`/`str` APIs will panic if a particular index was not on a valid
198	UTF-8 code unit sequence boundary. Conversely, no such checking is performed
199	in this crate, as is consistent with treating byte strings as a sequence of
200	bytes. This means callers are responsible for maintaining a UTF-8 invariant
201	if that's important.
202	* Some routines provided by this crate, such as `starts_with_str`, have a
203	`_str` suffix to differentiate them from similar routines already defined
204	on the `[u8]` type. The difference is that `starts_with` requires its
205	parameter to be a `&[u8]`, where as `starts_with_str` permits its parameter
206	to by anything that implements `AsRef<[u8]>`, which is more flexible. This
207	means you can write `bytes.starts_with_str("☃")` instead of
208	`bytes.starts_with("☃".as_bytes())`.
209
210	Otherwise, you should find most of the APIs between this crate and the standard
211	library string APIs to be very similar, if not identical.
212
213	# Handling of invalid UTF-8
214
215	Since byte strings are only conventionally* UTF-8, there is no guarantee*
216	that byte strings contain valid UTF-8. Indeed, it is perfectly legal for a
217	byte string to contain arbitrary bytes. However, since this library defines
218	a string* type, it provides many operations specified by Unicode. These*
219	operations are typically only defined over codepoints, and thus have no real
220	meaning on bytes that are invalid UTF-8 because they do not map to a particular
221	codepoint.
222
223	For this reason, whenever operations defined only on codepoints are used, this
224	library will automatically convert invalid UTF-8 to the Unicode replacement
225	codepoint, `U+FFFD`, which looks like this: `�`. For example, an
226	[iterator over codepoints](struct.Chars.html) will yield a Unicode
227	replacement codepoint whenever it comes across bytes that are not valid UTF-8:
228
229	```
230	use bstr::ByteSlice;
231
232	let bs = b"a`\xFF\xFF`z";
233	let chars: Vec<char> = bs.chars().collect();
234	assert_eq!(vec!['a', '`\u{FFFD}`', '`\u{FFFD}`', 'z'], chars);
235	```
236
237	There are a few ways in which invalid bytes can be substituted with a Unicode
238	replacement codepoint. One way, not used by this crate, is to replace every
239	individual invalid byte with a single replacement codepoint. In contrast, the
240	approach this crate uses is called the "substitution of maximal subparts," as
241	specified by the Unicode Standard (Chapter 3, Section 9). (This approach is
242	also used by [W3C's Encoding Standard](https://www.w3.org/TR/encoding/).) In
243	this strategy, a replacement codepoint is inserted whenever a byte is found
244	that cannot possibly lead to a valid UTF-8 code unit sequence. If there were
245	previous bytes that represented a prefix* of a well-formed UTF-8 code unit*
246	sequence, then all of those bytes (up to 3) are substituted with a single
247	replacement codepoint. For example:
248
249	```
250	use bstr::ByteSlice;
251
252	let bs = b"a`\xF0\x9F\x87`z";
253	let chars: Vec<char> = bs.chars().collect();
254	// The bytes \xF0\x9F\x87 could lead to a valid UTF-8 sequence, but 3 of them
255	// on their own are invalid. Only one replacement codepoint is substituted,
256	// which demonstrates the "substitution of maximal subparts" strategy.
257	assert_eq!(vec!['a', '`\u{FFFD}`', 'z'], chars);
258	```
259
260	If you do need to access the raw bytes for some reason in an iterator like
261	`Chars`, then you should use the iterator's "indices" variant, which gives
262	the byte offsets containing the invalid UTF-8 bytes that were substituted with
263	the replacement codepoint. For example:
264
265	```
266	use bstr::{B, ByteSlice};
267
268	let bs = b"a`\xE2\x98`z";
269	let chars: Vec<(usize, usize, char)> = bs.char_indices().collect();
270	// Even though the replacement codepoint is encoded as 3 bytes itself, the
271	// byte range given here is only two bytes, corresponding to the original
272	// raw bytes.
273	assert_eq!(vec![(`0`, `1`, 'a'), (`1`, `3`, '`\u{FFFD}`'), (`3`, `4`, 'z')], chars);
274
275	// Thus, getting the original raw bytes is as simple as slicing the original
276	// byte string:
277	let chars: Vec<&[u8]> = bs.char_indices().map(\|(s, e, _)\| &bs[s..e]).collect();
278	assert_eq!(vec![B("a"), B(b"`\xE2\x98`"), B("z")], chars);
279	```
280
281	# File paths and OS strings
282
283	One of the premiere features of Rust's standard library is how it handles file
284	paths. In particular, it makes it very hard to write incorrect code while
285	simultaneously providing a correct cross platform abstraction for manipulating
286	file paths. The key challenge that one faces with file paths across platforms
287	is derived from the following observations:
288
289	* On most Unix-like systems, file paths are an arbitrary sequence of bytes.
290	* On Windows, file paths are an arbitrary sequence of 16-bit integers.
291
292	(In both cases, certain sequences aren't allowed. For example a `NUL` byte is
293	not allowed in either case. But we can ignore this for the purposes of this
294	section.)
295
296	Byte strings, like the ones provided in this crate, line up really well with
297	file paths on Unix like systems, which are themselves just arbitrary sequences
298	of bytes. It turns out that if you treat them as "mostly UTF-8," then things
299	work out pretty well. On the contrary, byte strings _don't_ really work
300	that well on Windows because it's not possible to correctly roundtrip file
301	paths between 16-bit integers and something that looks like UTF-8 _without_
302	explicitly defining an encoding to do this for you, which is anathema to byte
303	strings, which are just bytes.
304
305	Rust's standard library elegantly solves this problem by specifying an
306	internal encoding for file paths that's only used on Windows called
307	[WTF-8](https://simonsapin.github.io/wtf-8/). Its key properties are that they
308	permit losslessly roundtripping file paths on Windows by extending UTF-8 to
309	support an encoding of surrogate codepoints, while simultaneously supporting
310	zero-cost conversion from Rust's Unicode strings to file paths. (Since UTF-8 is
311	a proper subset of WTF-8.)
312
313	The fundamental point at which the above strategy fails is when you want to
314	treat file paths as things that look like strings in a zero cost way. In most
315	cases, this is actually the wrong thing to do, but some cases call for it,
316	for example, glob or regex matching on file paths. This is because WTF-8 is
317	treated as an internal implementation detail, and there is no way to access
318	those bytes via a public API. Therefore, such consumers are limited in what
319	they can do:
320
321	1. One could re-implement WTF-8 and re-encode file paths on Windows to WTF-8
322	by accessing their underlying 16-bit integer representation. Unfortunately,
323	this isn't zero cost (it introduces a second WTF-8 decoding step) and it's
324	not clear this is a good thing to do, since WTF-8 should ideally remain an
325	internal implementation detail. This is roughly the approach taken by the
326	[`os_str_bytes`](https://crates.io/crates/os_str_bytes) crate.
327	2. One could instead declare that they will not handle paths on Windows that
328	are not valid UTF-16, and return an error when one is encountered.
329	3. Like (2), but instead of returning an error, lossily decode the file path
330	on Windows that isn't valid UTF-16 into UTF-16 by replacing invalid bytes
331	with the Unicode replacement codepoint.
332
333	While this library may provide facilities for (1) in the future, currently,
334	this library only provides facilities for (2) and (3). In particular, a suite
335	of conversion functions are provided that permit converting between byte
336	strings, OS strings and file paths. For owned byte strings, they are:
337
338	* [`ByteVec::from_os_string`](trait.ByteVec.html#method.from_os_string)
339	* [`ByteVec::from_os_str_lossy`](trait.ByteVec.html#method.from_os_str_lossy)
340	* [`ByteVec::from_path_buf`](trait.ByteVec.html#method.from_path_buf)
341	* [`ByteVec::from_path_lossy`](trait.ByteVec.html#method.from_path_lossy)
342	* [`ByteVec::into_os_string`](trait.ByteVec.html#method.into_os_string)
343	* [`ByteVec::into_os_string_lossy`](trait.ByteVec.html#method.into_os_string_lossy)
344	* [`ByteVec::into_path_buf`](trait.ByteVec.html#method.into_path_buf)
345	* [`ByteVec::into_path_buf_lossy`](trait.ByteVec.html#method.into_path_buf_lossy)
346
347	For byte string slices, they are:
348
349	* [`ByteSlice::from_os_str`](trait.ByteSlice.html#method.from_os_str)
350	* [`ByteSlice::from_path`](trait.ByteSlice.html#method.from_path)
351	* [`ByteSlice::to_os_str`](trait.ByteSlice.html#method.to_os_str)
352	* [`ByteSlice::to_os_str_lossy`](trait.ByteSlice.html#method.to_os_str_lossy)
353	* [`ByteSlice::to_path`](trait.ByteSlice.html#method.to_path)
354	* [`ByteSlice::to_path_lossy`](trait.ByteSlice.html#method.to_path_lossy)
355
356	On Unix, all of these conversions are rigorously zero cost, which gives one
357	a way to ergonomically deal with raw file paths exactly as they are using
358	normal string-related functions. On Windows, these conversion routines perform
359	a UTF-8 check and either return an error or lossily decode the file path
360	into valid UTF-8, depending on which function you use. This means that you
361	cannot roundtrip all file paths on Windows correctly using these conversion
362	routines. However, this may be an acceptable downside since such file paths
363	are exceptionally rare. Moreover, roundtripping isn't always necessary, for
364	example, if all you're doing is filtering based on file paths.
365
366	The reason why using byte strings for this is potentially superior than the
367	standard library's approach is that a lot of Rust code is already lossily
368	converting file paths to Rust's Unicode strings, which are required to be valid
369	UTF-8, and thus contain latent bugs on Unix where paths with invalid UTF-8 are
370	not terribly uncommon. If you instead use byte strings, then you're guaranteed
371	to write correct code for Unix, at the cost of getting a corner case wrong on
372	Windows.
373
374	# Cargo features
375
376	This crates comes with a few features that control standard library, serde
377	and Unicode support.
378
379	* `std` - Enabled by default. This provides APIs that require the standard
380	library, such as `Vec<u8>` and `PathBuf`. Enabling this feature also enables
381	the `alloc` feature and any other relevant `std` features for dependencies.
382	* `alloc` - Enabled by default. This provides APIs that require allocations
383	via the `alloc` crate, such as `Vec<u8>`.
384	* `unicode` - Enabled by default. This provides APIs that require sizable
385	Unicode data compiled into the binary. This includes, but is not limited to,
386	grapheme/word/sentence segmenters. When this is disabled, basic support such
387	as UTF-8 decoding is still included. Note that currently, enabling this
388	feature also requires enabling the `std` feature. It is expected that this
389	limitation will be lifted at some point.
390	* `serde` - Enables implementations of serde traits for `BStr`, and also
391	`BString` when `alloc` is enabled.
392	*/
393
394	// #![cfg_attr(not(any(feature = "std", test)), no_std)]
395	#![no_std]
396	#![cfg_attr(docsrs, feature(doc_auto_cfg))]
397
398	#[cfg(any(test, feature = "std"))]
399	extern crate std;
400
401	#[cfg(any(test, feature = "alloc"))]
402	extern crate alloc;
403
404	pub use crate::bstr::BStr;
405	#[cfg(feature = "alloc")]
406	pub use crate::bstring::BString;
407	pub use crate::escape_bytes::EscapeBytes;
408	#[cfg(feature = "unicode")]
409	pub use crate::ext_slice::Fields;
410	pub use crate::ext_slice::{
411	ByteSlice, Bytes, FieldsWith, Find, FindReverse, Finder, FinderReverse,
412	Lines, LinesWithTerminator, Split, SplitN, SplitNReverse, SplitReverse, B,
413	};
414	#[cfg(feature = "alloc")]
415	pub use crate::ext_vec::{concat, join, ByteVec, DrainBytes, FromUtf8Error};
416	#[cfg(feature = "unicode")]
417	pub use crate::unicode::{
418	GraphemeIndices, Graphemes, SentenceIndices, Sentences, WordIndices,
419	Words, WordsWithBreakIndices, WordsWithBreaks,
420	};
421	pub use crate::utf8::{
422	decode as decode_utf8, decode_last as decode_last_utf8, CharIndices,
423	Chars, Utf8Chunk, Utf8Chunks, Utf8Error,
424	};
425
426	mod ascii;
427	mod bstr;
428	#[cfg(feature = "alloc")]
429	mod bstring;
430	mod byteset;
431	mod escape_bytes;
432	mod ext_slice;
433	#[cfg(feature = "alloc")]
434	mod ext_vec;
435	mod impls;
436	#[cfg(feature = "std")]
437	pub mod io;
438	#[cfg(all(test, feature = "std"))]
439	mod tests;
440	#[cfg(feature = "unicode")]
441	mod unicode;
442	mod utf8;
443
444	#[cfg(all(test, feature = "std"))]
445	mod apitests {
446	use crate::{
447	bstr::BStr,
448	bstring::BString,
449	ext_slice::{Finder, FinderReverse},
450	};
451
452	#[test]
453	fn oibits() {
454	use std::panic::{RefUnwindSafe, UnwindSafe};
455
456	fn assert_send<T: Send>() {}
457	fn assert_sync<T: Sync>() {}
458	fn assert_unwind_safe<T: RefUnwindSafe + UnwindSafe>() {}
459
460	assert_send::<&BStr>();
461	assert_sync::<&BStr>();
462	assert_unwind_safe::<&BStr>();
463	assert_send::<BString>();
464	assert_sync::<BString>();
465	assert_unwind_safe::<BString>();
466
467	assert_send::<Finder<'_>>();
468	assert_sync::<Finder<'_>>();
469	assert_unwind_safe::<Finder<'_>>();
470	assert_send::<FinderReverse<'_>>();
471	assert_sync::<FinderReverse<'_>>();
472	assert_unwind_safe::<FinderReverse<'_>>();
473	}
474	}
475