String types considered harmful
This is an excerpt adapted from my old Zig book.
The Zig, the String, and the Unicode
A question I commonly see is “Does Zig not have support for strings/UTF-8 ?”
Well yes, and no. Zig has decided not to treat strings specially because really, there is no simple correct way to do it.
For example, for iterating over a string, you might expect to iterate over a number of
characters? Problem is Unicode doesn’t even define what a character is, and it might
range from a single byte to an entire
grapheme cluster.
Take "π©βπ¦βπ¦"
, what we see as a single character is actually made out of 5 codepoints
(U+1F469 U+200D U+1F466 U+200D U+1F466) which are encoded as 8 bytes (assuming UTF-8), and each should be correctly preserved otherwise
we end up with a different emoji!
So, if your language has built-in Unicode support, what should it do ? Iterate over codepoints ? over grapheme clusters ? What if I want to iterate through all of them ? Is the Unicode table defining grapheme clusters (separators, joiners, etc.) always updated ? That’s why Zig leaves Unicode as something for standard library and other libraries to handle.
A simple example of thinking strings are simple and failing catatrosphically is JavaScript,
a programming language supposed to handle strings, but where "π©βπ¦βπ¦".length
is equals to 8 !!
In almost all cases this is not the intented behaviour when getting a string’s length.
So really Javascript (and most languages that supposedly handle strings) trick you into believing
your code will work intuitively when in fact, human language is just complicated. Hence why,
like for many things, Zig wants you to think about what you want in order to get a correct behaviour.
For basic string manipulation in Zig, you’d look at std.mem
and std.ascii
, std.mem
contains
useful functions like indexOf
or replace
while std.ascii
contains ASCII-specific functions
(as its name suggests).
The reason it’s in std.mem
is that it can operate on any
type of slice, so you could replace every int with another.
It’s also great if you’re forced to use UTF-16 (win32.. π₯Ά)
const str = "abc";
if (std.mem.indexOfScalar(u8, index, 'b')) |index| {
std.debug.assert(index == 1);
}
For Unicode-y things, you can use.. std.unicode
. It contains methods for iterating over codepoints, for encoding and decoding UTF-16 and UTF-8.
const str = "ForΓͺt UTF-8 β‘";
// it returns an error if 'str' is not valid UTF-8
var view = try std.unicode.Utf8View.init(str);
var iterator = view.iterator();
while (iterator.nextCodepoint()) |codepoint| {
std.debug.print("{c}", .{ codepoint });
}
Finally for more complex things, this is where Zig will need an external library, like many other languages (yes, even those with that advertised Unicode support), notably for normalization which is very important before storing a string because, it turns out a character can have multiple Unicode representations. It’s also required for correctly upper-casing or lower-casing a string which is complex task because, actually, English isn’t the only language in the world, and that letter case isn’t universal (think of Chinese)…
Zig currently has a good library for that, named ziglyph it bundles all the required Unicode data and has a correct behaviour for Unicode characters. It should be used for separating strings into grapheme clusters (necessary for most emojis) or string order.
But really, in most cases, you should reconsider whether you really want to do those string operations
because, often, they don’t make sense in other languages.
In fact, even with support for grapheme clusters there can be problems.
Think about “Ε” (used in languages like French), it’s one grapheme but two characters, yet you would
expect the reverse of “Εuf” to be “fueo” (which would need special handling, imagine that for all
the characters in the world..) if we reverse it again, we get “oeuf” which, linguistically, isn’t the
same word. So even in Latin languages, those job interviews string problems cause problems.