The string in the following code sample consists of 5 letters and an exclamation mark:
const message = 'Hello!';
Thinking about strings as a sequence of visible characters also suggests that the number of characters in
'Hello!' string equal to 6:
const message = 'Hello!'; message.length; // => 6
But as soon as you deal with more complex characters, for example the emoticons (😀, 😁, 😈), modeling the strings by visible characters becomes inaccurate.
Consider the following string:
const smile = '😀';
You can see that the string contains just one character: the grinning face.
But if you use the
smile.length property to determine the number of characters, you might be surprised that it contains 2 units:
const smile = '😀'; smile.length; // => 2
How could that happen: you see one character, while
length indicates 2 of them?
The String type is the set of all ordered sequences of zero or more 16-bit unsigned integer values (“elements”). The String type is generally used to represent textual data in a running ECMAScript program, in which case each element in the String is treated as a UTF-16 code unit value.
A code unit is just a number from
0xFFFF. The magic happens because there is a mapping between the code unit value and a specific character.
For example, the code unit
0x0048 is rendered to the actual character
H using the unicode escape sequence
const letter = '\u0048'; letter === 'H' // => true
Now let’s use UTF-16 code units directly to create the
const message = '\u0048\u0065\u006C\u006C\u006F\u0021'; message === 'Hello!'; // => true message.length; // => 6
A Unicode character from Basic Multilangual Plane is encoded with one code unit in UTF-16.
However, characters from non-Basic Multilangual Plane:
require an unseparable pair of code units (named surrogate pair) to be encoded in UTF-16.
For example, the grinning face character
'😀', which would have the code unit of
0x1F600 (the number
0x1F600 is bigger than
0xFFFF thus doesn’t fit into 16 bits), is encoded with a sequence of 2 code units
const smile = '\uD83D\uDE00'; smile === '😀'; // => true smile.length; // => 2
\uD83D\uDE00 is a special pair named surrogate pair.
smile.length evaluates to
2, which denotes that the
length property of the string primitive determines the number of code units.
The string iterator is aware of the surrogate pairs. When you invoke the string iterator, for example using the spread operator
..., it counts a surrogate pair as one length unit:
const message = 'Hello!'; const smile = '😀'; [...message].length; // => 6 [...smile].length; // => 1
string.length property determines the number of code units, rather than the number of visible characters.
Understanding that a string is a sequence of code units is necessary if you work with characters above the Basic Multilingual Plane.
Quality posts into your inbox
I regularly publish posts containing:
- How to use TypeScript and typing
- Software design and good coding practices
Subscribe to my newsletter to get them right into your inbox.