The Enigma
Story of string encoding
Our thoughts are exchanged using a language. Any technological advancement we make involve language as it is the only known way for us to communicate. Programming “language” use it too.
“public” “class” …
“def” “square_root”(x)
<”script” type=”text/javascript”…>
The double quoted parts of the text above are standard lines you will see across many code files in different languages. We know that computer uses binary numbers 0, 1 to store these but do you spend much time thinking about the convention followed to capture those 0,1?
Convention
The name might evoke strong responses as every developer have their own sentiment attached to a convention. They have individual convention(s) and yes team conventions. Similarly to store and re-process the binaries that represent the language convention used is called Unicode. It is an abbreviation for Universal Character Code. This convention is a vast ocean so much so that you could relate it to water on earth.
Did you know human body is made of water? As much as 70%. Yes, off course you knew it. It is common science. Now did you know as developer you daily use this convention? It is likely that you might have heard about this convention and may you will agree to the question posted a sentence ago. But we are sure you might not feel using it every moment. Since, the moment you introduce the first string, there is this convention coming into play. The moment you compare two strings, there again is this convention coming into play. You see it is so pervasive.
Codepoint
The foundational unit of Unicode encoding is the Codepoint. Read it as a map between a number and a character that we know in a language. So, let us pick some quintessential word units (alphabets) and see their code point –
E (the quintessential alphabet in English) has a codepoint a.k.a number 45, ழ (the culturally significant alphabet in Tamil) has a codepoint a.k.a number of BB4 (😊), ओ (the resonant sound in Hindi poetry) has a codepoint ak.a. number of 913, 日 (Hi / Nichi which symbolises the sun from the land of rising sun – Japan) is 65E5 and 福 (which represents the happiness, good luck and fortune in our next-door neighbour China) is 798F.
You would have discovered the reason of our smile; these codepoints are not binary base. Instead, are hexadecimal. We talked about the convention for the binary. We can assure you about the advancement in compute power can handle the conversion between hexadecimal to binary by itself and does not necessitate convention.
We kept repeating the word number earlier; as you would know about the convention of writing small case “x” in front of hexadecimal numbers to indicate it is written in hexadecimal. But with codepoint in Unicode the convention dictates U+ so for the alphabets mentioned above the Codepoints will read as
E = U+0045
ழ = U+0BB4
ओ = U+0913
日= U+65E5
福 = U+798F
As you would see eastern hemisphere languages capture requires larger number than the Latin-based ones.
Encoding
Like in the quantum world a few dispatches ago we focused on encoding information in Q bits in Unicode too similar conventions exist.
UTF-8, UTF-16 and UTF-32 are the popular ones that you will know. UTF-8 uses variable length encoding. That means a character could be encoded as 2-byte long another 4-byte long. Remember a byte is 8-bits. The encoded text bears semblance with ASCII for the first 128 characters. UTF-16 uses 2-bytes for almost all the characters and the ones that could not be stored in 2-byte (because the number to represent the character is too big for the 2-byte) is stored using surrogate pairs (4-bytes). UTF-32 uses 4-butes for all the characters which covers the known span of characters in written languages across the globe.
Character and Glyph
Not a technical topic but is certainly at the core of the UTF standard is character. The UTF standard distinguishes between character and glyph. Glyph being the mark a character makes on the screen or paper. It is written representation and can take many forms whereas the character is the abstract representation of the unit of a language. Glyphs are typically seen in the font files where different types bold, italic and other formats are specified.
Normalisation
True reflection of the word, a normalised form is the most common denominator of a letter. So, É is essentially combination of capitalised letter E with diatrical mark “ ́”.
Unicode in programming languages
Many programming languages have native support for handling Unicode. All the string typed variables are stored in memory in their Unicode encoded representation.
Python
Defaults to UTF-8 encoding for all string literals and variables. This has been adopted since Python 3. Earlier versions python needed developers to declare the default encoding.
With all string defaulted to Unicode representation you can initialize a string in Unicode directly using the Unicode escape sequence –
bumble_bee = “\U0001F41D”
This is the symbol for 🐝bumble bee emoji. Notice the variable length part of the UTF-8 where this number is too large to be stored in 2-bytes.
What if you had existing string and wanted to get its Unicode representation? Then we have
unicode_encoded = “Hello”.encode(‘utf-8’)
and the reverse will be
simple_string = unicode_encoded.decode(‘utf-8’)
Javascript
Uses UTF-16 i.e. 4 bytes for a character by default for all string. Similar to python we can initialize string directly in Unicode using the respective codepoint.
let smiley = ‘\uD83D\uDE00’;
Whereas for larger codepoints one will have to resort to
let bumble_bee = ‘\u{1F41D}’;
One can convert existing string to Unicode codepoints using functions charCodeAt and codePointAt. The notorious point being the use of “At” in the function name. One will have to use the map function to retrieve the value for all characters in the string.
We will reserve examples from other languages and further nuances for a next dispatch. Until then happy new year of building interesting applications.