JavaScript Text Encoding: ASCII to Unicode Guide

Have you ever seen things like ASCII or UTF-8? At least you should've seen the latter one. All modern IDEs and code editors display some sort of UTF when you work with files.

How modern IDEs display file encoding

Most of the time, you pay little attention to it because it doesn't directly affect your work. What if I tell you that it is the opposite that understanding what exactly UTF-8 and other UTFs mean directly impacts your job?

Just imagine that simply changing UTF-8 to UTF-16 increases your file size 2 times. Do you want to know why?

In this article, we'll dive into what the ASCII and UTFs are, how we're using them on a daily basis, and what problems misusing an encoding scheme can cause.

Bytes and characters

When you look at any text on modern devices, you see words. Each of those words consists of individual characters. Have you ever thought about what character is?

There are at least two major parts of it. The first one is "how" the character looks like. It could be the same character "A" but in different fonts or in the same font but in regular, bold, or italic variants.

Character "A" in in 3 different fonts

The thing we see is called a glyph. Different fonts have different glyphs for the same character. You can compare a glyph to an application frontend. We can display the same value that comes from a backend in different shapes and forms. The same is true for a glyph.

But what is a backend in this case? Let's call it a character code. The character code is a unit of information that allows different glyphs to represent the same character.

Concept of character code shown on character A

In the previous article, we discussed bits and bytes, and how understanding them can help you write better JavaScript code.

We can apply this knowledge directly to characters to understand them better. Each character you see on the internet has the actual size in bytes. Knowing how many characters a file contains makes it easy to calculate file size. If there are 1,000,000 characters and each character takes 1 byte to store, then the file size is 1,000,000 bytes or 1 megabyte.

Another application of this knowledge is related to the binary numeric system and how bytes can be represented in binary. Here is an example of how a single byte is represented in a binary numeric system - 11111111. Any eight binary numbers represent a single byte value.

A character code can be represented in a binary numeric system as well. Here is how the popular "Hello world" phrase looks in binary.

01101000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100

You can tell how many bytes are there by counting groups of 8 characters. In this case, there are 11 groups, which means there are 11 bytes.

When dealing with binary, context is the king. Context is what makes the difference between just 0s and 1s and commands of some programming language or a file. An encoding scheme is a context that allows us to turn a set of binary numbers into a human-readable phrase.

The encoding example of the “Hello world” phrase into its binary representation actually uses one of those schemes called American Standard Code for Information Interchange (ASCII).

From ASCII to Unicode

ASCII is an encoding scheme formalized in 1967. It remains one of the most significant, if not the most significant, standards in the tech industry. The reason is simple: it is the first widespread encoding standard specifically developed for the tech industry.

ASCII has two versions: the base version, which contains 128 characters, and the extended version, which contains 255 characters. Both nicely fit in 8 bits or one byte of information.

This is how you encode “A”, “9”, and “/” characters in ASCII.

Example of ASCII encoding for A, 9, and / characters

As the name suggests, the encoding was developed in the US and designed specifically for English. 255 characters are enough to encode most English words and sentences.

In fact, ASCII became so popular that people from other countries started using it. But you can’t use English-based standards for other languages encoding that easily because of the limited characters set.

Trying to solve the problem of limited ASCII characters set, people have created supersets of ASCII like Japanese Industrial Standard (JIS) to be able to use the popular format with customizations for their needs.

However, the problem was clear: ASCII is too limited. In 1988, the successor of ASCII was created, called Unicode.

Unlike ASCII, Unicode was initially developed as a 2-byte encoding. In the extended version, 1-byte ASCII can encode up to 255 characters, while 2-byte Unicode can encode 65,536 characters.

That’s a decent leap. This version of Unicode allows the encoding of most of the widely used characters in the most popular languages.

A unique feature of Unicode is its extendability. It is not a fixed standard, and adding new languages can be easily achieved if there is a demand.

Unicode

Unicode is an engineering masterpiece. Let’s look closely at what makes it so unique and why it became so important.

Backward compatibility with ASCII

One of the main goals of creating Unicode was to create a standard for encoding a vast amount of different information. The goal has been achieved.

At the time of Unicode's creation, a lot of information was produced using ASCII. It wasn’t an option to just drop support of everything people had created so far and adopt a new standard.

That’s why the first 128 characters of Unicode are the same as ASCII characters. This makes Unicode backward compatible with ASCII, and the transition from ASCII to Unicode is seamless.

Example of how ASCII and Unicode are backward compatible

Unicode's transformation format

Unicode was initially developed as a 16-bit or 2-byte encoding standard. This amount of information was enough to encode most of the popular languages.

However, it is not enough to encode all possible information. Old scriptures, dead languages, and emojis are just a few examples of information missing from the initial standard. That’s why it is now not a 16-bit encoding but a 21-bit encoding and has more room for growth if needed.

What it means is that every character in a text is encoded using 21-bit.

But what if your text is in plain English, contains no special characters, and can be encoded using the first 128 characters of Unicode? It would be nicer to encode it in ASCII, where each character takes only 8 bits and 2.5 times less space to store.

Unicode is a flexible encoding, and thanks to different Unicode transformation formats, or UTFs, you can encode text using different numbers of bits. There are three major formats: UTF-8, UTF-16, and UTF-32. The number indicates how many bits are used to encode and store a single character.

You can use UTF-8 for plain English text. It takes exactly 8 bits to store every character in this encoding. But what if your text contains some character beyond the scope of the first 128? You can still use UTF-8 because it is a flexible standard, and depending on the character, it allocates a dedicated space to store the character.

UTF-8 memory allocation for different characters

In this example, we use UTF-8 to encode all characters. Each character in the word “Hello” is encoded using only 1 byte. However, the Thai Ko Kai (ก) character is encoded with 3 bytes using the same encoding scheme.

However, Thai characters don't always occupy 3 bytes of memory to store. When using UTF-16, each character takes only 2 bytes to store.

Memory allocation difference between UTF-8 and UTF-16

That’s why UTF-16 and UTF-32 transformation formats are still valuable and not going anywhere despite vast UTF-8 adaptation.

If you know that the text you’re dealing with is fully written in Thai, it doesn’t make sense to use UTF-8 as encoding. It will work, but it also takes ~30% more space than using UTF-16.

It works in another direction as well. Using UTF-16 for text in plain English makes It is two times larger in byte size than using UTF-8 because each character is encoded with at least 2 bytes.

Memory allocation difference between all UTF formats

Code point and Code unit

Each character in Unicode has a unique numerical identifier. Such a unique identifier is called a code point.

Every code point is unique, despite the UTF you’re working with. Every code point is written in the following format: U+XXXX where XXXX is a hexadecimal number. The range of unique code points goes from U+XXXX to U+10XXXX. For example, the code point for the character “A” has code U+0041.

While a code point is related to all Unicode characters despite their UTF, a code unit is specific to a particular UTF. Depending on the encoding, a code point may be represented by one or more code units.

Difference in code points for the same ก character between UTF-8 and UTF-16

Detailed breakdown of character ก code points in UTF-8 encoding

Unicode planes

When Unicode was first introduced in the 1980s, its creators believed that 65,536 code points would be enough to encode all the world's popular writing systems.

This initial set of code points is now known as the Basic Multilingual Plane (BMP). BPM contains characters you use every day, including Latin letters, common symbols, and characters from widely used non-Latin scripts.

However, as the standard progressed, it became clear that more space would be needed. In Unicode 2.0 (1996), supplementary planes were introduced, expanding from one multilingual plane to 17.

Each plain contains 65,536, which extends the initial capacity to 1,114,112. This expansion was crucial for several reasons:

Accommodating complex writing systems: Scripts like Han (used in Chinese, Japanese, and Korean) required far more characters than initially anticipated.
Future-proofing: The additional planes provided room for newly discovered historical scripts and potential future writing systems.
Special-purpose characters: Planes were allocated for technical symbols, emoji, and private-use characters.

The introduction of supplementary planes marked a significant milestone in Unicode's development, transforming it from a limited character encoding system to a comprehensive standard capable of representing virtually all known writing systems.

UTF encoding and JavaScript

Now, it is time to look at encodings in the context of JavaScript.

JavaScript internally uses UTF-16 encoding for strings. I mean for any string, even if the string came from a file, network, or anywhere else. If the string somehow comes into the JavaScript world, be sure that it is always encoded using UTF-16.

It is just a specification requirement, and we can do little about it. The positive side is that things just get simpler. We work with one encoding and one encoding only.

If we decide to create a variable that represents an error message and the error message text is 20 characters long, you can be sure that the size it takes to store this string is precisely 40 bytes.

// The variable string content occupies 40 bytes of memory
const errorMessages = 'Something went wrong';

At the same time, there is a way to encode a string in JavaScript using less memory. It is possible to do so only using buffers.

// "Sun" string encoded in hexadecimal numeric system
const buffer = new Uint8Array([0x53, 0x75, 0x6f, 0x6e]);

// The default decoding scheme is UTF-8
const decoder = new TextDecoder();

console.log(decoder.decode(buffer)) // Prints "Sun";

You don't need to understand the whole buffer workflow just yet. We'll talk about it in a future article.

The string that we get from the decoder.decode() function is in UTF-8 encoding before it gets to JavaScript, after that it gets UTF-16 encoded again. It happens because of how the API and the whole buffer thing work.

The interesting thing is if we mismatch the type of encoding and decoding schemes, we'll get completely unexpected results.

https://x.com/pavl_ro/status/1813610479014789455

The data we save in the buffer is the same, and the type of buffer is the same, but the decoding scheme is different. Because of that, we're getting a completely unexpected result.

Conclusion

ASCII and UTFs are encoding schemes that allow text information to be shared across different machines and the Internet without losing any information.

With Unicode, we can encode up to 1,114,112 characters that are more than enough for the foreseeable future. The Unicode standard consists of multiple parts, such as Code points, code units, planes, etc.

Unicode's transformation formats (UTFs) provide the ability to encode the same exact text using different schemes.

Internally, JavaScript uses UTF-16 for all strings. However, it doesn't mean we can't use different encodings to store strings in a format that we want.

You have to be mindful when working with different encodings because using mismatching encoding and decoding schemes can lead to unexpected results.

From ASCII to Unicode: A JavaScript Developer's Guide to Text Encoding

Bytes and characters

From ASCII to Unicode

Unicode

Backward compatibility with ASCII

Unicode's transformation format

Code point and Code unit

Unicode planes

UTF encoding and JavaScript

Conclusion

Comments

More from this blog

Writable Streams in Node.js: A Practical Guide

Exploring the Core Concepts of Node.js Readable Streams

Building a Mental Model of Node.js Streams

Profiling Node.js application with VS Code

Building Semaphore and Mutex in Node.js

Command Palette

Bytes and characters

From ASCII to Unicode

Unicode

Backward compatibility with ASCII

Unicode's transformation format

Code point and Code unit

Unicode planes

UTF encoding and JavaScript

Conclusion

Comments

More from this blog