 In a previous unit, we explained how to represent numbers as bits. The next question is how do we represent text as bits? Well, the answer in short is that we can represent text as numbers. Here, for example, we could represent the word banana as a sequence of numbers corresponding to each letter's position in the alphabet, b corresponds to 2, a to 1, and n to 14. In principle, though, the associations between individual characters and numbers may be arbitrary, so long as both the writer and reader of the text data agree on which numbers correspond to which characters. Here, for example, b is represented as 52, a is represented as 97, and n is represented as 4. A set of mappings between characters and numbers is called a character set. Over the history of computing, there have been many commonly used character sets, but the predominant character set for many decades was ASCII, standing for American Standard Code for Information Interchange. ASCII contains 128 different characters, including the uppercase and lowercase letters of the English alphabet, the numerals 0 through 9, and the common punctuation characters used in English, basically everything on a standard US English keyboard. ASCII also includes a few white space characters, most notably a character representing a single space. Here, for example, when representing the text a space banana as ASCII, the space itself is represented by the number 32, because the space character in ASCII is mapped to the character code 32. Also note that uppercase a and the lowercase a are separate characters, and so are represented with different code points. The other white space characters are horizontal tab, vertical tab, line feed, and carriage return. Exactly what these characters signify differs in different programs. For example, a horizontal tab generally denotes some amount of space that's greater than a single space character, but how much space exactly is up to the program displaying the text. Aside from space itself, these white space characters are considered to be so-called control characters. The control characters in ASCII are characters which don't necessarily represent a visible symbol, but rather some action. The idea is that programs or hardware reading ASCII text data perform an action when reading the symbols. For example, back in the 1950s and 60s, teletype machines would sound a chime when receiving the bell character and advance their paper reel when receiving the form feed character. As the original uses of the ASCII control characters are now mostly archaic, most programs ignore the control characters. Two major exceptions, though, are line feed and carriage return. In the conventions of UNIX programs, a single line feed character is used to designate the end of a paragraph, i.e. a new line. Hence, the character is often called a new line even though its official ASCII name is line feed. In the Windows world, the competing convention was adopted that a new line should be indicated by two characters used together, a carriage return followed by a line feed. This difference of convention used to be a big headache, but most recent programs cope just fine with text data that uses either convention. The term plain text refers to data consisting of just a sequence of text characters and nothing else, no font styling or formatting of any kind. The source code we write as programmers is plain text. A Microsoft Word document, on the other hand, is not plain text because it contains formatting information and many other things in addition to the text content of the document. A text editor is a program which creates and edits files of plain text. The Notepad program included in Windows is one such example. Whereas Notepad is a very simple program, some other text editors, most notably Vi and Emacs, contain numerous advanced features. When talking about character sets, it is important to distinguish between the notion of a character and a glyph. A character is an abstract thing whereas a glyph is a concrete visual representation of a character. Here, for example, we have two different glyphs of the lowercase j character. The lowercase j character is an abstract idea but it has many concrete visual representations in different fonts and styles. It's also important to understand the difference between a character set and a character encoding. Whereas a character set designates which characters are associated with which numbers, a character encoding specifies how exactly to represent those numbers as bits and bytes. For example, ASCII is most commonly encoded as one byte per character but backward memory and storage were especially precious, ASCII was sometimes encoded with just 7 bits per character. The ASCII code points run from 0 to 127 so we need 2 to the 7th bits to represent all those possible values. So when a program reads and writes text data, it needs to not only be mindful of the character set but also the encoding. As we'll see shortly, some character encodings can be fairly complicated. The obvious problem with ASCII is that it suits English OK but not most other languages. For decades non-English languages were represented with a hodgepodge of other character sets. To solve this problem, a standard character set called Unicode was introduced in the early 1990s. Unicode not only includes all of the characters of every known written language in history, it also includes a large number of other symbols such as symbols of mathematical and music notation. The code points of Unicode are denoted as hexadecimal numbers preceded with a U and plus sign. The code points range from 0000 up to 10FFFF in hex, meaning Unicode has 1,114,112 code points. Currently, however, most code points are left unused, allowing for future expansion. The code points are divided into 17 groups called planes, each with 65,536 code points. The first plane, plane 0, is called the BMP, the basic multilingual plane. The BMP contains all the characters from modern human languages except for the many ideographs of Chinese, Japanese, and Korean, which are located in plane 2, known as the supplementary ideographic plane. Plane 1, the supplementary multilingual plane, contains characters of historical languages. Planes 3-13 are currently completely unused. Plane 14, the supplementary special purpose plane, contains non-graphical symbols used in some textual data formats like XML. The last two planes are reserved for private use, such that programs can designate their own meaning to those code points, with, of course, the understanding that those imbued meanings won't be recognized by other programs. Notice that the hex digits of a code point other than the last four hex digits effectively designate the plane. For example, the code point 2, ABCD, is in plane 2. The Unicode standard specifies not just a character set, but also several standard encodings for that character set. We'll look at the three most common, starting with the simplest one, UTF32. UTF stands for Universal Character Set Transfer Format, and the 32 part indicates that this encoding uses 32 bits to represent each character. In UTF32, we simply represent each code point as a 32-bit number. So, for example, code point 3FF01 is represented as the bytes 00003FF01. Likewise, code point 40077 is represented as 0040077. Because the Unicode code points only go up to 10FFFF, the first byte of the four will always be 00. It may seem wasteful for each character to take up an extra byte, but CPUs typically deal best with memory in chunks of four bytes rather than odd-sized chunks like three. So when memory or storage usage are not a concern, UTF32 may actually be the most efficient option. The UTF16 encoding uses 16 bits for characters of the BMP, the basic multilingual plane, but 32 bits for the characters of the other planes. The case of BMP characters is simple. Here we have the code points 0065 and F10F, both encoded as just two bytes. For the other planes, things get more complicated. In the first two bytes, the first six bits are always 110110, the next four bits designate the plane, and the next six bits represent part of the code point. In the last two bytes, the first six bits are always 110111, and the remaining bits make up the rest of the code point within the plane. For example, the code point 3FF01 is represented as the bytes D8BCDC10. The bits in red are the 12 pre-designated bits. The green bits specify that this is plane 3, and the remaining orange bits are where we put the value F010, just split across the four bytes. Understand that these four green bits specify one of the 16 planes other than the BMP, so plane 1 is represented by the value 0, plane 2 by the value 1, plane 3 by the value 2, and so forth. Now, you may be wondering how a program reading these four bytes will know they are meant to encode just one character and not two. Well, the code points D800 through DFFF are called surrogates, meaning that they are left unassigned expressly so that a program won't mistake these byte pairs for individual characters. Also note that in these four byte encodings, the six fixed bits at the start of each byte pair differ between the first and second pair, so a program can always distinguish which of these byte pairs represent the first two bytes and which represent the last two bytes. Because UTF-16 uses just two bytes for characters in the basic multilingual plane, it's a good choice for representing text of non-English languages. However, for Chinese, Japanese, and Korean, which use many characters in the supplementary ideographic plane, UTF-16 does not save much memory or storage space compared to UTF-32, and so may not be worth the extra processing work. The last encoding we'll discuss, and perhaps the most widely used, is UTF-8. In the UTF-8 encoding, characters are represented with as few as eight bits, but as many as 32. Code points 00002007F require just one byte, code points 0080207FF require two bytes, code points 08002FFF require three bytes, and all remaining code points require the full four bytes. In the single byte encoding, the byte always begins with zero and the remaining bits specify the code point. In the two-byte encoding, the first byte always begins with 110, the second byte begins with 10, and the remaining bits specify the code point. In the three-byte encoding, the first byte always begins with 110, the second and third bytes both begin with 10, and the remaining bits specify the code point. In the four-byte encoding, the first byte always begins with 110, the second, third, and fourth byte begin with 10, and the remaining bits specify the code point. So notice a few things here. First, if a byte begins with 1, 0, it cannot be the first byte of a character. Second, the leading bits of the first byte effectively indicates the number of bytes in the character. Looking at a few examples, here code.0031 is represented as a single byte, 0, 7, 0, 0 as 2 bytes, 8, 6, f, f as 3 bytes, and 5, 0, 0, 0, 0 as 4 bytes. Be clear that the binary form of the hex code point is split across the orange bits of the multiple bytes. For example, 8, 6, f, f in binary is 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1. You might think it would be okay to encode a character in more bytes than it needs, but doing so is not allowed by UTF32. So say it is invalid to encode code.0031 as 4 bytes. Because the first 128 code points of Unicode are the same 128 characters of ASCII, English text encoded in UTF8 will consist mostly of single byte encodings. The code points up to 0, 7, f, f cover the characters used for most other European languages and some Semitic languages, so UTF8 can represent text in those languages using mostly two byte encodings. For other languages, however, UTF16 is generally the more efficient option.