 Hi, my name is Thomas. I work at the Czech Technical University in Prague and I also work for the R-Cortain on R. And I would like to tell you something about encoding support in R. Encoding defines our character maps to a sequence of bytes. Different encodings support different sets of characters and they map them differently. Commonly used encodings with R are ASCII on all operating systems, UTF-8 on Unix systems, and Latin 1 and many others as well on Windows. ASCII maps one character to one byte using only seven bits from the byte. The text here, hello in this slide, is mapped to the five bytes as shown here. ASCII only supports the English language. Latin 1 is a superset of ASCII. It uses eight bits of a byte to represent a single character. The Latin text Grise from German is represented using the bytes shown here. GR and E are the same bytes as in ASCII. Latin 1 is a common encoding in R on Windows for users from the United States and Western European countries, but Latin 1 is also used in other countries in the world. SJS Win, for example, is an encoding used in Japan. It also supports Russian and English, but primarily it is for Japanese. It is a double byte character set, which means that one character is encoded using one or two bytes. In this example, we have three characters and the first two are each represented by two bytes. The third character in the example is an exclamation mark which is represented using a single byte and is the same byte as it would be in ASCII. There is a number of other double byte character sets used on Windows. These are by people from Asia. UTF-8 is an encoding that maps one character to one to four bytes. It is a so-called multi-byte character set. The text Grise is represented using the bytes shown here. GR and E are represented using the same bytes as in ASCII. And U and SHARP-S are represented each using two bytes as shown here. What is important is that UTF-8 supports all languages. It also dominates the web use today and it's the default on all current operating systems that are used with ARC except Windows. Operating systems have the concept of native encoding. It's the encoding that is currently expected by the operating system and the C library in the strings that are passed to them. On Unix, it is one native encoding at a time and today by default it is UTF-8 even though one can change it. On Windows, things are more complicated. There are two current native encodings, one for the C library and one for the operating systems with additional details that complicate things. Also applications may choose to use UTF-16 instead ignoring the encoding that is set by the OK. In AR, character objects on the heap represent strings. One character object for one element of a string vector. Each character object can be stored in a different encoding and there is a flag saying which encoding it is. The flag can be saying it's UTF-8, Latin one. It may also say it's unknown which means that it is either ASCII which is a subset of all encodings supported by AR or it means it's a native encoding that is neither UTF-8 nor Latin one. So for instance it could be this SG-SWIN encoding for Japanese on Windows. In addition to these three values one can also use bytes encoding which is not really an encoding. It means that we are not working with strings but with byte arrays and this is for expert use only mostly. This works nicely on UNIX. When the native encoding is UTF-8, when we enter the string grisa by typing or pasting we get the encoding flag for it UTF-8. A string hello gets encoding flag unknown because it's ASCII. On Windows running in Latin one locale encoding again you can enter grisa, pasting or we have the German keyboard typing and it will get the Latin one flag. And we can also enter grisa using the byte slash uscapes which then mean that it will be UTF-8 string. AR will automatically convert the string to the right encoding if it needs to and things work well. Except when they don't and I will speak out it later. The internal encoding support in AR hence it supports multiple encodings on input and output. It's very many encodings that are supported via the icon library. Character objects supports the encoding I have said and what is very important sometimes are needs to convert strings to the native encoding. This is necessary often when passing strings to external libraries or to code that is from external projects but has been incorporated into AR. Sometimes it's also necessary to pass it to some AR subsystems that have been designed to work with byte arrays. So sometimes strings are converted to the native encoding. Symbols in AR are native encoding for instance. So this brings a limitation to Windows and AR. The native encoding on Windows cannot be UTF-8. It cannot be any other unique encoding. And hence it cannot represent all characters. Printing in AR already requires conversion to the native encoding. So this Japanese string that I have shown before we can't even print in AR. We will get some weird escapes as shown here when running in Windows in Latin 1 locale. It would work fine when Windows was running in Japanese locale but not in Latin 1. So how to prevent getting into these problems? Well, only work with strings that can be represented in your native encoding. And if there is no Windows locale with such encoding, use Linux. Unfortunately that's the simplest case today and you will not have these problems. Also I would recommend to use ASCII only for file names, directory names and user names. And in addition not for encoding reasons but in addition also only letters, numbers and other score and dots. But still how to improve things on the AR side, how to become more permissive. Using UTF-16 on Windows is not a solution. UTF-16 or it was UCS-2 has been adopted by Windows a long time ago and perhaps too early. It's an encoding that is not used by most software, most open source software uses UTF-A and one needs to write platform-specific, Windows-specific code to be able to take advantage of the UTF-16 API. We could change R to do it and in parts we have changed R to do it on Windows but this could never be the main encoding in R because we have so much external software that will never work with UTF-16 and this is not really used by platform-independent code. But there is a solution now. Very recently Windows 10 came with support for UTF-8 that is already sufficient. One can set UTF-8 as the default native encoding because both of them, as I said before, there are two on Windows and it can be done for R, so selectively for applications. This is great. R has to be changed only a little bit to support UTF-8 as native encoding. I didn't change but it was simple because we already supported it on other systems. The problem is that UTF-8 on Windows requires a new C library. It requires UCRT which is the Universal C runtime in Microsoft terminology and to be able to use it we need a new toolchain unfortunately. We need a toolchain that will build everything with this new C runtime. R2-4 is not new enough and we have to rebuild our libraries for packages, all packages and R. As a proof of concept I have created an experimental toolchain for R that is good enough to build R itself. The base packages and the recommended packages and I have described the details in my blog post and even more details in a technical write-up. I also created a custom R build with an R installed that anyone can use to play with this. Also one can download my toolchain. The toolchain only supports some packages not all run and by conductor packages. With this custom build in R term after setting the code page to UTF-8 and choosing a suitable font one can print the Japanese string I've shown before. The native encoding is UTF-8 so nothing happens when printing, no conversion because it's already in UTF-8 and it can be represented in UTF-8. So that's all from me. I hope I have explained the basics of the R encoding support. More is available in my blog and of course in the R manuals and I hope I have also explained to you that R actually supports Unicode. The problem on Windows is setting UTF-8 as the native encoding and what we need for that is a new toolchain to rebuild R and packages.