 Welcome to the RC3. Hello and welcome at the Franconian stage. Sorry for the delay, we had some technical problems. We have to get used to it. I want to present you an interesting talk from Miro. He talked about names and if you think about names, names can have first name, last name, middle name, prefix, postfix, but think about encoding. It could be difficult. Some of the difficulties that Miro experiences, he will tell us about it. So welcome Miro. I'm glad to hear your talk. Thank you very much. Hello, everyone. My name is Miro Slav Švedivi. No, that's wrong. No, that's still wrong. There's still some character is wrong. No, that's still the wrong encoding. No, no, no, no. Yes, now it works. My name is Miro Slav Švedivi and what I hear quite often is that my name is unfortunately invalid. That cannot be true. Miro Slav Švedivi, this is how you write it. This is how you pronounce it with IPA. And if you have the standard keyboard layout and you use the combo key, you can type it as well. All that I'm going to speak about today, I'm going to speak about names in IT, in programming. Python is just an example programming language. If you are using a different one, it may very probably apply by the same extent or a little bit differently. I would be interested to hear what are your experiences. I'm going to speak about strings and bytes, about encoding, about normalizing, case folding, sorting, regular expressions, and about the names on the web, on the web forms on the databases, how they are formed, how they consist of different parts, prefix the first, middle, last, topics names, and about the allowed character. So how you should prepare or program your app that it works with the names of people correctly and without problems. In Python 3, now we have strings and bytes that are two different objects or types. Strings consist of characters. So these are actually the characters that are used to write in some language. And we have more than one million different code points that are in Unicode. And these strings are only in memory, in the working memory. So if you read something from a file or over network, it will come as bytes, because byte is the standard old 256 different possibilities of eight bits that you have in a byte. But as soon as you read it in a byte, you convert it to a string, because as we are going to see, one character can consist of several bytes because there are many more than 256 different possible characters. Then in the memory, you work with those strings. And at the end, you convert it again to bytes when you want to save it to a disk file, or you want to send it over network. So if your name is Chagnaris, you don't need any encodings. But now, of course, the byte or the string Chagnaris will be encoded into bytes using the dot encode method. And the other way around your bytes, you encode them back to Chagnaris. Both look the same. In Python 3, the default encoding is UTF-8, and Chagnaris looks the same in UTF-8 or as the original characters. So there is no change. You see that Chagnaris is 12 characters or 12 bytes when it is encoded as UTF-8. If you are from Germany and your last name is Miller, then it will be a little bit different because in UTF-8, the U with the deoresis, so umlaut, doesn't fit into one single byte, and you need two bytes. So there are two bytes that will save or that will encode your U with umlaut. If your last name is Chagnaris, of course, this is not the last name. This means Nihou. So hello. The two Chinese characters are encoded as six bytes in UTF-8. So it works one way, the other way around. And if you know that your bytes are in UTF-8, you can read them and decode and you will get your original Chinese character. But there are also other encodings, apart from UTF-8. UTF-8 is great because it works for Unicode and for most characters that we need. But earlier, earlier, earlier, there was ASCII. And ASCII knows only seven bits and only 128 encoding possibilities. So there is a limited number of characters that you can encode directly. In the case of Chagnaris, it works. But if your name is Nihou, then you will need something like Latin 1 encoding. Latin 1 or ISO 8891 is encoding that works perfect very well for some Western European languages like German, French, Spanish, Italian, and other languages. And it knows, or it has the information that these several characters can be encoded into the respective bytes. But the U with Umlaut has a place in this Latin 1 encoding table. But many other characters don't have because you have quite limited the number of possible characters that can be encoded. My last name, Szyliwi, which means grey haired in Czech and Slovak, it comes from Czechoslovakia, cannot be encoded in Latin 1. In the languages like Czech, Slovak, Polish, Hungarian and in other central European languages, we have used Latin 2, which has, which knows some other characters on the place of the characters that are encoded in Latin 1. So the S with Karen has a place there, but it is not encoded in Latin 1. You need to encode it in Latin 2. And then, living in Germany, sometimes I get packages post mail from some German companies that do touch stuff. And you see that I think they have some problem with my last name because on the web form my last name is written correctly, Szyliwi, on the sticker, on the package, the Szy is replaced by a question mark. And now I just wanted to know, why is it like that? Because Szy and question mark, they look quite similar, but that's not a problem. How is it possible that they encoded my Szy as a question mark? In Python, if I encode my last name, Szyliwi, into Latin 2, it will work. But if I encode it as a Latin 1, then, of course, it will not find the S with Karen, the Szy, character and it will raise an exception. In the coding code error, I don't know the Szy because it is not contained within Latin 1. Okay, but I receive a package. I don't receive an exception. So probably there is some possibility how they do fix it the way that it works and that I get my package, although with a wrongly printed name. And yes, there is a function or the parameter in Python where you can encode to Latin 1. And if there is an error, you replace the character. Default for errors is a raise, to raise an exception, as in the first case. But if I tell it, if there is an error, replace it, it will be replaced by a question mark. Why they chose a question mark? No idea. It is not configurable. But there is a small hack that gives you the possibility to replace your character, missing character by something else. But this is probably how that big company encoded my last name and converted to Latin 1. And that's where the question mark comes from. And you can write a short one-liner in Python, for example, like this, where I say if there is an error, use the replace randomly function. And in that case, it will just put a random number. You can write something else, some funny character that will be printed there instead of a missing character. And in this case, I got a five instead of s. Well, a five looks more like she than a question mark looks like she. That's fine. And there are some other big companies in Germany. One, I will not tell you the name, but they have beautiful, big trains all around the country. On the mail that I get on the customer card, on the online tickets, they manage to use always a different encoding and always to write my last name differently. There is another big company that has big airplanes. When I wanted to buy a ticket and I've write my last name, they tell me, you can only enter letters in the adults last name field. A, my last name consists of letters. What is a letter? In Python 3, you can call a variable. You can name a variable using any character that looks like a letter. So, for example, I could do something like that. But I cannot name a variable, for example, a question mark or a smiley. Where does Python know that she is a letter and question mark or a smiley or some row is not a letter? If you import the standard to Unicode data library, there are some functions that will give you the possibility to investigate, to inspect how the characters look like. So, now I have a few characters, a umlaut, the sharp S, lowercase uppercase, yes, there is uppercase sharp S, and the dot and the question smiley. And then I ask for the category and for the name. And what you see in the first column is the character, then you see the two characters, LL, LU, ZS, POSO. And this is the category of the character. It starts with an L, it means it is a letter. And then if there is a U or L, it means it is an uppercase letter or a lowercase letter. ZS, P-O-S-O-R, different other categories for symbols, for numbers, digits and so on. And then you see the name. So, for example, Latin small letter A. This is the list of all categories. And actually every character in the Unicode table has a category, belongs to some category. There is a possibility to access the information behind that. So if it looks like a letter, I can use it as a letter. If it looks like a number, I can probably get the decimal value of this number or of this character, even if the character doesn't look like a number if it is a number or some other alphabetical number. There is the character maps app that will also give you all the information about the character and where you can see also the category and the name and all the characteristics. Quiz folding. Quiz folding is the possibility to switch between lowercase and uppercase letters, which works in Latin alphabet, in Greek alphabet, in Cyrillic alphabet. It doesn't work in Chinese. It works in some alphabets. But in our case, we did it. So if I have some characters in lowercase, I dot upper, I get uppercase version and vice versa. There are some exceptions. So for example, the sharp s, uppercase version of the sharp s, there is an uppercase sharp s, but it will, Python converts it to uppercase ss, which is probably wrong. And the other way around, if I have uppercase sharp ss and I convert it to lowercase, then it will work, but this is not a symmetrical operation. So yes, and this works for all characters that are lowercaseable or uppercaseable. But this doesn't work always correctly. So we have seen here the case of sharp ss, and there is also one other case that is contained even within ASCII, even within the basic 26 letters of the Latin alphabet. And this is something that we don't see as a problem, but we broke it to some other languages. And that is the case of I. You see the difference between lowercase I and the uppercase I. There is one tiny difference, the dot. There is lowercase I with dot and uppercase I without dot. And there is a language, at least one or several family of languages that distinguish between these two eyes. It is, for example, Turkish. The I with dot is pronounced as E and the I without dot, dot less I is pronounced as I. And now imagine that if you have some Turkish text, not you, but your Turkish colleagues, and they wanted to convert between uppercase, lowercase, it can be wrong. And sometimes it can be so wrong that even the verb can mean something different if it is with dot or without dot. So our Turkish colleagues, they have actually to make a workaround, they have to import ICU, which is the International Components for Unicode Library, and then import their locale and then convert it this way. So that's a little bit more complicated. Normalizing is something that you probably don't see usually, but this is the normalizing, this is the decomposition of characters into their parts. For example, we have the German word Zeus, which means sweet. And I have two words that look the same. The first one, three characters. The second one is normalized NFD form. What does that mean? I take again my small tiny script that shows me all the characters within the stream. And you see that the first word contains three characters. The second one contains four characters. And the difference is with this U with umlaut. In the first case, it is U with umlaut as a one character, Latin small letter U with deoresis. In the second case, there are two characters. The first one is Latin small letter U, and the second one is a combining deoresis. And you see that the column in the second word, this line with combining deoresis, has moved a little bit by one character. That means that the combining deoresis doesn't have a width. So it has zero width. It is just glued to the character before. And then it looks like a U with umlaut. And there are plenty of combining characters that allow you to combine. You can actually make a sharp as with deoresis on the top or combine most characters with other combining characters. This is an example. This is a stegoverflow answered to the question how to parse HTML with regular expressions. Of course you can't. But what you see at the end, these are just characters with plenty, plenty of random combining characters just after them. And it looks cool. Alphabetic sorting. There is built-in function in Python sorted that will get a list or a string, which is actually a list of characters, and will sort them according to some rules. If you have some numbers, great. That's easy. But if you have characters, it will store them. And what you see is that in the example that at the beginning I have capital like uppercase aou, then lowercase aou, then with umlaut uppercase, then sharp as, then aou lowercase, then I have some central European like Czech slow xl. And then the sharp as uppercase, it comes at the end. The order looks, doesn't look very natural. And this is because all these characters are converted to unicode code points, like the numbers, like their position in the unicode table. And then they are sorted according to them. And sharp as uppercase came later, so it is at the end. And this is not what you would like to see as an alphabetical list in a fold list or a list of some names or so. We want to sort now according to the German language because sorting according to every language it may be a little bit different. So we import the locale and we say our locale is German and we sort these characters using the locale string x form method. And it will look better because I have, first I have both a's, then I have a with umlaut and then I have b. This is how it should look like in a German dictionary or in a German phone list or list of names. But if I have a Swedish user, for then seeing a with umlaut between a and b is not natural. Swedish expect the umlaut characters at the end of alphabet after that. So if you have the Swedish alphabet, it should be at the end. In Hungarian, there is the Czech sound, which is written as cs and cs doesn't get between cr and ct. cs is as a special character between c and d. So this word cheaper, which means sharp as a chili, cheaper comes after all the c words. In Czech and Slovak, we have also the Czech sound but we've write it with c with caren. You have seen already in my last name, which is s with caren. Plenty of characters like Slovak alphabet has 43 letters with all the possible umlauts. And another thing is for example the ch, which is the ch sound, which is also alphabetically between h and i. But there are also exceptions. If you have two words glued together and the first one ends with a c and the second one starts with an h, it is not ch, it is ch, and then it is sorted differently. In French, this is even more interesting. They sort something from the beginning of the word, some other things sort from the end of the word. So usually when they sort words, they sort everything according to ASCII form. And then they look if there are some, like these four words that have the same ASCII form, they look at the end at the last syllable. And then you see the first two words, they have on the e they have nothing and the other two words, they have accent aigu. So first come the two words where the last syllable is without accent and then the verse with the last syllable with accent and then within these two they sort according to the penultimate syllable, so the before last and so on and so on. So that's French. That's okay. I have seen there, if you have seen their keyboard layout, you understand why they are doing it like that. The problem is that locale is connected to the process. So it means if you do something like this, like set locale in your code, then if it is a library or if it is a website with plenty of users, with plenty of threads running, this will change the locale of the process. And this is not what you want because if you have two users, one of them clicks, I want to sort it, at least according to German rules, to Swedish rules, then they will just break it to each other. And there's a problem because locale is connected to the process. But we have seen already this ICU library that allows you to use the locale as objects, object-oriented, and that allows you to do something in your corner, in your method, and you use all these things, but the whole process is not changed by that. Another possibility which is much more lightweight is PYUCA, that you see that sorts nicely, but that sorts according to some English rules because it doesn't define which locale it has, which language. So they have some better sorting than the default sorting according to Unicode at least. But this is not optimal for every language. But if you need one general list, you can go with PYUCA. Now, regular expressions. If you have a problem, use regular expressions. Now you have two problems. But anyway, let's say that we want to extract from this word Munichen 1, 2, 3. Munichen is the German name for Munich. Munichen 1, 2, 3. And we want to extract the name Munichen. So if I import the regular expressions and then I ask for Orca extras A2Z, A2Z, it will see the M, capital M. And then HEN, it doesn't see the U because it doesn't belong to the list of A2Z. There is backslash W, which finds the U. So it finds the word Munichen, but it also finds all digits. I'm not interested in digits. I just want to see Munichen. So how can I extract Munichen in a regular expression? The research party library regex, that works identically to the standard RE, but it has some more functionality. And in this case, that's the possibility to ask for backslash P, which is a special character from the Unicode list. And in the L, the capital L, as you remember, is the category for letters. So if you have L, it's any letter, L U would be uppercase, L L would be lowercase letter. And this is how you can actually use regular expressions to find words that contain some other characters that are beyond ASCII. So I came here for Python, but I said for the names. So that was the programming part. Now let's have a look at the names. I cannot see you, but you can just raise your hand if your name fits into first name, last name category. Mine fits. But maybe there are some people here who have some middle name or that have some petronomic surname, metronomic surname like in Spanish, who are Guterres Savatira. So these are names that are, this is not last name, there are two last names or two surnames. Maybe there is someone from Hungary or from Eastern Asia who has last name first and then first name last. Are there any popes or queens or kings here? Maybe somebody who has only a name. Or for example, from the Nordic countries, if in Iceland someone is called Sigur and their father is Johan. So this person is called Sigur Johansen. But Johansen is not the last name. Johansen is the petronomic name. So this means that Sigur Johansen, you can call them. You can call them Sigur, you can call them Mr. Sigur Johansen, but you don't call them Mr. Johansen. And in the alphabetical case, they are not under J, like Johansen, they are under S, Sigur Johansen. There are names like different forms of names. So for example, in Czech and Slovak, the names for the masculine and the feminine forms of names are different. My name is Sherevi, all the female's names of my family are called Shereva, grey-haired, but this is the grammatical form of the feminine form of the grammatical form for this adjective. All the substantive, all substantive in Czech and Slovak, if someone is called something like Mule, the females, women, they are called Mule Rova. There comes an ova at the end. And this is not only for Czech and Slovak names, this is also for other names. So if you read Czech or Slovak newspaper, we will see Angela Merkelova. Sometimes Merkelova is still okay, because Merkel sounds like a name, last name, that would be acceptable also in Czech and Slovak. But there are also some names from Africa, from Asia that are grammatically not compatible with our language, and that gets always the ova at the end. And of course, if you have some title, phonon 2 or some economical title, which is a part of your name, then this is also more complicated to decide whether how to write it in a form, because if a form asks for a first name and last name, and where do you write your title or your second name or patronaming, patronaming, or the parts of your name. So actually what I propose, suggest is to have one form for full name, where you write the name as it is, official, on your passport, and then how should we call you. So in my case, full name, Miro Slavšiđivi, and you can call me Miro. And there are some other people who make it really clear on how you should call them and how you should write their names. Yeah, this is what I see sometimes when I write my name on some forms here in Germany. Please enter characters from the European character set only. What is European character set? Slavšiđivi is a central European, Czech or Slovak name, and it contains only characters from the central European or from the European character set. Please enter a full, valid name. I'm sorry, I don't understand what you mean. My name is Valit, a name of a person cannot be invalid. So if you program something that has to do with names, I'm not speaking about GDPR, I'm speaking about common sense. Don't assume anything. Don't put random limit on the length of a name. There are short names. There are long names. There are very long names. There are names that are so long that even if they make a typo on Wikipedia, the person won't notice it. Don't use stop words. If it is a stop word in your language, it is probably a perfectly valid name in another language. As I told, family members don't have necessarily the same family name. In my case, all the names in Czech, Slovak, Polish and other languages. Different transcription from non-Latin alphabets. So of course, if you have the Russian name of Chekhov, every European languages writes it differently, the same with Chinese. On the other hand, I went to Russia twice with the same passport and I got to my visa. And on the visa there was always my whole name in Latin and in the Cyrillic alphabet. And on those two pages, the transcription, the Cyrillic transcription of my last name was different. So also the Russian officials, they see my last name and they try to write it somehow to transcribe it into the Cyrillic alphabet. So it works other way around. The men change their family names too. So if in your form you ask for the maiden name or name is probably not what you want because there are plenty of men who after they get married, they change their family name. One letter name is probably not an initial. So Benua Belanda Madelbrough, the French guy who did quite a lot of beautiful stuff with fractals. The B is not an initial, it's just a B. So probably everything that is printable is probably fine. So you have to expect anything in the name. If you now have heard about this guy, Christopher Nal, hello, I'm Mr. Nal. My name makes me invisible to computers. If your program has problems with that, I'm sorry for that. Someone tried to, or both, customized a license plate for their car and they wanted to have Nal on it. The guy thought, hey, this is great. If I get a speed ticket, they are not going to be able to attribute the speed ticket to my license plate because there is written Nal. And at the end he received all the speed tickets in the county that were not attributed, that were not attributable to some license plate because it just was mapped to Nal. So he received a way too much. If your database has a problem with a guy named Robert, drop table students, okay, see you in the Q&A. There's a street and cities. There's also something interesting because it's like names. If you're in Germany, you know what is the most common name of a street in Germany? Yes, it's Einbandstraße. No, just joking. Einbandstraße means one-way road, but many foreigners think that it is a street and when they park a car, they just ride down. I am in the Einbandstraße and then they need quite a long time to find their car again. What you can see in many U.S. lists of directories of companies from Germany is the concept of Hauptstraße because their OCRs probably or some other programs are not able to identify the Schaff SS and they write it as an uppercase B. The names of the places, they can be very short, like the O somewhere in Scandinavia or the Y somewhere in France. The inhabitants, they call themselves Ypsilonian. So if you live in a place like this and then you get some control question, check question like what is your mother's maiden name or what is the place of birth and it says, oh, you have to enter at least six characters. No, don't do that. And there are some places like Llanwark, Wenge, Gogorah, Vindro, or Quantasilio, Gogogoch, that are a little bit longer or Chzonci, Rzewoszice, Pogelwenko, Wadi. So you need really much more place and you don't have to stop after 10, 15, 20 or 30 characters because the places have really long names. And sometimes the places don't even need names because somewhere in Iceland you can just draw a map on the envelope and it will arrive. There are plenty of things that you have to think about when you are doing programming something with names and that can surprise you. There are some pages like this, falsehood programmers believe about names. I invite you just to read them and you will see quite a lot of interesting stuff that you have not thought about earlier. Your name is invalid? No, your name is not invalid. Please, as a developer, respect the names of your users because their names are not valid. Don't break the locale. So import ICU if you are in Python. Convert from bytes to string as soon as possible and from strings to bytes as late as possible. Or work with strings the whole time. It's cool. UTF-8 is cool, Python 3 is cool. Be cool too and use Python 3 and UTF-8. And if you tell the user your name is invalid you will land on the Twitter account your name is valid. Actually this is also a limit of Twitter because you can have an account with maximum 15 characters so your name is invalid wouldn't fit there so there is a Twitter account your name is valid. And be nice. Thank you very much. Miro, thanks for your interesting talk. Thank you. It was a pleasure to hear what kind of problems you can have with names and encodings and how you can circumvent it as a programmer and you really should circumvent it. I have some questions from the audience and they go very specific to non-letter characters. Which characters? Non-letter characters like the upper straw and it's called M- and what do you think about they should be allowed in names. How do you handle that? At least in passwords it looks like they are allowed. If they are a part of a name of a person so they are valid. There are numbers maybe all of that should be allowed so as I thought you have to accept almost anything and deal with it. We have a question from the audience really thank you for your talk. It was interesting for us all sorry for the problems with the stream we had if you missed something from the stream just go to the recording afterwards there will be a full recording of this session including questions and answers. Thanks again Miro. Thank you very much and enjoy the congrats. Bye.