 Welcome to the next afternoon session. Our next speaker is Miroslav Shadivi. He will talk about your name is Invalid. Names of people cannot be Invalid, your app shouldn't. And he will mention many things, I guess, about these wrong names. So, okay. Miro, please start. Thank you. Hi, let me introduce myself. My name is Miroslav. We have still some problems with the transmission. Wait, wait, wait. Hey, we couldn't prepare it beforehand. I will have to switch something here. No, no, that's not still. Ah, no. Okay. My name is Miroslav Shadivi. And sometimes I still see somewhere that my name is Invalid. If you want to pronounce it correctly, this is the IPA, how to pronounce it, and this is how to type it on a standard keyboard layout with Komposki. Hi, my name is Miroslav Shadivi. I work at Zalute GmbH here in Karlsruhe, and if you would like to see the region with the best German wine and the best French beer, come to see us to Karlsruhe. We do a lot of great stuff with Python and other open source technology. Originally, I was born in Czechoslovakia, studied in France, and now I'm here in Germany. So today, first about strings and names in Python. So how the strings bytes work, encoding, normalizing, case folding, sorting, and regular expressions, and then about names on the web, about the apps, about databases, how to work with the names correctly and how to allow or test whether a name is valid or not. So in Python 3, there are two types of objects that work with strings. The first is standard string that for each character knows more than one million possible characters that's unicode, and that is something that is in the working memory. You can do that, but you don't send it over network or store into a file, and there are bytes with 256 possible combinations that are used to store data on file and or network, and you have to combine or you have to convert these somehow. For example, if your string is, or name is Chagnaris and you just encode it, you get a series of the bytes that looks almost the same, and the other way around from bytes back to string, you convert it with using the decode method. The bytes you can see them with, you can recognize them that they have a small b before this quote, and the strings, they have nothing. If you encode them standardly in Python, it uses UTF-8 as a standard encoding, but in this case, it will be 12 characters as a string or 12 bytes as bytes, so there is no difference. But for example, if you have a German name, Mila, and you encode it, then from six characters you will get seven bytes because the latter you, umlaut, will be written or stored as two bytes. Yeah, if you have a Chinese name, this is not a name, this is Mihao, so a simple hello in Chinese from two Chinese characters in UTF-8, it will be stored as six bytes. And there are other encodings, UTF-8 is relatively the newest one, although it is over two decades old. There is ASCII, but is around since many, many dozens of years. If your name is Chagnaris, of course you can be encoded in any encoding, and actually you are the one who decides about the encoding, but in this case, Chagnaris with 12 characters will be encoded into 12 bytes using ASCII, but if you have the name Mila, you cannot do it with ASCII, you can do it for example with Latin 1 that knows more characters that knows even this UTF-8, and in this case, the six characters will be encoded as six bytes using Latin 1. My last name, Shaivi, comes from Czechoslovakia, which doesn't use Latin 1, it uses Latin 2 alphabet or encoding and our six characters will be encoded as six bytes. But always a byte is something that has only 256 possibilities, so you don't have the information which encoding it is in. In this case, we knew that we are going to use Latin 1 or Latin 2, but normally you don't know which encoding it is in. And sometimes it can happen, something like this, that I get a mail here in Germany where that types my last name wrongly. How is it possible that instead of the s with Karen, I get a question mark. Do they use Python? So if I take my last name and encode it with Latin 2, I will get the correct six bytes that mean the rising, but there is no question mark. In Latin 1, I will get a unique error because the hash cannot be encoded in Latin 1. In those 256 possible bytes of Latin 1, the letter hash is not there. But in Python, you can define an error. What to do in the case of error? Default is raised an exception, but it is possible to say replace it with something. And replace default is a question mark. So if you encode some string that contains characters that are outside of the encoding, so for example, sure, in the case of Latin 1, and you say in the case of error, replace it, it will automatically replace it with a question mark. And this is probably how this happened. You can even do something more. You can define your own function if you don't want question mark the whole time, and you can register an error and encode it, and then tell, okay, if I use the errors method, replace randomly, for example, call this lambda function that gives a random digit, and this X start plus plus 1 means that then go on with the next character. You can define more movements around such cases. So in this case, I can tell if I want to encode my last name with Latin 1, and if a character is not cannot be encoded, and then put there some random digit. In this case, and this is probably how they are doing it. Random characters. Sometimes there is a space. Sometimes there is no space. Sometimes there is something like this. Sometimes there is an inverted question mark. Literally all three pictures come from the same company that has different means of communication like letter, like card, customer card, or online form, although I have entered it correctly in the work form when I signed up for their service. There are some other cases. Here are some no-name German flying company that has huge airplanes but have problem with their form. They tell me, hey, you can only enter letters in the adults last name field. What are letters? My last name consists only of letters. What is a letter? In Python 3, you can do something like this. You can give variable the name that is not pure ASCII that consists of other characters. But you cannot give variable a name consisting of starting with a number. But sure it's not a number. It's a character. It's a letter. How does Python know that sure it's a letter, but a smiley is not a letter? Question mark is not a letter. There is Unicode data that has the standard module Unicode data that has the method category and name. So we give it any character and it will tell you which category and name it is. So in this case I gave a few numbers and smiley and dot and so on. And there you see all these characters and then you see two letters. There is ll, l-u-z-s-p-o-s-o and so on. And then there is the official name of the character. So everything that these two characters, when they start with an L, it means a letter. L-L is a lowercase letter, l-u is uppercase letter. And so it is defined. Every character in Unicode is defined. It belongs to some category and it is a letter or not. So you can use the capital letter sharp as a variable name in Python. These are all the categories that we are going to see a little bit later how we can use them. And also if you open the character map application, you will see all the information about the character. And you can use actually the name to put the, to write the character with the backslash N, capital N and with the calibrations. Case folding. Case folding means the conversion between uppercase and lowercase. We will see we have some letters in the lowercase, we got uppercase and vice versa. It has a few exceptions. So for example the sharp S, the German sharp S lowercase converts to S as uppercase, although there is official sharp S uppercase character. The other way around it converts correctly. And so it works for most other characters between uppercase and lowercase. Of course you have for example Chinese characters. But there is also with this uppercase and lowercase there is one problem even in the basic ASCII. There is one ASCII pair, over case uppercase, that we like the western English, European language group with wrong. Which one is it? It's I. You see if I take lowercase I, I do it to uppercase, I get this. The other way around as well from the uppercase I, I get to lowercase I. You see the difference? Lowercase I has a dot, the uppercase I doesn't have a dot. How do you explain that to someone from Turkey? Because they have the I with dot lowercase uppercase that is I and I without dot or lowercase uppercase that is I. And when they convert between these two pairs of characters they always get the other one. So if you want to work with Turkish text, you have to import international components for Unicode library and then work a little bit more complicated codes that converts between the Turkish pairs of I and I correctly. The decision is something different. If you have already your Unicode text and you think okay that's great I can now work with that I can access the characters, the length and everything. You will get probably two strings that look the same and they are not the same because when they are normalized they will be divided or some characters will be split in two characters. The first one has three characters. The second one is with dots like the resist mode. And the second version has this the resist behind the U. So the first U is standard U with value of 117. And the second one is there then comes this combining the resist that is of the category and which is a combining character. And also, this is not an error, but you see that the whole line is a little bit offset to the left. This means that this combining the resist doesn't have a width. You can convert between those two versions like the normalized and the normalized and between the normal forms and they will not be the same. So it means that if you have a lot of text to try to normalize them or normalize them to one form and only then compare them. So this combining the resist and there are plenty of combining characters. You can do some fun stuff. You can put more characters after each other. And then you can get something like this. This is actually the overflow answer to the question how to use how to parse HTML with red war expression so you see at the end, there are some characters with plenty plenty of combining characters. No, that's not so easy. If you just sort the standard Python sort using the standard Python sort, you will not get the alphabetical order, because you see that there are some uppercase lowercase than you have some other letters. What's the new like letting one letters than there is some like input like this sure. And then the sharpness as comes at the end because it has a really high difficult value. You can use locale. So you start locale with the German, for example, and then you sort this characters and we will get it out loud between a and D. But if your locale is for example Swedish, then it will be different because in Swedish all the amounts are at the end. And there are some other exceptions, for example in Hungarian, the CS pronounces is extra letter between C and D. In Slovak, we have plenty of characters with currents that we put just behind the rest of the letter. There is CH pronounced as h, which is between H and I. And for example, in French, they have even more rules. It means that they can, they sort everything according to ASCII. But if there are accents, they sort them from the end of the word. So first they sort the last syllable and then the penultimate and then so on. And then the rules. The problem with locale is if you do that in your library, then the local of the whole process will be changed. This means that if you have some multilingual program that has to have the input of several users at the same time such as website and you switch the local the whole time it will, there will be a chaos. And that's why there is this ICU that we have used already to convert a lot of these opportunities that you can use to sort something without changing of the system. I have read that there are some recommendations to use Pius EA, although some unique called collocation algorithm collater algorithm. And these two is that Pius EA doesn't define which letter which locale or which language it uses. It has a great database included that and it's announced that it works very well for characters outside of English. Yes, you can use it if you only need some English list sorted list, and you want to have those other characters included at not so bad position, but don't use this if you have pure Germans Swedish or check text, because it doesn't use the specific sorting of the specific language, or that case I prefer to use ICU. At the end, you need code regular expressions. So let's have a look at this text there is new space 123, and I want to extract the word. So if I did this so all current all characters between a set uppercase lowercase, it will split, it will ignore the cell out and return to like to verse like that. It's wrong. If I say I want the backlash W so also numerical characters, the there are also numeric characters so I will return to return also the 123 that I want to want to have. In that case, we will have to use the regex it is a third party library that works 100% like the standard E works and only has a few functions extra. And in this case with backslash P you can define any code characters that belong to some category, and you have seen at the beginning this capital L means that it is a letter. And you can also define something like you for uppercase letters L for lowercase letters and so on. You have seen that at the beginning in the list of categories, and in that case we get the word. I came here for Python, but I say for the names. So this was the Python part now let's have a look at the names. So where with you in normal, normal offline conference I would just ask raise your hand, who can write their first and last name like this. I don't know how many of you can write it later in the discussion, but probably, okay, I can tell my first name. But maybe some of you have something like the middle name, or something petronomic or metronomic so name in like Portuguese or Swedish, sorry, Spanish. There is last name first name in the many Asian languages or Hungarian. There is name number if we have some popes or queens or things here. Or there is some simple name, but there is also something like the name of the father, and that you are a son or a daughter of the father so for example in Icelandic. If your name is to see what your grandson, your father was young. But you have to use or work with these names will be differently because Sigur Johansson doesn't mean that it's a mystery handsome. There is no mystery handsome. Sigur Johansson is Sigur or Mr. Sigur Johansson, and you will find them in the phone list, the phone book you will find them as s like Sigur Johansson, and not as J, because they are they don't sort their names according to this. Of course, there are plenty of phone and to find the baby and so on. In different languages, and many times you just don't know whether to put it together with the last or first name or somewhere. Of course, some doctors in some languages, the doctor became the part of your last name. So how to do it correctly. Actually, if someone, if you ask for a name just ask for the full name, right, let someone write the full name, and then you can ask how they should call you. So for example, Mr. T he says oh my first name is Mr. My middle name is dot or period and my last name is T. And then, even if you can write it correctly, you will sometimes get error message like a please enter characters from the European characters set on this happened in Germany. My last name should be it's not far away from Germany. Sorry, what's European character set. And the same service. A few pages later, please enter a full valid name. They didn't want to accept my name. And my name is valid. So the names of people cannot be right. So if you work with names in some forms, don't assume anything. Don't put random limit on their land minimum maximum. There is no minimum. There is no maximum. So for example, if the lunch room for log stock in her name is like this, this or a character or go to the delphin to put in Gutenberg. This was some German minister of years ago. His name even appeared on Wikipedia with a typo and even see that. Don't use stop words. So if it's an stop word or like root word in your language, it can be a perfect word in another language. The all family members don't have necessarily the same name. So for example, all the male members of my family are she gave me all the main female members of my family are she give up. Anything different transcription from non Latin alphabet. So for example, if you have the Russian name Chekhov in all your every European languages types it a little bit differently. And other way around. So for example, I was in Russia, twice with the same passport and I got to visa and both times my last name was type differently. Or in Chinese, the same. You use something like made a name or me know the men they change their family names to one letter name is probably not an initial so been what the land of mother brother is not a variation. So actually you should accept almost all printable characters. Hello, my name is Mr. No, my name makes me invisible to computers and article I don't understand how someone can program something that doesn't accept last name. Or the same with his Bobby little Bobby tables. Um, yeah, the, the names of the streets and cities and so on. The most widespread name of street in Germany is of course I'm bunch passes on so one way street you see it almost everywhere now. So the German this stress like street is with the sharp as you will find a lot of American lists of German companies with this be capital and instead of the sharpest. So the names of the places or somewhere in Scandinavia, the city linear so the Greek in summer in France. The main question is what is your mother's made a name at least six characters or they ask you for your place of birth, at least six characters. No, doesn't work like that. Or on the other hand you have places like on my book we get to get a little bit longer. Or sometimes you even don't take a mattress because you just described very well where you want to. So if you are working with the names of people, I now ignore GDP are everything I programmatically respect their names, and don't break the local so use the ICU. Convert hamburger principle, come, convert from bite system as soon as possible and convert back to bites as late as possible. If you tell your user because of your spitefulness, ignorance, laziness or something that their name is invalid. This is not cool. There is a Twitter account. Your name is valid. It's not is invalid. It is your name is valid because there is a limit on the characters. Twitter. That will maybe where your case can appear if you do something like that. That's all. Thank you very much. Thank you very much. Are there any questions. I see there is a first question. Unfortunately, I can't read it. It's probably the wrong encoding. So while you can type in questions, I do actually have a question. At the moment, I'm always using utf-8 for anything, everything actually. And is there a reason not to use utf-8 and when and why and if you control both ends of the communication or you have some power to do that, use utf-8. Of course, there are cases where you can use it if you are programming some special printer or something that doesn't support. But if you control both ends of communication, use utf-8. Okay. Any more questions? You can click this Q&A button down on your screen if you are in Zoom and ask your questions there. Am I in Zoom? Yes, there is a question coming in. No, don't worry. I read the questions for you. Johnny Zhang is asking, some auto-Chinese software is still using Big 5. So what are your thoughts about that? Rewrite it, if you can. Rewrite it, okay. Why that? What's the major drawback of Big 5? Never use that. But sometimes it would be nice to have one encoding that works everywhere. Because you have also combined documents. In my case, I have also used some Chinese characters within European and Sierra League and so on. So it's nice to have, if you want to have bootlegable documents, anywhere you need something that works for everything. Okay, so are there more questions? You can always ask Miro later in Discord. Just press Command and K in Discord. And then you see Window and you can enter Invalid there. And then you will see the channel talk. Your name is Invalid. And there you get already got big applause there. Really great talk, poignant and entertaining. While other people are maybe asking more questions. Last chance, if not, please go. Oh, even more applause in the channel. So if not, please, Miro, go to the channel. Your name is Invalid and maybe there are more questions there. And in the meantime, we maybe have time for another advertisement. Thank you. Thank you again, Miro. Oh, that's a close second wave.