 Bonsoir Montréal-Piton. Good evening Montreal-Piton and welcome to Montreal-Piton 95 and Ticeptic-Zibra. Bienvenue à Montréal-Piton 95, zébre antiseptique. Vous pouvez peut-être le deviner par le nom de l'événement. C'est le dernier de notre série. Ensuite, on va avoir une nouvelle série de nom tout aussi aléatoire et loufoque. On vous invite à suivre nos événements pour les découvrir avec nous. Donc, avant de passer à notre programme pour ce soir, petit mot sur l'ordre. On peut vous rejoindre à nous sur Slack de l'ULL ici mtlpi.org on est dans le canal meeting et on a un code de conduite à Montréal-Piton qui dit essentiellement soyez excellents les uns envers les autres. Si vous voyez quelque chose qui ne vous semble pas excellent, trouvez-moi sur Slack. Je suis Yannick et je vais vous donner une seconde question pour vous parler de la situation dans le programme. Vous pouvez nous joindre sur Slack. Le URL est ici. C'est mtlpi.org . Et on a un code de conduite à Montréal-Piton qui dit essentiellement soyez excellents les uns envers les autres. Si vous voyez quelque chose qui ne vous semble pas excellent il vous semble excellente. Je suis Yannick on Slack et je vais faire mon meilleur pour résoudre la situation. Notre rencontre pour aujourd'hui va être plus tôt que l'habitude, probablement parce que c'est la sommeur et nous voulons garder un peu de la nuit pour aller au-delà du patio pendant des réfrérences. Nous allons avoir une présentation par Andrew Cunningham sur l'internationalisation et la localisation. Dans deux langues sans m'en faire jus dans mes mots. Et ça sera suivi de notre 5 à 7 virtuel sur JT. Et nos événements futurs, lundi prochain, on va avoir une présentation par Andrew Cunningham sur l'internationalisation et la localisation. Et je vais dire en deux langues sans m'en faire jus dans mes mots. Et ça sera suivi de notre 5 à 7 virtuel sur JT. Et nos événements futurs, lundi prochain, on va avoir une autre soirée de programmation. Donc si vous n'avez jamais assisté à une soirée de programmation, c'est un peu faire du pair programming. On travaille sur un problème en commun et puis on a un document partagé. Il y a des fois que c'est Google collab et des fois que c'est un Jupyter notebook. Et puis lundi prochain, on va faire deux sessions en parallèle. Une sur la programmation web en Django et l'autre sur les concepts de base de Python. Donc peu importe si vous connaissez bien le Python ou si vous commencez lundi prochain, ça va être un bon temps pour vous exercer le bout des doigts. Et puis après ça, on prend des vacances pour l'été. On n'aura pas d'événement pendant juillet et août, mais on revient en septembre. Et puis je vous invite à voir notre calendrier à Lueré ici pour voir qu'est-ce qu'on va faire en septembre. Future events. Next Monday, on va avoir des programmes. Et si vous avez attendu un de nos programmes, c'est des programmes de base. On a un problème commun. Et puis on a un document shared. Donc, either Jupyter notebook ou Google collab. Et puis on a travaillé sur ces problèmes. On va faire deux sessions en parallèle. En lundi prochain, lundi il va y avoir des web-developpements avec Django et lundi il va y avoir des concepts base de Python. Donc peu importe si vous connaissez bien Donc, n'importe si vous connaissez le Python vraiment bien ou si vous êtes juste commencé, allez-y, la prochaine semaine, et vous pourrez exercer les conseils de vos doigts. Et puis, on va prendre des vacances de soleil. Vous pouvez voir notre calendrier pour ce fall sur notre website. Donc, on ne va pas avoir d'autres événements en juillet et en avril, mais on reviendra fort en septembre. Maintenant, mais pas le least, je dois remercier nos partenaires, FGNR, pour consulter dans Django, d'autres projets de web développement. Ils sont basés en Montréal et leur hostage est le virtual happy hour. Et puis, il y a Edzy, et je suis désolé, je ne peux pas prononcer son nom correctement, mais je vais essayer. Pamoon Kass, qui a pris cette belle photo de ce bleu, n'est-ce pas beau ? Alors, finalement, il y a notre partenaire FGNR qui fait de la consultation sur le développement web à Montréal qui héberge notre 5 à 7 ce soir. Et il y a Edzy, Pamoon Kass, je m'excuse Edzy, de prononcer ton nom qui a pris cette magnifique photo du serpent bleu. Donc, c'est tout pour les annonces, et puis je vais inviter Andrew à se joindre à nous. Allô, Andrew, Andrew est la personne qui connaît le plus de l'internationalisation et de l'unicode que j'ai jamais rencontré. Il a travaillé sur un bunch de projets et dans ce domaine, il a travaillé avec le W3C, il a travaillé sur l'unicode spécification et il va nous parler de certains de ces soucis. Andrew, le stage est votre. Allô, tout le monde. L'une des choses quand j'ai écrit le code Python, c'est que je n'en sais pas quelle langues que je vais avoir à utiliser. Donc, il s'agit d'être flexible. Et ce que j'ai trouvé beaucoup, c'est que... j'ai usually to go beyond the base core functions in Python. Donc, je vais commencer par un petit controversé et pour qu'on fasse ce qu'il y a aujourd'hui. Python 3 n'a pas de modèle d'internationalisation de l'internationalisation. C'est une accretion au temps. Python Core functions en langues neutraux et en locale indépendante. Et cela peut créer des conséquences inusuales pour l'un ou l'autre. D'abord, la terminologie. Qu'est-ce que je veux dire par l'internationalisation? L'internationalisation est le processus d'application de software pour un service web afin d'être adapté à différentes langues et régions sans des changements en ingénieur. L'internationalisation est l'architecture d'un service web. Donc, si je vais à la documentation de Python, il y a un secteur d'internationalisation. Cependant, l'internationalisation spécifiquement discute le module locale et le texte de l'ingénieur. C'est-à-dire que c'est la nécessité de l'infrastructure pour l'internationalisation plutôt que tous les aspects de l'internationalisation en Python. Si vous faites une recherche sur l'internationalisation en Python, la plupart des choses que vous trouverez, 99% de cela sera en infrastructure localisation et en localisation, pas en internationalisation dans un contexte large. L'internationalisation, d'ailleurs, est le processus d'adapter le software ou le service pour une région spécifique pour ajouter des components de l'ingénieur locale et de transmettre le texte. Pas seulement le texte, l'internationalisation aussi implique l'internationalisation, les images, les couleurs, les layouts, l'expansion des textes, l'expansion de l'interface user, et les conventions de type-setting. L'internationalisation, un troisième terme que vous utilisez, est en fait un terme d'internationalisation. Un service est globalisé quand il s'agit de l'internationalisation et de l'internationalisation multiple. C'est prêt pour le monde. L'internationalisation en Python est un aspect de l'internationalisation et aujourd'hui, nous allons regarder un couple d'aspects de cela. L'un des choses que j'ai fait dans mon temps, ce n'est pas qu'il y a beaucoup, c'est d'aider à documenter différents aspects de l'internationalisation en Python. Ce qu'il faut documenter est d'assortir une collation dans une manière linguistique. Nous allons regarder quelques des problèmes, très rapidement. C'est un long document avec beaucoup de détails. Mais nous allons juste avoir un overview de l'internationalisation. Et nous allons aussi regarder quelques autres problèmes. Mais, rapidement. Les locales, sont des petits bêtes. En Python, nous avons un module local pour interagir avec les locales des systèmes. Mais il y a différents types de locales. Il y a les codes unicodes, les data locales de l'internationalisation et tous les locales qui se développent. Et ensuite, il y a 15897, qui est plus prévalent sur BSD et Linux et Unix systèmes. Le module local en Python, les commandes et les logiciels sont très modèles sur 15897. La chose importante avec les locales est que les locales ne sont pas vraiment suitable pour les codes plateformes. Qu'il soit disponible sur différentes plateformes va varier, en termes de type de local, CLDR versus l'ISO 15897, mais aussi en termes de de ce que les locales sont disponibles et que les propriétés de les locales sont supportées. Donc, par exemple, si j'utilise Google Collab, je suis limité à 4 locales, les locales C, l'IOS English, quelque chose qu'ils s'appliquent POSIX et un peu d'autres ondes et ondes. Donc, si j'ai vraiment envie de faire des choses en langage sur Google Collab, j'ai besoin d'installer tout le soutien de langage nécessaire parce que les locales ne sont pas là. L'initiel local en Python utilise un local C. Si vous voulez utiliser quelque chose d'autre, vous devez l'explicitement changer. Mais la documentation en Python a quelques warnings. Et essentiellement, toute l'information dans l'initiel local de l'initiel vient de s'adresser à l'initiel local. Laissez-le au système default et ne changez pas parce que les changes locales ne sont pas garanties d'être fraises. Dans un de vos fraises, vous pouvez changer le local et tout de suite, ça change beaucoup d'autres choses dans l'application d'autre. Donc, basically, vous importez le local, vous setez le local et vous setez le local à un string élevé. C'est un de Python's way of saying set it to your system default. Et il y a différents types de local identifiers. La plus commune qu'on va voir est language and region with a encoding identifier and sometimes some modifiers. On some language systems, they also add the unicode script. So, examples would be ja underscore jp dot utf8. Although that's sort of redundant because you can get away with just ga dot utf8 except the local, positive locales require the region. German, Austria using ISO 8859-15. Greek and Greece simplify or traditional Chinese from Hong Kong, you and has a modifier so the collation is based on the pinyon value of the Chinese characters, the ear graphs. Alternatively, some systems like newer versions of windows will use a bcp47-based tag with either U-extension or T-extensions which are managed by Unicode. So, a simple one would be Australian English. Simplified Chinese in People's Republic of China. Now it gets fun because this uses the U-extension so it's basically Thai using the Buddhist calendar and using Thai numerals, number system. This one is a lot more complicated. You have English reordering Greek characters before Latin characters and then digits after Latin but before everything else. Next one is Hindi using native digits. This one is Canadian-French but ignoring case when it's sorting. So you can actually see the Unicode locale identifiers are actually quite powerful. You can actually tailor a lot just from the language code itself. But this only really works with PiICU which is a Python wrapper around ICU4C but you can tailor lots of different aspects of it. So, so it's actually quite powerful just from the locale tag. Quickly looking at collation. Collation includes sorting, indexing, searching and matching. But I'll limit myself to sorting at the moment because that's the of these collation functions. That's the only one supported in core Python. For indexing, searching and matching you need lots of other platforms integrated into Python. The default sort in Python either using sort or sorted will give you a locale neutral language independent sort. And it's based on ISO 14651 not on Unicode. The key thing to know about Python operations is some will be based on Unicode and some will not be. And some sort of are based on Unicode but not quite. So, I have a random string of characters not quite random I've deliberately chosen a few and a couple of them like a graph of both pre-composed characters and decomposed sequences. So, sort of pre-composed here decomposed over here. The interesting thing with this is that your uppercase characters sort before your base lowercase characters. I'm sorry for the interruption my daughter came in for question. It's time for them to go to school. But the other thing you'll notice is that agrav sorts in two different places one earlier in the list after A and one later in the list. So, one of the key things here is that uppercase and lowercase characters are separated they don't sort together. So, and that decomposed and pre-composed NFC versus NFD NFD versus NFC data will give you different sorts. So, if your data all your data is NFD decomposed it will sort differently to if it was all pre-composed. You can, to some extent, normalise this a bit. So, for instance, this particular piece of code I use a Unicode data module to, and I case-fold the strings I normalise it to NFD and that will give me a different sort to the default sort. So, I'm actually getting the NFD and NFC the decomposed and pre-composed characters to sort together. You can also do locale specific sorting. So, you can import your locale module set your locale and then sort based on the locale using the key parameter. But you'll notice that it gives you better results than the raw sort. You can also tailor the locale sort. In this case, I'm taking the locale sort keys case-folding I'm doing a normalisation called NFC case-c, case-folds. Now, this is not a normalisation form that's supported in built-in Unicode. So, I have to create my own function. So, basically, I normalise everything to NFKC case-fold it and then normalise it to NFC. And then I do the comparisons for the sort. And that gives me a bit different, more tailored results than a standard locale. One thing I'd say before we go any further is if you're working on Mac or you need cross-platform code that will work on a Mac or R&BSD or BSD-derived systems, you've got problems. Most of the collation tables in on these platforms are symbolically linked to a specific table. I.e., it will not give you language-tailored sort or locale-tailored sort. So, there's no way to sort by locale on the Mac or on BSD, on FreeBSD, on NetBSD, and so on. So, and for me, with my code needing to run on Macs as well as other platforms, it implies that the locale module and using locale for sorting just won't work for me. So, alternative solutions. There's a module called PyUCA which implements the default Unicode collation element table which is basically a language-neutral approach. As currently, it supports Unicode version 10 but allows you to download the appropriate collation key files from the Unicode site and use those instead so you can actually use more recent versions of Unicode. The option I tend to use is to use PyICU which is an extension wrapping ICU4C. So, if you've got ICU4C on your system and it's easy to install on a Mac using Homebrew, then that provides a way of creating cross-platform code that is locale sensitive. So, in the code, I'm importing locale and collator from ICU. So, ICU.locale and ICU.collator. I define my locale, I initiate a collator using that locale and then I pass the get sort key for that collator to the key parameter for sorted. And that will give me a platform independent sort. So, that'll be a consistent sort across all platforms that have ICU installed. So, that's my... And it's also my preferred way of handling a lot of string operations as well. A useful module, Unicode data but also a problematic module. This is one that's inbuilt into Python, it comes with any installer Python. But the version of Unicode supported depends on the version of Python you're running. Unicode data cannot be separately updated. So, it's tied to what was created at the time. So, if you're running an older... If you need the latest version Unicode, you always find you have to run the latest version of Python and upgrade to the latest version of Python to get that support. But there are ways around that. Some of the uses, obviously it displays a lot of information about each character. So, it's useful for testing against a subset of Unicode character database properties. I use it for Unicode normalisation and I use it for converting native digits to the Western Arabic digits that we normally use. So, an example, importing Unicode data as UD. I pass it a string. I can then use the normalised function within Unicode data to normalise it to NFT. Then, I can use the isNormalised function within Unicode data to test whether it is actually NFT. In this case, it's true. I can pass it a loud digits and then I can use UC decimal to convert each digit in the string. One of the important things to remember about Python is that most numbers are treated as strings unless it fits very specific criteria. The number has to be made of Western Arabic digits. It mustn't have a thousand separator, grouping separator, and it must use a period for the decimal separator. Anything else won't be treated as a digit but as a string. There are drop-in replacements for Unicode data. These are updated within a month or two of Unicode releasing a new version. Unicode data 2 is a direct drop-in replacement. Unicode data plus and tangled up in Unicode have everything that Unicode data has, but also have some additional Unicode character database properties that Unicode data doesn't support. During operations, most are Unicode, but not all. Here's an example of a Dinkr Wikipedia article title. And I title-case it. And you'll notice that it doesn't title-case just at the start of a string or of a word. It will uppercase within a word, when the peculiarities of how Python defines a word. Anything, within this word, you have combining diacritics because some of these combinations don't exist as single Unicode character points. So they have to be made up of two, open O and combining Derecesse. Now, any combining characters, any marks of any kind are not treated as alphabetic characters by Python. So any character after a non-alphabetic character is title-cased by this function. Waze around it, roll out your own, or use PyICU. So for instance, if I use the word Francais and test if it's an alphabetic sequence, it is. But if I use the decomposed NFD equivalent, it's false. I'll say it's not considered an alphabetic string because of the presence of the combining diacritic. But there are work arounds. I tend to use the rejects module rather than read because it has much, much better Unicode support. It supports a lot of the Unicode properties and categories. And two of the things missing are different types of marks that are word-forming characters, but aren't considered as part of the alphabetic property. So basically, everything with the alphabetic property, plus these types of marks, and also I've added an interpunk for Catalan. Yes, I've had to work with Catalan in the past. And this will test whether a single character or a group of characters are alphabetic based on this expanded definition of alphabetic. So there are work arounds for some of the limitations. Likewise, working with digits, I tend to use ICU. This code I use for converting two native numbers in some of the code I write. Another headache is plotting. One of the most common libraries used a lot is Matplotlib and its whole ecosystem. There's a lot of stuff based on this core. Library, including the default planters plot, c++ word cloud, and there's lots of other software. They all use Matplotlib at its core. The problem with Matplotlib is it doesn't support complex script rendering of fonts and it does not support bidirectional texts. So if you're doing stuff in Hebrew, Arabic, Syriac, or all sorts of different languages, nasty things happen. This is just a standard Matplotlib bar chart with Arabic labels. One of the problems is that all the text is non-joining. It's using the isolated forms. It's not using the initial, medial and final forms of characters. Arabic characters can have up to four versions of a glyph and it needs to be contextually rendered. The other problem with this string is it's not supporting bidire, so the whole string is reversed back to front. So the joining is broken and the ordering is broken. Solution to that is to use MPL Cairo, which is also part of the Matplotlib overall project. It's an alternative back-end for Matplotlib and it uses FryBit by D and Ruckim. FryBit by D is used for the Unicode bidirectional algorithm support and Ruckim is used for complex font rendering. So if I use that instead as my back-end for Matplotlib and say I use it in Seaborn, I can do things like this. This is Serrani Kurdish bar graph. The code I've used here can generate two versions, the mirrored version and a non mirrored version. The text is correctly formatted. It's the right direction. And the special initial medial and isolated forms of the glyphs are used. The other thing it's doing, the code that I've used here is it's swapping the Western Arabic digits for Eastern Arabic digits. So that gives you a quick overview of just some of internationalisation issues in Python and some of the workarounds and some of the alternatives. If you want more detailed information on some of these, go to that GitHub. That's where I'm slowly depositing all my notebooks and notes. And that's probably it. We're going over time. And if you have any questions, let me know. Back to you. Thank you, Andrew. Everyone, if you have questions for Andrew, you can ask them either in the YouTube comments or on Slack. And I will be happy to relay them out loud to Andrew. Tout le monde, si vous avez des questions pour Andrew, vous pouvez les poser dans les commentaires YouTube ou sur Slack. Et puis je vais les relayer de vive voix à Andrew. And I will start with one of my own questions. I have many questions because that's a fascinating topic. But I will limit myself to one for now. How does Pi ICU for, say, sorting compare with uploading these problems to the database? So I know I can ask MySQL or Postgres to sort using a specific locale. Or are they doing the right thing? Or should I be worried about what they're going to give me and do it in Python instead? Well, they are more likely to be using ICU behind the scenes as well anyway. So what you are actually, if you use exactly the locale that they're using, the locale that they're using, then you'll get the same results, identical results. Both Postgres, MySQL, Maria, DB, they would all be using ICU. OK, so the difference is that Python, you can within the databases, you only have a fixed set of locales. In Python, because you are interacting with ICU at a fairly low level, because the functions in ICU are very low level. And PyICU is a direct wrapper around those functions. So you sometimes find that you actually have to write some extensive code to make use of some of the functions, because there aren't any user-friendly wrappers to facilitate all that. But that said, that gives you a lot of power. When I was showing some of the BCP47 locale-based identifiers, just so the locale identifier can control how ICU sorts and allows you to tailor it in very, very sophisticated ways. And that won't be available directly in the databases. I see. So some of these very powerful sorting examples that you gave us, must be looking at a fairly complex lookup table. So does that significantly impact performance? Can we still assume that sort is N log N? Or is there a case? Well, it really comes down to what you need to achieve and the purpose. One thing said, it will be slower than your native sort, because your native sort has been optimized in lots of different ways. But that said, your native sort isn't going to give you the sort you need for French. It's not going to give you the sort you need for English. It's not going to give you the sort you need for lots of different cases. It's not going to give you unicode-related sort at all. The most basic example would be all your uppercase characters. Uppercase Z is sorted before low case A. So Zebra with a Z, capital Z, would come sort before Apple with a low case A. Can you remember those peculiarities of the sort algorithm in Python? Yes, I think the examples you showed us apply perfectly well for French, where characters with an accent should sort the same as the unaccidented character. In UK, violence, they have a secondary level of distinction, and casing, uppercase versus lower case, has a tertiary, where in Python, native sort, they're all over the place, depending on what its code point is effectively unicode. Excellent. That's a great motivator for us to heavily leverage that. Hi, I see you. I don't see any questions. So I'm going to re-emphasise everyone that you should not be shy, ask your questions, and I will be happy to relay them to Andrew. I'm going to ask one last one of mine. A problem that I often see is that I cannot tell if a unicode glyph is going to take one or more character when I try to render it, and that makes it hard to predict my physical layout of the text. Is there an easy way to predict how a wide character sequence is going to end up? There's probably a few different approaches. Probably the easiest way would be to typecast your string into a list based on graphemes rather than characters. If you typecast a string into a list or break it up into individual characters, but you can use the rejects module with a findall to split the string into graphemes. A grapheme would be the base character and any combining characters that go with it to form one visual glyph. This would be one way, and then you can just count how many characters are in each string in the list. That's one approach. There's other approaches as well. I like that approach because it seems very flexible, which is much better than knowing specifically which font you're using and then measuring whatever. That's another issue because the font you're using, there's two issues. How many characters does the grapheme take up is one issue. The second issue is how many glyphs the font uses to display that cluster. The ideal scenario is that a font will only use a single glyph for a cluster. That doesn't always happen. You might have an emoji made up of five or six code points. Some fonts will support that combination. They'll support the proper rendering of the skin classifiers and colour classifiers and various other things and give you a single glyph. Other fonts will only partially support it so that five or six characters are rendered in three glyphs. So you've got both a component of Is this a grapheme cluster? How many characters are in the grapheme cluster? And then what the hell does my font do with it? So they're very different questions and in terms of the font there's probably no easy way of doing it within Python. Unless you start looking at what the original string is and what the projected rendering is that would get to be quite complex. I've got it. So I thought there was an easy solution for it but it looks like there might be some edge cases that I should still be aware of. Very good. I've got it. I've got it. I've got it. Very good. Andrew, I think that is it for us. I don't see any questions. I'm very thankful for your time and I know it's getting late for you in a different way than for us. I forgot to mention that Andrew is talking to us from the future. It's already Tuesday there in Australia but as usual we have our... So we have our happy hour that might be a little too early for Andrew. I'm going to put the address here and the address is also in the YouTube comments, in the description sorry of this video on YouTube so pyntlmeet.fjnr.ca slash fp95 so I don't know Andrew if you'll be able to join us or if it's a great time for recafination for you but we're going to meet over there. All right thank you everyone and that's it for us and I'm going to see you, many of you, on next Monday at the programming night and for the rest of you I'm going to wish you a great summer and we see you this September. Bye.