 So, I'd like to present Dennis Fabio, who will be talking about NewLibSea and locales. Thank you. Thanks for coming. So, I will first talk about POSIX locales, as they were defined in JLibSea 2.0, then talk about a new format introduced in ISO Technical Report and explain in more detail collation, which is very important and very difficult to understand first. So, here are some links to POSIX local specification. You first have the specifications, which are very interesting. And there is a link to the Technical Report. You should also have a look at Peter's pages about NewLibSea locales. It has many useful links. And this presentation with all sample files I present are on a page on PEOPLE.debian.org. POSIX local specifications contain definitions for six categories. The first one is LSE collate, which defines collation order, which means how strings are ordered according to the Cultural Convention. The second one is LCC type. Which characters are digits and letters and so on. And also, it contains mappings between lowercase and uppercase characters. The next one is LCC message, which is used by GetText to output messages and also for interactive answers like yes, no, question. Next, LCC monetary is about formatting monetary amounts. LCC numeric is for non-monetary amounts. And eventually, LCC time is for date and time format. A local data file is a text file, which can be edited by hand. It's very easy to read, as you will see. It can contain one or more several local categories. Each category can either have a copy keyword, which means that definitions will be provided from another local, or she can have a set of keywords and values per. You can have commands and continuation character so that long lines can be completed. Specifications say that characters should be encoded with symbolic lines within angel brackets. And also, that they may be used themselves or with a neural representation, but this is a very bad idea because you then have an encoding problem. One should always use symbolic names. Strings are when there are several characters, so you have to use the records. Everything is pretty simple, as you can see. And when a keyword has a list of... It can contain several values. It is always a semicolon separator list. So this is very easy. And a local file looks like this. First, at the very top of the document, you can define an SK character and a command character. By default, SK character is a backslash. And in all locales, G-lipsy developers prefer to have a slash. And by default, command characters is a hash sign. And again, they always use a person sign. So I took an excerpt from the finished locale and there was a command. And categories begin with the name of the category and finish with the hand and the name of the category. And between these markers, you have keywords and definitions. So here you have a service and example for an abbreviated date. So you see symbolic names, which is very readable because these are escalators. But in order to map these escalators to characters, one has to use so-called repertoire maps. So there was a line, repertoire map, to say in which file symbolic names were defined. And the format of the file is at the bottom. It was very simple. You have the symbolic names, then the unique value and then a command. It's very simple. This format is very well supported by applications, but it has many weaknesses. The first one is that it cannot be extended. You can have either a copy keyword or a set of keywords. You cannot extend locales, which is very, very annoying. And repertoire maps became obsolete with the unique code repertoire. So it does not make sense anymore to have repertoire maps. And it also could cause problems if you use different repertoire maps which may define the same symbols. So it was a pretty bad idea. And there was no traffic elevation and some people want to have new local categories. So in order to improve physics locales, a working group was created to try to have a new standard. They drafted an ISO standard 14652 which is called Specification Method for Cultural Conventions. But in fact, this draft has some controversial issues. And so several people did not want to have this draft as an ISO standard. And this group was disbanded in 2004. But it doesn't mean that everything is over. A new group will continue to improve this new locale. And so this is not a dead list. And at the same time, they released it as an ISO technical report so that people can have, in order to have a larger audience. One main difference with this new format as implemented in NULIPSY 2.1 is that they no more use repertoire maps, always unicode cut points. So as you can see, this is much less pretty work. Please, for me, it can save for you. It's really annoying, but it is much more robust. There was also some interesting additions like the use of ellipses. For instance, you can define digits with ellipses. It's very simple. And addition of proceed evaluation and new categories. They had six categories. I will present only two. LC identification and LC paper because I do not really see how to use the other ones. I'm not sure if they are so useful. So LC identification is to tell which which version of the, to which document category definitions are conforming to. And LC paper is about paper format. So this new, this new standard was designed to fully improve politics through games. The problem is that it is based on it has no blonde, it is only theoretical and there is no implementation. So several people complain that some definitions don't make sense and do not want to influence them. They sent to GNULI C developers at work in 1990, 1997. And GNULI C developers implementing most of this draft and the technical report has some very minor changes. But GNULI C developers told us so that they didn't want to blindly follow this standard because of these issues. So in fact, the only, sometimes the only way to really know what GNULI C expects is to read the source code. It's also one of the reasons why I wanted to give a talk because it's not always very easy. So if I come back to a local data file so we now have the exact same beginning. You define escape character and command character. It's important to tell to GNULI C developers which character sets are supported. UTF-8, yes? I'd just like to point out that this standard was superseded in the 90s with SFS 4.9 and 0.0 which mandates Latin 50 as the new encoding and at the same time recommends strong language into UTF-8. UTF-8 is implicit. You don't have to mention it. But if you look at how for the encodings you should tell GNULI C developers so that they can support it. Otherwise, as I do not know. FTE identification, an important keyword is the category keyword. As I said before, it should tell which category to which specification this category is for. The problem is that there are no standardized values. So I was told that the following values can be used. POSIX 1993 to refer to POSIX specification or I-18N to refer either to the keyword implemented in GNULI C 2.1 or to the technical report. The problem is that this field is I-0 in GNULO Calde. For instance, the finished locale. So at the beginning you have some keywords which we will explain in the standard, in the technical report. I do not repeat it. As you can see, category does not belong. It should be in fact I-18N or I-0 there. But almost all locals have this silver. LCC type. It contains definitions for what characters are, letters, digits and so on. Those are pretty fine classes and also mappings. This one should be automatically generated from a unique code table so there is no need to modify them. But for certain locals it may make sense to define some new classes or new mappings. And POSIX and I-0 technical report have different keywords for both GAR class in POSIX and class in I-0 technical report and GARPONF and MAP. And GNULIPE-C supports both. LCC type also contains definitions for consideration. So a simple example. I-18N is compiled by many locals. So it contains basic definitions. The predefined classes are defined with their name and the list of characters. So it's very simple. And you can see here how commands and training slashes are used to split long lines to upper a mapping between lower case and upper case letters. Another example is Japanese because they have to define new classes. So Q1J they use the syntax, the POSIX syntax. So they define classes, the first defined classes with GAR class and after for each class they give a list of characters and similar to mappings. They could be the new syntax which in my opinion is more compact and more readable and could look like this. Transiteration is inserted between transitsart and transit and keywords. It's pretty easy. You can either include predefined transiterations or many predefined translations. There are two fields here the first one refers to a file and the second one refers to a repertoire map. It is now obsolete so it is always empty. To define transiteration for instance like in Danish you put first the character which needs to be transiterated and after then a list of strings which can represent this character. And GNULIPC will display the first string which contains all the given characters in the show launch encoding. LCMLH is pretty simple I will only discuss the grouping keyword which is used to group numbers. So in most European countries we have grouping of three by groups of three it depends. Not all locations have this setting. So the grouping keyword contains a list of groups. The first group is for the digits which are to the left of the decimal separator and then all the groups are for digits from right to the left. There is one special value which is minus one to tell that there is no more grouping and if this value is not the last value of the field the last value is repeated for all other fields. The only use I know of for this grouping is GNULIPC extension they have a %quad printf extension for instance to have a full value you can write printf %quadf and you will then see grouping. Finish was not very interesting so I choose TAMIL in India so LCMLH defines the character used or the spring used for decimal separator and for thousands separator and the keyword grouping. Here they have two values, three and two so this means that the first group before the decimal separator has three digits and all others have two. LCMunitory has many keywords but I will focus only on three of them which are pretty interesting. You have CSPrecid, so CS means the currency symbol and X here means either P or N to distinguish between positive and negative amounts. You can have a list of values to tell if your currency symbol presets or success amount if they need to be separated by space if where to write the signs fine and so on. You can write any format you want with that. I put here the whole view of the keyword so you have Q1C symbol which is YOLO in Finland International Q1C symbol is an international Q1C chart in three letters followed by a space. Grouping and other monetary grouping is similar to numeric grouping and the three fields I was referring to have some strong value in Finnish because they say that the Q1C symbol does not precede amount so it comes after. They say that the sign comes before the amount and that there is a space between sign and Q1C symbol if they are adjacent. The problem is that with these settings they can never be adjacent so there is something wrong in this hotel. I do not know why because I'm not Finnish but it could be fixed. Here I use PHP to show examples because it was the simplest example where you can use what you want. With these settings, PHP means that you want Q1C symbol and PHP international Q1C symbol. They are different as you can see. In one case there is a Euro sign and in the other case this is with Euro in letters. For instance, let Yen have other settings and so you can see that the output is very different. You can define whatever you want to fit your local use. It should be noted also that all keywords can be preceded by... can I have a prefix, int, underscore to mean that this refers to the %e format. This was introduced because Finnish officials told the working group that in Finnish Euro symbol should come after Q1C symbol but when written in letter it should come before. This is why it was introduced but it has not been implemented this way by the Finnish local writers so I don't know what to do. You can see that for the same local you can have a different output with %e and %e. Can I see time? Yes. It is worth noting that the formatting of international currency symbols were broken a few years ago and fixed. That might be the explanation why the Finnish local was incorrect now because we fixed the implementations that actually get it to behave the way it was specified and a lot of the scales tried to get them right by modifying the values that looked right because they were actually using the specification to write them correctly. Okay, thank you. I'll say time is maybe the most controversial because for instance in post-expecification when the day of the week are translated in post-ex the first day is Sunday whereas in the technical report they say that the translation is for the first day of the week according to the location which is pretty much different and so they introduced several other keywords to tell which day is the first day of the week. The problem is that new life city developers disagree with this approach. I believe, I'm not sure but I believe that they disagree so we still have the post-ex behavior but the new keywords are not supported really but they are understood. And for instance in French so we have AB days for abbreviation of day day of the week month extra and more interesting you can define date and time format. The last five fields DET FMT is to display that date with date and time is written by date plus person C so there is an example below and the FMT is when date is alone so this one looked pretty good in French and you have also format to display time of the day in France we do not use AM and PM mutations so these fields are empty and the last field was requested by Corvichill's maintainer so that the translators have a chance to have a localized output of the date command when it has no arguments and the problem is that when it was added post-ex local was copied into all of that locale and in French it looks really wrong and it is also true for almost any locale so this can be fixed and you can use this this notations but I don't have time to understand about that. Next section is about LCE missages it is used by GetText but also to answer simple yes no questions and you have two fields yes and no expressions which are extended regular expressions and we can notice that almost all locale have a trailing dot white card which is pretty useless because these are regular expressions and there is nothing LC paper is to define height and width of paper mostly used for this locale units are in millimeters so if you read specifications and this slide you are able to modify things if they are broken for your locale this can be easily done you do not have to mess up with your settings you can export the lock path until unviable and GIC will then search local data in this directory so you provide your locale into a working directory you edit it you run local depth to have to compile the local data and you can then play with this data to see how to fix things I will now talk about collation which is in my opinion very important because it is pretty critical at first but in fact it is very simple anyone can have a look if your language has some strange settings it can be easily changed the reference for this is ISO14651 written by Alan Labonte about international string ordering he describes the algorithm which will be explained here Unicode also has a working group and they have algorithms which are pretty similar to the one described by Alan Labonte but with small volumes and mostly they use different notations but it is very similar and there is also a nice presentation by Mr. Chilton about tibetan script it's nice because tibetan script is pretty cute to read to look at because I can't read and it is very simple the key point of both Unicode and Alan Labonte algorithms is to replace sequences of characters by other characters so that the usual string comp can give the expected results I will explain just now what this means for instance, if you only consider ASCII letters you assign one, two to letter A and so on until that to compare Kerouan Cat you only can see that with this numerical values you can compare these strings this is a pretty simple but of course that we use many other characters for instance, the first problem is how to sort lowercase and uppercase characters in the POSIX locale you have first all uppercase characters and then all lowercase characters but in U.N.U.S. for instance words are sorted like in dictionary which means that there is no distinction between uppercase and lowercase characters so this needs to be refined the first solution proposed here to interlive uppercase and lowercase characters does not work and the idea is to add other levels so we keep the same level 1 code which means that lowercase and uppercase A have the same level 1 weight these values are assigned to levels of cold weights when about collision and U.N.U.S. assign a weight value of 1 to lowercase character and of 2 to uppercase characters and when this is done you concatenate all level 1 weights and then afterwards all level 2 weights so here if you want to know how cars and cats written with a mixed uppercase and lowercase characters are sorted JLIC will perform this they will put all weights together in this array and run str compare on these two arrays to know which how these strings compare to each other but in fact this example is broken it does not work like this the reason is for instance if you try to compare tar and tarra the one there is a weight one there is a problem because you mix level 1 and level 2 weights so this something needs to be done there are mainly two solutions you can either decide that all weights for level 1 are greater than any weight of level 2 if I remember this is what Unicut does but I'm not sure and another possibility is to have a safeguard value which is lower than any other weights this is what is done by JLIC obviously zero is not a good value because string comp will chop on it and so one has been chosen so all of their weights have been shifted by one and now lower case level 2 is weight is 2 and uppercase is 3 if I consider the same example the addition of this level separator ensures that everything works as expected now tar is sorted before tarra I will give other examples and here is a sample program to display a collation weight it is available on the website if you want to download it I will use it in the presentation so if I consider this sample collation definition you have an LC collate you have to define samples by creating a sample and after that you assign weights to your samples they are assigned sequentially from 2 so they are 2, 3, 4 etc but in fact as always an internal optimization JLIC tries to lower weights without changing the relative order JLIC here sees that uppercase and lower are only used in level 2 and do not interact with others so automatically it chooses the weights which were written before so you can compile this locale with localdef you have to add the minusc flight because it is not complete and localdef will complain if you try to compile it but with the minusc flight it works fine you compile it and you can see that the results are exactly what we had previously so it works exactly as expected so samples must be defined before being used which means that creating a sample keyword must be present and weights are automatically assigned from 2 and there is always an optimization and now if we consider another example we can try to work with the drug diacritics and this is always the same idea you define samples for your accents and you combine accents and samples and it will work as expected so here in this example I choose to assign the weight of two characters which have no diacritics three to carry this accent and four to acute accent and if I run this sample with French words which are cote cote cote cote it's a very well known example you can see that they are sorted this way by this locale the problem is that in French we have a very strange convention which is that when strings are posiomographs which means that if you do not consider diacritics they have the same letters then accents are compared from right to left which means that we will first have A without accent and then A with accent and this is easily performed in GNUAC if you replace the order start keyword on the second on the right you are here forward forward and you replace it by forward backward which means that on level two only you want to have backward duration and now we have the right collation weights which means that this locale can sort these letters according to French rules so how many levels do you want at least three so base characters case and diacritics but we need another one because we need to make distinctions between all of the samples which are not letters like computation marks and so on so in fact almost all locals use four letters for all levels but this is not a rule you can define as many levels as you want so here is a typical example with four levels you first define your symbol with a collating symbol then you tell the relative order and locale left will assign values to them automatically and after that you use for each character you want to represent you tell which weight should be assigned so it's very very simple and it is even more powerful than that because in some languages you can combine characters for instance in German the asset ligature is sorted like SS and on the other hand in a traditional Spanish CH is considered later between CMD so we have to here is an example the SS German ligature you only have to say for the SS character you assign the weights of the characters you want to represent so as you can see the two first roles the DAS and DAS have the same level 1 weights so they will have no difference at level 1 only other levels have a difference similar to A&E ligature in Norwegian Bokmal A&E ligature is later so some locale can have other definition the other case made to one mapping one could believe that it would be quite natural to have this definition but if it doesn't work this way you have to define a collating element and use it with the given symbols and after that assigns weights to it you can see and by combining both you can have too many mapping the problem is that locale has some bugs and sometimes we will discover that it doesn't work exactly as you want ISO 14651 had a terrain they wanted to ease writing locale so they propose a scheme which you can see here you copy a table and you only modify the characters you want so you can reorder weights to change relative values and it's very simple so my conclusion to modify locale is very easy when you have some tutorial and I would like that everybody helps in fixing locale's bugs the problem is that with this notation it cannot be written by hand so we need some tools to work with these locale files these tools does not really exist I also have some tools but in my opinion it is not usable by everyone this is a problem but maybe and eventually if you have problems please ask on the live-see locale's mailing list you will be very welcome I also have a final slide with our developers who want to play with locale dev and there were some open issues I wanted to investigate but it's not very important maybe if you have questions it would be more interesting just about Estonian there is a problem because they sort the Z later between A and T so when you write a regular expression A minus Z it will give a very strange result so many problems are broken and something needs to be done it's pretty... very strange so if you have questions yes more comments than a question actually Denis told that there are no tools to write locale's and they are lacking I would insist on it they are desperately lacking during the DI work on deviant installer which we all run and go to the presentation I had to write locale without knowing what the hell this thing has to be written and it is a nightmare complete nightmare so if there are coders who want to find a very funky project just try to think about writing a simple tool for people who want to write locale writing locale needs a deep knowledge of languages but it shouldn't need a so deep knowledge of the unicode table I need to know by heart the whole new unicode table that's crazy isn't that wonderful for this tool it will be helpful to know what is specification for the health format because there is no such specification you have a piece to write down the simple parts about MC date and all these things which are very boring to write down and I think that Yordi came there also because he was very lazy and for writing the catalan in France locale he just merged together the catalan Spanish and the French because this is hard to do I did the Andorra and I had to write that for Andorra you had to sit down and write unicode by hand you are a hacker what do you mean? yeah, so about the Estonian example you had one question that comes to mind is that right after what Yordi was released Melli throws a file about against MC complaining that it should have been as well ISO 88, 59, 50 was that ever fixed in the end? sorry, I don't understand the question that for Estonian the default encoding let's call it legacy encoding should be the dash 15 and it's been for a long time dash 1 yes, but there are so Matan will say that in the first version the Estonian locale had a Latin one encoding but it is wrong it should always have been Latin 9 but JLFC maintainers refused to modify it because the Yard was strong policy about local names they never changed the encoding of the locale this is why we had to switch to a new role because you cannot it's a unit want to change the encoding of a given locale so as they had it wrong the first time they didn't want to modify it even in this case do you agree with that policy? I have no other choice in some way I agree with it because you don't want to expect it's all about inspections when you use the locale there's many more differences between the dash 1 and the dash 15 than just the euro symbols there are many other characters not that many but still more than enough than this different behavior and this locale is important enough to give it a new name but getting back to the question of Estonian it's never been dash 1 for instance in fact it should have been dash 4 which is what has been used in the Baltic countries until recently and by the time Estonia joined the 3 sub-radivers and submitted their data for the locale it was 15 already at that point it was mentioned very clearly both by Mays-Ross and by the guide the National Linguistic Institute it's never been in its standard until that one was decided so it was a funny incident I'm not directly maintaining it so I can do nothing but I just remember that on the dean side the question was asked could this be changed for median? no it's a good idea Peter? the approach used for changing character sets in the locale is to give it a new name you use the same file with a new character set added to the name and this will be the new name of the locale so that will have to be done in Devian as well to avoid being in conflict with the settings on all the distributions another question? yes I was looking at the Indian locale there are actually for most of the major languages using UTF-8 so there are a number of other legacies included this is not a problem you mean in the locale data file? no there are but there are no locale spotters like they would all you need to do is to generate your locale with a different encoding this is the same you have only one locale data file because it uses unicode so it can represent any encoding because I'm thinking specifically it's called SD Indian Standard Code for Information and Interchange yes it's like an ISO so it doesn't matter I'm surprised I believe that it is supported but I'm not sure it may be but I didn't see anything because it's even varied any other question? just one quick comment that might be of interest to Finns or to people interested in Finnish language I have the graduation paper of a student in the Kobe University that tried creating recently a locale for the Korean language the Korean language is basically the missing link between Estonian and Finnish it's hardly spoken nowadays anymore but there are still a few people in Eastern Finland along the Russian border speaking it a very interesting paper for those Monterees had brought it along basically described all the steps he had to take to create the locale all the problems and researching what was the problem then the hour of finding out that there is no ISO if you have any other questions or you can talk to me later on the please ask on live still locale mailing list one small comment to the programmers I thought you had the comment about check Turkish and see how it doesn't really make the case for the locale conveniently this is a common problem for quite a few programs they expect that if you have one tractor and one locale to lower or uppercase you will still have one tractor this is not true you might have two or three you have no idea they have to store them in an array thank you