 To je nekaj nekaj več tehnika prezentacije. Tukaj smo prišličili, kako prišliči spelček, način, in v tem, da je načinabilitva, in začneva. The data is something about me. I'm active as a volunteer in several three-hand open source projects for several years. I do all of these in my spare time, which is very, very thin, so if you consider it spreads among different projects, actually my contributions to individual projects tend to be quite smaller. In the past, this project saj je bilojo vse dopril tega, kako sem v prve, projekta in režim, nam je vse počkaj, počkaj, ki sem vse počkaj, kako je vse na konječnih projekti v italijanj spelček dišenari, ki je vse vsi izpočil v mnimi projekti. Zelo, nekaj, nekaj, nekaj, nekaj. Zelo, status of the Italian dictionary. History. It's been developed for about 15 years, so it is an old dictionary, an old dictionary does mean something. It's not... We will see that even dictionaries can have some legacy issues. It had different maintainers, and while it is only text files, it's amazing how many conventions you can mix up and mess up with, with many maintainers taking care of the so-called code, even though it is just rules, words and rules. I've been maintaining the dictionary for, say, the last 10 years approximately. It is built for OpenOffice, but it is used as is, or with the minor changes like packaging it into an OXT file, just to go back to the previous talk, by many free and open source applications. So basically, every time you have an Italian spell check somewhere, probably the underlying dictionary files come from either this project or a variant of it. As the maintainer, my burden is, especially with support requests and user reports. They were frequent in the early stages, where the dictionary was still being built. So the dictionary was incomplete, and you got negged every day with missing words or everything. Now it is much less frequent, since the product is mature. Say, we can consider the dictionary to really be in maintenance mode. It fulfilled its task, and now there are just minor fixes. But maintenance is always required. Even Italian is changing, of course, meaning that no later than a few months ago, a baby in Italy was, I don't know how many of you speak Italian, so okay, keep this generic. A baby in Italy was seeing a flower, and inventing a word that does not exist in Italian for a flower with a lot of petals around it. The word didn't exist, but the issue was brought up to the Italian national authorities so to say about the Italian language, and they approved this new word officially. So the next day I had my inbox full of people who asked to add this word to the dictionary. And of course, if an eight-year-old child can do that, then everybody else was inventing words and asking for them to be formally approved. So it was kind of a frenzy moment even though the language is supposed to be stable. And so the request types we get are type one, very easy. You type a normal word, you find out that it is underline red, and next step you just send an email to me saying to add it to the dictionary. Now this is very, very rare. Actually, most of the time this happens. It is actually a type two request. We still have this small flow of requests that cannot be addressed because they are just wrong. People do not take the time to spend one minute and actually checking in a real dictionary if they are suggesting something that should be there or should not be there. And most of the times actually it should not be there, so I already have canned answers to people saying, no, but this word does not exist. I know that a lot of people use it, but it is not a real word. Please do not ask for it to be included or for your own dictionary, but then it will not be Italian. And type three is the interesting one. Since it is what we are discussing now, I got an increasing number of requests to remove wrong words. People that say, I misspelled a word, but it was still recognized as valid. In these cases, actually, the words that they believe they do not exist are existing, are legitimate Italian words. Just nobody is using them any longer. So our issue is probably that the dictionary is not grown to the point, and this holds for Italian, but surely for the English dictionary we are seeing the same issue and for many other dictionaries. To be too complete for its intended use case. OK, then the standard answer you give to these in-office applications is please supplement your spell checker with a grammar checker. The checker is able to identify rare words and it might be helpful in cases like this, meaning that you see something that is not wrong, so it cannot be underlined wrong, but it is underlined in some way to tell you, OK, but if you are using this word, either you are talking about something really strange or there is something wrong in your text. And also, it is not a perfect solution because a grammar checker, last time I checked this usage, was not really able to identify all variants of the word, meaning you identified the base word, but not all of its variants. And it is not compatible with all use cases in an office suite, it makes sense to have a grammar checker, but this dictionary is used everywhere in small application, probably also for spell checking that runs on smartphones and many small software that do not implement an interface for grammar checkings because they do not need it. So it is time to trim the dictionary or to at least think if it is right to trim the dictionary. Two of the trade. We have also good old text editor, simple text editor, what we are doing is in text files. Some statistical analysis tools for medium size data. We are not talking big data here. It is hundreds of thousands of words, which is big, but it is not the kind of data that would properly be classified as big data. And the hand spell tools and custom scripts. We have an example here in case people are not familiar with the format we use. It is simply one huge text file with the stems, meaning the base form of the word, like this is to talk in Italian, parlare. And a set of letters in combination. What does this mean? This means that this is a verb to talk. And here you have rules to form all derivative forms of this word. So basically, if this was English, you would have a talk and here a rule that is defined separately and means from talk you derive talking, you derive talked, you derive everything that is related to talking. One can follow a completely different approach that is exploding this and creating even huge text file. But it will not work with synonyms, for example. For synonyms, it is always important that you can reduce back to the stem of the word since that is what you need for creating the synonyms. Otherwise, in office applications, all the synonyms will break if the dictionary is not using this or will be inefficient. Much, much inefficient. So what I try to do is to create a corpus, meaning finding all possible Italian words from somewhere. This is impossible as a task, of course, but you can try to get some approximation. The bad part here is that you can use news, Wikipedia articles, recent books, because I do not want words that were used to centuries ago. Italian is quite an old language by language standards and it is almost in its current form since it has been in that form for eight centuries, more or less. So manma sanitization is necessary because news will contain person names and other stuff that you do not want in a dictionary. And no matter how hard I try, I never managed to get past the 50% of base words, which basically means half of my dictionary is useless in real use cases. And what's even worse, is that only about 5% of all possible variants are considered because here it is much a matter of copy-pasting. If a verb behaves like another verb, then I will just copy-paste all variants, but some of them really sound strange in Italian, they are not real Italian words. So what we got in the end was that we removed all words that do not appear in the corpus and this is much, much more trickier and if any of you has some smart ideas come to me after the talk and we can try how to make it in an efficient way, the variants that are never used are removed. Of course this is more complex because you do not just remove a line but part of it and becomes quite complex. What about the future of this prototype? I think the full dictionary will still be maintained, the one that will be properly maintained is the full dictionary and it will be the default one, the one from which dictionary is for office suites or thunderbird for email clients are created even for text editors of course. But we might decide to accompany it with a light version. Use cases are, all use cases were a normal dictionary could be too big and uselessly big. So when you are very limited, you have very limited resources like lightweight or mobile for language learners we might publish a simple Italian dictionary and people who are learning Italian would know that the underlying words are words that they are not supposed to know because they are advanced words and everything else is in their range of expected skills for example. Use cases may be found and I am happy if you have any creative ideas on how to use lightweight dictionaries to just talk to me after my talk. Well, that's it. Thank you.