 AE. My nejme. Is Stanislav Hráček. And I am from Czech Library Office, Community. I'm also a long term member of the document foundation, and I am dealing mainly with localization of LibreOffice to the Czech language. Today I will speak about spell check dictionaries. And if everything goes well, mojí přednávání se po svojí Marina Latíny, které je hodnější zvuk, protože jeho přednávání byla vlastně s LibreOffice konferenc, kteří můžete přednávání toho díky. Říká se o italiánském komunitě a o světě, že jsou spelčekdikšené, který nebyl přednávání za dlouhé tady a můžete přednávání toho díky na světě komunitě, protože můžete přednávání spelčekdikšené když bylo v 2015-2006. Je to díky přednávání, které je to díky přednávání, a je to díky, že je to díky přednávání za díky, že je to díky přednávání. V té díky je to díky přednávání v můjších světě a můžeme přednávání v tomto, že jsou vlastních a v tomto, že jsou vlastních vrůvů, protože můj vrůvů jsou zvukaté i po věřit, že však elektrobych ale i smartphones, kteří jsme nevysnili v 15 jení, ale v svojej díky těchto díky děláte dali předný věřit a tady možná mě tu plnit. Prvně, byla to vytvětné díky TPL zpečně un 호ekit to je na tom hodině. Příklad na hodině, co se našli mělo vzít, byla utekla na CC0 license, který je vyskávání vyskávání. Zero license, co se tady vtím můžeme vzít na dátku, tak se to vyskává. A teď máme námný spáv s výkřenou data v Vicky data. A jsou výkřené výkřené pod cc0 výkřené, takže jsme předtím výkřené výkřené, když se výkřené výkřené a projít se jako alternativ výkřené výkřené. A výkřené výkřené výkřené je námný se z výkřené. Vysvědělám výkřené výkřené výkřené. Tady se zvědělám výkřené výkřené výkřené. To byla výkřené výkřené, zvědělám to, který neskoučil. Výkřené byla výkřené byla merozlav poshta, který je translator, tedy persoň, který výkřené s ním větného výkřené. Výkřené byla námný se z výkřené výkřené na ob Marta Zvě orangejí. Zvyklěli výkřené již výkřené a je to nohle, které se to budu námný a také se závěděl. Je to zanechváda. Je to zvěděláda za výkřené, bylo to zvědělé po vůstěne i vysvěděl. a nám se vzájí některé formy, které jsou taky výstřené, které se to nevěděl, že je to vědělejší, ale toto pravdělejště, které se vzájí. Prostě se když máme výstřené, a když je to výstřené, že to je výstřené, If it is the rare form of valid word then it makes sense to mark it as incorrect. We somehow hit the limitations of unspell here because unspell provides only binary spell check. For the user it would be perfect to have some probability signed to that check, but ale je to něco, které je nejpříklad příjít. Tady přijímá díky díky LibreOffice 7.2 a je to nejvědnit, že LibreOffice je zvědnit jako obstřím díky díky Hanspehl, takže je to dělal díky Mozilla a je to dělal díky díky díky díky díky díky díky díky. In 2019 Masaryc University in cooperation with Red Hat released their dictionary, which was based on analysis of language corpus and some semi-automatic processing of it. It contains only some selected parts of speech, namely nouns, adjectives and verbs. There are also some missing parts, like proper nouns are missing or comparison of adjectives is missing. But still it is quite sufficient database of tens of thousands of leximes and it was used by us as a base for the new experimental dictionary. And the main point in 2018 Vicky Data introduced a new namespace with database of lexicographical data. The basic unit of that database is so-called lexime, which can be typically a word, but also a phrase or part of word, like suffix or prefix. And we can store also meta data there, like characteristics of the lexime, relations with other leximes and so on. For Czech language, we have more than 2,000 leximes there currently. And as we will see in a few moments, we have a user interface available to create and maintain leximes, which allows to normal user access to that database. It's so important that it's a Vicky Media Foundation project, which garanties stability and it is better than any of homemade solutions, which we were considering. And we are talking about spellcheck dictionaries, but it's a common database, which we can use for any tool dealing with language, so also hyphenation dictionary, dictionary of synonyms, which we can both find in library office as well. But I can imagine also tools for adding diagritics to the text, dealing with translating or with grammar checking and so on and so on. And we are doing here something, what is a perfect example of free software or free data principles, because we are developing a common database shared by many projects and doing it at one place and by joint forces. How does it look? This is a page with example of lexime. Ček adverb duchablně. If you are familiar with Vicky Data, you know that IDs of Vicky Data items starts with Q. Here ID of lexime starts with L. Here we have information about language, part of speech and the sense of the lexime and the list of forms follows. In this case we have three degrees of comparison of Ček adverbs. Another example is prashátko, which means a small pig. Here we have information about etymology and then about language style. It's diminutive. We can provide many, many types of language style, for instance, if it is formal or informal word, if it is archaic, if it is kind of dialect. And it's a bit pity that Hanspell cannot work with this information, because for me, as a user, I would love to check, I want to do my spell check only by using formal words or I want also include informal words and so on and so on. We can also connect leximes with real Vicky Data items. I mean that items starting with Q. So here we have three connected with the item. This is the link to the Vicky Data item and for instance we can then check the Wikipedia page about this term. Also we can state the connections between leximes. In this case, we say that this lexime has a synonym and as I said before, this information can be directly used for creation of dictionary of synonyms. We can also add other metadata as pronunciation, examples of usage and so on, but they are not usable by spell check dictionaries, so I will not talk about them. I was talking about user interface for maintaining leximes. This is a basic page how to create a new lexime. I just insert lemma, basic forms, language and lexical category, which is part of speech and I am done. Then I need to add all the forms of the lexime by hand, which is not so convenient, so there are also templates provided. For instance this template allows us to add all forms of a verb in this form. If we have more complicated cases like in case of check adjective, which can have more than 100 of forms, we can use also this batch loading and this input is prepared probably somehow automatically. Of course, we can use also API access to vicky data and specifically to vicky data leximes. There is query language for vicky data. We can use also Python packages for accessing the data. Like this one is a general one and lex data is a package specialized for leximes, but there are also others. A bit of statistics. Here we can see number of leximes by language. We can find here big and traditional languages like Russian, English, German, French, but also a smaller one, which has probably enthusiastic volunteers who enter the data. In general, first places in these charts means that data has been imported automatically from some sources. But not only number of leximes matters, this is chart showing another aspect of the data. It means how much of Wikipedia texts are covered by leximes database. For instance, in case of Swedish, only about 10% of words in Wikipedia is not covered by lexico-graphical database. We are talking here about tokens, which means all words, not only unique words, so frequency of the words is considered in this metric. This is something what I have already said, that we decided to create a new Czech spellcheck dictionary license under the CC0 license, which is created as a combination of morphological dictionary and of data from Wikipedia, and that dictionary is still considered as experimental. It is provided as an extension for LibreOffice and also Mozilla's prepared version for the Mozilla applications, and we can also check words online at the web page www.českajslovníky.cz. Let's look how the old and new Czech dictionaries are successful. In this chart we see the similar characteristics as the coverage of Wikipedia walls, with that difference that we don't have Wikipedia database here or Wikipedia corpus, but some small but still representative examples of texts containing a few hundred thousand of words. In the chart we can again see the number of missing tokens, which are marked by the spellcheck as incorrect. In the bottom part, with red color, we have that traditional GPL dictionary, which is shipped with LibreOffice, and this is the original former version, and there was also an alternative version, which was not part of LibreOffice, but it was slightly better. And this one is the latest version, result of the update done by Miroslav Poštá, and we can see that his changes were really significant and improved the dictionary a lot. Regarding the new experimental CC0 dictionary, this is the initial state, which took only the data from the morphological dictionary with no wiki data. If we include wiki data, we see a significant improvement, and during the time as database on wiki data was increasing, the improvement is more and more significant. So this is the current version, but we still see that the experimental dictionary is not at the same quality level as the original one. So the main message of my presentation is that we have a database of words or of like seems at wiki data, and its database for any language is licensed under the really permissive license. We can do whatever we want with that data. We can access it through quite convenient user interface or by programming interface, and the previous chart showed that we applied that data successfully to improve performance of spell checking dictionary of the Czech language. So if I could advise you just check the like seems and try to incorporate them into your spell checking dictionary because it's available, it's for free, and you don't lose anything by testing it, but if you like it, you are welcome to contribute to that like seems data. And in my opinion the wiki data like seems, it's a perfect place where to store our dictionary data and where to maintain them. So that's all. Thanks for your attention and let's go to the questions if there are any.