 Hello, hello, welcome to my session on Ninai and Boothi-Rome, tools to generate abstract Wikipedia-like text using Wikidata items and lexemes. My name is Mahir Morshad, I go by user Mahir256 on wiki, I am a graduate student in electrical and computer engineering at the University of Illinois, and I am an administrator on wiki data and the Bengali wiki source. In the last couple of years, since the abstract Wikipedia and wiki functions projects were announced, I have been interested in getting lexicographical data ready for text generation purposes. Some of you may have seen me at previous events discussing matters which have some relation to that. Well, today I want to discuss the sorts of tasks and planning that I have found necessary to get your language to the point where text can be generated for it, starting with work on wiki data items and lexemes, then on to abstract constructors and concrete renders, and finally putting these all together. Before we do so, however, it may be worth briefly overviewing wiki data items and lexemes. So what are wiki data items? In short, wiki data items are essentially structured, multilingual, linked, and machine readable pages about different concepts. All sorts of stuff, from people and places to things and ideas alike, can have wiki data items. Information about these concepts is stored in the form of individual statements, which basically represent individual facts that we can separately add qualifications and citations to. The items themselves also have human readable names, brief descriptions, and possible alternate names that can be translated into your language. And many of them also link out to various other wikimedia projects. It's possible to get information out of all of these items using the wiki data query service. And there are any number of outside projects that use wiki data's data every day. No matter what language you do speak, it's important that wiki data items be modeled consistently well so that we can get information out of it properly. And what are lexemes? Whereas items represent different language-independent concepts, lexemes represent words and other linguistic units used in a single language, from nouns and verbs to entire phrases and everything in between. Each lexeme includes not only information about the language it is a part of, and its grammatical category, whether it's an adjective, an adverb, or something else, it also has individual forms listed for those words, which change depending on how and where they are used. And individual senses representing different meanings that a lexeme might signify when used. Lexemes, forms, and senses all have statements just like items, so that information about words, phrases, and their parts can be similarly qualified and sourced. The senses on a lexeme in particular, and the statements on those senses which link out to items and other senses, are critical for ensuring that abstract wikipedia can discuss different concepts in appropriate ways. This brings us now to the abstract wikipedia. In general, its aim is to make information accessible in more languages by essentially putting article information into an abstract language-independent machine-readable form, sort of like a wiki data item with the potential to represent more complex ideas. And then having software turn this abstract information into text in any given language. Both the information and the text generation software is still meant to be provided and maintained by a community, in the case of the software possibly hosted in the wiki functions project. Ultimately, the idea is to ensure that anyone in any language can work on the abstract article information, while those interested in text generation in their own language can separately work on aspects related to it. Now, there are two types of objects that are important for the rest of this presentation. Besides items and lexemes, of course, constructors and renders. So what is a constructor? Well, a constructor represents an abstract unit of meaning. This unit can be an entire sentence with multiple parts to be filled in, a type of phrase with a single slot, or a singleton for some concept that modifies other constructors. The machine-readable articles of abstract Wikipedia are ultimately planned to consist of constructors used in different ways. These constructors may also reside in wiki functions. It is possible that one constructor may be made to correspond to an entire sentence of an article. But the kinds of constructors that will be needed is best left a community decision once the project itself launches. Nothing stops us, however, from attempting to model potential constructors to handle different types of information in advance of the launch of the project. So what are renderers then? Well, renderers, on the other hand, take the constructors in an abstract article, transform the filled-in parts into pieces of text, and put those pieces of text together into coherent parts. In general, for a constructor to be turned into text in a given language, a renderer for that constructor language pair needs to exist. These are ideally functions, hosted in the wiki functions project, that can be called on when needed by the abstract Wikipedia. Most of these will likely involve exploring relationships among wiki data items and like-seems, to find the right like-seems to use, and perhaps configure them so that they fit naturally in the overall result. Like constructors, the renderers that will be needed are best provided by a language community, though nothing stops us from attempting to provide some of these now. So you've heard about wiki data, you've heard about abstract Wikipedia, what are Ninai and Udiron? Well, Ninai and Udiron are names that I have given to the two parts of a text generation system that I have been intermittently working on since last August. Ninai is able to take a custom representation of constructors and search for the right wiki data items and like-seems for different concepts. And Udiron provides functionality to build up sentences piece by piece, using the like-seems it retrieves from wiki data. I began writing these then because I wanted to make discussions of abstract Wikipedia much less nebulous and much more concrete than they were before. My hope was that having some sort of wiki data-powered text generation system out there, no matter how much influx it was, would not only as a proof of concept motivate people to improve like-seems and items for their own languages, but also would enrich the kinds of ideas that might be entertained when people talk about the Asset Unlaunch project. Of course, whether I have succeeded in this change in discussion tone remains to be seen. The software itself is currently written in Python, but this is because, of course, wiki functions in the abstract Wikipedia do not exist yet. Aside from this, Python is one of the languages that a function in wiki functions can be written in, and the structure in Ninai and Udiron is intended to make transplanting it to wiki functions as easy as possible, respecting the technical limitations that have been brought up when it comes to that platform. There's still a lot to be done, and the source code is out there for those who want to try using and expanding it. Most of the work I have done with Ninai and Udiron concerns Bengali, German, and Swedish, with a few digressions into other languages, but the rest of this presentation involves going through the process for adding support to Ninai and Udiron for a single new language. This single new language will be Dagmani. I do want to emphasize a few things before we begin though. First of all, the steps needed for each language are most likely going to differ considerably between languages. The decisions made for Dagmani support may not necessarily work for Igbo, or Hausa, or Chi, or indeed any other language. Second of all, while it is possible to model things one way early on just to get something working, as I'm doing right now, it is important to refine the model soon enough to ensure that it can handle whatever new linguistic features and details need to be handled later. Third of all, it is entirely likely that the resources I'm using for this demonstration inadequately describe certain features of Dagmani. And so modeling the decisions that I, as a non-speaker, have made with them may well need to be undone or adjusted by Dagmani speakers or others better acquainted with this language than myself. With these disclaimers out of the way, let's begin. We should first start by considering precisely what statements we want to make with this system. In the interest of time, and to ensure that this demonstration doesn't get bogged down in much detail, let's just make some simple assertions about the kinds of things different people and animals consume. Someone drinking water, a speaker eating a yam, a sheep drinking a lot of water, and a monkey eating a banana. When turning these into abstract sentences, it is important to be able to break down their components and identify the relationships between them. Consider the sentence about monkeys and bananas. The sentence is at its core about eating, with someone doing the eating and something being eaten. Usually these two participants, by themselves, are not saying anything about themselves when they are uttered alone. They can, however, be linked by an action, to say something about both of them. One can argue that applying the action of eating to the two participants yields a meaningful statement about them at a given point in time. In a way, it's almost like to eat is a function, in this case, a constructor, which takes two participants and produces a statement. In NINI, the actual activity to eat is bled out into its own input to a more general constructor action, alongside the two participants, for reasons which will become clear later. We can transform the other three sentences to use constructors in a very similar way. Some of their components can themselves be broken down and reanalyzed into function-like formulations. In the sheep-water sentence, the relationship between much and water can be considered applying a general attribute to a noun, hence the attribution constructor. And the position in space of the sheep is in closer proximity to the speaker, hence the proximal constructor. Remember how wiki functions can host various things to be used by abstract Wikipedia? This formulation of constructors as functions taking arguments is precisely inspired by the kinds of objects wiki functions is designed to hold. Now, if you were planning to generate exactly these sentences in your own language, you most likely will not need to repeat the process I just went through. It is entirely possible that the abstract representations of the relationships I described would be the result if I started with other languages. What follows, however, is important for every language to do, possibly separately. This is where wiki data comes into play. The first thing we need to do is create lexemes for the words in those sentences, or rather we should check first whether these lexemes already exist, searching for them by going to the search bar and prefixing L colon to the lexeme we have in mind, and then expanding or fixing the ones that do exist if it's necessary to do so. Remember, we need to create only those lexemes which are actually missing. For the purposes of this demonstration, however, we will pretend that none of the lexemes we need exist, so we should create them all. Let's begin with the monkey banana sentence. We will need at least three lexemes here, the two nouns for monkey and banana, and the one verb for to eat. In each case, we can go to create a new lexeme by clicking create a new lexeme in the sidebar. There, we need to specify the word in question, the language of that word, not body in all three cases, and the lexical category of each word, noun or verb, as the case may be. Once we have specified each of these things, we can move on to adding some critical parts of these lexemes. The most important of these critical parts are those that tell people what each word can mean. These are the senses. Without senses and without an external source that might provide any meanings, a lexeme is effectively meaningless. Let's add some to the lexemes we made. We can add a sense by clicking add sense and filling in a description in at least one language. We know that this one means monkey in English, but we might say that it means monkey in each of these other languages. A talkbani speaker, rather than just repeat the word, should come by and add an explanation in talkbani of what a monkey is to this sense. We can do the same thing with these other lexemes. Once we add a sense by clicking add sense and filling in a description in at least one language. We know that this one means monkey in English, with these other lexemes. Once we have added meanings to these lexemes, we should add at least one form to each of these lexemes. While the lemma is important for identifying a word, in many languages the lemma is not directly used in a sense. Some transformation of it is used in said. In talkbani this is true of nouns, which have both a singular and plural form, which lend themselves to classification, and it is also true of verbs. But both of these discussions are a bit too long for this demonstration. We can add the base form of the verb as a form by clicking add form, filling in the base form, and then saving it. On nouns we can click add form, fill in the singular form, and add a grammatical feature singular to that form. Then we can do likewise with the plural form. We've added a lot of information to each of these lexemes, but the power of text generation is only really realized once they are connected to other things on wiki data, or rather when their senses are connected to other such things. This way, for a given concept, if we start with a wiki data item for that concept, or a wiki data lexeme in another language for that same concept, we can follow a property trail all the way back to any other language's lexemes for that concept. There are a lot of properties for this, but there are two in particular that will be useful for us. The first of these specifies for a particular lexeme sense an associated item for this sense. It links a meaning of a lexeme, typically a noun, to the wiki data item corresponding to that meaning. To add this to the sense on monkey, we click on add statement under the sense in question, and then type item for this sense in the box that appears, or p5137, which is the property ID for that property. We can then search for the wiki data item for monkey, click it, and then save the statement. We now have our first link between this thugbani lexeme and another entity on wiki data. Since other lexemes in other languages with the meaning monkey can have exactly this type of link, this means there is now a link between this thugbani lexeme and those other language lexemes. Let's do the same for banana, and then the do the same for to eat. Note that for to eat, which is a verb and not a noun, we use a slightly different property predicate for, but the logic is pretty similar to item for this sense, so trading it separately will not be necessary here. The second of these properties is useful when there is not yet a wiki data item corresponding to a particular meaning, so that item for this sense is not applicable. If there are multiple senses across languages with that meaning, it may be more feasible to just specify that one of those senses is a translation of the other. While we don't have to use it for any of these lexemes, given that there are already items for monkey, to eat, and banana, let's try linking to eat to the Igbo lexeme sense with that same meaning. For this, we copy the lexeme sense id from that lexeme, add a translation statement, and paste that id into the box that appears. Once we do the reverse operation of the Igbo lexeme, we have now established a link between those two. No wiki data items are required. If there is an appropriate wiki data item you can link to, you do not need to add translation. I cannot emphasize that enough. On the verb to eat, there is one more property that you will need to add, and more on this property is shown in the link below, but we don't need to move along to the actual act of forming sentences with renders. Before going further into it, however, it is important to discuss briefly how we are forming sentences. In short, we consider sentences as formed through attaching their words together in a structured fashion, forming a sort of tree with a root at one word, in this case sleep, and branches to other words, namely the other words in the sentence. There is a project called Universal Dependencies, which is attempted to make trees like the one you see here for sentences in more than 100 languages, using the same types of connections in as consistent a fashion as possible. With respect to the Dagmani sentences, we can observe a few things about how they are structured. For example, in this diagram of the sheep-water sentence, we see that the subject of the sentence, namely the sheep, precedes the verb, namely the drink, in this sentence. It therefore is important that when a subject needs to be connected to a verb, we must attach it to the left of that verb. This can be specified as a method in Udiron by calling a function that takes a subject and a verb and performs exactly this attachment. Because lots of other languages link subjects and verbs in the same way, there was already a separate function made specifically to do that that could be called upon. So we didn't need to re-specify everything ourselves. Just a brief reminder before going on, when you look at this code sample, wiki functions doesn't exist yet. Otherwise it would be possible to turn each of these functions in variable names into IDs and thus make the entire source code translatable. While in Udiron our setup to be as easily transplantable to wiki functions as possible, the lack of source code translatability is just one thing I could not address. A similar sort of function can be written to deal with each of the other structural relationships in this sentence. Objects of a sentence, like water here, following the verb in the sentence can be handled with almost exactly the same sort of function that was used to deal with the subject-verb relationship. In this case the object needs to be attached to the right of the verb. Adjective noun relationships also use similar handling, as does the demonstrative noun relationship. The handling of the suffix that appears with the verb in this sentence is a bit similar but also a bit different because the function in question, which is called whenever a verb must be marked for the imperfective aspect, can be written to always use that suffix rather than require a provided suffix for reasons which I hope will be a bit clearer when we go into the parts of this system that actually interface with the meanings of words. These are the renderers. If we return to the original abstract sentences and here I flushed out the sheet-water sentence in sort of the way that Ni-Ni expects it, there are two types of objects within them that need to be handled, the basic entities and the constructors. With respect to the basic entities, Ni-Ni natively has functionality, given an item like these four, or a lexeme sense, to search through wicked data properties to find a lexeme in a particular target language. We just don't have to do anything special here. Remember the item for this sense property and the translation property? Here is precisely why they are so important. This then leads to the constructors, each of which will have a corresponding renderer. By the time we need to render a constructor, we can assume that their inputs have been processed, either into words or into sequences of words. The main activity of the renderers for Dagbani will be to invoke one of the functions mentioned in the previous section, the ones that attach things to verbs or to nouns. The renderers for attribution and proximal in Dagbani are in fact as simple as that. The action renderer, however, given the potential generality of the action constructor, actually is split into multiple sub-renderers. One handles subjects, like the sheep, one handles objects, like the water, and one handles inflections, like the habitual inflection. These sub-renderers for Dagbani, luckily, are just as simple as the other renderers. It's actually possible to add custom functions that must be run at different points in the rendering process. For example, many constructors in different languages have functions that add end punctuation, such as periods, after rendering, if the constructor in question is the outermost. We can do the same with the action constructor, which is here the outermost, in Dagbani. We can now try rendering our abstract sentences. Because Nina and Udyron are currently written in Python, we need to start a Python session first. Then we need to import a bunch of stuff. Then we need to type out one of our abstract sentences. Then we need to call a method renderer on that sentence, specifying that we want to render it into Dagbani. And voila! You have now seen the basics of how abstract Wikipedia-like text generation can work with Nina and Udyron. There may have seemed like lots of aspects were presented at first, but just the functionality shown of a whole lot of sentences just like the ones I showed. I'll admit that some things were discussed rather quickly, or had been gone past. But I believe that the rest you have seen is nevertheless reflective of the most important steps to take. It took lots of repetitions and trial and error in designing the system. And by the way, Nina and Udyron are still living continually developing pieces of software that I frequently consider expansions and revisions to. But I believe that this whole process is to ensure that the system, which may well power the abstract Wikipedia, remains both flexible and powerful for all languages. I hope from this demonstration you recognize some ways that you can help right now to make text generation in your language possible. Here is a recap of some of the most important things. First of all, create like-seams. The words that you use in your language right now will definitely need like-seams if they're going to be used in a text generation system like this one. Make sure to search for them before you create them. If they already exist, try to add more information to them, and especially senses. A form or two might be nice. Link like-seam senses to items. If getting the right word is key, then you need to be able to arrive at it from other possible entities on Wikidata. Since Wikidata items are language neutral entities, linking like-seam senses to Wikidata items via item for the sense or predicate for or devinim of, as the case may be, is perhaps the simplest way to ensure that the right like-seam sense can be found. Link like-seam senses to other like-seams. Of course, not all like-seam senses have corresponding items, and that's okay. We have properties like synonym or translation or antonym or hyperonym or pertainin or probably something else I haven't heard of to ensure that these senses can still be found when starting from other like-seams. Think about how you put sentences together. Whenever you put a sentence together, no matter how simple or complex, there may be some rules as to how parts of this are done. Try to formulate these rules. They could be entirely informal or they could be based on linguistic expertise and research. Whatever the case, see how much you can capture in a sort of functional form. Think about how you express meanings. Similarly, when you come up with words to indicate a particular concept, there may be a number of different possibilities to arrive at equivalent expressions or there may be a specific manner used all the time in your language for that concept. Think about how you come up with some methods for these. And with that, thank you so much for watching. You can find me both on Wiki and on a variety of platforms if you have questions about Wikidata-based text generation in general and Ninaia and Udiron in particular. Enjoy the rest of Wikimedia!