 So Arabic corpus was designed as a somewhat desperate attempt to overcome some of the challenges facing Arabic corpus work. And I consider it a kind of stop-gap measure that can help students and researchers until the time comes that someone is actually willing to invest the huge amount of resources and time it would take to create a large, lemmatized, part of speech-tagged corpus of Arabic. One of the reasons I'm actually giving this paper is just because I receive constant negative evaluations of Arabic corpus, which I completely agree with, because it's based on this understanding of what Arabic corpus is trying to be. It's a stop-gap measure more than anything else. Anyone who works with Arabic corpora can't help but develop a bad case of what I'm calling English corpus envy. For English large, lemmatized, part of speech-tagged corpora are available, along with spectacular online tools that allow easy access to very advanced functions. There are also adequate tools for automatically lemmatizing and part of speech-tagging in the first place, which facilitates the creation of these large corpora. So I want to quickly review these features while demonstrating some of the issues and challenges of dealing with non-lematized and non-part of speech-tagged corpora that are peculiar to Arabic. By lemmatization, I mean the grouping of word forms that are meant to be or felt to be instances of the same underlying word, together for search and counting purposes. For example, grouping and noun with its plural are grouping the various forms of a verb. You can see I've done some for English there. And we saw this morning in Latin that that was happening automatically. And that magic is not that easy to create, even for English. While searching, when searching for forms in a non-lematized English corpus, in many cases, the forms you are searching for are right next to each other alphabetically. And even in cases when they aren't right next to each other, they're typically not very far away. Further, there are typically a fairly restricted number of word forms for the huge majority of words and limits, typically just two or three. So even if you had to list them separately in a search, it would not create a huge problem for the user. Arabic, on the other hand, has a number of features that make searching for forms in a non-lematized corpus problematic. First, every verb has a large number of forms. When you count the past and present conjugations for every possible subject pronoun. So I've listed them here. I've written the forms in both Arabic script and then in a one-to-one transliteration so that non-Arabic speakers can get some idea of what I'm referring to. The forms you see for the verb form kattaba, hold on a second, see if I can get this. So we have the past tense verbs there, the present tense verbs there, these extra forms for subjunctive and imperative here. Some of the forms are the same, so I'm not counting those, okay? I only count the forms that are graphically different. So if you count all the forms that are graphically different for a simple verb like kattaba, you come up with 30 graphically distinct forms. In the past tense, these forms do, they alphabetize together, but obviously in the present tense, they don't because the present tense uses prefixes and not suffixes. As if this were not enough, ha ha ha. Most verbs can take a variety of extra prefixes and suffixes beyond those involved in the actual conjugation of the verb. These suffixes, the suffixes are mainly pronounonies that represent the object of the verb. The prefixes include the future particle sa, the subordinating adjunction li and the conjunctions fa and what. The conjunctions can be attached to any form while the other particles are limited to the present tense forms. But for those forms to take the particles, the particle can also be confined with a conjunction. So let's do a little math. We take all 30 forms that appear alone and they can also appear with a wa or with a fa. That means you've got 90 possible word forms. Now 13 of the imperfect forms can be combined with sa and with washa and with fasha. So that adds 39 more forms. And 12 of them can be combined with li, wa li and fa li, which is 36 more forms. So to find all the examples of this very simple verb, you need to examine 165 separate word forms which are literally all over the alphabet. Then you need to realize that every one of these forms, all 165 of them can be combined with 11 graphically distinct possible pronoun endings. Giving a total of 1,815 possible word forms for every verb that you might want to look up. If you're thinking about searching for these and saying, you know, I want to know how many times this author is using the verb kachaba in this novel. You're going to get a big headache, that's what I can say. Nouns are going to be the same. They're not quite as bad. But if you just take any particular noun, it has the singular and it has the plural. That's two, but then it has four graphical forms of the dual. When you think in computer terms, it's not the same. You think, oh yeah, there's the singular, dual and plural. That's three. No, there's the singular, ketab, then there's ketabani, ketabani, ketabah, and ketabai, because there's four possible graphical versions of it. So there's six forms of every noun in the simplest case. These could take attached prepositions, some of which don't allow for both of the dual forms. It ends up with 18 possible forms. You can add those to the conjunctions that we already talked about, fat and sad. And then you have to realize that almost all of those can appear with and without the article and they can also appear with and without a pronoun ending. And so any noun in the language is going to appear in a corpus almost in 400 different separate word forms, okay? And these forms show up all over the alphabet. Another issue with Arabic is that it's written without short vowels. And so it's extremely common, for example, for any simple form one verb to have the verbal noun look exactly the same as the past tense key form. So for instance, dotha means he applied pressure and doth means pressure. dharasa means he studied and dharas means lesson. And they look the same, graphically they are the same. And so you have this, if we had a part of speech tagged corpus, that would all be figured out for you, but we don't, okay? And so it's a problem. Some English corpora, many English corpora like the British National Corpus are very carefully constructed to include a balanced selection of a wide variety of genre. This allows one not only to find many examples of the use of a word, but also allows one to investigate whether or not the word is specialized or limited to particular genres as opposed to being in more general use. This information is much harder to get from non-representative corpora like Arabic corpora are almost certain to be because of the difficulty and expense of creating a balanced corpus. English corpora have grown huge in the last decade. This makes it much more likely that you'll be able to find relevant examples of what you're looking for. And it also makes the lack of examples more meaningful. So for example, almost every textbook and all the verb books that talk about Arabic verbs gives the one alternate form of the justice of double verbs. Don't worry about it if you don't understand this, you can say lam yamorra and lam yamoror. And they give them as just two alternatives. But you look at any of the small corpora that are available and you don't find lam yamoror, it just doesn't exist. But it's a small corpus and so you're not sure. But if you get 100 million word corporas or 300 million word corporas and you don't find any examples of lam yamoror, then that is a meaningful thing. You say, oh, people just aren't using this particular version of the justice. A corpus by itself is just simply a collection of tags. And that's either coded up, lamathize, PLS tag, et cetera, or simply left raw. Many corpora, both in English and Arabic, have been distributed in just that fashion, just as files. So if you Google Arabic corpora, you're going to get a bunch of corpora that you can download, but they're just files and you have to figure out what to do with those files. It leaves to the user to find a way to search them. This is fine if the user happened to be very technically oriented. And I heard about this in your talk this morning, but it creates a fairly high hurdle for the non-geeky, in which I include myself, okay? And in fact, in which I include most student teachers and linguists and humanities researchers in general, is that you have a certain amount of technical skill, but you're not comfortable. And so you end up not choosing to use the tool or the corpus. Modern web interfaces, like that available for the corpus of contemporary American English coca, have made English corpora many, many times more accessible to the non-geeky. This is one of the actual motivations behind the creation of Arabic corpus, to provide a resource online with straightforward tools so something would be available for the huge majority of Arabic learners, teachers, and researchers who are uncomfortable kind of doing it by hand or doing it by themselves. Many of the English corpora are optimized for fast search and we've heard about fast today. And it is almost miraculous. If you go to the corpus of American English coca, and it's I think up to 300 million words or something and results come within two seconds. Martin Davies, he says he refuses to make anything where you can't get the results in two seconds. And so he does an amazing amount of preprocessing and uses huge amounts of data storage to be able to get that result. That's not something we yet have available, so we sit around and wait, okay? Arabic is blessed with what I call massive graphical ambiguity, okay? This is partly caused by the not writing of short vowels and I illustrated that earlier. But even when you do write the short vowels, you have phonological ambiguity, maybe not quite as massive. I just give two short examples here. You think it's not a problem until you actually start working with it. And there's certain words where you just cannot get what you want. You get too much of any other thing. And so this taqad qaqad al-meen is a good example. If you want to find the form one verb, qadimah, or the form two verb qadimah, you're gonna get a million form five verbs because taqadimu, taqadimu and tuqadimu look exactly like taqadimah in this form five. And they don't mean anything like the same thing. Taqadimu means she comes form one present, tuqadimu, form two present. Tuqadimu, form four present. But taqadimah, form five, passed and with a different conjugation, third person masculine instead of feminine. And then taqadim is an imperative. Taqadim is an imperative and taqadim is a verbal noun. And they all look exactly the same. And this is not a little tiny problem. This is a massive problem in Arabic online search or any kind of computer search. I love the next one, this ya'aindal. Ya'aindal is just the greatest word in Arabic because it can be so many things. It can be the jasif of a'ada ya'udu lam ya'ud. It can be the jasif of a'ada ya'adu lam ya'adu. It can be the jasif of a'ada ya'adu lam ya'adu, okay? It can be just the regular present tense up to count ya'udu. It can be the passive of that ya'udu, which is that mean to be considered and it can be a form four present tense ya'adu. And these are not similar in meaning at all. It's a completely different word. You don't want to group them together, okay? In addition to that, for whatever reason, Arabs have decided that they are not going to be consistent in writing certain things. And so at the beginning, the hamsa is written either on or under an alif at the beginning of a word. And it is entirely optional. You can either put it or not put it. And if you do a study of it, which I have done, you find that it is always, it's not like one book always does it the same way. Whatever book you choose, whatever page you choose, one will have the hamsa and one will not. You know, it is completely random. And it's the same with the alif maksura at the end. If you have a ya, here, let's get this thing. If you have a ya at the end of a word, sometimes it will be written without the two dots. But unfortunately, the ahram and a few other sources in Egypt, when they have a word that's supposed to be written without the two dots, writes it with two dots. And so you have, it's absolutely, so we have for instance the word ali, which is common name, and ala, which is the preposition on. And it's absolutely random, which is going to be spelled like in the ahram and in lots of other sources. And so you can't, if you want to find all the examples of ala, you have to accept ali, which seems stupid and ridiculous, but there it is. Okay. Another thing that makes this more difficult is that a lot of the prefixes and suffixes, if they were just, if they were very recognizable as just prefixes and suffixes, then you would just get rid of them. It would be easy. You would say, oh, ala-flam, al, that's the definite article. We'll take all the ala-flams off the words and then we'll be left with the word. No, because there's all kinds of words that have ala-flam, but it's part of the word. And so you have to read Arabic to get this, but I give the example of il-ti-ham, which starts with an ala-flam, it's approved that it's part of the word. You can add an ala-flam to it and say al-il-ti-ham. The same with a lot of pronoun endings. For instance, home is there, like Kitab, home is their book, but you can get a word like it-tah-ham, which ends in what looks like home. And of course you could say it-tah-ham-mah-hom, where you could have a hom-hom. And the last one there is the same nab-baha, nab-bahu, and be at the front of a word. You think, oh, I'm gonna take all the bees off the word. That means in. And, but you can't do that because there's all kinds of words that end with be. They start with be, even words with four letters. So like bul-bul, be-bul-bul, it's tough. That's, I think the whole point of that is just to say that there's, if it's extremely frustrating, if you are given a large amount of Arabic text and you're just using a search function, you go into BB Editor, any other normal thing, but a really strong search function, and just start searching, you get tons of stuff that you don't want. And you don't get all of what you want. And that's the whole point of these kind of corpora, is to get what you want and not what you don't want. Those are the two keys. And it just doesn't work for Arabic. The first thing I did in the digital humanities was a synonym dictionary. And I was supposed to find kind of core examples of all the words that I was using to really show how this synonym was a little bit different from that synonym. And before I really understood how to use corpora, I thought, well, that's gonna be really fun. I have to read through a year's worth of newspapers, 75 times, to be able to get these examples. And then I figured out I could start searching for them. But as I started searching and developing search tools, we get so frustrated because you would get 400 examples, and only three of them would be examples of the thing you were searching for. Everything else would be garbage. It really is not straightforward, that's all I'm going to say. Okay, so Arabic has many other features like multiple broken florals, non-sound verbs, which make it difficult to gather the forms for a particular word. I mean, the one I love the most is the verb bana. You know, bana is to build. And if you search for bana and all the possible morphological forms of bana, 90% of the words you get have nothing to do with to build. But you get bent, for example, instead of bana. Bent is girl, instead of she built. And et cetera. Bunay, for example, is my son. Yeah, Bunay, and that has nothing to do with build. It just goes on and on, okay. So right now I just want to give you an introduction to Arabic Corpus, and at the same time, kind of introducing how, you know, first of all, just recognizing that this was developed by me, a non-techy person without resources. I was not giving, I've never gotten a grant for this, and I've never gotten help from a computer professional. And so I make no claims for it, I mean, as far as being a professional product. I was just trying to take what I knew how to do and make it available, literally as a rebuke to the computer professionals who should be doing this. This is meant as a rebuke. They don't take it as a rebuke. They just criticize it. But there are large Arabic corpora out there that are so techy, no one will use them except computer professionals. There's Arabic giga word, for example. There's call home Arabic, there's all kinds of things out there that 99.9% of students and teachers refuse to use because they can't figure out how to do it, okay. This tool that is incredibly non-geeky and that has millions of problems that they immediately point out has 10,000 users that all of whom have used it at least 10 times and hundreds of users who have used it over 100 times. I mean, it just says something, you were talking about this this morning, that you have to make something that is simple enough and not scary and as soon as you do that, people love it. I mean, they want it, they need it. But the tools have always been there and they could have used them that they just weren't willing to until it got into a web interface. So anyway, I wanted to try to ameliorate some of the problems inherent in a raw corpus. There was no question that this was going to be a raw corpus. I didn't have the resources to even consider not making it a raw corpus. It turns out there are automatic POS tagging and limitization tools out there for Arabic. I spent years working with them. Don't tell anybody this, but they're crappy. They don't work, okay? They work great if you're a computer science professional and you're doing it for a professional paper that you're giving at a computer science conference and you're demonstrating that it has 65% accuracy and blah, blah, blah. But if you're doing it to try to find the words, no, they just don't work. It's not useful. We just simply are not there yet, okay? This is something, I actually, I'm surprised that someone else said this today. But this is the most important decision I made, okay? I decided I need to put the corpus on a server and provide access to a web-based program, okay? It seems like a no-brainer in hindsight, but it was not obvious to me at the time. And it's so hard to remember for those of you who aren't old like me, that when computers weren't so ubiquitous and when the internet was not just a foregone conclusion, we forget how fast things have changed, okay? I spent several years with this corpus and building this corpus and I used Perl on the command line to search it. It was great. That's how I did my synonym dictionary. It worked wonderfully. But I was so excited about it, I started telling my students and my colleagues about it and I was willing to send them the files. I mean, you couldn't send them in those days. You had to hand them to them on a zip drive or something and give them the program so they could do it themselves. I had zero takers. No one was willing to do it. It just was too complicated, even though it wasn't really complicated. I could explain how useful it was, but it didn't matter. As soon as I put it online, the exact opposite thing happened. Everyone was willing to give it a try. And instead of me having to offer, people sought the program out, okay? It became popular. There's about a hundred references on Google Scholar of people that used it for dissertations for books and for scholarly articles. People need a corpus, but they needed one that they could use. When I look through my usage, I have about 10 users who have thousands of searches each. And most of them, I know who they are, and they're all working on dictionaries. You know, and that's funny because I know some of these people and they're very, very sophisticated corpus linguists and they know how to use sketch engine and they themselves have huge corpora that they can search. But sometimes it's just easier to go to Arabic corpus and find it there. I mean, that's the point. It's the ease of use that makes the difference, okay? People who are afraid of programming are not afraid of a browser. Nobody is afraid of a browser these days. And I think that's a really key point. Okay, I just wanted to run through what you do in Arabic corpus. In Arabic corpus, you have to choose a part of speech. And I'm going to go into that in a minute to explain why that's important. But you choose a part of speech, you tell the computer what you're looking for. And that helps the computer narrow down the range of things that it will find for you. Then you choose a corpus. You can choose all, but that gives you a long wait. This is about 150 million word corpus, it's large. It's not large in terms of modern American corpora, but it's large in terms of Arabic corpora. And it's a raw test corpora, so it takes quite a while. And a lot of the things that you might want to find out, you can find out in one newspaper. And so what I've done is all the newspapers I have, I've divided them into years. It's about a year's worth of newspaper. And that's very intuitive for a lot of users. Because it's like saying, if I read the Ahram for a year, I'm gonna see this word 62 times. That's about once a week. Or I'm gonna see this word 723 times, which is about twice a day. It gives you an idea of a palpable feeling for the frequency of that word. This also kind of shows you a pathetic attempt to balance this corpus. I wanted to have some kind of balance. And so that literature was not much available. And so I started digitizing it. But hand digitizing, even with modern tools, is both extensive and fraught in various ways. I finally got a million words. And so I have one million words of literature out of 150 million words. So that's not balance. But at least it allows you to look at the literature separately. I have some nonfiction, I have Islamic discourse that comes out of Islamic websites. That can be useful for some people, a little bit of colloquial, and a bunch of pre-modern stuff that various people have given me. And the Arabic learner corpus that somebody else did and begged me to put it on here. I can actually put anything on that anybody wants me to. If anybody hands me a file that's in Unicode, I'm usually willing to put it on if it's a fairly major piece of Arabic. But I'm not willing to digitize it myself. You can either type the thing you're looking for in Arabic or in Transliteration. Notice that you type the verb in its dictionary form, in this case tawtaha, or in Transliteration tawtaha, there's a Transliteration table available. Once you run it, it gives you a little bit of quantitative information. So for instance here, I ran it through the modern literature corpus. You can see right there, it tells you what you ran it through. It went very fast in this case. It found 101 examples of this verb. And then it gives you, it does the math for you and tells you how many instances for 100,000 words. If you wanna know the instance of per million words, you know what to do. Add a zero. But that gives you, for every single word you look up, you get this quantitative information. It turns out to be really helpful and pretty insightful for a lot of at least lexicographic and pedagogical uses. Then you go to the citations page. You can see here that you can click on any of these ways of looking at the data. And so if you go to the citations page, this is a typical concordance. It gives you the 10 words before and the 10 words after and it sorts by the word before. So you can see the word before is right here, but it also puts it here. So you can look down quickly and it does that 100 words at a time basically. I'm only showing you a little bit of the page. But that allows you to case out the word and how it's being used. You can also click here where it says sort by word after. So then instead of sorting by the word before, it'll sort by the word after. It allows you to look through and check for what words go with this particular word commonly. In an attempt to help with understanding what a balanced corpus would give you, not being able to provide a balanced corpus, I actually provide the piece of literature that the works come from. So this is just names of pieces of literature where these words were used. And so you can see actually there's some pieces of literature where it's very common for whatever reason and others not. In the case of newspapers, newspapers are all downloaded from websites and the websites almost always include sections in the actual URL. And so I can pull out those sections and organize the pieces of newspaper. And so this means that the subsections are gonna be different from every single, in every single newspaper, but they're still relatively helpful because every newspaper is gonna have front page articles, Arab news articles for basic categories are gonna be more or less the same. And so you can kind of, you'll be able to look down here and see, oh, this one seems to be used pretty broadly about the same through all subsections, whereas another word, like if you look up the word Hattus, which is a guard, it turns out that you find that mainly in sports because it's the guard of the soccer, you know, the goalie. And that's, to me, that's important somehow. You know, in other words, for a student to just learn the word Hattus means guard, but to also be able to tell the student, if you're gonna be reading the newspaper every day, 98% of the time you see this word is going to be in a sports article. It's gonna mean goalie. That's relatively useful information, okay? Then you can click on word forms, okay? And what it does is it gives you, oh, that is not word form. There we go, word. Unfortunately, I put the wrong slide in here for word forms. So it doesn't show you something else. But what it does is it shows you every example, every possible word that it found in its form. And so if it was key tab, it would have key tab, al-key tab, bil-key tab, key tabu-hoo, key tabu-ha. It would show you everything it found, and it would show you how many of them it found, and you can click on one of them and see only the citations for that one. Here is the words before and after. This is just a compilation of the same way I sorted the citations, okay? And so you can see we're looking at the word Tadaha, okay, the verb Tadaha, and the word after, the most common word after is alayhi, then alayka, then ala. And then we have asila, sual, al-ashila. Again, just that information right there is pretty helpful for students, you know, to realize when I see Tadaha, I'm probably gonna see something about question, you know, Tadaha and question go together because it means to pose a question. And to someone, that's what the ala comes from. They're just something satisfying about this kind of information. Finally, this is actually something I added later after a lot of complaints. You know, Arabic has fairly free word order, not totally free word order. And so not all the collocates are gonna be right before and right after, and I was letting people look at the words right before and letting them look at the words right after. But normally in a corpus, a collocation is, you set a range, you say within four words on the front and within four words on the end. And so I just did that so people could have it. And so this says within four words on one side and four words on the other, what are the most common words that go with this word? And again, you see that ala and asila show up at the top and then come sual and a whole bunch of other alas as we go farther down. And it also allows you to download all the citations that you can then bring into an Excel file. And it basically has one column with the 10 words after, one column with the word and one column with 10 words before. And then you can sort those in various ways, count them, do whatever you want with them outside of the program. Okay, so that was kind of a very quick introduction. Now, I just wanted to go back to some of the aspects and talk about why I made those choices. So computers, as we know, we're very good at searching for exact strings, but they're not good at understanding what humans just do intuitively and figuring out, oh, that's an example of that word. That is not an example of that word because of this context. At least computers being manipulated by non-beaky people like me are completely incapable of doing that, okay? When you search for exact strings in Arabic, you typically generate so many false hits that the results seem useless. And so the reason I came up with the idea of the user choosing the part of speech is because that allows the program to reduce the number of false hits, okay? So what happens, the whole way this program works is you type in any set of, any sequence of letters. And the computer instantly finds every example of that string of letters. For instance, if you search for the string DRS, Dara Sa, you will get basically thousands of hits. And these are all different. You get more than thousands if you count the each individual one. But the idea is you have three categories. One category is that string with a whole bunch of extra letters around it, just different words. And you're definitely not searching for those. Those are different words. I'm not looking for Madrasa. I'm not looking for Madaris. I'm not looking for, you know, Landr Storman, which is a Finnish name. I mean, that happens to have DRS in it, you know? And so because I show, if I choose string, it's just gonna give me every example of DRS. But if I choose noun, it's gonna know that there's only some things that are possible with nouns. It can have the definite article. It can have any of those prepositions that can be on the front. It can have a pronoun in it, or it can be plain. But it can't have a meme on the front. And so it immediately cuts out all those things. The other two categories of things it chooses is things that have to be a noun or that have to be a verb. So if you have DRS with an a laflam on the front, the article, it is a noun, okay? If you have DRS with a calf on the front, okay? That means like. It has to be a noun. But if you have DRS with a yeah on the front, that's a verb conjugation. It has to be a verb, okay? And so the second category of things that it finds, when it finds DRS, is all these things that are unambiguously one or the other. And so by choosing noun, I cut out all the verb ones. And by choosing verb, I cut out all the noun ones. And that windows down the false hits enough that then I'm willing to deal with the results, okay? The third category of things that it will find are things that are absolutely, totally ambiguous, okay? That's like the bare form. The bare form can be Darasa, a verb, or Daras, a noun. And there's nothing you can do about that. And that's another complaint I get constantly with Arabic corporates. This is, I chose a noun and it gave me a verb. And I'm saying, you didn't read the instructions. I didn't promise you that it would find only nouns or that it would find only verbs. I promised you that the thing that it finds, totally out of context, could be that noun. Morphologically, graphically, it could be a noun or it could be a verb, who knows? And so basically what the part of speech does is it hugely reduces the number of hits. You can see this is still searching for the exact same thing, but I've chosen now, okay? And here we have all sorts of nouns. What's that? So there's Abdar, which has to be a noun, but there's the bare form, which could be a verb. There is Darasa, and you would also say, oh, that has to be a noun. No, because it could be Darasa, it could be the dual. Here's Wwad Darasa or Wwad Daras, ambiguous. Here's Wwad Daras, not ambiguous. Here's Darasuna, that could be ambiguous, it could be Qad Darasuna. And so, in other words, you have to go through on your own and decide whether it's ambiguous or not. But the point is a good number of these are not ambiguous enough so that you could deal with it as a researcher instead of just throwing up your hands and saying, I'm not doing this, I'm not playing this game. Okay? So this is just an example of the verbs. Of course, in the verb section, you get a lot more forms because verb, just so copious in the number of forms, but maybe less of them are ambiguous. The first one here is ambiguous, but then the second one is not, the third one is not, this one is, et cetera, going on. Okay, here's an example of the word form list, finally. The reason I give the word form list is so researchers and students using their knowledge of Arabic can get some kind of an idea about whether the numbers that this is providing should even be paid attention to. So you come to the word form list and say, how many of these hits are ambiguous? And you can just kind of go down and look at the word forms that it finds and if you say that half of them are ambiguous, these numbers are not gonna tell me much. I have to actually look through the ambiguous ones and see before I can trust these numbers. And so this is just kind of a way of dealing with this idea that I don't have a POS tag corpus, I do not have a lemma size corpus that I have something that can get me a handle on that. Okay? Now, there are ways that Arabic corpus provides to help you do a little bit better than that. But to do that, you have to get a little bit geeky. Okay? So this is something that I don't advertise too much, but it's in the instructions and when people call me or email me of how can I do this, I tell them. If you are willing to learn regular expressions, regular expressions is something that is in Perl, it's in Unix, it's in most computer languages and they're not hard, but they're a little tricky. I mean, you have to, there's a little bit of learning curve there, but if you are willing to use regular expressions, I allow you to type in a form, then a space and a dash and a space. And then you type in anything in the regular expression format that you want to cut out. Okay? And so for example, here, my main offender here is these two words here. And so if I want to investigate darses and noun, I want to get rid of those. And so I would just say, I don't want any form that starts with a D. I only, and maybe I just want that a little, basically give me present tense verbs or whatever. I still am looking for nouns, so it'll give me those with aleflam, it'll give me the ones with a preposition, but it's not gonna give me any of these potentially ambiguous ones and that will cut out all those and make it just easier to deal with. This is an example of regular expressions. Regular expressions are not difficult, but I typed in DRS, okay? And I put the dash backslash B means word break. So this means I'm trying to cut out just the simple form DRS. And if you can read Arabic, you can see that in fact it doesn't show up. It disappeared there. The reason I allow for typing in transliteration is for the regular expression. When you type regular expressions in Arabic, it just looks so bizarre. I mean, really, really bizarre. And so it works actually, but most people just throw up their hands. So it's just easier to do regular expressions with Latin letters. Another possible strategy is looking through that list so we can look here and say, okay, these forms over here are like a darsain, darsahi, not real common. Specifically, they're not gonna make much difference unless I happen to be investigating dual or something like that. But look at this form here. I mean, that's like about half of all the forms is in that one form. And so I say I'm comparing this word to some other word. If I said, well, I'm just gonna look at that form. I'm only gonna look at this word with a definite article. Then I know I have the noun. I know I have a very common form that statistically is gonna be consistent. And if I compare the same thing with the other noun, then I can be pretty sure, point O whatever, that those statistics are gonna match the statistics of the overall form, which I can't get without actually reading through the thousand forms. I'm not willing to read through the thousand forms, so I'm willing to use the stand-in at dars for the whole entire noun dars. And that turns out quite a few people use that technique. You do the same thing with verbs. You say there's all these ambiguous. We remember taqadam, taqadama, yataqadamu, taqadimu, taqadimu, taqadimu. And you go, oh, I can never ever find any information about this form. But if you use yataqadam as a stand-in form, then that reduces the ambiguity by maybe 90%. So you find a stand-in form, one that's very common, but it's also unambiguous, and then use that as your comparison. And you get not all the way there, but part of the way there. There's an example of stand-in forms. I just gave you the example of just looking for ad-dars, wad-dars, and fad-dars. That's the only things it's finding. And then comparing it to al-ibra, thal-ibra, and wal-ibra. And that would give you an idea of the relative frequency where the lesson is showing up in this particular citation. It's one year of a Jordanian newspaper, almost 600 times, whereas al-ibra, which also means the lesson in a slightly different context, is more like 200 times. So that gives you a feel for the frequency of those two words. To address corpus size and search time, I think I mentioned this before. I basically just allowed people to search parts of the corpus at once. If a time is a factor, you search one part of the corpus. If you're willing to wait three minutes, then you search the whole corpus if you want that data for whatever reason. The same thing kind of applies to the balancing. Although the corpus itself is not balanced, you can hand balance by choosing to, by saying sort, look for only one newspaper and for certain literary features and through the Islamic texts or whatever. You can search different parts of the corpus that therefore would be balanced in your own way. Another fairly common technique that people have used and that I've used quite a bit is to count a sample. And so for example, and looking for word senses for my dictionary, I have a word and it has three senses. And I want to be able to tell the person using the dictionary which sense is most common. What is the relative frequency? But there's 2,000 examples even in a single year of a newspaper. And so what I do is I download those samples and use a program that I wrote that's very fast that chooses 50 randomly. It just picks out 50 and throws them in a thing. I'm willing to read 50. And so I go through those 50 and code them and get the actual relative percentages of the three word senses. And then when I stick it in the dictionary, I make it very clear in the introduction if anyone's willing to read it, most of them aren't. That these percentages are not any truth about the world. They're just that sample. But that sample is better than nothing and it usually rings true somehow. In other words, you look at the sample and you say, yeah, I can feel that that one is probably gonna be a lot more common than that one less common. And so this is an example of my current dictionary project. And I have the word medj-less. Medj-less can mean council or it can just be a seat, a chair you're sitting on or a gathering, okay? And you can see that I have 36, 36 and 27. It turns out that in the particular purpose I was using, which is the literature purpose, they were about a third each. Each of them were fairly common. And you would not, I promise you, get that idea if you looked this up in a regular dictionary. You would not realize that medj-less was actually pretty common to mean party. Get together, you know, something like that. So I just wanted to show you a few kind of actual usages. So here is a search for medj-less and looking through the words before and the words after to get a feel for the different word senses. And so here we get medj-less as shab, okay? Medj-less as ta'ahu and medj-less as moom, which all has to do with some kind of a council. But biddehk. Only the Arabic speakers are going to understand this, but there's something about laughter. And you look up the examples of this and you realize this is the party one. The whole medj-less erupted in laughter, okay? And so it allows you to quickly hone in on specific word senses if you're looking for examples of something without having to read through hundreds and hundreds of things. Here's another, this example I like to use a lot. We teach that to students as meaning state, as in we lay at California. And if you look it up in there or even on Google translator, it doesn't even manage to tell you that it means term in office, which turns out to be very common and they don't very often mention the meaning that is used in the kind of Islamic sense of a certain kind of rule. But it turns out that in this particular, in Ohio, the particular year 1997, we lay at the faqih, which refers to Iranian rule of Moas, is the most common usage. Then we have we lay at California. But the next most common one is Thania, al-wile'a-thania, which can only mean term in office. The second term in office, I know those of you who don't speak Arabic are saying, what the fuck? I should stop doing this, but it's just fairly revealing. You're able to kind of get at word senses quite easily with this. We lay at the raist of term of the president, et cetera. Also, you look down here and you find that you get California, Texas, Florida, New York, but you don't get any provinces in Egypt and you don't get any provinces in Lebanon. And so we teach students that we lay in the state but we often neglect to tell them that it has nothing to do with any kind of division in most Arab countries. We're talking about Florida and California. Okay. And then one of the common features, one of the common words that came before wile'a was akbar. And I thought akbar, why would people be, akbar means biggest. I couldn't figure out why people would so commonly say the biggest state. And so I looked it up. You can always click on it to see what the citations are. And of course it turns out that it's the name of somebody, you know. It's Ali akbar wile'a or probably wile'a because I think he's a Persian Mullah or something. And so it's just useful. It allows you quick access to find something that you're having a question about and then dismiss it if it turns out that it's on nothing. Okay. It's very interesting. You can find creative ways of using this corpus to get at grammatical variability. And so for example, at one point I decided to look up tajib. Yajib is, it is necessary. It's commonly what we might call, it doesn't necessarily agree with the noun it goes with. The noun it goes with is usually a sentence. But every once in a while you do get tajib. It's variable. Okay. Tajib is conjugated for she instead of for he. Okay. And so this allowed me to find all the tajib citations in a particular newspaper and then look at what was, what the most common citations, I wanted to be able to compare to really get some kind of handle on what was more common, yajib or tajib. And so first of all, I looked up tajib and found out that the most common feminine verbal noun that went with it was muraja and the next most common was ishada. Tajib or ishada or something. Okay. So I decided as well, that's what I'm going to compare. So I got the statistics for tajib or ishada and for yajib or ishada. So I could compare them to see, which one is more common. And you can see it's not very hard to figure out that yajib or ishada is quite a bit more common, but still the other one does exist. Okay. Another, if you're willing to use regular expressions, you can also expand that a little bit. Here I took the top eight feminine verbal nouns and put them all in a regular expression. So I was searching for all of them at once. Okay. So I have ishada, mualaja, munaqasha, muraja, mura'a, muaqaba and muhasaba, which are the most common feminine verbal nouns that might go with tajib or yajib. And then you could, and so that gave me a better feel. And I got 19 of the feminine agreement and 53 of the masculine. So that gives me a fairly secure statistic. It's not perfect, but it's a lot better than nothing. And I think that's kind of the point that I'm making in this whole paper. Here is another kind of research project I did at one point just looking for the future particle, sa or salfa. You can have the long form salfa or the short form sa. I use the regular expression to search for them both at once. And the results are really pretty interesting. If you look at al-hayya for a year, the most common forms are sa-yakun. Oh, and I chose the form yakun because there's too many sa's that mean something other besides this. And so I just chose yakun and takun. And look, there's sa-yakun, sa-takun, wa-sayakun, wa-satakun, all the sa's are first. They're the most common. And then all the salfas are second. Whereas in al-ahram, they're mixed up. There's a lot of sa's, but there's a lot of salfas too. Egyptians really do use salfa a lot more than people in the Levant. I've got a ton of evidence for that. This is a really good place to find rare forms. Okay, so I actually found a couple of yummadors, for example, finally, after years of searching, okay? And it's also really great for finding idiomatic expressions. My students were reading a newspaper article and it says, well, have a muttabutl faras. And being a non-native speaker, I'd never seen that before. It means, and that's where you tie up the horse. I mean, I understood it, but I didn't understand it, you know? But I looked up muttabutl here, and look, I mean, it's just so interesting that it's just such a window on the thing. Muttabutl faras, muttabutl na'ama, that's the ostrich. That's where you tie up the ostrich, muttabutl khayl. This is a way to find out all the different words that mean a horse in Arabic. Khaylun, you know what? In other words, it's just all of a sudden, you get this feel, and also the frequency, you know? And one year, oh, this is all newspapers, but still, you know, 134 of that idiom. So, oh, this is an idiom I wanna learn. It's motivating. The conclusion I think is just that with a certain amount of creativity and patience, you know, you can get a lot of good out of fairly simple, non-certificated tools. I would love if someone had the resources, hot arm maybe, or someone who's willing to put in $10 million and get an army of coders and coat up POS tag, limitize, and really get us, you know, a hundred million word corpus that is balanced and works. But until that time, I'm hoping that Arabic corpus will help a lot of people get a handle on Arabic. Thank you. Thank you.