 مرحباً لكم وردنا مرحباً لكم وردنا مرحباً لكوربوس مرحباً لكوربوس بلسطين و لبنين. أنا مستفا جرار ويأتي مع أصدقائي كريم عاف. نحن فعلنا هذا العمل مع أصدقائي ستيماء حمودة وفادي زراكي. قبل أن أبدأ أريد أن أتكلم بأن في نهاية العربية في بيرزات ينبيرسي we have released several lexical resources which are available online including the Arabic ontology or word net the very large lexical lexographic database in addition to several dialects morphologically amdated في أسلأة هاتف كوربورا والدخل العمير الجديد من المشاركات للتوفاق أسلأة الإحمايات والتخفضات حسناً، فالمرأة التي سنريد أن نتقل على هذه المناسبة is to provide a morphologically annotated corbus for Leventine dialect which is spoken in this area. لذا كما نعلم العربية لا يوجد فرصة لنغوية. ويوجد فرصة العربية فقط، فهي العربية العربية أو العربية المدنية العربية التي يوجد فرصة لنغوية أنها مستخدمة على العربية المدنية. ولكن نبدأ برأيب مناسبة لنغوية في حسناً، ويوجد فرصة العربية المدنية. لذا، هذا يجعل التحديث للتحديث للحسنة للحسنة للحسنة للتحديث ويقوم بحيث تحديث الآن سوف أعطي كريم لكي تقوم بإمكانك إعادة الكرمس اللبنين. Hello everyone, thank you for tuning in. Yes, so when we created the Cabelladi corpus, that we used to extend the Quraas corpus to create a more Leventine corpus, we chose, we collected about 9.6 thousand tokens and divided into 424 sentences from many different places, so Facebook posts on social media that have a more colloquial tone, blog posts that have a more informative tone that were written in the Lebanese dialect, and traditional poems of the Zajal tradition in Lebanon. To annotate our corpus, we populated a smart Google Sheet with it and it was annotated by 4 annotators over a period of 10 months. And so we have over here an example of one annotated word with many different features that we will be talking about in detail in the next slides. So for example, this is one chunk of our annotated corpus. We start with the token which is the word itself as it is extracted and we made a conscious decision of working with the dialect written in the Arabic script because sometimes people tend to use the Latin script or as we call it Arabisi. But for some reasons we chose to, for convenience we chose to work primarily on the Arabic script. The first of all, the first feature that we annotated would be the koda. The koda is the conventional orthography for dialectal Arabic. It's a proposed orthographical system that was also used for the Palestinian dialect. In order to unify and standardize the spelling inside the corpus because sometimes people tend to write a non-standardized language or dialect in many different ways. So for computational purposes it's important to standardize writing and spelling. Then we annotated the affixes, so the prefixes and the suffixes, the more themes that come before and after each word. And they're identical in Lebanese and Palestinian, but sometimes there are some more themes that differ due to regionalism such as interrogative particles where in Lebanese we would say and Palestinian would say and differ between both. This is one example, but they're identical mostly. And suffixes are also identical between Lebanese and Palestinian but we noticed one striking difference. And that is that of the usage of the plural when we are talking to the second or the third person. And it's mostly noticeable when it comes to the ending of that plural where Lebanese people say use the letter M to express the plural and Palestinian use, sorry, the Lebanese people use the letter N for the plural and the Lebanese mostly use the letter M. But in Northern Palestine the varieties there use the N like the Lebanese so we can notice the continuum there. And that was one interesting feature that we noticed in the work. Then we have the STEM. Basically we use the taxes used in Sama which is a morphological database developed by LDC. And in STEM we also use the, we read the words of the, its variations, the affixes we put it and then we added its part of speech from the Sama tax set. And then we have, we annotated the pretty standard feature pretty standard features like part of speech, a person, it could be the first, second, third person aspect, or actually for verbs it could be P, perfective, I imperfective or C command. The gender if it's masculine feminine or none. And the number of its singular plural or as it is common for many Semitic languages, dual. Then we annotated the modern standard Arabic Lema equivalent of the word that we are annotating from the dialet. And we use the Lemas from the Sama database. And if there is, if the Lema in Sama doesn't exist, we just wrote our own Lema and then added the zero to say that it is a new solution. Then we annotated the dialect Lema. So some words naturally in the Levantine do not exist in standard Arabic in Lebanese or in Palestinian. So we had to write a new Lema for that word that is exclusive for the dialect. And we put it side by side with the modern standard Arabic Lema to see that the equivalences and differences there will be an interesting feature to look at the future. Then we have the frequent functional words in Lebanese as opposed to some frequent functional words in Palestinian. Well, mostly they are identical, but we noticed we explored and we detected some differences mainly due to regionalisms that we can see in this table, for example. For example, with Lebanese people would say for yes, Palestinian would say yes. And such examples are limited, but pretty noticeable because of regionalisms and any dialectal continuous. I will pass now the microphone to my colleague, Dr. Jarrar. Thank you, Karim. So to talk about the evaluation of the annotations we selected some sentences randomly about 400 tokens. We re-annotated them and we measured the internet agreement with KABA and we reached good results. So this is about 78.5. We also re-annotated the 400 by an expert and we compared the results of the annotations of the experts with the original annotations with all annotators, actually. And we see that we reached the F1 score of 90%. Okay. Now I will move to talk about the revisions of Qorras. So Qorras was originally published a few years ago but because it was used in several applications and especially when we came to also reuse the annotations for the Lebanese, we found some actually mistakes or we wanted to do some improvements. The improvements we did are the following. So first of all we barzed all the tokenization and BOS tags and we make sure that they are correct. We also revisited, actually we focused on this, the elementization, the MSA-LAMAS and the dialect-LAMAS. We make sure we make very sure that every single LAMAS is mapped to SAMA or underscored with zero say it's MSA but does not exist in SAMA and make sure that they are linked with the dialect LAMAS. We revisited other features also. So together it was a revisit or revisions of almost all annotations. At the end we produced a table called a solution table which is a unique solution. So because a word can appear with the same annotations different times in the corbous. So we removed the annotations and we generated the solution table. We used this table, the solution table while annotating the Lebanese corbous. By doing this we actually, both corbora, the Baladi and Corbous can be used as one corbous, as one combatable corbous. Both corbora are available in this link. So people can search and see, retrieve all annotations and see everything and they can see whether it's from Lebanese or Palestinian and it's also available for NLB research. To summarize, so we present a new Lebanese morphological annotated corbous. We revisited an existing morphological annotated corbous. Together we presented more Leventine corbous. The details can be found on the web. That's the end of our talk. Thank you very much for listening and we are happy to answer any questions by email or directly. Thank you very much. Thank you very much and yeah, it would be interesting to see how this kind of corbous would be used in the future for projects because well, dialects are mostly under resourced and would be sad to keep them that way and would be nice to see what the future holds for dialects such as our native tongues that are not usually represented in research. And hopefully many good things can be done with automatic translation pedagogy to teach Leventine as a second language to people in diaspora because we have such a huge diaspora worldwide as Palestinians, Lebanese, Syrians, Jordanians in general. So yes, hopefully this can be the start of something nice for this area of research. Thank you very much for tuning in and hope to see you in Marseille. Bye bye.