 مرحباً وردنا مقام بكتبنا مدرسة ويأتي بها في مدرسة كوربوس ومدرسة مدلية لتعاون بإمكانها أربية مصطفة جرار قامت بعمل with my colleagues محمد خليلية وسنان غانن في بيرزات عامة قبل أن أبدأ أرجوكم لتجدتني أن أسأل my research group في which we released several linguistic resources that we have developed and available online a lexicographic database with many lexicals and Arabic ontologeral audit and several dialects corboreum or for example annotated in addition to a large number of linguistic ABI to give access to our data and services such as WSD and NEAR and other natural linguistic services or tasks. Okay. So the problem we tackled in the paper is the fluke. We have seen mature results in NEAR systems for most languages to recognize the entities such as Sammy Wolves at Jimmy Carter Center. Sammy is a person and Jimmy Carter Center is an organization. This is called the FlatNEAR and it's really mature in most languages. People achieved good results. However, in NEAR where we want to recognize entities inside other entities. So we want to say also that in addition to Jimmy Carter Center is an organization. We want to recognize that Jimmy Carter is a person. To build such a corbora and annotated manually is challenging. It's also challenging to train bird mothers with this. Existing Arabic NEAR corbora is actually only supporting the flat like this and they are small and they only support a limited number of classes or classes or entity types. In addition to that, they focus only on mother in standard. In this research, we present a corbus and a fine-tuned model. The corbus compared with others is actually the first corbus that supports nested named entities. The largest in terms of number of tokens inside the size of the corbus. And the richest also because it's very rich in terms of number of entities inside the corbus and supports 21 types of entities or classes of entities. The corbus consists of MSA and electric corbora and supports multiple domains. A media, history, culture, health, finance, ICT, law, elections, politics and migration and tourism. Actually, the corbus is available as nested. It's also available, can be used as a flat. And the fine-tuned model, we used the Arabic brettrend model and we trained it over this corbus and we achieved 88.4% F1 score. The corbus was collected from three types of resources, web articles, and it's almost half of the corbus. We also crawled the Palestinian Digital Archive which covers the history and culture and also other entities. This corbus or this part of the corbus is really rich in terms of nested. This is why we include it. In addition, we also have social media text written in there. In total 550,000 tokens. The 21 classes of entities we support are persons, groups of people, not occupations, organizations, geopolitical entities, geographical locations, facilities, events, date and time, language, websites, laws, products. We distinguish between cardinal and ordinal numbers, percentage, quantity, unit, money, currency, and so on. Okay, this is an example of annotated sentence. منح مدير بانك القائرة مبلغ مليون جنيل افتلاف العاملين بجامعة القائرة لدعم ميزانية 2022. So the manager of the caro bank awarded one million pounds to the employees union at the caro university to support the 2022 budget. So if you see here, منح is O. مدير بانك القائرة the manager of the caro bank is an organization. But it's also caro bank is, sorry, this is occupation. The manager of the caro bank is an occupation and caro bank is an organization. caro is a geopolitical entity. million bound is money and bound is currency. The employees union at caro university is organization. All of it is one organization. But it's also caro university is another organization mentioned inside another organization. And caro again is a geopolitical entity. The 2022 is a date. So please look or remember to see that we have some entities of the same time. So I org, I org. So we will talk about that. 14 people participated in the annotations. Two of them were experts, near experts. The annotations were performed using Google sheets. It's actually just a simple way. Over eight months. And we did the corpus over, we annotated the corpus over three phases. In the first phase, we asked the annotators to annotate the corpus manually, totally. And then the experts reviewed the annotations and gave feedback to the annotators to revise. At the end, we finally tuned the model and we used it to predict tags. And we look at the tags manually. We looked at the tags if they are different or not. And we correct if we find mistakes. So we did these two times. The idea of the third phase was to find some missing annotations. Okay. So these are the counts of each entity in the corpus, whether it's flat or missed it. As you see here, we have 7,000 mentions of persons. 700, about 700 of them are missed, but the rest are flat. The group of people are 5,000. The highest numbers are geopolitical locations, organizations, person, norm, and ordinals. The lowest numbers are blood act and quantity unit. Maybe also money and currency. They are not very high. 22% of the corpus is missed. Of the entities in the corpus are missed. We faced some cases, 576 entities that are missed of the same type, like organization inside another organization. Like the employees at Cairo University is one entity, and Cairo University is an inside also entity. Such cases are, we're challenging to annotate and also to train them with it. To evaluate the Interantator Agreement, we choose some sentences randomly about 24,000 tokens. And we asked the annotators to re-annotate them and we calculated the Kaba with O with the tag O because most of the corpus is actually, naturally, is O. It was tagged with O. So we calculated with O. We were afraid that it will overestimate the numbers, but it didn't estimate the numbers. This is the calculations without O. It's almost the same. But also we provided another another major, if one is called, which achieved also the same results. So it's about 98% accuracy. That was about the corpus. Now I will present the final tuning of AIRT with nested entities. We find it tuned an existing bretrain model called Arab AIRT version 2. Multi-task learning And so we built a multi-classifier one for each of the 21 types of entities using these parameters. We divided the corpus into 70% training, 10% validation, and 20% for testing. We achieved these results which shows 88.4% if one score overall. For some of the tasks, the results were very high for some of the tasks. Pay like geopolitical entities, pay some debt, ordinal also money currency, low also organizations are really high numbers. But there are some low numbers like a product and unit and the quantity because we have a very small number of entities inside the corpus. Last but not least, I would like to mention that entities of the same types were not supported in the training. This is a link to if you want to try our model online. It is a web service with different types of formats. A JSON and XML also you can see it highlighted. To sum up, we presented in this paper a corpus and a model. So the corpus supports nested entities. It's a large corpus and rich in terms of a number of entities and a number of classes of entities. It supports MEC and dialect and multiple domains and with high interanthetal agreement. And the model also shows 88.4% F1 score. The code and the data and the demo are available on this link. Okay. This is the end of my talk. Thank you very much for listening and I'm happy to take some questions. Thank you.