 Welcome to this deeper dive into named entity recognition. Since named entity recognition is a text mining task, I strongly recommend that you go watch my short introduction to the core concepts of biomedical text mining first if you've not already done so. The goal of named entity recognition is to find things in papers. In other words, go through text and mark it up. It consists of two subtasks, recognition, which is to see that a name is for example a name of a gene or protein, and normalization, also known as grounding, which is to say which gene or protein it is. Since normalization is needed for most actual use cases, the two are often considered one big task and just jointly referred to as named entity recognition. Today I'll cover how you build the all-important dictionary, how you can use a dictionary for doing named entity recognition, how you can improve the results further by using machine learning, and finally what you can use named entity recognition for. The dictionary is essential because it defines your universe of entities that you're looking for. For each entity, it provides a unique identifier, as well as the multiple different names under which it may be referred to. This means that a dictionary of gene names would tell you that there's something called CycleDepend and KineH1, and that CDK1 and CDC2 are the same thing. It will also tell you that there is an organism called Saccharomyces cerevisia, and that this is also referred to as budding yeast. You want to map all these names to interoperable identifiers, and for that reason you will typically want to start either from an existing database such as Uniprot or NCBI Taxonomy, or from an existing ontology such as genontology or disease ontology. Even though these resources come with a lot of synonyms, you will want to add additional synonyms from other sources if you can, and even then you have to deal with the fact that your dictionary will never be completely comprehensive. For this reason we use name expansion to create additional names automatically. That handles things like prefixes and suffixes, for example if you have a human gene name, authors may put an H in front of it to point out that it is indeed the human gene and not the mouse ortholog. We want to handle abbreviated forms, so when NCBI Taxonomy tells you that there is a species called Saccharomyces cerevisia, you want to still recognize it when it is written as cerevisia. Similarly you want to handle plural forms, so when genontology tells you that there is something called a mitochondrion, you want to generate the plural form mitochondria to also identify that in text. Once you have built a good dictionary, you can use it for doing name density recognition. For that you need an algorithm to do flexible matching of names against text. This could be for example case insensitive so that when you have a name CDK1, you can still match it when it is written with different casing of the letters. You may also want to handle spaces and plafons so that when you have the name CDK1 written like this in your dictionary, you can still find it in text when it is written with a hyphen instead of the space. All of this functionality is implemented in the tagger software developed in my lab, which is open source so you are welcome to go use it. It uses a custom hash table between the scenes to be very fast. It can handle more than 1000 abstracts per second using just a single CPU core and on top of that it is highly parallelized. Also this approach is universal. You can build any dictionary for any kind of entities and use it on any text. If you do a good job making your dictionary, you can generally get to something in the order of 70-80% recall. That means your method can find 70-80% of what is mentioned in the text. However, initially you will have terrible precision. That is, it will make a lot of false positives. However, this is fairly easy to fix because usually a few bad names, such as the human gene named bad, are responsible for most of the errors. And that means that if we just manually identify those errors and create a curated blacklist of bad names that we don't want to use, we can get rid of most of the errors and suddenly we have a precision of 80 or even 90%. That means that 80-90% of what we find in text is correct. If we want to get even better, we can complement this with machine learning. Machine learning can either look at the shape of the names, such as is it 3 uppercase letters followed by a digit, in which case it's likely a gene identifier? Is it 3 uppercase letters that could be a gene name, but it could also be an acronym for something entirely different? Or is it 3 lowercase letters, in which case it's likely a common English name? You can also look at the context around the names. In other words, the other words around them. So if you have a gene name like SDS and the text says hepatic SDS expression, you would think that this is clearly a gene name in this context. However, if the exact same name appeared and it said trisk lysine SDS buffer, you would know that this is for sure not a gene name. To learn this, you need an annotated corpus that you then use to train an entity type classifier, a classifier that looks at the words around the name and predicts is this likely to be a gene name, is this likely to be a disease name, or is it likely to be something completely different? Most approaches nowadays use deep learning for this. They start from a pre-trained language model such as BioBird and then mask the names and train a classifier for the specific task. The downside is that you need an annotated corpus and making this is very labor intensive. Also, these models can do recognition only. They cannot do normalization. So they can only look at the context and say the hidden name in the middle is likely to be a gene. They cannot say which gene. And finally, deep learning is much slower than dictionary-based name-density recognition. For this reason, it's often used to filter dictionary-based name-density recognition results instead. That way you can improve the already good precision. And you have the further advantage that normalization has already been done thanks to the dictionary-based name-density recognition, so that problem is gone. And last but not least, it allows you to process much less text because you only need to look at the text around names that were already found in the dictionary, as opposed to having to process all text in your corpus through the deep learning model. What can we use name-density recognition for? One is curation support. If you're a curator making, for example, the Uniprot database, it can be useful to use a tool like Extract, which produced this web page where the names are marked up, where you can then click on the names and get pop-ups with useful information about the entities, including the database identifiers, which can make annotation faster. You can use it to improve information retrieval. The way that works is that you get synonyms information and ontology structure, so that if you search, for example, for psychiatric diseases, you will also find papers about mental disorders and papers about specific mental disorders like major depressive disorder. Finally, and certainly not least, you can use name-density recognition for relation extraction, and that is in fact a prerequisite for doing relation extraction. That is what I will cover in the next presentation. Thanks for your attention, and I hope that you will watch this one.