 Cool. All right. Thanks for the show. So, again, my name is Dustin Wright. I'm really happy to be able to speak to you all today about our work, which we showed at AKBC earlier this year. We're very, very happy. We've got a best application paper. Yeah, so just to reiterate, this is a collaboration between UC San Diego and IBM as part of the AI Horizons Network. So, yeah, let's get into it. So, actually, the group that I was working at at UCSD is Center for Microbiome Innovation. So, a lot of what we're focused on is kind of the human microbiome and what kind of AI solutions we can use to understand the human microbiome better and some tools for understanding the human microbiome. And so, one of the kind of major issues with studying the human microbiome is that the amount of literature is just growing exponentially. So, there's huge increased interest in the field. There's been a lot of really great work coming out associating kind of microbiome with different aspects of human health. And so, with that, there's just this exponential increase in number of publications. And so, it's kind of one thing that they were identified is that it would be really useful to be able to have, like, a very easily searchable, sortable database of facts related to microbiome which come from the literature. So, at a very, very high level, the group is working on kind of building structured information from all of its unstructured text data in the form of a knowledge base. And so, for anyone that doesn't know, a knowledge base is essentially a GRAC database which contains nodes and edges. The nodes are particular entities that we're interested in. So, for example, different types of bacteria or different diseases. And then the edges are relationships between them. So, if we're talking about going from kind of the textual example to how it would look in structured data, we have the sentence from a piece of literature that says, Crohn's disease is associated with bacterial dysbiosis. It frequently includes colonization by a parent invasive E. coli. So, the nodes in this example would be the disease in the bacteria. So, Crohn's disease and the bacteria parent invasive E. coli. The sentence is saying that colonization by a parent invasive E. coli is associated with Crohn's disease. So, you have an edge that an increase in AIC is associated with Crohn's disease. Being able to kind of build this structure from loose text is really, really useful for people in the microbiome community. For example, clinicians can use it to help identify potential conditions. Nutritionists can use it to potentially help develop nutritional supplements. Researchers can use it to kind of maybe find some unseen connections for a hypothesis generation. I notice the word colonization is not highlighted, but it's kind of really important in that sentence. Right. So, kind of the first step is just looking at association and kind of like positive, negative or neutral association. So, that's why, that's kind of not why that's... The colonization that is the trigger word. It would be the trigger word. Colonization would be the trigger word for that. So, anyway, so that's kind of a high level of what this is useful for. And so, there's really kind of three major components to knowledge-based construction. So, there's recognizing entities and texts. So, just saying that particular span is a disease or a bacteria. There's finding associations between them. Otherwise, there's relation extraction. So, being able to see that, you know, an increase in AIC is associated with Crohn's disease. And then there's what's known as linking or normalization. There's a ton of different names that he knows why. But that's going from entities that you found in text and grounding them in an ontology. Kind of normalizing across different ways that you can name things. So, the main focus of this work was on the particular problem of normalizing disease entity names. And what that entails is essentially mapping names that we found in text which we know are a disease to some concept ontology. So, in our case, we rely heavily on the CTT magnetic disease dictionary, which is a structured ontology, it's essentially a hierarchy of diseases that comes with kind of deferred names, synonyms, definitions, all of this data, which is very, very useful for helping to build our models. And the main thing is that they're all grounded in a single concept ID. And so, essentially the task is, if we have the sentence say, inherent to basic E. coli, as we co-associated bacterium often found in CD, and we know that CD is a disease, we need to know what that particular disease is. So, in the context of the CTT magnetic disease dictionary, it could be celiac disease or it could be Crohn's disease. So, we need to be able to use different features of the text and different kind of contextual features, as well as using the features in the ontology to be able to understand what the particular disease is. So, the kind of state-of-the-art methods prior to our work were mostly based on kind of feature engineering and shallow learning techniques. So, the two major state-of-the-art techniques were denorm and tatter-1, which are both done by the same groups. And essentially, they do normalization in very similar ways. And the way that they do it is they use TFIDF vectors, which are essentially engineered features based on frequency counts of tokens and text. And they learn essentially a similarity matrix. So, they basically learn how to rank the similarity between two different TFIDF vectors. For example, one coming from text and one coming from a name in the ontology. And whatever is the closest name in the ontology is what they'll normalize the disease to. So, given this, kind of the major research questions we had in this work were how can recent advances in deep learning be used for this problem of disease end normalization? There had been a few attempts before us, but most of them didn't show two promising results. It couldn't quite be as performant as the shallow learning techniques. So, we were interested in kind of how can we approach this problem from a neural net perspective and see performance gains or at least achieve safety of the art. The second question is what aspects of language can be useful to help with the problem of disease end normalization? And so, this really goes hand in hand with kind of how can a neural model be used for this problem? So, if we're using a neural model, what are the features that are useful for it? All right. And then finally, and this is a critical problem which we needed to investigate in the context of using neural nets in this problem is that given the lack of training data, labeled training data in this domain, kind of what are the sources of data that can be used to help improve the performance of deep models in this problem? As probably many of you know, deep models are very, very data hungry and we need tens of thousands, about hundreds of thousands of different samples to be able to perform well on one of these problems. And within this domain, there's just not a lot of training data. So, that's one of the kind of major questions we had in this work. So, the rest of the talk, I'll kind of, I'll go through the first two points in depth and then I'll talk about the third question and then I'll talk about how we kind of evaluate our model and what kind of data sets we tested it on. So, starting with the first two, we kind of took a two-fold approach in terms of the features that we were using, the kind of linguistic features that we registered in. So, essentially we're using semantic features which are kind of very, very popular in the context of neural models. These are the form of word embeddings and a very simple compositional phrase model. And then contextual features which are also becoming increasingly popular recently in the field. So, in terms of contextual features, there's really two different types of contextual features which could be possible that we could use. There's local features which are kind of features based on the surrounding text, the immediate surrounding text I've mentioned, and then global features which is what we were looking at. So, we didn't use local features, we used global features. The global features we use are its decoherence. So, essentially we're trying to normalize the diseases within a piece of text to a coherent set of diseases. This is kind of motivated by the idea that Crohn's disease is much more likely to appear in context with celiac disease than it is with something like brain tumors or congenital abnormalities. In terms of the architecture, the way that our kind of workflow is is that we start with unstructured text. We tag the text with some external taggers. We get all the diseases within the text. We then get semantic features for each of the entity mentions that we find. So, we tokenize them, we get word embeddings, and we compose that into one vector representation of the phrase for each of the entities within documents. And then we have a coherence model where we pass each of those phrase-based representations through this kind of coherence model to get a new representation for each mention based on these surrounding mentions. And then finally, we use those representations to ground the classified each entity into its concept. So, I'll go a bit more in depth into what each of these models is as well as how we perform classification. So, the very first step is to get a vector representation of the phrase. We actually use a very, very simple compositional phrase model. This was motivated by some work with Felix Hilded in 2016 where he basically showed that using a bag of words model and just training word embeddings for the task of phrase similarity in paraphrase generation and learning sentence representations is actually a very strong model for performing those tasks. So, we kind of start with this very simple compositional phrase model where we take the word embeddings and just perform a bag of words summation on them and then pass them through a linear projection. So, this gives us our phrase representation. The next step is our coherence representation. So, kind of the previous work that he's looked at coherence is trying to model it using kind of the joint probability between all the phrases that would appear in a particular piece of text. The kind of problem with this is in terms of efficiency. So, with that, it's either you have to solve an NP-hard problem where you're having to model the joint probability of all the diseases within a particular piece of text for all the different concepts in your ontology or you have to use kind of hacks like using like a loopy belief propagation in the CRF. So, you're kind of motivated by how can we get around this kind of NP-hard problem. And so, the way that we are modeling it is actually just using a single layer of bidirectional brew network. So, essentially what happens is we have all the phrase representations for a particular document and then we pass all those representations through a bidirectional brew and then at the output we have a new representation which is based on all of the surrounding mentions. And then the last step is classification and kind of the classic normal way that you would form classification for this problem is to use just a softmax. One of the things we are also motivated by is this kind of data problem is how can we get around the lack of training data and be able to learn representations that are based on or learn representations of all of the concepts in the ontology which many of which don't appear ever in training. So, based on that, we decided to model or to perform classification by modeling this concept embedding space. And what the concept embedding space is essentially a learned representation for every concept in the ontology. So, there is a vector representation for Crohn's disease, there is a vector representation for Celiac disease and we learned this representation while we trained the phrase incomparance model, right? And so, then the way the classification is performed is we take those representations from the phrase incomparance model and we classify whatever the disease is, whatever the closest point in that space, that concept of embedding space is, right? And then the embedding space and the whole model are trained together. So, by doing this, we can't use just, you know, cross entropy, it's our loss, we have a different loss to be used. So, we use this kind of modified margin raking loss and essentially what, you know, the margin raking loss performs is it essentially takes your predicted representation of a disease and the representation from the concept ontology and tries to make the ground truth representation from the concept ontology so the correct disease, say, Celiac disease and make it closer to your predicted representation and then you take a bunch of negative examples and you try to force those representations to be further apart. So, the kind of modifications that we use is covered in this blog post if you're interested in seeing kind of at a deeper level what it is but at a high level that's essentially what we're doing, right? So, that's all kind of the modeling in terms of implementation. We use PyTorch. We've retrained all of our word vectors using Word2Vec on all PubMed and PubMed Central so we get starting representations based on biomedical tech. We initialize our concept embedding space using the ontology so this is actually one of the benefits we get by modeling or by performing classification this way is that all of those embeddings in our concept embedding space we can initialize using the names in the ontology. So essentially we take the preferred name for all the names in the ontology and we do a summation of the word embeddings for each of those names and initialize the concept embedding space that way. And then finally we combine the predictions from the phrase and the coherence model using a learned parameter and the motivation behind doing this is that we can actually use the names from the ontology to train the phrase model on its own and the reason why we want to do that is so that we can learn how to classify every concept in the ontology and not just the ones we see in the training data which is very, very smart. So, we can train this phrase model on its own using the names in the ontology and then we combine the predictions from both the phrase model and the coherence model and perform classification. So that first part was all kind of modeling and those first year research questions of what linguistic features we were interested in. The next part is data augmentation. So essentially the problem is that these data sets have very, very limited training data on the order of thousands of examples in like low thousands of examples. So, essentially to try to train a neural model on this we have to perform some sort of data augmentation in order to be able to train good representations. So, to overcome this problem we essentially use two techniques to augment our training data both of which are coming from the same data set. So we use this by a way as a new data set and the way that the nature of that data is it's hubbed abstracts which is kind of the same as the data that we use for training evaluation. But the mentions themselves aren't labeled. The documents are labeled with the concepts that appear in them. So, for example, a document we labeled with we know that Crohn's disease and celiac disease both appear in this document. We don't know where they appear. So, starting from this data we wanted to see how can we kind of leverage this to augment our training sets. So the first way is by using distance supervision. So, essentially the way that works is we take all of the constants which we know appear in a document and then we take the names for those concepts that are in the ontology and we just do a plain dictionary match. So we match for names and synonyms to the spans in text and get labeled data that way. Right? And then the second method is using synthetically generated data. So, we take the concepts which we know appear within documents and get co-current statistics for all of them. And then for each iteration of training we generate random samples for all of the concepts in ontology based on those co-current statistics. So some set of what is allegedly kind of coherent sets of documents based on those, or coherent sets of concepts based on those co-current statistics. So, going into the numbers this is kind of the statistics for the data sets which we're used to evaluate on. And if you just take a quick look you can see that there's just not a lot of training data. The NCBI Disease Corpus has only 793 articles and less than 7,000 annotations. And Bicreative Chemical Disease Relation Corpus has, you know, 1,500 articles and less than 13,000 annotations. So just not a lot of training data if you want to train a neural model. In terms of kind of mentions as well, one thing that's really interesting that's not reflected essentially in this table is that within training and tests there's a huge disparity in terms of the concepts that we train on. So in test set, for example in Bicreative, about 30% of the disease concepts don't ever appear in training. So our models can learn nothing about them essentially. And so kind of some insight into this as to why the Shell Learning techniques work is that they're entirely based on surface text features. So TF-IDF you're just learning basically how to map between frequency counts of the tokens. So those methods are a bit more robust to this kind of disparity between, you know, training data and test data. Right? Now with our distance supervision though we get a huge boost in terms of the diversity of concepts that we train on as well as the number of examples that we can train on. So this gives us kind of a huge huge advantage in terms of using a neural model or a huge boost in terms of using a neural model as opposed to just using the training data. Alright, so the next part of the talk I'm going to go into how we evaluate the model and kind of the metrics we use and how our models perform in relation to, or in comparison to the baseline models. So the first metric that we use is the first two metrics we use actually for microapp want accuracy. These are the standard metrics for this task. Basically the question that microapp want to ask is which concepts within a document do we recognize? And then accuracy is asking using the perfect tacker, how many of those predictions can be generalized correctly. So very, very simple. One thing that we saw as an issue with accuracy though is that we have this concept of an ontology. So there is some way that we can measure kind of prediction quality as opposed to just you know, is it correct or not, kind of what is the quality of our predictions. And so we come up with this normalized LCA distance metric which is essentially asking that question. And the way it works is I'll give you an example actually. So we could have our ground truth concept be eye-opening. Now let's say we have two different taggers. The first tagger could predict the disease was Crohn's disease which has nothing in common with eye abnormalities except for the fact that they're both diseases. And then our second tagger could predict the correct concept was congenital abnormalities which is actually the parent of eye abnormalities. So based on kind of its position in the ontology and the way it's way to be used we can say that predictor 2 is a much closer prediction to the ground truth than Crohn's disease which is essentially the two predictions and their lowest common ancestors we can get a measure of prediction quality. So basically the way that that metric works is we take those distances to the lowest common ancestor and then average those over the entire test set. And then finally we're looking at efficiency which is essentially how long does it take to train a model given the same amount of trained data. So going into the numbers for the NCBI disease corpus model which is in blue versus tagger 1 in yellow and dm1 in gray sees consistent gains. So we see a big gain in terms of precision a nice gain in recall and about 3% improvement in terms of f1. On biocreative diseases though we still see a gain in terms of precision but we see a slight drop in terms of recall. So overall our f1 score was just slightly lower than the closest baseline. And basically what this is saying is that our biggest gains are in the lower resource setting. So on a data set which has a lower diversity in terms of how different mentions appear and the different concepts which appear in the data set we can see gains and the reason why this happens is because we're using kind of the concept ontology for a lot of the augmentations and so the extra data that we get is kind of low diversity in terms of the mention text that would appear. So because we're using just the answer in the ontology for our distance supervision we're seeing gains in terms of this smaller set of mentions and concepts but it hurts a little bit when the amount of mentions that appear and the concepts that appear is a bit more diverse. One way to mitigate this would be to improve our distance supervision so that's one area that we can improve on for these models. And then for accuracy as well it's a very similar story so we see various light gain on NCBI disease corpus and then a slight drop on biocreative. So one other kind of test that we did is seeing how robust the models were to abbreviations as well because all the models used this kind of abbreviation resolver to get rid of abbreviations when they can find them. And so when we remove abbreviation resolution we actually see gains on both datasets. So our models are a bit more robust to abbreviations. Next is prediction quality so we saw that our models were consistently better on both datasets in terms of prediction quality. So essentially what this means is the concept of betting space for learning is embedding those similar diseases closer together so even when we get a prediction wrong we're still predicting it's somewhere closer in the tree than it would be in the other two models. So this was a good result that we were pretty happy with. And finally is efficiency so our models were vastly more efficient than the baselines like orders of magnitude more efficient. And a lot of this comes from the fact that our modeling is actually very simple and the baseline models while they're using kind of just TFI net vectors the similarity matrices that they have to learn are massive so they're learning a huge file in your map between TFI net vectors which are the size of the vocabulary by the size of the vocabulary so up to 20k by 20k matrix that they have to learn. Cool so kind of the final take takeaways from our work. One of the main things that we showed was that simple semantic features and modeling are enough to approach then in some cases achieves the art on this task and the nuance in this in that we can kind of use very intelligent data augmentation and simple semantic features to be able to perform this while using a neural model. And kind of one of the takeaways from that is that the data augmentation is absolutely critical. We rely very heavily on the concept ontology and the names that appear there and kind of this modeling of concept of any space as opposed to regular softmax classifier. Next is that our models achieve kind of higher prediction quality which is a good result if you want not to predict diseases that are totally far off and have a bit higher prediction quality our model is more suited for that. And then finally and again this is something if you read the paper you'll see some of the nuance in it is that the utility of coherence appears to be domain dependent. We don't see huge gains when we have the coherence model. The model performs pretty well with just the semantic features alone. However in kind of other domains that we've looked at such as bacteria it appears that coherence might be a bit more useful. So with that I'd like to thank everyone for coming and seeing the talk. A quick shout out to everybody that's worked on this project. It's a big collaboration between people here, people like UCSD and probably smart people who this work may not be possible without it. Thank you very much and if you want to check out the code there's a link up there or do anything with it. Thank you.