 First of all, welcome everyone. Good morning. Good afternoon. Good evening from wherever you are joining. My name is George Zasaronis. I am Vice President of Data Science for Research Content in Elsevier. And today we're going to be talking to you about a couple of technologies we have developed with state-of-the-art machine learning techniques for supporting the fighting of the COVID pandemic that broke out in 2020. And without further ado, let's proceed to some of the main content. So today I'm joined with two of my colleagues. You will meet them in a moment, Zubair Afzal, Director of Data Science in our department and also Eximeos Tsakornas, Senior Machine Learning Scientist and Manager of Data Science in our department as well. Just a very quick flash intro of who we are. George, I'm so sorry to stop you. Excuse me. We just have one more minute of a break. I'm so sorry. I'm so sorry. Just one more minute if you don't mind. Okay, thank you. Absolutely. We can start from here. We can resume from here when people are back. Great. I think Cliff will come on and he'll launch us. Thanks so much. Okay. Well, let me welcome everybody back from the break. And for those of you who got a very quick preview during the break, you'll hear most of that again, bear with us. But I'm delighted to welcome a team from Elsevier led by George Tsaronis. I think I did not do too much violence to that. If so, I apologize. This is a really timely topic in the sense that the pandemic has produced this incredible flood of information and the scholarly publishing apparatus has struggled in so many ways to deal with this. So we're going to hear about a number of tools and tactics that Elsevier has been experimenting with. And I think, interestingly, this will form a little bit of a compliment as well, perhaps with the first talk we heard at this. And without further ado, let me turn it over to George and his colleagues and take it away from there. Thanks Cliff and Diane and Paige for having us today. So welcome everyone. Good afternoon. Good morning or good evening from wherever you are joining. We have a company today here with my colleague Zubar Afzal and Efimio Tacona that you will meet in a moment. I'm George Tsaronis Vice President Data Science of research content in Elsevier. And today we would like to demonstrate a couple of technologies that we have developed to help the research communities combat or fight the COVID-19 pandemic. This is just a flash introduction of who we are. So Elsevier probably you know, mostly as a publisher, but next to our publishing business. We have also grown a large global information analytics piece, primarily serving research and academic communities, but also corporate markets in the space of research applications health and life sciences. So what we are emphasizing to achieve is to help the communities we serve to progress the science to advance the spaces that are working at and to help the communities in general, get the most out of the get the most insights out of the research content. So what we offer is basically a unique combination of content, not only Elsevier content we, as you will see in the next in the next slides of the presentation we work with a number of partners and other resources. And if you think about that, we've had in the past, you know, with our data science community which is a very vivid community of 250 plus experts in machine learning nlp data science in general, a number of very successful research collaborations with some examples are recently with Harvard in the past Imperial College University of Melbourne and many others. Just to give you an indication that we are not only implementing solutions but are also doing cutting edge research within our remit. In 2021 together with our colleagues from the AI lab we have with the University of Amsterdam and Fry University. We published one of the latest advances on how to basically query a mind knowledge graph which received one of the outstanding paper awards at the ICLR 2021 conference. This is one of the main conferences for machine learning in the area of deep learning in the recent years. So let me introduce you to today's topic. So today we will be talking about some of the things Elsevier has done to help the community to combat the COVID-19 pandemic. Since the outset of the pandemic, one of the first things we did was to put together a very comprehensive resource directory, the Elsevier coronavirus resource directory which is publicly available pointing and having more than 80,000 full text articles accessible on COVID-19 and to various other tools that can help professionals, academics, clinicians, but also researchers alike to aggregate, if you may, necessary information and facts about the virus and the pandemic itself. So with that I will pass it on to my colleague Zubair who will introduce you to one of our two solutions that we would like to present today. Thank you, George and hello everyone. I'm Zubair Afizal. Today I will be talking briefly about one of the use cases, the COVID-19 use cases that we have at Elsevier which is about identifying what we call the core or the primary COVID-19 articles. So, as you know, ever since the pandemic broke, whole research communities and industry in the government, they all work together to join forces in order to combat. And the key advancement actually in this area is to ensure the fast, efficient peer-reviewed communication of any novel research finding. So, Elsevier, at Elsevier we implemented the acceleration initiative with the objective to identify what we call the primary COVID-19 article, come back to this in the next slide, but what does it mean. So, to identify these articles so they can be shared with the scientific community as soon as possible. And of course, all our core catalog is available on ScienceTrack and on Elsevier novel coronavirus information center page for free. We also contribute to the COVID-19 data set through the PMC which is the COVID-19 is the largest research data set out there on the COVID-19. And of course we also contribute to the COVID-19 articles to the WHO catalog. So, when we started this task that was back in March 2020, we wanted to have a method to identify COVID-19 articles and made them available for free. But at that time we could not implement machine learning because we needed a training site to create, to build a machine learning model, but we did not have any. Initially we started off with a very simple approach using keywords. So this is the initial keywords that our subject metrics experts came up with to identify any relevant article. So, all articles that had one of these terms were identified there and all of them they were first made available on our coronavirus information center page and also on the science track page for all the research for all the community. So, soon we realized that this keyword list was rather short, it's not extensive. So, in the next version we expanded our terminology, we added many more relevant keywords, also a combination of keywords that we thought were useful to identify such articles. As you can imagine from this type of keyword queries that they usually have very high recall that means they will catch any relevant articles, but they will also generate a lot of false positive or the noise. So, we wanted to just to give you an example. So, here are a few articles, recent articles. And if you look at the title, all title have at least one of the keywords present in them, which I have highlighted, but if you look closely the articles on the left, they don't really talk about the coronavirus, you know, or, you know, its diagnosis or the treatment but the talk about, you know, the gambling at the time of COVID-19 or the supply chain resilience effect, how to optimize that, you know, but the article on the right. If you look closely, it has more information that suggests that these are the article that actually talk about the coronavirus, coronavirus and, you know, the treatment and vaccination. So we wanted to build a machine learning classifier that can learn to distinguish between these two sets of sets of articles. So the articles on the right are what we call the primary impact article. And on the left is what we call the secondary impact article. Now, this, the task was clear that this is what we wanted to do, but what we did before we continued before we started with machine learning we interviewed clinicians and bioinformations and scientists and researchers to gather their information need. So based on the interviews that we, we actually came up with this following inclusion and exclusion criteria. So, so the top one is inclusion criteria for the primary impact article. So any articles, you know, all articles that talk about diagnosis treatment, vaccine development articles, you know, talk about other type of coronavirus is how the hospitals are handling the pandemic population related phenomena. What are the impact of the the COVID-19 on the healthcare system so we all group them into what we call the primary impact articles. So all other articles you know articles that talk about the impact of COVID-19 on economy, education transport social media and all. We group them in the secondary impact article. So, now we had the definition but we still did not have a training set to actually build a machine learning classifier. So, you know, any any supervised machine learning classification, you need some training training set that you could use to train your train your algorithm but we did not have any based on the criteria that we came up with. So, what we did we use active learning to actually generate a test set and active learning if you don't know is a sub domain in machine learning. So typically when you are creating a training set what you do is you take a random sample from your whole population and then you ask your subject metrics to label them. Now the quality of the of the annotation also depends on your sampling how good your sampling was whether all the samples were actually useful for the for the algorithm. So with active learning. The, it's not you that decides what what articles or what, you know, examples that needs to be labeled but it's the algorithm that decides okay which article should we label. So, in order to maximize the fact of the machine learning algorithm. So, we use the active learning to create a training set also a test set and we used a large pre trained by bird model. This is a very large language model that is trained on 4.5 billion words from PubMed and from 13.5 billion words from the PMC it's based on the on the bird architecture. So we, we took that large language model and then we fine tuned it, we optimize the algorithm so it could identify, you know, the articles that we call the primary and the second impact article. So, with the help of this approach with the active learning to actually speed up the whole process of generating the algorithm while we are also building the training set and the test set. So the Cora algorithm so this swap we would call our algorithm the Cora the cover 19 relevancy algorithm or Corine short, the algorithm has two components actually it has the keyword classifier. So, so it uses all the keywords that I showed. I'm showing below. So the idea of using the keyword classifier is to ensure high recall so we don't want, we didn't want to miss any any relevant articles. So any article that has one of these keywords will be will be picked up by this classifier, but as we already know that a lot of these many of these articles are false positive so then we apply our machine learning classifier, which then identifies the primary impact code article. So when we apply the Cora when we applied the code algorithm on the same set of article that I showed earlier so the algorithm actually assigns a probability score to each of the each of the article. So we did not just use the title of the article for to classify them we use title and abstract of the article. So as you can see on the left, the algorithm has scored probability the maximum here is 0.46 well below 0.5. So these articles were not classified as primary by the by the algorithm, but on the right as we as we expected these algorithm the algorithm actually gave them very high score. So this way we know that these are the articles that we wanted to capture, and you know, make them available for free as soon as possible. So we on the on the evaluation. So of course we evaluated the methodology, you know how good the keyword classified was and how good our Cora classifier was. So we use precision recall and a front score. Major. So these are standard measures that that we use in the machine learning or in information retrieval for people who don't know precision precision actually is the is the fraction of the correct articles from all the article that you identified so in this case, when we look at 0.74 it means that for every 100 articles that that that were identified by the keyword classifier 26 were actually false positive and the only 74 were actually correct recall is also sometimes known as sensitivity is the fraction of correct documents from all the correct documents. So here 98 means for 100 correct articles the keyword classifier identified 98 of them. This is what you would expect from from a from a keyword classifier anyway because it has all the general keywords present. F1 score is just the harmonic mean between the two between precision and recall to get one one score. On the other hand had a high precision about 0.89 but slightly lower recall, which again is also also expected. Keep in mind, when you are doing the machine learning, you always try to find a balance between precision and recall, you can actually optimize your algorithm to get high precision or high recall but usually it it comes at the cost of others so if you want to put more weightage on the precision, then you'll have to sacrifice some recall and vice versa. So, but overall as you can see F1 score was actually higher than the keyword classifier so this is when we first deployed our machine learning model in production and started classifying all the incoming all the things has drifted. So we applied the same keyword classification on an article, and I turned out that the precision actually dropped actually quite a lot you can see from going down from 0.74 to 0.36 that means you know for every 100 articles only 36 were actually correct by the keyword classifier. The score ahead still very high precision and recall was even higher. So, this just shows the, the advantage and the power of these language models that how, how good they are over time in, you know, in identifying the variations in the language or people are using the terminology but still able to identify them with high precision and precision recall. So just to just to summarize our contribution we created a framework of inclusion and exclusion criteria that maybe use as a generic guideline to annotate core profile and publication. You know, simple but very efficient deep active learning approach with the help of our subject matter expert. By the way, the training and test set that we created through this process is also available for free. So, anyone can actually download this training set and train their own version of for if they will. So we demonstrated by experimental evaluation that the algorithm achieves very high F1 score on detecting primary and secondary impact articles. Now at Elsevier, the algorithm Cora is ensuring that we are identifying all primary impact article, as early as at the submission time so they can be put on fast track peer review and publishing publishing process. So that was quickly. First use case, I will now hand it over to the most will then walk you through the second use case. Thank you very much. Let me let me share my screen. And I hope I hope that you can all see my screen now. Yes. That's great. First of all, I would like to thank you all very very much for for attending the session. And what I'm going to do is, is basically, I'm, I'm going to present a search engine, which we are going to apply to the show called cord 19 data set. And I'm going to present the basic components of this engine and then cord 19 is essentially a data set comprising literature around COVID-19 and other coronaviruses and it is actually continuously updated by the island AI Institute, among other organizations. Elsevier Conte contributes to this data set. So cord has to bear and George just mentioned. And that's essential data set that we're going to apply a solution to now. In a modern search engine, there are a few basic components in cascade, if you will. So first of all, there is that the so called indexing component, that is the part where your query hits the index. And then there are machine learning based components, it could be a learning to run component as we say or a question answering component or it could be both. And in our case, the search engine that we developed specifically for this corpus whose first page looks like this actually has all these three components so that when a user submits a query like that, for example. And note here that the query is a question like what is the effectiveness of chloroquine for COVID-19 indexing will retrieve very, very fast the bucket of candidate documents which are relevant to the query. Then this bucket of documents is going to be sent to the learning to run and the question answering components. And then the results are mixed and returned back to the user. And the question answering component in particular. I will attempt to provide relevant snippets to directly address the user's question you can see this here. In the, in the right hand side, it's shaded in gray while the learning to run component, we will serve essentially the final relevant search results to the user as you can see in the left hand side here. And this will happen in real time. And then, you know, click and see, you know, highlighted in yellow the possible answers to his or her questions. Now, how is this possible in real time. And, and let's just start with with a little bit deep dive. And we can start with, with basically indexing, which is the first component of our search engine because indexing is essentially what makes search engines fast. So if you think about the way or the scale that Google works, for instance, it becomes evident that you can't really do simple things right so you can take your query and compare it against every document that you have in your database. And because you have millions of queries and millions of documents, this isn't going to scale. So you need something faster. And the basic data structure or the basic idea that we use to make things faster is the so called inverted index. And the best way to think about what this is is to think about the index at the back of a book. Right so you have a book and in the back of it what you have is a list of keywords. Next to each keyword you have the pages that talk about the keyword. And that is an example of a manually constructed index now an inverted index that the search engine uses is very much the same type of a data structure with the difference that in a search engine you don't choose what words you're going to index you're basically indexing everything. So these are the typical words, and then you store pointers to all the documents that contain a mention of those words. And this is a nice data structure to have because it allows you for very quick retrieval of material, because once you've looked into the index. And basically once you've looked into the page that contains a certain word that it's in the user's query, then you can jump straight to that page. So it allows you to find stuff in sublinear time. And this is just a simple example that I have created to showcase the inverted index. And suppose they have a collection of documents it's really simple it's just five documents like that be one of the five. And the inverted index actually pertaining to these documents is is this one here, at least, at least part of this inverted index. So you have the terms here that are in your documents. I'm giving you as the head of the inverted lists as we say, and it's term corresponds to an inverted list and that's a list of tuples, which basically consists of a document ID together with the number of times that the term appeared in that document ID. So essentially the term support here. What this says that it appears in document number three one time and nowhere else whereas the term he appears in every document twice in the first one, and once in every other document. And the key point here is that due to, due to zips law, which is essentially law in statistics governing language, these inverted lists in their majority are going to be very, very sparse. So this data structure here allows for a very quick retrieval of material, even if you have multi word query like like that football pay. What you can do is that you can pull up the inverted list for football, like that and you can pull up the inverted list for pay and by incrementing a couple of pointers you can actually score the documents. You can either have a bullet score like that, like, if you're all query terms are present like much or no much, or you can have a more sophisticated score taking into account the terms frequency in your period your document. Right. And that's, that's basic inverted index and there are crazier things that you can further do with indexing stuff like certain phrases. So, very recently, basically, you're able to index numerical vectors. So as to retrieve very, very fast the k nearest neighbors of the query vector from your indexed actors, people call this semantic indexing where one uses essentially numerical vectors to represent pieces of text. The way that pieces of text that are semantically related are represented by medical vectors are closer in that vector space. And there are of course, of course, crazier things that you can do and you can do all these things in conjunction with indexing. We don't have the time to go through everything what I'd like to do now is talk a little bit about the learning to run component and the question answering components in a search engine. What is essentially the basic idea behind learning to run. So what do we want to do. And this is basically how learning to run system performs in the forward step, if you allow. So if you assume that you have a query queue, and you have a bunch of documents that were returned from indexing. These are represented by the I, and you want to find the most relevant ones. So what do you do. So the first step is essentially you create a joint. As we say in machine learning a joint feature vector representation for your query and each of the documents. It's query and document pair for you query queue and document I, you create the feature vector representation X of I, right. And, and, and, and then what you have is that you have a scoring function that you have learned to assign a score to this feature vector representation and now that you have all the scores for the documents you sort them in decreasing order and we feed them to get to the user and the learning scoring function F that you see here is typically much more rich than the scoring functions used in indexing that we saw earlier as they take into account the much richer set of features. Now, in learning to run, there are basically three different categories that one can use to learn the scoring function, and by far the most popular approach which is the approach that we also use the so called pairwise learning to run. In this case, you have data in the form of in a policy, for example, that this document a is more relevant than document before QQ and given such an equal because you can estimate essentially the scoring function, right. I know that we have used come from an open source data set from some of our 2010. I'm not going to go into the details here instead. What I would like to do isn't, I would like to show you what we mean exactly by join feature vector presentation of query document pairs. You can see that's himatically here. And what this means is that for each couple query document, we extract our features and here is a brief categorization of these features. There are a number of frequency based features. For example, counting the number of terms that are common between the query and the document, or other similarity features like TF IDF or BM 25 which are classical information retrieval. Other category of features are semantic features coming from word embeddings, for example. Another category has to do with positional features. Namely, you split your document into zones like title abstract, perhaps other paragraphs, and you compute essentially frequency based features and semantic features on these zones in your document. Right. Excuse me if you miss. We have one more minute. Now we can talk a little bit about question answering which is basically a feature of the entire search experience so the top 20 score documents coming out of the learning to run component are fed as an input to the question answering component, which attempts to directly address the user question and let me just stress that the performance of the question answering component crucially hinges on the performance of the learning to run component. Now, question answering may sound complicated but today it's actually pretty easy to get a state of the art model by doing the following. You can take a pre trained language model like Bert here which is a very sophisticated language model, and you can find him this pre trained model for question answering by feeding it a small amount of data for your application. The final line tuning was performed through approximately 2000 question and answers pairs for COVID-19, and this will get you pretty much state of the art performance. And here we actually use a variation of of Bert, giving us an input, a question and the passage, and as a target output start and then talking classifier. Each classifier has its own set of weights, which essentially are used to pinpoint the start and end token for your answer at inference time. And what's actually important here is how you choose essentially to mix learning to run and question answering results and feed them back to the user to keep him engaged like that. Because here, by user clicks, we can actually utilize the user feedback to further fine tune our models. And this is actually very important when you design a search engine, a search and system. Thank you very much. That's that's all I had to say. Well, thank you. Very interesting. We have some time for questions, happy to take one or two questions or we can also take them offline and we will obviously share this data information with the audience here. Very happy to do so. Yes, I think we probably have time for a quick question if there is one. Please jump in. I don't see one leaping up. I've been there for people to absorb and I hope you'll be able to stay around at least for a little bit and perhaps field a couple of questions through chat. That was really very, that was a very interesting presentation and so I thank you very much for sharing that work with us. Thank you very much and thank you for having us. Thank you.