 Lucy and Kyle are at Allen Institute for AI. So obviously to most of you, that's in Seattle. And they are, of course, going to talk about COVID. This is something I'm very excited to dig into. And I hope all of you are. So without further ado, here are Lucy and Kyle. Awesome. Thank you so much, Ben. It is a real pleasure to be here today. Thank you for inviting us. My name is Lucy Wu-Wang. Here with my colleague, Kyle Lowe, and we're going to be discussing the COVID-19 Open Research Data Set, or CORE-19, and also ways that it's been leveraged by researchers to mine the COVID-19 literature. As Ben mentioned, we're both researchers on the Semantic Scholar Research Team at the Allen Institute for AI out here in sunny Seattle, Washington. And we both worked extensively on the CORE-19 corpus. So to start motivating this problem, COVID-19 literature has been published very quickly. This image I borrowed from COVID-19 Primer says that more than 70,000 new papers have been published in this year alone since March. And this amounts to several hundred new papers a day. It can be really hard for clinicians and researchers to keep up with the latest findings given this degree of publishing. So to help manage information overload and address these issues, there is a clear need for automated methods that leverage artificial intelligence techniques to assist readers in managing this large volume of papers. So how does one build an automated text mining system for scientific literature? A couple of steps. So first, you might have to construct a corpus of documents. And to do this, you need to identify potentially relevant documents, preprocess them into the same format, maybe a machine-readable format. Then there's data enrichment. So this includes adding annotations to important entities, maybe to mention the drugs or genes or other things. And then potentially labeling data if you're trying to train a model that does something else. This is followed by model development, which is actually creating and building the system. And here the kind of text-mining practitioner, the AI practitioner, has to decide what the system should do. Does it retrieve documents based on a query? Does it try to answer questions? Does it try to check facts against evidence in papers? And they also have to choose what models to use to do this. So if you use a neural model, what architecture to use, should they use a pre-trained model and so forth? And finally, one has to evaluate the performance of the system and figure out whether it actually satisfies user needs. So each step in this pipeline can be quite expensive and time consuming. So this is where the community efforts around open data, shared and reusable modeling and annotation resources, as well as these community shared tasks can really come into play. Now these efforts have contributed a lot to speeding up this pipeline for COVID-19. And for this talk, we focus on introducing some of these community efforts and discussing how they've helped this process. So I'll start by talking about the COVID-19 corpus, which attempts to centralize the process of corpus creation. The COVID-19 is a data set of structured machine readable COVID-19 research games. And we released the first iteration of this data set back in March in collaboration with several partner organizations. And since then, the data set has grown a lot. Now it includes more than 300,000 entries, and full texts are available for about 120,000 documents. It's also updated daily. This data set was one of the earliest COVID-19 literature data sets that was publicly released. And consequently, it served as the foundation for many dozens of text mining systems for COVID-19. To construct the data set, we start by answering the question, what is a COVID-19 related paper? So each text mining corpus kind of has to answer this question of what to include within the corpus and what to exclude. So for COVID-19, we use keyword search to identify relevant papers. We also include papers from a number of curated COVID-19 specific databases. And a key thing to note here is that the keywords that we use not only covered COVID-19, but also historical current viruses such as SARS and MERS. And we include papers where a keyword shows up in the title of the abstract or the full text of the document. And here are the sources from which we derive these papers. And this includes things like the World Health Organization's COVID-19 Literature Database, PubMed Central, PubMed, very familiar, and as well as various preprint archives. And these papers are ingested through the semantic scholar data pipeline where we clean and harmonize the metadata for the papers and then also extract full text out of PDFs when those PDFs are open access and available to us. Then we release the metadata and this full text as part of the COVID-19 data set. Here's just what some of the data looks like. At the top there is a metadata entry with typical paper metadata fields. And we provide a link between that metadata entry and the full text extraction of the paper when an open access PDF is available. A snippet of full text is shown in that gray box. So this full text extraction takes advantage of this custom PDF to JSON parser that Kyle and I created as part of the SORT project. And that's actually how we became involved in the coordination project in the first place. And in this full text extraction, we annotate things like citations, references to figures and tables and other objects. But of course, additional annotations and post-processing can be performed on this full text. So now I'll give some examples of some additional annotations that have been made available in the community and which can be used by others. Oops, yeah. So there's many things that one could imagine annotating. The things like named entities, genes, drugs, maybe relationships between entities. If you're working with clinical trials, PICO elements could be very useful. So these labels can be automatically generated. For example, using existing named entity recognition models or text classification models. And we see many examples of groups who have annotated various classes of named entities. For example, to terms in biomedical ontologies or to terms in UMLS. And some examples of this are code 19 on fire and the side bite to code 19 annotations. Label data can also be generated in a crowdsourced fashion. So code at 19 is an example of this. This group use crowdsourcing to generate question answering labels on top of documents in court ranking. So question answering is a specific natural language processing task where given an input question, the model attempts to retrieve a span of texts that corresponds to the answer to the question. And finally, annotations can be expert curated. And these are perhaps the most valuable annotations but also tend to be the most expensive to collect. So medical experts have been asked by various groups to annotate or make judgments about the documents in court 19, especially as part of various shared tasks which we'll discuss later. There's a couple of platforms out there that help to support, I guess the public sharing or discovery of these annotations. So probably annotation is an example of this where people can upload annotations, making them re-explore and discoverable by other parties. And here are some example annotations made publicly available on this platform for court 19 to ontologies such as human phenotype ontology, monarch disease ontology, the CYBITE annotations are available here as well. And PubTator is another platform that where annotations are shared on LitCovid which is another COVID-19 paper status that released by PubMed. And as I mentioned before shared tasks can also be a source of high quality annotations. So shared tasks provide infrastructure to compare performance between systems. They encourage groups to work on a single problem or task and then judge the performance of these systems against one another through expert assessment. So for example, on a shared task on document retrieval medical experts may judge, they may take the retrieved documents from each of these systems and judge them for relevance to a specific query. And these assessments can then be converted into annotations or labels which can then be used to further train models and improve kind of existing model performance. And we'll hear more about these shared tasks later. So the vision is for researchers in the community to create these various layers of annotations on top of the structured full text in COVID-19 and then share these annotations publicly so that other folks can use them. And because these annotations are on a common dataset it's easier to kind of use many, I guess use annotations from different groups because they all apply on the same underlying dataset. So then when I or someone else am building a text mining system I can go find the annotations that are useful for me and add those as inputs into my model. Or if I create new annotations I can also share those which will maximally benefit other groups in the community. Now before we move on to the next part of the presentation I just wanna briefly revisit this question of what is a COVID-19 paper and also to share some other potentially useful open data resources beyond COVID-19. So in COVID-19 we select papers based on keywords but these papers are part of a network of papers that kind of it's like the whole of scientific leadership. So if we expand on COVID-19 by following all of the citation relationships in those papers we get something called the COVID-19 closure graph. And this is joint work with folks at Microsoft academic. The closure graph includes millions of papers referenced by papers within COVID-19 that goes beyond the limit of COVID-19. And looking even further to the whole of scientific literature we also refer people to STORP which is the semantic scholar open research purpose. And this is a corpus of COVID-19 style full text data that's extracted from over 12 and a half million open access papers in the semantic scholar database. And these papers are from across all domains of science including also and also some more humanities fields as well. And both of these resources can provide more context around the papers in COVID-19. So now I'll turn it over to Kai who will talk a bit more about modeling resources and sharing. Lucy, can you go to the slides for me? So we don't do this chair thing. Okay, so to go through these relatively quickly since we're kind of at time. I want to talk a little bit about the fact that AM models are pretty expensive to train. One way to dramatically cut down on the cost of developing a model is to pre-train a model and then share that model with others to user extent. So one common form of this is document embeddings which allow you to take papers such as COVID-19 papers, turn them into vectors such that documents that are similar to each other will appear together in space close together compared to documents that are far apart for each other, which are documents that are dissimilar. This means that documents with similar fields of study or research themes will be clustered together. And we call these embeddings and Spectre is one method of doing this. Next. Of course, ASS is not only benefit from embedding documents but also individual words. So the current state of the art method for doing this called BERT and people have been training variants of BERT to handle specialized language inside of papers and COVID BERT is trained on COVID-19 as an example of this. Next. Finally, knowledge graphs are collections of concepts or entities which are represented by... You muted yourself. That was me. I muted you by accident. Sorry. Oh, hello. Apologies for that. No, sorry. So yeah, relationships between entities are also represented in knowledge graphs. And an example of this is COVID Graph which contains information about COVID-19 concepts and their relationships as well as incorporating other gene and chemical ontologies. Next. And I just wanted to walk through an example of how these might be used. So a common pattern for an AI system might take a query or a question such as the one about hypertension above and make use of embedding techniques. Next. To retrieve the relevant articles. So you can find documents that are similar to the query to limit the space that you then run your next BERT model over to extract the specific answers to that question. Next. And finally, you can augment the retrieved collection of documents or snippets by using a knowledge graph. So a knowledge graph might contain known relationships between hypertension and other phenotypes, for example. And you can use that to find text snippets that might not exactly mention the word hypertension but are still very relevant to answering the question. Next. And finally, to talk about to give like a quick overview of shared tasks and computations. Next. I'll briefly talk about three shared tasks that have been launched on Core 19. The Kaggle Core 19 Challenge, Trecovid and Epic QA. Next. So Kaggle Challenge was the first competition that launched alongside the Core 19 release back in March. Participants were presented with questions that they had to answer using extracted relevant information from Core 19 papers. And since the competition has concluded, one thing we've really learned is that we really want structured extractions kind of like what you see here in a table format because this makes the extractions much easier for medical experts to consume as opposed to early on when participants were extracting just lengthy text snippets that contain the answer. Next. Trecovid is a related shared task which focuses on finding what documents are relevant to answering a question. And in this task, we organized in five rounds where we released new questions over time as well as new papers being added to Core 19. And to get a sense of what we've learned is we really need to look at how information needs have changed over time. So you can see in round one back in March, we were asking very different questions about the virus. Next. Compared to round three in May. Next. And finally, around five in July. The types of questions, the information that we were focusing on was definitely very different over time. Next. And what we've sort of learned from this is that the notion of what a relevant document might be sort of changes over time along with the questions and the document collection. So early on in back in round one, we didn't really know very much about the virus. And so a lot of historical coronavirus research was considered to be relevant to answering these questions. But by the time we get to round five, we have more COVID-19 papers than we can possibly read. So do experts still consider those early historical papers to be relevant now, even if they were relevant back then? So Trek COVID sort of gives us this opportunity to study this because we have various snapshots of Core 19 over time and various annotations benchmark to that state of the Core 19 corpus over time. Next. And finally, Epic QA is sort of an extension built on top of Trek COVID, which wants to answer, also answer this notion of does relevance depend on something else besides just like the document and question. And in this task, we're trying to answer the question of does relevance, does a relevant answer depend on also who's asking the question. So in Epic QA, we're trying to answer questions about COVID-19 that would be most suitable for a healthcare consumer versus a medical expert to consume that article. And so maybe possibly, this is an ongoing share test and possibly what we'll find is that relevance is not just a function of time, but also a function of who's asking the question. Next. And so in conclusion, AI like text mining or natural language processing techniques, but has the potential to be very helpful for reducing information overload and helping experts find answers to COVID-19 questions. The development of these systems is very expensive and we've highlighted in this talk various ways open data and resources have contributed to speeding up this process and relieving the burden for AI practitioners at various stages in development pipeline. Next. So yeah, we just wanted to thank our many collaborators on Core-19 and the various share tests. Next. And we've included links to download Core-19, store and core various other resources and that concludes our talk. Next. Well, thank you very much for going quickly at the end and thank you guys for an amazing talk. We are running about 11 minutes behind. So we are gonna hold questions to the break period that said at least Lucy and probably Kyle are on Slack and you can see their Twitter handles here. So please tweet to them, ask them questions in the Slack. Now we are going to transition to Imran Ha who is going to give our next talk. Hello. Hey, so my advice to you is if you need a speaker, invite Imran. So when I used to work at the National Library of Medicine, I invited him to give a talk about what I won't say because of what I'm about to say but I will say he was very frank. He did not sugarcoat anything and his criticisms were of a very mathematical basis and at that point I realized this guy was not, he told it like it was and I would always be happy to invite him to give another talk. So with that, I'll turn the floor over to Imran. Thanks so much, Ben. All right, let's get the screen share set up. All right, those slides look good to everyone. Yep. Very good. Cool, so thanks so much, Ben, Imran Hwajan, the other organizers for inviting me to come and speak today. I am currently the VP of Data Science at Recursion Pharmaceuticals. As Ben mentioned, I've been working in the biotech industry for a while, gave a talk at the NLM regarding cancer detection a few years ago and have been interested in applications of computer science and open science to problems in biology for a number of years. What I wanna talk to you about today is some of the work that we've done at Recursion against the COVID-19 pandemic and in particular for this conference, our large open data release around that work named RxRx19, as you can see a large morphological profiling image and metadata data set against various aspects actually of COVID-19 as a disease. My Twitter handle is there and I'll have contact information at the end. Happy to get in touch with anybody who's interested in any of the work that's going on here, any of the data. And of course, you know, I'd be remiss to say, if I didn't say, my team is hiring. So if you like what you see, you know, please drop me a line. A little bit about Recursion, we're building a vertically integrated biotech company that really builds massive empirical data sets at every step of the process in order to accelerate drug discovery. When I say massive data sets, what I mean is that our total imaging data set now, I think the latest public number that I can talk about is around four and a half petabytes. So 4,500 terabytes, four and a half million gigs of image data that we've collected from our assay that I'll tell you a little bit more about, that we then use to drive programs through discovery and eventually into the clinic for a variety of conditions, both rare and non-rare. And you can see on the slide a little bit, you know, how we start with building phenotypic models of disease in human cell culture, assessing it in super high throughput and then taking it all the way down through that drug discovery process and actually moving it into the clinic into human translation. But somewhat uniquely among pharma companies, along the way we also support open science. And it's one of the reasons why I joined Recursion and why I'm very proud to be here today and contribute to these efforts. We have a site, rxrx.ai, that actually shows off all the data sets that we've released to date. We started with RxRx when I won a couple of years ago and just this year we've released RxRx2 and 19A and B, which will be the main focus of my talk. So a quick outline for what I'm gonna talk about. I'll explain what these RxRx data sets are, like what is it actually that Recursion has released? And then I'll go into a little bit about the process of how we went about releasing them and some of the considerations that come into play when you're doing a data set release, that's not only as huge as the one that we're putting out, but also when you're doing so from an industrial context. I'm guessing that a lot of the folks who are joining today are looking at open science from a primarily academic context. And there are some different considerations that apply when you're an industry. And I think it's valuable for both sides to engage on this and understand like, what are the things that all of us have in common that we can share in order to push these ideals forward? And finally, I'll share a few notes on what's happening with these data sets now, right? Now that we've released them, it turns out that like, we actually have gotten quite a bit of utility out of making them more broadly available. And I'd like to share a couple of those notes with you. For anybody who's interested in more about the data set, more about the experiments or the results, I encourage you to check out our preprint. I have the link to it down here at the bottom. If you look up my name in bio archive, you'll find the link as well. And of course, the information is at the easy to remember URL rxrx.ai. So without any further ado, let's dive right into it. Let's talk about what these rxrx data sets were that we released and why they're potentially of interest. To understand that, the first thing that you need to know is a little bit about our assay and about our platform. We do an ultra large scale, a particular kind of experiment known as morphological profiling. For those of you who aren't cell imaging aficionados, in other words, we're like myself before I joined recursion, I'll explain a little bit. So previously I came from the sequencing world, the sequencing world has a little bit of a chip on our shoulder like, hey, we've got the biggest data, our costs are scaling faster than the Morse law, everything's great. And now we can do single cell sequencing. What's interesting is that microscopy is an incredibly data rich technique. It's intrinsically single cell, it's intrinsically spatial and it's highly flexible because you can tailor the kinds of things that you're looking at in such great detail. That was something that I didn't appreciate before I came to recursion and I've really been able to see the power here. Now, because microscopy is so powerful or so flexible, there are a number of different modes in which you can run it. One mode which has become extremely popular, particularly in drug discovery and such, sort of beginning of the chain steps in high super biology is a mode known as high content imaging. We'll use a couple of generic stains to be able to lock in on particular aspects of your cells, like where the nuclei are, where the cytoskeleton is, but then you'll use specific stains in order to highlight a single pathway that you're interested in. Now, this is super high content, right? Because you can actually look at the trafficking of individual proteins. You can figure out where, the cellular or subcellular level they are, but it's challenging to scale because it means that both the assay as well as the analysis are custom for every experiment. Recursion uses a different method, known as morphological profiling, where you use a larger number of stains, but all of which are shared across all the experiments that you run. So for example, the stains that we use were described in the cell painting paper, the brand all paper from natural protocols listed at the bottom there, where we're looking at a set of generic stains, looking at the nucleus, the ER, nucleoli, the actin and actin, Golgi and plasma membrane and the mitochondria. Now, you might think, well, that sounds great. It makes the assay really cheap and easy to scale, but do you get the same information out? The short summary is yes. You have to build quite a bit of computational infrastructure on the backend in order to do that. And in fact, where our first dataset release came from, RXRX1, was in running a competition to demonstrate that in fact, you can get that kind of accuracy from generic stains that you would get out of specific stains. Now, this is really interesting because it means that you can standardize the experiment, you can use computer methods to recover the information on the backend and you can really scale up your assay capacity and your ability to generate data. So with that, now we can understand what actually is inside the RXRX family of datasets. There are four datasets, RXRX1 and RXRX2 are what we would call phenotype only datasets. These are imaging, these are morphological profiling experiments on a variety of human cell types either primary cells or cell lines where we have perturbed those cells with a variety of perturbants. In RXRX1, they're S-I-R-N-A genetic perturbations. So trying to knock down about a thousand different genes. In RXRX2, they're soluble factors. So things like cytokines and perforns and so on that'll affect this state of the cell. We also have these RXRX19 datasets where not only do we have phenotypes but we also have drug screens conducted against those models. RXRX19, we used both active as well as inactivated SARS-CoV-2 virus the causative virus for COVID-19 and then screen drugs in dose response against those infected cells to see which ones would make the cells look more like the healthy state. In RXRX19B, we did the same thing but instead of using virus, we used a cocktail of cytokines informed by measurements from the plasma of patients who had severe COVID-19 in order to look for agents that might actually be successful in knocking back the COVID-19 associated cytokines store. Critically, these two sets of datasets have different licenses attached to them and I'll explain why as we go a little bit further in the talk. Now, these datasets include not just the imaging that I mentioned, although they do have an awful lot of that hundreds of gigs of data, of five or six channel fluorescent microscopy data. RXRX1, 2 and 19B are all six channel, 19A is five channel. We left out one of the dyes that has to be applied to live cells basically for biosafety considerations. Beyond the images, we also include the image features derived from recursions internal deep learning models. As I mentioned, you have to do a lot of computer vision work in order to extract relevant information from these images because they're all taken with generic states. We've done that work and we've actually provided the features along with the datasets. Of course, you need the metadata in order to make any sense of this. So we include files and metadata, including technical information like the plate IDs and well IDs associated with each well, what perturbation was in them and what treatment and what concentration those were in there. And finally, something I'll go into a little bit later in the talk. We also have an interactive visualization tool at covid19.rxrx.ai that allows you to step through and actually see what the results of the drug screens were for yourself. So with that, how did we actually go about releasing this dataset? There are a number of critical decisions that you have to make in order for releases of this size. One that seems a little bit obvious but really comes right at the beginning is what formats you're going to use. Now, obviously internally, we have pretty interesting infrastructure on the backend and file formats that are set up for rapid machine learning and so on. And people are working primarily with the Unix systems. However, we wanted to make this data accessible to as many people on whatever size systems they were using and whatever operating systems. And so we ended up just picking lowest common denominator formats, right? The images are stored as large zip files. We didn't even use tar balls because it's harder to do with TARs on Windows systems. The images are encoded as PNGs rather than ZAR and rather than storing the metadata in a complicated binary container, they're just CSV. Although these formats are less efficient and perhaps not the ones that you'd use in an active machine learning system, they're more easily accessible to as many people because we wanted to maximize our reach. You have to make decisions about where you're going to host this data. Now, although there are a number of repositories for open scientific data out there, the fact of the matter is most of them are not going to take releases of hundreds of gigs coming from a commercial institution. We already have a heavy investment in Google Cloud and so we ended up hosting these on Google Cloud storage and then linked to them from our landing pages on rxrx.ai. That's not the kind of thing that will work for everybody but it's a really useful first step. And as it turns out, I'll tell you a little bit later, as a consequence of that, we've actually been able to get in some interesting deals with Google in place for this data. And finally, you don't want to just drop a data set on the community and then never come back to it. It's almost certain that the first time that you put something out, there'll be bugs or at least questions that people have about how to work with the data. So the original rxrx1 release was done as a competition and we hosted it on Kaggle and had a lot of back and forth there. For more recent releases, we created a GitHub repo with information and we've been taking questions and bug requests through the GitHub issues interface. And now I said that I was going to talk about how we go about releasing datasets. Licensing terms can be really contentious in industry because there are commercial interests in play. And so it's important to separate out the goals that you have for each dataset. For rxrx1 and rxrx2, we licensed those under Creative Commons licenses that only allow non-commercial usage. Our interest in rxrx1 was in demonstrating the power of morphological profiling and sort of convincing the community that, yeah, there's a ton of data, there's a ton of information you can get out of this data and our data that we generate recursion is high quality. And in rxrx2, we want to further demonstrate the power of recursion's platform and what you can get out of this. So our interest here is in driving academic work, driving more learning from this data and reanalysis without compromising the commercial IP interests that we have. For rxrx19, I'm proud to say we have explicitly disclaimed any commercial interest in this data. There are patents that we have filed but we've explicitly said that these are purely defensive just to make sure that a bad actor can't lock up any potential drugs coming from this. And we've licensed some CC buy in order to drive broad sharing and collaboration, just make sure that you cite us but we don't care what you do after that, it's in everybody's interest to make sure that this pandemic is solved as quickly as possible. And finally, another consideration for us was that we wanted to make sure that people could interact with the data. We're talking about a dataset that's almost 900 gigs in size, that has 400,000 images that's dealing with 1,500 to 2,000 different small molecules. This is not the kind of dataset that people can casually poke around in on their computer if they're just downloading the raw data. But I think it's really interesting to enable people to be able to do that kind of interactive exploration. So along with our preference, we also released an interactive server at COVID19.rxrx.ai allowing people to interact and play with that data. And I'll show a little demo of that here. So you just open up your browser, go to COVID19.rxrx.ai, it loads up this web app that's built in this really nice open source framework called Dash. You can select the screen that we did on human cells or the ones on the Vero E6 monkey cells and then you can pick the compounds that you wanna see. And what you get is this cool two-dimensional plot called a Prometheus plot showing the results of the screen in the cytokine store model on the left and two repeats of the active viral infection in the middle and on the right. You'll also get a table of the hit scores as well as the compound structures that you selected that it's then easy to go ahead and download the results to look at as you will. And I think this server has been really powerful for enabling people to understand what you can do with the data and what is actually there. So with that, what's happened now that we release this data? We put out our first release for RxRx19 in the first half of the year, or sorry, RxRx19A in the first half of the year, RxRx19B in August. And we've actually seen some really interesting traction from it. So like I said, we published RxRx19 on rxrx.ai, self-hosted, did all of that. But since then, because COVID-19 is such a global issue, we've been approached by a number of groups in order to share that data elsewhere as well. So I'm proud to say that our results are part of the Kemble COVID data portal. So you can actually go into Kemble and they've put up this great interface where you can see all the structures, hit scores and so on, all in a unified place. The NIH and NCATS have an open data initiative that has also asked us to host a microscopy image. And Google also has a public data sets program and in particular a COVID-19 related public data sets program that I'm proud to say that they are now hosting that data on their end as well in the public interest. Beyond that, I think it's been really useful in terms of communications. What I have here is a screenshot of a brief Twitter exchange that I had. It's a very reasonable question. Hey, are you able to share what's been going on? It's really liberating to be able to say, of course I can share it. In fact, here's the URL, just look at it yourself. All of our data is up there. What's surprising is that not only is this useful in public communications, it's even useful in internal ones, right? I had a conversation with a couple of our scientists a month ago where they asked us, hey, what's going on with this particular molecule? Hey, you can just look at this website, right? It's really easy to look this up. There are no complications or anything to remember. And I find it's really nice for hypothesis generation and exploration by a wide variety of people. And finally, I think something that's been really cool for us is we're actually using RXRX-19A as a recruiting tool. One of the work sample tests that we use in order to evaluate new data scientists is to send them RXRX-19A and ask a series of questions. It gives folks an opportunity to see our capabilities and see how cool our data is, and for us to see what interesting things people can do with this data set. So with that, I've said it a few times, just want to remind everyone, RXRX-19 is available. It's a huge data set of images, metadata, deep learning embeddings, and an interactive server present at rxrx.ai. If you're interested in drug screening work for COVID-19, either the virus or a cytokine storm that probably ends up causing most of the mortality later on, I encourage you to check it out. And just in conclusion, I've been really excited to work in this project. I think that generating these kinds of large data sets and making them available to the community is super valuable and being able to host them on the cloud has really made that accessible to a much broader range of folks than would have been able to do in the past. So thanks so much, happy to take any questions either here or in the Slack or in the gather town after this session. Oh, right. Thank you very much for just another amazing talk. That was great. I think we have time for maybe one or two questions either for Imran or Lucy slash Kyle, direct message it to me in the intervening time. I'll just go ahead and ask Imran a question. I apologize. It's kind of a hackish question, but it seems to me that it would be quite useful to have a single cell transcriptome data for the cells in question. Any plans on doing that? Well, so there's obviously limited detail that I can comment on regarding our plans. What I will say is, we do have a couple of openings for folks working on next generation sequencing based methods, both biologists as well as computational biologists and bioinformaticians. So A, that should give you some ideas of directions in which we might head and be if you're interested, please come apply. What I will say is that, something that I didn't talk about on the assay slide, even though single cell sequencing has gotten cheap, a really, really cheap assay, you might be looking at a buck a cell, something like that. Morphological profiling assays are even cheaper. You're talking literally one to two logs cheaper than even the cheapest single cell sequencing. So our focus for scale has really been focusing on the morphological profiling side of things. Some work done out of the lab of one of our scientific advisors and Conprator actually showed that the number of perturbations that elicit a transcriptional response, that elicit an imaging based phenotypic response is fairly comparable. So, although there are certainly some interesting things that we think we could get out of the single cell, RNA on this, that's probably not something they're going to be for this particular dataset. But if you're interested in working on that, please come join us. Awesome. So, moving on we, thank you. We have Yubin Wang and I will say, are you being Kim, Michael? One thing I've noticed quickly is that if you, my thesis is sitting on a shelf somewhere and nobody will ever find it, hers, she is clearly very proud of because it is the easiest to access doctoral thesis I have ever seen. So, but also people ask me every day about clinical notes and parsing clinical notes. So, I am excited about this talk and we'll be paying wrapped attention. I hope you are too. So with that Yubin, do you want to share your screen? And we will- Yes, let me get this go. Here we go. Can everybody see my slides? Absolutely, yeah. Okay, let me see if I can. You might want to hit present, but- Right out, yeah, uh-huh. There you go. Did that work? Yes. Perfect, okay, great. Thank you for the introduction. My name is Yubin Kim. I'm director of technology at UPMC Enterprises. Just a little bit about me. I graduated from CMU in the Lancash Technologies Institute back in 2018. My thesis was in large-scale distributed search systems, but since I've graduated, I've been with UPMC working on clinical notes and in the medical domain. My current role in my current role, I work with stakeholders in both the provider and peer side of UPMC to identify problems that can be solved using machine learning and NLP. One of the reasons why I joined UPMC was because I was interested in working within an area where I could have access to real-world healthcare data. And I was certainly surprised at what that looked like in the real world when I first joined UPMC. So I kind of, in this talk, I wanted to show you guys what EMRs, especially clinical notes, look like in real life, especially in a big system like UPMC. So if you haven't heard about UPMC Enterprises, UPMC Enterprises is the technology and R&D development arm of UPMC. So in Enterprises, we have an in-house R&D team as well as a side of the house that works with startups and researchers at both UPET at CMU developing new technologies that will solve healthcare problems in the real world. One of the things that we have developed at Enterprises is called Neutrino, which is a big document management engine. The thing about Neutrino is that it is a storage repository for clinical notes, all clinical notes that goes through any UPMC facility at any time to past decade. So this includes products notes, discharge summaries, radiology reports, pathology reports, any clinical note that is created within the UPMC-wide, it is stored in Neutrino. And furthermore, we have agreements with other hospital systems like Butler and Lake Valley, and we have those notes as well when the patients going through those hospitals are covered by the UPMC health plan. So we get something like 700,000 new documents each week across 35 different sources. And this includes EMR vendors of all stripes and colors. I don't even know half the EMRs are listed on my slide to be completely honest with you, but this is to illustrate that we have a very wide variety of sources in clinical notes. So long story short, it's a lot of data. And I'm going to show you what it looks like a little bit. And the news is that it's not good. So in the real world, we have an array of different issues in clinical notes. Some of the most well-known issues include ambiguity. So the word cold can mean several different things, including the temperature, it could be a viral infection, and it is also a shorthand for chronic obstructive lung disease, which I did not know about until I looked it up. Abbreviations are very common, NKA, no known allergies, FROM, full range of motion, and astute observers will notice that FROM is usually considered a stop word when you're doing NLP. These are words that you typically strip out. So you can already see issues that occur in clinical notes. Clinical notes also include a lot of synonymy, heart attacks, and myocardial infractions often refer to very similar concepts. And of course, misspellings, and especially the patient errors turn up a lot in clinical notes. Dictation software is fairly frequently used and the errors from those software are not usually corrected by deficient. So these all tend to make it into DMR. So these are some of the most common well-known issues, but today I wanted to introduce you to some that are maybe less frequently known. So one issues that I've noticed in the clinical notes repositories is that physicians are busy, clinicians are busy, they often take shortcuts in writing out clinical notes, and they often make up abbreviations. So this is a fairly common abbreviation that I've seen, API, stands for appendectomy, but this is not the only way this abbreviation is used. So for example, it could be API as in the appendectomy procedure or API as an appendicitis, the actual disease, or API as in the actual organ appendix. So this made up abbreviation is used in many ambiguous ways by clinicians who are documenting these things. And of course, it could also be a misspelling of apply just to add insult to injury. Another common issue in clinical notes is this aspect of question and answers that usually get written down into full texts. So in a lot of reports, you see basically multiple choice questions. So in the EMR, you're supposed to select a histological type for your tumor. And so they give you four options and you are supposed to select one particular option. So this can appear in many different ways, as you can see on the slides. So one particular way that you can select things is by writing things down. Another way is to select a number. Another way is to have these little checkboxes that are checked off. And so of course, if you're parsing this with NLP, trying to determine what the histological type of this actual tumor is, you want to say that is a super superficial spreading types tumor. And picking up these other terms that are present in the note, but are not actually representative of what the patient is experiencing would be inaccurate. So a naive approach to NLP would be picking up all these terms, where in reality is a superficial spreading type tumor. Another type of question and answer, social history notes contain a lot of these things. And so a common example would be, do your friends influence you to use alcohol, tobacco, illicit drugs? And you would get these free form answers that are yes to alcohol and drugs. And you would need to understand that it is not just that the patient uses alcohol and drugs, but it is the influence of alcohol and drugs for friends. And it may not necessarily be that the patient actually uses alcohol and drugs. So there are a lot of refinements in a lot of ambiguities and nuances in clinical notes that you need to be careful of. Another common issue, sentence boundary detection. It's often one of the first steps for many NLP tasks, even in the world of neural networks, sentence boundary detection is often a first step to chunk things up sentences to feed into your complicated machine learning network. And you think that would be easy, but often it is not. People are not necessarily the best at using periods or punctuation, even in boundaries where it would be obvious. And this is one particular example that I pulled from clinical notes that has been de-identified and changed slightly to protect patient privacy. But this is an example of a real sentence that you might see in clinical notes. I'm not even going to try to read this, probably going to trip over all the medical terms, but this is a very common example of a sentence. And so being able to do proper sentence boundary detection is a real issue as well. Oh, this is a fun thing, tables. So tables are terrible. If anybody who's worked with text data, I know hates table parsing. Table parsing is difficult, it is not easy, and you think that it would be easy, but it's hard to get things to work properly across lots of different types of notes. So this is a easy example of a table. So these are vital signs of a patient. You can see that there are table headings and the actual value, but then there are these whole like this normal ranges of vital signs that are also in the table. So you need to know how to parse those out and not include that as a part of a patient's medical history, but even more exciting are fishbone diagrams. So if you don't know what fishbone diagrams are, these are basically shorthands that are often using clinical settings to visually indicate a patient's vital signs or their important calcium levels and things like that. In a way that is standardized within a specific discipline, similar fishbone diagrams can mean different things across different medical disciplines, but it is a visual way for clinicians to be able to see the thing and say, oh, okay, this patient is healthy, this patient is not. And so when it gets turned into textual formats, this is what it tends to look like. It's a giant mess. And what is worse, these are often mixes of spaces and tabs. And so when you parse these things out, you need to know exactly how wide the tabs are in terms of spaces. And in this particular example, I figured out that the tab is 10 spaces wide, but when I use them, my tab settings are two spaces. So then it looks, ends up looking like that. And it's impossible to parse tables. Finally, I want to go over the titular issue that my talk was titled around no types. So in Neutrino, in our document repository, we have close to 40,000 different no types. And the biggest source of different no types come from radiology. Each type of radiology image generates a different type of note in the EMR system. But that still leaves us with 10,000 different no types that we need to parse through. And the reason why it is important to have an understanding of different no types that exist in the EMR is because first of all, certain no types are sensitive. So behavioral notes from psychiatric hospitals need to be treated at a higher level of confidentiality. So not all medical staff are allowed to see these types of notes. You need a higher level of clearance to see these types of notes, similar with HIV status. And secondly, possibly more relevant to us as researchers, no types are also very, very diverse. Training on one type of no type does not guarantee performance on another. And it can be easy to fall into a trap of saying, oh, I trained this model in the system and it has great F1 scores. But in reality, you wanna apply to different no types in my Nokia case at all. Another difficulty around no types in particular is that there is currently no clear way to organize or map these no types to a clean hierarchy. So there is a Loink document ontology that exists, but it's in a beta version, it's incomplete. And as far as I know, there hasn't been a lot of active work on developing out this ontology. So basically you're flying blind and you're relying on the subject matter experts of your institution to help you try to figure out which no types belong together and which don't. Within UPMC Enterprises, we had an initiative to try to map out no types into a standardized format, very manual and very, very time consuming. And it was done by a non-subject matter expert. So the quality varies as well. I wanted to, because this is such a big problem, I wanted to give you a little bit more of an in-depth understanding of what this really looks like. So I did some data exploratory things on a particular data set called that was generated for the CARE project of PHDA. So a quick sidebar, PHDA, the Pittsburgh Health Data Alliance, is UPMC's data sharing initiative. So if you have a collaborator at Carnegie Mellon University or University of Pittsburgh, if you have a PIDA for collaboration, you can apply to work on a project that is funded by the PHDA. And this is a two-fold type funding situation. Not only do you get access to actual money, actual dollars to fund your PhD students, to do actual research with, you also get access to clinical information that UPMC has to be able to do your research on. So for this particular project, it was for disambigating abbreviations in clinical notes. We de-identified five million different patient notes, including 12,000 note types across 10 different sources. And again, radiology accounts for most of the note types, but it still leaves a substantial number of note types that need to be managed. So the analyses that are subsequently done will be on this particular data set. So as mentioned before, the document distribution of note types are highly skewed. So this really long tail of lots of different note types that have very few documents are mostly from radiology. And about 343 note types from the PhD data set contain 90% of the documents. Some of the most popular types were patient call logs and office visits. And this is a particularly interesting note type because it contains a very wide variety of different types of notes, which I'll talk about a little bit more later. To have a look into what the note types look like, I took the section headers of the clinical notes and parsed them out using the SECTAG vocabulary. This was based on a paper that was published in ANIA. And you can see that across these three different note types here, this is a nutrition assessment note type, whether or not a patient is eating well, if a patient has malnutrition, they generally go talk to a nutritionist and this type of note is generated. This is a radiology note, I believe for a CT scan of the head. And these are EC70s, these are call logs from Epic. And you can see that the section headers that are present are very, very different. The frequencies of the sections are also very, very different as well. To look at these in a graphical format, I took the, on the x-axis is the different types of sections that were ordered by frequency rank. And the y-axis is of a specific note type, what percentage of those notes contain that particular section. So you can see this graph will kind of give you a sense of how varied notes are within a single type. So for example, EC70, call logs and the CT head scan radiology notes, there are a handful of sections that are present in almost all of the notes and then it has a rapid drop-off. So you, from that, you might be able to conclude that these notes have a fairly specific format. Whereas for these nutrition assessment notes, you are seeing a lot more different sections and a lot more variety on how frequently these sections appear. And this is just a handful of sections, it's a handful of note types that I pulled out to graph out what the sections look like. And you can see there's a wide variety on how coherent a note type is within its own type. To give you another view of heterogeneity within a note type, these are both plastic surgery post-op notes and they look fairly different. Again, these have been de-identified and changed to protect patient privacy. But you can see that the length is very different, the kind of information in it is very different, the section headers are very different. So even within a single note type, you have a lot of variety. And across note types, it can be apples and oranges different. To give you a sense of how important this is, we did an experiment where we were doing some entity detection and trying to detect generic mentions of entities within clinical notes. And we use a state-of-the-art CRF-based model off of a paper and we trained this model on just 1,000 notes of 600 different types and we tested this on two different sets. So set one was a testing set containing note types that were mostly seen in the test set. And here you get not great numbers, but okay, 0.6 precision recall is kind of low. Okay, F1 numbers. And in a separate test set in test two, we tested the model on note types that were wholly unseen. And you can see that there is a really sharp drop-off in performance. This is like night and day difference in performance. These are like fairly abysmal numbers, which is one of the reasons why note types are so, getting an understanding and making sure that you have different note types in your training data is so important. I wanted to quickly go over one possible solution that you can use to try to tackle this particular problem. This is based on a paper that an intern of mine and I published in a workshop called Health Search and Data Mining at Wisdom 2020. And here what we did was we used K-Mean clusterings to cluster different note types based on textual and section header-based similarities. And we saw that these clusters often aligned with different source systems and a little bit with the manually labeled types that we had. So we are semi-confident that these clusters are okay. And based on these clusters, we kind of repeated the experiment where we were looking at training a NLP model on the different clusters based, we were training the NLP model on these different clusters within the same cluster type. And these clusters each contain lots of different note types. And we saw that we had some mixed results but generally speaking when the model was trained on the same class. When the model was trained within the same type, within the same cluster, we saw that the training and testing performance were similar. And in the type, in the clusters where we saw poor performance, we investigated these and saw that these clusters were actually less coherent. So we did a correlation between the silhouette coefficient of the clusters and their test scores. And we saw that there was a strong correlation when the clusters are very tight and coherent. We saw that the test scores corresponding were very high. So I rushed through last bit quite a bit but that is kind of what I wanted to convey. Clinical notes in the real world are complicated and there are a lot of challenges to tackle, a lot of low hanging fruit to tackle. Please come talk to us about the PhD program. Please work with us on helping solve these problems. That was fantastic. Thank you so much. We're running a bit behind time and I think if we extend into the break session we wanna do that mostly with questions. So we're gonna quickly transition over to Jeremy Weiss. Jeremy is going to talk about actually taking that EMR data and bringing it back into clinical rounds. If you look at Jeremy's web pages, one thing you see very quickly and a really salient piece of information is that he is looking for grad students and postdocs. If you like what he has to say here, please contact him. With that, Jeremy, if you wanna share your screen we will get going and then we'll have a little Q&A session for everybody after. Great. Thanks very much, Ben. Can you see my screen? Absolutely. Great. Hello. Let's try to get this up and running. Okay, turn. So my name is Jeremy Weiss. I'm an assistant professor in Mahine's College here at Carnegie Mellon. And today I wanna talk about one reuse of data which is that of structured electronic health records. So I won't be focusing on notes like the last talk but on the structured component and thinking about how we could integrate and bring them to clinical rounds using machine learning techniques. So I'd like you to put on your doctor hats, get on your doctor's shoes, don the white coat and off we go. So this is a representation that we might pour over during rounds. We've got one individual, this is one patient's data across time and there are many different clinical events being represented on the y-axis. So this is actually from Mimic. It's a de-identified data set so the time is kind of scrambled but nonetheless the unit of time here is one day. And then on the y-axis you might think you have like maybe a glucose value, you might have vital signs, procedures, other clinical events that are documented over time and that's what's going to populate the health record for an individual. And so now we're here at this blue vertical line and we need to make decisions. What should we do next? How do we best take care of this individual? Okay, so you're in your white coat so if you think, oh well, let's go talk to the patient and let's collect our illicit symptoms and let's perform a physical exam, assess that information, propose some interventions and then once you decide, oh, I think these are probably likely what I should do then, you'll go back to discuss the patient, obtain consent and act. And you'll do this for let's say a dozen patients every day and you'll do this every, somewhere between two to six hours, maybe two to 12 hours. At the end of the day, you might go write your notes but for now which you'll have access to our previous days notes and a lot of other information, measurements and structured form in the health record. And that's what you're going to be working. Okay, so what if you don't know what to do? You've got a COVID positive patient, they have chronic obstructive lung disease, CPD, pulmonary disease and you're, well, you're not sure so maybe go consult the specialist. That's probably the most common thing that you'll do but you might also go back to the literature. You might go to review guidelines such as up to date or you might go back and look at randomized control trials or observational studies and this will help you deal with your clinical equipoids. Equipoids is a balance of interest. You're not sure quite what to do. There's a fourth thing that you could do which is use a predictive model. Now, many predictive models are used in the formulation or in constructing guidelines or scientific evidence bases and to some degree there are some embedded in electronic health records that help guide the decision-making process so that you can deal with your clinical equipoids. Now, I wanna talk about how do we bring this into the timeframe of actually treating a patient? So you could take many risk scores that are many out there and you could apply them to the patient and those risk scores would come from other sources from large data sets or studies that have been done previously but you're not really sure necessarily when you approach an individual patient what your question will be. And so there may not be any sort of answer that you can draw quickly from the existing resources. So that's why you might turn to a predictive model that's built in within your system. Okay, so here's that picture again. I'm gonna focus on forecasting models from this information because we're trying to assess risk over time because that may influence what we end up doing in the way that we treat this individual. Okay, so what are forecasting models typically used for while they can quantify risk, risk of an outcome we'd like to avoid? They can assess the urgency, how quickly are those outcomes that we want to avoid? How quickly are they going to occur? And then does this individual belong to some subgroup that has particularly characteristics where we know something about those subgroups and we can say, oh, because they're in the subgroup we should act in this or that way. Now, there's good and bad with taking such an approach taking electronic health records data and building them all right at the point of care. The good on the good side is there's clearly a lot of data that's present and because there's a strong belief that there's high quality content there because right now physicians and clinicians when they're looking at the health record they're using the health record not only to document but also to make decisions because there's such as a useful information source right there. On the downside is that this data is quite messy. We just saw an example of the messiness of notes but structured data is also messy in a lot of ways. So in particular, electronic health records are kind of like digital exhaust they're passive collection, data collection. You don't get to control what is measured at least if you're not in total control of the system likely you're not control control the entire system you have the ability to control some small aspect of it and of course always with patients. So what's the status quo here? So we've got our goal is to use this data as a complementary evidence source and currently it takes months to years to conduct an analysis, publish it and then determine whether this published model will apply to the patient at hand. But at this time scale we're trying to do this with not even having the question in hand until we see the patient present and then being able to say something during the course of that encounter. So the time scale is really a big challenge here. Okay, so can we do better? So I'm gonna continue through here and one of the things that the machine learning for healthcare community is addressing is forecasting models and how do we disseminate them? How do we deploy them? So that they can be used. So with the growth of machine learning we've seen a growth in the machine learning for healthcare subfield and now we have our own slew of conferences where we can communicate within this particular subfield so that we're ensuring that all the things that we produce are really relevant for healthcare. And within this sub area, there are a number of themes. So we want for high performing and generalizable models and that's a good emphasis and a continuing emphasis. We've also seen outgrowth of machine learning to achieve a particular property, their whole fields devoted to fairness, interpretability, robustness as a few examples. And there's also been a push into pipelines, this idea that we want to be able to take in relatively under structured data and develop predictive models that are actually useful to the end users potentially like clinicians or our patients themselves. And so I'll focus on this last one and I'll focus on particularly the forecasting within this last one. So of the end to end models that are in development or have been published, they oftentimes look a lot like this. Maybe they'll be like a Docker file or they'll be, here's a script and you're gonna have to enter a hundred different arguments that will characterize the parameters of what you want your algorithm to do. And you have to do this all upfront. So here's an example. So this is from Mahila Vandeshars group. It's the clairvoyance tool and this is the type of call that you would make to basically produce a predictive model from some slightly processed data. And it's potentially hard to know how you would want to choose, make all of these design choices upfront. And so there's a possibility that maybe we should have this in a step-by-step process and that's something that my lab has thought about in development. So instead, what you might consider is using a visualization tool. And so we've built one, it's called TLA. And in this case, because it's an interactive process, you can investigate data, data details, like issues with the data, correct them, and then keep going through the machine learning processing pipeline, okay. And because our end users are, our desired end users are people in the health spaces, we want this to be a little bit like an electronic health record. So it should feel a little bit familiar to the users who want to use it. So here's a schematic of this machine learning pipeline. They're fundamentally four different panels, a cohort extraction panel, a timeline representation panel, a modeling panel, and an assessment panel. And because we want it to feel like an electronic health record, we want to be able to show everything at the level of the individual, but we also need to be able to do the processing and we divide that processing into in-memory approximate computation and then slower processing, where we're going to be querying a database or pulling data from a disk. And so to be interactive, we can't use all of the data, we just have to be able to look at subsets so that we get fluid interaction. There are a number of design choices that are built into this. So again, I've mentioned, it should feel like an electronic health record. We want it to be reactive, but we also want it to be representative of the entire cohort. So we think a lot about how we can use visualization for machine learning and machine learning for visualization. And then upside from that, there are also these checklists that are common in healthcare, but are moving into this predictive or prognostication space. And we align ourselves with these. So let's take a look at the tool. So I've pulled it up. I think I'm sharing my screen so you should be able to see it. So here is one patient and I'd like to show you what this tool allows you to do. I'll show you two parts. There's the cohort selection part and then the assessment part that I'll just very briefly highlight. So here's your data for one patient across time and maybe we're interested in mental status. So we might be interested in something called the GCS score. We might consider it as a potential outcome. And if we were to add that as an outcome, it will be highlighted in the data. And so here now we can see that GCSV is being measured repeatedly. So let's go ahead and talk about actually just decreases in mental status, which would be a low GCS score. So what we can do is annotate this. We type in the GCS. We find GCS and say, if it's smaller than or equal to eight, that's a low score. We add it and that will annotate the system and it's going to take a minute and then it will update. And then all we have to do is add back our new annotation. And what we see is that we've actually identified a decrease in mental status occurring relatively early in this patient's clinical trajectory. If you wanted to verify that, you could go look at the values of those GCS scores. And in fact, you would see that there was a transient decrease for this patient at that particular time, okay? So now maybe you want to select some features because you want to do some modeling. So let's just, you know, select these features and get a sense of what kinds of features are being represented here. These are primarily lab events. And we'll go down here and we'll add some features and you see that they're represented. Now they're going to be features and we're going to be predicting this orange as an outcome. And, you know, there's a whole bunch of other things you can do here. I'm going to put in some windowing, which is to say I want to only focus on the time period during which they're inside the intensive care unit. And so now there's a line down here at the bottom indicating that, okay? So when you're done with all this processing you'd probably do a slower and more refined version. You would come down here, click the version button and then that would process the data and create a refined data set for you up here. Oops. Okay. Now you do this through the rest of the time but I just want to show you the end result of something that you might get. So over here in the assessment panel I've loaded to basically a representation object and a modeling object. And now we're going to be able to see that we have, you know, a survival analysis for these individuals. This particular one is looking at low plate accounts onset of severe thrombocytopenia but that's not too relevant for demonstration of the tool. So here we can see we have survival curves over time. So over the course of 20 days a high risk group, the one in purple has relatively lower survival, 70% survival free from having super low plate widths. And that's based on a number of different features. If you want concordance plots you want to see are the predictions of the actual rates, you know, accurate. You can plot them here. And there are a bunch of different toggles that lets you choose, you know, different settings. So if I wanted 10 groups it will take a minute and process and then update with, you know, the results. So that's pretty nice. And then down here if you're interested in some of the associations between the features that were included in the model and the outcome here you get a forest plot and this is something that you can work with. So all of this can be done in relatively quickly. And so is a nice interactive tool that where you can provide this to healthcare professionals who understand the intricacies of the health process but may not necessarily have all of the details and how to code all of these different little design choices throughout the process. Okay. Okay. So that was a whirlwind tour of visualization tool. And what I've shown here is really just the kind of the classic setup of doing survival analysis using a classic method called GLM net cocks. A big part of what my lab does is focus on developing the methods that are temporal analysis from electronic health records data. And we've created a suite of methods here and I've listed a bunch of them down here. A little over time. So, okay. So I will first I'll highlight if you want to play around with this visualization tools you can click on this link or you can go to my website, my CMU website and it'll be a link on my research webpage to get the full more complete access you have to use these credentials demo and reverse. Okay. So just to highlight one type of kind of other research that I do within this temporal framing. One thing that I'm interested in is understanding risk as opposed to classification even though they're kind of two sides of the same coin. So most classification algorithms are trying to divide let's say black from orange points in this graph. But oftentimes in healthcare we're interested in risk of every event, not just the ones that are close to the decision boundary. And unfortunately, most classification programs will focus on the decision boundary because that's exactly where there's greatest uncertainty highest entropy. But in healthcare, oftentimes you want to know what is the actual risk of the low risk individuals. And so what this particular method does is identifies how to appropriately focus on all individuals throughout the plot but in a temporal frame. So instead of looking at classification we're looking at risks across time and we're thinking about rates are of other events. And so if you think that this black point is relatively low risk of being classified as yellow if that's the outcome you want to avoid, well, what is that risk? And so that's what this method set of methods will do. So practically what does that mean? I'm just gonna jump straight to some results. If you were to train an LSTM, a deep learning model and you wanted to try two objective functions, one which is standard, a maximum likelihood approach or the second one, a harmonic mean type of approach what we're going to see or what we see in our experiments is that the harmonic mean approach will straighten out the tail. And so this is the concordance plot we're straighting out the tail. So we're getting better predictions on our low risk individuals. This is done in simulation, the effects are strong. You see that just to some degree in real data sets. So here I'm again pulling from mimic three. I'm looking at decreases in mental status or glass glaucoma score. And so here I'm showing the maximum likelihood versus harmonic mean processes in the calibration plot. First of all, the subgroups identified from the maximum likelihood version aren't as low in terms of their empirical rates or hazards as the ones that are identified in our subgrouping. And then you're saying, well, maybe the orange points will look a little bit further away from the diagonal line which is what you want. But in fact, I'm plotted a log log scale here. So if you actually look at the absolute differences between the lowest to risk groups, the absolute difference is actually smaller for the orange point, the harmonic mean method as compared to the maximum likelihood approach. So this allows you to identify what is the risk in people who are not really at super high risk? Oftentimes you might think of identifying risk factors that are applicable to the whole population. But if all you're using are algorithms that are focusing on high risk individuals, oftentimes you won't get to see the associated risk factors for the broader population. Okay, so in summary, I've shown two different tools that my lab has produced. So first is TLite. The goal is to make this much simpler for healthcare professionals to produce analyses from, let's say, months into minutes or maybe a bit optimistic, minutes to hours to days. And this can help them address their clinical requirements. The second is that risk estimation is really important and maybe for forecasting, at least within healthcare, maybe more important than classification tasks. So we've illustrated one method if you wanna look at general populations or in particular, low risk populations, here's a method that you might think about. Okay, so these build together to look at risk estimation that it's useful for individuals who are taking care of patients, these healthcare professionals, this is something that we're pushing out to the community. So thank you for your attention. This is joint work with a whole bunch of people. I listed some up here. Thank you. All right, thank you. And we are way over time, but that second part was totally worth it. Like, I mean, really, when you started, I was like, uh-oh. And then I was like, this is awesome. We're just gonna wait a minute. But that said, we have eight minutes till the next session and we're gonna try to have a brief Q&A for all of the speakers. That's great. My question was where to get this? If you wanna pop that slide back up that had the demo, that would be awesome. I'd love to see that real quick. I got questions for a few of the speakers. First, Yubin. And I think a lot of the clinicians in the audience know the answer to this or never wanna hear the answer to this. Has UPMC tried any behavior modification with docs to help with those? For example, asking them not to use abbreviations. Oh, dear God. That's hard. That's hard. Yes. Okay, yeah. I thought that, this is a question for Imran, but anyone can answer. How do you handle human subjects' considerations such as removal of personal identifiers? Both direct and indirect for extremely large datasets. Something that probably everybody in the room deals with a little bit. I can take this one. So within UPMC, we have a automated tool that goes through text and tries to rip out names, the identifiers that are defined as PHI by HEPA. It's difficult in text. It's not, you can't do it 100%, but we do the best we can. And this is one of the reasons why getting clinical notes as shared data sources is so difficult is because there is no clear way to do this so that we would be 100% certain that all PHI's stripped out. So it's definitely a big challenge, especially for free form text. Cool. Anybody else want to jump in on that? Yeah. I mean, that's kind of madness. But yeah, it has to happen. And yeah, just encouraging people to look things up in terms of collaboration and hopefully neutrino and other systems. We'll start working in a wider way with other health systems. Anybody else? Any other questions? All right, well with that, I'd like to thank all the speakers. Thank you so much. We have about five minute break. Obviously, the vast majority of you are using laptops, so please take that anywhere you need to go. And yeah, we will reconvene in five minutes through the panel led by Sean Davis. So we will see you back in five minutes. Thanks again to all the speakers and that's it for this session.