 Hi, my name is Karan Deep Singh, I'm an assistant professor of learning health sciences, internal medicine, urology and information at the University of Michigan, and I'm excited on behalf of our research group to present ClinSpacey, an R package for clinical natural language processing using Spacey, SciSpacey and MedSpacey. Historically, clinical NLP has been fairly java centric. Most of the existing technologies have relied on the Apache unstructured information management architecture, also known as UIMA, that was adopted from IBM. And two prominent examples of UIMA-based software include MetaMap and Apache C-Takes. MetaMap was developed by Dr. Alan Aronson at the National Library of Medicine, and Apache C-Takes was developed by a team of researchers at the Mayo Clinic, and then eventually turned into an Apache project. While both of these tools are really powerful, they require a lot of configuration, and they don't seamlessly integrate with data science languages like R and Python. A couple years ago, a Python package became available that has been turning things on its head in the space of natural language processing, and that was Spacey. So in contrast to UIMA, Spacey was relatively easy to set up. It was available readily in Python, supported multiple languages, had multiple pre-trained pipelines available, had easy integration with neural networks, and came with pre-trained word vectors. And with the rise of natural language processing in the space of machine learning and neural networks, the ability to have pre-trained word vectors was a huge plus that simply wasn't there in other clinical natural language processing pipelines at the time. So with all these kind of features, and with the fact that it was fairly fast in the way that it was built, it started to see a lot of wide adoption in Python. And Spacey can do a lot of things that clinical NLP tools simply just can't do. In contrast to the UIMA-based tools, which were fairly difficult to set up and required a lot of dependencies to be kind of figured out, as well as the right Java version, Spacey was fairly easy to install. You can install that right from within PIP and Python to the extent that anything in Python is easily installable. This is about as easy as you can get in terms of Python installation. There was also already interfaces opening up from Spacey to R. And this was evident in the Spacey R package. That's fantastic. And also in other R packages like CleanNLP that support interfaces to Spacey. As I mentioned earlier, Spacey supports entity vectors or embeddings, which we'll talk about a little bit more later in the talk, and support for multiple language models. And all of this has led Spacey to have a fairly fast and growing ecosystem. So if you go to the Spacey website and look at their universe, you'll see that there's several packages out there available in Python and other languages that are built on Spacey, including CleanNLP, which was one of the packages I mentioned earlier that provides an interface to Spacey from within R. However, Spacey is not sufficient on its own for ClinicalNLP. If you look at this output here from the Spacey R package that's wrapping Spacey, you'll see that the Spacey package is great for identifying tokens, identifying lemmas, part of speech and entity, but it can't recognize clinical entities. So you can't see it here, but the basic English language model that comes with Spacey isn't specifically built for clinical or biomedical text. So for example, it's not going to parse chronic kidney disease as a single entity, and rather it will give it separate entities. It also can't map phrases to the unified medical language system or UMLS codes. UMLS is a metathosaurus, basically you can think of it as a dictionary that wraps other dictionaries that tries to tie together multiple different commonly used clinical dictionaries. And so without support for UMLS, Spacey just can't cut it as a alternative to existing clinical NLP tools on its own. It also can't recognize negations and hypotheticals, and if you've read clinical notes that there's lots of situations where things are used either hypothetically or phrases are used with negations. For example, part of the work up for someone who comes in with let's say shortness of breath is to figure out if they have chest pain. So it's not uncommon to see an assertion of a sentence saying the patient denies chest pain, in which case we assume that the patient does not have chest pain. Similarly, if someone comes in comes in with chest pain and they get a CT angiogram, the reason for that CT angiogram might be to rule out a pulmonary embolism. The phrase rule out PE doesn't mean that they have a PE, it just means that a PE is possible. And so it's uncertain whether or not they have a PE. Sci Spacey and Med Spacey are additional Python packages that add functionality to Spacey that help quite a bit. Sci Spacey was developed by the Allen Institute for Artificial Intelligence. It brings biomedical language models to Spacey, including entity vectors and embeddings that are specific to clinical terms and biomedical terms. Sci Spacey also includes a UMLS entity linker, which means that it has the ability to link phrases to UMLS codes. This is exciting because it means that assuming that the mapping is done correctly, that the phrase for CKD would get mapped to the same UMLS code as the phrase for chronic kidney disease. Med Spacey was developed by a highly experienced research team, or highly experienced research team, including folks at University of Utah. And it incorporates several pieces. The pieces that we use in Clean Spacey are the context algorithm, which detects negations and hypotheticals, as well as references to family members, as well as a sectionizer, which tries to figure out which section of a note a given concept or entity is found in. For example, a patient who has a medication listed under the medication section of a note might just mean that that is a medication that patient already takes, whereas a medication listed under the plan would mean that that's a medication perhaps that's being newly prescribed. But Psi Spacey and Med Spacey also are incomplete. Neither package directly converts annotations to a tidy data frame format, which would be one row per entity. There is a Python package called Deframesy, which does this, but not anything in R. And neither package nor Spacey itself supports transformation of data from a one row per entity format to a one row per patient format. So if you have a data set that looks like this with two patients, one of whom is 58, the other is 42, with systolic blood pressure and a phrase containing their history. If you're going to build a prediction model out of this, you're going to need to convert it to a format where each of those entities in the history column becomes a separate column with a count of how many times that showed up. So neither Psi Spacey or Med Spacey can do this out of the box. And notice here how the patient has no hypertension, and we want that to be recognized as a HTN value of zero, even though the phrase HTN or the word HTN shows up in the history. So this is where it's important for the negation detection to work, so that we can not include things that are negated, or if we want to include them, that we can set that manually. So the goals of Psi Spacey are pretty simple. We want to make setup and installation easy for all the Python components. We want to make annotation of clinical text simple. We want to make the results tidy. We know that processing of many items can be slow, so we want to make it easy to write the results to file and make it easy to then pipe these file names directly into other functions. And then we want to be able to create this one row per patient format by binding the kind of results onto the original data frame. And this binding took inspiration from the tidy text bindTFIDF function, which binds the term frequency, the inverse document frequency, and the TFIDF onto the existing data frame. Except in Clean Spacey, we have two functions that do this. One function is called bind Clean Spacey, which supports one column per entity. In this case, for example, diabetes would be a column, hypertension would be a column, and the values of those cells would represent the counts for each of those entities. And we also wanted to support entity vectors so that we know that if a patient has diabetes and hypertension, each of those map to a certain dimensional vector. And we want to be able to take the averages so that we can have essentially one row per patient as our final format for modeling. And then, of course, we want to make working with UMLS data easy if your machine has the appropriate RAM to be able to handle size spacey's UMLS entity linking requirements, which, it turns out, is actually 12 gigabytes. All right, so let's look at Clean Spacey under the hood. So once you've installed Clean Spacey and you library in Clean Spacey, you'll get this message where Clean Spacey is just telling you that by default, if this is your first time installing Clean Spacey, that when you initialize Clean Spacey, it's going to install Miniconda and install the Clean Spacey conda environment as well as all the dependencies within that environment. If you want to override this behavior, you can. But we're going to go ahead and just use the default Clean Spacey init function to get things initialized. So what does Clean Spacey init do under the hood? It makes setup and installation easy. So first, it installs Miniconda. Or like I said, you can override this if you want. It configures a new virtual conda environment called Clean Spacey. It installs the correct package versions, which is no small feat. There are multiple version dependency issues between Psy Spacey and Med Spacey. So for example, Psy Spacey supports Spacey 3, but Med Spacey does not. So Clean Spacey init takes care of all those versioning issues, including the correct versions for the Psy Spacey language models. In general, you should install Clean Spacey from CRAN. But right now, the latest version of Miniconda, which installs Python 3.9, is actually incompatible with Spacey 2.3.0. And so for right now, I would recommend using the GitHub version of Clean Spacey. But this should be fixed in Clean Spacey 1.0.3, which should be submitted to CRAN relatively soon. So once you run Clean Spacey init, you'll see it working its magic. Here it's installing all the dependencies. And several minutes later, it finishes up installing dependencies. And it loads and imports Psy Spacey, Spacey, and Med Spacey into a pipeline, and loads up the default large scientific language model. All right, so the first goal of Clean Spacey is to make annotation of clinical text simple. So if we feed in just a simple vector, this patient has diabetes and CKD stage three, but no hypertension. We get back a data frame where you can see that it's one row per entity. Here, there's no mapping to it. UMLS code, it's just the entity and the lemma. And then some facts about it that come from the Med Spacey package around whether it's historical, hypothetical, or negated. And if you look down here, you can see that hypertension is picked up as being negated. If you feed Clean Spacey a vector with multiple items in it, then by default, it'll assign different Clean Spacey IDs to those multiple items. So here, you'll notice that diabetes, Melodis, and DM actually get assigned different entities because in Clean Spacey's or rather in Psy Spacey's default language model, those were assigned two different entities, even though they might be referring to the same thing. You can also choose to give a data frame right into Clean Spacey. And if you do that and you provide Clean Spacey with a DF underscore call argument to tell it which column contains the text data, it'll actually process it and give you the same exact output as we got before. This is beneficial because then you can actually refer to identifiers that are already in your dataset. So for example, if you had a row ID like the ID column here and you provided the argument of DF underscore ID, you could actually process the output the same way except now the Clean Spacey ID refers to the existing identifier in your dataset. There's two types of diabetes here. There's type 1 diabetes and type 2 diabetes in medicine, which is not what this is about. What we're trying to disambiguate is DM from diabetes. And there's really two ways that we can do this. One way we can do this is to map all entities to UMLS-Cueys, which are the UMLS codes. Or we can try to map the entities into a vector space where hopefully DM and diabetes models are sitting close to each other. Let's first try mapping all the entities to UMLS codes. How do we do this? We first turn the linker on. This is the UMLS entity linker by rerunning Clean Spacey in it. Make sure you have enough RAM available before you try doing this. And once we've turned the linker on and then we run the same function earlier with Clean Spacey and pointing out which column contains the text, you can see that we get this huge mess. We get PT maps to Portugal, PT maps to Positron, emission tomography or PET scan. The one thing PT doesn't map to is patient. Diabetes mellitus is correctly mapped to the definition. And if you look at HTN and CKD, here it's pretty cool because you can see that it correctly maps those to hypertensive disease and chronic kidney disease, respectively. But do we at least fix the DM issue? And it turns out we didn't. So DM here is picked thought of as being potentially referring to dexamethasone or dextrometherfam, but not to diabetes mellitus. So entity linking isn't perfect. It's slow. We can remove some extraneous matches by restricting by semantic type, but lots of mismatches remain and it struggles with abbreviations and disambiguation. So my personal preference is to keep the linker off, but it depends on the task. So how do we put these predictors in a model? ClinSpacy return to one row per entity output. But to build a model, we want a one row per patient output. To put our predictions in a model, we can now use the ClinSpacy init where we'll turn our linker off. And then we'll use the bind ClinSpacy function. The bind ClinSpacy function does exactly what it sounds like. It takes your ClinSpacy output and then adds a column for each one of those entities that it identified. And in this case, it uses the lemma. That's why everything is lower case. And then those numbers represent counts. Because things only showed up either zero times or one time, you're seeing only zeros and ones. Again, it didn't quite solve our diabetes mellitus and DM issue as we expected, but otherwise, this is pretty close to what we wanted. We could also use embeddings here. And embedding is basically a multi-dimensional representation of an entity. You could think of it as points on a coordinate system. And the thought process here is that similar entities, like diabetes, mellitus, and DM, should be close to one another on this coordinate system. Here's an example of a three-dimensional word embedding where you can see that man and woman are close to each other and king and queen are close to each other, but also king and man and queen and woman are also close to each other, or at least in the same direction as each other. So these word embeddings can capture really intricate relationships, not just in terms of the meaning of a word, but also its tense and its relationship to other words. Clint's Basie includes two sets of entity embeddings, the size Basie entity embeddings, which are available when the linker is off, and the Cuey-Tavec embeddings, which are only available when the linker is on. Because size Basie embeddings are returned as part of the spacey pipeline, we need to explicitly request them when running Clint's Basie. So if you look at the code below, you'll see that if you want the size Basie embeddings, you have to set return size Basie embeddings as true. And you'll notice that the first three of 200 embeddings are shown in the screen below. Here, they've been rounded off to two digits, although the actual embeddings are several digits longer. We turn these off by default for speed memory reasons, but you can pretty easily get these as long as you specify this argument in the Clint's Basie function. Even though DM and diabetes melodists were assigned separate UMLS codes, when we look at them in the entity embedding space, and these are size Basie embeddings, we see that they're actually fairly close to one another. This is the code that shows you how you would include just the first 10 embeddings as predictors using the bind Clint's Basie embeddings function, mapping the output and then linking it back to the text data so that you can kind of join those results back to the original file and then end up with one row per patient data. And then this also shows you, I won't go through it, how you can save your results to output files to save memory and also to be able to then pipe in those output files directly into the bind Clint's Basie functions. And this is actually my preferred approach. Clint's Basie is a precursor to recipes. Because things take a while, I don't recommend that you use it inside of recipes, but rather that you pre-process text before you send it to recipes. And for more details, feel free to take out, look at our GitHub page and our package documentation. Thank you.