 Hello, my name is Sophia Schellhut and I'm a postdoc and data scientist working in David Miller's lab at Massachusetts General Hospital and Harvard Medical School. David is a medical oncologist, dermatologist, and clinical informaticist and he's the director of the Merkel-Soll-Karsinoma Center at Mass General. He'll be delivering the second half of our talk today and we're both really excited to share some of our latest work here. So briefly, we oversee the Harvard Cancer Center Merkel-Soll-Karsinoma patient registry and often we work towards developing data science solutions for pretty much every aspect of this program, so from trying to come up with ways to automate the data capture from the electronic health records to developing data science tools to aid in the analytics and outcomes research. So today we're really excited to discuss one such product that we use in our multi-institutional patient registry and that's genetics, a genomics report text mining our package and Shiny app designed to capture real-world clinical genomic data. So we have no relevant disclosures for today's talk. I'll start with an overview. First I'll discuss the unmet need genetics was designed to address. I'll provide details on the accompanying red cap based genomics instrument, essentially the instrument is used to house the structured abstracted data. We'll review the genetic schema and provide a demonstration of the application and then we'll finish off with limitations and solutions. What is the unmet need? So clinical genomic data obtained from routine clinical practice can greatly increase our understanding of clinical oncology, but there are a number of barriers that impede capitalization of these critical real-world data, for example, prolonged time to analysis secondary to the difficulties of actually capturing the data from the heterogeneous sources are the challenges of attempting to process the vast amounts of genomic information. So these hurdles increase the time to insight from real-world data and threaten our ability to fully maximize on advances in molecular and information technologies. So there are a number of sources of clinical genomic data. So we like to break them down into two, the first one being commercial vendors. So PDF documents and reports are often provided by the commercial vendors for their respective test results. So here I'm showing Garden 360's report and Foundation One's report in PDF formats. More commonly though, clinical genomic data is often found in text files in pathology reports from the institutions embedded in the electronic health records. And so most institutions these days have their own institutional next generation sequencing platform and they often result those sequencing reports in these text files or embedding them in pathology reports in electronic health records. So moving on to the genomics instrument, we've previously published the Redcat-based genomics instrument on our website, themillerlab.io. Here we provide the data dictionaries a CSV file, so downloadable CSV file. Once downloaded, it can then be uploaded into Redcat and the genomics instrument can then accompany the genetics, our package, and Shiny app. It's essentially designed to house the mind structure data. So to get access to details on how we optimize the user experience and user interface as well as obtaining the data dictionary, you just have to find the post. This is the title of the post, Genomics Electronic Data Capture Instrument under our Optimizing Real World Data Collection of Posts. So this instrument was designed to capture relevant clinical metadata and of course the sequencing and genomics results. So here, for example, in the first field, you can place a string to describe the tissue that was biopsied and sent off for sequencing. We call it a lesion tag. So in this example, we perform next generation sequencing on a right-to-interior thigh skin primary merkle. The metadata is obviously very important. Some of our patients have, say, 20 lesions and we send off primary lesions and metastatic lesions for sequencing as well as maybe blood samples for liquid biopsies. So the metadata is obviously essential here. We specify the primary cutaneous tissue, provide the data of acquisition, the sequencing platform used, for example, here, Foundation One. And then obviously, the instrument is designed to capture that rich genomic data and potential biomarkers, so mismatch repair status and the tumor mutation burden. So like I said before, we've tried our hardest to optimize the instrument for UXUI with the manual data abstractor in mind as well as to accompany the genetics app. So for example, a manual data abstractor need only to start typing the beginning of the gene variant name and an autofill drop-down menu appears. This helps navigate the full comprehensive Hugo gene name list, makes it easier, helps streamline data capture. So we obviously are here today because we appreciate that there are data science tools that can help circumvent the need for, say, a purely manual, labor-intensive, time-intensive data capture method. Thus, we designed and developed genetics to augment data capture. So here I'm providing just a scheme of how genetics works. Briefly, as I mentioned, the sequencing results and reports are often contained in the electronic health records. So whether the results are from PDF reports, from those commercial vendors, or institutional pathology text files from the institution's next generation sequencing platform, either way, genetics takes that input, imports, parses, tokenizes, wrangles, and transforms the data and then ultimately uploads and imports it into the genomics instrument in RedCap. So now I'll hand it off to David Miller to explain how exactly genetics accomplishes this alongside with a demo. Thank you for that introduction, Sophia. What we want to do now is provide an overview of some of the natural language processing techniques that the genetics package utilizes. NLP is simply a discipline of making human language understandable by computers. Ultimately, we want to convert unstructured data into structured data. We use a variety of techniques in the package while highlighting tokenization and regular expressions. This is an example of a corpus of text presented this way as free text. It is a challenge for data scientists to perform analysis on it. Another way to make it more simplified is to tokenize the text, take each individual word, and place them into a specific cell. In a data frame, this is an example of tokenized text as tidy data. In addition to tokenization, we made use of regular expressions, which are a sequence of characters that specify a search pattern. We created a variety of regular expressions to help us capture genames, nucleotide sequences, and amino acids. By pairing both tokenization and regular expressions, we are able to target specific elements of clinical genomic reports to abstract. Now let's go ahead and provide a demonstration of the genetics application. First what we're going to do is show you how to install it, make it available at our GitHub repository, the Miller Lab, where we provide a readme page, it provides an overview and the code to install using the DevTools package. On the repository, we also have the code for the shiny application, which you can copy and paste into our studio project, such as here. Then you can run this code, which will pull up a web browser that is the genetics application. We're going to go ahead and open this into our browser. This is what the user interface looks like for the genetics package. We have a variety of imports, import boxes, and drop-down menus that allow us to direct this information into RedCap. This is a good segue. Let's talk briefly about RedCap. RedCap, their research electronic data capture system, is a widely used platform that I think many of you are familiar with. Here we have a project called Genetics that we're going to be using for demonstration purposes. In this project, we only have one record, and we're going to be adding to this project using the genetics application. We're going to be capturing data from a sample report that we see here. This is a sample report of an on-co-panel report. This is genetic analysis on a primary tumor from a myocardial cell carcinoma patient. This is a fictionalized patient. This is a synthetic patient. There is no actual patient-protected health information here. This is a good example of the data that might be contained in an on-co-panel report, where we have a series of information that would be desirable to capture. We have a number of actionable findings, such as tumor mutation burden, mismatch repair status, as well as a number of genetic variants that are found in this tumor. There's a lot of data here, a lot of data that we want to capture, but also a lot of data that is extraneous and stuff that we do want to capture. You can capture this clearly with classical abstraction, capturing this information manually. But to make it more streamlined, you can just copy this information to the clipboard and paste that information into our Genetics application browser and paste that information here, and then you can add other information to the browser, including the subject ID. Here we'll call it 2, the instrument instance that you want to put the data into. So if you have a series of lesions that are sequenced, you can put them into repeating instruments. You're going to select the genomics platform that you used to capture that data. Here's an example. We used on-co-panel, and we have an input box that allows you to put in metadata about the lesion itself. This is optional. For example, let's say we captured a right arm primary. Next, you have a dropdown to select the lesion type. This is of interest in our Mercosile Registry, adding specificity to the type of tumor that's being captured. Is it a primary tumor? Is it a metastasis? Is it a local recurrence of blood biopsy? And for this example, we are going to be capturing a primary tumor. Next, you can enter the date the tissue was obtained. Here, for example, January 1st, 2021. Now we're going to put in information that directs the application to a specific red cap project. So you need to enter the red cap address for your project, which we'll do here, and you need to enter your API token. This is a specific password that each user has, and as a result of it being private, we've hidden it. But this is essential for the package to know where to send the information, and we're using built-in under the hood as the red cap, our package, to direct this information directly into red cap. So now we've captured all the information that is relevant and necessary, so we click the Run Genetics to the red cap button. Now I'm going to take your attention over to our sidebar. We have three tabs. The first tab is the Inputs tab, which we are using to enter the data. The next is our Report tab, which shows us an exact facsimile of the report that we just captured. This is an important quality control step where you can look at the report and make sure that what you paste it into that input box is exactly the report of interest. And then finally, you click on the third tab, which is the Data tab, and this calls the Function Genetics to Red Cap, which does the text mining. The output is a data table of two columns. We have a column of the Variables of the Genomics Instrument, and we have a column that is the results, and that's the information that we either entered into that initial browser, but more importantly, the information that was abstracted by the genetics application. And here's another step where we can do a quality control measure where we can look at our report and we can also look side by side with the genetics report from the data table to make sure that the information that we captured is indeed accurate. Now, after we have captured that and run genetics to Red Cap, we can go to our Red Cap project and we see it after we refresh the browser that we instead of having one record now have two records that information was just sent through the API into Red Cap. Then information that we saw in that data frame has now been transmitted to our Genomics Instrument in our Genetics Package. And you see the information that we captured. We had a right arm primary lesion that was captured on January 1st, 2021. And we see the type of lesion genomics platform that was used on CoPanel. And importantly, we see all of the genes that were captured, the genetic variants that were captured. And again, you could use this step as a as another measure, another time and opportunity to do quality control. And then after you've gone through and confirmed the information is accurate, you can go ahead and save and exit the form. So that's our demonstration. It's important to note that we have the ability to capture text files and also PDFs because many of the platforms use PDFs. And we want to be able to capture from both platforms. So now we've seen the package, we want to make a comment about its integrity. So we analyzed the time to capture and the accuracy of the information. We had two data abstractors capture seven genomic reports and have the genetics package capture those same seven. And we found, as you might imagine, that it was much more efficient to capture using the genetics package. And importantly, we also found good agreement rates. We had greater than 99% agreement with abstractor one and genetics and abstractor two with genetics. So genetics was fast and also quite accurate. Finally, we want to touch upon a few limitations and potential solutions of the package. The first limitation is that the optimal functionality of genetics requires a REDCAP project. Importantly, genetics can still perform text mining on clinical genomic data and create structured data tables. But without a REDCAP project to import that data into the project, the data itself can be analyzed. But it clearly, the overall application was designed to take that information and store it into a REDCAP as part of a broader registry. As we mentioned, GenX currently only supports four genomic platforms. However, importantly, the code is open sourced and developers can add code to support other platforms as needed. And finally, genetics is for research purposes only and clinical decisions should not be made on the output. With that, we want to conclude by acknowledging the Linux Foundation and the R Medicine organizers for inviting us to give this talk. We want to thank Project Data Sphere, the American Skin Association, and ECOG Akron for funding this research. And we also want to thank the Project Data Sphere program manager, Ravi Kumar Commander, who helped with some thoughtful edits for our manuscript. Finally, I want to thank all of you for your time and attention, and we'll be happy to take questions. Thank you.