 Okay, we're going to get started. Thank you for joining us then, where we'll be talking about analyze and clinical and genomic oncological data with the genie BPC and genome RR packages. I'm Sammy Brown, and I'm with Jessica Agri and Chris Awaiting, and we're all researched by statisticians at Memorial Sloan Kettering Cancer Center. In terms of our agenda today, we'll start with going over a brief overview of projects genie and genie BPC. Then we'll look at the clinical genomic data processing pipeline, and delve into a case study where we'll work through clinical data processing with genie BPC, and genomic data processing with GMR, and then we'll wrap it all up with a conclusion. The primary goal of genie BPC, or the project genie biopharma collaborative, is to augment the existing registry of genomic data from ACR project genie with enhanced clinical or phenomic data to support clinical genomic analyses. The phenomic data from in genie BPC are curated using the prism curation model to capture detailed information on a lot of different aspects of the EHR, such as cancer diagnoses, treatment regimens, disease status from radiology and pathology reports, and medical oncologist assessments. And these are structured over several different data sets with hundreds of different variables. And so one reason that we'll get into in a few minutes of why the package is so helpful is to streamline all of these, working with all of these data sets. And the analyses using linked clinical and genomic databases, such as genie BPC will help to drive advancements in precision oncology in identifying the genomic alterations and therapies that will optimize clinical outcomes. So to just skip a broad overview of the genomic data included in ACR project genie, researchers can receive genomic data from various different formats and types. And that ACR project genie data repository is comprised of one type of genomic data called tumor DNA sequencing assays. These are data that are collected from tumor samples via biopsy or resection, and they compare DNA sequence in cancer cells with that of normal cells. The sequencing assays can be broader targeted, and broad regions refer to whole genome or whole exome sequencing, whereas targeted regions refer to gene panels. And the genie data consists of data from targeted gene panels from high throughput or huge amounts of data, sequencing assays, and we also refer to these as next generation sequencing. The genie BPC data are publicly released by cancer cohort. So we have non small cell lung cancer, colorectal cancer, breast, pancreas, prostate, and bladder. And new versions of the data are periodically released to include additional patients and variables and incorporate data corrections. The data files are stored as CSV and text files, and they're available for download from stage five networks, Synapse data sharing platform. And downloading each file individually poses challenges for efficient and reproducible workflows, which will lead us to our package genie BPC, which is a pipeline to programmatically access the data corresponding to each release from Synapse to support reproducibility and create data sets linking clinical and genomic data for analysis. And then moving into the genome R package, this package provides a consistent framework for genetic data rendering, processing, visualization, and analysis. Okay, so for our demo today, we're going to actually be working with the genie BPC data. And so the first step is to register for a Synapse account. So Jessica has put the Synapse link in the chat. And if you could just follow the instructions of registering, we'll give you a few minutes to do that. And one thing to note is please do not connect to Synapse via your Google account. Make sure you create a username and password. Okay, we're going to move on. And if you have any questions, please put them in the chat and we'll be monitoring throughout the hour. Okay, so the next step for our demo is to work on installing the genie BPC and genome R packages. The instructions to download these packages are shown here. And then also they are on our GitHub repository, which Jessica just put the link in the chat for in the demo.r script. And further, our package details are available on the package websites that are linked here. And just note that the R packages require a version, an R version of greater than or equal to three point sites. Okay, so with that, we'll move into the data processing pipeline for the clinical genomic data. First, the data is stored on Synapse. And so as we mentioned, this is the Sage Vi Networks data sharing platform. So the first part for the genie BPC R package is to import the data using call data Synapse and the Synapse version helper function. Then we'll get into actually data processing with the create analytic cohort and select unique NGS functions. We'll then go into data visualization with drug regimen sunbursts and then move into genomic processing with genome R using the create gene binary and TVL genomic functions. Okay, so we'll move into a case study that highlights all of these functions, where we create a cohort of patients who are diagnosed with stage 4 adenocarcinoma, mom's small cell long cancer, and received carbo cotton and pemicrexid with or without bevacismab or cisplotin, and pemicrexid with or without bevacismab as their first cancer directed drug regimen after diagnosis. And just a note to please follow along using the demo R script on our GitHub. All of the code is provided there. Okay, so our first step is to actually import the data into R. And so to do this, you need to set your Synapse credentials using the set Synapse credentials function, which will store your username and password for Synapse during each R session. So if you just run the function with your username and your password, then you will get a message that you're connected. We also have this Synapse version helper function that returns a table of the genobpc data releases that are currently available. This function has one input, most recent, which is a true or false input and column most recent equals true. We'll return a table with each cancer code that's been released and its latest data release version. And calling most recent equals false will return all data releases available. So if we call this recent equals true, this is the table that's returned. And we can keep this in mind moving into the poll data Synapse function. For the poll data Synapse function, this pulls the genobpc clinical and genomic data directly from Synapse into R. It has two arguments, which are cancer type or cohort and the version of the data or version. And keep in mind that versions of data are updated periodically with releases of new variables and additional QA. And this function returns a list of data frames for each cancer site for the accompanying version of data release. So moving into our demo, and just to know that all of these yellow slides will involve everyone running code, and we'll give you a minute or two to do that. So we can first load the genobpc package library and set the Synapse credentials using your username and password for Synapse. And then we'll run the poll data Synapse function to pull the non-smell cell lung cancer or NSCLC version 2.0 public data. And we can name that the NSCLC Synapse data as is shown in the demo script. And what's returned will be a list of data sets. Okay, we're going to move on to the data processing aspect of the package and I'll pass it off to Jessica for that. Okay, great. Thank you, Sammy. So we'll talk about data processing for the clinical data to start and we'll talk about the functions create analytic cohort and select unique NDS. We'll come a little bit later and we'll talk also about visualizing some of that clinical data. So the create analytic cohort function is used to create a cohort from the genobpc data. So as Sammy mentioned at the beginning, the clinical data that are part of genie include cancer diagnosis information, including cancer cohort, treating institution, histology, stage of diagnosis. And we also have a lot of cancer directed regimen information, including the regimen name and the regimen order and the cancer that the drug regimen was given for, in addition to all of the pathology radiology and medical oncology information. So the create analytic cohort function will focus on the diagnosis and cancer directed regimen information to build a cohort. And then it will return all clinical and genomic data for the selected patients that met the criteria that we'll specify. So to go through the arguments for create analytic cohort, the first argument is data synapse, which is the list that was returned from poll data synapse on the previous demo slide. Next is an argument indicating the index cancer sequence. So in genobpc, the term index cancer or project cancer refers to the cancer that has associated genomic sequencing. So we're not going to touch on this too much today. This defaults to picking the patient's first cancer that had genomic sequencing. And that's what we'll leave it at. There are multiple institutions that contribute data to genobpc. So by default, create analytic cohort does not subset by institution. But if you wanted to, you could provide the institution input parameter to look at cases from a particular institution. We can build an analytic cohort based on stage diagnosis, which we'll do for our case study. By default, no subsetting is done based on stage, but you can subset based on stage one, two, three, one through three, not otherwise specified or stage four. Another way that you might build a cohort is based on histology. And so for our case study, we're going to look at amnocarcinoma histology. And for this input parameter as well, the default selection is to just not subset by histology, but the available histologies are listed to the right. When creating an analytic cohort by treatment information, you might be interested in a subset of patients who received a particular treatment. So you can specify regimen drugs. And there is a lookup table provided with the package so that you know exactly how the drugs are named in the data. And then regimen type indicates whether you want patients who received the exact regimen you specified or regimen that contained those drugs. So if you said regimen drugs was carboplatin and regimen type is exact, you would only receive patients, you would only get patients who received carboplatin. But if you said regimen drugs is carboplatin and regimen type is containing, you would get anyone who received any regimen that contained carboplatin. So that would include carboplatin alone or carboplatin given the cisplatin apotaxyl pymotrexin and so on. The other input parameters with respect to building a cohort based on regimen are regimen order and regimen order type. And regimen order refers to the order of the cancer directed regimen. So you might want to pick the first regimen that a patient received for their cancer. So you'd say regimen order is one and order type is within cancer. Or maybe you're interested in looking at the second time someone received a specific regimen. So you would say regimen order is two and order type is within regimen. The final argument to create analytic cohort is return summary, which specifies whether summary tables are returned using the GT summary package. And by default, this is false, but we really recommend setting it to true so that you can take a look at the data that's returned by create analytic cohort in some nice summary tables. So now we're on a demo slide and just a reminder that our case study is based on building a cohort of patients that were diagnosed with stage 4 adenocarcinoma, non small cell lung cancer, and they received carboplatin and pymotrexin with or without bevacizumab or cisplatin and pymotrexin with or without bevacizumab as their first cancer directed drug regimen after diagnosis. So to build this cohort, we're going to call the object NSCLC cohort and we'll call the create analytic cohort function. We'll first specify data synapse. So this is the list object that was returned from poll data synapse. So it's NSCLC synapse data and we'll indicate that we want the NSCLC version 2.0 data set for data sets. We'll specify stage of diagnosis is stage 4, histology adenocarcinoma. We're interested in four different drug regimens and we want those exact drug regimens. We don't want any other drugs given with those drugs. We want the first regimen. We want this regimen to be the first regimen that the patient received for their cancer diagnosis and we will return the summary tables. So we'll pause here to give you a minute to run this code and then we'll go through what the output looks like. Okay, so we'll take a look at what the result of running this create analytic cohort code chart returns. So the first summary table is just called TBL overall summary and it shows the number of diagnosis regimens and cancer panel test reports. We'll use cancer panel test report or CPT exchangeably with next generation sequencing and GS. So we'll see that we have data return for 241 patients and each patient had one diagnosis since the default was just a lot quicker to index cancer per patient and because we specified a particular drug regimen and we wanted it to be the first regimen, we have one regimen per patient. So we have 241 regimens and we'll come back to this in a little bit, but we'll see that some most people only had one next generation sequencing report, 92%, but there are a handful of people who did have multiple next generation sequencing reports performed over time. So we'll talk about how to handle this in the coming slides. The next summary table is a summary of the cohort. So we can see that we have patients who have lung cancer. We did a subset by institutions. So we get DNA Farber, MSK, and Vanderbilt cases. Everyone was staged for a diagnosis and their histology was at no carcinoma. So our case study is clean. So basically this table just kind of tells us that the function is working as we expected it to, but if you did have additional, like let's say you wanted to look at stage one and two, you would see the breakdown of stage on this table or if you had multiple histologies, it would be informative to tell you the breakdown of each of these characteristics across your cohort. The drug regimen table includes the cohort and institution variables again and also tells us the breakdown of the proportion of patients that received each of the four drug regimens that we specified. So the most common was carboplatin-penetraxin, followed by carboplatin-penetraxin with bevacizumab, and then this is platin-penetraxin with or without bevacizumab, followed after that. This is the summary table for the next generation sequencing data. So we'll see that the oncotree code, which is a way of characterizing the type of lung cancer, was most frequently LUAD, which is lung adenocarcinoma, which makes sense given that we specified we wanted adenocarcinoma, histology and explicitly. And then we'll see here that there were multiple sequence AXA IDs or multiple panels that were used to perform the next generation sequencing. And we'll talk a lot about the impact of having different panels or patient sequence on different panels when we get to the general part of the demo today. Next, we will visualize the drug regimen data for patients in our cohort. So we specified what we wanted their first drug regimen to be, but we can visualize all of the treatment information for patients in our cohort. So this is an example of a sombrous plot to the right, and it's a way to visualize the complete treatment course for a selective cancer diagnosis. And each ring corresponds to a regimen where the innermost ring is the first regimen, the second innermost ring is the second regimen, and so on. And when you run this in R, it comes out as an interactive figure. So you can hover over regimen names. You can hover over bands in the plot to see the regimen name and the percent of patients that received that regimen pattern. The input parameters for this function are data synapse, data cohort, and the maximal number of regimens. So we'll specify in data synapse all of the data that was returned. So this is the entire non-small cell lung cancer version 2.0 public release. Then our data cohort is specifying the specific cohort that we're looking at return from create analytic cohort. And maximum number of regimens is just a way of sub-setting the sombrous so that it doesn't get too unwieldy if you're only interested in the first few regimens. You can hover again on this. So we're at another demo side. So we'll pause for a little bit here. We'll create an object called NSCLC underscore sombrous, where we'll call the job regimen sombrous plot function and specify data synapse as NSCLC synapse data with the non-small cell lung cancer version 2.0 object. And then data cohort is our NSCLC cohort and we're not going to specify some number of regimens. So we'll pause here to give a minute for you to run this and play around with the resulting sombrous trigger. Okay. So the next slide, we have the resulting sombrous plot. And if you hover over the orange band here followed by the green band, you can see that 10.4 percent of patients had curve a button from the trexid followed by a nebola band. And you can hover over the different trajectories. And looking at this, you can really see that there's a lot of heterogeneity in the treatments that patients received. You'll also notice when you run it interactively, there is a little box on the side that says legend, but we recommend turning that off and really just hovering over the portions you're interested in to look at the legend up on the top left so that you can see which color corresponds to the trigger. Next, we're going to talk about the genomic data processing with one function from the genic APC package and then the rest come in from the genome R package. So we're on the last box here in our pipeline. We'll talk about creating gene binary to process the genomic data for mutation, copy, number alterations, infusions. And we'll talk about TBL genomic and a couple of other functions for summarizing the genomic data in tables and figures. So back to our case study. We have patients with non small cell one cancer in this nsclc copart object that we created with create analytic copart. And we'll talk about how to process the data into an analysis study matrix of gene alteration events. And we will summarize the genomic alteration through the season analyze difference between men and women study. So we can see on this table to the right that 60% of the 241 patients that met our inclusion criteria were female to recap the genomic data that is available in genie VPC. We'll be processing and analyzing data on mutations, copy number alterations or CNA and fusions and that names of each of those objects are listed here and more information about them can be found in the genomic data guide posted by ACDR. So to jump into a summary of some issues when processing multi institutional genomic data. One issue we saw on the next generation sequencing data summary. Most patients do only have one next generation sequencing report, but in the case that they have multiple, we need to figure out how to select a single sequencing report for analysis. So we'll talk about that with the select unique NGS function. Other issues that we'll go into greater detail on in a little bit include data formats and gene standards often being inconsistent. So variable names, data formats, gene names may even differ between studies or sometimes within studies, cohort inclusion. So in those mutations fusions and copy number alterations file, only instances of genes that were altered are included. So if someone did not have any alterations, they might not be in that data set. So we'll have to be really careful about making sure we don't accidentally drop patients from our analysis. And then lastly, multi institutional studies use several gene panels. So just samples might be sequenced on it on different panels across different institutions and even within the same institution over time, next generation sequencing panels have evolved to include more genes as time goes on. So we need to be careful about annotating mutation status and analyzing overlapping genes. So we'll start with selecting one sample, one next generation sequencing sample per patient. And we'll do this with the select unique NGS function. And this function prioritizes characteristics of interest. And if importantly, if the patient only has one report, it's returned regardless of the criteria in this function. So the data cohort argument is the output of the CRE analytic cohort function on poetry code is a way of characterizing the cancer diagnosis and is reported for all next generation sequencing reports. Sample type will specify which type of genomic sample to prioritize. So if the patient had multiple next generation sequencing reports, sometimes that happens if their primary tumor was sampled. And then later on in time, their site of the test assist was sequenced as well. So you can specify which of the two you would prefer to include if they did have multiple. And lastly, you can specify selecting a sequencing report with respect to time. So you can select early sequencing report or the latest sequencing report. So we are on another demo slide. So we can see that in the NSELC cohort, the cohort NGS data set had 262 rows, even though we only had 241 patients. So we know from here and from our summary table, returned from CRE analytic cohort that there are patients with multiple sequencing reports. And so we'll run select unique GS and we'll specify data cohort as the NSELC cohort, the cohort NGS data set. And we'll specify the LUAD and aqua tree code. So this is long abnormal carcinoma. We'll prioritize that if if there is a long abnormal carcinoma aqua tree code sample available, if not, we'll look for a NSELC metastatic sample. And if there are still multiple, we'll pick the one that is latest in time. And you can see that running this returns 241 rows corresponding to one row per patient. So we'll pause here for a little bit to allow you to run this code. Next, we'll talk about the format of formatting data in an analysis writing matrix. So the create gene binary function from GMR will give us a data frame of N patients by X alterations, as you can see here. The alteration columns are denoted by the gene name if it's a mutation, such as TP53, or the gene name plus aesthetics of dot amp, dot fuse, or dot dial for other alteration types. So we can see here that the MYC dot dial is a deletion. And each cell will have a zero if there's no alteration, a one if there is an alteration, or an NA if that gene was not tested in that patient. So to get the data in a standardized format, we need to talk about the data formats and gene standards that can be inconsistent in multi-institutional data. So the genome mark functions are designed to work with standard alteration data formats, using common data platforms like C-Biot portal. The gene copy number alteration and fusion data are slightly different than the standard. And so we need to reformat it using the reformat fusions and pivot CNA bomber functions. So looking at the reformat fusions function, initially in the gene fusions data, we have, for example, these two genes that are fused together. And it's in a lock format. And when we run the reformat fusion function, we get just one row to indicate one fusion between the two genes. Next, using the pivot CNA bomber data in the gene raw CNA data, we have it in a lock format. And the standard is to have it in a longer format. So the pivot CNA bomber function transposes this, the copy number alteration data into a longer data set. And so here we'll get ready for the demo where we call the mutations copy number alterations and fusion data as objects to save in the R environment. And then we reformat the fusions and reformat the CNA data using the reformat fusion and pivot CNA bomber functions. Now that we have the copy number alteration and fusion data reformatted, we're ready to run the gene binary. Add additional arguments to create gene binary to help address the remaining data processing issues. So the next issue that we're going to discuss is cohort inclusion, where samples with no alterations may be dropped when pulling the raw genomic data. And so this samples argument will ensure that all study IDs have a row in the resulting data, even if they're not present in the genomic files. We'll also cover how multi-institutional studies use several gene panels. And so samples may be sequenced using different panels and therefore may have non-overlocking genes that we need to annotate as missing. And so the specified panels argument and create gene binary can insert NAs when we know that the gene was not tested for a specific patient or a sample. So to use the specified panels argument, we need to first create a data frame indicating which patient IDs were sequenced on which panels. And so we'll use the dataset from the select unique NGS to call the sample ID and the panel name or the assay ID to create a two-column data frame of the samples and their corresponding panel names. And so after running the specified panels argument, we can see that whereas if we do not specify the panels, it looks like the genes that were not tested are actually missing. Whereas when we include the specified panels argument using the data frame of the panel, of the sample IDs and panel names, we see the correct indication that these genes were not tested for these samples. We need to make sure that the gene names are consistent across all studies. And so we do that with this recode aliases argument. And it's important to note that some gene names have evolved or been updated over time. And so this argument ensures that all panels use the most up-to-date gene names for consistency. So I just threw a lot out there and let's give you some time to run the demo code to create the specified panels object and run create gene binary. All right. I will pass it off to Carissa to continue with the genomic data processing. Thanks, Amy. So now I'm going to talk a little bit about how to analyze this data that we've just processed. So we're going to use the data in the previous section to just make some basic summary tables and do some exploratory analysis using the genome R package. So as with processing data, there's some common issues that arise when you're analyzing multi-institutional genomic data. Firstly, often we're working with pretty large targeted panels. So if we do want to do some analysis with clinical endpoints or associations between gene frequencies and clinical characteristics, we might be conducting a lot of hypothesis tests. And as we know, the large number of statistical tests, the greater risk that some of those significant findings are significant due to chance. And additionally, a lot of times we're working with pretty low prevalence events or low prevalence alterations, which are not very informative. So one way to handle this that's often done is to choose a threshold, depending on your sample size, it'd be 1%, 5% or more before you start conducting your analysis. And also you might want to consider adjusting your p-values for multiple testing, and there are a couple ways to do that. So there's some genome R helper functions to help with that. Additionally, as I mentioned, a lot of times we're working with low event frequencies or low prevalence alterations. So depending on your cancer type or your disease area, it might make sense or might be biologically meaningful to summarize on the gene level or the pathway level as opposed to the specific alteration level. So there's also some genome R functions to help with that as well that we'll talk about. Returning to our case study, we're going to take our processed gene binary object, which is just that processed binary data frame, and we're going to summarize the genomic alterations overall in the cohort. And also we're going to do a summary by sex. So in this case, just male or female, given the data that we have. In order to do this, first, we're going to need to join our clinical sex variable onto our processed genomic data frame. So on this slide, and again, this code's kind of long, but it is available in that demo.rfile, we're just using dplyr select and joins to grab the patient ID in that sex variable and then marry that onto our processed gene binary object. So we'll give you a couple seconds to do that. Okay, so as I mentioned, we might want to subset by a prevalence threshold. So the subset by frequency function in genome R will help you do that. So you pass your processed gene binary data into subset by frequency, and you choose the threshold, which is the T argument, which can be between zero and one or zero and 100% prevalence. And then you'll get returned from that a data set with your sample ID, as well as only the genes that meet or exceed that given threshold. Additionally, there's another argument, other bars that helps retain any other variables you might want to keep in your analysis data set. So in this case, we want to retain that sex variable, because we're going to do some analysis with the later. So we just pass that to the other bar's argument. So here's an example we can quickly run through where we take gene binary pass it to subset by frequency. Here I chose a pretty high prevalence that probably is not too commonly chosen in a real analysis, we did a 40% prevalence threshold, but just for example purposes. And then I specified that I wanted to retain that sex variable in the final data set. So as you can see, we started out with 1400 ish genes in the resulting data set. We have six columns, one of them is the sample ID column, one of them is that sex column, and then the other four are the only genes that met or exceeded that 40% threshold. So we'll give you about 10 more seconds to go through that. Additionally, I just wanted to point out this other helper function that works in a similar way, but instead of subsetting by a certain alteration threshold, it subsets by a specific targeted panel. So in the package itself in the Genomar package, there's a lot of sort of built in panels that are kind of commonly used panels and agrogenic research. And so you can just call out one of those IDs in the panel ID argument. And you can look at the documentation to see what those IDs are. And then similarly, you'll get out a data set with just the genes within that given panel, as well as any other variables that you want to retain. So here we as an example, we looked at the MSK impact 300 panel, and our resulting data set has 220 genes that were both in our data set as well as in that panel, and then also our sex variable. So just useful, especially if you're working with multi institutional data that might have several different panels at once. So next, we're going to create a simple summary table of our frequency alterations using the table genomic function. So table genomic is a function in genome are, but it is built on top of the GT summary table summary function with some specific defaults that make it helpful for presenting genomic data. So if you haven't heard of it, GT summary is a wonderful package written by Dan sober that helps you create summary tables and arts very useful for a lot of applications. But because this table genomic object is a GT summary object, you can also leverage lots and lots of customization functions available in GT summary for making your table look more, you know, more presentable, look a little bit nicer, you can use some different statistics, you can, you can basically specify any aspect of the table. So check out the GT summary documentation, and we'll show an example of using the bold labels function here. So here we can just run through. So now we have subset our data. So this and SCLC subset is the 40% or higher cut off version of our data. And then we're going to pass it to table genomic. And then we're going to add that GT summary bold labels on top just to make it look a little nicer. And we get this really nice table out of the box on the right. And it's showing us those four alterations that reached that prevalence threshold. So as I mentioned, depending on the cancer area, the disease type you're working on, it might make sense to analyze on the gene level as opposed to the specific alteration level. And that basically means that the, that you're going to bend together things like copy number alterations, deletions or fusions on a gene with the mutations and count any of those things as an event for that patient on that gene. So in Gina, Martha is a helper function called summarized by gene that you can pass your data to. And if your data is coded in the documented way, it will give you that data set summarized on the events on the gene level. So here on the right is an example of what that gene level table would look like if we ran it through the table genomic code. And I'll just point out like KRAS, for instance, here we have 107 events and are 241 patients. And then if we go back to the previous slide, you'll see KRAS is 98. And that's because in this version, that is not by the gene level, but the alteration level, these are just counting the mutations, whereas in the next table, we're counting any sort of genomic event from our data as a KRAS event. Just to note that if you are going to use this, you definitely want to subset by frequency after you summarize by gene, just because if you first subset by frequency, you might be getting rid of some low level events that might count towards those gene level events. So first summarize, then subset, and then you can pass it to table genomic, or any plotting functions or anything you want to do with it after that. Additionally, you might want to analyze even one level up from the gene level, which is on the gene pathway level. So pathways are groups of genes that work together biologically to control processes like tumor growth or a response to treatment or cancer progression. And there's some common pathways often used in cancer research analysis. You can check out this paper that I cited at the bottom that has some of those common pathways. And a lot of these pathways we built in by default to the package to allow you to process your data and analyze these pathways pretty quickly. You can do this using this add pathways function. So basically, if you take that processed binary data set and you pass it through add pathways, it'll add a column for each of these common oncogenic pathways onto your data set. So here, this is a list of those default built-in pathways. And again, these will just be automatically added as additional event columns to your binary matrix. Additionally, on the next slide, we'll talk about the custom pathways argument. So certain cancer types or disease types have very specific genomic pathways outside of those kind of common oncogenic pathways. So we wanted to allow the functionality to analyze those easily as well. So here, you can pass custom pathways and specify the specific alterations that you want counted towards that new pathway. And it'll process it and add it to your data set, just like it did for those default pathways. So here we can take 30 seconds to just add this add pathways function onto our process data. And if you run through this and put this through table summary, you'll get this nice just pathway specific table on the right. And again, this is counting any gene that's within that pathway. If that's altered, then that pathway is then activated for that patient. So again, let's go back to the issue of presenting a table that split the alteration frequencies by male and female. So we can take our previous table genomic code and just add one small tweak, which is the bi argument. So we subset the same way we pass the table genomic, and we just specify bi equals that sex code variable. And then we both label it just to make it look nicer. And we get this really nice table on the right that has our overall column, which we've seen before, as well as this split frequencies by male and female in this case. And then as we mentioned before, we might want to additionally do some adjustment for multiple testing. GT summary makes this really easy. They have this add p function, which by default will add a Pearson kai square test p value onto this. And then also add q, which will do multiple testing adjustment of that p value by default uses false discovery rate, although you can specify pretty much any method you want there's some built in, or you can write your own custom one and pass it to add q. So it makes it really easy to just kind of quickly do these exploratory analysis and add these multiple testing adjusted q values. We'll just take a minute to go through that. And then I think there's a question about analyzing other types of pathways from other sources. So in the custom pathways argument, I believe you can also pass a named list of several other pathways, like a list of custom pathways you want to analyze. So I think you can pretty easily turn a JSON into a list in our and then you'd be able to pass that list into the function. But I haven't really tried it. So check it out and file an issue. If you have any issues with that, we can, we can work it in this package is still under development. And then lastly, I'll just quickly mention there's a lot of default plots and custom tables and plots you can use aside from the ones that we've gone over in the genome or package, including this is just a basic commutation plot using the top altered genes, you can pass a math file here and you know, I'll give you this kind of nice out of the box visualization. And most of the visualizations in genome are our ggplot objects, you can then add additional ggplot customizations on top of them pretty easily. This is another one that I use a lot, which is just looking at the top genes and splitting them by the brand classification. So that was just a couple of visualizations, but there's there's several more in the package, as well as color palettes that are useful for genomic data, because a lot of times it's genomic data, you want a lot of contrasting colors if you're working with, you know, high dimensional data. So genome are has some built in color palettes that are especially useful for that. Additionally, depending on where your data source from, you might need to do some additional data checks. So the genome are vignette, which is linked here has some helpful tips on just queuing this is the sort of genomic data in general. It might be appropriate to oncocopy annotate your data and only analyze oncogenic or likely oncogenic alterations. So I recommend checking out oncocaby.org and there's some functionality that's woven into genome are to do that as well. And lastly, this one's kind of under development, but we're working on some additional functions to look at CNA segmentation data. So there's some forthcoming documentation and some additional functions that can help you make really cool, like segmentation heat maps very easily. So that's coming soon. So I'm going to pass it back to Jessica to just kind of wrap us up. Okay, so all together, these two packages, the genie VPC package and the genome are package offer reproducible pipeline to create cohorts for clinical genomics data analysis. And starting with the genie VPC package is streamlined data access from the Synapse platform into the art environment for clinical data processing from multiple complicated, unwieldy clinical data files of varying structure with a high that are high dimensional. There are a lot of variables and a lot of different data files. So the genie VPC package allows you to, in a streamlined way, read in the data and create your analytic cohort. And then from there, the genome are package facilitates annotation and analysis of complicated genomic data. And importantly, we want to highlight that genome are we use today for the genie VPC data, but it is it can be used more broadly than that it can be used for genomic data processing and analysis of genomic data outside of the the genie VPC project. So thank you all for being here today. I know we had a lot of information to go over. And so the slides are available on GitHub. The demo code is available on GitHub. We appreciate your time. We're happy to answer questions. If you reach out to us via email or on GitHub, since I know we're almost at time. And we want to say special thanks to Anna Fuchs, who contributed a lot to the packages in this presentation, but going to be here today. And also to others in our department here at MSK who have contributed to these packages along the way. So thank you all. All right, thank you. I'm going to go ahead and end the session now.