 So welcome to session 8A, Statistics and Buying Informatics. Today we have a couple of great speakers today. And this session is sponsored by Roche. So to get our session started, so here we can see the different tops we'll have. We'll have a talk about Mall Evolve R, another one about applying R to dosing web application, a third one about resortlizing highly multiplexed imaging data. And then we'll have a sponsor talk on analyzing clinical trial data using R. So our first speaker today is Janani Ravi, who is currently an assistant professor at the Department of Pathobiology and diagnostic investigation at Michigan State University. She completed her PhD in computational biology at Virginia Tech and her postdoctoral research at the Public Health Research Institute at Rogers Biomedical and Health Services. Her research group at MSU works on developing computational approaches to study the molecular basis of pathogenesis and intervention of infectious diseases using protein sequence structure function, relationships, comparative genomics, and drug repurposing. Janani is also actively engaged in training, education, and outreach, focusing on increasing the participation of underrepresented minorities in data science and our programming. She's the founder of our ladies East Lansing, and co-founder of Women Plus Data Science, which engage over 1,000 members. So with that, we can start with Janani's presentation. Hello, everyone. I'm excited to be here today and present our ongoing work at User 2021. Special thanks to the program committee for this wonderful opportunity. It's my first research talk at an international art conference, and I'm thrilled to be here. I'm still learning to make content like this as accessible as possible. Please email me at janani.msu.edu in case I can do anything better on this front. Today, I'm going to talk to you about an hour-shiny web app that we are actively developing to characterize proteins across a tree of life using a computational evolutionary framework that we developed. We would love to receive your feedback during our live Q&A and on the lounge. In my research group at MSU, we work on both the pathogen biology itself and the host responses to pathogens. We develop computational methods to compare genes, proteins, and entire genomes of pathogens so that they can learn about mechanisms underlying infectious diseases. We have a few favorite pathogens in zoonotic infectious diseases that we have been working on recently. We are also developing methods to figure out ways in which we can repurpose existing FDA approved drugs to help us humans fight infectious diseases better. So this current slide has a few animations, and with every click I have some text written up. OK, so we are very interested in our group to understand several aspects of what makes each pathogen pick through the lens of computational evolutionary biology. Most of us working on pathogens are interested in fundamental questions such as pathogenicity, host specificity, environmental versus pathogenic adaptation, and the evolution of antibiotic resistance in recent years. If not fundamental biology, we are always thinking of ways to squash the bug. Most of these fundamental untranslational questions can be addressed, at least in great part, by studying the conservation versus unique molecular features and specificity versus sensitivity. We built streamlined workflows for molecular evolution and phylogeny at the protein level, and compared to pathogenomics and pangeonomics at the genomic and gene levels. And today, I will focus on the former. I first started working on using the molecular evolution approach to study a panbacterial stress response system during my post-doc with Marla Jennero at Rutgers and our collaborators at NIH, Arvindan Vivek. Since then, the approach has been generalized and now tested by us and others to better understand the nature and evolution of proteins and protein families. This generalized approach that I will be showing you were built by my undergrad Sam, Lauren, and Joe. We have now streamlined this approach to build a user-friendly, R Shiny web application for biologists and an R package for computational biologists. Now I'm at slide five. I want to begin by describing the problem that motivated me to develop this entire approach. The phase shock protein response system, or PSP system in short, is critical for bacterial envelopes, cell surface stress response, and membrane stability, and integrity. While the stress response function is highly conserved across phyla, there is tremendous diversity in the sequence and structure of the accessory proteins, the regulatory mechanisms, and the membrane stress response dynamics across phyla. In this cartoon here, which is actually really colorful and seems like it has a lot of moving parts, but the only thing that is important is that the namesake protein that is in black is the one that is conserved across lineages while the other colorful partner proteins are not. We soon realized the identity of these different stress response players, their protein architectures and phyletic spreads remain to be characterized in depth across the tree of life. So I decided to embark on a comprehensive computational evolutionary analysis to figure out this PSP system across the tree of life. Over the next few slides, I will explain at a high level what this evolutionary analysis entails. We start by identifying homologs or similar proteins of all PSP members across the tree of life, which is what I'm showing here. So we do this in a few different ways. We use multiple starting points and collator results. We use iterative searches as well, so that we pick up all of the remote homologs. And finally, in addition to full protein searches, we also do domain-centric searches. And I'll tell you in a bit what these domains are. Next, for each and every one of these thousands of homologs, we reconstruct the domain architectures by characterizing the proteins. And we do this using sequence alignment and clustering algorithms to identify homologous groups and construct protein families. We do profile matching using domain and homology and orthology sorry, orthology databases. And finally, we also use prediction algorithms to identify key motifs like signal peptides, transmembrane regions, or cellular localization. In the special case of bacterial proteins, we perform additional analyses to a certain function. In this genomic context largely dictates bacterial protein function, we next scan the genomic neighborhoods to identify these putative, co-transcribed genes that are co-localized within the genome of bacteria. Finally, we analyze our many, many homologs, their domain architectures, and genomic neighborhoods in the context of evolution using phyletic spreads, multiple sequence alignments, and phylogenetic trees to infer biological function and ancestry of these genes and proteins. Yes, the biological analysis that I just explained are pretty important. But what I want to show you folks today is how we manage to integrate our many, many results into a neat little web app. So now let's move on to the PSP web application for a quick demo. And now I'm at slide number 10. OK, so this is how the PSP web application looks. And this was built by my undergrad Sam. We start with the Data tab. And here you might note that this app is actually pretty interactive, like most shiny apps are. It's queryable, sortable. And then if I'm starting with multiple proteins, I basically have a dropdown that can pull up and restrict all of the data by these different query proteins of interest. All of the data can be downloaded here. So here in this table, for instance, I'm showing the different proteins. It's linked to the NCBI databases. We have characterized the domain architectures for every one of these thousands of homologs. We have the genomic context here represented by arrows. We have also the lineage information, meaning which bacterial phylum it is coming from. So these are the different query proteins. And the phyletic spread or the lineage-wise spread of the different proteins. Next, as I mentioned, we look at the domain architecture, meaning what are the different combinations of domains that these proteins have across the different homologs. We have a summary table that is, once again, interactive. If you click on any of the rows, you can find the different lineages that these are present in. And again, all of this is searchable. And we also have a slider depending on how. For instance, if you say I'm only interested in the top 95% of the data, we see that there is very little variety. There is only PSPA without any of the fusions. And again, we can do this for every one of our other proteins, too. Next, we construct something called upset plots using an upset R package. It is pretty cool. It's a very nice way over the traditional Venn diagrams to look at intersecting sets. And here we have different domains and the combinations in which these are coming together. And finally, we have an interactive domain proximity network of not just the domain of interest, in this case, toast track from the dropdown. We can also go to all to actually capture something that's really cool, which is the entire PSP network. So we discovered a lot of novel components here using these detailed approaches. We also look at the genomic context because very many of these homologs are actually in bacteria and genomic context are important. Rest of the tabs, please feel free to explore later on. It's very similar to what we do with domain architectures, but now instead we do it with the genomic contexts. And finally, we look at the phylogeny of these different proteins and their homologs. We have an interactive sunburst plot here to look at the lineage-wise spread. Again, for every one of the proteins, we construct a colorful phylogenetic tree. And each of the colors here in this PDF are represented by the lineage. We also have the multiple sequence alignments in case you're interested in exploring these different homologs further. We also have a parallog table. The parallogs are basically having multiple proteins, multiple copies of similar proteins within the same genome. And we cannot also explore them further to study the revolution. OK, and with that, I'm going back to the app and I'm now at slide number 11. We have combined comparative genomics and detailed sequence structure function analysis to systematically identify sequence homologs, phylogic patterns, domain architectures, genomic neighborhoods, and trace the evolutionary patterns of the PSP proteins across a tree of life. We have succeeded in identifying that these protein systems date all the way back to Luca, which is the last universal common ancestor. We have discovered the larger stress response network. We have also identified several novel partners and key players in the PSP system and linked all of their roles to the biology and what it means for membrane stability of these bacteria, archaea, and eukaryotes. So this PSP web app was just one instance that helped us collate all of our results. The problem was we had actually too many results and we didn't know what's the best way to look at them and it was very difficult to parse through, which is when the first version of this PSP web app was born. But eventually, we kept adding lots and lots of features so that we can search through these results, create multiple visualizations on the go, subset the different results and then also change the aesthetics as we go along because we found that to be very useful to explore the results. But what is most important is that it helps us identify a few practical challenges. A lot of these current tools can only leverage subsets of data to hinder these different facets of bacterial proteins and genomes. Independent tools for protein sequence, similarity searches, and all of these analysis exist. But there are no unified frameworks and tools to perform all of these. So in short, what we are looking at is these data sources reside in several siloed servers and databases. Tools are designed, are not designed to really talk to each other. And even if a scientist manages to do the address these above challenges, they still need to know how to manage a server, run these computationally heavy tasks. And finally, there is also the issue of streamlining and managing the data analysis and visualization. This brought us to developing a general computational evolutionary approach that can now be applied to any gene or protein of interest and for any genomes or pathogens of interest. And for this, we have developed a web application, again, using R and Shiny for the first version for biologists and an R package, a companion R package that we're developing for computational biologists. And we have a few favorite pathogens that we have tested our approach on. So the Mollivolver web application that we have now developed, basically allows users to start with one or more proteins of their interest. We construct the comprehensive set of homologs. Or if you are using other tools, you can bring in your own data from, say, BLAST or in the ProScan. You can perform a detailed domain architecture analysis and finally the full suite of phylogenetic analysis. And you can do any combination of this as I will just show you in the demo. So Mollivolver beta is now here. Please do test it out. I will also throw the link into Zoom chat and the launch. So I would love to hear your feedback. This will work in development. So now I'm at slide 17. Let's explore our new R Shiny web application really quickly because we're also short on time and I'm gonna move to the web app. Okay, so I've opened the Mollivolver web application here and on the front page, we basically have the schematic what I just showed. And then when I say, start your analysis, it takes me to the upload tab. So there are different ways in which I can upload data here in different formats. I can generate a multiple sequence alignment and then perform the analysis here. So if I'm using other services, I can bring in those results here too and run phylogenetic or other appropriate analysis. We can go to the corresponding help pages. There is a very detailed documentation here that I would encourage you to look at if it's the first time you're exploring the app. And finally, once you submit your analysis, it basically gives you a six letter alphanumeric code that you can bring back to fetch your results. So in this case, I'm just gonna use an example code to retrieve my data. And if you're doing the full analysis, it might take about two to three hours because there's a lot of computation involved. So this will basically, once I retrieve the data, within the app, you can explore your data and all of the visualizations dynamically as we did in the PSP web app. So this is a quick summary of the different kinds of analysis that have been done for our proteins of interest. So you may notice that a lot of these analysis are very similar to the PSP app, except that now it's really cool because these are your proteins of interest. So I can go to domain architecture. Again, these are similar to what we saw for the PSP app. I'm in that tab now. But additionally, we also have the actual visualization of how these domain architectures are coming together and we can select which analysis that we want. Again, all of this is dynamic. I can select which protein that I run it on. I can change all of these. I can explore the entire data here as before. There are a lot of additional columns. If you want to add them, you have your input first day sequence. And finally, we also have the phylogenetic analysis and the tree. So now I'm back to my slide, slide number 18. We've already been able to successfully study a number of infections and molecular systems and there is so much more exciting work to be done here. For those of you interested in exploring our web app and upcoming app package, please do get in touch. Here are the key people who made the web apps on the current app package in Dev possible. And here's my full group. Believe it or not, they're all undergrads. And I'm really lucky to be working with them. And we have worked on a lot of these cool applications thanks to these wonderful collaborators at MSU and other institutes. Needless to say, I love all things R and I'm very nicely integrated with this beautiful R and data science community with R ladies, women plus data science, R open site and bio conductor. And it's wonderful working with all of you and more recently with the user program committee. Finally, I want to thank our funding sources and with that, I would like to thank all of you for your attention. I will be happy to take any questions either during the live Q&A or at the lounge. And also please feel free to share any written feedback via the URL, feedback you are provided here. Thank you so much. Just as a reminder, we have the Q&A where you can post questions and we have 10 minutes at the end of the session where speakers can also answer any questions because we're a little bit behind on time. We're going to go with the next presentation which is from Dergay from Rx, R studio. And he's going to be sharing their experiences on using R in a precision dosing web application. Good morning, good afternoon or good evening, everyone. My name is Dergay Darutian. This talk will be about how we have built a precision dosing web application heavily using R. First of all, it's a great pleasure to be virtually here. User is my favorite conference, not only because I met my co-founder for this project at User 2014 in LA, but also I had the chance to present on different topics and meanwhile in the past years, I had the chance to meet many other R users and developers which just changed many things in my life. Anyway, this would not have been possible without the supporting full-time job where I was not only allowed or even more encouraged to work on open source packages related to reporting API integrations and using R in production, but also to organize and run local R meetups and conferences, also continued teaching heavily related to the R language and also to work on some side projects, for example, RxStudio more recently. So, heads up, this is about RxStudio and not RStudio, mind the extra X, sorry, for the confusion. What we do is personalize medicine or PKPD modeling or model informed precision dosing as a web application and platform, which means that we are also integrated in some EHR systems. So, to give you a better understanding on what we do, I prepared a quick demo of how you can run this application. Here, you can see a dashboard of your past PKPD model simulations. Let me run the probability of target attainment plot for self-append as a quick example. So, you need to provide some patient information and fill in some other forms, which forms are defined in R and all the validation rules are passed from R and then the results are being evaluated in R. Let me just quickly jump to the results. So, this is an interactive GGPlot2-based SVG that we show here, but let me switch to the actual application where I can show you some more complex examples. So, this is Bayesian adaptive dosing for vancomycin. So, little more details need to be provided here, like if you would like to estimate the probability of nephrotoxicity. And we also need historical records. So, here these are some PHI data that needs to be stored in a special way later on. I will mention those. To keep it short, you need to provide POS doses and some lab results so that we can fit PK parameters of the patient. I just jumped to the results to keep it short. Here you can see the recommended dose for this patient and all the previous results and future estimations are provided on this single SVG where we start with a historical dose. You can see the measured blood concentrations that we use to fit the patient's PK parameters. And based on that, we suggested these doses to be taken every eight hours. And after reaching the steady state, you can also see how the blood concentration might look like in the future for this patient compared to the population model that you can also see in the background. Let me scroll down. You can see some further plots, for example, the previously mentioned probability of nephrotoxicity or the simulated PK parameters for the patient and how it compares to the population. All right, let me switch back. And this is probably enough introduction about the application. And let me switch to how we have built that and how this is running in production or in other ways, how you can build like an enterprise ready software using open source tools and some cloud services. So first, let me define what is production. And I included here a few examples from my past projects where I consider that I used R in production. To keep it short, I think the important thing is that you are writing R scripts to be run outside of your IDE. So not ad hoc analysis, but some automated tools, for example, and integrated in an application or used for stream processing or training models in batch processes and later on using in APIs for life-scoring. To summarize all these, using R in production means that you're using R without manual intervention in a highly standardized environment. For example, using Docker images and containers with pinnit R and R package versions. The importance of security, which is pretty straightforward for software engineers who has been trained for a long time about SQL injection and few other things. But these might not be that trivial for our users and developers. And also one of my favorite topic is making sure that you are logging every single details that might be important later on, maybe for auditing or debugging purposes. And for this end, it's not enough to use print statements, writing some text to the console, but using a proper logging framework that might deliver the log records to a central place with real-time alerts and notifications. But let me switch to our overall infrastructure overview and show you how we are using R actually in production. So this is our end user using a laptop or mobile device and visiting our site, which is a static site. So it's pure HTML, JavaScript, some images and so on, hosted in AWS Amplify, and sitting between the user and the static sites, there's a web application firewall just to comply with all the regulatory requirements. So this is a static site, but behind that, we do have a modeling API and some other cloud functions. But before the user can interact with those, they need to authenticate, for which end we use a cloud service, which could be let's say AWS Cognito or in GCP, it could be a Firebase authentication or it's HIPAA compliant version, the cloud identity platform, which makes it super easy to integrate authentication in your application. And on the top of that, when you are using databases, for example, DynamoDB or Firestore, you can very easily set up rule-based access rules so that you can specify which document can be fetched, updated or deleted by a user or group of users. Similarly, for the cloud functions, we are also prescribing who has access to Rando's and we could use AWS Lambda or Google Cloud Functions. What I would like to mention here is how we store secrets. Fortunately, both cloud providers have a secrets manager which makes it super easy to access credentials based on rules instead of having access to hard-goaded credentials. Let me switch to the modeling API, which is probably more important to this audience. Let not me read up all the cloud service acronyms and to keep it short, we have Docker containers running on a bunch of virtual servers sitting behind a load balancer. So whenever an authenticated user makes a request to our API, there's a single plumber process, processing the information and returning the response to the user. Both the modeling API and the cloud functions are writing to a central logging storage, which could be CloudWatch or Google logs, sending the data to your data warehouse for further processing, but also doing some real-time monitoring and alerts which are sent to our developers who are writing code, working on feature branches, pushing those to get creating pull requests, all those need to be reviewed by other members of the team. And besides that, unit tests, integration tests and end-to-end tests need to pass as well. And once a pull request is merged to the main branch, then there's an automated process, rebuilding the production Docker image and doing the deployment in this case using Amazon Elastic Container Service. And there's the last bit that I have not mentioned that is Aptable, which is our GRC. And let me switch to our next topic which made the release of this application a little bit more complex compared to my previous our related projects. So working in a regulated environment definitely has some advantages. When you are not building tag-dabbed, what you're coming up with clean design, nicely documented, and it's just for great. You have really good code coverage and so on. On the other hand, the monthly or quarterly reviews make it somewhat cumbersome. And while anyone who went through some audits, either in FinTech or healthcare knows that these are not that straightforward processes. Fortunately, R does have a history when it comes to using in a regulated environment. And if you look up the Cran page, even it comes with a guidance document on how to use R in a regulated clinical trials. Also, there has been commercial solutions. For example, Mango's Validar. Or more recently, I was very happy to see the R consortium project called RValidation Hub, which will help us to prove that these open source tools can be used in the regulated environment. Still, at the end of the day, you have to do your own audit. So I included a list here, which might be useful for some of you who might consider writing a similar software. First of all, use common sense. I think many software engineers would just write software by default, which complies with HEPA security and privacy rules. But still, it's good to go through these. So first, you need to search for whatever is to be done. You need to get those policies, apply fine tune to your environment, and then making sure from time to time that you're complying with your policies. Quick example is data management. Pretty trivial things like encryption. I think nowadays it would be impossible not to encrypt, but still need to emphasize that data need to be encrypted when stored somewhere, and also when it's being transmitted between services. And yeah, just highlighting that, identifying PHI and handling that in a special way is also important, although probably much better to handle everything as it was PHI. Most important experience when it comes to vendor management is make sure that you onboard only those vendors that you really, really need. Otherwise, you will end up with some extra risks and time spent on the vendor doing like security assessment if the vendor might have access to PHI, also the need to sign a BAA and so on. So if possible, get rid of all that trouble. Code review required for everything and all kind of automated unit test as well. Next difficulty is internationalization, which is much more than just translating strings. For example, when it comes to different units of measurements in the States versus different countries, also date and time formats, and so on. To overcome that, we are heavy users of get tags both on the front end and back end, just to show you some interesting examples. Even drug names need to be translated in some languages. For example, this is the Portuguese translation or highly technical terms like AUC or MIC does have their own local version. So even the tiny bits of a GG plot needs to be translated. I will not get into the details, but we plan to open source these helper functions from the RX Studio package. To keep it short, we store the translations basically as hash tables in environments and doing the translation is a very quick lookup. Reading the translation files, we provided a basic PO file parser. I know there are some packages making that functionality, but this is coming without dependency, which helps a lot in a regulated environment. The most complex thing when it comes to the translation files was coming up with a helper, which is extracting all the strings from the code sources that needs to be translated, which means not only the R sources, but also documentation and few other things. But fortunately, we could build on some previous tools like XgapTax to make that possible. High-level overview, as you can see here, we have almost 4,000 words for every single language that needs to be translated for the front end and the back end. So that's quite a lot, and it's not that straightforward to make sure that everything is always properly translated. So for this end, we have some automation in the means of GitHub actions, which are checking out a Docker image, building a test Docker image, and running tests to make sure that all the strings are extracted from the package that needs to be translated, and there's a translation for all the supported languages. To give you a few examples of the already open-sourced pieces of the infrastructure, one is Lager and BotoR is another one among the others that you can find on termicata.dev. And here you can see a quick example of using Lager, and the important of this framework is that it's separating the log formatter from the layout and the appender, which means that you can format your message with glue, paste or whatever string concatenation method you like, which is separate from the log layout. So you could store your log message as a JSON blob in a database, and you could log the very same message on your console printed as a single line, and this flexible framework will make it rather easy for you. Just to give you some examples, this is a log record for a request, which is storing not only the IP and the user ID and a few other things of the user, but also about the environment, like the Docker container ID or the instance running the Docker container, but also we are storing a similar log message for the response where you can see how long a request took, what was the amount of memory it required to run that simulation, and everything is tracked by a request ID so that you can join the related records. I already spoiled this animated GIF previously, so we already have thousands of comments, although we just started like a year ago. Unfortunately, these comment messages are not that great when it comes to communicating with the end users, although really great for auditing, but for this end, we came up our own semantic versioning of the models. For example, what you can see here that for Cephapim, for a given method and PKPD targets, we came up with a major release which changed to the UI. Minor releases are not changing the UI, but might return slightly different results, for example, because we updated how we compute different things or model parameters changed and patch versions are just visual changes. This is how it's rendered to the end user in the application. How to run actual models in the background. So as I mentioned previously, we are using Plumber, so heavily using filters to figure out what language our end user needs in this case so that we return ggplots in Portuguese, Spanish, English, and so on. Other than that, doing the authentication part as well. In this case, using the Firebase Admin Python module with the fantastic reticulate R package from RStudio, doing some logging right away, and then how we are evaluating the simulations. We are heavily using forked processes, making sure that we capture standard out and standard error in these cases. And let me just show you quick overview of a model definition with some metadata on the top where the list of inputs that are required to run a model. And I will skip these. I will share the slides later if you are interested in the details. And I will just switch to some examples of our calculations, really tiny helpers here. So about licensing, most of our stuff is not open source yet, but it's free for individuals and also we are dedicated to making access to these tools for free for education and smaller clinics. EHR integration also comes with affordable pricing and we are always looking for contributors to this project to provide you a quick glimpse to your SDK. This is a model definition. You provide your inputs, how to calculate the results, and then we provide a HIPAA compliant environment with an automatically generated user interface for you. And all right, time is up. So I will stop here. Please check out our homepage, register for free in the application and reach out if you have any questions. Thank you very much. Bye-bye. All right. Thank you very much, Geregay. Just as a reminder, please feel free to ask your questions on the Q&A and we'll have 10 minutes at the end of the session for answering your questions. We're gonna move on to the third presentation, which is by Niels Elling from the University of Zurich. He's gonna tell us about the sitemapper package and how it can be used for visualizing highly-multiplex imaging data. Thank you very much. Hi, everyone. So I'm really excited to present to you the sitemapper package and our bioconductor package to visualize highly-multiplex imaging data directly in R. So the sitemapper package contains three broad functionalities. It visualizes pixel intensities as composites of up to six channels using the plot pixels function. It can also visualize cell-specific features on segmentation masks using the plot cells function. And it allows interactive gating of cells and the visualization of selected cells on images using the sitemapper shiny function. But today I will not talk about this last function. The package can be directly installed by a bioconductor using Bio-C Manager install for the release version or directly from GitHub installing the development version using remotes install GitHub and then Bodenmiller group sitemapper. So before I will show you the functionality, I will introduce you to the data structure. So sitemapper uses a single cell experiment data container, a common class in the bioconductor framework to store cell-specific metadata and intensity features. So you can see this in the figure here on the right in A. Here the single cell experiment contains cells in columns and then channels in rows. And in the call data slot, you can store more cell-specific metadata. Sitemapper also exports the sito image list data container, which is new. This stores image objects in form of a simple list container. And in A, you can see this, that we store segmentation mask in the sito image list container. So multiple segmentation mask depending on how many images you have. And here a single cell is identified by a set of pixels with the same ID. In B, you can see a sito image list container that stores multiple multi-channel images. And sitomapper can then generate composite images as here on the right. So visualization of cell-specific metadata is possible by linking sito image list and sito and single cell experiment objects. In the function, you will see the image ID and cell ID that matches cells and images between the single cell experiment and the sito image list object. You will need to work with process data. And I will not talk about how image segmentation works. But you can have a look at the Bowden-Miller Group IMC segmentation pipeline in the link below to see how you can segment cells on these images. Before visualizing the data, we can first read in some example data which are also provided on the GitHub repository in Bowden-Miller Group sitomapper demos. First, sitomapper exports the load images function that you can use to read in multi-channel images and segmentation masks. First, we will read in multi-channel images. They are stored as TIFFs, but sitomapper also supports PNGs and JPEGs. So it directly formats these images into a sito image list, which now contains three images. Here you see the names of the images and each image contains 38 channels. In the next step, we can read in segmentation masks, also as TIFF. Here we need to set as is equals to true to read in 16-bit unsigned integer images correctly. And again, it formats a sito image list object containing three images. This time, each image only contains one channel, which are the cell IDs. In this GitHub repository, we also provide a formatted single cell experiment. You can read it in using the readRDS function. And here you can see that among these three images, we detected around 6,000 individual cells and we have measured 38 different channels. Cell-specific metadata are stored in call names and channel-specific metadata are stored in row data. Before we can start visualizing the images, we need to slightly format the data. First, we will need to set the channel names of the multi-channel images using the getter and setter function channel names. Here we know that the order of the channels are the same for the multi-channel images and the single cell experiment. So we can just transfer the row names. We will also need to set the element metadata of the images and the segmentation masks in form of a data frame, which only contains one entry, the image name. This will later be used to match segmentation masks and multi-channel images. A relatively new function of the cytomepper package is measure objects, which computes morphological for cell shape, size and location and intensity features for every image and every cell. By default, the expression intensities are the mean intensity per channel and per cell. So to use the measure objects function, you need to supply the masks and the images. And here, mask and image matching is done by using the image ID parameter, which we will set to image name. This creates directly a single cell experiment with the same dimensions as the single cell experiment that I've shown you before. It stores mean intensity per cell and channel in the counts essay and create and stores the morphological features in call data. Now we come to visualization of the images. First, I will show you the plot pixels function, which takes at least the image parameter. This takes a cyto image list object containing multi-channel images. By setting the color by parameter, you can specify which channels you want to visualize in form of a composite image. Here, we select the channel's pro-insulin CD4 and CD8. The biological meaning here is to visualize pancreatic eyelids and immune cells. So pro-insulin you can see in yellow, CD4 in blue and CD8 in red. For every channel, you can specify a color gradient in form of a named list. And you can also adjust the background, the contrast and the gamma level for every channel using the BCG parameter. This is now a named list that takes a numeric vector of length three, where the first entry identifies, specifies the background, which is added to the image, a factor with which the image is multiplied and the exponent to increase the gamma level. The image title parameter defines the attributes of the image title. In this case, we only change the text of the image. So here, image title takes a named list and I will show you later which other features you can change. Scale bar is also a named list to change different attributes of the scale bar. You can see it here on the right. We set the length to 100 pixels and generate a nicer label for the scale bar. In the next step, I will show you how to use the plot cells function to visualize segmented cells. And to show the full functionality of this function, I will subset the single cell experiment to only contain cells of a certain type. Here, these are beta, alpha and delta cells which are present in the pancreatic eyelid and cytotoxic and helper T cells, which are immune cells. So the plot cells function takes at least the mask parameter, which you need to specify a CytoImageList object storing segmentation masks. You can also supply the object parameter which takes a single cell experiment. So here you will link cell specific metadata which are stored in a single cell experiment to segmentation masks via the parameter cell ID, which is contained in the call data of the single cell experiment and which also is stored as sets of pixel intensities and the segmentation mask. The parameter image ID is used to match cells contained in the single cell experiment to individual segmentation masks. Again, we can set the image title. We can also now change the color of the title. We will also change slightly the scale bar different from the default and change the color to black. The missing color parameter here defines the color of all cells that are not contained in the single cell experiment. They are white and you can see these on the images. All the white cells are not contained in the supplied single cell experiment object. Background color sets the color of the background which is now seen in gray here. Finally, I will show you how to outline cells on images. So for this, we will use again the plot pixels function which combines multi-channel images, a single cell experiment object and segmentation mask objects. Again, via setting cell ID and image ID, the plot pixels function links individual cells contained in the single cell experiment to segmentation masks and it matches segmentation mask and multi-channel images to the single cell experiment container via the image ID. Again, you can specify which channels you want to visualize as a composite image. Here, you can also specify the color of the outline of the cells. When you look at the images, certain cells are outlined and here the color of the outline is dictated by the cell type. So you can see in green and red pancreatic eyelid cells and in blue immune cells. Again, you can specify the gradient, the color gradient per channel, the background contrast and gamma level per channel, set the image title and the scale bar and here the thick parameter defines if the outline should be drawn as thick or not. With this, I would like to thank my co-authors, Sir Nikola Tobias and Bernd, who helped with developing the package and with publishing it. I also want to thank my funders Marie Curie and Embo fellowships. If you want to know more about the CytomegPAP package, have a look at the publication and the GitHub repository for analyzing example data. You can also find more info in the main vignette called visualization of imaging cytometry data in R and a newer vignette, where you can find information on how to read in and store images directly on disk instead of in memory. We've also been working on new packages, IMC data sets, which provides example, multiplexed imaging data, which is publicly available and IMCR tools, which is under development, but it will contain analysis functions for multiplexed imaging data. Finally, this is the session info. These are the main bio conductor packages used in the displayed analysis and I'm looking forward to your questions. Okay, thank you very much, Niels. There's a question on the Q&A about, is it on RStudio? I see that you're typing an answer, but you could just say because for the recording, that'd be great. I mean, it's installable from bio conductor directly. I guess this was the question about. Yeah. I'll ask you another question, too. Are you thinking of expanding Cytomapper for spatial-trastic tonics data? Any particular images there, like 20 gigabytes big? Yeah, I know. So there are on-disk approaches now. The issue is really, as long as you can read in a single image into memory, then you can also write it out again as kind of like a subsetable image, but at the moment, it's really only supported for pixel data, which I think spatial-transcriptomics is not really kind of supports this format. Thanks. Let's see. So let's move on to the last talk of the session, and then just as a reminder again, please use the Q&A. We'll have 10 minutes at the end of the session for more questions. So our next and last talk is by Adrian Waddle, who's gonna tell us about analyzing clinical trials data using R for exploratory and regulatory analysis, and Adrian works at Roche, who is the session sponsor. Thank you. Thank you very much for having me present here. I'm gonna talk today about how we use R to analyze clinical trial data, both for exploratory analysis and also for regulatory analysis, and where regulatory means it then gets sent at some point to the health authorities. That in turn requires, as in the previous talk was mentioned, validation and predefined workflows and documented. So the talk will mainly focus on the demonstration I will give, and I will mainly quickly go through the interactive part and spend some time on the static output generation and how that can be transformed into the static PDF that then will be used downstream by various stakeholders. So the motivation in general for pharma, and here I'm talking mainly late stage pharma because like lots of early development happens already now, but in the late stage of pharma, it's a highly regulated environment. And so historically, SARS has been used for most of the submissions to health authorities. And that is slowly changing some reasons for that. Well, one big reason for that is that lots of talents, like a lot of students coming from academia, they know already or they like or, and then also R is a very powerful tool which does meet the business case, which does meet the needs. And also we're shiny as we become really, really powerful to provide interactive analysis environments to various stakeholders. And so from what I've seen so far is lots of the tooling for many companies revolve around static output creation and the static output creation is quite particular requirement with labeling so that that output displayed is put into context. So it has title, footnotes and information when the data was cut off and what you see in the plot. And that is kind of the main use, but these days like there's more data, there's also high-dimensional, more high-dimensional data and so exploratory analysis has become very important. And so what we provide is essentially web applications that provide a very flexible way of subsetting the data, but then also choosing what should be analyzed and what kind of encodings should be chosen for the visualization or the statistical analysis. And I think the last and also very important group of users of those analytical tools are those who actually know like a program of the statisticians who performed the analysis and using the norm work down and so on. And for that it's very important to have a good environment and a good API where you can run ad hoc analysis and then integrate it into your analysis workflow. And so I think all does provide very nice environments to do all of those and I'm going to show you how we have designed our environment to do the Mito3 phase of interacting with the data. So we use a mix of internal and community packages and we also make some of our packages available as open source and we're working on making some of the other one available in the future. And so our table is in our package to create general tables. So to calculate and then tabulate and output rendered. And then turn is a package that takes our table and GGBot another packages and adds the business logic to it. So it does the statistical modeling, it does the descriptive statistics and it provides the components to build the table listings and graphs that are used in clinical trial analysis. And Teal is a shiny-based framework that then uses turn and all table and other analysis packages and provides those flexible shiny applications where you can dynamically subset data and you can change how the data should be analyzed. And so Teal is the shiny-based framework and then turn will be used to create the static outputs but also to do the ad hoc analysis through via the command line interface. And so one particular design aspect of Teal is that it's possible to have code reproduced via show our code bottom. So there's code reproducibility from shiny which then can be used to create the static outputs or do further ad hoc analysis. And so as I mentioned before, our table is already open source, the other packages are currently not but we are internally discussing how we proceed with those packages. And also mentioned, we do use many other packages like we don't write something if there's already community efforts that meet our business need. So that's the setup. And so in the demonstration, I'm going to show you I will first show you how we set up the Teal application, how we export the code. And then with that code, how you create an output that is paginated in this title and footnotes. And so it's important to say that all the data that's shown is synthetic, meaning there's no real patient behind it, it's all random number generation. And the data right now here is created with random CDIS data which is proprietary but we have an open source version of that that we are currently working on that depends on respectable which is another open source project we're working on and that should make it more flexible to collaborate and create synthetic data from various uses. So for the demo, I've prepared an R session here which is here on a production environment using RStudio. And so what we do provide is a way to essentially load released package that released our packages in the release. And so to create the interactive session. So here you see we use a random CDIS data to get the ADSL. I made an example data set here to show how we do create random data. And so usually it involves lookup tables and you can see it in respect of the synthetic CDIS data, then some rating and then some random sampling that gets you very far in terms of development demonstration but then also later on for other use cases that are useful for planning. So if we create those, we create the T-Lab. T-Lab is a shiny app. You see that that object up here gets the new spy shiny app. And you need to tell T what are the data relation, use its relational for CDIS data. We have wrapper functions that do the right default. And pretty much it is about the parent data sets and what are the key variables. And some reproducibility sort of when you click on the show our code to get the right reproducibility. And then it's model or meaning you can add modules with no parametrization. Here I have an adverse events table with parametrization and so you see a little bit here what the parametrization looks like and I'm not going to spend too much time here because I have already presented until at an RStudio webinar which you can look up. And then you can give it a title for it notes and if you run that, excuse me. So something has just them. That's maybe I said that multiple times before. Apologies, good. So, good, that looks good. Good, apologies for the hiccup. And so the T-Lab application has always a similar layout there's a menu, there's a filter panel on the right where you can select variables to filter on and then you have your modules and the modules can be to look at raw data. So for example, if you would like to look at the at the AE variables that we do analyze, you can add them, we can move the pieces you don't need. We also have variable browsers where you can look at the data interactively. And if you filter, it also filters the data in here as you would expect with shiny interactivity. And so if you then go to the advanced event table, you have here in the center, the output, you have the encodings. So you can change the ARM variable. Here I didn't do that many encodings. So ARM CD versus ARM, well ARM has a little bit more markup, you can use different terms. I will leave this, I think we have, you can add like add and remove. You can change the SOAR criteria of the terms like alphabetically versus the frequency of total here and so on. And you can add an economic result set. So you can actually go to the country and say, for example, I would like to exclude this country. It updates here what your data set have filtered and it updates the output here. And so the important thing here to continue the story from exploratory to regulatory is that we have that SOAR code button and that gives the exact code that actually created the output in the center. And so if you click this SOAR code, it's going to take a while because we style it with the styler package and you can copy that, that code. And if you focus on one of those, for example, headache is 90. Here, you can then stop your shiny app, you essentially can create new script, you can copy paste this in. And so if you delete everything, you don't start from scratch. You will get exactly the same numbers in a static way. So there's some packages here that have been used that we still attached. We don't prove that the package attached packages that are, but so here's the data generation. You see the look up table. You see, we create ADAD from the setup before. And then there, here's a sub setting. So it's the age and country. So we do this and then we can essentially create, here's the tabulation, like the calculation, then we create the tables. And if you look at the table here, we would expect that headache is 90, 84.9. And what we've seen before right here, 90, 84.9 is the same number. And so this is a great way for, if a clinical scientist or any analyst sees an output that they need for production output, static production output, they can copy that center to the statistical programmer and the statistical programming term can create a production output, which I have here prepared in a little bit more readable ways that I removed some things and reading it here. And so with that, so I'm going to delete this and start from scratch again. So if we were to create the production output, and of course there will be workflow pipeline tools to do it on that scale, but on a very general level, you start by loading the setting up the environment, loading the data. You filter your data. Here I use a little bit of different filter. Usually you make an image showing, so it just filters the other data sets and then you go into table creation. And here I've added title, footnotes and provenance information because that's what we need in the static output. The rest is the same. So you split by arm variable, you add column counts, which is the ends on the arm. You add a total column, you summarize the patients, then you split by body system. You summarize it with object ID and then you count on currencies for the different terms. In this, for this adverse event table, so I encourage you to read the vineyards on our table. It's a very flexible framework to essentially create any table calculated and display. And so if you build that, you get a table. If there are zero counts with title and footnotes in ASCII here, if there were zero count, you can go and prune the results, which in here is not necessarily needed, but because it comes from teal, it's pruned. And then it can sort it, frequency-based. And so the table itself has three information. So we can use path into sorted. And so that gives us the final result that we now would like to put into a PDF. And so one part that's provided in our table, so that's the table being here, is that paginate function. And because the table itself is a tree, you can now make a fairly simple rule based on how many lines per patient can you have, line per patient can you have and how many siblings in a leaf are the minimum. So meaning, for example, if this is a leaf element, you cannot cut between those two elements. You want at least to show two terms, next to each other from the same system organ class. And if you do that, that's fairly, it's not such a big table. So it's fairly fast. It's five different table, what you see here, which is always, that gives you the pagination. But if you then want to make it in a published publication, ready PDF, for example, you need to match the column waves between the table, which is, you know, that just takes every table independently. That would match the column wave, meaning now if we make this bigger, this column is nicely lined with this. And then the last step is, this is going to be in grid low level here, but we're going to make a render function in an export function in our tables with the next release. You can essentially say, I want to now write it as a text file into a PDF. And so if I go to the file step here, and I do this, I got a table PDF, and this would give us something that is commonly used for communication and reporting. And so that is kind of the round trip, the trip from explorers from setting up, designing it, the outputs, then using it in an interactive application where it can subset and change the encodings to then creating the static output, where it can dynamically adjust it at hook in the comment line interface, and then creating the output that is used downstream by various functions. Good, thank you very much. And as I mentioned, our tables in synthetic city state are already open source. Please keep an eye open for the other tools that I've mentioned. Hi, thank you very much, Adrian. You have a question on the Q&A from Esguerra Carlo. He says, what types of profiles do you have at Roche to be able to develop our packages? Do you have our developers working on these packages? Yes, so I'm the technical, like the NEST, so I'm the technical project that's called NEST, and we have around 20 developers wearing by year, but it's 50 to 25 developers who work on those all packages. And then we have the statistical programmer who use those in the daily study work, and then there's a DevOps cycle where their feedback gets incorporated and then put on the backlog, we prioritize it, and then it gets released at the moment by monthly. But yeah, so we have all developers, we do hire also for the analysts, we hire people with our skills to then perform the analysis. Okay, thanks. I'll ask you a question too. So you showed us how you can, from the Shiny app, download the R code and then run it on your computer, how do you deal with the different package versions and maybe even like operating systems? Very close. Yes, so that's a good question. So because we make, so we give essentially the package versions as a list because we have a production environment and released versions that we control, that be used nest contains the information for the future. And so you can also install them and most likely get the same pieces, same information. But for the future, we do plan to provide our and a lot of log files when you click on the structure, the R code. And that together with the RStudio functionality of like RStudio package manager should be fairly good to recreate it. And then of course in production for real analysis, you do have much more tightly controlled environment of the Docker environments where you store the Docker images. So we have a question from Slack, which is, may I ask how you're dealing with statistical disclosure control around in cell suppression in R table? Can you repeat that? Sure, it says, may I ask how you're dealing with the statistical disclosure control? So in practice is rounding cell suppression in R table. Yeah, so R table separates the rendering from the actual data. So the R table itself has to complete the information, no rounding, no formatting applied to it. And then we have a step to then apply the rounding rule. And there's vineyards and that, and I think it can always do a lot. There's some, so for example, position-based rounding. So given the position of your input data around to a commonplace, that for example is missing. For that we probably have to update the formatting, abstraction, extend that language a little bit. But yeah, so you can essentially change the rounding to your needs once you have the table or when you define the table. I hope that answers the question. All right, yeah. I mean, otherwise there's also the talks, underscores stats, underscores bioinformatics channel on Slack. And so Jenna and he was like, yeah, if anyone has questions that, you know, beyond the Q and A, feel free to ask questions there for any of the speakers. So I'm gonna ask, we're now in that 10 minute window at the end of the session. So if you have any questions for any of these speakers, please feel free to use the Q and A. And so there is one question. Should I answer that or should I wait for other questions? So how does Teal do code generation? Do you Shiny meta? So Shiny meta came off there and you can read on the Shiny meta web page. I'm actually mentioned there. We did a different approach to code generation. We talked to our studio and then I think that was part of the input to Shiny meta. At the moment, we have not switched because our way has not, it's not the most difficult problem from our side. I think Shiny meta is a great tool. It's more flexible than ours. But we have had other priorities. So I do hope at some point we can switch over to Shiny meta. At the moment, we use a different way where that is in-house developed and a little bit less verbose than Shiny meta. So Shiny meta is a very nice piece of work which at some point I hope we can switch over to. Thanks. I have a question for Ginani. You mentioned that one of the challenges is having a single house for all these tools related to molecular evolution. So you showed us your Shiny app. Are you thinking of providing like a software that people can download and it will install all the dependencies and they can run more analysis or is it mostly like or you're aiming for Shiny to be the end solution? That's a great question, Leo. And in fact, that's the main part of the talk and got cut today. So please do check out the video link that I posted on Slack for the last five minutes. But anyway, so now we have developed a Shiny web application, the Walla Walla, which is the title where we allow people to run the full analysis on their proteins. The initial web app that I had shown is just a mockup where we have data from this other stress response system we were looking at. But right now people can run this entire analysis on our new beta, alpha beta version of the Shiny app. We are also developing a companion R package that can be downloaded and I mean, you can use it within our studio. So that is still in development. So I mean, a lot of our code went into the Shiny app. So we thought, okay, we might as well make that available too. So we are working on that. So hopefully that should be on bio conductor later this year. All right, good luck with the back conductor submission. Thank you. Neal sees experience with that. Oh, wonderful. I'll be in touch. Thank you. So Gary, for you to talk, you showed a lot of different pieces of software that you're developing. Some of them you're planning on making them open source. Some of them are closed source. How does your company balance like how much effort how much effort is put into each of them? Yeah, that's a great question. And I think that comes up at many companies. I can answer that from my full-time job. So I was lucky enough to work at companies where we could open source basically anything which is not the core business logic because I've been working in FinTech and AdTech that covers mostly like almost everything we do. So it's been great to open source all those Python and R packages regarding RX Studio which is a site project. It's a little bit more difficult because we have limited resources. And when you open source something, you really need to clean up all the related documentation, prepare an API that can be used by others. So to keep it short, we just need to find the time to clean up and separate nicely from our core package. And that should happen hopefully in the next couple of months but honestly, no guarantee. So just to answer your question, I think it needs some dedication and resources from the company to open source. Yeah, hopefully that answers the question. Thanks, yeah. And it's always a tricky thing. Nils or SIDOMapper, so like, have you, were you asked to compare it with like MATLAB tools, for example, or why would someone that is a MATLAB user use SIDOMapper, right? I know MATLAB is expensive but yeah, I don't know if you had any thoughts on that comparison. Yeah, I mean, I'm mainly just comparing this to tools that the rest of the group is using, which is all kind of Python based or using some graphical user interfaces or some mainly cell profiler. I mean, there is a strong overlap to other tools. I think here the main point was to integrate it in a framework that the group is using for data analysis. And so this data container and S4 class that kind of contains all the meta information that is needed. And it was just a tool missing that directly visualizes this on images. So in Python, you would probably use squidpy or scampy for this MATLAB not sure. All right, thanks. We have a question on the Q&A from Chia seen you for Janani and Nils with increasing complexity of Shiny app, do your tests incorporate reactivity testing in these apps? Can you talk a little bit about your experience with this? So I mean, yeah, I didn't really talk about the Shiny app but yeah, SIDOMapper has one. I started testing, so writing unit test or snapshot test using Shiny test, I think. And it is at the end, it wasn't really too straightforward to kind of incorporate random version changes that it's all setting up continuous integration for Shiny snapshot tested. That's where it kind of stopped for me. So it's a bit tricky. We found something similar to, we started using the Shiny test and we also were looking at unit test for the app package but it got too complicated too quickly because we have a lot of categorical classification. It's mostly visualization based. So we are still finding the best way to move forward with these tests. But for the time being, we just have the mockup app where everything works and they're just trying to move it to the current Shiny app. So we don't have any details on the testing yet, I'm sorry. But it's something that's work in progress. Okay, we're running out of time. So we'll have one last question and then please feel free to ask more questions on Slack afterwards. And so this question is from William Michaels for Janani. Is there a known sequence alignment program behind the Shiny app? Like this example, Clustel. Possibly is it a program from your group has developed at MSU? So we are currently using four different alignment tools in the backend for MSA. We are using a couple of Clustel, Muscle, Tea Coffee and K-Align. But Kevin, on other people at MSU are also developing our own MSA algorithms. The idea is that the Shiny app and the app package is meant to be modular where people can add any newer algorithms or methods that come into place. We just wanna provide this general purpose framework for any kind of molecular evaluation for genetic analysis. But we have four in place right now for MSA. I hope that answers your question. And yes, Leo, we are using BioSeed this now. Okay, awesome. So with that, thank you very much for all our speakers and our session sponsor Roach and for all the participants for asking questions. Please feel free to network on Slack and we'll see you around. Thank you, bye bye.