 So, welcome all to this S.I.B. virtual computational biology seminar series. Today we have the pleasure to host here at the ETH two speakers, Chandra Sekar, Rama Krishnan and Michelle Okoniewski from the scientific IT services of the ETH Zurich and the SIB. And I also want to welcome all of you who are at the moment online. So I will go briefly through the bio of our two speakers. So Sekar studied mathematics at the University of California in Berkeley where he obtained his bachelor in 97 and also studied computer science and art at the University of California Santa Barbara where he obtained a master in 2003. He worked as a software developer and consultant for companies and research institutions in the U.S. and in Germany. And since 2009 he has been at ETH Zurich supporting researchers with software for data management analysis and visualization. Michelle studied computer science at Warsaw University in Poland, the Warsaw University of Technology where he obtained his PhD in 2002. He worked in a number of consulting IT projects in the industry and for the United Nations. He then switched to bioinformatics in 2003 doing a postdoc at the University of Antwerp in Belgium and then at the Cancer Research UK in Manchester. In 2008 he started working as a bioinformatics expert for the functional genomic center Zurich and in 2014 at the Scientific IT Services here at the ETH Zurich. So I will just briefly go through what does the group. So the Scientific IT Services also called SIS is led by Bern Rien. It is an interdisciplinary bioinformatics and scientific IT support group which develops computational tools. These tools range from lab databases to reusable framework components that enable and support both data analysis and data management in life science research and beyond. The group collaborates with Swiss and European research groups and industry in the life science sector. And the group improves and ports scientific software, develops data management solution and provides associated services. The group member also integrate and operate data analysis pipelines and provide training and consulting in databases, scientific software development, high performance and cloud computing. So today we have the pleasure to have Sekar and Michel that will tell us about Expose which is a suite of tools for visualization and publishing of single cell RNA-seq data. So Michel and Sekar. Thank you again for accepting this invitation and the floor is yours. Thank you Diana. Welcome everyone. So the outline of the talk is as follows. I will do a short introduction to publishing of genomic data. In particular I will talk about SRA upload and data retrieval. Then we switch for a moment to the hot topic of single cell RNA sequencing and the data analysis of it. This will lead us to the design principles of our system. Then we switch to Sekar for the live demo of Expose. Sekar will also discuss more in detail technical aspects of the web GUI. Then we talk about the possibilities of customization and future work. So all the people who generate experiments that include sequencing data are supposed to know that when publishing a paper with this data one needs to make the dataset public. The goal is to allow for reproducible research to allow other people repeat the data analysis. That's why all the data should be deposited in the public repositories. Here you have logos of all these repositories. Many of these major repositories exchange information or do mirroring of each other. So typically what is deposited are the fast queue reads, alternatively BAM alignments. This is what those are the raw data that are expected to be published. The data needs to be properly annotated with metadata then uploaded to the storage of the repository. Then the goal for the biological paper is to get the identifier like here in the example. This paper by Sun et al. They deposited the data in the European genome phenome archive. Got this accession number and with this accession number you will be pointed from the paper to the data. So let's have a quick look on how it is done on the example of Shortread Archive. So when you deposit the data you have to enter the information about the biological project, about the samples, about the people who generated those data. And this is done in a form of a wizard. It's pretty convenient, still there are some smaller tricks. So once the project is submitted then you have the navigation web interface so you can see your samples, projects and other aspects of the uploaded data. So once the metadata are entered then you should upload your real data. This can be done with FTP upload, with the Aspera plugin. And then you can see finished project. This project can have the release date which is arbitrarily selected or you can decide to release your data for example exactly when the paper is published. So this is the typical policy. Your data is kept private or not released to the public until a given date. That's normal feature of it. So then the data get into the repository and it can be downloaded. The metadata can be grossed for example with the SRA run selector. Here you have for example the experimental description of this particular RNA-seq experiment. You can see single samples, you can filter, you can choose your samples. To download the data the trick is that SRA for example uses a proprietary compressed format for storing the data, so namely SRA format. Then you have to use SRA toolkit which is a set of tools used to operate on these archives mainly to unpack it back into FASTQ. There is also a number of other utility tools. So for example there is SRADB library in R which is mainly fetching the SQLite database of all the SRA experiments. So the structure is like here in these five tables. And then once you have this SQLite database on your computer you can query it about the metadata of the experiment and then you can download from the R level the SRA archives. So let's switch to the story of single cell RNA sequencing which is as I said sort of a hot topic in the gene expression techniques. So there is now a number of machines and library prep techniques for the next generation sequencing that can be used to read the expression level from very small number of cells. This single cell is just a buzzword, typically there are at least several cells. And then RNA is extracted, then like in other library prep techniques this RNA is encoded into CDNA, then it's amplified, then it's packaged into the sequencing library, then it's sequenced typically on Illumina. Then for each single cell you have the expression profile and then this can be analyzed and used in a feedback loop for the further experiments. So that's the very general outline but those single cell experiments are really done right now, they are popping up. One of them was the main motivation for our adaptation of the expose system for visualization. So the analysis of single cell data it's mostly done especially on the level of the primary analysis like the standard analysis of RNA-SEC. So here you see the typical jungle of software and formats that is used for RNA-SEC sequence analysis. So basically for single cell RNA-SEC it's done in the same way, it's the alignment, feature counting and then creating the count table. And then you can also try to see splicing and similar events. The problem is that secondary analysis, so statistical methods are mostly not developed yet. So statistical groups are working on it, like I know that locally the Mark Robinson group from Unitsui, they are for sure doing the research on appropriate methods for single cell RNA-SEC but it's not there yet. Still, there is a number of tools like in the bioconductor, bioconductor repository or their courses which you can use for analysis of your data. Mainly now it is the level of quality control on looking generally in the data, then you look deeper into the profile of specific cells and specific genes. So from the tutorial of Lund Macarte-Marioni you can see histograms, density plots, cell in plots. Here in the Cambridge tutorial there is a bit more insight if you would like to read more. So the data analysis challenges of single cell RNA-SEC come mostly from the fact that the number of samples, the number of cell sequence is much bigger than in the standard RNA-SEC experiment. For RNA-SEC experiment it's typically several samples, sometimes more. Typical single cell experiment can go into hundreds, maybe even thousands of cells. So the count table has more columns and as the coverage from the cells is lower the count table is more sparse, has more zeros. So it has different distributions. So as I said there are no established guidelines on the secondary analysis. Sometimes there is a need of comparison and linking to standard RNA-SEC. We have such a case in our project. So one of the tricks is also that with that big amount of cells there is no polite economical way of re-running the primary analysis. If the single cell experiment is deposited in, for example, an SRA downloading the data, running the alignment and counting takes a lot of time and takes a lot of expertise We can say that you would need a dedicated bioinformatics postdoc for a couple of months to do it properly because this can be standardized but always there are different aspects. And this is a lot of data grinding which has to be done on some kind of a cluster. So you cannot expect your public viewers, public users to re-run fully the analysis. That's why we in the discussion with Verdon Tyler and the lab members of the group of Dagmar Iber, we have discussed and written down the user interface design assumption for a system which will be used to visualize and publish the results. So the data after the primary analysis, the expression level. So yeah, we want to publish the already pre-prepared data mapped and counted. So this level can be skipped. So just we work on the expression tables and annotation files. Also the goal will be to give the end user sort of interactive way of playing with these results, like in a various ad hoc visualization. So in that way we avoid re-running primary analysis and save time and computational resources and hopefully we can do it without the bioinformatician expert for primary analysis. In this particular project as I said we have both single cell and classic RNA-seq population samples. The need of the users would be to select specific genes, to select specific cells, samples in the population, specific conditions. The interface obviously almost needs to be web-based, needs to be efficient and responsive. And the main feature needed would be interactive drilling down in the data to see the local patterns of expression. So those are the assumptions, those are the input data. So we have the count table for the cells, which is as I said more sparse. Okay, those are just example data, they are not the real ones. In parallel you have the standard RNA-seq samples and then there is the sample annotation, which includes in that project cell type and time point, the treatment is also made up here. So those four tables are the input tables for the system. As a model app we've seen the nice example of done by collaboration of the groups of Basel and Geneva I think, which has done a bunch of violin plots and other types of visualizations done from R in shiny, for a particular data set, for a particular science paper. So it works nicely but we had concerns that when we choose the technology that shiny and R is known not to be too fast. So shiny in general is a framework for fast building nice web data driven apps from the level of R. So typically R developers can do a shiny app in a few mouse clicks. Yeah, but the possible drawback is the performance and the scalability with the number of users. So we use shiny but mainly for prototyping. Then we discussed with Sekar and this is the point where I switch to Sekar and he will tell more about the framework that we use. Okay, hello, I'm Shaker. Can everyone hear me? Is it okay? Yeah. All right. Great. So whereas Misha was more of a bioinformatician I'm more of a software developer and data visualization person. So I came to the project from a different perspective. And so the tools that we use in our group are much more around general purpose programming languages and general purpose architecture. So we do a lot of work with HTML, we do a lot of work with Python, we do a lot of work with Java. And nowadays especially with these frameworks like NumPy, SciPy, Pandas, Python is a really viable environment for doing data analysis and data manipulation because you have the ability to take advantage of the fast matrix operations and you have the data frame model that comes from R which makes it very convenient to work with and manipulate data. And then we also use tools like component architectures like React for constructing user interfaces and we use libraries like D3 for building visualizations, interactive visualizations. And admittedly these tools require a greater amount of knowledge but they also I think offer the potential for efficient UIs that are fast, scalable and allow a great deal of interaction. And I guess most importantly in the context of this project there are tools that we're familiar with. So the background behind how we came to choose these tools requires taking a step back kind of before this particular realization of the project. So the term expose and the background behind what we've done in the NIRS-MX project dates back to an earlier project that was done with the plant biology lab of Billy Grissom at ETH and a postdoc and a staff member and Matias came to us with a proposal for a piece of software that they wanted to have. And this was an application for kind of biology-aware exploratory data analysis. They kind of phrased it as like somewhere between R and tools like Tableau or Spotfire. So the reason they had that designation was that you have tools like R which are very, very powerful and allow you to do all sorts of analysis and all sorts of visualizations but in order to use them you need to be a programmer. And then on the other hand there's these tools like Spotfire that make it possible to build visualizations but they tend to be very general and have no or very minimal understanding of biology and the kind of visualizations that are necessary and useful in biology. So they wanted to come up with like an environment which allowed users to build their own applications for data visualization and exploratory data analysis by picking from a library of components and constructing a UI and specifying interactions and how these particular components interact with one another and then allowing them to just point that at some data and then go crazy with their analysis and exploration of the data. It's obviously a very ambitious idea that they had and they had funding for a pilot project which we then implemented and I think I've got that in the next slide here. Right. It is this kind of application that you see here in the slide. So we built a few different components that were drawn from the kinds of data that they used or utilized existing tools that were out there. So we didn't build our own genome browser. We used the JavaScript genome browser that I think was developed at Sanger Institution and on the other hand we also on the let's see here on the on the right hand side of the of the slide you also see a visualization that comes from keg where again we weren't generating that we were just pointing the user to that as an appropriate interaction for a particular kind of drill down operation with the data. So the idea was to take advantage of things that were already out there to the greatest extent possible but make it possible also to customize the UI and the interaction to a particular kind of data or a project. And so this is the building the foundation that we used to build nearest MX because although exposed is currently on hold due to looking for funding. We think the architecture is very good and we want and we thought it was a very natural fit for the kind of interface that the Taylor group and the Eber groups were asking for for visualizing their data from the nearest MX project. So the architecture is taken from exposed where we have a bunch of objects that are made available components that are made available to the UI. And these are all react components which allows us to combine components largely arbitrarily and define very general interactions between them. But at the same time the individual things that are wrapped in the component can be very specific. They can be custom built D3 visualizations like the one you see for the gene ontology histogram on the right hand side or they can be reusing tools that were developed elsewhere like the Inch Lib heat map interactive heat map which was developed at a chem informatics group in the Czech Republic. And it also allows us to combine kind of underlying technologies very in problem specific way. So Inch Lib is built on top of Canvas which is necessary for the performance that it delivers and in the D3 visualization we use SVG because it's a very convenient underlying model for doing these kinds of visualizations. I think now I can maybe show a brief demo of the app that we built for interacting with the Neurostem X data. So this is the page that you comment to when you initially navigate here. So it just gives you some general information about the data. So as Michelle explained in Neurostem X they conducted two different types of measurements on mice stem cells to understand how the neurons develop. So one kind of measurement was based on single cell data and they did expression analysis for the entire genome using single cell kind of samples and on the other hand they did the same sort of analysis but with population data and the goal of the project is to compare and try to spot discrepancies between the activity that happens at the single cell level versus the population level because the hypothesis is that there's actually a much greater level of differentiation at the single cell level than you might necessarily see at the population level. So that's something I wanted to explore and better understand. And this UI is designed to make it possible to interact with both of these kinds of modes of data acquisition in a way that combines them and makes it possible to do analysis and navigate from one type of data to another. So as you can see here also the way they acquired data was done, it's parallel regardless of the kind of measurement that they're doing. So in the single cell data there's four different cell types that they analyzed because the stem cells kind of have some sub-differentiation that occurs during their development and they analyzed in both cases 10 different time points in the development series. And you see here this is the genome that was the number of genes that were analyzed and you see that there's data for a slightly higher number of genes in the population level versus the single cell which makes sense because you can probably pick up fainter signals when you're doing analysis on population level data. So this is just an initial page that allows you to kind of get an overview of the data. There's two ways of entering the actual data exploration. So one way is this explore tab. The idea here is you don't necessarily have any preconceived notions of what you're looking at. You just want to see what the data shows. So for this we do a clustering analysis. This is again this is very generic. It's not necessarily biology specific but we do a clustering both on the different cell types and all of the genes. The genes are displayed in the rows and the different cell types and time points are displayed in the columns. This is the Inch Lib interactive heat map. You can kind of click around in here. We on the fly compute a gene analogy analysis so that you can try to understand what genes are in that row because each row at this level of the heat map can contain a large number of genes because we've kind of collapsed genes that cluster together into one row to make it possible to navigate the 30,000 genes within just one UI. If we were to display 30,000 rows that would quickly become unwieldy. But we've collapsed that to a more reasonable number of rows to support better navigation. And you can see here so the row that I clicked that's highlighted in green. The gene analogy analysis kind of tells you a high level understanding of what kind of genes are in there. So apparently have something to do with the cytosolic small ribosomal subunit and ribosomes in general. Once you've selected some, what that operation does is it selects some genes to look at. So then you can navigate over to the inspect page and kind of look at those genes in greater detail. It's of course also possible to go directly to this page. In particular if you have your genes that you're interested in, let's say you have like 20 genes or something like that that you like or you study, you can go directly to this page and input those genes right into here without that step of doing the exploration in order to find the genes that you think are interesting. But because I don't know anything about mouse biology, I usually go through the explore page because it helps me find things that I can then look at. So what you see here are there's a matrix of check boxes that allow you to focus in on particular cell types and time points if you're interested in doing that. So you know that you see the same heat map visualization but here each row is one gene, so it's down at the gene level. And you see displayed next to one another the population data versus the single cell data, which makes it possible to look for and spot differences here. As you can see, the population data is much more homogeneous, so whereas the single cell data there's a much, much greater variation in terms of the expression profiles for the individual genes at different time points. So if we take something like what is this TPT1 PS6 at the population data, it's highly expressed everywhere and then if you look at it here in the single cell data, it's still highly expressed at some of the cell type time points, but at some of them it's not so highly expressed and that might be something that you want to kind of look at in detail. By going further down this page, you can see the expression values in greater detail shown as box and whisker plots that are interactive. Let's go down to our friend TPT, whatever that was, PS6. So you see here these are the box plots of the expression level data on the population measurements versus in the single cell measurements. So you can see here immediately that there's much greater variation, but this is also an interactive element because one of the things they want to do is to drill down to these things that are outliers, these samples that are outliers and try to understand what's going on within those cells. And so by clicking on one of these outliers, I go to like a sample detail page which shows me data for just that sample zoomed in on the genes that I've selected. It sometimes takes a couple of seconds to compute, but then it comes up. So what we see now here are box and whisker plots of all of the genes that I've selected and drawn as a dot is that individual sample where it appears on all of these genes. Because one of the things that you might ask yourself is why is this sample an outlier? Is it because maybe there's some measurement like screw up and maybe there's just some, which is dirty data? Or is there something that's really happening at the biological level that caused this sample to be an outlier? And so for example, the fact that this sample on many genes is within the interquartile range is within kind of the range of normal values is an indication that it's probably not just a measurement anomaly that this sample is an outlier. So it's clearly an outlier on some of these genes like GM 27684 and GM 4735, but it's very much within the normal range, very close to the median for example, on ATP5B. We also take a look at all of the genes for which that sample is an outlier and do the genontology analysis on that in order to see if there's some particular categories of genes that pop up that might give the biologists some ideas to what's going on, why is that an outlier? Is there some particular aspect of the biology of that gene that's causing that sample to be an outlier? So this is generally a very large, it's often a very large plot, but this is an analysis that we do on the fly and then we show this here. It's also possible to just go into the detail for any individual gene and just see the values for that one gene if you're not interested in looking at a whole bunch of genes but one and we give you several different views into that data simultaneously. So you can compare the population versus the single cell data, but we also give you box plots here that are done at the cell type level. So we show the single cell data versus the population data right next to each other in box plots that are drawn next to each other. To make it easier to compare the data in a bunch of different ways, depending on what is most appropriate for the kind of analysis that you're doing at the moment. So let me get back here. So I think I should probably hand it the floor back to Michel. So I'm going to do that bear with me for a second. Yeah, so we are going towards concluding our talk. So as Sekar has demonstrated with expose interface, it is possible to publish your data in the way that you can look at this data interactively. The raw data still needs to be published in SRA or similar repository, but such a user interface, web user interface can be published in parallel. It has several advantages. So the first advantage is that you can give this interface to play for reviewers of your paper to play. So these people should look into your work in a bit nicer way because the first problem, if I may, reviewer of a genomics paper, if I can't reproduce the story which is described, this is a bit disturbing. And here, assuming that this interface works smoothly, makes your reviewer happy. Then after the paper is published, the same story about the end users, end users can come enter their favorite genes or the whole, as Sekar was showing the whole genomic signature, so set of genes and see how they behave on your samples and cells. And it's going on pretty much live fast and interactively. So our group, Scientific IT Services, can customize expos in several ways. So we have shown these two flavors for agronomics, microarrays, and single cell RNA sec. Still we plan to do it as out-of-the-box application template. So the goal would be that the biologists come with their count data and annotation data. They enter it into the system and they get the live web application that can be published. So it is not much work now for our development team to adapt to various types of new data or various flavors of the data after the primary analysis. So anyway, the message is that the system can be customized depending on the data set or the biological purpose of the publication. So yeah, you are welcome to contact us. We are open for discussion. That's our website, SISID, yeah, OK, the summary of our activities, Diana did in the beginning. Our flagship system is also the open base for life science data management. We do trainings, code clinics, we maintain quite big cluster. We do various types of training and consulting as well. So yeah, we are open for discussion on various types of projects. And we would like to conclude by expressing thanks to people who collaborated with us on that project. So within scientific IT services, yeah, it's our boss, Berenterine. The person who did a lot of work on that is also Sven. The original exposed project, as Sekar said, was developed in the collaboration with the Gruesome Lab. The expose on single cell data was done under the guidelines of, from Verdon Tyler and from people from Iber Lab, especially Zara and Marcelo, where the people who has driven the development with their requirements, tests, and generally nice collaborations. So yeah, there is the email to Verdon Tyler because he's the person, he said that he can be contacted also for the biological and methodological content of this particular experiment. OK, thank you very much, let's conclude the talk.