 and she's a very crucial to build all these educational resources. She should be with me presenting, but she's not well, so I will do it myself. But thanks again, Maria. And then a couple of questions to understand how the audience knows this type of resources. I wanted to know who thinks is proficient in tidyverse, who is everyday and is quite proficient. So I know if I have to explain things. Yeah, almost the majority. I guess I will ask anyway who has done single-cell analysis before and thinks is proficient? Half. And who has used before tidy transcriptomics for bold core single-cell or anything? Good, you are in the right place. And that's it, I guess. And we can make these interactive, so you don't have to, but just to understand how many people wants to follow along like executing code and there will be couple of very easy exercises to stimulate your tidy skills later on. So can you raise the hand if you think you want to execute the code with me? Okay, half, good. All right, okay, let's start. So this is the page you can get from the schedule or maybe there are also other ways. And you can see here the two important links, workshop details and sorry, workshop details and orchestra are the two key. If you click on workshop details, you will see the webpage of the repository of the workshop with the information that you can execute this workshop locally, just install the package and with all dependencies sorted out for you. And if you see under syllabus, there is the webpage, which is the actual material of the workshop that you can use on a later time, but I will suggest you to go through the code with me today. As you can see, all the material here and you're very welcome to look at it in a later time. So I will guide you through orchestra and then I will do a very small presentation introduction and then we'll go to the hands-on part. So for who wants to follow along as you now know, you can click on orchestra and you can search for tidy. And you will see a bunch of workshop. Confusingly enough, there are two workshops with exactly the same name for some reason, but you can see one has many more views. So you can launch that and that will be the right one. So if you want to do it now, so if anything doesn't work, we have time through the presentation to sort it out. Yep, then it's ready. All right, so I will zoom in. Is this good or you want more for the people back there? Is it good? Okay, you have a good high side. All right, so in the webpage, but I will not follow the webpage today, you see that there are some embedded slides, so that's an introduction I will give now. All right, so just a couple of points. To inform you, there is a blog about tidy transcriptomics. Well, first of all, tidy transcriptomics is a small ecosystem that we built through timing in the last couple of years that includes tools to perform transcriptomics analysis and manipulate transcriptomic data, both bulk and single cell. And we have a blog about that, that we are building up educational material and if you ever have to refer to this ecosystem, there are two publications that are about two points of the ecosystem. One is TidyBulk, which is an analysis framework and vocabulary. And another one is about TidySorat that I will not go through today, but it shows the principle that you can see today as well. Okay, so most of you use TidyVerse already. Just a reminder, TidyVerse is a philosophy and ecosystem associated with that, that runs through four main principles. One is reuse existing data structures, the main one being a data frame, compose simple function with pipe, embrace functional programming and design for humans. So when we develop TidyTranscriptomic, we follow these principles as well. So this is a data frame that is under the class tibble, which is a revamp and a curate, and a curation of this type of data. So we have different columns here. Columns can be of different type. For example, characters, very simply, can include other data types. For example, here we are including a plot, a ggplot, can include another tibble itself. If you want to do iterative analysis, you might want to group your data or we have multiple datasets and you can iterate functionality on two rows. Can include a linear model. It can include anything, including Surat object and single cell objects. So it's going toward from a simple table more toward a database almost. So the data manipulation grammar, well, is quite different. Of course, you are familiar with Bezar, which is let's suppose we want to filter data, create some new column and plot the data. You can see here that we have some redundancy. For example, data frame is equal data frame, where data frame class is equal A, we might want to create a new column based on these two and then use a plot function. Tadjar is based on piping. So you have one input is piped into a function that has an output and that output can be piped into another function and so on. So you don't have, if you don't want to create a temporary variable, which is a quite bug-prone process. So you can see here that we have our data frame. We pipe into a filter function. We pipe into a mutate and we pipe into a plotting function. So first of all, we don't have redundancy. We don't have to declare the input variable multiple times. We have not created any variable and also the vocabulary is pretty self-explanatory. So you have to get to the documentation much less often, which is something I also like very much. This is a schematics of the ecosystem of tidy transcriptomic at the moment. So above, you have the bulk data and below the single cell. And so the idea is that we created data abstractions. So data interfaces basically for example to the summarized experiment and single cell experiment that allow you to display the data in a tidy manner, interact, visualize, integrate the data with tidy principles. And so you can see that this tidy representation, here is study single cell experiment can interact with by conductor with the code that you are used to, but now is also can interact with the tidyverse functionality. Same thing for summarized experiment and same thing for Surat. We also have an analysis framework for bulk, which is called tidy bulk is very, very extensive. Probably is the biggest library in this ecosystem. I will show you briefly, but you can do all the basic operation with these framework includes most of the common software you might use, use a very tidy self-explanatory grammar and use the pipe concepts and so on. And today we will focus mostly on the single cell experiment object, but also touch on the summarized experiment with the pseudo bulk analysis. Okay, so you're all familiar with single cell experiment. It's a quite complex hierarchical object. And when we want to display, we get some summary information as you can see here. For the analysis, well, the by conductor community contributes packages that are interface with this object is obviously amazing for collaborating indirectly. And for data manipulation, we have a grammar that often resemble Bayes are. So for extractive metadata, we can do cold data for extracting reduced dimension. We use reduce themes function to subset. We can do it in this way, data, double comma, and the column we want to subset according to. For modifying the metadata, if we want to do in a very simple way, so a very simple edit, it's very simple, the grammar is very simple. But if we want to do more complex operation like is common in everyday real world analysis, it can get a bit more cumbersome. Let's suppose we have our data frame, our single cell object, and we want to attach some information about clinical data. And we want to filter just the cells for which this information is available. We have two separate comments here. We can bind our data with our table and we have to be careful to match the rows, of course. And then we can use this subset command to subset the cells which have clinical information in this case. Then what if we import tidy single cell experiment? Well, first of all, the data will be displayed to us in a different manner. So this is a table representation rather than a summary. And we can see the data is much more exposed. So we can see what's the metadata in reduced dimension columns. As well, we have some information about the number of features, number of cell similarly to what is the default display. What is about the analysis? Well, the analysis don't change at all. So this package doesn't do analysis. So you keep doing whatever you were doing before. The object hasn't changed in any way. We're just providing an interface for the user. What about the manipulation? Well, you can see here that now you can use tidyverse functionality. So if we want to display the metadata, well, we don't have to do anything. So that's displayed by default. We can see that we can select columns and select is a quite powerful operator. You can see here columns that contain your map. For example, we can filter, mutate. And for more complex operation, there are often more elegant ways to do it. For example, again, we want to join a table of clinical information and filter cells just that have this information. Well, we have powerful operators such as InnerJoin. So the matching is sorted out for us. We just define the column we want to join with. And just to mention that for Surat object, the interface is exactly the same. So if you swap Surat for a single cell, your code doesn't have to change at all. The visualization is the same. Of course, the only thing that changed is that it's telling you is a Surat obstruction. And the analysis is Surat things. We don't intervene on that. And the manipulation, as you can see, is copy and paste the code above. And you get the same result. Here, just to mention the functionalities that we have, we have most of the Dipliar entire functions. You can have a look here. And also ggplot and plotly functionality. So it's virtually like a table for almost all purposes. And the last bit of this introduction is just to clarify what tidy single cell experiment is and what is not. So it's not a data container. Data container is single cell experiment. We don't touch it in any way. And it's not an analysis tool, as I said. What it is is a data interface. So we represent the data in a different manner. And we let you interact also with tidyverse beside all the operation you were doing before. And we do manipulation, integration, and visualization. And so the question that sometimes I get, how can I go from a tidy single cell experiment to a single cell experiment is not relevant, because we never leave single cell experiment. And you can even toggle between views. So if you are very nostalgic of the original description, you can toggle it and still be able to operate with tidyverse. OK, so let's go on to the orchestra. Maybe just to have a confirmation, are we all in the orchestra or you want some time? You can raise your hand if you are all set. The minority, I hope. OK, maybe let me know if you have problems. That's the best way. You can go into Vignette. We have just one Vignette for this workshop and is a tidy transcriptomic case study. And so you can open that. All right, here there are all the information that are then rendered in the website. So I won't go through everything. This is also for people that want offline to go through this. So I just wanted to specify what you will learn here and what you will not learn. So you will learn about the basic operation of data manipulation that now you can do with single cell experiment. You will learn about the representation of this data, how to interface with tidyverse functionality. And I will show a couple of real case study, very small toy examples that use this technology. And then on this workshop is not any reference for analysis. So we don't want to teach you how to analyze data. And we are not teaching here, well, me, Maria is offline, but how to, you know, is not a fundamental of tidyverse. So we assume that more or less you know what single cell analysis and tidyverse. This workshop is focused on how to put them together online. First of all, we load a few libraries. You might be familiar with them. Some tidyverse and single cell experiment. Now we load our single cell experiment objects and we visualize it. As you know, that's the display. We get the dimensions, we get some gene names, barcode names and so on. Okay, so now let's load tidy single cell experiment. After we load this library, if we print the same object, now it looks different as I mentioned. And we see all the metadata and also UMAP dimensions and other reduced dimension that might be there. We can see that we have so many features, so many cells and so many assays. Now, as I said, if you like very much the original display, you just set one option and these will be toggled to the original one. But I like more the tidy representation so we can toggle back and here you go. As I mentioned, the object is untouched. So you can apply every functionality that you were applying before. Also now with no problem. For example, assays on this object, which interrogate what are the assays include in here, works no problem and you can imagine any function that operates on single cell experiment will still work. All right, so before starting the next session, does anybody have any question, curiosity, anything at all? Does he need a mic? Okay, I can repeat your question. Yes. So my question is quite simple. Are you overwriting any methods to, for example, to show the single cell experiment object or is it? Yeah, so how do you solve this? Yes, so the only methods I'm overriding is the show method and basically, instead of showing a description, I show something else, which is exposing all the information. So it's an abstraction of the actual object in a form of a tidy verse, but because you can manipulate that in exactly the same way as you see it, it works well. It's like if you had a t-ball in front of you for almost all purposes. And when you set the option, is that that methods read that option and if you want the old version, it just doesn't do anything, just returns the old visualization. So this allowed us to not touch the object. So if you manipulate your single cell experiment with these tidy operators, but your colleague doesn't have tidy single cell experiment, he will never know. So the object will never know tidy single cell experiment existed, which is very good. It doesn't leave any trace. Okay, so the next session, I will show you some simple commands on this data and you let me know what you think. So there are some common operators in the player, such as filter. In this case, we see our let's call it data frame and we can filter for a specific cell cycle phase, which is here. And we are, we want G1, you can see that we have less cells. So we filter basically cells on this column. You can imagine you can filter a much, much complex combination of parameters. You can select the column you want to visualize or you want to continue your analysis with, in this case, cell, file, and few other columns. Here we have reduced dimension that pops pop up because our view only column. So if what is returned is a valid single cell experiment, those column will be visualized as well. As you can see, we are doing these operations and what is returned again is a single cell experiment object. We can use mutate to create new column in very powerful ways or to modify existing columns. In this case, it's very trivial. We are converted phase into a lower case. You can see now that the phase L is returned. And now just for display, I'm selecting just few columns for you to see. Yes, so if, well, that's the case. If in any way you produce a non-valid single cell experiment, for example, if you summarize or if you omit key columns here, there is no problem, just a table is returned to you for further whatever you want to do. So you can want to summarize and visualize or summarize and integrate with other things. As simply you'll receive a simple message and that's now is a data frame. Simply we just select file. There is no cell ID. So obviously it's not a single cell experiment anymore. We can do in a real world scenarios, I'm doing much, much more complex manipulation. Unfortunately, regularly integrating information, summarizing, taking from something else. And so the curve to do that is very low with tidyverse because there is so much, so powerful operators. For example, we have a file column, as I showed you before, let's, well, is above, you have to believe me. And let's say the sample name is embedded in this file column, sometime it happens. And so we can use extract from tidyR to input a regular expression and we can extract one or more columns from there. And so as you can see now, we have created a new column called sample and it's here where we have isolated, well, maybe to visualize, let's take off everything. We have isolated the sample ID from the file. It's pretty elegantly. And of course, the good thing is that all these operation can be piped together. So you can imagine how much more tidy and many less variable you have. And especially, you don't have to reassign a variable to itself, which is in, when you do interactive programming is very bad in my opinion, cause a lot of bugs. And other neat things that tidyR has is unite. So you can unite columns to create new IDs, sometimes that's necessary or you can separate. As you can imagine here, we are uniting sample and BCB to create a new sample ID. And just to show you, we are selecting that column. All right. So before the next session, you have any question? Would you like to see something more than I showed you now? Any curiosities? Also people from online, we welcome any comment, any question? All right. So let's step to a little case study that happened to one of the dataset we analyzed and is a calculation of a transcriptomic signatures that we identify in an article. So well, here are some notes of how this dataset was created, but this dataset is ready in the material of the workshop. So by default for single cell experiment, the information we expose to the user is the cell related information. So all the metadata in reduced dimension because the rationale is that the cell is the crucial, is the crucial unit of single cell experiment compared to a bulk RNA sequencing analysis, for example. But we have this function, which is called join features that allow you to add to your data that you are seeing and you can manipulate some transcripts, some genes in this case. So we can add all these genes that are in the signatures we want to interrogate. As you can see here, here they are. So they are now part of your metadata and they are columns that now you can operate on. So the goal here was to, we had this breast metastatic dataset and after we analyzed it for some time, a question came back asking for providing more evidence and analysis for T-gamma delta cells, which are not extremely documented in studies such as other cell types. And so after the fact, we had to go back again, we identified these signatures from the literature and the Pizzolato proposed a combination of these genes to create basically a score with which we have a score cells, re-isolate them and did some analysis. So that's part of that for that I'm showing you here. So after we have joined these features, we have them in our dataset. We can use simply the common mutate to apply the arithmetics that Pizzolato proposed. For example, these cells are positive for these genes and negative for CD8. So he proposed to sum up the abundance of these genes and subtract this and we rescaled them because we saw it was providing more better results. And so as you can see, just in one mutate column we can apply quite sophisticated arithmetics in an elegant way. We don't have to create any variable here. And we are creating a new column, which is a signature score. Now if I show you, we have created this column, signature score from the genes we have added to our data. And now here I'm repeating code just to show you that in real analysis you don't need to again create new variables to test things on the fly. You can pipe together if something you want to keep, you might want to save it otherwise you go on. So in this case, we have manipulated our data and you can use functions from by conductor in conjunction with tidyverse functionality. For example, to visualize these cells with their score. And so we use plotume up and color by this new column. You can see here that this cluster it's most likely the cells we are looking for as the signature score is quite high and the cells are quite compacted together. So we can imagine we might want to isolate this cluster and do some more analysis to understand if these are the cells we think they are. So you can use, there are many visualization functionalities in by conductor and not only, but when you load the tidy single cell experiment you also have now the possibility to use, for example, ggplot. Often we want to create custom plots and these library gives the possibility of doing that without much hassle. So is interrupts directly. So what you see as a data frame you can visualize it using ggplot. So in this case, we are doing a scatter plot with our dimension, color by signatures. We might want to add shape, transparency, fast setting, whatever. You know how powerful ggplot is. This is still a quite simple example. As you can see here, we choose this spectral palette and some styling. And again, it's pretty clear that this group of cells might be the gamma delta we are looking for. Okay, so now that we have plot them, again, we take our code and we might want to isolate these cells and reintegrate if we have multiple samples, reanalyze, recalculate variable genes, so on and so forth. And again, here is pretty simple, not variable created here. We are just in the exploratory phase. So obviously this is not the best way to filter but this is still at the end, a toy example. So for simplicity, we just set up a threshold. Let's suppose this threshold is ideal. Of course, you would like to recluster and select the exact cells. But let's suppose this is a very good way to do it and we can filter the cell simply based on whatever condition we like. In this case, a signature bigger than 0.7 and in this case, we save a new object called a gamma delta. Let's see, here we are, we have 72 cells. Of course, here we have many more but the thresholding based just on the signature is not ideal, but nonetheless, this one on filtering that we might want to apply just for demonstrating. And so we have our single cell experiment ready to reanalyze. In comparison with Bayes R, that's a way to do it. So a comparative with our approach. In this case, you would apply, you would create two different objects with positive and negative signatures. You would apply some summary statistics and rescaling. Then you will do our arithmetic here as well and update and filter our database on this. As you can see, we have created three different variables that now we have either to remove or to keep. Okay, so again, this is an example. One thing we might want to do is to reanalyze the data as I mentioned. And here is another example as we can pipe tidyverse commands with some by conductor commands in a seamless way. So in this case, what are we doing? Is we are re-integrating the samples in the data. And these are functionalities from by conductor. And for example, fast MNN drops the metadata information. And so we can simply left join, which is another deploy our functionality, the metadata of our dataset. So we can simply left join a table. We can use a stable to convert at any time for any reason our single cell experiment into a table. So as you can see, for example, we execute, here we come. So this will calculate some corrected reduced dimension that you can see here. However, we don't have our metadata anymore. So very simply, we left join our metadata and we are in the shape where we started. Okay. You didn't see this. Maybe some cell IDs have been changed. Okay, it worked before. I'm not sure what's happening now. But yes, I think the cells ID is joining by batch. Let me see, is this the problem? Yeah, so that there was a problem. Yeah, we can choose the column to join with similar to a tidyverse. And here we have all the information back. As you can imagine, left joining on the fly. So being able to left join and pipe forward is very important. We have to integrate in our single cell single cell experiment information that we have calculated or are external all the time. So I find this very, very good. And again, we can run a UMAP after this to see. Of course, this UMAP will be very uninteresting. Okay, sorry. And then we plot the UMAPs. Yeah, just we are calculating reduced dimension here. And again, just to show you that the UMAP exists, but of course not enough cells to do anything. Any questions so far? I've shown quite a few things. If you have any doubts or anything, feel free. I will repeat. No, that's, yes. So in the future, we will play also with some coloring of the table. Now it's possible to communicate visually these things. But for example, the cell column does not exist in single cell experiment. It's just a row names. So dot cell is a column you cannot rename or touch. It's a view only column for clear reasons. And the standard that we have applied, both in summarized experiment that in single cell experiment is those type of columns are dot cell, dot feature, dot sample. And the practical reason that often there might be another column called cell, called sample, or called feature. So we might want, we want to avoid column deduplication. So dot cell is pretty unique and it's pretty convenient. Yeah, it's just a convention. Other questions, comments? You find any of these confusing? Something you would do different? Feedback is also very welcome. All right. So another good thing that becomes pretty natural with this type of data representation is that we can also interface with plotly. And so for example, these are pretty big objects. So I will not execute this, but instead of two UMAP dimension, you might want to calculate three UMAP dimension. So increase the dimensionality of the data you want to visualize. And this is an object that we already have with three UMAP dimensions. So I'm just plotting it. You can directly input now your single cell experiment to plotly. And because we have now, well, I will show you, I will show you these objects first. You can see that you have three UMAP columns here, a simple, and then it's pretty intuitive to call whatever dimension you want to plot. We are coloring by cell type, which is another column we see. And you can get pretty easily three dimensional presentation of your data. Of course, in UMAP, all the variability here is also in the 2D, mathematically speaking. But I found more than once pretty convenient to add a third dimension to better have a feeling for some heterogeneity that maybe was not that clear in two dimensions. And also here you can, you know, because it's plotly, you can hide some cell types to have a better idea of the data. And sometimes these 3D, even if you use them for a 2D static image, offer some position where visually the cluster are quite separate and visible. All right, so for the ones of you brave enough, we have very two simple exercises that I would be happy if you try. So we can see together how these things work and you can already approach this. So these exercises use SCE object variable and we have seen that we have calculated signatures for gamma delta and we have filtered before these cells. Now the question is if we use the same threshold, what are the proportion of gamma delta on compared to all other cells? So the question is what proportion of all cells are gamma delta T cells using the signature score bigger than 0.7 to identify those cells? Okay, these. What? He's not with us, okay. Okay, these are pretty simple. It involves just two functions. You can obviously use the code we have done above for the signature calculation and then you just need to add a couple of function to summarize and create the summary statistic. So I'll leave you a few minutes to that. What are you doing? It's someone on mute there. Also the people in the remote, be very happy if you try the exercise and at the end you can post whatever solution you got so we can see in the chat. Okay, anyone that has finished? Who is trying? So I see three, four, five people. So do you want, who wants one or two more minutes? I'm happy if you try and you are finished, good. Anybody wants some more time? Cool. What's the answer? Part one. What was the first question? Yes, what's the, yeah, so what's the answer to the first question? The first one? Okay, good. Anybody in the chat? Good, all right. Let's try if I get the same result. Okay, so we had the code of our signature calculation and filtering, okay. We can pass, copy and paste this code. Oops, all right. So instead of filtering, we could use a mutate. And then let's see that we create a new column called gamma delta according to this condition. So we will get a new column either true or false. And then we can use a summarization function, for example, count. And we might count the instances of gamma deltas. Oops. Instances of gamma delta that we have in our dataset. And the answer is the ratio of these two. Roughly, we don't have the 72 back, but let's give you an idea. It's a 2% good, good team. Good, all right. So some of you might have started a second exercise. It's quite similar, but you have a bit of, you have to start from a bit of a upstream point. So there are some cells that have low transcriptional output. Let's say, if you plot the transcriptional output, you would see, but let's say we set a threshold of 100 total read count. Okay, the question is of these weird cells that we might want to explore, what is the cell composition? Okay, so we have some cells that we want to isolate with some property and we want to understand the cell proportionality in this set. I will leave some time unless everybody has already done this. And of course, if you have a look to the object, this is a column of the metadata. So as this object has many columns, as often happens, you can use always the select functionality to visualize just that column to have an idea. So you can also do in this case. And sorry, this obviously cell type column as is written here is the variable you want to understand the composition for. Can you raise your hand? Who is doing, trying the exercise? So the one who are trying, who wants more time? Nobody wants more time? All right, so the ones who tried, let's make it just a bit specific. If you calculate the composition of the cells, which is the most abundance? So it's a precise answer and we can see. So again, once you have calculated the composition of these low transcript cells, which cell type is the most abundant? And I will try myself in one minute. Since you have done the exercise, the question is which cell type is more abundant across that composition? In one minute, you can shove it all together. So you don't cheat. All right, one, two, three. Wow, how many have done this? You're shy if I asked to raise the hand. Come on, yeah, that's true. But good, let's see if I get the same, or maybe, well, yes, I have to show how to do it. So again, well, this time we start from our original object, which is here. Let me read the question again. All right, so obviously we want to filter first with this condition. I mean, I can literally copy and paste. All right, so we just do this and we get a few cells. So we said, I know, sorry, not few cells. We get 2000 cells. And now we can, as before, there are many summarization functions, but let's say we want to use count. We can count the instances of cell types. Okay, in this case, you see that we are summarizing. So we are not getting back a summarizes permit anymore, obviously, we just get this message in a data frame. We might want to plot, do an Instagram, do whatever we want with this. Well, a shortcut to my question is we can arrange based on N. So yes, the TCR, VDelta1 cells are the most abundant. Normally they are not. We are reached for this exercise. Gamma Delta are quite rare, if you are wondering. And for example, if we want to calculate proportions, we can create a new column. Let's say proportion is equal N slash sum N. We have our proportion here and, well, so on and so forth. We can also arrange based on proportion to get the answer. Okay, well done. Happy you are getting into it. All right, so before the third and last is a bit more intense, brain-wise, the last part. Does anybody has comments, questions? Yeah, so I didn't know about the count function, so I actually used group by and then summarize, which I guess does the same thing. For sure. But already after grouping, it says that table is returned, which, I mean, it's a non-invasive thing to group the annotations. So I don't understand why that happens. Well, that's true. I mean, after you group, you don't necessarily have to summarize. Yes, so it's not trivial actually in the backend to group. I mean, you should group because in our representation, we are integrating reduced dimension and metadata. And we will include spatial coordinates and so on. So it's not as trivial as applying that to metadata, for example. Yeah, I mean, you're right, it could be implemented. Actually, I will think about that. Yeah, the only reason I did not is because I thought if someone wants to group, most likely wants to summarize. And so I thought if you summarize, you're already not getting that. But yeah, thanks, I will think about that. Yeah, okay, so the third part is possibly the most interesting because as we developed, of course, we developed a lot of infrastructure for bold carbon sequencing, which is still relevant. But has a new importance because often we want to summarize our cells into sample or pseudo sample to do exploration or to do actually a differential analysis. And so I will show you here that from this single cell experiment object, we can aggregate cells get to a summarize experiment that you are familiar with. And then we can apply the many, many functions that tidy bulk offers to do that. Again, is all in a tidy ecosystem. So it's very neat, doesn't require variable creation. So on and so forth, the concepts remain. So for this last part, we again, load a few libraries. I mean, we load the glue, tidy R and poor, very popular patchwork as well for visualization. If you haven't used it, it's very good, incredible. And we load the libraries, I just told you, tidy bulk and tidy summarize experiment. Okay, for this, we have a neat function in the workshop that is aggregate cells is quite generic. So it's quite flexible and good. You can give a single cell experiment. You specify what samples you want to group and what I say. For example, a typical case is that we might want to group cells by sample and by cell type. And for example, doing differential transcript abundance analysis across cell types. Of course, if we have multiple sample with two or more conditions. And I will put, I mean, I think I will put this into gist as well, but this is a function you can import as well from this package of the workshop. So the input object is the one you know and the output object I will show you now. It will be a summarize experiment, but we will have still a tidy representation because we imported tidy summarized experiment. Of course, we can toggle to the classical view all the time, even for summarized experiment. You can see here that we have feature samples and our assays, all the metadata is exposed to us. Again, we can filter, integrate, visualize, no need to interrogate the object internally. In this case, we have features because is an important part of the bulk here. Sample column, again, we have these dots, as I mentioned before, these are column, they don't exist actually in the object. So our abstraction of those and our assays counts. You can see here we have another sample column in the metadata. And you can see here cell types. So we, since we grouped by sample SL types, these two columns, we get a sample column that of course is unique. So it's a combination of these two, for example. Well, this graphic is in the tidy bulk GitHub page, but just to show that tidy bulk can operate directly on tables on data frames. So you can actually format, you know, you can convert this to a table and then will be a actual data frame or summarized experiment. Again, nothing changes between these two. You can toggle between the two for any reason. You use exactly the same code. And again, on these summarized experiment, you can use tidy bulk now, but still use whatever by conductor or functionality you used before. Again, the object is untouched. We'll never know tidy summarized experiment exist. All right, so who, this is quite important because it's quite based on these functionalities. Who has used Nest before from poor package? Nobody, one, two, three, maybe four. Okay, the minority. So Nest is probably one of the most powerful functionality in tidyverse. It's as R is at its core, a functional programming language. Although, you know, we used to use for loops and while loops and so on. Nest is the actually very functional implementation of this concept. So as I show you a data set, a table can include other tables. And as I will show you now, these my, you know, you might want to group your data set, produce different plots, produce different analysis so on and so forth. And so Nest allow you to do that. And of course, here Nest is abstracted for your summarized experiment object. So in this case, we want to do differential analysis across cell types. For example, we have cancer and healthy condition which genes are up or down regulated between these two condition for T cells, B cells and so on. So we don't need to split the object, run a loop, save variables, merge them again is all possible because of this nesting. For example, we nested by cell type here and we created a new column with the nest object called group summarized experiment. And you can see here that we have our cell type and each of this column includes summarized experiment objects. And now you can iterate methods on this object very neatly. It's all in a very consistent data frame and because we are not creating variables, it's much less easy to create bugs in our code. So let's say just to show you what is inside the first row. So this summarized experiment, I just use a slice one and I pull my variable. And this is still a summarized experiment object. You can see here the cell type is, oh no, sorry, the column is outside. But this includes just CD4 ribosomal rich cells, okay? Anybody has question now? I mean, I know this concept is a bit unintuitive sometime. Everybody understood exactly what we are doing here? Cool. All right, so let's get to work. Now, let's say we want to do again, a differential transcript abundance analysis. Here we're using on our nested object, we are using mutate. So we are operating on some of the columns to create a new one or to update the one before. When you use nest, you often use map in pair. Map allows you to have some columns in input, doing whatever operation you want to create a new column or update a column. So in this case, we mutate, we create these columns. Sorry, we update this column. And in map, we have two inputs. One is the input column, which is the same because we are updating. And the second is an operation that can be as complex as you wish or as simple as you wish. And this operation includes in this case some tidy bulk functionalities to do a very analysis in a tidy manner. For example, tidy bulk includes a function that is called identify abundant and uses in this case, the HR framework in the back end to identify abundant cells, sorry. Abundant genes from non-abundant genes the step you usually call filtering. In this case, we are not filtering, we're just labeling genes that we want to keep in our analysis, according to in this case, a covariate, which is treatment. A summarized experiment is a return and we pipe into a scale abundance function with a method. There are many methods for scaling what might be called normalization. But basically you are calculating a scaling factor, just one number for each sample. So this is used for visualization. Usually you want to check your distribution and when you want to plot your transcriptional abundance is good to have them scaled. And then we apply differential transcript abundance where we specify our formula. In this case, treatment is our only covariate. Our methods, there are many methods including HR, Lima, DC2 and so on, our scaling method. And because this is inside the map, this will be applied to all cell types and this column will be updated with this information. And now we will take probably 15 seconds or something more to go through all cell types. Yes. So I guess I missed something here. What is the difference by group, by and list, list function? Well, for very, very simple operations, there might not be difference. But for example, if you group by one column and you summarize, you lose all other columns. Okay, if you use nest, you can nest, you can create a new column that summarized something. You might filter based on this new variable and you might expand the old dataset again. So you lose, you didn't lose any information. So let's suppose you use group by, you lose some variable, you might want to left join them afterwards or something like that. So nest is much, much more powerful. So is a much more powerful abstraction. So you can, for example, as I will show you, you can nest your data, you can create a new column with plots visualization, one for each cell type, for example, with group by you will not be able to, or maybe yes. No, you will not, or maybe I said, I don't know. But I mean, the concept is much more powerful. You can interrogate nest object that will always exist there and you can interrogate them downstream your analysis. You never lose them, basically. They're always there, hidden. Yeah. Instead of map, but if it's like computational intensive, could use the bio-C parallel, like, Yes, yeah, you could. Yes, so I actually, I was interested in this, I read some discussion. Map is neat for one reason that you have map int, map character, and map factor. So let's suppose I have these objects, for example, so I have these objects. By the way, this is, we have done our analysis and it's still packaged as before. We just updated our object, very neat. Let's, with map, because there is map int, map character map, whatever object we want to return, you can do something like this. So we want to calculate the number of cells for each data set, or maybe more complex things. We want to filter and count something like that, creating a new column. We can do this n, we can map int, so an integer will be returned and we can do n row with our data. Sorry, yes, that's the input column. Of course, we want to provide to, come on, to map. And so we have a new column that is integer. So it returns whatever object you want. If we just use map, I think apply is exactly the same thing. And you can apply parallel frameworks. In fact, there is a package called future map, which use the future infrastructure, which is, again, pretty powerful, to apply this map functionality in parallel. So basically, I could have done all my analysis, differential analysis in parallel, or maybe into a cluster or something like that. So you can imagine that this potentially could be powerful if you have single cell data you want to distribute your computation. Okay, so that's the object is returned. Any other question before we go on? Yeah, this is a question from the chat. It says, not a package related question, but how to decide if we require cell debulking for the analysis or not? Can't we simply compare treated versus untreated per cluster? Is cell debulking required when comparing SC RNA-seq results with bulk RNA-seq results? Okay, so if the question is, do we necessarily need to use cell debulking? Is this the point of the question, I guess? How we decide that we need this? I mean, it's a personal choice. There are many methods. Some approaches use summarized data in cell debulking and use very known methods for testing. And there are also mixture models or other approaches that do not require cell debulking is a choice of researcher. I think, for example, one very useful thing is that, let's say you have a very big data set with 100 samples. For doing the initial exploration analysis to understand what's the biological or technical variability in your data set is much easier to reduce these cells into a single sample and do a PCA analysis, for example. So then you can decide your covariates, your confounders, and so on and so forth. So for exploratory analysis, it doesn't hurt ever, I think is very useful for hypothesis testing is a complex answer. So it's the choice of the researcher. Any other question before we go on? We have the last couple of blocks and then we're done. So everybody is clear with what we have done here. So applying this map function. If there is any doubt, let me know. So we're all on the same page, would be nice. Okay, well, some output here. Again, still, this is basically becoming a database. Is, again, very clean to understand. I will pull just to show you the first object here. Again, a CD4 cells. And you can see that tidy bulk, basically every time you do an operation, it adds information to your data frame. So for example, we have identified abundant transcript. So we have an abundant column. Well, this is an A because this is abundant false. So we just test the abundant genes. As you can see here, we don't need to filter our data. Once we identify abundance, some genes will be tested and the scaling will be done on the abundance, but you still have all the information you need for all genes. Sometimes this is very helpful because you might want to scale based on housekeeping genes, but you want to keep all genes for visualization or whatnot. So we break the filtering analysis loop here. This is not needed at all, but let's say we want to visualize just the results for which we have, sorry, just the data for which we have results. We can, well here, normally I, you would do into the map functionality, but I will just take off the first element of the list and I filter based on abundant, for example. You can see now that we have our statistics and our differential transcript abundance has added logful change, FDR and so on. And now you can go on filter, visualize. Again, I haven't created one variable here. No more than that. Now this is the last block. Actually we are almost done with time. Another thing you might want to do of course, after you have done your test is to visualize, pick up the whatever genes are significant, build box plot and so on. Also here, the nesting very powerful. We can do this in a very elegant and self-contained way, not creating any variable external loops and so on. So we have three steps if we want to create plots. For example, first of all, we want to filter significant associations. Again, this is a toy example. I'm filtering very high threshold, but you can imagine you set your own threshold. So again, we are using map and we are updating this column. We might want to create another column if we like and another column with filter data will be built. In this case, we are running through cell types and filtering. We are filtering some cell types, do not have significant transcript. Usually this doesn't happen, but this is a toy example. So GG plot doesn't like empty data sets just because we are faceting. Otherwise you could even plot empty data set and everything is very modular and elegant. But anyway, in this case, we filter. Here again, it's a bit complicated. I show you map int. You are producing an integer from one of your column. Here we are filtering based on this result. So we are, now that we're filtered, we want to know which cell type have any transcript remained. There is bigger than one, so I don't know bigger than one. Anyway, so we get, I can show you this, oops. You know, our filtering and we can see that not all cell types survived here. And then with the cell type that have significant transcript, you can see that now we have just five rows with our summarized experiments. And now again, we can produce a new column with our plots for every cell type. And here we are using map two, which doesn't take one column for input, but takes two columns and produce one output. In this case, we are creating a column called plot, which is the output. And we are putting two columns together, which is our data and the cell type. Here the cell type, I just want to add a title to the plot, that's why I'm keeping it. So, well, dot x, I didn't tell you, is the actual input for map. So this is dot x and this is dot y in this function. So dot x, we use directly ggplot. Again, dot x is a summarized experiment. So you can forget about that is a summarized experiment. You just work crunch through. We want to, on the x, we want to build a box plot of significance. We have treatment on the x and count scale on the y. We do plus one because we want to scale the axis afterwards. Box plot, we call them so on and so forth. Well, execute this. And we have the title with the cell type. As I show you, again, I show you the results. Again, very neat. Everything we need is included in that frame. We have our filter data and our plot. And now thanks to Patchwork. Well, first of all, I show you, as before, I want to show you one of these plots to see what's inside. I just slice one and pull the column. Here you go, that's our plot. We see some association. Again, you can imagine how many things you can do with ggplot and all the ecosystem there. But if we use Patchwork, we can neatly wrap of all these plots in a very big summary plot. And here, well, we have our own style here, but basically we pull all the plots. So we have a list of plots. We pipe into a wrap plot function. And this is what we get out. So we have five plots for our five cell types with our significant genes and so on. So this was a toy example to show you, you can do quite complicated things with minimal variable creation. You can, when you manipulate data, you can try different parts that you might not like. Actually are the majority and you don't need to clog your environment. Just create piping parts. You keep what you like and the rest that you forget about them. That's it. I hope you liked it. Thanks for intervening and contributing. I appreciate that. Any questions? Right, cool. Yeah, so I have a question. Yes, it's really cool. So, but so far it looks like it only works on call data and gene expression. So how about row data? So does it work with row data or do you plan to make it work with row data? Yes. So, well, if, so at the moment, row data is a bit complicated because there is no unique format, but if is not grouped row data, let's say one transcript as one interval, sometimes that's the thing. This information is in your data frame. Such a, like it was reduced dimension, we attach the row data there. So again, you can use filtering and other things. Now I haven't worked much on row data. If, for example, the row data is grouped, sometimes one feature has multiple locations. So these will be just a nested. So you will see a table. So the brief answer is I would like to expand the row data. It's not obvious to me how to do it now. I welcome contributors. Now we are a small team, but hopefully if people like it, we'll also be, you know, possible to work together. So it will happen more in the future. At the time, at the moment, is not so powerful manipulating row data. But of course, I mean, this is still a summarized experiment. So you can apply all tools you were applying before with the, you know, geranges, play ranges or not. Any other questions? Did you like it? Good. Cool, all right. So feel free to contact me. Let's keep in touch. If anybody is passionate about this, we can do a lot of work. They'll also, we plan to, you know, do a publication on the bike, tidy by conductor. So if any PhD student want to come on board, also there are, you know, auto, I mean, there is a publication opportunity, of course, to have something back concrete. And the last thing I will just send you to, well, well, look for the github, tidy transqutomist. You will see all packages there and I'm sure you can find everything you need. All right, thanks a lot.