 Okay, hello everybody. And this is the workshop on tidy transcriptomics focus mostly on single cell, but also bulk RNA. I encourage you if you can to participate. So show your face if it's possible but I'm not sure if your camera is allowed to be on, but nonetheless interacting the chart is barely early years so is, you know, job of everybody to keep me awake. So you can ask question whenever I will try to read the chart every now and then, although the chart will be mostly hidden but I will. Thanks. I will try to have a look. So, I will start. Well, first of all, let me ask the first question. Who of you is keen to actively reproduce the code as I speak, and to do there are a couple of exercises as we have some time to spend today. So there is also opportunity to do a couple very easy exercises. Good joy. So as we are not so many. If each of you could say either yes or no whether they will follow along with the hands on coding. I assume from this small sample that roughly 50% will follow along. So for the people who want to follow along is very easy. I will send you a couple of links. This workshop is online and it has also a render websites along with it. So I will pass in the chart the GitHub repository. So this workshop is in our package so you can install and all dependencies will be dealt to you for you. Inside this home page of the GitHub there is the package installation so it's pretty easy. I mean here there are the exact version of software but you know it's the one that we actually have now. So I believe. Well, let's let me suggest to you. One second, I think I'm sharing the here go I wasn't even sharing. Here you go. Yes. So you can see here, I suggest to you to install these four software versions should take not so long. And this is the installation of the workshop itself. And I will do a brief introduction so you have time to for this code to be executed on your computer. Okay, so I will. And you can browse the Vignette but I will tell you how to open the Vignette yourself so you can execute the arm up down. So, let me give you an introduction on tiny transcriptomic so if you go to the GitHub, and you. Let me see. Under syllables there is the material webpage that will bring you basically to this website of which. So if you just want to have a look or go to the material in later stage, you can access this webpage and all the coding will be there for you and the presentation as well. All right. So let me ask another question. How many of you have ever done single cell analysis RNA sequencing analysis. You can type yes or no. Okay. Not single cell. The majority did not. All right. Not single cell. So, okay, let me ask another question. How many of you have done RNA sequencing analysis at all. Okay. So the majority, if not all have some familiarity with RNA sequencing but not single cell. And let me ask the last question for the time being how many of you. Let's say our familiar or routinely. So you can either respond familiar or routinely program with tidy verse. Okay, good. Probably this is the most important thing if you want to try to do some exercises that are there. So the just as is is explaining the in the GitHub on page. So let me state the goal of this workshop. So the goals of these workshops is to show you how you can perform RNA sequencing analysis both bulk and single cell through the tidy verse ecosystem. This is not so much to teach you how to do single cell analysis, although you see some examples here. Or how to how is the perfect way to use tidy verse. Okay, so this is not teaching you the foundation necessary. Although you will see something here. It's, I'm not coding to show you the best way to do the analysis necessary. Okay. So this is not the goal. The goal is to show you the interface between these two. Between the biology and methodology. And I will show you how you can use some tools we have developed to achieve that. All right. So this is not what the concept is of tiny transcriptomics. So, these lights are available in the in the web page and I just passed in the chat. So we have a couple of resources for tiny transcriptomics for for first of all we have a lot of workshops that are in the same. Let's say github account that this workshop is. We also have a blog. So we won on which we have the tiny transcriptomic manifesto which explains the principle that we are following in our goals. And also, we have a couple of papers out that that show in more details study bulk, which is a analysis framework, a tidy analysis framework framework for bulk RNA, and tidy Surat, which also goes in to more details, technical details about what I will show you today. Okay, so what is bigger. So what are the principles of tidy verse or tidy art. So tidy tidy, the tidy programming is, let's say a philosophy or a way to to program that involves some principle and also a software ecosystem as you might know, which is called tidy verse that allow you to follow these principles. The principles are reuse existing data structure. One of the main data structures is the data frame that tidy verse uses compose simple functions with the pipe. Embrace functional programming and design for humans. And these are also the principles we follow when we developed the tidy transcriptomic ecosystem. I'll give you an idea, although most of you, for sure know this. This is a typical data frame, which is render, and, well, is render here as a table which is a more powerful and clear pretty few, if you will, version of the data frame, but behaves exactly the same. This data frame has observation as rose and variables as columns. And these variables could have different nature, for example, it could be a simple character. It could be a list, for example, in containing a plot. It could be another table. For example, if you want to do iterative analysis, you could group. The main data on is called nesting to have a each variable being a table and applying an operation on all the rows here. It could be a linear model. And you can also can be single cell data containers. Okay, so it could basically include anything you can imagine. This data frame is becoming more powerful now so it is almost like a database where we have the relations between a lot of variables. So what is the big picture of tidy transcriptomics. Tidy transcriptomics is a software suite. A little bit you can imagine as tidy verse, which includes many packages. And at the moment includes three packages and we are now expanding to, I believe a total of five including special data and multi asset and so on. At the moment, these, these ecosystem includes a tidy bulk, which is an analysis infrastructure to which follow the principles for bulk RNA sequencing. Also for bulk bulk you we implemented tidy summarized experiment, which allow you to interact to summarize experiment as if they as if they were data frames so applying all the tidy manipulation that tidy verse offers and of course now these summarized experiment can interact with these analysis framework table. And for single cell, we have developed tidy single cell experiment and tidy Surat, which allow you to display and manipulate single cell experiments route object in the same way. So with the same grammar there are not. Let's say container specific grammars and also very similarly to tidy summarized experiment visualization and also make allow the single cell experiment and Surat now we to interact with tidy verse. Of course, as I will show you later on these objects, these containers are not touched. So they can interact with whatever they were interacting before, namely by conductor and Surat. I will mostly show you how to perform single cell analysis using these tidy transcriptomics, but I will also touch with tidy bulk through a pseudo bulk analysis, which I will explain later as well. So just an introduction on how the this container are originally displayed and how and we can manipulate and what does it change when we import our libraries. For example, for who is not familiar with single cell. For single cell data, such as bulk data is quite complex has abundance information metadata for genes and samples and so on. So when we display this object on the screen, we see a summary. We're seeing that the single cell experiment with such cells and such features. This has the assay so the variables containing the abundance, the transcriptomic abundance such as in this case row counts and low counts, so on and so forth. I mean you can have an idea here. Similarly for Surat. We get a little lighter summary. Features assays and so on. So how we can analyze these containers. Well, for this by for the single cell experiment is which is part of by conductor, we have all the software that the by conductor community contributes to this repository. And the software is mostly designed to interact with these by conductor containers so we can use different software and use them together, ideally in a seamless seamless ways, such as in screen and scatter that are common and popular packages. For Surat, we have the Surat software itself, and some Surat wrappers that are community contributed software. How about data manipulation. So what about if you want to modify this container filter this container merge containers on and so forth. So each of these container as grammar specific manipulation. For example, if we want to see the metadata. We do call data on one and data double brackets on the other. We want to capture so to print on screen and use some reduced dimensions such as PCI or you map. We use reduced teams or embeddings for the two containers. We can subset the data set so filter based on based on a metadata columns, for example, in a similar way, but single cell experiment requires double commas here. We can utilize metadata information pretty easily. If we want to do a simple operation such as add the column info to the metadata you can say we can do it in the similar way. If we want to do manipulation that start to be more complex, such as in real world analysis, then the grammar is not consistent anymore. Let's say we have our container, a table of the clinical information for some cells, let's say a cell comes from a patient with certain disease. And then we want to select the cells for which clinical information is available. Okay. So in this case we can take the data. We can see bind these table, and we need to to match the, for example, the sample column in our table and in our data to of course make sure that the merging is correct. And then we can apply subset functionality, a subset function to filter the cells for which sample ID is present, for example. On the other hand, in Surat, the metadata is a data frame so we can use tidy verse on that. However, not see, oops, however, not seemingly as we need to do a couple of operation to achieve this. For example, we can update the metadata doing a left join on our data. But in this case we can simply match by sample automatically. But then anyway we have to subset in a separate operation, for example, so these two again are not completely similar. What happens when we import library tidy single cell experiment or tidy Surat. When we import these libraries, then we, when we print these objects on screen, they appear in a, in a, with a similar structure, they appear as a data frame, where the cell metadata and reduced dimensions are displayed for us. You can see here we have an header with some summer information, such as this is a single cell experiment, this is Surat, and we have these features, these assays and so on. And these amount of cells. And now we don't only visualize this as a data frame but we can interact, modify, filter, merge, join this data set as if they were data frames, and the underlying objects will be updated according to do to that. So basically is an abstraction layer to those objects that allow to be manipulated with tidy verse. Now, how does it look for the analysis. Well, for the analysis nothing changes because these object these packages are not for single cell are not analysis packages are just manipulation. So you keep analyzing your data as you were before. And how about the manipulation, where, in this case, not only the display is consistent but also the grammar is consistent so as you can see, you can apply the exact same code to these two objects and you will, you will will obtain the same effect. For example, for visualizing metadata, we can simply print the object on screen as the metadata is exposed to us to start with to select some reduced dimension we can just use select command and use the powerful evaluations here, we can select all columns containing you How about filtering the functions before that was called subset we can just use filter based on arbitrary combination of columns. How can we mutate so add for example a column to our data set we can just use mutating this way. And in the example I mentioned before where we have a clinical data frame and our object we want to merge the two and select the cells for which the clinical information is available. So we have these powerful joint operators, such as inner join that do this to jointly in a way so together. So we joined based on sample. And we filter implicitly the cell that have this match. So, that's a bit of an introduction of what these packages do. So let's introduce a wide range of functions of operators from the player tidier GG plot and plot Lee for for interactive plots. So you have you have a quite wide range of options that some of those will be shown in this workshop. And to conclude these introduction introductory presentation. I want just to clarify what these packages are and what are not. So, the packages I showed you here, which are tidy Surat and tidy single cell experiment are not data containers. So the data containers are the underlying Surat and single cell. And these joint analysis tools are just adapters that allow you to use tidy verse on these objects. So are just data interfaces, which allow manipulation integration and visualization. And so sometimes we get the question, can we go from tidy Surat to Surat and this a versa. The question is not relevant at all in this case because we never leave Surat or single cell experiment. So there is not such a thing as a tidy Surat object. The Surat object will never know tidy Surat exist study Surat stays on top and interfaces Surat object with you in this case and we the tidy verse. So, this was the initial introduction. And I will ask if you have any question. Please put them in the chart I imagine if you have never seen this adapter says they, they use a quite different paradigm for what we are used to. I imagine, you know, not all messages got to you very clearly. So if that's the case, please let me know so we can have a brief discussion about few aspects that might not be clear. Yes, I appreciate it and also well explained so far. Yeah. So I would say that that given a presentation that early is not that easy. But it's, I'm doing my best. All right. Okay, so can the people that want to follow along confirm that the installation was successful. I think I will show you briefly. The website is the render of the actual Vignette. So you can. I mean I guess you will look mainly my studio session. But you have just to mention you have four parts, and you will be able to navigate them. The first part is the introduction to tidy Surat in the sense of the coding. The part two is, is, is an example of how we can visualize and manipulate the signature in single cell data. Part three is nested analysis. Let's see how we go as the complexity is rising as we go. And the number four is the pseudo work analysis which I will explain what they are if you have never heard of them. So, even though, well, let me first of well let me switch to my session. Okay. Is it big enough, or, or, I mean, I guess, yes, it's can be can be too big, but is it big enough for you to be comfortable looking at the screen. I'm referring to the phones here. Looks good. Okay. So, even though some of you have not had the experience with single cell, I assume the concept, the concept of single cell is pretty much known here. Let me know if it's not. The really basic thing to understand that we are probing cells and yes we are probing cells. In this case we are probing mRNA molecules. So for each cell we have a list of mRNA molecules with their respective counts. So that's pretty easy to grasp. And these initial parts will show you how you can read a single cell container in this workshop is using Surat container, but virtually, if you replace Surat and convert it to single cell and execute the tidy code. It will work the same. So, and I will show you again how to switch between seamlessly between tidy verse and the Surat. Actually analysis commands piping them together to obtain a very elegant pipeline here. I assume, well, let me surat as I explained is a single cell analysis framework is developed by such a lab is very popular and includes is quite well organized and user friendly. And includes a lot of functions that can also be piped together into, you know, in a way tidy our style. So, first of all, let's load some of the library we will use. For example, we are using they were loading some tidy verse libraries, including as well as Surat and some libraries for coloring. Okay, so I will execute chunk by chunk. Of course, the here there are there is in there is explanatory text that is more detailed to what I will, you know, tell you. But, of course, is there because if you want to go back, you will be able to fully to understand whatever and to remember whatever we, I show you today in a later time. So I will just focus on the code chunks and and comment on them. So, let's load a surat object that is included in these in this package. In this work workshops package. So, well, let me restart my session because you know my, I must have loaded tidy surat already. So, this is showing already in the tidy version. Let's see. Okay. I'm not sure why the tidy surat behavior. Okay, so, normally, if you, let's say you load surat as we did before, and you load a surat object, as I showed you in the introduction. If you print the surat object on screen. Now, you will see a summary of the surat object. Okay, not much, not much information will will be exposed to you. And again, you can see the which assays are in there which reduce dimensions and so on my notch, not much than that. Now, when we load tidy surat. So this is the difference as again I'll show you before that you will see the same objects differently. In this case, these will be visualized to you as a data frame. What this data from includes, it includes the cell related information, which is metadata and reduce dimensions and so on. So here we have, in this case, these are very small data set we have just six features so six chains with 33,000 cells. One assay which is SCT, which is typical scaling and normalization from surats but is not. This is an active assay which is normalized and a total of two assays which is RNA SCT, which is also true for the print out above. So nothing has changed here. But we can see that now we can, we have a lot of information exposed to us. So we have all these metadata sample barcode cell cycles course cell types that we have we had infer before so on and so forth. So we have a description of how these objectives was created. I think is later on, but nonetheless this object was analyzed so there are there is information already in here. Now, if you, if you want to use study verse but you really really like the original display of the data, you can toggle these options so restore surat show. And you can go back to the original, but because of tidy surat surat library is in is loaded. Now you can use tidy verse nonetheless, but we like better the tidy representation so we toggle this back and we can see the tidy representation. As I was mentioning, this object is untouched. So everything that was working before on your surat object works now. Let's say we want to use a assay function from surat, which simply lists the number of us the list the assays that these objects contains, and we obtain what we expect. And this is true for any any analysis function we applied. Now, I will again, let me know if you have any doubts, I prefer any question rather than you use you keep guessing maybe what I'm talking about as most of you did not are not very very familiar with single cell analysis so don't be shy. So in the next cup in the next couple of minutes, I will show you some simple tidy verse functions apply to the surat object to give you an idea of what we can do. And if you have used a divorce before they work pretty much as they would work on a data frame. So, we have this sort of object, let's say we want to filter some cells of this sort of object. Let's say we want to filter in the cycle cells in the cycle phase G one. So we can pipe our input object to the filter function, and rather than having 30,000 cells now we have 18,000. The object has been updated see underlying object has been updated in this case filtered and returned to us. Of course I could have saved these surat object in a variable, but tidy verse is very good with piping so we can explore the object without saving temporary variables if we don't need to. Let's say I want to select some of the metadata column, for example for decreasing. I don't problem installing the packages but I mean there is there is a question that you, but could not install your workshop material because of GitHub update limits really. Okay. All right, let's sort out this question. So I'm not sure. I wasn't aware. I mean, it could also be an artifact of multiple people from my institution, potentially. So the are marked down. So I downloaded. I just downloaded that I just cloned your GitHub repo, you know so I have your, your GitHub repo on my machine. And I try to install it in our, I get an error message that I just popped there into chat this could be an artifact to have, you know, possibly being on my VPN, so I'm going to try and get off the VPN. Can you API API rate limit water nursing that, but can you, let me, let me pass to the calm. I mean, so you are cloning and installing it locally or not. Correct. Yeah. So what about if you try. Install it in this way. Like with the remotes install GitHub. So that's what's tried that's so that is exactly what I tried I did the remotes double colon install GitHub, and the error message I got us from that. Now I can clone it using like the GitHub, you know, desktop or whatever no problem. So I have the materials but line 77 where it's asking for the where it's actually referring to our medicine 2023 trans that's the part that I can't do. Yeah. So, can you install it with the. So if you are because you have to create a project. I mean one way would be to create a project for the directory. And then build it with the with the R studio so you will have a build tab and you can do install. And that package will be stored for you. I believe if you do actually you don't need even the project. If you do dev tools install issue, if you are inside the. I know probably you need a package you need a project but nonetheless don't worry I'll try and I'll try and follow along and just and figure it out and see if it's an artifact like I'll figure something out I just wanted to like raise that flag just because you, you knew what, what was going on. Yeah, yeah, yeah. But, I mean, yes. Yeah, feel free to follow along. But as I said, with these two steps, you should be able to creating a project in the directory and click install. Then it should work. I was not aware of any limits. Yeah, I think these. These new setup of GitHub for these. How they are called not personal accounts. They are getting stricter and stricter but thanks for the heads up. Then markdown falling. Yeah, okay, I mean good to know, good to know. The package could can be installed locally for sure. If you clone it or download it. Yeah. All right. So you, you, you we have the time during the exercises. I mean having said that sorry let me tell you, even if you don't install the package. You can load the surat object if you go into the data directory. So in this case let's say I've clone or download the repository. You have the surat object here so you can load the surat object without executing line 77. So you will be able to do exercises and so on. I'm sorry I was looking away when you were explaining that can you just say that one more time. Let's say you clone the repository inside the repository and you have this data directory. Yep. Yeah. This is where the data is, you can actually load it manually you don't know perfect. So you, you can load surat object and that then the surat object, you don't need to line 77 so it will be there in the session for you. Excellent. Thank you. Okay, so let me. Okay, so if, for example, as I was saying if we want to select some columns for just visualization or for making the object lighter or any other reason we can just use again the select verb from tidy verse. And you can see here that we are selecting the end count and phase columns. And that's what we get. We also get the reduced dimension as these are view only column so we cannot touch them for obvious reason these are were calculated. So are always there, but nonetheless the metadata has been filtered just for these two columns. A functional surat object is returned to us. Just to show you how that whatever we apply on the object on the abstraction is then applied on the object itself, let's say we sell, we select these two columns, and we save our update is right. And then let's explore the metadata so we are knowing we're not go now going inside the object and see how the metadata look like. And the method all the columns of the metadata has been stripped away and just two columns that we request are present. Okay, so just to say that the object is always updated in the background. So another thing we can do is to manipulate the metadata columns. For example, we off in real world analysis we often need to change things around for all sorts of reasons combine columns and so on. For example, if we want to modify the south phase of the cell cycle to a lower case. So basically, we can use the mutate with applying the function to lower two phase, and then, and then to visualization purposes for now, I select these two columns. As you can see here that the capital letter phase has been transformed to a lower case. Okay. So we have a bit of a more realistic data manipulation. Sometimes we in our object we have. We record the path of the file, where we got the object from. Sometimes this path includes information let's say the file name includes the sample ID, or the sample treatment if you download data from geo. This is often the case. So let's say in this case, I built this object from these files. So that's where I read the files from. And as you can see here we have some important information such as sample ID. So, in this case it's quite easy to extract the sample ID from the path and make it a isolate into a its own column. So, in this case we can use the quite powerful tidy verse verb extract is from tidy R. We can extract from the file column create a sample column with the arbitrary regular expression. In this case we're just selecting this alphanumeric string here. And, as you can see, I'm just selecting the sample column, we have what we expect. So we can do something that shot the load in manually. And that's exactly the same. Build menu install didn't work. Okay, yeah, just just just just for your knowledge that's all caught up. Yeah, it's fine. It's perfectly fine. I mean, I'm not sure why but now it doesn't matter. If you load the object from the data directory, it will be fine for sure. So it's good that you caught up so maybe you will be able to try a couple of the exercises we have afterwards. But maybe for sure later on I will ask you a bit more so I will be able to understand what was going on. Okay, so you can imagine how powerful these manipulation is as the tidy verbs are again, allow us to do very complex operation in an elegant way. Another thing we might want to do is to unite two different columns. Let's say we want to create a unique ID. We have patient ID and we have treatment ID, we want to create a unique sample ID. In this case we have sample and BCB which is a patient code. Our studio with root permission. Okay. So again, we have these quite powerful operator this call unite where we state our new column, the column we want to unite and then I select these three columns. Just for to display for you, and you can see here that we have very simply created this new sample identifier. For these quite complex operation I'm not creating any temporary variable, and this is very convenient and also avoids to make our code bug prone where we have a lot of variable that we have to update to re-update and so on. So in this case, I've so far I created zero variables for all this operation that I could have piped together. Also, let me know if you, as I'm going along, if you need more time to catch up just let me know just let me slow down a little bit. And I prefer you to be on top of the of the thing rather than try to, you know, run after. Okay, so this. This was the initial introduction to the verbs we can use. So part one is done. Now, please pass into the charts have a look, I will give you one minute to scroll up have a look to the code. Again, you can go on the website and visualize the code again yourself. And let me know if you have any question. Actually, nonetheless, I'm sure you have. So ask your questions in the chat and we can discuss about this. Good. Question is, does, does surat object, does it surat object which you are using a multiple assay. You mentioned it already. Yes, in this case, it has multiple assay. So if you, for example, in the in the header here, you can see that we have two assays which is RNA in surat. This means usually row counts so it's integers. This is SCT which is the normalized and in surat you have an active assay, which you are applying operation on in this case also is displayed here we have active as the SCT so yes, this object includes two assays. And I will show you that, even though the main this visualization the main display does not include gene information. Because in single cell, the really main element is cells, but we can use join feature to add the transcript abundance for all assays in our data frame. And it will be in the next example. All right. Part number two. I don't start to get a bit more complicated if you're not familiar with single cell analysis but again, this sort of object and again ask question I prefer we stopped and we just, you know, explain parts multiple times rather than you maybe don't understanding really what's going on and I'm speaking for half an hour. But there are some details on the object. The object we are next using was created in this way we we you can have read but is already a object with some information associated with it we have pre processes and so on. And the next example is quite a realistic. I mean is actually a real example of a study that I perform. Here, we did these analysis on breast cancer, PBMC so circulating immune cells. And after we have done our analysis and processing, we have been asked to analyze the in unconventional T cell type which is called gamma delta T cells, which is of course well known but not so much in the context of our context of our research so in breast cancer metastasis and so on. So we had to go back to the object where we already did our clustering and so on, and to identify better identify gamma delta isolate them and reanalyze them okay so that's the broad context I'm talking about here. So I will show you. Yes, I will show you these objects. I know is the same yeah. So this one object I show you before is actually the one using part two. So well for completeness. That's what I show you before that's why we have the curated cell types and so on. The first thing we did is to go into the literature and try to see if there was a transcriptomic signature of gamma delta T cells that we could use to identify the identify the cells. We identified we found a publication from Pizzolato where they did exactly that, and they proposed a very small transcriptomic signature and also how to calculate it so in our arithmetic formula, how to derive a unique score from these six genes. There are T cell genes. There are RDC which are markers of gamma delta. And also negative markers that we will use. As I mentioned we can use joint features to add feature to our data set. So we, we take them from the underlying assays to the matrices and we put them in our data set. Once we have them, we can do whatever we wish with them we can do calculations we can do filtering we can do visualization. So in this case, metadata information and gene abundance information are just variables. There is no much difference between them. And this is very powerful because we can combine them in this tidy are queries, for example. So let's call joint features, and I will show you what the new columns are. So we look at this object we have six new columns called CD3, TRDC and so on. Okay. And these will include in this case we took them from the assay SCT. Now if I, let's say, if I want to select some gene just to show you. I want to do pipe this into select, let's say I want to select cell and CD3G. Sorry, CD3D. And you can see here that that's the SCT value of the CD3D. And I could filter cells with as sorry CD3D. And then a threshold so I could filter for these bigger than 1.5 and we would get our that's it so on and so forth. Okay, so now these could be used for anything. As I will show you also later now. Okay, so we have our genes. Now, we want to calculate our signature score. So we have a positive indication of how the score would be calculated so we have some positive marker and negative markers. In this case, we can pipe now this object with the genes into a mutate function. We can simply create a new column signature score with the arithmetics in this case we are the transcription of these positive markers and scaling them to zero one and subtracting to this positive sub signature a negative sub signature so a genes that should be absent, which are the CD8 markers that we are also scaling. So again, I'm doing some relatively complex operation here I'm doing arithmetics on genes, without creating a single temporary variable. Our environment is clean. If we, you know, we can try many different things on the spot and we can just save whatever we are happy with. And these are all all operational self contained so that's very important. Now I select the signature score first and everything else and you can see here that which just simply added a column with the double class here, and we have our signature score. As before now we can do whatever we want with our signature score in combination with the other columns, as I will show you is everything clear so far. Oh, I'll come off mute and ask my question just this is again probably my naivete of not being a biologist but when you're rescaling you're rescaling relative to your data right. Not to any reference. No no no rescaling is simply after you do the sum you just scaling across all cells between zero and one. Right but it's just within you're not like getting a risk score out of this this is just within your sample, you're distributing it between zero and one. For sure. Yeah. Okay, got it got it. Yeah it's very simple I mean, literally is is like if this was a data frame. So let's forget we can forget about assays and so on, we can just treat these as a data frame so if I have a data frame with CD3, TRDC and so on. So I'm going to call them summing this column. And then, in this case, you know these I'm creating this, which is an unnamed column these I'm doing this on the fly, which this first unnamed column, and then I'm rescaling it. So am I also creating this new column with no name, and I'm rescaling it so all the roles will be rescale between zero one so yeah is there is nothing beyond that is very simple, especially simple because we are treating these as a data frame otherwise we should fish the assays do the calculation as I will show you afterwards and and creating two variables among those. Okay, now what we have done is simply create calculating the signature. Now, I will show you how we can combine surat analysis tools with tidy verse now very easily. So we have our signature and now we want to visualize the data. And then there is a function called feature plot, which simply visualizes our signature score across cells. In this case using the you might produce dimensions. Again, this is a function from Surat while these are functions from tidy verse that we introduced with tidy Surat. And you can see we have each dot here is a cell who works with single cell is very familiar with this is is is a bit like principal components so our reduced dimensions. Each dot is a cell and we call her by the signature that is from zero to one. And you can see that here we have a corporate cloud that it might be our gamma delta T cells. And we might have found these cells to analyze, after all, and we can, there are many hard to abhoc visualizations as I show you, but also these tidy Surat that we introduce makes very easy to do arbitrary plot using ggplot with no much effort so sometimes we want to do custom visualization for papers. We want to use all sorts of styles ggplot offers a huge ecosystem of packages that interact with it. So if we want to have this freedom, how hard is to use ggplot instead. We do tidy Surat that's not hard at all because we can pipe our sort of object into ggplot, as if it was a data set again. So simply we are arranging for popping up the cells with the biggest core, but here we are just simply using ggplot. So doing things explicitly here. We are plotting you map one and you map two dimension coloring by the signature, and we are adding our aesthetics and so on. And you can see that we have similar visualization. And again, this cluster might well be the cells we are looking at, we are looking for. Yeah. The fact that we can pipe our single cell data container into gplot is quite convenient. Again, look how many things I'm doing, not creating a single variable. Let's say if we don't like what we see, we can change things around change, you know, signature calculation for example, and try things on the fly. I'd love to ask a question and this is outside of your topic so feel free to say you don't want to answer this question. I teach are two researchers. Some of which have never written code third these are not molecular biologists. I deal a lot with some people who really enjoy and find value in creating interim objects that they then have in their environment that they can manipulate and work with. And I have concerns about reproducibility. You've, you've emphasized several times. Look, I'm not creating a lot of extra objects in my environment this is all on the fly this is all piped in. Why do you prefer that. Well, I mean let me understand your arguments so you are saying that you when you teach you, you, you teach to create temporary. I don't. I agree with you, but I want to understand, because a lot of people who are new to coding, like to say oh I made this thing. Now I can go and look at it and my face is there. Yeah, and so, and I have a hard time explaining. I feel like to, to researchers, why this is a bad idea, or not not a bad idea, but why it could come back to them later. Well, I mean, the, the, our habits of, of. So, one thing to stress is that our is quite powerful because it's a functional language that so far, we've never used as functional language we use for loops while loops as we would do with the Python with C++ and so on. So, in a way, the fact that we kept saving variables is a habit that comes from other languages like, like the ones I mentioned. And basically, the majority of variables we save is not because we want to look at them but because we need to execute the code we need to update them in the order of for the code to function. So, the, the, the thing I stress now is that you can save variables, if you want to look at them. But if you don't want to look at them in a second stage, you don't need to. The other thing I would say is not our code doesn't hinge of variables as much. However, whenever we want to save, you know, as as I'm doing here. Now, for example, I'm saving these objects that just includes gamma Delta because I want to save it for reproducibility. That's what I'm doing. But let's say I work two hours on this and I finally as actually is the is the what happens with me. So, I wanted to understand whether I understand whether I should rescale the variable or not, in which way I should discuss the variable and to understand whether I do this well, I have to visualize. So instead of saving hundreds of variables and so on, I can quickly to rate until I'm happy and save the objects that I want to compare or I want to, you know, send to colleagues and so on. Yes, so that's what I would stress the fact that we don't need any more variables unless we want to. Of course for reproducibility. Yes, you save variables I do it all the time. But yeah, I'm not functional anymore to our programming necessary. No thank you that's helpful and I, I feel like saving objects off can actually be less reproducible because a lot of times people say why don't know how I made this. But it's, but I'm using it, and I said well if you don't know how you made it. You can't use that. And especially, I mean, one comment and then there is another, but on this very, very bad. How can I say behavior that we are used to is that we update the same object multiple times. So, you know we say surat object equal surat object, blah, blah, blah. And especially bad for reproducibility, because we updating our object in in a three hour working session. And then we might, you know, go back and we don't know which, which comments we execute and this led to the object being something else that you think this is something that happened to me all the time and I'm sure happened to say, you know, you save an object you were thinking was analyzed in a way but actually you need to execute that command you send an object and so on so definitely less variable could increase reproducibility. But, the author says that as a new user it's comfortable to say variables and I agree is good to have objects that you have there. This is really once you get familiar. So, you know, if you have some years of experience, you want to be very efficient. Then you don't need variables anymore unless you want them. So it's not anymore a comfort to have variables. My environment can be empty. But my work is more efficient so I agree this is probably when you are very familiar then you want to speak through your analysis this free you have a lot of problems. Okay. Yes, I mean as having a lighter environment is good for many, many reasons. That's the but again, you can save variables if you want you can save variables at every step. But the important thing that you don't need to do it anymore in the base are language you do need to save variables to modify the metadata, remerge the metadata and do all these sort of things. So, let's step. There is a another functionality is not a, you know, extremely important here but is a very nice thing to visualize. So for example, let's say we want, we know that the cells might be the cells we want, and we want to circle them. Yeah, in a way could be to recluster cells and do more complex operation but if we want just to visualize them in a quick way, we might want to draw gates around them and filter them. So we did a package called tidy gates. This works for any data set is not single cell. If you have a table of housing is a distribution of housing you could do the same. So if you call, you, you, we have our scoring here, and you, you create a column gate. Using these two you map dimension, and these gates column will be created, and then we can filter on this column. And this is an interactive comments so I need to pass in the. Let me zoom out a little. So now we have this plot, we can now draw gates on the, the cells we think are gamma delta, we click on finish. We have all the columns selected. And our columns are the gates column is created and then we are filtering and I will show the result. Again, all these. If you like variables, you can save it whatever stage, but again, I could do all these just saving the final object. We have a gate column here, which is an integer. And this column is one if it's in the gate and zero in the others we can draw multiple gates. And we are filtering just gamma delta to sell and as you can see now we have a total of 1000 cells rather than 30,000. Okay. So this is a is a basically a whole pipeline I'm doing here, where is very readable. So each, each block, I mean I'm not commenting here in the real world and comment I'm not doing here to knock clock too much, but each block has its own function I'm doing here, resulting in signature calculating signature drawing gates filtering. And, of course, you know that tidy verse offers this, and very verbose grammar that is almost like natural language. Now, for the sake of this workshop, I will show you how another way to filter the data, which is very unoptimal, which could be to filter based on a threshold, let's say our signature is higher in this cluster. And I say, okay, for the quick analysis, I filter cells based on the signature. And I save my object this way. I do this because the gating is not, is interactive, so it cannot be executed automatically while this can. Nonetheless, let's say I am happy with the signature score bigger than 0.5, and here is my object. In this case, I have less cells here. So I'm saving this as surat object gamma delta as this object now contains only gamma delta. And one thing we might want to do is after we filter these type of cells, we rerun the analysis pipeline, which normally includes scaling, variable gene detection, clustering, dimensionality reduction, and so on. And again, in the real world analysis, this is an iterative process. So you reanalyze and reanalyze these cells until you are happy, you are visualizing what you need. And it's very important, again, probably if you are more familiar with the analysis to be able to do this on the fly. So again, I pipe my analysis together, I filter my cells, and then surat is quite good with piping. So I can basically pipe the whole analysis pipeline on this subplaster, which includes normalization, find of identification of variable feature, integration, so removal of technical artifacts, and clustering. And this takes some time. But I mean, you can get the gist. Let's say if I need this, no. I mean, let me execute this. Let's see what happens. And then you can ask some question if you have some. Yeah, it took a short time because this is a very small data set. But again, very easy to reanalyze a subset of data to then understand what these cells are if you are not happy, change things in the pipeline and re-execute it multiple times. And the object, again, is a surat with the updated UMAP, updated integrated dimensions, so on and so forth. Now, what would be the code if we use BASAR, if we forget about Tidyverse? So what before TidySurat people had to do? Well, as you can see here, you can, you have to save objecting variable for them to integrate them and put them back. So in this case, we are calculating the positive part of the signature and rescaling. We are calculating the negative part of the signatures and rescaling. And we are doing the arithmetics here to subtract them and add these as a surat data column. And we are filtering here. OK, so there are four different steps to do this. Let's say if we want to, let's say we are taking from the points we made. OK, say I saved these two variables and now I can visualize them. Yeah, while here I was doing everything, these would be the final code probably for more clarity and elegance and reproducibility. But nonetheless, if I want to visualize these two columns, I can be more explicit in my coding. So rather than calculating a signature here, I can do the positive signature and negative signature. Let's say negative signature. And I can then select those and visualize them. Again, do I have to create variables to visualize things now in this case? Is my positive signatures and negative signatures? And then I can explicitly create the signature simply subtracting one to the other. So again, we can find comfort in saving variables. But this is, I don't think it's needed. So we can let go of this pretty easily. Again, in this case, I would just save objects that are big to reproduce. And I want to actually, I mean, save variables that I want to save to files, for example, or other operations. And another beautiful thing, if you do single-cell analysis, you find this interesting, is that normally, as I show you, we see visualize the cells in two dimensions. So this each cell, like RNA sequencing, each sample, is characterized by 1000 genes, so 1000 dimension. To visualize them, we can reduce this dimensionality to 2. And in this case, UMAP is a good dimensionality reduction because it's not linear and is great for single-cell. But sometimes, adding a third dimension could give us a better idea of the heterogeneity of this data, as I will show you today. Creating 3D plots with TIDISurata is very easy, as we can plot our Surata object directly into plotly, which is an interactive plotting package, very powerful. Because reduced dimensions and metadata are just variables in our abstraction, we can use them together. So we can just state the column names for our dimension and the cell type column for our colors. And simple code, we can get a three-dimensional plot. We can start to explore the data set under a new perspective. And in this case, we can see that some cluster are a bit better resolved. And we can see some relations that before were compressed in 2D. So for large, the question is, for large data set, does TIDISurata make any difference with memory allocation and computation? The answer is no, because the object is untouched. So the environment never knows TIDISurata existed. If you save the Surata object, that Surata object never knows TIDISurata's applied operation on it. That's the answer. Of course, when you do mutate or select, TIDISurata actually does computation because I reproduced those functions. But for any analysis or any object memory imprint, absolutely not that the object is untouched in any way. And with this visualization, we can switch off cells. I mean, just be interactive with the object. And these, I mean, of course, this can be applied not just to Surata objects, can be applied to data frames in general. But this is obviously very powerful in single cell analysis. OK, I think this was the second part. We have maybe one hour to go. We have seen quite a lot. I assume, especially if you're not used with single cell, we have seen signature calculation, visualization, analysis, and so on. So we stop for five, 10 minutes. I would very much appreciate if who wants tries the exercise I will give are very easy. But just typing a few lines of code can make you remember all this much, much better. And is a fun challenge, too. Now, I assume the people who wants to code have loaded their object. If you load the object from the data directory, even though you don't install the package, you are able to execute the whole code. So we have the Surata object here. We have the Surata object with the 3D dimensions. I mean, you don't need now for the exercise. So basically, the only thing you need is the Surata object. And there are two exercises for you. One is we have taken our Surata object and saved the gamma delta T cells or identified them with a threshold. The question is, what is the proportion of gamma delta T cells with the threshold we just visualized among all cells? So you have some code to define gamma delta. The question is, use the code, pipe it to some tidy verse operation to identify what is the proportion of gamma delta across among all cells, let's say 10% or 5%. So try this exercise and let me know in the chat what's your code or what's your answer. Or well, do that first, and then we will step to the second question. OK, Surata wrappers. OK, why we need for the, no, well, OK, good point. Yes, Surata wrappers, they get installed when you install the package. You can, I can, I mean, first of all, you don't need. Surata wrappers is only needed for these reanalysis. Only needed for these run fast M&M, but you don't need this for the question. So this was actually evaluated false. So just to tell you that you can reanalyze the object. Nonetheless, I will update the guides to be able to, you can install Surata wrappers by yourself, is a GitHub repository. But definitely, you don't need it now to execute to run the workshop. No, I mean, you can run every chunk, but some chunks are evaluated false for a reason. These might take a lot of time in workshops with big objects and so on. But 90% of the chunks can be executed, no problem, I believe, even without installing. But while you do, you see a bit the code and you do the exercise. If you want, I will add these to the red me, actually. This is how you can install Surata wrappers and you will be able to execute all chunks. Good, all right. If you got the proportion, type it in the chat and also you can type the code that you used. Or maybe, I mean, actually, let's wait a few minutes if someone else wants to try. I'm able to select the points with tidy gates. Yes, so you have to, of course, to create a polygon, you have to type three times on the, you have to execute tidy gates in the console, yeah, because it's interactive. You cannot execute it in the Armored Town, as I did before. When you execute it in the console, you will get a plot on the right and you need to click three times to draw the polygon and click finish or click escape. And the polygon will be created for you and tidy gate will work. Having said that, for the exercise, you don't need to get, you can just filter based on the simple threshold of signature score bigger than 0.5, yeah. But if you want to use the tidy gate for yourself, for the future or for fun, you need to, is the chunk, you need to pass copy and paste in the, I lost the tidy gate here. OK, you need to pass into the console, like I'm doing now. Yeah, if you passed it, then a plot would be visualized for you. You click one, two, at least three, but you can click more. And you click on finish or click or type escape. And you can see that it will work. Good, Joy, do you want to write your answer? Good. So 1%, yes. Good. However, I challenge you to use the tidyverse verbs to do that. Yeah, so imagine you have your SURAP object. You know what the threshold is. So one way that I would like you to reproduce is rather than filter on the threshold, you create a column based on the threshold. And in tidyverse, there are many verbs you can use. There is a count verb. Yeah, basically, that's almost the solution. But I would be good if you want to implement it. So again, instead of filtering on the threshold that showed before, you can create a new column, count on that, and get the proportion that way. You can't, three, four, nine, three, four, nine. Yeah, so instead of filtering, so eliminating self from your object, you can mutate. So you can create a new column. Yeah, exactly. Make that a summarized count. Yeah, I don't know if you can count directly on this. You could, well, could. Or you have to create a column and then count on that column. That column could be is gamma delta true or false. The context here is that sometimes you want to calculate proportion to decide something about your pipeline. And you are doing hundreds of these checks while you are interactively coding. So instead of saving a surat object, a surat gamma delta and calculate proportion and do these 10 times because you are changing your thresholds, you want just to change without saving variables who are part of the pipeline, execute, execute, execute. So these things might seem overkill, the fact that you pipe all together, until you spend hours and hours chunking through data analysis, exploration, visualization, re-analysis, and so on. So accumulating the fact that you don't need to save variables for 30 times and copy and paste your code, your variables, and so on and do multiple clicks to execute multiple variables, these speeds up your workflow. So that's why I'm suggesting this. Yeah, that's exactly what you should do. I can give the solution here, but you got it right. So instead of filter, I could mutate. So create a new column. I'm showing here in the console. Let's say is gamma delta. And in this case, is gamma delta is signature. Is this signature? My zoom environment is covering everything. And you can see here that we have created a new column. And then we can just count is gamma delta. And I mean, you could calculate that. Now what I'm doing is overkill in this case. But you could summarize these two, or maybe you could add counts, the total number of cells. So if this works, add count is deprecated. Here you go. Add drop argument. OK, I'm not using it. And you could do mutate proportion equal 1 versus 3r is n plus dot and here we go. Yeah, good. All right. And you can do the second exercise just to challenge if you're on a second stage later on. OK, so I assume all the possible questions have been asked. So let's step to a bit more powerful and complex operation. So can I ask to the chat? You can reply yes or no. Who has ever used nest map operations for nested analysis in tidyverse? OK, nobody. It's not as well known, although it's definitely the most powerful functional programming infrastructure that R has. And this is not just true for biology. This is true for whatever data sets you have in your hands. Indeed, I use it all the time, irrespectively if I'm working on biological data or data frames. So group by, yes. Yes, it's somehow similar to group by, but is more general and powerful in a way. And I will show you. For example, you can group by and do some operation. Let's say you group by and summarize. If you do this, you have 100 rows. After you summarize, we have 10 rows. So you lost some data. When you do nesting, it's similar to group by, but you start to create new nested columns on which you can do operation, summary operation, and not lose any data. And then I'm nesting later on after you've done your calculations. I will show you here on this biological data. But again, this is a data frame obstruction. So whatever I'm showing you here, you can do on a data frame. OK, so let's do a summary operation. Let's have a look of how many cell types do I have in my data sets and how many cells. Again, you can do this simply with count, which come from tidyverse. I have CD4, which are T cells. These are all immune cells, different type of immune cells here. We have NK cells, which are cytotoxic cells and so on. So you can have an idea of the distribution of cell types. Now let's group this data. Let's suppose when you do single cell analysis, is often useful or needed to do operation or analysis per cell type. So you want to analyze, you want to perform differential expression within these cells, across treatments, within T cells, across treatments. So these grouping often comes about with single cell analysis and not only. So here, what I've done is to take my Surat object and apply the nest command from tidy her. I'm creating, I'm grouping my data to a new column is called Surat, according with this curated cell type. As you can see here, the data I've produced is a data frame with two columns. I could have nested based on multiple columns. In this case, it's just two columns. As you've seen before, based on the cell type, and this new column includes itself Surat objects. So just to show you, let's extract. And in this case, I'm saving this nested in a variable. Let's extract this first element and show you what it is. So we are just taking the first row and pulling the Surat. And as I was mentioning, this is a just Surat object. But the standard of having 30,000 cells has just 4,000, which is just cells for this type of CD4 cell. Yeah. So for example, what could you do with this? I show you just a brief example. Let's say I want to count how many cells I have. I want to filter the cell types that are more abundant. And I want to get back my original data set. Normally, you would take your Surat object, count, so obtain a summarization, maybe save to a variable, doing a left join with your original objects, and filter that way. Again, we have to create a variable, joining, executing two different commands, maybe 30 times, because we are iterating across our workflow. How can we do it in one go? I mean, and the count is a simple operation, but instead of count, I could do an analysis that gives me a score. So in this case, I have my nested, and I can mutate and create another column that is called m, and pool gives these functional tools, which is called map. And map work very nice with nest, because map iterates. It's a bit like L apply. It iterates a function over a list. In this case, the list is the Surat column I have here. So a nice thing about map that it can output many classes. For example, map int outputs an integer. So I can simply give as input my column and count how many cells in the columns I have in my objects. And let's see if it works. So in this case, I counted the number of cells. Let's say I want just some types that are very abundant, because I need to do statistical tests. For example, I want a lot of information. So I can filter and be bigger than, let's say, 4,000s. And I will just get those cells. And then I do unnest Surats. And I will get back a Surat object. In this case, I'm working. Must a positive length. This strange normally is quite easy operation. OK, it doesn't work because I have some scale data. Well, that's a bit unfortunate. I do often this kind of operations. Maybe you can imagine how it would work. So it would give you back the similar to group by and ungroup. Well, so the question is, how is this different from group by and ungroup? Let's say I want to do, I mean, the tidy Surat is not implements group by in a powerful way. However, on a table, you could do group by. But let's suppose, how can I explain this? So let's suppose you have, let's convert this Surat objects to a table so we can do group by and so on. Let's let me answer to this first question. So I convert it to a table. Yes, so now it's simply a table. I don't have a gene expression anymore. But let's group by, let's say, this column. Now, I want to maybe I want to summarize and identify how many how many cells there are per cell type. So an equal. How can I do this? Is like count operation is would be n row dot something that's no and here we go. Yeah, now and I filter, let's say, and bigger than four thousands and well now I have. I cannot ungroup this object anymore because my object has been summarized. So I should save in a variable and add this information to my original objects or do other operations. So instead of this simple summarization, let's say I have a complex operation. Let's say I want to run a linear model on these groups. So with group by is not always easy to do very complex operations. Yeah, so next give you the ability. Let's say instead of this and I have L M. I want to take some columns of my data set and extract a p value and then I want to filter based on the p value. So group by is limited on these aspects. While the paradigm nest and map is completely general. So in a way, instead of applying some operation here on the groups, I mean, I can apply whatever operation I want on the groups and produce whatever variable that I can then add to my data set and filter and do whatever operation visualization and so on. So I never lose data. I could, I can always un-nest. It's a bit unfortunate it is not working here. I'm not sure why and of course I don't have time to investigate, but normally seamless. So you nest based on something you un-nest and your surat object will be reconstructed out of that. So very powerful is a general tool is a generalization of group by and mutate or summarize. Is it more map reduce? I'm not sure what you mean. Is it more than map reduce or? No, I guess I meant the point of nest and map is if it acts more like a map reduce function, which it sounds like it does. You're sending things off to a complicated task and then having the results of that phone home. Yeah, yeah, for sure. Yeah, that's exactly it. And yeah, in some operations, some like you can read. Well, I won't go into details, but you can return complex. Map can return automatically complex objects such as data frames and you can call map DFR and that's actually a reduce. So you actually unpacking that. But anyway, it's, that's exactly it. Okay, what we were doing here. So now we are, we nested by cell type and as I said, often we want to do operations such as differential expression for each cell type or whatever other operations or visualizations for each of tags or dimensionality reductions for each cell type. Again, this is even more powerful because all these iteration are done in a functional way. So we are not creating and updating variables. All these operation are done in isolation within this function so they don't affect each other. And this is very helpful for building robust code. For example, we have our nested object as I show you before. There's something here is not printing it. So what we are doing now is to do an operation across the surat objects. In this case, we are taking our X is our surat column. That's a syntax of map. And we are piping through a normalization and identification of variable features. Sorry, a normalization. This is a differential expression function from surat. So we are comparing treatments here. Because I gave the default identity to the column treatment so the surat knows that our identity we want to compare is treatments. So again, normalizing, doing our hypothesis testing, filtering by p-value and taking the first tangents. Now, of course, this is the final code because I already explore what this output is. But after I know what this output is, the code is very elegant. So I can pull my final code there. And I will walk you through here. For example, let me pull the first surat. So I'm not using map. Just taking the first element and do operation on that. So I can pull surat column, which is a list. But I take just the first element. So that's a simple surat object, as I showed you before. And let's do this operation of hypothesis testing. And I show you what's the result. So what's happening inside the map. The output of findAllMarkers is a data frame with statistics or p-values, full change. These are the average transcription of the genes across cells and the gene. So what I'm doing here, I'm taking this output, filtering by p-value and taking the first 10 rows and taking the row names, which are the genes. So what I'm doing here is selecting the top significant genes for this comparison within each cell type. And I'm repeating this across cell types. Let me update these objects. And I will show you what I got. So and I'm creating a new column. So I have my cell type surat. I'm creating a new column, which is called significant genes. And we nest the powerful thing that we are building, basically, database. We are with many column types that we can then do operation on, as I will show you. This will take probably a minute because it has to go through 30 or so cell types. If you have any question, as this is quite new to all of you and you want to understand more, let me know. OK. So it's finished. How does our dataset look like? I'm sure why now it takes some time to display this. Anyway, so again, we have our three columns, cell type surat. And now we have created a new column. So this new column includes character vectors with our 10 genes. So as you can see, the size is 10. Now, usually, we can build heat maps of these markers to visualize the transmit abundance across all cells. And as we are doing this analysis per cell type, we want to create a heat map per cell type. So here, I'm creating a new column that is called heat map. Thanks a lot. That is called heat map. And here, we're using a function that is called map2, rather than taking one column. It takes two columns, which is surat and significant genes. And we are taking the first input, which is surat, scaling and building the heat map. And as the gene feature argument, we are giving the second column. So we are taking the surat. Taking the genes and do an operation together to create a new column here, which is called heat map. And this new column will include plots. So as you can see, this is much more powerful than grouped by, because now that we have nested, we can create an arbitrary number of columns of different kinds. Again, I'm, in principle, not creating variables here. It's very robust. I don't need to fill my session with temporary variables just because I'm doing complex operation. Exactly, yeah. This is, I mean, the question is, is X references the data, so the first input column, and the Y, the genes, the second input column? Yes. X and Y, I mean, if you swap these two arguments, so if you do surat the second, then it will be the other way around. So it depends on the order of the arguments. So this is the input and this is the operation. So map to accept three arguments. The first input, the second input, and the operation. And just X and Y is whatever they call these inputs. So now you can see that I have a new column, which is called hit map. And now that I've a list of hit maps, I could, for example, pull the first one or using a patchwork to put them together or whatever I want. But again, the point I'm showing here is that I can use now this nested data set, almost like a database in which I have data, visualization, gene list. And it's a great package because we can ship these. We can save now this nested in a file in RDS. And all the relations between plots, analysis, linear models, data is all there in just one package. We don't have to have many files that we have to merge together and so on. And moreover, we have all these operations are self-containing the function. As I said, this is the functional programming that R was designed for. OK, I think we have 30 minutes to go. Yeah, I mean, this is not to be evaluated, but I'm showing here that I can go from object to hit map all in one go. Again, if I don't want to create variables, I don't need to. If I want, I can. I think Joy was the one who was doing the exercises. So I'm not sure if anything else is doing it. However, you can challenge yourself later on to make sure you understood this on the website I gave. All the material is there and you can reproduce the exercise. OK, so this is the last part of the workshop. Before I stop for maybe 10, 5 minutes to maybe give a little break to you guys and myself and give you the opportunity to go back, see the code, think about what's happening here with this tidyverse thing. I will just summarize what we have seen so far. So the first part, I show you how to apply simple tidyverse verbs to surat object. In a way, surat object behaves like data frame. In part two, we visualized a case study, basically, where we identify a signature and analyze and use that signatures to analyze the cells that were characterized by. And the third part, I show you how you can use powerful tools such as Nest and Map to do nested analysis. And in this last part, I will show you how you can do pseudo bulk analysis, which means sometime is very convenient and informative, even though you have a lot of cells, let's say 100,000 cells and 10 tissues, so it's 10 specimens. We might want to collapse all cells from each specimen to a unique measure. So instead of 1000 cells, just one sample, summing all the expression of these cells to analyze the variation in a simpler way and to do somehow more robust, sometime more robust analysis using very well-known tools, developed for bulk analysis. So you virtually collapse these cells as if it were a tissue, basically. That's why it's pseudo bulk. OK, I will, if you have any question for any material so far, put it in the chat and we will take five minutes for that. You mean the repository of the workshop or the repository of the tools? So, well, I send to everybody the workshop here. So again, the workshop is here. You should be able to install the package. Hopefully the API limits will be over. I've never seen these limits on package installation, but and if you go to the repository in the red, there is the link to the website, web page, so where this whole workshop is rendered for you. And since we are here, I just, if you click on the user of the repository, of that repository, it's called TIDY Transcritomic Workshop. We have many workshops there that we have given that are somehow different one another so you can learn from there. And I will pass to you also my GitHub profile as it includes many of these packages we are talking today. So TIDY Surat, TIDY Single Set Experiment, TIDY Gates, TIDY Hitmap, a lot of TIDY stuff, among other things I do in my research. That's the user. One minute and then we do the last part. These will be very similar with bulk analysis. So if you are more familiar with bulk, these will look very similar to them. They are very, sorry, familiar to you. Okay. So as I explained before, sometimes it's very hard to understand major technical or biological effects across many tens of thousands of cells. Sometimes it's much more informative to collapse those cells as if it were a bulk tissue and do analysis of variability in the more standard way that we are used with bulk RNA sequencing. In other words, we sacrifice the variability across cells to better understanding the variability across samples, for example. So in this case, I load some libraries, some TIDY-verse libraries, as well as some TIDY-transvitomic libraries. TIDY-Bulk is the analysis tools that TIDY-transvitomic has just for bulk at the moment. And I will show you as includes all common operation that you can do analysis that you can do with RNA bulk sequencing data in a very elegant and TIDY format. And as well as TIDY-summarized experiment, this is very similar with TIDY single cell experiment as is an adapter of summarized experiment to TIDY-verse. Just a second. Why do I not have, let me, I don't have this. Do I have it now? OK, it's quite strange as I deleted some code to make the GitHub action work. But let me take this code from the actual repository. Let me see if, yes, so OK. Let me update my code locally. OK, part four. OK, so we load the libraries and we have our sort of object that we are familiar with. And let's suppose we want to collapse the cells according with the sample and the cell types. So for each sample and each cell type, we will have a, like if it was a tissue, basically sample. And we want to get this from the row counts. And what this operation will do is to sum the transcription across genes for the cells I'm grouping, basically, so sampling and cell type pair and create a bulk data container. Now, we see what this pseudo bulk object is. And we see that we went from these aggregate cells will be available in TIDISurath very soon. But basically, we went from a Surath object to a summarized experiment object, very convenient. Summarized experiment contains bulk RNA sequencing data. Again, because we have loaded TIDISummarized experiment, we are visualizing this as a data frame. However, internally is the same story, is the original object there. And the visualization is very similar to Surath. So we have our, I mean, very similar. There is some difference. We have our metadata here. However, for bulk data, because genes are very important in bulk, the features are displayed for us in a long format. This is very convenient if you start using, you understand why, but we have basically feature as if we did join feature in the single cell with our RNA assay here. And you can see that we have some headers that tell us we have 3,000 genes, 123 so-called samples, and one assay, which is RNA. And we can do all operations before. We can filter, do all sorts of things. And the underlying object will be updated for us. So we can forget we are interacting with this complex object. Just to show you that this will be rendered better in the website. But tidy bulk offers a wide variety of analysis that you can, some of those are in this list and are most the common, you know, the scaling, reduced dimension, clustering, differential expression, deconvolution, survival analysis, and so on. All done in a tidy fashion. Now I've introduced before the nesting operation. This nesting operation is also useful in your bulk analysis. Let's say we have done differential expression before at the single level. Let's see how we can perform differential expression at the bulk level. This differential expression in our case, we want to do it at the within cell types. So what we do is to use the same syntax that we have used before with the nesting. In this case, apply to this bulk object. We are nesting based on cell type and different object but same story. So the syntax keeps to be familiar. We don't need to learn new things. We have our curated cell types. And this column is standard, including a surat, including a summarized experiment. If I want to show you the first element, that's what it is. We have less samples because we are just looking at the subset of the data. Now we can use map as before. We just need to change our pipeline to calculate differential expression. And with tidy bulk, this pipeline looks as user friendly as the surat. So we are inputting our summarized experiment, which now is dot x. We exclude lowly abundant features and we call for this keep abundance. This is from tidy bulk. We test differential abundance using DC2 and we scale also. We produce explicitly scale counts for visualization. So this will take maybe some time. Let me see. Ah, the phone bigger. Sorry, I wasn't looking at the chat. Where am I? Yeah. So again, now I'm updating the variable. But we are creating a new, yes, no. So we are updating the same column here. So we are updating the summarized experiment column with inputting the summarized experiment and do some operation on it. This operation includes filtering, testing, and scaling. Now, if we look at our object, in this case, it looks exactly as before. However, this object, this column has been updated. How has been updated? Well, if we look at the first element, for example, we can see that our summarized experiment not all includes the column that we were seeing before, but also now includes TMM scaling, the multiplier, and the statistics of our test. For example, full change, p-value, p-adjust, and so on. This is information related to genes. The beauty of this format is that we merge feature-related metadata, sample-related metadata, and counts all in one data frame as if each of them was a variable. So again, we don't need to differentiate now anymore. We can do visualization and filtering based on combination of all those together. The last, I mean, we are at the very end, yes, the score helps with the plotting, yeah, exactly. Yeah, so this long format is very, very good for plotting among other things as we can decide high aesthetics based on whatever we decide and combine, combine different information. Okay, now we have done our differential expression analysis. What I'm doing now is the same thing I was doing before. Let's say we want to plot the top genes for each cell type. So we have done many analysis. We have 20 cell types. We have done 20 differential expression analysis. I want to plot the top gene or top genes for every cell type. So I take my nested data set and I add a new column, same thing as we've done before, which the significant genes. And because I'm taking just one gene, I'm just actually outputting a character. It's not anymore a list. I can show you now. And how I'm deriving this top gene, top feature, is I have my summarize experiment inputs. I take transcript information. This is all tidy bulk inside here. I arrange on p-value, select the first gene, and pull that gene, so extract the character out of it. I can show you, I probably can show you this operation outside the map so you can see. Again, as I've done before, I pull this thing and I select the first element. So I just have this thing with our statistics and so on. And let's apply this workflow. So when I pivot transcript, I just select transcript information. So this function understand which is the transcript related information and outputs a table with now transcript as a main object and all my statistics. I just have 1,000 and something transcripts. And so what I'm doing is arranging taking, so ranking on p-value and taking the first gene. In this case, me at is the first, is the top significant gene. I'm doing this for every cell, so how does this look? I'm creating a new column here with the top gene, OK? I could have top gene or top three or whatever you want. Now I filter, I use this top gene to filter my original object just for that gene so I can plot the expression. So again, I use map two. This is very similar to what we have seen before. I input to column and update my old one. So I'm updating my old column, input that column and top gene. And I'm using one to filter with the other. So y to filter x. And here I can execute the old chunk. So I update my object. And now the summarized experiment column will include just one gene. And the last thing is as before, I can plot this stuff. So I can add a new module that adds a plot column and applies a ggplot in there. You can explore that. But nonetheless, I will execute this. So I'm creating a new column. And I, in this case, I'm pulling all plots. But you can see here, that's my new dataset with the plot column. And at the end, I'm pulling all plots. And you can see here the top genes for every centaur. Again, very simple, very robust. And I didn't have to create any variables. So the possibility of bugs because you haven't executed one line and updated variable are very low. OK, a lot of stuff. But that's why I put the material online. So you can go back and read it for the next weeks or so. And I hope you enjoyed. We have seen quite a lot of stuff, both for single cell and bulk. And I will leave these last few minutes for questions. And after that, that will be it. And so you guys was very good. Yeah, that's true. This could be a good teaching material. I think starting from tidy bulk, it gives the high level concepts first. And then you can investigate further. So it could, it's for sure software, probably. It doesn't focus so much on the coding or the nitty gritty. It just focus on the workflows and what you do. Can you explain the aggregation by sample cell type along with treatment? Trying to wrap my head around what would be relevant in my work. Well, so in this case, I have single cell. I want to simplify these data. And I want to decide what to simplify across. What is my observational unit? A simple simplification could be forget about cell types, suppose this is a tissue. And I just nest based on sample. I don't nest based on cell type, OK? So for example, here, sorry. When I create the pseudo bulk, I don't create the pseudo bulk according with sample and cell type. I might just create the pseudo bulk according with sample. So this is the same thing as a bulk RNA sequencing data. So that's probably the simplest thing we can do. And we can do now differential expression across treatments. So that's your typical bulk analysis. But because now I have single cell experiment, I can divide in a finer way. For example, I want to do analysis per cell type. And so that's what I've done. Treatment in this case is just a covariate I use for testing. But again, I would suggest execute this code, get a bit more familiar, and you will quickly understand what applies to your work. Amazing. OK. Thanks a lot, everybody. I hope you enjoyed and you took home something. Thanks a lot for all the questions and the participation. And if you have questions, get in touch. I have my GitHub there. And if you want to contribute to these things, you can. The community is growing. So for sure, participation is welcome. And hopefully, see you very soon.