 Hi everyone and thanks for being here today. My name is Timothy Keyes and I'm an MD-PhD student at Stanford here to talk to you about TidyTalk and our package I've been working on for Tidy Single Cell Data Analysis. This talk will have three parts. First, I'll present some background information about single cell data and my motivations for building TidyTalk. Next, I'll demonstrate the TidyTalk package's syntax and analysis capabilities, and finally I'll conclude with some take home messages for anyone interested in building or using Tidy Single Cell Analysis tools. So we'll start with a bit of background. While a lot of the topics in this talk will generalize to multiple kinds of single cell data, the specific data type that both I and TidyTalk work with is called CITOP data. CITOP stands for Cytometry Time of Flight, and it's a single cell technology that works by detecting antigens on single cells using antibodies conjugated to heavy metal isotopes. While the cells are run, when the cells are then run through a mass spectrometer, you're able to measure the amount of each metal on any given cell. These metal measurements can then act as a proxy for protein expression. So you ultimately end up with 40 to 50 protein expression values for each cell that you run through the machine, and these values are written out into a file format called .FCS files, one for each sample. While this data type may seem straightforward, analyzing it is actually deceptively complicated. That's because, as I usually describe it, single cell data sets all exist on multiple simultaneous levels of analysis at the same time. So what exactly does that mean? Well, on one hand, you might be interested in answering questions on the level of individual cells, or you can think of the data from each sample as represented by its own data frame in which each row represents a cell and each column represents a protein measurement. At the same time, you may also be interested in how cells organized into communities or clusters based on shared characteristics, regardless of what sample they come from, and those communities or clusters can be of varying sizes with varying composition. And finally, you may also be interested in aggregating data from both the single cell and community level into a sample or patient level representation so that you can compare samples or patients with one another. And so there really are these kind of different levels of biological scope inherent in all single cell data sets, all of which are important in their own way, and all of which are associated with their own kinds of biological questions. But maybe most importantly for us as data scientists, the data processing needs within each of these levels are also distinct. For example, you generally want to perform data cleaning and pre-processing at the level of single cells, whereas at the community and sample level, you're usually more interested in using statistical and modeling tools to associate high level feature information with a variety of biological or clinical outcomes. So I hope all of this paints the picture that at least in my mind, the problem at the core of single cell data analysis is really about how you move seamlessly, intuitively, and reproducibly between these different levels of analysis. And not only that, but it's about how you move between your internal data representations and the input and output file formats that you're working with, as well as between your internal data representations and your reporting devices like plots and figures when you want to share what you've done with the scientific community. And so this schema is really the core problem that TidyTop tries to solve. How? By using TidyData. So TidyData is a concept or really more of a philosophy that became popular in the data science world kind of five-ish years ago and has really just been growing in popularity since. In R, it's primarily supported by the Tidyverse ecosystem of packages, and in Python, it's supported by the Panda's ecosystem. I would summarize working with TidyData as having four main principles. The first is that data frames really are your core data structure. Second, each column in your data frame should represent a variable. Third, each row in your data frame should represent an observation. And fourth, every position in your data frame should represent a single numeric value. This matters because ultimately formatting your data in a tidy way allows you to simplify your analyses because they'll have a core standardized expected input and output data format. And so your functions basically are empowered to be built with an intuitive and self-consistent set of design principles, which is really powerful because it can make your analyses both easy to write for users and highly reproducible, which is important for scientists. So given all of that, how does TidyData actually put TidyData into action in the single-cell world? Well, if we return to our schema of what we're trying to accomplish, what TidyData does is implement verbs, quote-unquote verbs, at each level of single-cell analysis. And so you can basically think of verbs as function families that represent a specific data processing task, such that different members of the function family will more or less do the same thing, but they'll do so in different ways. The first set of verbs implemented in TidyTaf are for moving between raw files in each level of analysis that we talked about in the previous section. These include the read, write, cluster, and extract verbs. The second set of verbs are for producing plots from each level of analysis and include the plot, single-cell plot cluster, and plot sample verbs. And finally, the last set of verbs implemented in TidyTaf are for computing within each level of analysis. At the single-cell level, you have the pre-processed down-sample, dimensionality reduce, and post-processed verbs. At the community level, you have the differential abundance analysis and differential expression analysis verbs, both of which implement statistical tests for comparing communities or clusters with one another. And then at the sample level, you primarily have verbs associated with predictive modeling, like the train, predict, and assess verbs. So what does this actually look like in code? Well, most of you in the audience will probably be familiar with the deep plier or tidy-verse syntax of manipulating data frames by using the pipe operator, which is this kind of goofy thing with percent signs, to pass a data frame into a verb that then performs some computation and then returns an output data frame in return. This can be very powerful for building a reproducible pipeline, because when you chain multiple verbs, you end up with code that's easily readable from left to right and top to bottom, which makes it really clear what you're trying to do. TidyTaf then uses the same framework, but in the context of verbs specific for single-cell analysis. So for example, you might start with a data frame containing PsyTaf data, pipe it into the top preprocess function in order to transform each of your protein measurement columns. You can then pipe that into the down sample function if you want to preserve your computational power and only analyze a subset of the cells that you've collected at any given time. And you can continue to pipe the results of those functions into downstream analyses that allow you to cluster dimensionality reduce or plot as you see fit. So that's the high-level picture, but I think the best way to illustrate why using TidyTaf is helpful is by example. So for the next few slides, what I'd like to show you is how TidyTaf can be used to compress three years of PhD work into about 10 function calls or about 30 lines of code. What I mean by that is that I'd like to more or less reproduce the results of a PhD project that two of my colleagues here at Stanford published a few years ago. In their paper, which was published in Nature Medicine in 2018, they used single cell PsyTaf data to build a predictive model of relapse in pediatric leukemia patients, which they called the developmentally dependent predictor of relapse, which is pronounced deeper model. And they did this using a two-step algorithm. First, they estimated the stage of hematopoietic development at which leukemic cells developed their relapse-associated phenotype by comparing cancer cells to well-studied clusters of healthy bone marrow cells. And they did this using a distance-based classifier that operated at the single cell level. After that, they then used the characteristics of the cancer cell clusters they were able to derive to build a predictive model of which kids would relapse and which wouldn't. So as a cancer biologist myself, I found this algorithm to be really compelling. But the problem is that while the data are publicly available, most of the code used to perform the analysis were lost at some point since 2018. So what we're going to do here is recreate the core findings of this paper using TidyTaf. So let's start with the first step of their algorithm, which is the clustering part. After downloading the input data files from the journal website, we can pipe our data directory into the TOF read data verb, which is smart enough to read either single files or directories of files in both .fcs and .csv formats and combine them all into a TidyTibble. We can then pipe the result of the TOF read data verb into the TOF pre-process verb, which accepts as an argument any vector-valued function and then applies it to transform all the columns in your data set that correspond to CyTOF protein measurements. Here I've actually written out explicitly the default, which is the arc signage transformation, a well-characterized, variance-stabilizing transformation in the CyTOF community. Next, we can actually perform the first step of the deeper algorithm by identifying the main types of healthy bone marrow cells that we have in our healthy cohort. So I do that here first by piping in our pre-processed tibble from the previous step into the dplyr function to filter just the healthy cells. And then from there, we can pipe that into the TOF cluster verb, which applies the flow-some clustering algorithm to all of the healthy cells. And that's just one of five clustering algorithms that have their native support in TidyTOF. And then after that, we can pass the healthy clusters we just identified into the TOF cluster deeper verb, which will bend each cancer cell into the healthy cluster that it's most similar to using something called myhalinobus distance. For speed, I'm able to set this to run in parallel on 12 cores of my computer. And then I can bind the ultimate cluster assignments to the original input tibble for something that is very tidy and compact. From here, you might want to visualize your results so far and compare them to the original study. So that's what I did. Using the TOF plot single cell dimensionality reduce verb, we can create a tizny embedding for each of our cancer cell clusters. One of these clusters, actually the second cluster which I've circled in red, stood out to me because it expressed high levels of three phenotypic markers associated with relapse in the original study. And these are TTT, CD19, and CD24. Using one of our differential abundance analysis verbs, I was then able to perform a statistical analysis to see that that same cluster, cluster two, is heavily expanded in cancer patients relative to healthy patients. This looks pretty similar to the results from the original study which identified five populations of expanded cells in cancer patients relative to healthy patients. And in fact, it looks like what we've done by using a different clustering method is essentially lump all of those expanded clusters into one single group. So at least to me, this looks like so far so good. Next, we can proceed to step two of their algorithm which is actually building a classifier. To do so, first we extract patient level features from our single cell data using the TOF extract features verb. While I won't walk through every argument here, I do want to point out that TITTOF verbs generally use two types of arguments. On one hand, there are column specifications which indicate which columns you'd like to manipulate. After those come the method specifications which indicate how you want to do the manipulation according to a variety of previously published methods that are built into the packages internals. For example, the signaling method argument here is set to the threshold flag which means that we'll aggregate features in each column corresponding to signaling markers in our input table which I've actually specified with this column specification here by counting how many cells are above a specified threshold of protein expression. Using these extracted features, we can then use the TOF train classifier verb with appropriate column and method specifications to build a tenfold cross validated glim net logistic regression model classifying samples into relapse or not relapse groups which is exactly what the original study did. Then with a little help from the yardstick package, we can plot an ROC curve for our model's cross validated prediction and see that it's not too bad with an area under the curve of about 0.8. This is almost as good as the ROC curve from the original study whose area under the curve was 0.88 which is mostly attributable to the fact that they had access to some metadata that we don't. Another cool thing is that TidyTOF did a pretty good job identifying the same features as predictive of relapse as the original study in its regularized logistic regression model so considering that the original work took three years and I was able to do this like replicated analysis in about 30 minutes start to finish while eating a bowl of spaghetti, I think it's pretty clear that TidyTOF provides a convenient and fast workflow for this kind of analysis. With that, I'd actually like to conclude with some of the lessons that I learned while writing TidyTOF that I think could benefit anyone analyzing single cell data. For starters, I think every single cell person who works in R should know that the Tidyverse implements a ton of tools for doing most of the tasks that we do in the single cell world like aggregation and pivoting and using them is really convenient. In addition, structuring your analyses as TidyData helps them to be simple and easily readable which means that the pipelines that you build will be robust to human error like copying and pasting error and accessible to programming beginners which most biomedical scientists are. The second lesson is that design principles matter a lot whether you're building an R package or just writing code in general. Building something that's easy to use or easy to read makes a huge difference for other scientists who are trying to replicate your work. To illustrate this, on the left I've shown native function calls from the five clustering algorithms built into TidyTOF and you can see that their arguments are pretty different from one another. They're not always super descriptive, that they give their outputs in different formats and also that the functions themselves are named using really different conventions. In fact, two of them had never been implemented in R before including one that's only available as a Java GUI and the deeper algorithm which only existed in the minds of the people who originally published it. When you compare these function calls to the TidyTOF verb that implements the same procedures, you can appreciate that if you're trying to automate a workflow or if you're an R beginner just learning to code, having a simplified and self-consistent interface can make your life a lot easier. The key ideas here are that these functions are built using predictable inputs and outputs, they're written to be very modular and they're written to still maintain flexibility. I think this point is well illustrated too by how simple it is to use TidyTOF to perform three different kinds of dimensionality reduction by changing only a single flag in these identical pipelines. You can see that this abstracts out the need to navigate the original function syntax and allows users to focus on what they really care about, which is their data. And as a final take-home method, I want to talk about how package development may seem like a huge time investment up front, especially if you've never done it before, but it can save you or your lab a ton of time in the long term for repeated boilerplate analyses. So I want to demonstrate this by showing an example of the kind of code that I spent a lot of time reading during my PhD. I didn't write it, but it was published in a recent ScyTOF paper and it's fine, it runs, it's functional, and there's versions of this code that exist in most papers in the ScyTOF community. And I don't have any problem with that, but if you look at it, at least to me, it's not super obvious what this code is doing. But then when you compare this script to the tidy top version that I provided at the right, I think that the tidy version is a lot easier to read and understand in two seconds, right? You just have a file path, you pass it in, you read your data, you pre-process it, you down sample a bit, and you take out a junk column. And it's just, it's so obvious to anyone who's reading this what type of analysis I'm doing. And my point here is that if you write code like the tidy top code on the right, and you come back to it in six months, you're a lot more likely to understand what you actually did without really having to dive into the guts of all of the specific indexing and the names of all the files if you hard code them, which will save you time and make your analysis honestly just better and more reproducible all around. So I'm out of time, so thank you for coming to my talk virtually. Thanks for asking questions in the chat. And if you want to check out tidy top on GitHub, I've provided the link here. It's currently being beta tested, but I'm hoping for a stable release sometime before the end of the year. And thank you to all of my funders. Thank you for everyone who supports me, including my two wonderful PIs, Kara and Gary, and to everyone who's provided input. And thanks so much. And I hope to hear from you at some point when you open up a GitHub issue.