 All right. Can you all see the slides? Yes, looks good. All right. Great. Thank you for the introduction once again, and thank you all for joining me on this talk, despite this wordy mouthful of a title, which I realized in hindsight. With that said, I would like to transition to this slide which is a much simpler title, which is Simplifying Complex Next Generation Sequencing or NGS Analysis. For the remainder of the talk, I'd just like you to keep this in mind how we can create simple, easy-to-use tools. With that said, I'd just like to go over what a typical NGS workflow looks like. Let's say that you want to analyze one particular sample to see the gene expression in, let me just make sure I can get my pointer here. Let's say you want to look at a particular sample. Let's say a tumor sample and you want to see what genes are being expressed. You'd isolate that tumor, you'd extract the RNA, and then you'd fragment that and process it by adding different adapters, and then you'd sequence it typically on an Illumina sequencing platform. What that leaves you with is some raw read files which are called FASCII files. These are essentially specialized text files which contains information about what transcripts or which genes were identified in your sample. Then using these read files, you have to align it or map it to a reference genome. If we're looking at a human sample, we would align it to the human genome, obviously, and then this would give us a table of read counts. This is basically a table with genes along in the rows and then each of your samples would be in columns and there would be numbers which identify how many times that transcript was identified or detected in that particular sample. And then after that, you'd do some differential or DE analysis or some type of abundance analysis to, especially if you're comparing across multiple samples to see how those read counts change from one sample to the next. And then this will give you the differential gene expression or DGE or some type of other features of interest which will inform you on what to do next such as what other genes or pathways to investigate, which genes or pathways to target with particular drugs. So this entire process can be very complex. Of course here it seems simple for the one on depicting one sample, but this can be quite complex. Real world experiments are not as typically not as simple as this, especially when you're trying to find something groundbreaking and exciting, right? And each of these steps here can be pretty time consuming too. So let's consider one of these real world situations. So let's say you have three cancer patients and let's say you wanna see how a particular tumor, it doesn't really matter what type of tumor or what type of cancer, but let's say you wanna see how these particular tumors in these patients respond to a particular drug. So let's say through biopsies, these tumors were obtained and then let's consider this first one over here. So we take that and we grafted into two groups, two cohorts of mice. There's a control group which receives saline and then there's a drug treated group which receives your drug of interest. And these are, by the way, these are immunocompromised mice that can take up, accept a human tumor graft or human cell graft. So in each of these cohorts, there are triplicates or three mice. So we graft pieces of that tumor into these mice and then we treat each group for six weeks. And at the end of that six weeks, we excise the tumors and we prepare them for sequencing. And in this particular case, we're doing single cell RNA sequencing. So we also add some synthetic nucleotides to keep track of which sample and which replicate and which treatment group that each of these excise tumors come from. So we add those and we prepare them, we process it and we get it ready basically for sequencing. And so this is what we do for one tumor. And then for this other tumor, we do the same thing. And then also for this one here, which is grayed out just to make this figure easy to read. So here's a full look at that of the protocol. So once you process everything for sequencing and you do the actual sequencing, you do some analysis magic and you end up with these really cool looking figures that we typically get for single cell sequencing projects which are these cluster plots showing how different cells or different samples cluster in T-cell or UMAP plots. But before we get there, consider how much data that's actually here. So we have three patients. From each patient, we have two treatment groups. From each treatment group, there are three replicates which contain some synthetic or artificial nucleotide barcodes which are also sequenced. And then from each tumor, so from each of these tumors here, the leftover tumors that we're gonna sequence after the drug treatment, we're gonna roughly detect five million cells per replicate. And then because it's a single cell RNA sequencing, from each of those five million cells, we're going to detect roughly 20,000 protein-coding genes which is roughly the number of protein-coding genes that are present in the human genome. So that's a ton of data. So how do we go from all of that data that the sheer size, consider not just the values that are required at the numerical values but the actual size on disk. How do we go from there to the final and nice looking publication quality figures? So there are multiple ways to analyze these types of experiments. And you don't really need to know, I'm not gonna go over each of these different components. You can read these steps if you'd like and I've kind of put the logos of the common ones here on the side. But the takeaway is that there's really no, there's really no best way to do this. There's multiple, in each of these steps, you can swap out different components and you can run the pipeline different ways. But more or less, you're gonna get the same answer, which means that if a gene is really over-expressed or under-expressed in a tumor, you're gonna, and if it's statistically significant and if it's actually occurring, you're gonna find it despite which tool or which method you use. But the point here is that all of these methods are command line based or scripting based, which means that a typical wet lab scientist is carrying out this experiment is probably not well versed in how to analyze this data. So that's a huge roadblock for them to analyze their own data. The other thing to consider is that, like I said, at each of these different steps, you can use different tools to analyze the data. Now, when you have a pipeline, the output coming from one of those tools can't necessarily right away be fed into the next successive components. You'd have to do some modifications or some pre-processing to make sure that the next component can actually meaningfully make sense of that data. If you're not careful, if you put something in that's not compatible, it may run and you're gonna get a skewed or a statistically non-meaningful or incorrect output, which can really affect your downstream analyses. So that creates some difficulties and barriers that I encountered in NGS pipeline. So not just for single cell sequencing, but for all types of sequences too, whether it's Christmas screening, ATAC-seq, chip-seq, whole genome-seq. So when you're analyzing all these different experiments, the complexity of the amount of samples or the amount of replicates or the type of experiment that you're doing is going to affect how you analyze it. And not every experiment is unique. You could do two, we could take the experiment of the single cell sequencing experiment that I showed and you can do it for the same type of cancer from different patients. And it may not always be the same process in terms of analyzing it. You may encounter some issues with the samples. You may encounter some issues with the replicates, some sequencing errors. So you need to be able to adapt and be dynamic and not rigid in the way you analyze these. The other thing is, like I said, there's multiple tools and multiple languages and platforms. So you may do your, the alignments are typically done in bash or shell scripting. You could use R or Python for some of the differential expression analysis. Sometimes some steps are run using tools designed in Perl or C. So you need to know as bifurmiticians, if you're a bifurmitician, you know that there's multiple tools that you have to use. And then also the compute resources involved. So these are things like CPU, RAM and disk space and the larger or more samples and replicates that your experiment has, more of these resources that you will require. So sometimes these things can't really be run on a local machine. And obviously each of these things contributes a time, whether it's the time used to learn how to run the pipeline or just the sheer amount of time required to run some of the steps. So how do you solve for this? So obviously you can automate the pipelines, but just cruelly automating scripts isn't necessarily going to work that well. It's still, that still creates some rigidity, like I said, and you need to be able to be flexible for the uniqueness of different experiments. So you can provide a modular framework then. So this modular framework basically means that different modules can be shuffled in and out depending on the experiment, depending on the issues encountered during the analysis. And then you can abstract away the languages and tools from the user. So you can have one overarching programming language like Python or R that then calls all the other languages and tools that are required. So in this way, you remove the user from what languages are being used for the analysis and all of that is handled in the background. And then for the compute resources, you can make your tool cloud-based. So using something like AWS. So this makes it really easy for the user because they don't have to worry about the number of CP cores they have, the amount of disk space they have or the amount of RAM they have. Everything is done on the cloud. So they can virtually run it on any type of local machine. And then using parallel processing, we can really speed up the amount of time that it takes to analyze these types of complex experiments. So we're building a solution that is a point and click solution. So it's a graphical user interface tool to analyze experiments, a multitude of NGS experiments. And we're beginning with single cell sequencing experiments. So it's a point and click interface and the front end is built using React. So the user can point and click, upload their data, their raw read files that comes out of the sequencing platform. They can customize all of the parameters of the experiment, things like the type of experiment, how many samples they have in the replicates, any cut-off thresholds, any statistical parameters. And anytime there's several options to do a particular step, such as clustering or normalization, all of these can be customized using just selecting from a list and peering wish. They don't need to know the command line arguments to define those parameters. And then once that's done based on the parameters, an automated framework is assembled and this runs the actual analysis. And so this is really the meat of the tool, if you will. So our framework is built in, so we use Python as the overarching language and then using things like reticulate and basilless, there's interoperability with R and so they work together. And then we also respect the bi-conductor framework. So we follow the bi-conductor principles for analyzing and working with genomic data so different types of genomic data objects. And in the backend is handled via Flask. This is what allows us to quickly and efficiently shuffle the data from the backend analysis into the front end. And then of course, for the cloud we use things like EC2 for instances, EFS and S3 for the data storage and ECS for the containerization of different modules. And so you can see all the technologies we're using. Okay, so what does this actually look like though? You know, it's easy to put this in words, what does it actually look like when we create it? And this is when you see how complex it can really be. So here's still yet a more detailed but still a simplified flowchart of how the platform works. So we have the automated framework that's put together after the user defines all the parameters. And then we use an S3 bucket where the raw read files from the user are deposited. And then the EC2 instance is spun up depending on the size and parameters that the user defines. And then this in parallel does the read alignments. So for a single cell sequencing experiment, this is looking at the unique molecular identifiers from each sample and replicates and looking at the read counts per gene at each individual cell level. And at the same time, there's some demultiplexing happening to identify which replicates and control groups each of those cells belong to. So for example, the saline group or the drug trigger group and is it from which mouse is it from mouse123 and also which patient is it from patient123. So all of that data is obtained and then some quality control is performed automatically and that's saved in an intermediate data storage bucket. And then with all of these read counts, these are integrated and unified into a Serot object in R. And this prepares it for single cell sequencing analysis. And then here in a modular approach, normalization is performed, some type of filtering is performed, variable features are identified and different types of clustering can be done. And this is also done on an EC2 interface. And then the results are then stored in an R object and then saved into a final results data bucket in S3. And then there's an interactive visualization component, which is also run via EC2 automatically. And here the user can explore the data in real time and customize their outputs and generate high quality publication ready figures. So to give you an idea of the different technologies being used, here is just the logos of each technology. So as I said, the overarching tool is built using Python, the Python backend. And then that Python code calls different modules, which itself can be Python or R. And in some cases like here for normalizing, filtering and clustering, we can use either R or Python. So it depends on which is more performant for the type of data and also which types of methods or statistical methods the user wants to run. And then also the final interactive platform is a mix of Python, R, Python, React and Flask. So this is a work in progress, this solution. Our goal is to create, follow this framework for this point and click analysis tool for a multitude of NGS experiments. So we're starting with single cell sequencing but this is something that can be expanded to other things like ATAC, C, ChIPC, O, Genome C and CRISPR screening and also integrating different modalities together. So the homogenous framework allows the integration of these different things such as single cell sequencing and CRISPR screening, for example. So in the interest of time here, I'm just gonna summarize. So in summary, so I showed you just a very high level look at a point and click analysis solution that we're building which leverages different web-based technologies to make it really abstract a user away from all the complexities of analyzing the data. And by using a modular approach, this is scalable across different experiments. So it can be tweaked to run, to analyze very unique experiments and inherently then it makes reproducibility really easy because that framework can be saved as a log file and you can easily go back and reproduce the data that was run later on and it increases efficiency and it's scalable for many forever growing sizes of NGS experiments and the full pipeline is automated. And this really reduces barriers to entry for NGS analysis and it just opens the door for many people to analyze their own data even if it's just a cursory look before they do some detailed analysis with seasoned bioinformaticians. So I just like to acknowledge our team which is basically myself and Megan who are working on this project. So Megan, Megan Chang is a R and Python developer by name but I like to consider her as a multi-technology extraordinaire because I can give her seemingly impossible tasks and she comes back with amazing results and if it wasn't for her, this project would still just be an idea on a whiteboard somewhere. So with that, I will try to stop the presentation so I can see the window again. Thank you so much, Piro. I think we have about two minutes for questions. I don't see any new questions in the Q and A so I'm gonna kick it off. So I think lowering the barrier of entry to doing NGS analysis is a completely painful gap in NGS analysis and omics analysis in general and really I think delays a lot of research for unnecessarily because you have to hand things off to your bioinformatics team and really you can't do it yourself or at least it's very difficult. So I think this is very welcome. What I'm wondering though is about reproducibility and whether having a point and click kind of interface is the right way to do this if you wanna have a reproducible workflow. So my question is, do you think there should be an sort of an opinionated, tidyverse inspired modular grammar for omics analysis and I'm thinking about something along the lines of tidy models where it's modular, it wraps a ton of different existing functionality in a consistent API and seems like the process would be amenable to pipes right and it would be great to have outputs like tidy tipples that to easily work with, GGPAR, GGS summary, Deplier. So I'm wondering if that's something that you think should be done, whether that's another way that you could think about this or whether your thoughts on having a tidy grammar for this. So you can do this stuff that point and clicky stuff programmatically. Yeah, that's a great question. And I think that is something that you need to be cognizant of when developing these kinds of tools. So I think there's several ways to tackle that. And yeah, having like tidy models is great. Tidy models is a great example. So I think if we had, let's say, something called tidy genomics on the spot thinking, I think something like that that defines how these different data objects are processed and defines different methods and acts as an API, like you said, will make it really scalable for all the new types of NGS modalities that are coming out and as sequencing gets cheaper, it's gonna encourage more people to incorporate larger and larger samples and replicates and integrate many different modalities. So I think something like that will be important. The way we're handling it is kind of a different approach, kind of having a controlled containers that specify. So we're using bi-conductor framework. So we're having, for the reproducibility aspect that you touched on, we're working with Docker containers that have locked versions of the different R packages or Python packages. So based on the log file that's created when you run the tool, and if you go back to it and if you load that log file, it'll spin up the correct container to make it reproducible because these kinds of tools are stuff. So as a biochemist and bioinformatician, these kind of point-of-click tools are something I've been working on for a while. And yeah, you can have packages update and it can really break everything and it's not so reproducible then. So yeah, that's something really important you have to consider.