 Thank you very much. So yeah, this talk is about the bioinformatics pipeline we're developing at ETH Zurich. And yeah, here is just introduction. So here is just some research interest that I have. So we are working on data analysis pipelines and bioinformatics, machine learning also. We are developing some new methods to explain certain biological data sets. And previously, I had some work with recommender systems. So that's also one of my research interests. Here I have, yeah, here is GitHub, LinkedIn, Twitter information there. And the outline of this talk is, so I will first introduce the problem. I will give you some basic biological information. Later, I will talk about the so-called DNA mutation trees. So our model for representing mutations on the DNA. And then I will talk about the pipeline and bioinformatics stuff that we are doing to address the cancer research. And yeah, so I'll start with the biology background. I'm not a biologist, so both my masters and bachelors were in computer science. So don't worry, I can't go into much detail when it comes to biology. So we are working on cancer research. And so the data we have is coming from some hospitals. Some patients go there, get their tissue sequenced, and then we are analyzing it and providing a report to the clinician to base his treatment decision. And the previous, so let me explain what a cell is. So a cell is the smallest living thing in the human body. And when cells come together, they form tissues, like a muscle tissue. It is consisting of different muscle cells. And then the tissues come together to represent the organ. Organs become systems, and systems become the organism. That's the high school biology information that is required for this talk. So the previous technologies, so when we want to analyze the cancer tissue, the previous technologies were able to retrieve the information at the tissue level. Like the operation is called sequencing. The biological sample arrives to the lab, and then it gets sequenced. And as a result, we get the, we have the digital information about it so that we can run our analysis. And the previous technologies that were sequencing based on tissues, they were not able to detect the heterogeneity among different cells. Because so in the tissue, you just get the average of the cells. Therefore, you get the most dominant mutation. But you ignore all the other mutations. And as a result in the treatments, when you have the treatment for the most dominant mutations, sometimes the other ones also pop up. And this is why single cell sequencing has an importance now. And the new technology allows us to go into single cell detail at the single cell resolution. And yeah, from the single cells, OK, we have this DNA inside the cell which contains the genetic information. And DNA sometimes have mutations. And some of those mutations are known to be associated with certain diseases. And here we are, so in the talk, I will talk about our efforts to model those mutations. So the mutation we are considering is from the family of mutations called structural mutations. It is called copy number variation. And so here on the figure, there's this blue part on one genome. And on the second figure, it gets duplicated. So this is a mutation. So the variation in the copy number. On the left, there was just one copy of the blue. But on the right, there are two copies. So those copy number variations can be either duplications or deletions. And so we are going to be analyzing those copy number variations at the single cell resolution. Those mutations, they have this family sort of relationship. Because when a mutation happens, sometimes other mutations just span from the parent one. So they're child mutations and then sibling mutations. They have ancestors as well. And therefore, we are modeling them in a tree fashion. So not necessarily binary. But it's a tree to represent the mutation information. So the genome is divided into different regions here. Those regions are just meaningful parts of DNA that are associated with certain functionality in the certain functionality of life, let's say. And here the root one. So root one doesn't have any mutations. The other one, I'm not sure if it is how readable they are. So they are representing those copy number profiles. So this one says R1 plus 1. So in region 1, there is one extra copy. And all of the single cells that are represented by this pink circle are going to have one more extra copy of that region. And it goes on like this. So it's a tree where we have a dictionary of regions and then we contain the extra or missing copy number information. And this is what we are trying to learn. So we have a machine learning model to learn the best suited tree for a given cancer sample. And the way we are learning is by using a Monte Carlo Markov chain scheme, MCMC. So for each tree, we have a mean of scoring it by using a Dirichlet multinomial model. I won't bore you with the formulation, but we can discuss about it if you're interested in after the talk. So for a given tree and the sample data, we have a way of scoring it. It tells how likely it is to observe this tree given this data. And then we are able to, by using an MCMC scheme, we are able to move from one tree to another. And then we will score the next tree as well. And then we will see how good the second tree is. And if it is significantly better than the first one, we will discard the previous tree and we will update the model with the next tree. And we continue like this. It usually takes lots of iterations on real data, it is millions. And so the bullet points are the MCMC moves we defined. So here I will talk about it. The first one is prune and reattach. So we randomly pick one node from the tree. We pick the brown one and it happens to have two children. We just prune it and then we reattach it somewhere in the tree randomly. Just we just prune something randomly and attach it somewhere else. And afterwards we score the next tree as well. And if it is significantly better than the previous, we keep this one, we discard the other and we continue. Another move we have is called at remove node. So we pick a node and then we randomly generate another node as a child of that node. And then we see if this tree is better than the other. And another one is condense and split. So here we have, so yeah, we pick one node alongside with its parent and then we condense them into just one. So the second one got swallowed into the parent. They became just one. And then we see if this one represents the data better than the other. This move, in fact, this could be done by different at and remove nodes. But the reason we have this one is to help with the convergence because by insertion and deletion, we would have to use too much iterations. But here it's just, yeah, it is simpler. And each of the move we have in the Markov chain, so it is reversible. Like from the tree on the right hand side, we are able to go back to the tree on the left. And we are externally making sure that it is equally probable to move from left tree to the right tree to from right tree to the left tree. Otherwise the Markov chain wouldn't be in balance. And it would cost certain biases. And yeah, so this one is a tree we learned from a real data. This data was from a brain, from a mouse brain. And yeah, we again started with a random tree. And after millions of iterations, this tree happens to be the one that is explaining the mouse brain tumor evolution is the best. On the right hand side and below is the original data matrix we are getting after the sequencing experiment. And then the figure above is how we can reconstruct it from the evolutionary tree we learned. And then you. Yeah, so this was how we were defining the model or how we are modeling the heterogeneity in tumor. But in real life, we have many more things in addition to it. So the first thing is reproducibility of the research. We want any other research institute to just read our paper and then able to reproduce the results. Second requirement is scalability because in genomics, so in the past 10 years, the cost of sequencing has been decreasing faster than the Moore's law. As a result, this is producing more genomic data every day. And the growing rate is exponential. There's too much genomic data produced. Therefore, the bioinformatics methods have to be able to scale. And this is another requirement we have. We often use multiple programming languages. The tree model, so the MCMC parts was built in C++ since it was requiring too many iterations, therefore performance. And as many of the machine learning frameworks out there, so far this we also decided to C++. But many other parts are written in Python. And sometimes we even need to use R. Certain statistical methods are only implemented there. And yeah, multi-processing because we are, so this data is so I cannot run the experiments on my local machine. We are using the computational clusters. And since we have multiple nodes, so we try to make use of multi-processing a lot. And yeah, cluster execution for two reasons. One, there's not enough memory in my local. Second, there's not enough time because we need to run things in parallel. And the resources management. So for each task, each bioinformatics task we are doing, we need to define the memory time and the disk space requirements. Otherwise, in order to better utilize the cluster. And we often need to look at the statistics about the resource usages in order to better tune the cluster execution. And to achieve this, we are using, yeah, among many things, we are using this workflow management systems. So it is, yeah, the one we are using is called Snakemake. It is similar to GNUmake. It just follows its paradigm. But it has the Pythonic syntax. And see, I like this figure. I took it from, I guess, one of the previous Snakemake talks somebody gave. Yeah, so this just explains it in a small diagram. So yeah, Snakemake is a workflow management system in Python. It is consisting of different programs. And those programs have dependencies between each other. Some of them are providing output to the others. Some of them are running in parallel, not depending on each other. And then they are getting merged, connected in the end. And yeah, so this Snakemake is a, so it is provided as a Python package. You can just install Snakemake. And it has exactly the same Python syntax with a few extensions over it. And it follows the GNUmake paradigm, which is well established. And yeah, so the workflows are defined in rules. And those rules are trying to create the output given the input file. And the workflow management system is automatically defining the dependencies between different rules. And by using Snakemake, we can make use of all the existing Python libraries. And so unlike other workflow management systems out there, like when I need to use some Python functionality inside the workflow, I need to write a Python script and I need to make it executable so that I can access it from the shell. But in Snakemake, you can just use all the functionality of Python as it is. You don't need to wrap them into different scripts. Automated logging of the status. So since workflow management systems are consisting of multiple programs, sometimes even implemented in different languages, when something crashes, you need to know which one crashed and why it crashed. And if possible, you may want to continue with the rest of the workflow or you may want to stop there. But logging here is very important. And Snakemake is providing a very automated logging of all of the error warning and the status of each rule. Snakemake came out in bioinformatics domain, but it is a general purpose workflow definition language so it can be used in any domain. It's not domain specific. And I will show you some example here, syntax. So the rule is basically a task that needs to be done. So the rule can be depending on... So the rule may use a shell or Python code itself or it can use, I guess, their support for our scripts as well. So here, this rule is going to take two inputs. One is called genome.fa, the other is fastq. And then once these two inputs are provided, the rule will automatically execute the shell command and then it will provide the output. If it fails to provide the output, it will crash. Otherwise, the rule will be successful and the next rule may begin. Here in the shell command, there's this curly braces. So it is a way of communicating between the shell command and the input files because otherwise, if you want to invoke the same rule from the shell, you just need to do some extra work. Likewise, the output there is serving the same purpose. And here is one extra feature of Snakemake. Basically, you can have wild cards. So the sample here between the curly brackets, it is known to be a wild card. And this feature I like very much because without using a workflow management system, this is very hard to achieve. So I will let me explain you what it is. So in the second line of the input, it looks at the data directory. Okay, go to data, samples, and then find all of the FASQ files that match certain criteria. This can be a regular expression. This can be just anything like A.FASQ, B.FASQ, C.FASQ. And then for each of those input files, create the output that contains the same wild card. And for each input we have, so the shell command gets executed. And so without changing any line of codes, we can basically make it scale just by using those wild cards. And those wild cards, so we can even use them across different rules. Like whatever created from this Java tool use the exact same thing in the Python rule and produce the output automatically and handle the dependencies automatically. And by means of dependency, so before Snakemake runs, it creates this direct acyclic graph of the jobs. This one was from one of our simulation studies. So at first, one, two, three, four, first rules are executed in order. First one finishes, second one begins, and so on. But at some point there are multiple ones because these ones do not have dependencies and they can run in parallel. And likewise, the last row of rules they also may run in parallel, but each one is depending on the previous one. And afterwards we have this aggregation at the end. It is similar to the MPI paradigm, basically. You can run things distributed, then you can aggregate them. And this direct acyclic graph of job, so this is created automatically. We don't need to tell them, look, okay, first do this rule, then the second rule, then the third rule. There's a way of enforcing workflow management system to do that, but by using the inputs and outputs, it's automatically detecting the direct acyclic graph of the job execution. And here I will show you one more realistic Snakefile, so this is a complete Snakefile for a very basic example. So here I want to show you how similar it is to Python syntax. It is basically a Python, so it's a Python library, in fact. And in the first lines, we are importing some Python modules. They can be built in modules or custom modules like the secondary analysis here. And this is just regular Python. Later we have this config that is, so it's the configuration file. And since, I mean, in a program, there are often parameters, and a workflow is set of program, so there are more parameters. And therefore it's a common practice to have configuration files separated than the main workflow. And in Snakemake, there's this dictionary called config. It is built in, and when you are invoking Snakemake, you can just specify a config file, and it will be automatically parsed. This way you don't have to, so this way you separate the workflow and the config. And the rest is, so there's here just the Python function. You can have any Python functions, list comprehensions, all the Python syntactic stuff. But the difference here is the rule. So yeah, there's this rule, and there's input output like in the past example. But here instead of shell, there's this run command, which is just accepting Python code. So we have some Python code here to go through the files in one directory and do some stuff there, and in the end create this file, which happens to be the output. And yeah, so this is how simple it is compared to other workflow languages. You are within scope of Python and you can make use of, yes please. So you are missing two styles of IOS and inside Python? Yes, you can mix those two, exactly. So good question. Yeah, so personally, I usually use the make syntax. I usually call everything from the shell, even if it is Python. I may have Python classes, but I will write an executable in Python and call it from the shell. That way I can better manage the outputs and the logs standard error and the warnings. Because this way, if this crashes here, the snake file will terminate and the other way is much easier to deal with it. And yeah, so this is an example config file. So configs define the snake make, so they support two syntax, two formats. One is JSON, the other is YAML. YAML has the advantage of allowing commands, but JSON has the advantage of easily being easily serializable. I mean, because I often create the JSONs from some Python dictionaries, so I automate certain tasks. Therefore, in this example, I use JSON, but yeah, you can use any config file and you can use any other Python parser for the config. This is one of the two supported syntaxes in snake make. And yeah, the execution also, so yeah, snake make is automatically configurable with LSF scheduler, you can just pass it. And in the config file, given that you define the resources for each job, like this much memory for this job, that much memory for the second job, snake make automatically will create sub tasks, sub jobs on the clusters. And that way you can, yeah, give how much you can specify how much memory or how much runtime or how much disk space you want to give to each job. I'll be very quick. So another thing technology we are using is HDF-5, this hierarchical data formats. Yeah, they're binary, easy to manipulate. This comes very handy in genomics because we usually make use of metadata and so they are storing the data alongside with the metadata. And since in the pipeline we are using C++, sometimes Python and sometimes R, we need to have a common serializable format. I mean, we cannot use pickle, we cannot import pickle in C++ or vice versa. So this binary file, so there is another use of HDF-5 basically, we can use it and we can load the exact same thing. And then, yeah, and we can, so the HDF-5 allow us to just connect to some data and perform operations on a subset of it without having to load everything into memory, which is also quite handy. And yeah, this is the last slide or the one before the last one. So in Python there are two wrappers. One is PyTables, which is high level, a very nice wrapper of HDF-5 that interacts with pandas. And the other is H5-py, which is very similar to the C++ API. And yeah, so this was the outline basically, the problem at hand, our statistical model and the bioinformatics parts, the pipeline and the tools we are using to deal with that. And the feature work, so we will publish the method first, the statistical method, we will compare it to other methods and we will have evaluations on real data, a simulated data and we will show it on real data. And later, this pipeline is going to be wrapped up, we will do all the bindings to other languages and we will provide it on GitHub open source. And yeah, this concludes the talk. Thank you very much for your attention. Q&A, if anyone has any questions, just go to one of the microphones. Yes, please. Perhaps that microphone. Go to the microphone there, just like, yeah, the closest microphone to you. If I understand correctly, SnakeMate is compatible with Singularity, so is your project using Singularity or not? Oh, I'm not. I'm just wondering that. I don't know what Singularity is. Kind of container technology. I don't know, but maybe. No problem, it's just out of topic question. Maybe we can discuss it after. Thank you very much. Thanks. Okay, if no one else, then lunch will be served soon. I think everyone is hungry. Yeah. So let's give a round of applause. Round of applause.