 Hello everyone and welcome to today's Bitesize talk. I'm very happy that with me is Simon Hoimus from the cubic in Tübingen and he is going to talk about a NFCore pipeline called NFCorePan genome. Off to you. Thank you very much for the introduction Francesca, I am happy to be here. I will talk about cluster scalable punchline graph construction with NFCorePan genome. Thanks to advances in sequencing technologies, very high quality genome assemblies will become basically default in the near future and then this will offer us an opportunity to study genomic variation as never seen before. However, how can we represent and work with hundreds of genomes at the gigabase scale and one solution here could be so-called punchinom. It is able to model an entire set of genomic elements in the population in contrast to reference-based genomic approaches where we relate all the sequences to one linear consensus model of the genome. Punchinom creates each new sequence to all other sequences represented in this punchinom and in particular this helps us to mitigate reference bias. Now, how can such a punchinom data structure look like? And one way here is to encode it in a graphical data structure. As you already know, in the linear genome world we take the reference genome and augment it with variation. However, a punchinom graph now can compress the shared and variant sequences into one graphical representation. And in these punchinom graph models, DNA sequences are incorporated as nodes then with edges connecting the nodes as they occur as sequences representing the graph. This is basically this visualizing variation that you can see here in this figure. Also, if now sequences represent identical regions such as paralogs or orthologs then they will share actually the same nodes while the variants will be added as new branches in the graph. Like what you can see are the insertions, same letter change or inversion. We can also visualize such a punchinom graph in the tube map like way. Let's stay there for a moment to get a better idea of a specific punchinom graph model or implementation, which is called a variation graph. And here the idea is that actually all our sequences are paths that go through the nodes and these paths can be contigs, haplotypes, reads, or even whole chromosomes. In this example you can see two genomes and actually they share some sequence both actually visit this node number one because they have the same sequence but then their sequence diverges and so they both visit different nodes. This is a good visualization however if we have hundreds of genomes it can become hard to read and so the idea is that we can actually project this into a 1D visualization. What we do actually is we just concatenate all the nodes nucleotides into a so-called punchinomic sequence and then we write it from left to right, which you can see below in this figure. And then with a binary matrix we can actually encode the genomic sequences from both genomes. So if the sequence is present within the genome then we just draw the color and if not then it's actually not drawn. Now the question is how can we scale this up to potentially gigabase scale punchinographs and here the idea is actually binning. So here you can see a large scale punchinograph 1D visualization of a certain HLA DRB1 gene from the HLA region in human and what we are doing here is basically the same, we again arrange the nodes of the graph from left to right to find a punchinome sequence and if there's a coloring then the path has that sequence but if there's no color then actually the path does not have the sequence. You can place the path names on the left and basically what we are now missing are the edges but we draw them as black lines on the path to so visualize somehow the graph topology and the key idea is binning. So of course one pixel now represents hundreds or even thousands of nucleotides and so we bin them basically together and that's how we can visualize these large graphs. What is not clearly understandable here is the graph topology so there actually is also a way to visualize these things in two dimensions. This is extremely important to easily grasp large structure variations but also to take a look at certain kind of bubbles which can indicate regions where paths highly diverge or maybe could be hinting at the repetitive flow side. Now these are the key concepts to understand all the visualizations that are coming in this presentation. Now the question is how can we build such graphs that already exists is the so-called PGGP algorithm. It's called the Punchinum Graph Builder Pipeline and it basically comes with three major steps. The first one is the all versus all alignment step. So as input we have your sequences in fast A format and then we do an all versus all alignment which is quadratic so it can be quite heavy and then basically here it is visualized what you will get out of it. Once you have these alignments we can use a secret to actually use this kind of alignment graph to fold it together into a variation graph. So the idea is that when we have the aligned sequences and they share certain subparts of the pairwise aligned sequences then we can collapse these together into a node in this variation graph. The third major step is a normalization step. What we do here is we take this raw variation graph, we sort it in one dimension. So basically we sort it so that the 1D layout is guided by the nucleotide path positions given in the graph and we then normalize it by applying local MSAs across windows of this Punchinum Graph. So what we obtain in the end is this smooth graph. And usually in PGGP this is done around three times because at the edges of these windows we also need to ensure that things fit together. Then we remove some redundant nodes and we are also able to covariance against any path in the graph. So you can choose your reference of choice or any other genome you are interested in. Additionally we also provide, this tool provides graph statistics, 1D and 2D visualizations which are all aggregated in the Wild Dixie report. And the PGGP algorithm was already used to build one draft human Punchinum reference graph. And this was published in Nature this year I think. And therefore it's already really well tested. However, it comes with the caveat that currently it's a bio script. There is also a docker, one huge docker container for it, but it's only able to utilize one node at the moment. So it's hard to scale up. And that's why I then choose to implement the NFCore Punchinum pipeline. And here in the middle is its core workflow and it's basically directly disabled for that I just showed you it's the PGGP workflow. However, now I implemented some new features to make it faster or more scalable across the cluster. So one major bottleneck when you have a huge amount of input sequences are the always alignments because they are quite tragic. So what's possible now is that we just do first and almost all approximate mapping. And then we split these into chunks, let's say for example 20. And then for each of these chunks we can run this the heavy base level alignment and parallel across nodes of the cluster and this then leads to a much shorter runtime for this step. All the following steps then are as already presented. Another interesting feature is what we call community detection. So when you have like an input of an organism and you know it has comes with let's say eight chromosomes but you input all the sequences from our chromosomes and you did not know in advance which sequence belongs to which chromosome. The idea is that we can do this automatically. So here what we can do is that we do an all versus all approximate mapping. And then we translate this into a network and from this network we can apply this light and clustering algorithm to then detect all our communities so that by expectation we expect eight communities when we input eight chromosomes right and so for each of these communities we then can execute the whole workflow again. Once all of them were run through we can combine them again back into one huge graph. So they all stay in one file. And we can calculate the statistics and visualizations of the final graph. That's how the pipeline works in general. Let's take a quick look at what comes out in the magic issue report because I think it's a little bit more customized compared to what you usually observe in in your daily pipelines it was like R&B, C or so forth. So I implemented an Archie Maticus module, which basically takes the Archie stats table and then neatly visualizes certain features of the graph like the name, the length, the nucleotides, the number of nodes, etchers, paths, and number of components. So assuming now we as input we had given an organism with eight chromosomes and the humanity detection worked really well then you would expect something like eight components here because you would have eight distinct graphical components in your full graph. So first what you can also see is the number of bases and what you can also optionally do is like add here something like present and or present CG content if you're interested. The report also comes with lots and lots of visualizations so that you can better understand how your graph looks like and if it actually made sense what you did and maybe if you need some parameter tweak it or something like that. So here on top here is a so-called compressed 1D visualization and here the idea is that we collapsed all of our paths together into one row and then indicate by color coding like a heat map where a huge amount of paths actually have that sequence like here in blue or here in red. This basically means that this is a very unique sequence and this is not seen so often in the whole graph. And actually you have like thousands of paths later in your graph. This can give you a great overview of where homologous regions are or where are really unique regions on your graph. This default 1D visualization I already told you about. And here the 1D visualization is colored by path position. This means light gray is a very low nucleotide position within that path and black basically is the highest nucleotide path position. What's interesting here is that this path compared to all the others actually goes the other way around. And so it's the reverse complement. And this is because it apparently was assembled in a different way, like maybe from a different chromosome strand compared to all the other sequences. Here the information is basically the same. So all of these nodes are traversed in forward orientation, except here for this one it's traversed in reverse and that's why it was colored in red. So the 1D visualizations that are by default in the Maticus report are visualization by node depth. So Cray means that the node is visited once, but here in red actually it means that this node is visited twice so this helps us to detect like repeat regions or otherwise complex regions in the graph. As we have seen in the statistics table we have some ends in the graph and we can also highlight them in Cray and so specifically apparently only this path has some incident which are then here to get then the novelty of the topology of the graph. We also have a 2D drawing in the Maticus report. Now let's take a look at some real life examples when we actually applied the pipeline. I did that to an organism called Lodolomycesium longisporus. It's a very underestimated yeast fungi pathogen. And like when old people with a weak immune system somehow are infected with this they can die within days and so this can be quite an issue. Like Lodolomycesium longisporus short Lodolome comes with 8 chromosomes and some MTDNA. And the input of the graph that I will show you are 11 assemblies from this winter wet lab school this year and they were generated from nano point Illumina data. Two of them were fully assembled. And the other nine ones are still on the contic level basically. What I first did is that I applied the pipeline in this community detection mode. And the good news was that hey, we have 8 chromosomes plus the MTDNA and we also get 9 humanities. So this is amazing. And also most chromosomes are linear. However, some of these have these very long thin tails like here or here. And this is unmapped sequence. And this was somehow strange and not ideal. Also this chromosome B and especially chromosome H are not very linear and so I was not happy with the result. So what we tried next is something called reference guided community detection. What we did here is that we took all of the reference sequences which are each one sequence per chromosome. And then we mapped or aligned all the contics of all the other sequences to that reference and then we were able to actually place each of these into their respective community. And so for each community I then ran NFCropanogenome. This is the result and in general, most chromosomes now look much more linear. Even chromosome B looks beautiful now. Our chromosome H still looks messy. And so the next step was then that we got back to the assembly team. And they actually told us, hey, we see some interaction between chromosomes C, H and G. And so we just put all of these sequences together into one graphical component and voila, we have our beautiful graph here actually. We currently think that this is due to some RDNA region which somehow forms this interaction of these different chromosomes. I also told you something about cluster scalable. So let's take a look at an E. coli graph of over 2000 sequences. Here the quadratic or versus all alignment problem actually becomes a problem. So I ran WF mesh map it takes one hours 30 minutes that's really good. However, when I do it WF measure line, it takes 1000 times 20 minutes, and it generates over 600 gigabytes of path files. So this means that subsequently unsurprisingly sequished and run out of scratch disk space. And so I thought maybe I can actually use our network storage to build the raw graph, but it was so slow, it wasn't doable. So luckily there's an option WF mesh that allows us to retain only a subset of all the mappings. And it is configured in a way that although we get less mappings for the base level alignment step. The idea is that we will still get one huge graphical component so we can still get a really decent graph of it. And after doing this WF measure line and took 100 times five minutes so much faster than before much less resources required. And now with the default settings now see which complained about not having enough RAM. So there's another parameter there, which I decreased by two orders of magnitude, basically the idea is that it uses less RAM but it takes longer. And so after five hours, I had my raw equal and functional graph. So actually, on such a huge graph can really take its time. And so after 62 hours, I did one round of smoothing, and I couldn't do more rounds because of the seven days time limit on our cluster so here. So there's still somehow potential maybe to improve the tool smooth XG itself to make it more cluster scale. There are ideas for this but it directly involves to code in C++ and not to some optimization steps here in the NFCore function graph pipeline. So you can see the result in graph. And again, we have this compressed mode and you can see a huge amount of links so all of this black lines, each of them is an edge and because we have so many edges in this graph basically everything becomes black here. And that's because the bacteria interchange and change their genes with each other, right, huge amount of times. And I also put the genes at different positions in their respective genomes. And that's why this is such a mess. And that explains this in 2D this Harry ball down here because they have so many interactions between different regions of the different E. coli genomes that it's really hard to to see what's going on. And for sure here for example is that there's the like the majority of the sequences in this punch them are actually covered by, let's say basically all of these sequences, and then there are some unique sequences or roughly unique sequences in there. But you can also see here on the left hand side in this 2D visualization are these tiny graphic components. These are sequences that did not align to any other sequence in the sewage plot here. Yeah. I think that's all I wanted to tell you and I want to think all of these people here they all help me out a lot. And yeah, thanks for having me and thanks for listening. I'm happy to take questions. So, I am now allowing allowing people to unmute themselves and also to start a video so if there are any questions please ask them now. In the meantime, I was wondering what is happening after this pipeline let's say you have your pan genome what are you going to do with the data and yeah what is it going to be used for. This is a good question. It, and I think it also depends on the use case so for example with the landlord data you can take a look at all these SVs and how the chromosomes interact so if you're interested in chromosome interaction. This can help on last structure variation detection but also you can take the graph and it will improve your mapping for example if you the agent human is really diverse. So if you have short reads and specifically map them to a punch in a graph which already has lots of variation in there then it can improve your mapping, for example. And you can do sort of all kinds of things in downstream analysis with a tool called Archie. You can detect complex regions or, for example, what you can also do is then create a function at the tree, for example to get an idea of how close the sequences actually are that you, what you've created a punch in a graph from. Yeah. Thank you. Are there any questions from the audience. It seems everyone is happy. Then I would like to thank you again for this really nice talk. I would like to thank the audience for listening in and as usual, the John Zuckerberg initiative for funding bite says talks. So, thank you everyone. Thank you.