 Thank you very much. So my name is Pascal Kostanza. What he told me is that that was the best-rated rejected talk and now I'm the worst-rated accepted talk. Not sure what's better. So this is a talk about L-PREP, which is a tool that we have developed at iMac. This is a work together with Charlotte Herzl. You may not know what iMac is, so iMac is a research center with headquarters in Löwen. That's depending on the time of day, 20 minutes to two hours away from Brussels. And we have lots of different spots all over the world. Since more than 30 years, we've been working on microelectronics and nanoelectronics research. So we're primarily a hardware research facility. And if you have smartphones or computers or TVs, then there is hardware in there that was invented here in Belgium at iMac. So we are working with the big names in the industry like Intel, Samsung, DMSC and so on. So that's what iMac is. But we're also doing software research because we're realizing that more and more it becomes important to do software and hardware close together. And this particular research, so here is some of the names that iMac is collaborating with. And this particular research has been done together with Intel and Janssen, which you're probably not seeing because it's a bit blurred. Okay, so this is about DNA sequencing. So I will give a little bit of background what DNA sequencing is actually about. So DNA sequencing means when you take a blood sample, so real blood samples or other tissue samples, you can put them in a DNA sequencing machine like the ones you see at the top right corner. And what this does is it applies some chemical processes, turns it into smaller fragments and spits out these fragments, which are reads. While doing it, it makes mistakes. So these are not 100% accurate. So that's why you need to do multiple coverage so you can use some statistical analysis to make a better guess of what the right correct read was. And so out comes some file that gives you these fragments and then you need to use software to actually do something with those reads. So the sizes here, the DNA of a human is something like 3 billion letters, G, A, T, or C, this is called a base pair. And these fragments that come out of there are in the order of 150 base pairs per fragment. And the data that is generated is for raw data, 50 to 120 gigabytes compressed when you have a whole genome sequencing. For exome sequencing, it's 5 to 15 gigabytes. So that's the order of magnitude of the data that we're having. So what we're talking about is something like hundreds of millions of reads that come out of such a machine for one sample. What we then do in the computer is we do sequencing. So we have a reference that you see on the top of the screen. And then we take these fragments that we try to match against this reference. So we can use a reference because we are all not so different from each other. We only have 1% difference. So matching these reads against the reference gives us a pretty good idea where these reads are actually from. So this is the first step. It's called aligning, which aligns the reach to a reference. What you then want to do is we want to do variant calling where we look at each position and find out what the difference is from the reference. And what we have here as an example is that we hear we have a position where half the reads have an A and the other half have a T, which means it's a heterozygous SNP, A-T. You don't need to understand what it is. I also don't, but it probably means that half of these you got from your father and the other half from your mother. So that's what you typically do in software. There are the sequencing pipelines, which typically consists of several computational phases. I talked about the mapping, which you see on the top left. You talked about the variant calling, which you see on the top right. And there's also some processing of BAM files, which are intermediate files that are passed around these tools. These sequencing pipelines tend to take very long. We're talking about several days of computation. And we use certain file formats and so on. So one of the things that we did with Lprep is we focused on, let me go back, we focused on the middle phase, the BAM processing phase, which also takes a huge amount of time, but needs less domain knowledge than the other phases. And that was for our customer, a very important thing. So we looked at the BAM processing. And for the BAM processing, it's again, you have these several tools that all do certain simple steps, like removing certain reads that we were not able to map to the reference, sorting them in certain ways. This is very loud. Certain ways, removing duplicates because these machines may produce duplicates and so on. This is organized in such a way that each of these tools gets an input file, does its computation, produces an output file and passes it on to the next. And so what we thought is, OK, we can do better than that. Why don't we just produce a single tool, which is called Lprep, where you can just tell the tool, here are all the steps that we want to execute. Please do this in one turn, in one go. And so this allows us to do the file IO only once for the whole input and do the file output also only once. And then when we have the data inside the tool, we can use multithreading and clever loop fusion techniques to really make this much faster. So I'm not going to go into the details, how the software architecture works and how we got this faster, but the result is depicted here. This is for whole exome sequencing for a standard data set that is known in the bioinformatics community. If we take Picard SAM tools, that's the top line. This takes about one hour and 40 minutes for this data set. It's five steps. So you see these five steps in different colors. We could use Lprep in such a way that we also call it for separate steps. And for some of these phases, Lprep is a bit faster. For some, it's a bit slower. But the really interesting part is, is if we do all five steps with one single invocation, then we can reduce it down to 20 minutes. And there is a mode where we can tell Lprep, please use all the RAM available to you, and then we can even cut it further down down to 11 minutes, which makes it 10 times faster. And 10 times faster for such a pipeline is a huge win. So Janssen, for example, the pharmaceutical company that we're working with is now using this as a standard tool that can go to Amazon Web Services, they need to rent a bigger instance because we need more RAM, we need more compute power, but overall they're saving money because this is so much faster. So that's interesting. So now this is what Lprep is about. So now a bit about the history of Lprep. Originally this was implemented by Charlotte Haseel, who's also sitting here in the audience with help from me. And we did this in common list because our common list is an excellent tool to very quickly get a result when you don't know what you're doing and get to a production version. So this was a very good experience. We did six months of development time with lots of different design changes along the way and got to a pretty stable result. However, now that we have this in production, we ran across some issues and one of the major issues that we have problems with is the memory management. So the memory management is a key performance issue in Lprep. And the problem is that all common list implementations that we are aware of use a stop the world garbage collector. So when this garbage collector kicks in, all threats running at the same time are stopped and the garbage collector itself is sequential. It doesn't use any multiple threats. So it's extremely slow and costs us a lot of time. So Charlotte is smart. She was able to trick the garbage collector into making sure that it doesn't interface with our parallel phases. So it doesn't hurt us that much in production. But the code, the way to do this is not intuitive. The code becomes uglier. We have to put a lot of effort into a prep to reuse memory wherever we can. And it's not beautiful code anymore in the end, which is a pity. So the question that we are asking ourselves is did we achieve the best result possible? And is there an easier way to achieve the same thing? And yes, there are easier known things. You can use a parallel concurrent garbage collector, which is a known approach, or you can use reference counting, which is also a known approach, which both promise to work better with multi-threader programs. So the problem is, though, we can't do this in common list because it requires support from the programming language and from its runtime system. So we had to look at other programming languages. So concurrent parallel garbage collection, what is this? The idea of a concurrent garbage collector is that it doesn't stop your main program or just to interfere with it as little as possible and runs in parallel with the main program. And the idea of a parallel garbage collector is that the garbage collection itself is parallel. And when we started this experiment that we looked, and it's beneficial for performance because it plays nice with Amdahl's law. So when we started the experiments, the mature languages that are known to us that employ this were Java and Go. So Go introduced this in 2016, and we needed something that runs well on Linux, so these were the only options that we were aware of that support this well in mature languages. The other approach is reference counting. So with reference counting, instead of having a garbage collector, you record in every object how many pointers are pointing to it with a count with an integer. Every time a pointer is reassigned, you increment and decrement counts and when counts go to zero, you can deallocate that object. This has a few downsides, but normally for multithreader programs the synchronization is spread over the whole program. It shouldn't interfere with your multiple threads. So it should actually play nice with multithreading. There are more advanced implementation schemes where you try to avoid updating reference counts, but we are not aware of any mature language that does this. So the mature languages that we were aware of, that use reference counting are C++. I don't know how to talk about C++ languages that started with 11, but yeah, C++ 11, 14, 17. Objective C does reference counting, Swift does reference counting, and Rust does reference counting. But we can't use Objective C or Swift because their updates of reference counts is not atomic, so it's not thread safe. So we can't use it in a multithreader program. And Rust is a tricky case. We need, for a particular module, we need atomic compare and swap. And atomic compare and swap is not safe in Rust. It only works on unsafe pointers. And that defeats the purpose of using Rust in the first place. So we didn't use Rust. So the experimental setup, I'm not going into details. This is a standard pipeline that goes on a standard well-known data set. It uses five steps. The hardware environment is an Intel Xeon platform, a Broadwell platform, two sockets with 22 cores each, which gives us 88 hardware threads. And we have 768 gigabytes of RAM. And here are the results. So first, the C++ result. So that uses shared pointers for reference counting. So we don't have to manually manage our memory. We're using G++ 6.3 interthreading building blocks for multithreading version 4.4 and Google performance tools for low-level malloc optimizations. That gives us the following performance. We have something like 13 minutes and a half with 227.4 gigabytes of RAM. So that's the C++ result. Next comes Java. We're using the JDK 1.8 at that stage. This looks not so great. So we have run times that are better than the C++ run time, but at a huge cost of RAM. So we use a lot more RAM. For the parallel garbage collector, it's almost 500 gigs. For the G1GC, it's 360 gigs or something. That is really a lot. And even the one that uses the least amount of memory still uses a huge amount of memory. So that's not really practical. And that one actually takes very long. And by the way, the concurrent market sweep garbage collector is actually deprecated in Java 9. So we can't even use that anymore. And so now, drum roll, go 1.7. This is the result. With the default settings, note weeks, we get 10 minutes and 20 seconds, which is faster than all the others at an amount of RAM that's pretty close to optimal, also compared with the list version. So that was pretty impressive. We were skeptical in the beginning, but this is a really impressive result. And for that reason, we decided that the new version of Alper, which we released in September 2017 as open source, is now implemented in Go. It's on an open source BSD license. So it's a high performance tool for preparing some files for variant calling. It's a multi-fitted implication. It runs entirely in RAM and merges multiple steps with repeated file I.O. and can improve performance by a factor of up to 10 times compared to standard tools. There's other interesting things that we can now do. So we are now adding new features to Alprep. But what we also have seen now, so far the common list version, we basically only developed this ourselves, but we now see some first contributions by other users, which is really nice, which is a good sign that this is a good choice for programming language. One other thing that I should mention is when you compare the development time and how complicated it was to express these three different versions, C++ was doing incredibly bad. It was a pain to get this working in the first place. Java was the easiest to express. Java was the easiest to express because they have a pretty impressive streams library, which allows us to use parallel streaming, which gives us a lot of functionality that we can use to express this kind of framework. Because of that experience, we also decided to abstract away the stream processing that we have in Alprep and make this available as a separate open source library, which is now called Pargo, which you can also find under the BSD license. This gives you some constructs for parallel programming and constructs for pipeline processing. It's not quite as elegant as in Java because of lack of generic types, but it's quite nice in our opinion. So you can use that as well, and that doesn't have anything to do with DNA sequencing. You can use this with any kind of parallel programming. And that's the end of my talk. Thank you very much. I don't have any stickers, but if you have questions, I can answer them. Yes, so the question was, can we give you the benchmarks for the Lisp program? So the thing is, I could give you the benchmark for the best Lisp program, but the thing is, is that this is not a fair comparison because in that version, we completely deactivate the garbage collector in Lisp. So it's not a fair comparison to those other languages because there we do automatic memory management. The last runtime in Lisp, we have something like nine or nine and a half minutes at, I think it was something like 220 or 222 gigabytes of RAM. If you want to know more details, we have a paper about this. It was published. It's an open publication, plus one for bioinformaticians. So Charlotte had a paper there, just Google for plus one and Elprep. And you will find the paper and there's a more detailed breakdown of these benchmarks. If you keep the garbage collector running in the Lisp version, you get pretty awful runtimes which are beyond good and evil. Yes? So we've seen some usage. Yeah, okay. So the question was, if we see some usage in the do-it-yourself bioinformatician community. Is that correct, the question? So we do see some usage. I won't claim that we are extremely hugely popular, especially when we compare ourselves to the big bioinformatic centers. But we do see some uses. So the Github repository has 74 stars, which is not bad, I believe. We have some users at hospitals that are actually using this in production. Janssen uses Elprep worldwide for all its sequencing applications because of the cost reduction that they see. And yes, we made very sure in the development of Elprep that it can be used as a plugin replacement for the more standard tools. So we have tools like SUM Tools or Picard, which are pretty much regarded as the standard tools. And we made, put a lot of effort into making sure that our tool produces exactly the same results as those tools. So you can just download it, use it as a plugin, and save a lot of time with Elprep. Yes? Because, okay, the question is if you don't need a garbage collector, why do you need one? Why do you use it? As I tried to convey in the beginning, but apparently not very well, once we modified the Lisp code in such a way that you don't need a garbage collector anymore, the code became actually relatively ugly and relatively hard to follow. And we experienced this ourselves once we started to try to add new features, it became quite unwieldy to really do this well. And that's one of the things, as I said, the Java version is the easiest to express, but Go is also really, really straightforward, especially if you as a bioinformatician want to just add a filter that expresses one particular step, what should happen to your reads, this is very straightforward code. And we have this that we had users that came from the Python world, they looked at Elprep, and they just wrote their own filters and gave this to us, and we just could include it straightforward, and it was still as performant as before. And I doubt, even if common Lisp were a widely popular language, it would still be hard to do that. Yes? We tried all of them. So there's the concurrent mark-and-sweep, which comes with the JVM, and the G1GC, which comes with the JVM, and the ParallelGC. These are the three results. Now, we didn't do tuning. Yeah, but what we have as users in mind is end users that are bioinformaticians, which are not trained computer scientists. You can't ask them to tune these kind of things. And with Go to 1.7, with the default setting and no tuning at all, you get much better results. So I don't think it's worth the effort. Any other questions? Yes? We're currently in the process of repeating the tests, with newer Go version with Java 9, and with the GCC compiler version 7.2, because we have additional... Okay. I'm terribly sorry. So the question was, if we are currently trying to use... to do the experiments with newer Go versions. Yes. And also newer Java versions, and also newer C++ versions, because they all add new features that are interesting. In Java, it was interesting because they added compacted strings, where the characters use only 8 bits instead of 16 bits, which removed a lot of optimizations that we had to make in the Java version. And C++ also added things like any type and variant type, which is also interesting to use in this context. Yes. So did I understand the question correctly that in another talk... Yeah? Okay, so what you're saying is there is another garbage collector for Java, which is called ZGC, which is apparently much better and much faster than the current Java collectors. I'm not aware of this. This is something that's offered with the JVM as an option, so it must be some third-party tool. I'm not aware of this. We haven't tried this. But the thing is, I'm not claiming here that Go is generally better no matter what workloads you're running. This is workload specifics. For other workloads, the situation will probably be different. So I'm not claiming that this is a general statement. Okay, thank you very much again.