 Okay, it's for so run of plus for Pascal. He's gonna be talking about programming Okay, thank you very much. So Thanks for the introduction. So I'm going to talk about the cargo library What is Pago Pago is a library that we developed at IMAX exercise lab here in Belgium And it's a library for parallel programming and go It's based on our experiences with lots of different other parallel programming Exercises and languages in C++ common list and Java and a few others And we released it under a BSD style open source license at this URL that you can see right there And it supports a lots of different features for parallel programming So it's based on the notion of dividing conquer task-based parallelism, which I'm going to explain in a minute It supports lots of features like parallel ranges parallel reduction parallel Boolean functions speculative parallelism Some concrete algorithms primarily sorting so parallel quick sort parallel merge sort a parallel hash table for performance and some parallel pipelines Functionality that is inspired by the Java parallel streams that was introduced in JDK 8 But with the distinct go flavors, so we support contexts We support cancellation and we support go-style error handling, of course Now some of you might wonder why do we even need a parallel programming library for go? Don't we already have concurrency mechanisms in the language that already give us everything that we need and Here it's important to stress again that concurrency and parallelism are two completely different topics So what's the difference between concurrency and parallelism? concurrency You need concurrency when it's part of the problem domain. That's a really important thing to understand So when do we have concurrency in the problem domain? One typical example is is when you have an airplane reservation system. It may happen that several different People want to book the same seat at more or less the same time Then you have a concurrent problem that comes from the problem domain That you need to solve somehow and go is really good at this So you have go channels go routines all these lovely features that allow you to really solve these issues From the problem domain You would have these problems even if you wouldn't have multiple nodes if you wouldn't have multiple cores If you would have a single box with a single CPU core You would still have to face these problems and express them somehow as a solution And it's indeed the case that there are concurrent programming languages in the past that did that So there's nothing about multi-core or multi-nodes here with regard to concurrency On the other side when we talk about parallelism, we're only talking about the solution domain So these are problems for which we don't necessarily need any form of concurrency But we may want to use multiple cores or multiple nodes to make them faster That's the only reason why you would want to use parallelism And if you don't care about performance you can go home now Now sometimes it gets confusing because of course you also want your concurrent programs to be fast So bringing multi-cores and multi-nodes to concurrent programs actually makes sense But it's still important to keep these concepts separate in your head So let's look at an example for a parallel program. So this is a very simple non-parallel program It's a purely sequential program. It takes the slides of numbers Summs them up and returns the sum So nothing in here has any asks for parallelism or concurrency. It's a purely sequential program perfectly fine But maybe your slices are really big and you want to use multiple cores to make this go faster And then one way to express this in pure go without pargo Is by expressing it like this. So this uses go routines for speeding up this sum function So the first thing is this we look at a particular threshold we ask is this the length of the slice lower than the particular threshold and We need to experimentally determine that threshold and then we know if it's below that threshold It doesn't actually make sense to use multiple cores because this will be fast enough as a sequential program If it's bigger than that threshold then we divide the size of the slice by two So we split it into two halves Then we use a weight group, which is a feature that comes with the go standard library We tell the weight group. We're going to spawn one go routine and inside that go routine We're making sure that we tell the weight group that we're done after this go routine has finished and in that go routine We take it building the sum is a recursive call for the left half of that slice Then we're also building the sum for the right half of that slide slice Which potentially runs in parallel with the left half Then we wait for the left half and then we just built the sum from left and right That's one way how to express a parallel program What's important to realize here realize here is is that we have two recursive call to the sum function itself So in every recursive call we do the same thing again Is this now below the threshold then we just do the sequential version if it's not below the threshold We split it up into halves again This creates a task tree This is what the notion of divide and conquer task parallelism means So we just split up our problem into smaller and smaller pieces until we arrive at leaves that we can do sequentially Now this looks like a little bit of too much overhead. Why do we just create such a task tree? Why don't we just split it up by the number of cores and be done with it? Well, the beautiful thing about these kind of task trees is that they are very very flexible to schedule Let's assume you're in a system. So we're looking at this without the with the other calls Let's assume we are in a system with 16 cores Here we have 16 leaves So each core can take care of one of those leaves Now assume we have four cores Then each core can take care of one of the sub trees of this tree So here we already see that no matter how many cores we have it can flexibly adapt to How many nodes that can process and if in the more course we have the final grain that becomes But what's even more important is we simply is simplicity made the assumption that each of the leaves More or less takes the same amount of time But this is very often not the case very often the leaves have different run times This is when you talk about load imbalance in the parallel programming Which is a big issue in parallel programming with this kind of task tree. It's very easy to solve Some of the cores will just take care of the heavy weight leaves Which just takes longer and at the same time some of the other cores will just take care of more tasks Until the whole program is done And this is a very beautiful and elegant solution for dealing with load imbalance And as I said this happens much more often than you might think So task-based parallelism allows for flexible distribution of work over CPU cores If the typical case what what lots of newbies typically do when they do parallel programming by just dividing the number of The work over the CPU cores statically typically leads to bad performance and it's not optimal But what I haven't explained it is how do we actually schedule the task tree So I just said we can schedule it over multiple cores flexibly, but we haven't yet explained how so the Elegant solution for that is work stealing so work stealing is a known concept at least since the 80s Formalized in the 90s. There's two really excellent papers that explain what work stealing does and the idea here Is is that each core basically looks for work and steals whatever it can do This has been successfully implemented in many programming languages and libraries It has been implemented in silk for C. This is the most famous one It has been implemented as threading building blocks for C plus plus It has been implemented as Java for join library which comes with the Java default library And it has been implemented in go The scheduler for go routines in the standard implementation of go is actually a work stealing scheduler And this was the main reason why we became so excited about go to use it for our parallel programming tasks So what does it look like? Assume you have four cores and one of the core starts working on creating such a task tree So what it does is is there's one task which then spawns another task and at the same time The other cores are just asking each other. Do you have any work for me to do? one of the cores will By chance pick one of the tasks from one of the other cores and just continue working on that In the next step the original core creates another task and some other core Randomly gets that task by just stealing it from the first core and so on Once every core is busy. They're just continuing to work on their own tasks Creating new tasks finishing other tasks until one core may be empty and then it just start looking again Does anybody have work to do and just steals it randomly? and continues So this is how work stealing very roughly works. There's lots of details Technical details to make sure it's really efficient But what's really beautiful about work stealing is you don't need to plan anything These kind of work distributions that deal with load imbalance Basically just emerge out of course randomly looking for work and it's known based on these papers that I've shown that This is actually optimal. You can't do better than that That's what why I believe go probably uses this and why we were very happy that this is available in go So now back to our original example what you see in this code is this there's a lot of a Management code that just makes sure that we can just create work Distributed Spawn it make sure that it we wait for it and so on so lots of code that we're not really interested in and This is where pago comes in so pago uses this notion of divide and conquer task parallelism and then gives you a couple of Higher level functions to easier to make it easier to use this So in pago the same program looks like this and under the hood essentially has the same implementation So here this a particular example is an example for a range reduction So a range over a slice and we reduce it So reduction is a term that is known in parallel programming reduce we reduce many values to a single value This doesn't have anything to do with Hadoop map reduce or so. This is a parallel programming term And then we need the base case which is a sequential base case this gets called by the by pago and pago tells it which part of the slides should I be looking at and Then you need a reduction function, which just tells tells the pago library. Okay, if I have two results How do I combine them? Well, I just add them So this does exactly what I described before under the hood It just makes it much easier and much more elegant to describe it at a higher level So this is where all these functionalities that are already sketched Come into play. So we have a simple parallel do which just spawns a couple of go go root Functions we have ranges which don't produce a result. We have reductions over in float strings and generic interface We have range reductions over in float string and generic interface. We have the boolean functions We have speculative versions which are especially interesting with the boolean functions because you can make sure that a tree stops Executing as soon as you already know the result We have sequential variants, which is not supposed to be used in production code, but it can be used when you want to Debug your programs, especially when you like want to do print line debugging We have the quicksort and merge sort which are actually quite complex algorithms so we hide this away from you and Parallel hash tables and pipelines Now here again people may wonder why do we need parallel pipelines? Go is already really good at pipelines and that's true So here's an example of a pipeline in go Which I took from a tutorial. It's a two-stage pipeline The first stage that's gets a slice of numbers and just creates a channel over which these numbers are sent to one after the other The next stage takes these channel of numbers Squares each element and then sends the squared numbers to the end to the next channel And then here is a main program that reads the squared numbers and prints them out For a concurrent program. This is really beautiful really elegant to read very easy to work with From a parallel perspective. This is not so great because we just created three go routines Now if you have 16 cores, there is now 13 cores that don't do anything But you want to keep them busy because you're interested in taking advantage of them for performance So you would like to distribute the work a bit differently when you're thinking about a parallel program So from a so I'm not saying this is bad for concurrent programming This is really elegant because you're probably also dealing with many of these pipelines at the same time So you're already creating a lot of work But for a parallel perspective is this quite likely that this is the only pipeline and then you want to distribute the work differently so here is a pipeline in pargo What we do here is this is we pargo a pipeline is a data structure the Null value it can already be used. We give it a source in this case. It's a silly source It's a slice of two numbers. Of course, you wouldn't use a parallel pipeline for a slice of two numbers This is just for an example Then you can add stages the first stage is a parallel stage it receives a batch of Numbers it's hanging behind an interface because it's the only way in go to Declare something generic it Unpacked that data into a slice of numbers then modifies this in place to build a square for each number and Then the next stage is an ordered stage ordered stage means it's sequential So this stage doesn't run in parallel and it is executed in exactly the same order as this source of the pipeline and Here we can just print out the result What's nice about this here is now that in the back We again use this principle of divide and conquer dust parallelism to split up the inputs Into batches and create more batches than a course available so that a work scene in scheduler can actually optimally schedule them And this is how you can take advantage of all your course. I Forgot in the end you just run this So what you have in the parallel pipelines in cargo you have predefined pipeline sources for arrays slices strings Channels and buff IO scanner So for scanning text files You have support for user-defined sources through the source interface You have support for several kinds of nodes or stages. So sequential Ordered with a guaranteed order parallel You also have strictly ordered and limited parallel which gives you a way to control how much memory is used You have skip and limit nodes where you can skip elements or you can limit how many elements you want to see over the lifetime You have support for several kinds of filters. So generic receive and finalize boolean filters where you can ask Only run as long as every package has her for the fulfills a certain condition You have counting filters just counting how many elements you see and you have slice filters produce for producing result slices We also support contexts with cancellation error handling go style error handling and Fine-tuning of batch sizes so we can really tweak the performance All of these features are not just something that we just made up and came up with We actually use this ourselves So we have one tool called L prep, which is a DNA sequencing tool that we developed already for a couple of years It's a high-performance tool for Doing certain steps in the DNA sequencing pipeline. It is a multi-threaded application that runs Typically something like 10 times faster than the standard tools and it runs 10 times faster But because we're using the cargo pipelines and some of the other powerful functionality that I just described And it's implemented in go since version 3.0 and it's available as an open-source project here at this link So we're really we're eating our own dog food and we're making this available so you can also use it in your projects So Park was available at this URL Documentation is also available. So the standard API documentation and the wiki which describes the concepts in a bit more detail which can't easily be described as API documentations, there's also a link for L prep and That's the end of my talk. Thank you very much Sure So my question is the code that you showed Looked like something that will benefit from generics. What do you think about contracts? I was hoping I wouldn't get that question I'm not a big fan of generics And I think go would be better off without generics. That's my personal conviction From a user perspective from a user perspective generics look elegant But from a library if the provider perspective it becomes incredibly hard to write them correct And I'm not really happy about the current ideas around the contracts because that was just a hack in C++ That kind of a Coincidental hack like many things in C++ and I don't think we should imitate that in the goal language. That's my personal opinion Okay I'm the microphone runner myself Do you have any particular performance metrics between something using bar go Performance metrics between Do you have any numbers for oh do we have numbers? Well, yes, so for the L prep for the L prep sequencing tool we actually Had a paper for the previous version, which was not not in go But we just got a notification that it was now accepted for the new version, which is in go and there we have performance numbers I also we also did a study where we compared performance between Go C++ and Java for exactly the tool which I presented last year at foster them where go actually came out as the winner and Yeah, so we you can find these numbers other questions. Okay. Thank you very much