 Hello everyone My name is Ricardo and I would like to talk about growing a workflow language with new Geeks who has not heard of new Geeks before oh Oh, oh Oh, so you may want to stay in this room because we're there's going to be yet another talk about new Geeks It goes a bit more into depth I will explain the the rough idea, okay? So you you shouldn't get lost This is this is not the story of a product, but it's the story of an idea It is not my idea. It is an old idea an idea as old as organized computing maybe and It begins like this Once upon a time There was a little process It produced output as if to prove its existence Then it disappeared Into the void from whence it came no prepare for the tragedy And it's brief life. It never met any other processes It's all changed when at long last the pipe arrived the concept of a pipeline was born This is the pipe. This is one process. This is another process and they meet in the middle This one produces something this one consumes what the previous process produced This was a beautiful concept because the little process was no longer alone, but like every beautiful thing it turned into something Into a monstrosity pipelines grew ever larger as as did the compute requirements and no longer was it sufficient to have a process and another process that communicate with one another When researchers of the life sciences understood the concept of computing and then the value it provides for for biology for example they They try to scale this very simple concept this very simple idea to the genome scale, right? terabytes of data lots and lots of Processes that have to run and compute lots and lots of data to produce a Final answer maybe 42 Good processes is never as simple as that. It's not just align the genome and then pass this to some process that analyze it No, we need we need a new New spin on this idea of processes high performance computing. This is really just Lots of low performance computing but connected now an obvious problem with this is that These are different machines, right? You can't just have you can't just use a pipe for that. You still have processes, but how do you connect them? A pipeline is just a process connected with another process may be connected with to yet another process, but a The concept of a workflow is an expansion of the idea of a pipeline rather than having a linear Flow of data that goes in from the left and comes out of the right. You have a graph where information can disperse and Yeah seep through and at the end you filter out something that you're interested in But like any simplified story like this one This one too is made up of lies. I said that this is a process, right? This is true. This is a process, but much more important than the process itself is the much larger environment in which it runs We have We don't speak computer languages. We We interact with computers through a very simple string-based language, right? We have names of processes and The computer invokes them for us We can't actually control the process. We just give the computer a name and Dependent on the environment. It generates the process for us. So the environment is really crucial in What effects there will be when we invoke a command But the environment is not just this gray blob this Looming shadow in the background is if you zoom a little a little closer into that it consists of Things it consists of Files packages applications this is This is just a very simple real world environment It's a very simple environment. It's just one for for creating an environment for GCC for example This is just one compiler in in real Real workflows we have much larger environment that we can't possibly comprehend by just looking at it Even if you spend an hour looking at this we probably won't understand what this does while how it behaves What implication this environment has on the process running inside of an environment like this can containers help Okay, let's move on Containers containers the idea that you can wrap up an environment the binary state of an environment instantiated at a later point it's on the different machine maybe and That way you can be sure that the process running inside an environment has a well-known environment to operate in But containers are a weird, you know, we we use the term containers, but it's really an application bundle right, it's a it's a lot of state thrown in together and When you think of it that way they are very much like a smoothie It's the result of something it is an output of Following a recipe that you may no longer have access to Containers lack transparency. You don't really know what is inside once you have the result Now some of you might say oh, but what are we like things like docker files, right? Isn't this a recipe? It is a time-based recipe. It depends on where you run it and where when you run it Generating a a docker application bundle this binary blob depends on the state of the world right there was used as an input to The procedure generating the state so containers on are not actually a solution. They are they are not an input They are They are outputs So what can we do about this? So I'll be honest. This is a thinly veiled advertisement for for new geeks new Geeks is It's often called a package manager, but when you when you hear package manager, you maybe you think of of npm God forbid or you think of APT for for Debian right or you think of Piper There are so many package managers package manager does not begin to describe what Geeks does Yes, it allows you to manage packages. I built packages in a reproducible fashion by design reproducible means that you build a package today According to a recipe and results in a certain kind of output and it's sort of in binary state And you do the same thing tomorrow or a week from now with the same recipe you get the same output You get the same binary state so same inputs same output the idea behind that is called Functional package management, which was pioneered by Nick's But it's it's more than just packages right Multiple packages to get a form in environment Geeks allows you to manage environments create environments create isolated environment create impure environments There are like a mix of the state you have on your system and the one that you want to have But it also allows you to put those environment into isolated containers it is context of execution where Certain aspects of the environment are eliminated for example User names there's a you can create a user namespace in which the process running Context within which the user running the process is rude for example the process thinks oh, I'm running is rude but Really from the looking at it from the outside. There is no such thing as rude You can also virtualize the file system and you can say these files don't exist only those files exist containers are a very powerful Powerful idea that is this is orthogonal to the idea of bundling up binary state and shipping this around More than that though Geeks can allow you Geeks enables you to build complete systems by system I mean an operating system that runs on for example this laptop here or On HPT cluster or a virtual machine Because the two Geeks. This is all the same. It's all about building things recipes in a reproducible fashion so that the end We end up with exactly the kind of state that we declared at the beginning So in short what Geeks provides is reproducible deployment In a very generic way now, this is actually about the the workflow the Geeks workflow language, right? You can think of it as an extension to Geeks itself So this is not to scale the workflow languages actually even smaller. That's what Geeks is Just a minimalist language track, right? So This is where it's supposed to be So there's Geeks and out of Geeks we grow extra features that describe that provide enough features To allow us to express workflows now back to the original idea We had processes and we had the pipes the pipes are really just means of Composition so we have means of abstraction to Describe what a process is Maybe describe its complexity its resource requirements or simply its name So that we have control over it and the workflow is just a composition of many processes and this is really all there is to it place yourself if you're not a schema this may look really really ugly, but Bear with me in two more slides. These parentheses will disappear. So this is very very simple, right? This is a process that has a name So we can we can refer to it and we can invoke it. It has package inputs This one uses the GNU. Hello package whose purpose is to greet you when you execute it. It says hello world and The process has a procedure. This is the way how it is supposed to be executed This is some special syntax. You don't need to know that this just says It's like syntactic syntactic sugar allows us to run a little shell snippet where we execute. Hello Right process very very simple the package inputs field is Where the magic lies that this is where it connects to geeks When this is invoked geeks would generate the environment that provides hello and nothing but hello So that within this context, we have a specific version of specific variant of the hello package that we can execute and Then get the greeting that we want because maybe in hello three point all the greeting may change to Hello Jupiter. Who knows right? workflows As I said workflows are just The means of combining processes This workflow has a name flow This is a common workflow name And it it consists of processes that are connected in a graph Right, so there's a process a that does things in a process would do And a depends on the B and C processes that the outputs of B and C maybe B Generates something first that a consumes and also depends on on C now B itself Depends on the execution of D Right, so this is a very very simple description of how these processes are supposed to be plugged in Now for those who don't like S expressions just sick This is a different way of expressing the exact same thing, right? This is what what schemas see when they or what lispas see when they look at as expressions They don't see the parentheses. They see the structure Okay, so you can simply write that structure if you feel like it This is called wisp and there's going to be a talk later today about This language extension which is which is a minimalist language in itself There was a question This is actually a future extension that is currently in works there are different ways of expression Expressing the same thing right so the question was about If this is a data flow Expression it's data flows from a to B and C or it goes the other way around, right? This is Depends on a depends on whatever B and C provide You can't express it the other way around flipping the arrow and saying data that is pushed into B and C is Later consumed is later pushed through to a It's not communication flow here. Oh with wisp can you express everything with wisp that you can express in S expression? Yes, you can also try to let it figure of things out for you if you don't really Want to specify how data flows from one process to the other By simply by declaring inputs of a and outputs of a and inputs of B and outputs of B You can simply line those up and automatically connect this for some workflows. This is the easier way of Describing things sometimes you don't want to know the details. You want the system to just figure it out for you All right. This is an extension to canoe geeks So it provides naturally a command a sub command or geeks workflow Which allows you to run workflows Or to inspect workflows to to generate a nice representation of the workflow It is not a separate tool right it it came growing a budding out of geeks if you will the features that it gains by being embedded in geeks is Applenty I didn't have to write any of this in fact that they didn't run most of it anyway this is a project I took over but Reproduce will packages for example is one thing that geeks provides The workflow language doesn't have to do anything to gain access to thousands of bid reproducible packages This is beautiful, right? It has access to a special form of expression which is called a g expression a geeks expression if you will that allows you to more conveniently Access packages from within what looks like an s expression. So that's a special case if you know about geeks This will make sense to you Container context as I said you geeks provides the ability to set up containers Mount file systems in the right locations The workflow language gains this for free and I have to do anything to make it happen Same with virtual machines right geeks can build systems The workflow language has this feature this disposal workflow bundles having a workflow description locally May not be the most convenient way for other people to run that workflow, right there. Maybe they don't have geeks for whatever reason So geeks provides a feature with which you can package up the whole environment and create those infamous bundles These binary blob bundles We can bundle up workflows just like that and data caching this is actually Not directly a feature geeks provides but geeks provides caching if you will for For package builds because they are built reproducibly So we don't have to rebuild packages when they are and none of the inputs have changed The same goes for data right if you have data files that haven't changed Why would you have to regenerate the output right if you already have the output file if nothing of the inputs None of the inputs have changed There's no point in rerunning this so but you gain you gain caching Being embedded in scheme a language known for being a good language for writing languages Allows us to simply add syntactic sugar, right? We gain a big chunk of syntactic sugar, which is probably not good for you through wisp But we can also embed We can also embed other processes or syntax for for expressing graphs and whatnot It also allows you to execute These these workflows on HPC systems right there's this grid engine execution stuff You just have to specify that you want to run this workflow on HPC cluster And since the time is up Right I Just leave this here for you. You can look this up later the slides go up somewhere Yeah, we've talked about all of this. So this is it. We have grown a workflow language and We will see if this if this was a good idea Thank you. Yes, please Oh The question is the origin of this workflow language is in within the context of bioinformatics Are there any workflows that use this language? Yes. Yes, there are It actually does work on real-world workflows Within the context of bioinformatics, but this doesn't have to be limited to bioinformatics bioinformatics is just one weird case the one special case of scientific computing in general What is the relationship between geeks and nicks, right? Nick's was first geeks was second Geeks is an implementation of the same ideas that were pioneered in nicks With the implementation of geeks made a couple of different design decisions and there are different approaches to Exactly how this core idea is implemented Both projects face the same kind of challenges So there's there's a lot of collaboration between them In fact, we just had a short Conference a two-day conference before foster them where nicks folks were were present because the this the problem space that we both occupy is Virtually the same All right