 OK, well, thanks for showing up. My name is Jul Janssen and about a year ago I sent an email to the GNU Geeks Development mailing list asking for what somebody knew a way for me to do an internship that was related to GNU Geeks or GNU Guile and I was lucky enough that Piotr Prince replied to me and I found a really interesting place at the UMCEU Tract and one of the things I worked on during my internship in the last weeks of it was something called workflow management and to understand the concepts of workflow management that I'm going to try to address. Oh, yeah somebody signed me that I should talk louder so I'll try that. Yeah, so what I'm going to show is I've implemented a couple of record types on top of GNU Geeks's infrastructure for package management and this thing works. And to understand what I'm trying to achieve is that I'm now working at an institute where we are doing a data analysis and so we have some initial data set with the yellow node here and then through running programs on it we try to extract information from it filter the information or try to get useful research results from it and eventually we want some undeniable proof for our theorem or for whatever and if you look closer into the the nodes labeled A to N we can see that these processes are can be really simple like a grep for some information out of the file or a little bit more advanced running an R script or some other foreign language to guile I know I know. So what we're trying to do here is we're trying to make an easy way to to run these things either serially or in parallel and what I'm trying to capture is a workflow which is for me it's just a set of processes that we should run to get from an initial data set to the undeniable proof and I am not including the data into the into the whole process that that's just something you put in and get up and actually you can if you look at real-world situations it's more common that you don't get a serial process going from A to N all simple steps but more likely you can run a couple of things in parallel like you see from the nodes A to we can run three processes at the same time and at some other occasions you need to some process needs to wait for input from various other processes and at that point if you would do this with a shell script it could become pretty complex and it gets really interesting if if you can run things in parallel on multiple computers that's also something the hard to do with with shell scripts I guess so what we have at UMCU tract is we have multiple computers all connected to a single storage and this this storage is actually quite expensive so it's a two-step thing so we have storage that we can actively use that's connected to these compute nodes but we also have storage that's used for archival purposes and backups so that's when when hard disk crashes or when more hard disk crashes then this data gets lost but some more valuable data won't get lost that's important to note for if we want to run processes we should store this data and not inside the GNU store in the geek store but outside of it so we can archive it so to run things on multiple compute nodes we have this infrastructure a scroll a job control system and you basically pass it a shell script with some additional syntaxes like how much memory do you want to allocate and how much time do you need it and then it'll run the commands in that bash script yeah really bash not shell just bash so that's a bit unfortunate for for new guile users right but we'll get to that and then the job control system can can queue these on these compute nodes or run them all together so what I've created are two record types and this is the first one it defines a process which is the the a to n nodes and it looks kind of similar to a package but only a little bit more simple so it has a name and a version and a description and synopsis description fell off the page too bad but what's more interesting here is that a process can have a package input which are geeks packages so that should look familiar but also it can have data inputs which can be anything you like you can create a simple list of samples or just a string or a number or you can make some more advanced things like an associate list with with multiple items which you will will later use in the procedure and it has some runtime complexity specification in both space and time we with this information we can say something about how long the process will run and as you can see this is just a scheme code so you could include the size of a file and then say okay it needs twice the the amount of RAM to of this file size to to compute so something you can test out and try and then with this information is all pretty basic but then there's this procedure that actually executes the commands in a shell script and this procedure is and I hope this looks a bit familiar because it's been addressed in a couple of talks before me it's a G expression and it's basically the the quasi quoting but then in a geeks specific way and this G expression you can you can directly input packages from geeks and then the nice thing about this is that when you try to run the procedure when you turn it into a derivation and then later outputs the actual script that rolls out of this it automatically builds the packages for you that you're actually using so that's really useful and then now I want to get to the second type so now we have these these processes but how we're going to connect them all now we need a different type for that that simply again it looks quite familiar I hope there are two things here that are interesting there are restrictions in the process so process is just a list of all the nodes that you want to use in this workflow so these are just the process you defined earlier and the restrictions should be read in a way that you say the first depends on the completion of the second and that's really nice way so from from these restrictions you can determine its dependency graph basically and that's all to it because with this information we can now construct a dependency graph of processes that we've defined using G expressions so how that works is that you have this process definition and then you turn it into a derivation which will end up in the geek store right and when you build that derivation it will build the dependent programs for you and it will return a job script for you and that job script is something you can pass to the job control system that then will distribute it to one of the compute nodes in your cluster that's basically the entire ID so the the job script of course that uses the programs that you've you've built earlier the derivation did that for you so there's one thing to note here that for in order for the job script to actually work on a different computer it needs to have access to the programs so it kind of depends on a shared storage for all your compute nodes yeah maybe should address that I don't know for us it works but maybe for others it doesn't I don't know so I I probably needed to write some tutorial or something because I think the easiest way to to get down to how this works is to just try it out and I did that so you can find it on this web page you can download source code for my project I actually a forked geeks for this but I want to include it into the upstream distribution and I hope when it's stable enough that one of the geeks maintainers will allow me to introduce their workflow management bits into geeks upstream yeah and if you have questions after this talk or when you're at home when you want to ask a question you can always email me yeah so I'd like to acknowledge a few people for our prints first for for letting me do his internship you've to lift with me he was also involved in my internship and Ricardo worms and Ludovic Cortez for helping me Ricardo gave me some very useful feedback on on what the record types should look like and Ludovic Cortez yeah you're just a great inspiration right for almost anything so yeah and you probably see me around in the the new geeks development mailing list and the IRC chats so thanks for all the help for those who who helped me so yeah so are there any questions right yeah I thought about that but I even tried that but the problem so the question is can we can we integrate the the the data with with the geeks package management system so I suspect that what you want is to execute the jobs immediately and then put the results into the geek store right I thought about that and my first version did that but we ran out of space pretty quickly on the the geek store and then we had this problem like these are really huge data sets that we're processing we're talking hundreds of gigabytes here per data set and if you all put that in the store then you probably end up saying hey my new store is pretty big let's garbage collected and at that point you could lose the results of your research and you don't want that so we're looking into letting some other system manage the data and archive it for us actually there's already such a system available so why not let that system handle the data management part and let geeks handle the package deployment and handle the the generation of the scripts so that it can be executed there easily with existing job control systems so yeah it's trade-off and I don't know what's the best way to go here but this is actually the only way it works for us so yeah so if I can repeat the question in a bit more shallow way would you like to run or push a container with all the deployed software to some other machine then run your data analysis there and then get the results back it's basically what you're trying to do I guess to avoid a shared store system I think we can do that because we can spawn containers with geeks we can pack everything into a container right but I also seem to remember that that's not really the way we want things to go right because then we're kind of giving up on on the idea that we can efficiently distribute software and use it and so maybe it's easier to I don't know fix your shared storage system because I don't know but we can but we can definitely look into it and I would like to look into making a like an app bundle yeah I know it's not a great but kind of like a container of course we do that with geeks not with Docker or something and then we yeah yeah we can distribute that to a compute node and run it there sure yeah I'd like I'd like to welcome you to write it I don't know how it would look exactly but yeah maybe okay so the existing tool we have is Pearl so we're looking into workflow management systems and we tried the common workflow language and I was the first to try it and then I tried the hello world and I thought hmm I really need a programming language for my workflow definitions to be flexible enough to address every case and that's exactly what the work common workflow management language is developing into they included JavaScript into their declarative types and so yeah they're including a language into there and I think that the best language to do with this geeks guile sorry with some geeks package management because that's very important for us so the reception is that nobody's writing workflows as of yet at our institute because they're writing parole codes and it works we have these huge files of parole codes 700 lines of code is like the basics and then comes the error handling and stuff around it but we're trying to get started on writing workflows in in this way or some other work for language but then you know first you need to solve the the deployment problem of software and geeks is the only thing doing it right I believe so that leaves this workflow language as the only language that addresses the full thing I'm pretty confident we're we're going to use it any more questions okay then I would like to thank you