 OK, so my name is Ludwig Cortes. I'm from Inria, a research institute in France in computer science and math. I'm also the co-maintainer of GNU Geeks, so this piece of software that I'm going to talk about. It's actually pronounced Geeks, which I think is quite unusual to many. But yeah, that's the way it is. Last year, I already had the pleasure to talk in the HPC Dev Room. Together with Piotr Prince, we gave a general overview about Geeks, and Piotr talked about relocatable binaries precisely for HPC. This year, I want to talk about a bit of a more prospective thing that we'd like to work on, that we are already working on, but we'd like to make more progress in that direction. And this is about integrating deployment tools within scientific workflows. Just a quick show of hands before I start. How many of you are already familiar with Geeks? That's like half of the room, I would say. OK, so I'll do a quick introduction to Geeks so you get a feel of what it's like, what it does, and what makes it different from other tools. For those of you, at least, who did not attend Kenneth's talk this morning, which I think did a great job at trying all these different tools. So in a nutshell, Geeks is essentially a package manager. I suppose most of you have already used Conderspack or APT or VM, that kind of stuff. So it gives you essentially that kind of interface, like app, like condo install, that stuff. So you have a command which allows you to install packages, GCE, OpenMPI, GWL, for instance. Then you have this command, which is a little bit particular, because it has module-like functionality. So if you're used to module load, well, this is kind of similar. So Geeks package-dash-searchpaths give you the list of environment variables that you need to define to actually use the software you installed, essentially. So you can just add that to your shell and you have all the environments ready to use. I think that's pretty convenient. I mean, it's always treat me up before that I forgot to set this or that environment variable. So I really like this. Then you have another way to install packages, which is that instead of having the usual app style common line interface like this, you can say, well, I want to declare precisely what software I have available in my profile. So I just tried a simple file that contains a list of packages I want to have. And I pass that file to Geeks package-dash-manifest. It just creates the profile with all these pieces of software. That can be quite convenient if you're moving between different machines. Like you have a set of packages on your laptop and you'd like to have the exact same set on your supercomputer. Well, you can just pad that file on and just do Geeks package-dash-manifest. And we have transactional upgrades and wallbacks. So you can try out new packages, new versions. And if it doesn't work and you have a paper due for tomorrow, well, you can still roll back and that's fine. You have the exact same tools as you had just before upgrading. But this is essentially the main interface to Geeks, I would say. The cool thing about it is that it's reproducible. So what do I mean exactly by reproducible? Well, let's say you have somebody on their laptop using Geeks to install some set of packages. Well, then you can have somebody else on a completely different GNULI-NUX machine. So it's GNULI-NUX only. And as long as that person is using the same commit of Geeks, they're going to be able to install the exact same programs, the exact same packages. And by exact same, I mean at the bit level. So the Geeks project is actually involved also with the reproducible build effort along with Debian, Arch Linux, FreeBSD, lots of distributions. And so we pay a lot of attention to making sure that we have bit reproducible builds. I think that's a key difference to some of the other tools out there. So yes, that's reproducible and portable. So again, you can, you know, you have your cluster, your supercomputer, which has a completely different distribution than your laptop. But that's fine. Geeks is self-contained, so it will always create, you know, its own set of packages. And, you know, it's independent of what the whole distribution provides. I should mention also that Geeks does, like Nix and like Spark is doing now, does mixed source binary deployment. So if you have binaries available, which we usually do, well, then you just end up downloading those binaries. And if you don't, then you just build from source. And it's completely transparent. So we have this reproducible thing that you can replicate from one machine to another. That's bit reproducibility. But I think in science in general, we want to be able to do more than just replicate bit by bit the software environment. Sometimes you want to experiment and that's one thing that Spark does really well. If you saw Todd's talk just before, I mean, there are lots of options you can add to Spark to specify, like how you want to build your package, what dependencies you want to use, and that's definitely useful in an HPC context where people want to try out things. So with Geeks, we offer some flexibility at the command line. Like you can say, so there is, for instance, a package for HWLock, the hardware topology library in Geeks proper, and it's a fixed package, right? It's a given version, it chooses given configuration flags and so on. But as a command line, you can say, actually that's a real use case because some of my colleagues work on HWLock and sometimes they want to try out release candidates and this is something for them, I guess. You can always say, I want to build HWLock but I want to use a different source table so you can provide a file or a URL. It's going to use the same HWLock package recipe that's available in Geeks proper except that it's going to use this source file. That can be quite convenient. And you can also say, well, I like that MUMS solver package in Geeks, except that, you know, I like to use different dependencies. Like I'd like to use the parallel version of Scotch, the graph petitioner. Well, you can just say, okay, I want to, you know, take that MUMS package, but I want to rewrite the dependency graph such that any reference to the Scotch package is replaced by a reference to the PT Scotch parallel Scotch package. And that's going to do that on the fly. And so probably you will end up building from source because the build servers don't have this particular set of binaries. But I mean, you get the flexibility that many people need in HVC. And again, and this example is, I guess, we owe it to Kenneth, the discussion about binary performance, you know, for some pieces of software you need to choose at compilation time, whether you want to build a generic version of the software or whether you want to build an AVX, for instance, optimized version. And this is how you could address this kind of situation where, you know, Geeks itself provides a generic version that works on any X8664 machine, for instance, but if you want the AVX version, you can always say, okay, rewrite the dependency graph so that I use the AVX variant of FFTW. So Geeks started in 2012. It has almost 7,000 packages currently. It builds on a number of architecture. We got ART64 recently. And we have a continuous integration build form that provides binaries at this URL. The last release was in December last year. I think we could be doing a better job at making frequent releases. We're working on it. But we had like two releases last year, I think. And the new thing compared to last year is this Geeks HPC effort. Essentially, we realize that some of us in the Geeks community, so Geeks is not initially designed for HPC at all. Well, I mean, not specifically, but it has good properties for HPC, like the fact that you can, you know, install packages as a user without being root and you have a lot of flexibility and every user can decide what packages they installed. This kind of property is quite useful in a cluster, typically. So we realize that actually several of us in the Geeks community were already doing HPC stuff and using Geeks in that context. So we created that Geeks HPC effort initially started by Henry, the Max Tilburg Center for Biomolecular Medicine in Berlin and the Utrecht Bioinformatics Center. Also joined by people from Cray. So we're trying to see what else can be done to make Geeks a good choice for package deployment in HPC. So we've already made a number of improvements in that area and we hope to do more. And so this is an invitation for you to join us. If you're curious about Geeks and you'd like to see if it would fit your use cases or if you, I don't know, if you'd like to chat in general, well, we welcome you to join us to have a look at that website here. And yeah, hopefully we can talk and do stuff together. Right, so I don't know if you had the opportunity to go to the Unix History keynote yesterday. I think it was pretty cool, pretty interesting. And Unix has this well-known motto of having one tool for a job, one tool that does one job and does it right. I think we all like it in general as a concept. It's a pretty good idea because it means that you have to create tools that can be composed together, right? And the general idea is sound. I mean, it makes a lot of sense to me. Yet in Unix, composition of programs is a bit limited. Like essentially you can share text through pipes or through files, stuff like that. So you can compose things like cat and WC, you know, stuff like that, that works well. But if you're trying to compose other pieces of software, sometimes it doesn't work so well because you wanna share higher level concepts, not just streams of text. And I think this is what led us to what I call the Archipelago of tools that do one thing. So we can see that in HPC and clusters, I think, we have tools like, let's say we have APT here, which is a tool that is used to deploy software on the main distro. And then somewhere we have, I don't know, EasyBuildConda, which is a tool that users or CIS admins actually use to deploy more software. And then somewhere around there we have the batch scheduler which simply ignores the whole issue of software deployment. And then again we have CWL or Galaxy, for instance, which are tools to run pipeline workflows and again they mostly ignore the issue of software deployment. I think that's a problem. I think we could do a better job if all these tools, these different tools could be made aware of software deployment, right? This is what we're trying to achieve with Geeks, actually. We're trying to put reproducible deployment at the center of the stage and to provide tools that can be used to build applications around the notion of package and software deployment. So let me give a couple of examples. So as I said before, Geeks initially started as a generic package management tool, right? So you can install software. But since Geeks knows all this dependency graph of packages, we can actually create other applications around it to make use of that information. So let's say for example, as a developer, one thing we often wanna do is be able to say, okay, I fetch a source of the software I wanna work on, let's say Pet-C, and then I actually want to hack on that piece of software, right? So I need to set up the environment for that software. Well, we have a Geeks environment command where you can just say, I wanna hack on Pet-C and Geeks already knows about Pet-C, right? It knows about its dependencies. So just a Geeks environment Pet-C and it spawns a new shell where you have everything available to develop Pet-C. I think that's pretty convenient as a developer. It's one case where we're just using information that's already available and just making new tools using it. Container provisioning is another case where we have a disconnect between the main tools that are used to provision containers and package managers, for instance. I think that's waste tool. Because if you're using Docker, so the general approach would be for Docker to use a base image, let's say a Debian image. And from there you have a Docker file that runs a bunch of apps and condor commands, for instance. And it does a job, but it's not reproducible. Why is there this disconnect? Why can Docker be made aware of the package graph? I think we can do a better job. And so with this Geeks pack command, what we have is a generic tool that again uses information already available in Geeks, which is the dependency graph, and is able to create bundles. So the bundle itself can be a generic table that you can extract on a different machine, or it can be a Docker image. And you can create that Docker image, and here what we end up with is a Docker image that contains HW look and all of the dependencies. And from there I can just take that image, pass it to someone who, well, maybe hasn't seen the light yet and is not choosing Geeks, and they can still use that software in using the tools they're familiar with. So I must say I'm a bit of a programming language geek, so I'm trying to do a very real exercise which is to talk about functional programming and scheme. In an HPC track, I hope that will work, we see. I couldn't help but tell you a little bit about the underpinnings in terms of programming language feature that we have that make this possible. So let's see. So Geeks itself is written in scheme, which is a functional programming language of the list family. In scheme we have expressions, so this is an expression, scheme expression, as you can guess it's basically like the, it simply invokes the LSTOPO program, right? So it uses the system star function, which is like the system function in C roughly, and this is an expression, okay? If we had a quote right here, then what we get is an abstract syntax tree, so it's an unevaluated expression, so to speak, or a stage expression. So I can take this, I can write it to a file, or I can send it over to a different machine. But of course if I send it over to a different machine and I want to evaluate it there, well as you can guess, maybe it's not gonna work because this being LSTOPO thing actually means something, right? It refers to software. And the expression here doesn't capture the fact that I'm referring to the LSTOPO command or vegetable you like. So we need something more. And in Geeks we actually extend the programming language with an additional stage notation, which is this hash till thing, where here I'm saying, okay, I'm building an expression, but it refers to a specific HWLock package, right? So in other words, we have an expression that is aware of the deployment needs for itself, right? It carries information about what software needs to be deployed for the expression to actually work. And again, this HWLock here, it's actually a reference to a variable, and that variable describes a complete graph, right? This is HWLock. This is what's going to be deployed when I evaluate that expression, let's say on a remote computer. And so we've put deployment within the language in a way. And when we have that, well we can start building applications around it. So one of them is the Geeks workflow language currently being developed by Rolf Jensen of the Utrecht Bioinformatics Center. I don't know where Rolf is. It's basically using that feature to make deployment a core part of the workflow language. I don't know if there are bioinformaticians in the room. Yeah, okay, so I'm not a bioinformatician, right? So I'm trying to explain the whole situation, but basically in bioinformatics, people run big pipelines, right? They have lots of different tools in lots of different languages, and they compose them to build, you know, analysis of DNA sequences and stuff like that. And they basically have a graph of tasks that need to be performed using all these pieces of software. So basically as an input, you have data, and then you have this big graph that uses tons of pieces of software. And then as the output, you have, well, the end result, well, the computational results. But for each of these boxes, you actually need to deploy software. And in bioinform people, well, the de facto standard is CWL, Common Workflow Language, which is a tool that allows you to describe workflows so far so good. But again, it doesn't take care of software deployment by itself. So you are not really able to express the fact that you need not just to run software, but you also need to have that software deployed first. And that's something that GWL, I think, solves in a pretty nice way. So just to give an example, this is a GWL process definition where essentially we use a bunch of modules that define package objects and workflows, things like that. And then we can define a process where we have specific dependencies. And again, when we say Python here, we're referring to the Python package that's defined in this very module. So we're precisely referring to one Python instance, same for the SamTools thing. And then we essentially provide, in this case, a Python snippet that does stuff, does computations. From there, we can just run GWL Workflow and it's gonna submit the whole set of tasks using, in this case, the SunGridEngine batch scheduler. It's gonna submit that on your supercomputer and make sure that everything is deployed and accessible in the compute nodes. I think that's pretty nice to have this body. Okay, I'm gonna wrap up. Now, if we take a step back, in terms of reproducible science, it's becoming more and more of a hot topic. So this is, for instance, nature, insisting that people should be providing code and not just data. This is, I guess, another nature magazine. Well, and we get the point. The ACM also in computer science is insisting that, well, it's not okay to just have a paper that says that you have code that does great stuff, right? You need to have the code and people need to have a way to reproduce it. And there are initiatives like the Rescience Journal. There are many others, but this is the one I happen to know best, which essentially seek to reproduce experiments that are described in papers. I think to do that, you first need to be, you first need to have the code of the software involved and you need to be able to build that software, which brings me to my conclusion. If we look at the whole reproducible science spectrum, at one end we have initiatives like the Rescience Journal, which is scientists trying to reproduce experiments and to fiddle with experiments. At the other end, we have initiatives like Software Heritage, which protect software source code from disappearing. If we have both, that's already kind of nice for reproducible science. Of course, it takes more than this to make science reproducible, but these are very useful tool. And you see what I'm getting at, we miss something here for deployment. And what if geeks, for example, could be used to make reproducible deployment available so that we could big dose, ideal reproducible science workflows. And this is it, if you have any questions. All right, so let me repeat the question. So essentially, why did we create a new workflow language? This is actually not my work, but I think the main motivation to create the Geeks workflow language is to incorporate deployment within the workflow language. So from my understanding, CWL, for instance, so there are two main ways to use it. Either you just do deployment by yourself and that's it, CWL doesn't care at all. Or you can use Docker images and CWL in a CWL spec, I think you can specify your Docker image. But it's still up to you to actually create that image and deploy it on the nodes and so on and so forth. So, I mean, I can understand your argument, which is that CWL already exists and there are lots of tools and yeah, that is great. I think we have an opportunity here with CWL to actually make things simpler, right? To have deployment be part of the whole workflow management process. I think that's pretty useful, even though in some cases, yeah, probably people have already had their CWL workflows and they're not gonna switch overnight, right? Yes. So the question is, how is data managed in CWL? So, yeah, we had a discussion about this, so I'm not an expert in terms of design of CWL, but basically the issue was, how should we treat data? How should we put in that store, that GNU store directory that Geeks uses? Because it's content address, at least for data. So that was one option, but the problem is that data is typically very large for bioinformaticians and so they just didn't want to have that bottleneck, which would be the central store. So, data is treated out of ban from a Geeks perspective, so I don't know exactly, maybe we should discuss afterwards, but I think GWL arranges to have data stored outside of the store and just make sure it's available in the compute node. It's about, I knew I could do this, so you talked about the integration between the system that you... Yes. And just really quick, what did you do? Did you integrate a framework like the integration, the integration, building, when you have other options for the computer or something that you just... You're focusing on your system and why do integrators of yours... So the question, if I understand correctly... Because I have a framework that's just doing the GNU integration system. Yeah. But there will be a growth and you integrate it in your... In partnership, you integrate it and you do integration, you can do it. Right. So the question is, why do we have a specific continuous integration tooling? Or did I get that right? Yeah. Well, continuous integration is very important to us because we need to be able to distribute binaries of those packages. So we need to have servers somewhere building packages in advance so that when users actually try to install them, binaries are already available. So that's why we're doing continuous integration and also obviously for QA. Yeah, I think that's the answer. Thank you.