 Our next speaker is Matt Williams, he has worked on the computing infrastructure of CERN and he's going to talk about how the LHC's computing grid works. So please give a warm round of applause for Matt Williams. Thank you. So I'm Matt Williams, I recently finished my PhD in particle physics, I was working on the LHCB experiment on the LHC for four years, recently graduated and I'm now working at the University of Birmingham working on computing resources for the scientists who themselves are doing the analysis now. And it's as part of that work that I'm helping to help develop this tool, Ganga, which is an interface used by scientists to interface with a huge amount of computing power and storage available to them as part of the LHC computing grid. So a brief little update in case anyone here doesn't know anything about CERN or the LHC. It's the world's largest particle physics experiment, or at least the world's largest man-made one. It's arguably the world's largest man-made structure as well, being a 27 kilometre long ring underground in a tunnel dug specifically for the purpose. It's a proton collider, so it's accelerating protons to near the speed of light and climbing them together at four locations around the ring and each of those there's a detector which studies the outputs of those collisions and analyzes the data that's given to them. Given the huge amount of collisions that are happening every second, billions and billions are happening, it's outputting a huge amount of data. I mean, the amount of data that it is producing is way beyond what we would actually be able to collect, but the stuff we do collect to date equals something like 200 petabytes, though already it's probably a bit higher than that, and it's only going to grow as the accelerator gets more and more powerful in the future. So in order to be able to process that huge amount of data, alongside the design of the LHC was a corresponding project called the grid. The idea of this was to produce a computing environment which would be able to handle the large amounts of data and processing power that would be required. So it works in a tiered system, so at CERN there's a central hub, a tier zero grid site which has a large amount of computing power. And they then defer down to a single site in each country that's involved in the LHC. There's about 12 or 13 of those tier one sites spread around the world, one in each country that's involved. And the level below that are the tier two sites, there's around 160 of those. Each of those is generally something like a university or a search institute. There will be a dozen or so in each country, for example. Some countries have more, some countries have less. And it's at the tier twos and the tier one where the largest amount of data processing is done. And the sort of data that we study at the LHC in the sort of analyses that we do really does lend itself to this sort of distributed nature. You tend to end up, if you're doing an analysis with a list of collision events, maybe you've got 10 million, 100 million events you want to look at. You can very easily take a small chunk of those and process them independently of any other chunk of data. There's no real interaction between the events. So you can very easily chunk it up, send that out to wherever it needs to go and then collate the results at the end. So as I say, the project was evolved alongside the LHC. So even in the early days, well before the LHC actually started, people were looking into building these computing systems to provide the services to the scientists that need them. So in 2001, the LHCB project started work on Ganga. This was their in-house specific interface to this grid infrastructure. Each of the other experiments were also working on their own personal projects in order to interface with the grid. Since everyone was convinced that they had their own special problem that only they could solve in the way that they needed it done. However, the LHCB project Ganga was designed using a Python system with the explicit goal of being pluggable and extensible and so on. And so it was very easy in the intermediate years to take the parts of it that were LHCB specific and remove them and allow other experiments on the LHC to plug in their small part of experiment specific logic that's needed. So the Atlas experiment, there's a number of scientists on the Atlas experiment who are using Ganga for doing their data analysis. And in fact, outside of that whole ecosystem as well, there's the T2K experiment which is the neutrino experiment in Japan. Some of their scientists are using Ganga for interfacing with the grid resources which are provided to them as well. Of course, all the software that we create at CERN or as far as I know, all the software is completely open source. Ganga itself is GPL and the vast majority of software that comes out of CERN is GPL or other more liberal licenses. So how does it actually work? So if a scientist has a bit of code they want to run, they can use this tool Ganga to interface with the grid system. Or in fact, not just the grid system, they can interface with any other system that Ganga hasn't interfaced to. So in this case here, you see on that second to last line, we're setting the back end to be equal to local. That's telling the Ganga system, don't run this on the grid, just run it here on my machine. That's something that's often done by scientists when you're testing a bit of code. If you've just written a new piece of analysis software, you don't want to immediately throw it up onto the grid infrastructure, run it 10,000 times and have it crash within three seconds because of some bug you've put in. So it's a good idea to test it locally on a small set and then later on be able to submit it up with the grid. So it all centers around this job object at the top. You can set some parameters on it. Here, we're setting the name parameter to give us a string which we can use for bookkeeping, keeping track of what jobs we use for what since all the job information gets stored into a persistent database where you can see it all later. The real workhorse of the job system behind the scenes is the application. So the application is what is actually going to be run where this thing is going to be run. In most cases, you just want to run an executable. It can be an executable binary or it can be a Python script or in this case, it's just a small shell script. So you just say to Ganga, this is the thing I want to run, this is the actual code that's going to happen and this is where you can find it in this file here. In this case, this script is just going to create a file called out.txt and so we're telling Ganga the output files from this job. These are the ones that are going to be made by it. These are the ones we want to make sure end up back where we are now. We want to make sure we've got a copy of those in our local output directory wherever the job was actually run and whether that file was originally created. And so we specify that it's a local file. In output files, local file means copy it back to where I am locally. Once we've set up our job object, we just call submit and at that point the Ganga subsystems come into play. The monitoring loop comes in, it starts submitting the job to the system. In this case, it's just going to start up a local shell instance somewhere else on your computer. But if you were accessing the grid, it would be uploading it to the grid somewhere. It would then keep track of its status and make sure it's downloaded any output files at the end of the job. So once it has finished, you can just access the output files directly inside the IPython-based Ganga user interface. So you can just call the peak method on the job object you just had and it basically does an ls of the output directory. You see it's created a file for the standard out on the standard error and most importantly the out.txt we asked it to give. And if you want to peak further than the site, you can pass a name of one of those files to the peak and it'll open up a pager directly inside IPython and you can have a scan through and look at the output files to make sure that everything worked the way you wanted it to. Obviously that was just a toy example. That's nothing more that you can do there than simply running a local script on your local computer when looking at the output file. So it'd be good to be able to leverage the power of the grid. And it's as simple as changing the backend on that last step from local to lcg, where lcg stands for the LHC computing grid. It's the acronym that we use for that. So with one small change of one line to the other, you could run exactly the same script and that code would be uploaded to the grid system. The grid system would take over, distribute it, run the code wherever it ends up running. You don't even worry. It could be in China, it could be in America, it could be in Amsterdam, it could be here in Berlin. It could be anywhere. And it's completely seamless to the user. At the end, the data will be copied back and everything is the same. You don't have to worry about it. But Ganga's more than just that. It's more than just locally running stuff and the grid. It can interface with anything that you can access via an API basically. So there's a series of backends for, you see here PBS, LSF and SGE. Those are batch systems. Often universities have got local batch systems or a batch farm of some kind, which they use for running jobs, which are somewhere between running on your local computer and you want to upload to the grid. And again, you could just change it to PBS and it would be submitted to your local farm and you wouldn't even have to worry about any of the details. These last series of one here are a set of experiment-specific backends. So various experiments have got their own middleware interfaces sitting between Ganga and the grid to make an onion layer type situation to provide extra features that maybe that experiment particularly needs. But again, it's all a black box as far as user concern. You don't have to worry about what's going on. It's just going to work. So now that we're using the grid, it'd be good to really make use of the huge amounts of power it provides. So let's say for example, you have sitting on your local hard disk, directory containing a whole load of files. Maybe you've got 3,000 files or something in there. Each of them some number of megabytes. So it's adding up to a gigabyte, let's say a data or something like that. So there's a lot of data you're going to want to analyze. You can tell Ganga that these are the input files you want to run your job over. From that point on, Ganga will keep track of those input files. It'll make sure they get copied to wherever the job runs, whether that's locally on your back system or if it needs to be copied out to the grid, it'll make sure those files end up where they need to be. Of course, to be left at that, it would be pretty useless because you'd be taking one huge chunk of files and copying them to one place on the grid and they would just be run in one single compute node somewhere. It'd be good to be able to distribute it around and make sure we're running things in parallel. And Ganga provides a tool for this called spitters. So again, you define on the job object a splitter parameter. And in this case, we can use the spitbyfiles object. So this is an object which knows how to spit the files up into a smaller set of data. We simply take one parameter, files per job. So it's going to take this list of however many thousand files you have, chunk it up into chunks of 10 or maybe less files if there's not enough to fill a chunk and take each of those chunks, add in your analyzed data script that you want to run with it, submit it and the grid will put it somewhere. It'll take the next 10, that'll go up and that'll be sent off somewhere. And it'll keep doing that all the way through the list and you'll end up with some number of hundred sub-jobs that Ganga will keep track of you, keep track of for you. So you won't have to worry about how many sub-jobs are made or doing it manually. It's completely automated. At the end, each of those sub-jobs is going to create a histogram.root. Root is a file format we use at CERN that's basically a table of data as far as this stuff is concerned. It can also contain histograms and so on. It's basically a table of data. By specifying the local file here, this isn't saying that file is going to be made locally. It's being made wherever the job is run and you don't care where that is. But you're asking Ganga to copy it back to your local computer so you can have a look at it, open up in your text editor or analysis software or whatever you are going to use to analyze the data. But that's not ideal even then because you're going to end up with however many hundred copies or variants of this histogram.root. They'll get put in a sub-directory structure but still they're going to be separate files that you're going to have to go through manually and look at. So to solve that problem, Ganga provides something called a merger. Well, it provides a whole suite of mergers but the one in particular here is the root merger. So this is a little bit of Python code which understands how to concatenate together root files. It knows how to stick them together and combine them into one single file. And this again is completely automated. Once the job's been uploaded, it's been split up, sent out all over the world. Ganga's downloaded all of the results from each of the single sub-jobs. Once they're all downloaded, Ganga will automatically kick in, combine them together and start turning it into one single file which you can then look at. So from that point of view, you don't even have to worry about the fact it was split. You started off with one single analysis script and one single set of data. Ganga spit it and merged it. You ended up with one result. You don't even have to worry about the fact that it was distributed around. It's completely seamless. There is much more than just the root merger. You can write any sort of merger you might wish. Anything which post-processes the data basically. There's a class in Ganga which lets you pass in a single function which simply takes the output file directory and you just list through there. You could, for example, look at the log file for each output job, grep through it for a single string and find the average of the numbers or something like that. You can do anything you can think of just to post-process your data. So once you've been working at CERN for some number of years, you're probably going to have submitted several thousands of these jobs over your lifetime. Many of which you're going to have deleted because maybe they broke, but many of which you're going to want to keep around for the log files to check that stuff's working how it used to make sure your data's being reproducible. So Ganga provides a persistent database of all the jobs you've ever run through that system. So you can see here the three jobs that we've submitted so far. The first one we just found locally. That's all finished. It's showing up there as completed. The last two, because they were sent off the grid, they've been distributed around and they're still running. You see here, each of them, Ganga has created 324 sub jobs. That's how many it decided to spit it into. You don't worry about the number too much. You just have to know that they're there. We don't have any more details about which of them are running, which of the sub jobs are finished, or anything like that here. This is just a very high level overview. But it's very possible to get that information because Ganga provides a full API access to everything that's inside the Ganga's API. So inside the Python interface, you can access any of the information. You can access job information. You can resubmit things. You can do anything you want. So the most simple, we call that jobs object again, like we had in the last slide. We give it a parameter. We ask for job number two, which as you see is the bottom one here, the merger job, which is, as far as we're concerned, overall still running. We ask for status. And again, it tells us it's running. It's the same information. We can delve in a little bit deeper though. We can ask that job for a list of all its sub jobs. So we just give it the dot sub jobs parameter. That's gonna give us a list of jobs. We can loop through each of those sub jobs and ask each of them what their status is. We get a list of all the ones that are completed. We find the length of it and we find that 24 of those 324 sub jobs so far have finished. If we waited half an hour and ran it again, it would be a higher number because gang was constantly keeping track of how many sub jobs are finished. But jobs don't always just be running or have finished. Quite often, you'll get random failures. On the grid, your data will be sent to run at some particular site. It could fail without any real reason. Maybe there's an outer memory error on that particular location or things like that. So as long as some of your jobs are passed, there's a good chance that those that failed, it's simply a transient failure. So you can loop through all the sub jobs once more, check if the status has failed on that particular sub job and resubmit it and it will go back into the monitoring loop and keep going around and eventually it will be re-downloaded once it's finished. And this is the sort of thing you might want to do quite regularly. You might want to have a function defined which loops over a job object, checks all the sub jobs, resubmits the failed ones. So you can take any bit of ganger code, stick it inside a function inside a dot ganger.py file in your home directory and all those functions will automatically be available inside the user interface, which is based on Ipython, it's a slight fork of Ipython to provide this sort of functionality. So the last thing I want to talk about is dealing with very, very large files. So the example I gave at the beginning I was saying you might have a directory on your computer which has got something like 1,000 files in it or something like that. Even if each of those only some number of megabytes running up with gigabytes of data. And in fact, quite often when you're doing data analysis with the LHG, you're going to be dealing with at least gigabytes if not terabytes of data, you're going to be wanting to run your analysis over. So it's nice not to have to keep those files locally on your local computer and upload them every single time you want to have to do an analysis over them. And then at the end, if the output's big, you don't want to always have to download the output. Maybe you just want a summary file, maybe you just want to find number of events that pass some sort of criteria. So as well as being a distributed compute network, the grid is also a distributed file system or at least it provides a number of distributed file systems. The one in particular here is using this DRAC file system which is again, originally an LHCB specific grid interface. But the important point here is that it deals with a remote distributed file system. You don't have to worry about where these files are, they're out there in a way in the cloud. So for the input files here, we tell Ganger that we want as our input file a file called input.root and we're saying DRAC knows where it is. So I don't know the exact physical location, but the file catalog knows where to find it. For the output file, my program is going to create a file called histogram.root. That's going to be made locally on the worker node wherever my job is run. And I don't want that copied back to my computer here. I want you to send it off to the remote storage. That'll keep track of where it is, that'll keep a record. I can access it later if I want to, but for now I don't want to be dealing with all that network traffic coming up and down. And in fact, it can even be a little bit cleverer than that. Using the DRAC backend, which is basically a layer on top of the LCG backend, it's got a bit of extra logic in there to deal with this sort of file system access and so on. One of the clever things that can do is all you have to say to this, with this exact script here, you upload that script, you submit it, DRAC would automatically take your analyzed data, like program you want to run. It will look around, find the physical location where input.root is stored and it will send the job to that site and it will run it there locally rather than submitting that analysis script somewhere and copying the files over. It will try and automatically reduce the amount of copying that's going on in order to make things as efficient as possible and avoid clocking up network bandwidth. In the same way, the output is going to be stuck somewhere and so you could then run a second job. You could change together jobs. You can say, this is the output of job one. I want that same output to be the input of job two. You just have to pass in input files that equals DRAC file histogram.root. That job will be submitted to the grid. It will go up, it will look around, find out where histogram.root was saved to and again, it will be sent and run there. You never have to deal with those files on your local computer at all. You let the experiments also have to deal with all that storage and file management. So using the grid like this, you can deal with hugely large files that I've ever had to deal with them. Of course, for each sub job, you'll get back a standard out file, you'll get back a standard error file so you can make sure your jobs are running correctly. You can always have some files being sent off the DRAC, some downloaded locally, some sent off to some mass storage place. You can have as many input and output files as you want coming from whichever source you want as long as Ganga has an interface for it. And Ganga being extensible, you could very easily write a new plugin which dealt with any other file system type that you might want to use. We do have a file system type which uploads things to Google Drive, for example. Quite often people just want to be able to share files to Google Drive and so you can access upload and download files from there. So you can write basically an interface to any infrastructure you might want to be using for yourself. So you can find out more information at the website, cern.ch-ganga. Like I said, all the code is completely open source so you can go to the download link and have a poke around the source code. It was started, the project was started in 2001 which for reference is about the time that Python 2.0 came out. So some of the code has been around quite a while but on the whole it's quite readable and you can see what's going on as far as the job flow goes. So take a look at that if you want to have a little poke around and thank you. Questions? Thanks for the nice talk and the nice tool. I have two questions. So the first question is can you target some schedulers such as Slurm using the library? Yeah, I don't know if there's a Slurm backend yet but there's ones for Condor and Torque and so on. So there could easily be one for Slurm. I mean, it's a simple case of writing a bit of code called the white commands at the backend. So yeah, Slurm could absolutely be interfaced with if it's necessary. Right, and my second question was... I forgot the second question. Okay, I'll ask you in the break. Okay. And we got another question over here. Thank you. There was on your merge slide. Can you go back there? Yeah. I don't understand the line with the j.input files where there's actually a list comprehension on... So in this case, inputs.txt is an index. Is the open, overloaded somehow or you get a file handle? In this case, inputs.txt contains a list of file names. So that's an index file containing a list of all the file names that you want to include as your input for your job. So you're going to loop over each of the lines in that file, each of which is a string, which is the name of a file. But that's not what Open usually does, isn't it? So you just have a file handle. If you have open.readlines or so, you get, of course, the lines or whatever. When it's looped over in this list of comprehension, it does produce a list of the files. It has a slash n at the end of each one, but it does work. I did check this line, I was. Yes. Hello, thank you. I wanted to know how you handle code that runs parallely and needs to communicate to other processes or in different computers. So like inter-process communication between analysis jobs and things. Or inter-network communication. On the whole, there's very little scope for communication. I mean, Ganga will, is blind to that if you submit to a supercomputer, which has got some inter-process communication that you need to do, or some sort of communication of any kind. It will handle that because Ganga doesn't care about it. Generally, jobs on the grid, you don't have any sort of communication between them, each job's siloed, very much so. So I suppose you don't submit jobs that need to be run across multiple processes. Mostly not, no, no. Not in the sort of work that we do. Okay. How does Ganga find files? I mean, you can't just use the name, right? So each of the local file or DRAC file, for example, it's got a little bit of logic in there. So local file, by default, will look in the working directory that the user's in. DRAC file will, so obviously you can't just say input.root, it's going to magically know where it is. By default, it's authenticating with the DRAC system. So each person's got a local user area. So it will look in their local user area for the file. And likewise, it will be saved to their user area on the file system. Yes, yes. You can overwrite files in a previous job and things like that. You can give multiple output directories and stuff. Yes. Okay, thank you again, Matt.