 Many, many, many petabytes of data is what we're talking about this time today. Excuse me, terabytes, I'm getting mixed up here. Anyhow, here's James Wrigley, who works for the data analysis team at the European Exefe-EL, that is the largest generation of light sources, and they have really big loads of data to handle. He's going to tell us today how to do it, and actually he wrote, one of the things that touches his cold heart is H-top showing all cores running at 100%. This is what I suppose this is going to be. So here we go, James. It's yours. Alrighty. Thank you very much, Andre. So yeah, my name is James, and I want to talk about how to grok or understand an experiment at NFEL. Which basically means that I'm going to try and give you the perspective, I'm going to try and work you through an experiment from the perspective of the facility. So I'll talk about the hardware that we have here, the data that it creates, and the systems that we've built to manage all of that data. So I will start off with a little personal background. I am originally from New Zealand. I started working in the data analysis group at the European Exfall this year, and I'm what's called an instrument contact. So that means that I'm assigned to one of the six instruments at European Exfall to help them with their specific analysis needs. And so far, that has mostly involved working on things related to online analysis, which is the real-time analysis that's used to guide an experiment, but we'll get to more of that later. From day to day, the most common languages I use would be Python, along with C++, and with regards to tools, I generally go for Emacs, with Condor, and Viobu. So some of you may perhaps not be familiar with what NFEL is, so I'm going to give a quick introduction to that. So Exfall stands for X-ray free electron laser. It's the latest generation of what are called light sources, and it's basically a massive machine that is designed to create very short and very intense pulses of X-ray radiation, which is used to study various physical processes. There are other kinds of light sources, such as synchrotrons, for example, but some things that set X-rays apart from those, for one, the very short pulse length. This is often on the order of femtoseconds or tens of femtoseconds, and every one of these X-ray pulses is really intense, so there's lots of photons in there. The radiation is coherent. That's very important for experiments. And NFELs also are capable of a really high repetition rate, which is on the order of a megahertz. So that means we're able to create a lot of pulses in a very short amount of time. And on the right-hand side, I just have a little picture here of one of the X-ray detectors at Exfall. This is the DSSC, and we'll get back to it in a bit. But here's just an example of the kind of data that you could expect to collect at an Exfall. So that brings me to the European Exfall, which is the world's most powerful X-ray-free electron laser. This is actually located in Hamburg, and the whole thing is about 3.4 kilometers long. So this all starts at the DC side, in Barenfeldt, and there's a machine here called an electron gun. What this does is it injects electrons into a linear particle accelerator, which is located in this section. And I have here a little picture just showing the inside of the tunnels. The entire facility is underground. And the electrons travel through these superconducting RF cavities, which accelerates them to very high velocities. And after they've been accelerated, they enter what we call the undulator systems. So this is a picture of one of the undulators. And the electrons travel through this little tube here, that little silvery tube. And an undulator is basically an array of magnets. The magnetic field that these magnets create, which are on the top and bottom of these metallic slabs, the magnetic field that this creates causes the electrons to wiggle. And that motion or acceleration creates X-ray radiation. And that's the thing that we really want. So after the radiation is generated from the electrons, the electrons adapt. We don't need them anymore. But the radiation continues down these tunnels into the experimental hole at the second X-ray site in Schoenfeld. And here I just have a picture of one of the experimental hushes in the experiment hole. So there are six different instruments at the European X-ray which focus on different kinds of science. And each of them have their own experiment hush with their own detectors and equipment and so on. And here you can see an instrument scientist looking into this little box. This is the sample chamber. So that's where they would place the samples. The radiation would come in from tunnels from the right-hand side. It would hit the sample in here. And then the result of that would be recorded by an X-ray detector. So that's the facility overview, a very simplified version. Let's get to some of the technical details. So the accelerator that we have is capable of delivering laser pulses at 27 kHz. That means 27,000 pulses of radiation per second. But they don't arrive continuously. They arrive in batches. And these are called trains. And in this diagram you can see something about the structure of the trains. So the trains arrive at 10 Hz, 10 times every second. And each pulse of X-rays is compressed within this train. And the repetition rate for the European X-ray can be as high as 4.5 MHz. And this is really unique. And it's really important for searching kinds of experiments. The pulse lengths are also very, very short. So they're on the order of 10s or 50s or even lower. One thing that you probably might have noticed is that this whole machine is extremely complicated. And there are literally thousands or tens of thousands of different devices that produce data at X-ray. These can be either hardware or software devices. But all of them are controlled using our in-house control system. It's called Carabo, and it was developed from scratch within Excel. And on the right-hand side, you can see a little diagram of the devices just for one instrument, just for the MIT instrument. Each of the nodes here represents a single device. And the lines between nodes represent some data that is being shared or sent between devices. And there are similar graphs you could generate for the other five instruments at Excel. So it's a really complicated system that produces a lot of data. But this is all necessary to carry out the experiments that we do. So the experiments usually last about six days. And they can create anything from hundreds of gigabytes to petabytes of data. In fact, recently, we actually broke our internal record for the most data collected during a single experiment. And there was some users at MIT who recorded about 3.3 petabytes of data in just six days, which is really insane. But that does beg the question of what exactly is generating all this data, right? And the answer is that most of it is coming from 2D detectors. So let's talk about that for a second. There are a bunch of X-ray detectors we have at Excel. And I've listed some of them up here. On the right-hand side, there's a picture of the DSSC. This detector was actually developed especially for Excel, like also the Egypt. But there are some others that are slightly more commercialized. This DSSC detector is one megapixel. But in general, the resolutions can be anywhere from half a megapixel to four megapixels. And the most common one megapixel detectors like the DSSC are what we call pulse-resolve. So that means that they're able to record data for every single pulse within a train. And this is where the high repetition rate comes in, because you can imagine if you have lots of pulses within a train, and these trains are coming 10 times per second, and you're getting a megapixel data for each one of those, that's really a lot of data. For the DSSC, for example, it's capable of holding frames for 800 pulses at a time. So it's got 800 memory cells. And if you do the math, this works out to create about 16.7 gigabytes of raw data per second. That's really, really a lot. Another way to think about it is that it's about two and a half Bitcoin blockchains per minute. And keep in mind that this data is just coming from a single detector. Usually at Excel, there will be multiple experiments running simultaneously between the different instruments. So you could very well have situations where there's three detectors that are all spitting out data at gigabytes or tens of gigabytes per second. So a big area of interest for us is data reduction, because this kind of data rate is really difficult to manage. And in some cases, there are ways that we can actually reduce it ahead of time, so we don't need to store so much raw data. And to give you an idea of how much this is in proportion to the rest of the data, I've got a little table here, which is showing the rough grouping of all the data at Excel. So each row here vaguely corresponds to a group of some kind of data. So we've got data coming from the HH, the DSSC, LPD, and the Jungfrau. These are all 2D detectors. And then we've got this other source in here called DA. This basically represents all of the non-detector data. So that would be things like, excuse me, that would be things like motor positions, various other sorts of information like beam energy, that sort of thing. And in total, we have something like 60 petabytes of data at Excel. And you can see in this column here, which is showing the total storage amount, the non-detector data only takes up about 3 petabytes, which is really not much. So that's why data reduction is such a big topic for us at the moment. And you can imagine with these kind of data rates, like 16 gigabytes per second, there needs to be some special systems to be able to handle that. And to do that, we have a special compute environment. This is called the online cluster. So the online cluster is a cluster that is physically located on-site in Schoenfeld. And it's used for running all of the systems that we need during an experiment. So of course, this would be the devices, which are run by the control system, Karabba. But it's also things like online calibration and online analysis. So there's about 150 nodes in this cluster. Most of them are CPU. There are a couple of GPU nodes, but those are mainly used for the online calibration pipeline. For the sake of security, it's quite... For the sake of security, access to it is quite restricted. So you need special permissions to access it remotely. And there's also no direct access to the outside engine now. When it comes to storage, the main file system we have is GPFS, which is a proprietary distributed file system. And this is all running on SSDs for performance because we need to write a lot of data really quickly. And again, for performance, we use Infiniband, which is a different kind of networking technology that's designed for very high-speed data rates. And here is a grossly over-simplified view of the different systems that we have running on the online cluster. So all of these would be active during an experiment. So starting on the left-hand side, we have devices. These are things that are running in the control system, and they produce data. They could either be representations of physical hardware, so maybe a digitizer or a camera or a detector, but it could also be a purely software device. So let's say, for example, that you want to produce some image processing on data from the camera in real-time. You might have a software device to do that. And the instrument scientists who are operating the instrument are able to select the data from which devices they want to save because there can be hundreds of devices per instrument, and it's not usually necessary to store all of that. So they will select a certain subset of devices to save data for. And this will be used by the DAQ, which is the data acquisition system. And one of the DAQ's main responsibilities is aggregating the data from all of those devices and then writing it to the online cluster storage here, which, as I said before, is which is using GPFS running on top of SSDs. But feedback for the instrument scientists is also really important during an experiment because getting beam time at the European Excel is extremely expensive. If you do the math, I think it works out to something like thousands of euros per minute. So it's very important that the instrument scientists know exactly what it's having at any given point. And that's the point of online analysis. But online analysis usually requires some data from a 2D detector, for example. And the raw data that the detector spit out isn't really usable. It needs to undergo some kind of calibration. And that's what the online calibration pipeline is for. So this thing takes the raw detector data from the DAQ, it calibrates it, and then it can be fed into whatever online analysis tools we have running. We've actually done a lot of work over the last year in upgrading this for better performance. And the end result is that the instrument scientists are able to get some feedback from the instrument on the order of a couple of seconds, which is very useful. Now, the last part that I want to talk about with regards to this diagram is this process that we call migration. So the storage on the online cluster is not meant for long-term use. It's really just there for the data that's being saved right now during an experiment. There is a separate compute environment called the offline cluster. And this is used for long-term data storage and for any kind of rigorous final analysis. So there is a process called migration, which will basically copy all of the data from the online cluster to the offline cluster. So after some data has been recorded, it will just be copied, and it will now be accessible on the offline cluster for scientists to analyze. And now I want to give a couple of examples of the online analysis, because I think that can be quite interesting. So the first example I want to show is this thing called Bragg peak analysis. So just for background, when you fire photons at some kind of crystalline structure, they will reflect in a certain direction. And on the detector, it basically shows up as a blob. And for certain kinds of experiments, it's very useful to be able to track the features of this blob. So by that, I mean things like the position. So where is it on the detector? Often you'll do some curve fitting to a Bragg peak. So in this particular case, we're fitting a Gaussian to it, and we're getting this metric called the full width at half max, which is basically the width of that Bragg peak. And then another thing is the intensity. So how bright is this peak? And on the right hand side, you can see an example of some analysis of a Bragg peak that was done during an experiment. And what was really cool in this screenshot that I have is that you can see how the position of the Bragg peak changes as the instrument scientists move a particular motor. And being able to view this kind of data in real time is extremely helpful for them to know that they're actually doing the right thing. So that's one kind of analysis. Another kind is this thing called azimuthal integration. So in the last slide, I showed you just like a single little blob on a detector. That's just a Bragg peak. But in some kinds of experiments, they actually create lots and lots of Bragg peaks. And the data from the detector ends up having radial symmetry. So these rings are called scattering rings. And they're actually composed of lots and lots of Bragg peaks all at the same radius from each other. And for this kind of experiment, what's really useful is to count, for example, how many rings there are, the radial position of the rings, so how far is the ring away from the center of the beam, and also the intensity of the rings. You do need to have good geometry for this as an accurate geometry because it's quite sensitive. But as long as you have this, you're good. So just to explain a little bit more what azimuthal integration is, imagine you're taking this image that's on the right-hand side with some radial symmetry, and you convert it to polar coordinates. So one way of thinking about this transformation is imagine you have a circle, then you make a cut at one point in the circle, and then just unfold that circle into a line. That's kind of the transformation that's happening here. And what you see happen is that the circles that show up in the image with Cartesian coordinates end up as straight lines and polar coordinates. So on the bottom axis, you have the radius. This is the distance from the center of the beam. And on the vertical axis, you have the angle. This is the azimuthal angle. And what we want to do in this case is just integrate along the azimuthal angle to get what's called a 1D scattering curve. That's something like this. And you can see that the bright lines in this image of the polar coordinates basically correspond to the peaks that are visible in the scattering curve. And this is really what instrument scientists want to see. So they want to know how many peaks there are here. Roughly speaking, you could say there's maybe three big ones. They want to know their position. So how far away are they from the center of the beam? And they want to know the intensity. So how tall are the peaks? And this can all be done online in semi-real time with a couple seconds of latency. And I have a very simple example here. This is using a slightly different data set. Previously in this data set, you can see there's actually many different scattering rings. In this data set, there's only one fairly broad ring. And you can see that showing up quite clearly in the 1D scattering curve. And we also have tools for doing peak finding automatically so they can find the position of the peak, like how far away is it from the center of the beam. And if they want, they can also do some curve fitting to it. So in this example, they've just fitted a Gaussian to this curve and you can see the parameters that it spits out. So that is online analysis. So this is stuff that happens during an experiment while it's running. I want to switch now and talk about the different environment that we have for doing offline analysis. And that is the offline cluster. So what I'm calling the offline cluster is actually the Maxwell cluster, which is a compute cluster that DC operates for its users. And we, like, X-File works very closely with DC. We share a lot of the infrastructure. And that's the reason why we're using the Maxwell cluster here. So we use it for long-term storage and for any final analysis, primarily because it's just way bigger and more powerful than the online cluster. So the Maxwell cluster has about 800 nodes compared to about 150 for the online cluster and about a quarter of those are GPU nodes. It is quite different from the online cluster in that there's a real mixture of storage systems. There's also a mixture of the physical media that's used to store the data. So of course you have HEGs, but there is also magnetic tape storage available. And just like the online cluster, it also uses infinite band 4 for networking. So this cluster is what we and the users who come here use for offline analysis and to be specific, this is the analysis that happens after an experiment. And it's usually where you would prepare some analysis for publication in a journal, for example. And the exact kinds of analysis varies very much from experiment to experiment. There are session fields with techniques that have existing tools for analyzing FEL data. So one example would be serial crystallography. I'm going to get to this in a second, but serial crystallography is just an experimental technique. And there's a couple of existing tools with both command-lining graphical interfaces that the users can use to analyze the data they get here. Another option would be using Jupyter notebooks. So the Maxwell cluster operates a Jupyter Hub instance. And these are very popular. So on the right-hand side, I have a little example of this. Typically, they're written in Python. And you can see here just a quick example of someone analyzing some data from a spectrometer. And here we're just using a map polyp to display all this. So it's using fairly standard tools. Coincidentally, the offline calibration system also uses notebooks. So remember I said earlier that we have an online calibration system, and this is used for correcting the detector data in real time. But it's not entirely rigorous, and we have another system for doing this more rigorously. And that's the offline calibration system. So this will actually use this Jupyter notebooks as well, which might seem like a bit of an odd choice, but the advantage is that it's very usable for detector scientists to look at to play around with different methods of correction, for example. And all of the data that we create, both from the raw data and the offline calibration system, this is all stored in HDF5 files. And we have a couple of libraries for working with these files, like extra data and extra geom that I mentioned. Extra data is really for working directly with... It's more generic, it's meant for working with data from a run. And extra geom is used for working with data from 2D detectors. And I have a little example here showing the use of extra data. So in the first cell up at the top, you can see that we do a bunch of imports, we import this class called run directory from extra data. In the third cell, we open a specific run from an experiment and we look at what's in this run. So in this particular run, there were 4,000 trains. It lasted for about six and a half minutes. And you can also see all of the devices that this run has data for. In this case, it's a spectrometer. And in the final cell, you can see that we're actually getting some data out of this run. So in this first line, for example, we're getting a bunch of camera data. And the data that's returned here is actually just a regular NumPy array. So we make use of Python and the regular... Python ecosystem number crunching libraries quite a lot. So there's a lot of use of NumPy and SciPy and X array and that sort of thing. And yeah, this is just a multi-dimensional NumPy array. Now, the entire point of offline analysis is to do some rigorous analysis that is ready for publication. And just to conclude, I want to give a couple of examples of what kind of analysis users do. So there's two that I want to talk about. And the first one is related to this technique called serial femtosecond crystallography. And this technique is used to find out the structure of biological samples like proteins, for example. This particular technique requires a very high repetition rate and lots of photons. So what you do, broadly speaking, is you get a bunch of identical proteins that you want to study. You put them in a liquid solution and then you fire it in a liquid jet right in the path of the X-ray beam. And the idea is that hopefully there will be one sample, one protein sample per X-ray pulse. You'll get some data on the picture and then later on, working backwards from all that data, you will be able to figure out the structure of that protein or whatever biological sample you have. The problem is that you need a lot of, you need a lot of hits, basically. You need a lot of photons scattering off a lot of different proteins for this to work. Like on the order of tens of thousands, for example. And this is why the high repetition rate is so important for this kind of experiment. And because ideally you've only got one protein per pulse, you need lots of photons because the probability of a photon actually hitting and scattering off the detector is very low. And that's why the intensity of X-ray is so important. The final example I have is ultrafast dynamics. And there was an experiment recently where some scientists looked into the radiation damage from water molecules. And to cut a long story short, what they ended up discovering was some information about how a water molecule is destroyed from X-ray radiation. And this process is extremely fast. It happens in femtoseconds. And that's why the short pulse length of an FEL is so important. The pulse length is sort of analogous to the exposure time of a regular camera. So you can imagine if you're trying to take a regular photo of a fast process, you need a very small exposure time. And it's the same thing with an FEL. You need a very short pulse to capture the dynamics of these really fast processes. And this is just two examples. There are many others that you can find. And yeah. So, thank you very much. Well, very charming. Thank you very much. I'm flabbergasted. I would lie if I would say that I understood all of it to be quite honest. I was social science. Sorry. Yeah, we're a bit overdraft, but never mind. You don't have that many questions. We'll handle it. Okay, they're coming in by now. Okay, if you want to ask questions, go to below. Go to the info drop-down menu. Excuse me, go to the chat and you'll find the hashtags and the IRC clients, web client, where you can post them. We have some here. I'm just going to read it out to you. If you need a high amount of pulses per second, why don't you spin a few plates with holes in front of a continuous light source? So, the problem is that, actually, let me just go back to slide number two, four. So, this thing uses an electron gun. And the reason we have this bunch structure is because the electron gun needs to be recharged, basically. The thing that you're referring to with having like a continuous stream of pulses is actually possible. I believe there is one facility in the States. I'm not exactly sure which one. Maybe it's LCLS. But they actually have a storage ring. So, I think what's happening is that in front of their electron gun they have a circular storage ring. This is storing all of the electrons and then they're all fired off continuously. But we don't have that. So, that's why you have this bunch structure of all the pulses arriving in from trains. Okay. Well, here's the thank you. Here's the next question. Has your lab considered or evaluated Julia as an alternative tool chain? So, Julia is on my personal radar. But I don't think there's been any rigorous study of it just yet. So, I can certainly see why it would look attractive. I believe there was like a brief foray into using Julia. If I remember correctly for the use cases that they were using it for, IO was the bottleneck rather than compute. So, Julia didn't offer much of an advantage over Python. Having said that, there's certainly other use cases where I think Julia could make a lot of sense. Okay. Thank you. This is a question that is sort of fundamentally probably for your application. Do you consider photons to be waves or particles? I'm just going to pass the question. I have no idea how to answer it. I'm sorry. Yeah. I have no idea. What I might suggest is perhaps emailing Excel. There's probably someone there who would be able to answer that question. But I don't have the background for that. Sorry that I exposed you there and didn't mean to. All good. Do you use event-based cameras? I'm not exactly sure what you mean by event-based cameras. Perhaps the question is referring to whether events trigger a camera. And I think the answer to that is no. There is a very elaborate timing system at Excel, which tries to synchronize all of these things at the same time. So I don't think there are any event-based cameras, but perhaps I misunderstood the question. Okay. Currently there's no further question coming up. So we'll just wait a minute or two, maybe. Tell me the long way from down under, excuse me, New Zealand. Yeah. How did you get here? So after I graduated from university, I worked for a little while in the Netherlands for a financial technology company. But doing this kind of scientific computing work was always my goal, I suppose. So a couple years ago when Excel published their opening for the position that I got, I applied for it. And I'm very lucky that they picked me. So I didn't exactly move all the way from New Zealand to Hamburg, it was more from Hamburg to the Netherlands, sorry, New Zealand to the Netherlands to Hamburg. Even still, New Zealand to the Netherlands is quite a jump in itself, isn't it? It is, yeah, yeah. The flight was terrible. I think it was about 30 hours. It was not right. I would imagine. I ate flying, I was staying in my younger days, I loved it, took me to wild places. But I just sort of become a drag and a pain in the neck. Do you miss New Zealand? I'd say not particularly. It's a great country to be sure. It's going to be a great country. Yeah, but I have to say I really enjoy the work that I do here. So there's not many days that I miss New Zealand. Perhaps I just haven't been away from it for long enough, maybe in a couple of years I'll feel homesick. I would assume the weather is better. It depends, it depends. So I used to live in the capital city, which is called Wellington. And it's a harbour city, kind of like Hamburg. And the weather was actually not so different. I would say that on average it was maybe a couple of degrees warmer. It's always an extremely windy and rainy city. So it doesn't feel too different from Hamburg in that sense. True point taken. Sorry. People are still listening to us because somebody asks here, is the software comparable between the physics and the financial industries? Did it take you a great reboot split from finance to physics? Good question. Or is it just data? No, I would say it was actually very different. So, I mean, for starters, most of my work now is in Python, whereas previously it was in C++. We do use C++ at Exxon for like the very compute intensive parts of our pipelines. But the majority is really in Python. I'm just trying to think what the other differences are. Financial technology is typically, or at least for the stuff that I was working on, it was a lot simpler. So at Exxon it's really helpful to have some background knowledge of physics. And there was none of that in financial technology. I also find the work a lot more interesting. And does Exxon fulfill these expectations so far? So far it has, yes. I'm very happy to say that it has. Well, come back. Last question is on your HPC cluster. What schedule do you use? So on the offline cluster we use learn. I haven't actually used it that much myself. So I think that's the only schedule we use. Yeah, I think it is. But as far as I'm aware, we don't use Slurm on the online cluster. Can you hear me now? Yes. I'm sorry. I'm awfully sorry. I should have changed that hardware before that. I'm supposed to say hi to you from Andrea Thorne. I'm not sure he remembers me, but I work in the same area and he's also in Hamburg and I'm around in the RC3 world if he likes to meet. So here you are. Here's the message for you. Cool. Thank you. Hey, Andrea. There's another question that's come in. What's the unique feature of GPFS? As far as I remember, this seems to be happy with AFS. This is a good question. Unfortunately, I'm not sure I have a good answer for you because I'm not very familiar with those two file systems. Like off the top of my head, I guess one... So I don't know anything about the performance characteristics of those two systems. I guess one kind of nice thing about GPFS is that we get support from IBM for it. I can't really answer that better. Okay, that does it with the questions. I was wondering if it was worth the while going to the breakout room because the questions don't seem to be so thick. Actually, there is no question around anymore. So if that's okay with you, we can call it. We're right on time. It's perfect. So thank you again, stunning. I'm still flabbergasted by the amount of data, whatever it is. It's a lot. It really is. Thank you very much for having me. It was great. Thank you for coming. Thank you for giving us some of your time and have some happy times in Hamburg. Thank you. And remember when the fish start swimming by in about the height of your eyes, yes, you may complain about the weather. Sounds good. Sounds good.