 Welcome to another edition of RCE. Again, I'm your host Brock Palin, and I have with me once again my wonderful helpful co-host Jeff Squires from OpenMPI and Cisco Systems. That's a that's quite an introduction. Thanks Brock. Good morning. Good morning, Jeff. We have with us today Sean McKee. He's a research scientist at the University of Michigan and a part of the Atlas Group. Atlas, if I'm understanding correctly, is the detector on the Large Hadron Collider, the big particle accelerator that they've been working on over in CERN in Switzerland. So it should be pretty interesting. Sean, could you give us a little bit about yourself? Sure. I got my PhD originally at the University of Michigan, and I've been involved in energy physics since the days of the superconducting supercollider that some people may have heard of that was a big equivalent experiment to the LHC. It's actually was larger in scope, but was canceled by Congress in 1991. I became involved with Atlas in around 1999. So I've been working on this experiment now for 10 years, and some people have been working on it much longer. Atlas was actually formed as an official collaboration in 1992, and it's a worldwide collaboration on doing energy physics. Atlas itself is, I believe, now officially not an acronym anymore. But originally it was a toroidal LHC apparatus. And that's how with that tortured acronym, they got Atlas out of it by capitalizing the right letters in that statement. So there's about 2000 physicists all over the world, PhDs, professors, postdocs, research scientists that are working on Atlas. And if you actually include all their groups, it's much larger. So there are many more graduate students or even undergraduates working on various aspects of Atlas. So primary staff of about 2000 and a cast of millions. That's right. That's right. And of course, Atlas is, as Brock mentioned, it's one of the experiments at the LHC. The LHC is this large proton collider. It's about 27 kilometers around its underground in near Geneva, Switzerland. It actually goes under Switzerland and France, both. And there are two big colliding experiments with the points where the beams rotate in opposite directions and they collide, either at the Atlas detector or at our sister experiment called CMS, which is sort of our friendly competitor, which is important in hydrogeophysics, I might add, that you had two experiments because you wouldn't necessarily trust that one very complex experiment like this would give a new interesting, earth-shaking result unless you could get confirmation from another experiment. So we tend to want to do these things with at least two. And they have different, you know, detective technologies and different groups working on it. So you have some confidence if you see something new and interesting in both, then you've got some validation that what's going on is real new physics. So what is the purpose of the actual Atlas experiment on the LHC? OK, so in hydrogeophysics, we've developed over the years something called the standard model of hydrogeophysics and it's been a very successful model. There have been predictions made, for example, out to six decimal places. Experiments have been done and confirmed to six decimal places, the predictions. So we have, you know, it's our theory, but it's a well-tested, well-developed theory. One of the interesting pieces, though, that's missing is something called the Higgs boson. And that's the only standard model particle that's yet to be observed. We have every belief that it exists, but we don't know for sure because it hasn't been observed yet. And the Higgs boson is related to a mechanism that actually gives mass to things called W and Z bosons and other particles that would differentiate them from photons, which are also bosons that are, you know, observed in regular electricity and magnetism, you know. Photons, light coming from the sun, you know, wherever, those photons are massless. But these other bosons, these W and Zs, and in fact the other particles in hydrogeophysics have mass because of this Higgs mechanism. And the Higgs boson is sort of a representative piece of that model. So that's one of the primary purposes, but atlas and, in fact, the other experiments at the LHC are exploring a whole host of different things. There's better understanding of pieces of the standard model. There are things related to the asymmetry between matter and antimatter in our universe. You know, in the beginning, if there were equal amounts of matter and antimatter, we really shouldn't be here. The matter and antimatter would recombine and annihilate into photons and energy. And there wouldn't be anything. So there apparently is a slight asymmetry between matter and antimatter. There's a little bit more matter, like a part in a billion. And that part in a billion is what you see around us. So that's an interesting question. Why there's that asymmetry? And if it's true, and atlas can help shed some light on that. There are other things like dark matter. There are possible various theoretical models about high-energy physics extensions to the standard model that would give rise to so-called dark matter. If you look out in the universe as a whole, you can see how fast galaxies are rotating. If you add up all the stars and you figure out how fast the galaxies are rotating, it doesn't hold together. It seems like the galaxies should just spin apart. All the stars should be thrown off, like putting a bunch of, you know, marbles on a merry-go-round and spinning it and they all fly off. Gravity holds them together. So we can calculate roughly how much gravity need to hold these things together and come up with a lot more than what you see in the visible matter. So we hypothesize that there's something called dark matter that holds this all together. And atlas may have some opportunities to discover, actually measure some dark matter if it exists. There are other things like super symmetry and technicolor extensions to the standard model that may exist. And one interesting thing that maybe both of you had heard about, there's the possibility we might produce microscopic black holes in alice, which is actually not as dangerous as it sounds. And it would be a very interesting thing to observe. The microscopic black holes would evaporate very quickly and sort of a burst of particles and they would give us a fairly unique signature in the alice detector. So that's just some of what we're looking for. But basically it's understanding the fundamental laws of nature and how things work at a basic level that we're after. OK, so that's a nice overview about what we're looking for in the alice detector. The alice detector itself is physically quite large and how many collisions per second do you expect to have when the detector is up and right? So we'll have approximately 40 million beam crossings per second. So that means the way the LHC works is we load up bunches of protons traveling each direction and the very dense little packets of protons. We try and squeeze them together and make them as dense as possible so that when they collide, they have a high probability of really colliding centrally with a lot of energy transfer. So you'll get 40 million beam crossings per second, normal operating with the LHC. And since the proton density in each bunch, we try to make it as large as we can. Sometimes you'll have more than one proton colliding with one other proton in each beam crossing. The net result is we can get about in normal so called luminosity. That's how dense the protons are and how fast they're colliding. We can get about 25 megabytes of data per event multiplied by 40 million beam crossings multiplied by another factor, which is how many events per beam crossing. You can easily exceed a petabyte a second of data coming out of the experiment. So there's absolutely no way we can deal with a petabyte a second. We don't know of any way to do that. So what we end up having to do is use a combination of hardware and firmware and software to reduce that to something much more reasonable. So we anticipate actually writing to disk about three to four hundred megabytes a second. During normal operating mode for Alice. And the Alice itself, as you mentioned, is huge. It's forty four meters long. It's twenty five meters in diameter. So it's a multi story building way underground. It's a few hundred meters underground in a large chamber. It weighs about seven thousand tons. Atlas is physically the largest experiment at the LHC, but it's not the heaviest. Our sister experiment CMS is heavier. They're smaller, denser. So a little bit of a contrast. But it's it's pretty spectacular to get on and see the detector and get this interesting view of it. Looks like a huge city block of lots of wires and detectors and fancy science fiction looking stuff. So it's it's pretty neat to visit. Very cool. We're we're computer guys here. This this physics stuff is fascinating. But let me let me ask you some some further details about some of the numbers you were thrown out there. You were saying that, you know, you could get a petabyte a second, but you do a variety of techniques to get that down to a couple hundred megabytes a second, I think, if I recall correctly. What what what do you do? Do you compress or do you just selectively take some of the data? I mean, how exactly are you you whittling this down? Right. But that's the real interesting problem. We have this goal of trying to search for new physics, right? So we need to make sure that we get anything that might be new physics and save it. We don't want to throw the baby out with the bathwater, but we have this huge data reduction problem. We have to go from a gigabyte plus or pardon me, a petabyte plus a second down to something that's like three to four hundred megabytes a second. That's a huge reduction. So the way we do that is, as I mentioned, a combination of hardware. We we have built on to the detectors themselves electronics that makes quick decisions. So, for instance, usually what we're looking for is a large energy deposition moving perpendicular to the beam. That means the event was highly energetic in its collision, right? And so it has the potential to produce new physics. And so individual detectors can say, I have something interesting or not very quickly. They're basically if they go over a threshold or or have some other characteristic that you can match in hardware, you can quickly say I have something interesting or not. Then at the next level up, you can take groups of these detectors and in firmware, you can compare them. This guy said interesting, this guy said not, this guy said interesting. I think it's interesting. So you start looking at a larger piece of the detector and saying it's interesting or not. And so at that point, there's another place where you can choose to keep the data or not. Finally, when you're done with all of the detector pieces and the higher level firmware, you're left with a set of events that are potentially interesting and you have to make a choice and you're going to need more information. So what we end up doing is we have basically a fire hose pointed at roughly a thousand PCs, actually workstations, multi-core workstations, where the first event you give to the first available core, the next event you give to the next available core and so on. So you've gotten to the end of the chain and then by that time, the first guy you handed off to should be done reconstructing roughly what happened in the event and making a decision. This really is interesting. Write it to disk or no, we can throw that away. So we have a very complex interface between the hardware of the detector, eventually getting out to formatted event data, so called raw data on the disk. And we do that by doing this in parallel. We basically have a very fast switching infrastructure where we can send a stream of data to each core in order and by the time we're done, the first one should be finished and we can start over again. That's very cool. Let me ask you, let me even narrow down a little further here because this is quickly going into a different aspect of the typical stuff that I do that's very interesting to me, parallel computing kinds of things. So how do you get the data from the detectors to those PCs? And when you say they're workstations, are you reusing desktop workstations or do you actually have a server farm, a quote unquote traditional HPC cluster? Or how does this work? I would imagine that the speed is very, very important to you since there's ginormous amounts of data coming in that you need to, even if you've got 1,000 cores, you need to be able to process quickly. A, get the data there quickly, B, make a decision quickly, and C, write it out to some kind of backing file store very quickly so that that work can then be again used for the next event that comes along. Is that any more close? Correct, yeah, okay. Yeah, that is how it works. So some details. Most of the onboard electronics on the detectors are uplinked via fiber, usually only running at gigabit, but there are many, many, many of these fibers coming into a central electronics data acquisition area which is just near the detector in a cavern nearby where we have racks and racks of data acquisition equipment, TDCs, ADCs, data processing that takes that information and passes it on. When you get out to what we call the level three trigger, that is really something that looks like a typical HPC installation. For Atlas, we have chosen to buy a bunch of Dell, I believe we went with PE 1950s, one-U rack mounted systems that serve as the level three trigger. So there's a fairly customized non-blocking switching infrastructure in front of those such that you have this big pipe of data coming in and it can be then fanned out to each machine in order to give it its job, its task. Here's the data you need to look at, make a decision. When the decision is made, there's a back end switching infrastructure that interfaces to the storage. And I believe they're using a relatively inexpensive sand solution to write these out. I'm not so clear on the details about what they eventually ended up with, but I think that was one of the design, initial design choices that was being made. And that gets put onto these resilient, more cache-like disk areas. And then there's a separate process that immediately moves that into an HSM system tape, basically, and also on disk. And it's kept on disk as long as possible because there are further things that need to take that raw data and start processing it and getting it ready for distribution to the world-like collaboration. So that's a tremendously interesting statement that you said that this is basically gigabit stuff. Now, is it gigabit ethernet or gigabit fiber? Or, because I'm coming back to my point of, I would imagine that the latency is tremendously important in this situation. And if you're doing something across TCP, your latency is out the window. Are you doing something lower latency than that? So the system that I'm more familiar with is the Muon system because that's what we've worked on at the University of Michigan primarily. And the choice for the fiber readout was, I believe, ethernet, gigabit level, to do the readout. But there is a protocol that they run on top of that. That's not TCP. That's for the data movement. And that was developed within Atlas just to move this data as part of the data acquisition path. Gotcha, gotcha. So the backend part, though, when you start to get out to the disk and the storage elements and the HSM systems, those all tend to use common protocols, TCP, over ethernet as the technology. Yeah, that makes sense. So, and typically also, in terms of connections to CERN and out to the rest of the world, we're dealing with units of 10 gigabit connections. So multiple 10 gigabit connections growing various locations. So up to this point when you're getting it down to tape at CERN, this has all been automated. You could literally look at it as this is still just the acquisition equipment. The boundary is not only at that layer three interface where those PCs are. That's sort of the boundary between the online data acquisition system and what we call offline. Okay, so then you said you have 10 gigapipes out. You only had a couple hundred megabytes of data per second. How much processing do you actually need for your final analysis? Because you're not done yet. You're not actually archiving results. That's, you still have a lot more you wanna do with that data you've thrown on tape. Sure, and in fact, as I mentioned, there's disk in front of tape and it's put to tape in parallel while we try to keep it on disk because it needs to be read immediately out and sent to other centers. And of course the first pass has to be done on this raw data. We have a hierarchy that the data moves through. We have raw data that goes into a new format called ESD, event summary data, which is done by basically processing the event and turning it into higher level objects, tracks and energy deposited in a calorimeter, et cetera, et cetera, things that are closer to physics, things that physicists understand. And to do that step, to go from raw to ESD requires that you really understand all of this complex Atlas detector. So all the groups working on the various different sub detectors within Atlas have to calibrate, have to tune their system so that they understand what its response should be and basically it has to be aligned and calibrated. And that's gonna be the focus for our initial running is really making sure we understand the detector. The worry is that if you don't understand the detector, what looks like new physics could actually be in this understood detector. And so we have a lot of processing we have to do in sort of real time to understand the raw data that came out and calibrated, set, things can vary with time, the gas and some of the gas systems could vary with time, the temperature, the pressure, even detectors in the past have for instance been sensitive to the tides. The floor of the detector can move a little and cause the detector to flex based on where the moon is. And they've actually shown with their calibration they can see that effect, title effects on the detector itself. Because these detectors are large enough and they have to be very precise. So we can see very small things like that. But there's a lot to understand on the detector and our first year is gonna be really focused on understanding in detail all these different detectors. And so there's a calibration center set up, some are at CERN. In Michigan's case, we're unusual, we're also in addition to being a tier two computing center, we're also a calibration center for the Muon system. There are three around the world, one in Rome, one in Germany and one at the University of Michigan. So can you explain what a tier two site is? I assume there's also a tier one and maybe a tier three? There are. So the tier zero is CERN itself. Basically that level three trigger system feeds out into the so-called tier zero. That's the first place that we put the data, that's the central location. We then have a distributed hierarchical computing model that we've focused on in Atlas. There are, besides the tier zero then, there are 10 tier one centers. These are national scale centers around the world that represent Atlas collaborations in various countries or various regions of the world. And below them then there are tier two centers. These are regional centers typically associated with a specific tier one, sort of a cloud. The tier one represents the door into that cloud with a set of tier two centers. And then below tier two, we have the idea of a tier three center, which is really whatever an institution or lab can put together. They're not specifically obligated to do certain things within Atlas, whereas anything tier two or above has signed an MOU. There are certain deliverables that they need to provide certain requirements that they have to meet. The tier three, and then even there's a concept of a tier four, which is your desktop, your interface into this grid computing world that we're using to support our scientific work. And so there's a whole span of that. And the reason that hierarchy was developed was to help us manage and sort how the production data would flow out to the rest of the collaboration. But the hierarchy isn't just one direction. Most of the simulation work that we have to do is actually done at tier twos all around the world. The tier zero and the tier one are primarily focused on the raw real data that's coming out. They also help support users doing their analysis or groups of users doing their analysis. But the tier twos have sort of a different role. They're responsible for simulating in great detail all the different types of physics events that we expect to happen in Atlas and providing that out to the collaboration. One of the things that we do in Hydrogyphysics is really look for needles and haystacks. So we have to make sure that we understand exactly how at least known physics would be observed in our detectors. What does it look like when you simulate these events? They could be well understood physics events or they could be new physics based on some model. And then you can try to make sure that the simulated data and the real data look as much alike as possible. And what you end up doing in the end is you're gonna compare, you're gonna subtract what you simulated, what it should look like from what you observed. If there's something left over, a bump, an interesting feature in the data, that is a potential indication of something interesting, something new going on, right? And then you can look for new physics based on that. So the Tier IIs have an interesting role in this hierarchical model in that they are the primary simulation engine for all of Atlas. In addition, they also support users to do and the groups to do analysis. And they help. This is a massive worldwide hierarchy here. Let me ask you one throwback question. How much data is in a single event? Because when you were giving your descriptions previously, you were saying, well, for each event we do this and we have an event that's sent out. And how much data we talk about, kilobytes? Right, so each event that we send out is supposed to be roughly 1.6 megabytes. That's the design from our technical design report for our computing. We're a little bit fatter than that right now, but we're working to get down to that level. And it's important that we meet that goal because we've designed the scale of our computing resources based on the assumption that a raw event is 1.6 megabytes of data. Then this ESD that I mentioned, the event summary data, is about 500 kilobytes per event. So we reduce things by a little more than a factor of three. Below that, there's something called an AOD, Analysis Object Data. And that's a further summarization. Potentially you've added new information but you've also summarized to be more like what a physicist would work with. And that's typically about 100 kilobytes per event. And we're quite a bit fatter there right now because groups are still testing and developing software and they keep more information around than we can really afford to long term. And even below that, we have something called a DPD, Derived Physics Data. Those are subsets of the AOD. And for one AOD, you might produce four, five, six, seven of these. And they're imagined to be around 10 kilobytes each. And then at the very lowest level, we have something called a TAG. And that's really just a lookup for what an event is. It's roughly supposed to be one kilobyte per event. And it's what you would use as sort of, I guess you say, your dictionary. I'm interested in all events that have four muons with energy above this. And you can quickly look through the TAG database. You can imagine having the TAG database, for instance, even out at tier three's, they would have enough storage to potentially host the TAG database, the set of TAG events. And you can quickly figure out which events you would like to look at. And then you can submit your job to a resource which actually has access to the AODs or in some cases, even the ESD data. Extract those events and look at them with whatever your analysis wants to look for. So when you put all these numbers together, I mean, you mentioned before that tier zero has multi 10 gigabit pipes outward and whatnot. How much data are you pumping through from zero to one and one to two and two to three and so on? I mean, I have to wonder because you hear things about, well, my company and other companies are working on standards for 40 gigabit ethernet and 100 gigabit ethernet and whatnot. And it sounds like at least at the top layers of your hierarchy, you could really benefit from this kind of stuff. We certainly can. And in fact, one of our focus points has been certainly over the last eight or nine years is on the network itself to make sure that we have a robust network architecture in place to support this global distribution of data. So starting from the top, we have about 10 petabytes of data in Atlas that simulated plus real data that needs, that will be of interest to physicists all over the world. So that's, well, so the way we've chosen to divide up the data is each tier one will have some fraction of the raw data that they're steward for. So in the tier zero, all the raw data goes through the tier zero and is also put on tape. Approximately one fifth of the raw data will go to each tier one. And I mentioned before that there are 10 tier ones. So that in effect gives us two distributed copies of the raw data. And the individual tier ones are responsible for being stewards indefinitely for their fraction of this raw data, which means they put it on tape and they make sure it's accessible and available for the approximately 20 year lifetime of the Alice experiment from now forward. And so that's, they're gonna have a lot of data that they have to handle. And that's just the raw data, right? Then we produce these ESDs, which are the result of our calibration of the detectors and our algorithms that convert these raw electronic signals into things like tracks of particles and jets of energy and so forth. And when you change your calibration, you improve it, or you change the algorithms that go from raw to summarize data, reconstructed data, you have to rerun. So one of the primary roles of these tier ones are if they come up in three months with a new set of calibrations that are better, they're gonna have to go back over the raw data that they're responsible for and reconstruct it producing new ESDs. And when they do that, those resulting data products have to then be shared with other tier ones around the world and their tier twos. So there's constantly incoming data, new data that's being stored and distributed as well as reconstructed older data that needs to be distributed out. And so our steady state is much less than 10 gigabits a second for any tier one. But we have to worry about bursting and what happens if a system is down for a few days and has to come back up and restore its cache, catch back up in terms of the data stream. So we typically try to design the tier ones on the tier zero, so they have at least a five day buffer locally. And in principle, if they lose or have network events or lose connectivity, they can then catch back up by bursting into much higher rates and get back up to being current. So... I have to snicker, you said network event there. I would hope it's a very different type of event than the events we've been talking about so far. That's right, that's right. So, and related to this, we have something called the LHCOPN, Optical Private Network. There is a sort of at least conceptually a private network that exists between the tier zero and all the tier ones for Atlas around the world. And that's the primary path for raw data to go from the experiment out to the tier one centers around the world. It's also a secondary purpose is to allow the tier ones to share data with each other. So for example, if you are Brookhaven, which is our Atlas tier one for the United States, they may be steward of 20% of the raw data. When they reconstruct their 20%, there might be one or two or three other tier ones who want the reconstructed data updated. So Brookhaven may directly send down that data as they produce it. In addition, there's the tier two cloud. We have five tier twos in the U.S. Michigan and Michigan State actually represent one of those tier twos. They're gonna need the data as well. And so we will get it from Brookhaven. And this whole system that I've been describing is all set up for the well understood production flows. We have good idea about how often we're gonna reconstruct data, how much raw data comes out, how much simulated data we're producing. The thing that we're less certain about even now is what the impact will be of all these physicists wanting to get to all this interesting data. And there are many, many physicists, as I mentioned, in Atlas with large groups. And there's gonna be a somewhat chaotic, dynamic aspect to use of the system. How is it gonna meet the needs of all these physicists wanting to get to all this data? And that's one of the reasons we wanna make sure we have a very robust networking infrastructure in place. We anticipate that actually the data flows associated with all of these users, each using a relatively small amount of data, but potentially getting it from various tier twos or other, even tier threes or potentially tier ones if you don't have it within your cloud, that's gonna cause a fairly significant traffic flow. And we wanna make sure that our infrastructure is set up to support that and won't melt down when we have this very high demand that's not so predictable. Who are the other tier fours in the US? Tier twos, there's... I'm sorry, who are the other four tier twos, sorry. Right, so as I mentioned, Michigan and Michigan State are the Atlas Great Lakes tier two, AGL T2. There are four others, and the interesting thing is that most of them in the US are shared, just like Michigan and Michigan State share the Atlas Great Lakes tier two. So there's the Midwest tier two that's University of Chicago and Indiana University. There's the Northeast tier two, which is Boston University and Harvard. There's the Southwest tier two, which is University of Texas Arlington and Oklahoma University, and also Langston has been associated with that effort as well. And then an interesting tier two is the Western tier two, which is Slack. Slack is a DOE funded national lab, so it's different from the others. All the other tier twos are University based, and in fact that was by design. The scope of the five tier twos was supposed to be roughly equal in resources to the tier one. So if you add up all five tier twos, they should be roughly equal to the computing power and storage at the tier one at Brookhaven. Because of leveraging money, we were able to actually put together more resources at the tier twos in total than the tier one itself. But it's close still to being relatively equal. But the interesting thing was DOE funded the tier one. National Science Foundation funds the tier twos. So Slack was an interesting tier two because it's a DOE lab. So there had to be special agreements made between NSF and DOE and how they would actually fund that. And anyway, it worked out and Slack has also been a very powerful tier two because they have the facilities of the national lab behind them. So opportunistically, they can get a lot of CPUs if they're not busy. So it's worked out well. So what's the total? So you've got a massive storage data management problem, a massive international worldwide network. What's the total scale of the compute resources behind this? You mentioned the tier twos are supposed to be roughly equal to tier one. You got a little bigger. How many nodes or gigaflops are we talking about across the world? Right, so there's, as I mentioned, there are approximately 10 tier one centers around the world. And those are the national scale centers. And it's a little misleading. They're tier ones based on their role and purpose. We don't necessarily have well-defined, here's what it takes to build a tier one. You have to buy X processors, Y amount of storage, configure it this way. It's left up to the individual countries or whoever's supporting the tier one to meet the needs of Atlas. So there's an MOU that says, here's how much data you have to store, but it doesn't tell them, it doesn't require them. Other than you need to provide this as a grid aware storage, which in our case means SRM, storage resource manager. So there has to be an SRM and a phase to the storage. And that's at the tier zero, the tier one and the tier twos. So that's a requirement. And then there's a certain amount of pledged CPU capacity that we agree to. We give them a five year plan in the case of the tier ones and the tier twos. And we say, here's how much we anticipate being able to provide over the next five years. We try to accommodate Moore's law and our funding and the planning and then based on that plan, higher up analysis can be made to determine, do we have enough resources to meet our requirements, given what we assume we'll produce and simulate in Atlas. And if there's shortcomings, we iterate. We try to come up with something. Anyway, the way it's worked out is that Brookhaven as the US tier one in Atlas is actually a little bit larger than some of the other worldwide tier ones within Atlas. Also, the Brookhaven tier one, for example, is dedicated really only to Atlas. Some of the other tier ones around the world actually support up to four of the LHC experiments all in one center. So they have to set up to share resources in terms of how they schedule users and set up priorities and how they manage their storage. Which is fortunately a complication that Atlas and the US doesn't have. All our sites are pretty much focused on Atlas and we don't have to worry about having to divide them up. The scale is, it varies. For example, I believe in this year, 2009, we are by MOU delivering something like, it's an older unit. It's a Kilospecant 2000 is the way that we originally signed our MOU. And so we've now in the process, we are now in the process of switching over to something called HEPSpec06. It's based on the spec rate, a combination of the 2006 standards for measuring CPU power. And I should mention that most of our jobs are sensitive to the integer performance, not the floating-point performance. So that's why we use, originally used the spec-in standard. But we're delivering, our commitment is, for example, our site should deliver about 430,000 CPU hours and in any given month at about 965,000 spec-in. And that works out to like a typical old Pentium 3, I believe at a gigahertz was around one Kilospecant. So it would be 965 cores of those older systems. Of course, the newer ones are, in fact, are four or five even faster in terms of per core. So you're talking about a moderate amount of CPU that we need to provide at the tier two level, something like a few hundred cores typically. Now our site right now, Alice Great Lakes Tier Two has about 1,830 cores available. That's a combination of tier two and tier three cores. So our tier two cores are pledged by MOU and sort of centrally controlled in some sense. And our tier three is our local installation for our physicists at the University of Michigan State to use. And what we've done is we've benefited from making one larger purchase and then logically subdividing those things into tier two and tier three in terms of scheduler priorities and quotas, for example, or disk space amount purchased. And we just keep track of that internally, but we manage it as one large installation, which makes it easier. Some sites in US Atlas, for example, right now their pledge is something on 1,386 kilospecant and the smallest one is 665. So there's a range of what we pledged as a tier two within Atlas. Now CMS, our competitor, has seven tier two sites within the US and they have a uniform pledge amount of 1,000 kilospecant 2000s. In other words, one million specant 2000s per tier two. And we have a system that keeps track of how much we do every day in terms of CPU hours normalized by the type of clock or type of processor involved. And we also get tested regularly, multiple times a day for upness for availability. So the grid services are tested to verify that our site is online. And we get a report at the end of the month that also goes to the funding agencies and goes to Alice management that says how well each site performed. Was it up? Was it accessible? How many hours did it deliver of jobs that completed successfully? And those are metrics that are then used to evaluate all of this. So again, the tier one center is roughly the sum of all of our tier two centers in terms of processing power. And so that works out to be something like four million specant 2000s at the tier one center. So it's roughly five times more than the average tier two center. But there have been some variations again because these are our pledged amounts. In reality, our site is something like five million specant 2000s. Our pledge is 965. And that's just because when we figured out what we were able to contribute, we ended up getting better pricing and things worked out in an opportunistic way for us that we were able to get more resources. So that's been helpful. But I think there will be in the future a significantly increased need once real data starts to flow for these computing resources. And we certainly won't keep that level of over subscription in place in the long term. Oh, also I should mention that there's something else is the amount of discs that we have. Typically right now, tier two is providing something like 400 terabytes of storage per tier two. And the tier one has a few petabytes actually of storage. And that's not the tape system, that's disc. Behind that, there's a tape system that can do many, many petabytes, right? I think Brookhaven's capacity right now is, I'm not sure what the exact number is, but it's certainly above eight petabytes I believe and could be expanded. It's usually the largest disc farm I've ever heard of. I'm familiar with a couple of two petabyte disc farms, but nothing on that scale. Yeah, so this is, recently they've gotten way up because they waited, of course, to spend their money. They were behind their MOU pledge for a while, but then of course, because they purchased late, they got a lot more bang for the buck and they were able to get a bunch of, they bought sun systems and, let's see what else did they have. They're now actually looking at a DDN system data direct networks as possible storage solutions. So they originally got some sun thumpers, the 4500 series, and then after that, they bought a bunch of Thor systems, the successor to the thumpers. And that's, they got many, many of those systems, each one with something like 36 terabytes per system. And they've been slowly provisioning and bringing that storage online. So what's the software we're using to manage all this? You mentioned it's a grid, these jobs are automatically farmed through the, from the tier zero, and they move the data and the jobs kind of go with the data. Is this something you wrote yourself or using Condor or, not Condor, or Globus, one of the standardized grid setups or how exactly is this set up? Yes, to all. So Atlas's computing model is really grid based. I mean, that was always plan A. Get some snick. So the plan has always been grid computing. There hasn't really been a plan B. So whatever we're doing, we call it grid computing because that's sort of by definition the way we choose to meet our needs. And that's the primary goal of grid computing is just to harvest these distributed resources in a controlled, authorized way. So we can give resource owners the ability to control the resources and yet allow this distributed collaboration to manage and access them as well, under well-defined terms. So yes, we depend on Globus. We actually within the US use the open science grid distribution which is partially based on the VDT, the virtual data toolkit that's produced. And these are a set of packages that have all been tested and put into a nice, easy to use installed package. It's based on something called Pacman which stands for package manager. That was written at BU and it's Python based. And it's sort of a layer above an RPM where you can have pre and post install scripts that help make sure that the software that gets unpacked and deployed is configured correctly. This tries to take as much of the complexity out of deploying all these middleware services out of the hands of the installer and put it in a central set of scripts that make sure you end up with a robust working install. That's the goal at least. Individual sites have a choice. They do need to provide this middleware stack. And as I mentioned in the US, it's open science grid. In Europe, they use EGE stack, WLCG, EGE. Stack, which also uses Globus. It's interoperable with OSG but has components primarily developed and packaged in Europe. And other sites around the world may use either OSG or EGE software. Or in the case of the Nordic countries, they use something called NorduGrid which is their own package, also interoperable with the other two. So that provides the middleware. And the middleware is responsible for authorizing and authenticating and authorizing users, providing the tools to submit jobs in a standardized way to gatekeepers. Where basically you are grid credentials based on next 509 identity. You have a certificate assigned to every user in Atlas who wants to use the grid. Those are signed by regional certificate authorities that through the IGTF trust each other. And so a user from CERN who gets a CERN certificate can run at our Atlas site at the University of Michigan because the DOE grid CA trusts as set up a trust relationship with the CERN CA through these higher level organizations that coordinate all this. So we have a worldwide based identity scheme on these X519 certificates. You have a public and a private component. You do something called a grid proxy init and that allows you to get your grid credentials. In the case of Atlas, there's sort of another layer put on it. You have something called BOMS, these virtual organization management systems. And you do something called a BOMS init. And the difference between that and a grid proxy init versus a BOMS proxy init is that BOMS adds an attribute that says you're a member of Atlas. And in fact, you can choose to act in a different role. You could be a software manager and if the BOMS system allows you, if you're authorized, you can get that attribute appended to your X519 identity in a secure signed way so that when you submit a job, you may be mapped to a different user at the end site, one who's privileged enough to install software, for instance, within the install space for Atlas. There's also a production role. There's a bunch of different attributes that you can imagine adding on with BOMS. So within Atlas, we have a set of roles and attributes that can modify your identity. And then you can do things based on which role you're acting in. You might start out installing some software. You might then be responsible for some production simulation jobs. So you've then become a production role. And then maybe you have some personal jobs that you want to submit. So you get a role just as an Atlas user and then you submit your jobs. So each one of those, we've had the same distinguished name for your X519 identity, if they have different attributes. Anyway, all of this works together to allow us to share resources. There's this SRM interface that I mentioned, Storage Resource Manager. That's a grid interface to storage. It allows those roles and attributes that I mentioned to also be used to map to storage. Our storage areas have a quota system in place, really, based on the role that groups have. And so we can manage our storage somewhat with that. One of our primary issues in Atlas has been really the distributed data management problem, making that work well. Let's see. Individual sites have choices. Of course, the back end storage can be whatever they want. It might be simple local disk that you put some sort of SRM interface in front of. It could be a complex file system, GPFS, Luster. A lot of sites use something called Dcash, which was originally designed as a disk cache in front of tape systems. But in fact, many tier twos use Dcash, even though tier twos don't have HSM systems. They just have disks. And one of the advantages of Dcash is that it has this SRM interface as part of the package. If you have your own posits compliant file system, whatever it is, you can put something, for example, called Bestman, which was developed at Berkeley, as an SRM compliant interface to your POSIX back end storage. So if you're running GPFS, you're running Luster, you're running whatever POSIX compliant file system, you can then bring up a service, a Bestman service that'll provide the grid interface to that software. And then schedulers are, again, a choice individual sites make. A lot within the US use Condor. There's also Load Leveler, there's LSF, there's PBS. Many different schedulers are in use on the back end that take care of scheduling the CPU resources out of sight. Now you mentioned probably several hundred million different software packages in there. Some of them it sounds like are commodity packages or were developed elsewhere, but a large number of them sounded like you guys developed them specifically for your needs. How did you do that? How did you coordinate across this worldwide network to develop software that needs to be common throughout your entire hierarchy? It sounds like such a daunting task that you would need almost an entire IT infrastructure and organization just to be able to get done the physics that you want to get done. How was a project? Did you guys do that? I guess the first statement is that physicists have had a long history of having to collaborate and work together. So we tend to have that culture in place. Even having said that though, as you mentioned it's a daunting task to try and do that. I mean, think about the analogy with building the hardware itself. We had groups from all over the world that had to build something that all fits together and works as a whole, right? And so that's an equivalent and parallel task to the software infrastructure. They both require a significant amount of inter-operation amongst groups that are all over the world. So one of the things we try to do is focus on standards and interfaces wherever possible. So we all have tried to leverage some developing standards from GGF for, I guess it's OGF now, and other entities to try and provide a way to inter-operate. So as I mentioned, there were, there's sort of three main grid middleware software stacks around the world. The OSG software is developed within the US. Primarily, well, a lot of the focus for Open Science Grid has come from the high-energy physics users, but there are other users within there. There's astronomy, EVLBI, earthquake engineering, bioinformatics, a bunch of other stakeholders in OSG. And so the idea there is that we try to leverage common standards to wherever possible, yet we have a bunch of specific needs that we have that are, in some cases, unique. So one of the areas where we've done a lot of work in the US was in something called Panda. Panda is, in some sense, a meta-scheduler. It is our system for distributing, managing our production and distributed analysis for users. Actually, Panda stands for production and distributed analysis. And that was originally only in the US. And in 2007, it was actually chosen to be the Atlas-wide production and distributed analysis system. And since then, there's been a slow, steady deployment across all of Atlas. So this was a project that started out, specifically within US Atlas, to manage sending out jobs to all of our sites in a controlled way and to be able to set priorities and manage workflows in this very data-driven environment that we live in, within energy physics. And it's now basically the way for Atlas, globally, to do this. And Panda sits on top of these underlying middleware systems. They could be the Nordio grid, they could be the EGE or the open science grid stuff. But it's a combination of interfaces and databases that work together to manage our workflow. And so Panda is actually a very interesting system and has a lot of built-in monitoring. It's been very effective at managing our work to date. We've hit easily 50,000 jobs a day with Panda. Our goal, by the time we actually are doing real data, is worldwide we should be supporting half a million jobs a day with Panda. And a job can be multiple CPU hours and have gigabytes of data in and out. So we have a significant amount of workflow to manage with Panda. So is Panda was deciding what jobs coming into the grid should go to which site? Like is it probing all these different grid sites to figure out who has available resource and what should go where when? Exactly. So Panda has some ability to tap the information systems that these software stacks provide. So it has some idea about how many job slots are available and what kind of scheduler is on each site, how much disk space the site has and what areas and what kind of software is installed and accessible and what kind of data is already deployed at the site. So Panda has to make a decision. It will get a job or a set of jobs that it has to run. And it knows that it has access to the following resources within a given cloud. And it'll know roughly how loaded each site is, what the sort of key looks like in each site and whether or not the software it needs is available and whether or not the input data sets that it needs are available. In some cases it will end up with a situation where the data may be available at a site but the site has a very high load already with a backlog. So it has a choice. It can then choose to submit and further increase the backlog of that site or it could initiate a data movement to another site that's not nearly as loaded to preload the required input data at that site and then submit the job to that site as a target. And so it tries to balance these things. And it's been tricky because there was a trade-off when Panda was being developed. A lot of the middleware systems that were in place weren't as robust as we would have liked. And if you depend on them and they break, your whole job distribution system breaks. So they made some choices early on to try and minimize their coupling to these standard middleware pieces to be as minimal as possible so that they can control the risk and produce a robust system. As the middleware has gotten more robust, they're using more and more of the information and capabilities of the middleware system to help manage this. But it's still, as you can imagine, a pretty complex job trying to figure out how best to take a huge number of inputs and distribute them in the best way over a bunch of sites that have different capabilities and different cues and different storage. So that's Panda's primary responsibility. So now that you've been working on this project for a couple of years and you're knee-deep into it and the actual collider detector should become at a production pretty quickly here, what would have you liked to have done a little bit differently? What, you know, this whole organization of auto-farming jobs and data management, what would have you liked to have seen done differently maybe? I think the best- Let me throw another spin on that question. What technologies would you have liked to have seen more mature that you could have just used rather than had to develop yourself? So I guess that's not really another spin, I guess it's a follow-up question. Yeah, so I think, you know, hindsight is always, is always great because, you know, what happened? Given the way things have developed, it would have been nice to start along some of these paths much earlier. So I can give a few examples. A lot of the design for some of the infrastructure for the middleware was based on what the developers knew about or had at the time. So ideally, they would have, as part of their design, they would have put in hooks or opportunities to utilize future capabilities that didn't exist, but would have helped to solve problems. The hierarchy that I mentioned that we have in place to help define how data flows and moves is also in place really because we wanted a way to manage those resources to prevent overloading, to prevent, you know, things happening where individual resources become so overloaded there's no good put anymore through those resources. In my opinion, it would have been better to design resources and services to always protect themselves so that, you know, what they should have anticipated is there may always be more demand than they can supply and the resources should be configured, the services should be configured to gracefully deal with high loads. A lot of the issues we have seen are based on problems where individual pieces or components overload another component and the system becomes unresponsive or crashes and that has repercussions in a couple distributed system. So if we could have done a better job of thinking ahead to what we might want to have in the future that didn't exist now, but hey, if we could do this in the future, let's put a hook in here in the software, in the architecture, in the design so that in the future this could be taken advantage of when those backend resources support such a capability and a focus on protecting resources. The net result is that a lot of our design is not as potentially performant or efficient as it could be because we introduce arbitrary overlays to restrict how services can use other services. So rather than protecting the services themselves or self-protecting them, we overlay a higher level of control or attempted control on the infrastructure so that it won't melt down. But the side effect is that you're not as able to effectively or efficiently move data or use resources as well because these constructs that are used aren't really designed to be efficient. They're designed to make sure the system works. And I certainly understand that and our primary goal has to be having it work, but it would be nice if we had thought more about how to also make it perform while it take advantage of future capabilities that could be developed. Now it's difficult, for example, if you have a network where you can create dynamic circuits, it's difficult for the Atlas production infrastructure to take advantage of that because that wasn't a concept that was even thought of. The network was this black box where you put debts in and they helped it come out when you need them on the other end. If you can think about areas where you could co-schedule network and storage and CPU in an effective way, then you have a much more robust system and one which can perform better. But if those concepts aren't part of the initial design, it's very hard to patch them in afterwards. So I think that was, from my perspective, that would have been something that I would have liked to have seen more effort quite a while back. Well, cool, this has been absolutely fascinating for me. Quite a different use of engineering than my normal day-to-day use of engineering. So it's quite impressive if I do say so myself, a big worldwide network with a huge complex problem solving that it has to go through in order to just be able to get the science done. That is just tremendously cool. I wonder if you could cite a couple of websites and maybe we can put some of these URLs in the notes for this podcast here about where a novice could go to have a look and see some more details here about Atlas itself, about some of the computing infrastructure that you use and some of the physics behind it and things like that. Sure, in fact, what I'll do is I'll provide some links to Atlas and sort of an overview of Atlas from an novice's point of view. I mentioned Panda. There's a nice, twickey resource that defines Panda. I'll provide that. And in fact, I'll give a link to our Panda production interface. If users are interested, they can connect and actually see Atlas jobs running all over the world. And there's quite a bit within that page to dig down in and actually get some detail on some nice graphs built into the page. And I can provide, for example, also our Tier 2 twickey page. People are interested, they can poke around there and see what a Tier 2 looks like and get some more information. And of course, if people want there, they're always free to email me if they have specific questions. One additional question there. So I would imagine that most people who want to be involved in this project are already involved in this project. But what if you're a new physicist, you've got your PhD or just starting your research track or something like that. How would someone go get involved in the Atlas project? So it really depends on the individual case. But I mean, the best way is to join a research group that's already established and inside Atlas. There is a way for new groups to join Atlas, but it takes a little while. And there's basically an effective negotiation that happens with Atlas. And something called the Institutional Board eventually has to vote to create new members. But since there are so many institutions already participating in Atlas, if you're a new physicist, the best way is to join or participate as one of those existing groups. And there are groups that, most of the major research institutions in the United States certainly are either members of Atlas or members of CMS or competitors. So that would be the primary way. Okay, Sean, this has been very interesting. My sister actually worked on Atlas as an undergrad back several years ago. And so I've been kind of exposed to this, but I knew very little about it. So this is very enlightening for me. This has been great. So thanks a lot for taking some time with us. And I will get all those URLs from you and put those into show notes. Great, well thanks for having me. It's been fun. No problem. And this show will be up on the regular website at www.rce-cast.com.