 of RCE. I am your host Brock Palin. You can find us online at RCE-cast.com. I also have Jeff Squires from Cisco Systems and one of the authors of OpenMPI. Jeff has once again helped me out and we've been a little delayed here, Jeff. Yeah, we've been a little slackful here in the summer in the summer doldrums, I guess, that we haven't gotten all of them out. But we've got a flurry of recordings coming up. I think we're recording three different podcasts within a week. Yes. And then measure, yes. Yes, because I am going to be at a conference for a week, the first week of August, about our topic today. I'm going to be at TerraGrid 2010 and we have with us two guests from TerraGrid. But before that, usual stuff. Jeff, you have some interesting new blog posts out since we haven't had a show. You have a couple out. Yeah, well, I've been a little slackful on my blog as well. And so I'm trying to put out a couple in a row to keep my, keep my stats up, keep my blog entries per month up because I get a little heat from that from the Cisco PR department. But yeah, actually, so the last post I did was, I did a little play on the pop, your old Bice commercials and about half the people got it and about half the people didn't. They're like, why are you on a horse? What does that have to do with high performance computer networking? But then, you know, the young ends like Brock here got it and thought it was immensely funny. So I feel my mission was accomplished. Yeah, you can find a link to Jeff's blog off of the RCE website. So today, our two guests are we have Kay Hunt and Scott Lathrop. They both work with TerraGrid in slightly different ways. I'll let them introduce and talk about what they each individually do. So Scott, could you take a minute and introduce yourself? Sure. So I'm with the University of Chicago, and I'm the area director for Education Outreach and Training for the TerraGrid project. And I'm Kay Hunt. I'm with Purdue University, and I am with TerraGrid been there for about four years now. And my main interest in my main tasks have to do with the campus champion program that's under the Education Outreach and Training arm of TerraGrid that Scott is in charge of. In full disclosure, I should inform that I am the University of Michigan's campus champion along with Andrew Caird of the University of Michigan. So this is how I know Kay, and this was a little bit of an inside deal here before the TerraGrid conference. So could one of you explain exactly what is TerraGrid for those who have never heard of it or maybe heard of it and not really sure? Sure, be glad to. So let me do the short version. www.terrigrid.org will provide a rich array of content information. But TerraGrid is funded primarily by the National Science Foundation and provides high performance computing, super computing resources, if you will, in support of scientific research and education for people across the country. It's primarily for academic research. So researchers from any US institution that want to do computational science can request time and access to resources that are scattered currently across 11 different sites. So there are 11 different organizations that are formerly a part of the TerraGrid project that's funded by NSF. Okay, so if a user wants to get started on the TerraGrid, what exactly are they looking at, like contributing to TerraGrid versus getting from the TerraGrid? What kind of services does TerraGrid provide? So anyone that wants to use the facility can make a request for time. The simplest is a startup account or an education account. And that's a request they can put in that's a fairly straightforward form they fill out. And not only within two to three weeks, they'll have an account and they can start using the facilities. As I mentioned, the facilities span 11 different sites, but there's some 20 different computing systems. So there's a variety of architectures that they can get access to. And once they have access, the researcher or the educator that may want to use this in a classroom activity, can upload their code and start running it and work on optimizing the code. And then for a researcher who really wants to go full bore, then they can request additional time. The startup account is really to get them started, help them figure out which system or architecture may be most appropriate. And then they can request a much larger allocation of time. So when you say an allocation of time, are these are CPU hours on a system? Are these time with support people to help you work on a code? What exactly are the services? So an individual can request computer time on any of these systems. That would allow them and I should say this time is all free because it's funded by the National Science Foundation. They can request hours of time, which could be for a startup account for the kinds of systems we're talking about. It would be nominally 30,000 hours of computer time. There are projects today requesting millions of hours of computing time to do their research. In addition, they can request a large amount of data storage if they're doing something that's heavily data intensive. They can also request the time of some terrarium staff to help them optimize their code to make them the best advantage of the computing resources they're being provided. And then they can also request access to visualization servers so they can do remote visualization of the research they're doing to produce some visualization to show others the results of the calculations they're running. So when I first got involved with high performance computing as a student employee, to swap in CPUs on old machines that were about on their way out, we were involved in something called N-Paki, which was also a National Science Foundation computing resource provider to people around the country. Is there a relationship between Terragrid and N-Paki? Yeah, so this program has actually evolved. It started out with the National Science Foundation funding, what they call the Supercomputing Centers back in 1985, 1986. And there have been a number of programs that the National Science Foundation has funded since then. As you're referencing along the way, N-Paki was an effort that was led by the San Diego Supercomputer Center, and there was a corresponding Alliance program led by the National Center for Supercomputing Applications or NCSA. Those centers have continued to be involved, but the NSF program has continued to evolve. The Terragrid program was the follow on, if you will, to what N-Paki and the Alliance offered to the research community. And the Terragrid program is just now running into its 60 year operation with funding from NSF. So in comparison to others, we've just been talking about various partnerships and a little bit of history there. With Terragrid, what is the sum total of resources that you have? You mentioned a number of organizations, but what kind of horsepower can an individual researcher submit their jobs to? So it's an evolving landscape, and it depends, of course, on the funding from the National Science Foundation for Terragrid, but today there's over what we call a petaflop, a computing power with the various systems that are combined. And we know there are more systems that NSF will be funding. In fact, sometime probably in the next couple of months, we'll hear about some additional systems that NSF will be funding to provide services to the community. Collectively Terragrid is providing as much computing power as any other operation in the US for free academic research and education. But it's not to say it's the only activity. Department of Energy also supports major computing resources. But their support is, if you will, mission driven. It's they support research that's ongoing in relation to the mission of the Department of Energy. NSF is really, if you will, the most open academic research computing infrastructure that's available in the US today. So when you say the most open, what exactly do you mean by that? Is this a member's only kind of club? Do you have to provide resources to get resources? Or, you know, how does an average Joe researcher who's got some number crunching they need to do and no machines to run it on? How do they how do they get involved? So again, it's open to any academic researcher or educator at any US institution. The time is free. They have to submit an application. As I mentioned, the startup applications can are usually quickly reviewed and within a couple of weeks they'll have an account. As someone scales up the size of their computing needs, I mentioned earlier, some projects today are getting millions of hours of time on an annual basis. Those larger requests are reviewed so that there's a balance of requests from across the country so we can balance the broad range of computing needs. But the research is literally open and encouraged to support all fields of research, whether we talk about weather studies, climate studies, bioinformatics, or whether we talk about social sciences or humanities. We encourage research by all fields of study. So you talked about researchers can just regular US citizens or can a corporation like rent time from NSF Ontario grid resources or is it literally only the like you have to be affiliated with an institution? We certainly have to have an affiliation with an institution. So academic research is certainly the predominant base of users. For industry, industry is currently taking advantage of it, but it's a site by site agreement between industry and the site running the facilities. And that's because what NSF is supporting is open research. So the research, the computing that's done in support of the research all has to be published. For proprietary work, such as for industry, since the work is not being published, they have to, if you will, pay for the use of the facilities. So that's negotiated on a site by site basis. As far as Joe public or Sue public wanting access, that's not likely to occur because the facilities really are for people that need high end computing capability. It's not intended to supplant what you can get through your home computing or department computing or campus computing. It's really intended for people that need more resource than they can find other local campus. So is there any groups out there that are similar to terror grid that are bigger? You mentioned Department of Energy has some, I know, of the NNSA and a few other projects. Terror grid, the largest organization of its type in terms of compute power and users? I think it's probably the largest in terms of total compute power and user base. But there are other offerings. The open science grid is one that is probably has more systems available. If you look at the list of organizations that are part of open science grid, but collectively their total compute power is probably not on par with what Terrigat offers. Department of Energy is certainly one federal agency that's it's sort of an a leapfrog effort as NSF funds more compute facilities. Department of Energy is also funding more compute facilities. So on an annual basis, who's got the most sort of jumps around? But there are other federal agencies, of course, that are also supporting research using high forms computing. Whether you talk about the Department of Defense or other agencies, there are multiple organizations and agencies that are supporting research. So this all in all actually sounds like a really great deal, especially for, say, junior professors in the early in their tenure track race and don't really have access to resources yet, but really need to do some computation at least to get started. For example, you mentioned the start up process and whatnot. Can you run us through a typical scenario of, you know, let's say I'm Joe average chemical engineer and I've got some numbers to chunk. Do I write an MPI application and then I submit it and it runs somewhere in the grid and I get the results back in a day or, you know, how does this, how does the mechanics of actually running on the terror grid work? So it's, I think we're sort of work from a basis that someone, a researcher or someone teaching a course already has a code they're running perhaps on some, whether it be on their desktop or some local cluster. So generally they're taking a code that exists, moving it over the network to one of these terror grid sites. And then they submit batch jobs. The turnaround depends on the machine and how many other people are running. You know, debug queues, you can probably get jobs in and out relatively quickly. It's for, but then if you're running a very large job, it could be something that runs overnight or for some people it might even take a couple of days just because of the magnitude of the size of the run they're actually have submitted. But so someone would move their code up to one of these systems, you know, make a few runs, then probably do a fair amount of debugging. And then more than likely they'll want to do some optimization because what they were running on a smaller cluster may not run as effectively on one of these other systems. So then they probably spend some time, they are with their graduate students working on optimizing the code. And some of that may involve, you know, questions that they want to submit to the terror grid health desk. And there's a 24 hour outline where they can call and ask questions if there are problems, there are consultants available these sites to help people with questions. As I mentioned earlier, if they're getting a really intense set of problems, they can request more dedicated support from terror good staff. But so the average user is uploading their their job or uploading their data, running a batch job, and then the results come back in, you know, depending on the size of the job, you know, whether it be matter of minutes or hours depends on the size of the task. And then they probably do some visualization locally or some analysis locally to say, whoops, need to modify the parameters, need to modify the code and make multiple runs. So it's not unlike running on the local cluster, it's just a now you have more compute nodes available to run larger, more complex analysis. When a user gets an allocation, is it all on, did they pick individual systems to go? Because there's I notice there's some shared memory machines or some InfiniBand machines or some Cray machines where that, you know, don't support opening sockets, stuff like that. Or do you kind of say this and submit your job to some like grid meta scheduler? So nominally, a person requesting time would either request a specific machine because they're familiar with it or the architecture, or they may request time on multiple systems. And for a startup account, it's it for some people they may request two or three different architectures just to try their code out. Or because some portion of the code may well run well in one architecture and another portion of the code may run better on another architecture. So the startup allows them to do some of that testing analysis. Then when they go for the big runs and the larger allocation of time, they may select one or a couple of systems. In general, I'd say the majority of people are running on between one and three systems. There's probably a small subset that run on more systems. But so it varies depending on how how complex the needs are and how much the researcher wants to know about the intricacies of every system they're getting access to. So I notice a lot of the resources on the terror grid are really set up for low latency communication. They all use Infiniband or Numalink or the Cray C star network based chip. What if you're a researcher and you need lots of horsepower brought to bear on an embarrassly parallel parameter sweep, a serial farming of serial jobs, what kind of resources does terror grid provide for that? So I would say that the mix of the community is all over the place. There are some people that have some very large serial jobs, but I think the predominance is people running parallel jobs. And we think about sort of the mix of systems out there. And if someone is really just looking for high throughput computing for a large sequence of serial jobs, something like open sign screw that actually may be more applicable to their needs. But for someone who's running a large multi, many CPUs, many processors, very highly parallel, then the terror grid systems may be more appropriate. But there are some research groups that actually find that running some of their jobs on open science grid and some of their jobs on terror grid actually helps them to optimize their performance, their throughput, etc. And they sort of look at the mix of the type of job they're running to decide, again, which architecture, which environment is most appropriate. So there's no one answer. You know, the startup process is a way to sort of try that out, look at the different systems and figure out what the right mix is. And the consultants can help people think through that and help make the right decisions. Let me ask you a question from the other side of the fence here now. So we've been talking about the researchers and the value that they get and things like that. What's in it for the providers? So the, you know, the produce and, and University of Chicago's and others and whatnot, you provide resources that that basically, you know, people who are not under your control can submit jobs to. So what's the value proposition for for you? Well, we get in this business, because NSF is trying to support the advancement of science. And so what the centers, what the sites that are involved in terror to get out of this is the science that's accomplished. Or, you know, I keep mentioning the education to the preparation of the next generation of scientists. But what we really look for are the publications, the research, the science that's accomplished by the different individuals or the groups that are using these facilities. So the the scientific outcomes are the critical component. And so we're always asking for publications, reports from the people using the allocations. And when someone comes back for another round of allocations and more time, we always say, so what did you accomplish the first time around? Show us the results of the research you've accomplished. And we also put a fair amount of effort into talking with these researchers to document the science that's being accomplished. Because in the end, the National Science Foundation has to report back to Congress and others about what's what's the impact on science and science productivity. So it's that scientific accomplishment that these different research groups are able to demonstrate that's critical. Or as I say, to the extent that we're helping faculty help their students learn how to become productive users of these techniques and methodologies to advance science, they're all critical outcomes in our estimation. Okay. Let me ask another slightly different direction question again here. So Brock has done a podcast or two on blue waters. What is what is the relationship between pterograde and blue waters? That's an interesting question. And it's really been driven by in part by the National Science Foundation that they they funded the pterograde effort to do what we've been talking about. And then they funded the blue waters project at NCSA University of Illinois to to deploy the single largest computing system available through the collection of facilities. So the machine is not yet into production. In fact, the machine is still being designed and built by IBM. The machine is anticipated to go into production in 2011. And when it goes into production, the objective is to support scientific research in which that the researchers have sustained petaflow performance on their application codes. So today, what I mentioned was that Tergid collectively has over a petaflow computing power. The blue waters machine because it well, it will be supporting sustained petaflow performance for a variety of scientific applications. So you can interpolate that to mean that it's that machine by itself will have more compute power than all of Tergid today. So it'll be the largest machine in the NSF funded spectrum of systems. So is it going to be on the Tergid itself or is it a separate machine, separate entity, separate organization? Excuse me. At this stage, it's still it's not officially or formally part of Tergid. It has a separate allocations process. But yet, since NCSA is involved in both blue waters and Tergid, you can expect some of the same staff helping to support the community are helping to support community that's also using the blue water system. And we fully expect that a number of the users of the blue waters machine will have already been running codes on Tergid and what they're really looking to do is scale it up to yet a larger system. And so Tergid may be a stepping stone for if you will for some of those projects. Some of the projects may have been previously running on, you know, DOE machines or other systems. But but it really just suggests that it's the chance to scale up to yet a larger system within the NSF community of facilities. So users working with the Tergid, their at their institution, the Tergid resources are remote across the country, including the support resources. What what can they do if they need to like sit down with somebody and get help locally and have someone to actually like look at what they're doing instead of talking over the phone? Well, so the reality is most of the users are remote from these facilities. So the whole support system is designed to support remote users. If we didn't or couldn't, we'd have a real problem. So the consultants are there to talk with them by phone to exchange stuff through email. You know, if someone happens to be on the same campus or close to one of these facilities and there's no reason they can't sit down with the staff there. But it's not the optimal solution for most people. But but someone who's sitting on the campus at one of these sites, you know, they can certainly make an appointment with the support staff there. But but the whole objective is to ensure that regardless of where you're sitting, you can get hopefully the same level of support and assistance. So I'm a member of the campus champion program where I can help out people and have access to all the Terrigid resources and guide people along locally on my campus as a my campus's advocate for Terrigid and support for that. Kay, you're kind of my supervisor in this role. So can you explain exactly what the campus champion program is? Sure. The campus champion program is a relatively new program. We enlisted the help of campus champions. The first ones in May of 2008. So we're just about a little over two years old now. And this is a program where we try to identify at an institution, a local representative who will be a outreach person, a user support person, a feedback type person for their campus. So we've found that some people when they go to look for resources, don't even know that the Terrigid resources exist. So we use our campus champion, the person that's been identified at the institution to help with outreach, to let others know what the Terrigid is, how they can get access to it. And we've also found that some people, even though they know about it, think it's maybe hard to get an allocation on the Terrigid. So we've asked our campus champion to help expedite that process and to help the local researcher to get their allocation without a lot of red tape and to answer questions along the way. We feel like that's helped move the program on a little bit. The other thing we ask the campus champions to do is to feedback to the Terrigid staff areas that they feel need some upgrading, need some change, need some more definition so that the resources that we have available in the user services area can be more reliable, more useful to the researchers that are out there. So it's really a three pronged thing that the champions try to do for us that outreach, some hand-holding and some feedback. And we found that this has been fairly successful on most of the campuses where we do have a champion. I see. So it's not so much that Brock can help his local users with say the Terrigid resources at Purdue. He's more of a gateway of information in at least one of the three prongs here that can get the user in touch with the relevant help at Purdue or whatever resource they're using on the Terrigid. Is that? Exactly. Exactly. We can't expect our campus champions to be know-all and the absolute answer to everything, but we can give that champion direct access to the user support staff, to the other Terrigid staff, so that any user who needs particular help can get a direct link to the qualified help that he needs. Yeah, I've actually done a number of things in our local cluster, learning about Terrigid to make our local resources integrate better with Terrigid. I've set up some of the globus, myproxy stuff to be able to make transferring data between our system and Terrigid systems easier because we see Terrigid as yet another resource that enables research computing at Michigan, so we combine all these things together. Exactly. We hope that the campus champion won't just be a Terrigid person. We hope that a campus champion could sit down with a researcher at, say, the University of Michigan and say, okay, what are you trying to do? Here's the resources that we have locally. Here's the resources that we have nationally. What's the best track for you to take and how can I help you get there? So it sounds like that's what you've done, Brock, there at the University of Michigan. Yes, that, but I've also made it so that people can kind of prototype their stuff on our local system and very easily using grid FTP between the systems, move data back and forth, log in back and forth, and just try to make our system interoperate with yours a lot better. And so that's kind of one of the intrinsic values that we really have in town here is that Terrigid, they're all kind of at least connected in some way. I wonder if somebody could explain that a little bit that if I run on one system, it's not as difficult, perhaps, to run on a different Terrigid system. For example, I think we mentioned earlier, if my needs have scaled up, okay, now I know what I'm doing. I've prototyped my code, I've gotten some preliminary numbers, but now I've got a bigger allocation and I can run elsewhere. Is it difficult or an easy job or how do people typically migrate to other resources? Yeah, I think you'd have to ask the individual researchers how difficult or easy that is. I think that we've tried to make that easy and that you start at your local level and maybe with a small amount of data and you really have this larger job you want to do, but you can't get that done at your local resource. So you need to transfer data to the national resources, scale up your code and if you've gotten all the little details worked out on the local level, it's generally pretty easy to scale up. Now there's differences in compilers, there's differences in some data structures and so forth, so there may be some hurdles for you to go over, but when somebody's done something like Brock has done, he makes that transfer hopefully as easy as possible. The other thing I would ask, there's been a lot of work and there's actually a working group to address creating a common user environment. So as you move from one system to another interrogate, the objective is to make the file structure look similar across the system, et cetera, et cetera, so that it's less painful if you will to move from one system to another because the environment is more consistent across each of them. And I have heard I will add to that some of our campus champions who are very excited about the common user environment and it has helped them a lot to make those transitions easier. So if we have a researcher listening to this show and they want to see if their campus already has a campus champion, how can they find out who their champion is? So they would need to go to, they can go to chairedred.org and on that page we have a link to the campus champions and they are all the campus champions that are currently working are listed, their institution is listed and their email address is listed so that they could be contacted by the local person if they wanted to. And hopefully the champions on those campuses have also done some marketing of their own, some communication of their own to the research staff at their own institution. But if they don't know, they can certainly go to chairedred.org and find out who it is. Currently we have 69 institutions that are engaged with the program. So if they find out their institution doesn't have a campus champion, who qualifies to become the champion for the institution? So if they look for a champion and they don't have one and maybe they're interested in getting one at their organization, also at chairedred.org it has all the information about how you could become a champion. And basically that's a simple piece of, let me say that again, that's a simple route to take. What they would do is they can send a piece of email to the email address that's listed on the website. That comes to me, actually. And they express their interest, we talk and ask what it is they want to do and what they want to get out of this. We do have a memorandum of understanding that we do get signed and authorized usually by someone that's in the upper administration of the university. Reason for that is we'd like to have the upper administration to, number one, be aware that the program is going on and be supportive of it. So when they hear about it on campus, they can add their support when appropriate. And that memorandum of understanding just says what it is we expect the champion to do and what it is the Terrigord will do for the champion, which is something we haven't talked about yet, but we should. And then we sign that memorandum of understanding and that person becomes a champion. At that point, we try to go through an orientation process with the champion to get them integrated into the program if they're not really familiar with the Terrigord. We have training available and resources for them to access. We have a monthly conference call. We have the yearly conference, which as you've mentioned, Brock already is coming up in a week here where we get the champions together and try to support their needs. So the thing that the Terrigord tries to do for the champions is to give them the direct access to user support, give them access to documentation and training, actually come to a campus and help with an outreach event or a training event where that seems needed and relevant and provide materials that would be helpful to the champion as well. So we should say there's no cost to a campus to be a campus champion, but in the process of selecting who is the champion on the campus, we like to work with the campus to make sure that someone that who's really reaching out to and supporting the local research and education community so that, as Kaye says, they can help spread the word and be there to help support the researcher, as she said, trying to figure out of the various national resources available which are most optimal for the individual or the group. So moving on from that, what comes after the Terrigrid? I know these projects tend to be on a five-year or three-year or a 10-year kind of thing. What's the future for Terrigrid? So that's an interesting question because right now there's a solicitation out from the National Science Foundation for what they call Terrigrid Phase III or otherwise called Extreme Digital. So they're going through the process now of, I mean, I should actually say there's a couple of proposals that are just recently been submitted. So the deadline for submission has just passed like a week ago. And so NSF is going through that review process now to select activities to extend Terrigrid on for the next five years. It's intended to build on, of course, the success that's been accomplished to date but also look at ways that the program can improve to better serve the research and education community and campuses moving forward. So it's a little premature to give you an answer what it will look like because it will depend on what NSF funds but it's expected that NSF will be making these decisions and a new set of activities will start up in approximately the April 2011 timeframe. So I was going to add two things. You have already talked about the Terrigrid conference coming up the week of August 2nd. Say what, hold on, give it a beat and then just give a little pause and then launch into it that allows Brock to splice it in pretty easily. Oh, sorry. So I just want to mention that the Terrigrid conference is an annual event. This year it's the week of August 2nd. There's also a very strong Terrigrid presence at the annual super community conference. The SC 10 conference is coming up in November in New Orleans. All the Terrigrid sites participate in that conference each year. And then we try our best to get to as many professional society meetings as possible throughout the year. And as Kay mentioned, any campus that would be interested in having Terrigrid visit to share more information, conduct a local training session, we are always happy to explore those options as well. So there are lots of ways to interact with and learn more about what Terrigrid is doing. And we'd love to hear from your audience what we can do to help support them. Thank you. That sounds great. And we really appreciate your time today. This sounds like, like I said earlier in the interview, this sounds like a great deal. So if you're an academic researcher out there and you're not taking advantage of the Terrigrid, it sounds like you should. So I think we covered a lot of ways to get involved. And if your audience is predominantly students, this is a good time for them to go nudge their advisor to say, hey, maybe we ought to get involved. All right, well, I think that about wraps it up. We appreciate your time. Thank you.