 Live from Las Vegas, Nevada, it's theCUBE at IBM Edge 2014. Brought to you by IBM. Hi everybody, we're back at IBM Edge. This is Dave Vellante with Jeff Frick and we're here in Las Vegas at the Sands Convention Center. We're going two days. This is theCUBE, Silicon Angles flagship program. We go out to the events. We extract the signal from the noise. Pamela Gilman is here. She's the manager of data analysis for the services group at the National Center for Atmospheric Research. We're going to geek out on HPC and big data. Pamela, thanks very much for coming on theCUBE. It's great to see you. So tell us a little bit about NCAR and your role there and what you guys got going on. Okay, so the National Center for Atmospheric Research is the National Center that we have the facilities that we provide services to atmospheric researchers. It's run by a university corporation. So I believe at this point we have 95 member universities that define kind of the work that we do. Our focus is mainly on climate as opposed to weather. So most people don't understand the difference there. Weather is what you listen to your forecast at the evening, where it's short-term, you want to know what it's going to do tomorrow. Can I go to the pool today or not? Right. Climate is a very long-term research. So some of our researchers run models that they're looking at what happened in the weather over a thousand-year time period. And so they're looking for the cyclic things, they're looking for the correlations from year to year and through time. And so what the National Center for Atmospheric Research does is provide the resources to allow that type of research to occur. And you're an independent agency, is that right? We are federally funded, but we're managed. UCAR is the University Corporation for Atmospheric Research, which is the consortium of universities and that's our governing body. So what's happening with climate? That was, thank you for describing that climate versus weather and it's something we see all the time. I see the left and the right is arguing about it. What's the facts? What's really happening with climate? So the facts are a little difficult to find sometimes. Science is, especially with something that you can't produce an experiment and then say, okay, I've proven it. I've done it three times and I get the same answer. Climate is a little more difficult than that. So some of our researchers, one of the groups that I work with that I love to work with does paleo-era climate. So they've actually gone back and they're running that thousand years across the paleolithic era. Looking at the data from, you can pull core samples and see what actually occurred. And so they're looking to see, do the models that we run to look at things in the future, were they able to show you what actually happened in that time period? So they're trying to look at what they think can occur and what's happening. And of course, depending on how you run it, you can get different answers. I believe that at this point there, the majority of scientists believe that we're in a period of change. Things are changing rapidly. We've got, if you look at the news any day today, you see the ice sheets are melting and then you have where you get a warmer day one day and a colder day the others. So mostly this is trying to figure out why is the change occurring? Is this the normal change? And coming to some consensus as to what's happening. I saw Mount St. Helens blow up about 30 years ago, approximately, and when Mother Nature decides to do something or rip Africa from South America, it's pretty significant. That question on what you guys do is that the data, is it the data sets? Is it the computing horsepower? Is it the models when you say you provide solutions for people to test things to the universities? What exactly do you guys provide and where do you get that? So we do a little bit of all of it actually. So we have the, from the scientific standpoint, NCAR is built of multiple labs and there's five or six labs and all but one of those labs do the science. And so there are two groups that I work with closely that do the models that are one is the producing that Paleolithic and the future climate data. Another one is one that does a weather model that does the hurricane forecasting and we work with that. So we have the scientists there that are generating these models that they run. The lab that I work in is the Computational Information Sciences Lab. We actually produce the resources and we manage those resources. So we have a very large supercomputer center. It's about 25,000 square feet today. We've got the double capacity that all we have to do is put machinery in it to do that. And that's where our flagship systems live today. That current system is an Big Eye Dataplex system and that's where all of the storage systems, we have a large tape archive. So everything from spinning storage into tape that supports that supercomputer. And then the university will lease out? They actually just ask through the allocation. There's an allocation process and we have a panel that decides so they'll say this is the science I want to do, this is kind of the research that I'm trying to look at and can I have time on the computer? And what are your data sources? Where's all this data come from? So a lot, most of the data is produced either on our computer, in our center or at some of the other national centers. There's several centers that have the large systems and then we will bring that data back. And so one of the focuses that my group works on is some of those data transfer protocols to the spinning storage is so that when you produce data elsewhere that we want to have an NCAR that we can get that there effectively and efficiently. How much data are we talking about here? So currently what we have in the facility is about 33 petabytes in our tape archive. We have about 18 petabytes of spinning storage that's available and that's it anywhere from 60 to 90% used at any given time. Okay and any flash in there? We don't have any flash today, that's actually a technology that we're very interested in moving forward. What do you envision using flash for? So metadata or certain analysis? So we've looked at metadata in the past right now, it's more efficient for us to keep the metadata in the spinning storage with the data itself. I think the thing that we're looking at, we've got a next generation machine in the next couple of years is looking at the flash as kind of the burst buffer, the terminology that everybody's starting to talk about now. Our models do a lot of small file output so they're because they run through time they all run one time step in output data, run a time step output data. And so having that flash that's close to where they're producing the data that can handle that as they do these time steps and then allowing that to trickle out more to the spinning storage over time is something that we're very interested in looking at. So there's definitely a discussion, we talk about it all the time in theCUBE with this whole notion of high performance computing and big data coming together and actually finding commercial applications. I wonder if we could talk a little bit about sort of the big data HPC, big data meets HPC. What are you seeing there? Is it where do analytics fit in? How does it affect architectures? Talk about that a little bit. So one of the projects that I've worked with and I think a lot of people have heard about is the IPCC runs, the International Panel for Climate Control. And they do these periodic runs every four to five years that produce and it's an international panel. And so the first year that I worked with that was the IPCC four run. And the total amount of data that we output from that run was 100 terabytes. That was not difficult to manage. I was able to give them a dedicated storage system. That data once it's produced is made available to the community for about five years for anybody to come and get and download to do whatever research they're doing with. So we completed the IPCC five run a couple of years ago and NCAR can't even curate all of the data from that run. So we have about a petabyte to two petabytes of that data and that's not even all of it from that run. So you can see the scale within a four year time frame of how much just the same thing you're trying to do could jump. And so that's a huge challenge. So we use some of the, we bring that data in and then we have what we call science gateways that hosts that. And so that's a way of allowing the analytics to occur to allow the community to come in and say, I want the data from this run but some of these things have like a hundred variables in them. I only want the three variables I'm interested in. And so we've taken and coupled that functionality with our computational side so that the group that manages that data, they can pull that, use our computational resources to pull those variables out, package a data set and deliver that to the customer that's asked for that. So you're obviously a GPFS user. Yes. Right. I have been for a very long time actually. Tried and true file system. You're seeing it now seep in. I talked about commercial applications. You're seeing that seep into sort of IBM software defined storage. What other changes can we expect in because there's so much data now, so much demand for real time. Huge costs associated with all this stuff. What architectural changes should we expect coming down the pike in the next four to seven years? Four to seven years. Everything will change in seven years. The challenge is figuring out how you can change with it I think. One of the things that we've already done is we used to have what I always referred to as islands and we've talked about that in the past of being a very data centric way of looking at things versus a resource centric way. And so in the past you had the super computer guys did their thing. They produced a lot of data. Their job was to have a fast machine and be able to put out data as quickly as possible. And then it was somebody else's problem. And that data moved around from resource to resource for each of those tasks. The thing that we did with the data center that we have now is we pulled all those resources into a central pool. So what we're trying to do is shift to where as the data is produced somebody can look at it and they don't have to move it. It's all right there, it can live in one home and they can get that complete task done. And it's what we've referred to in some of our talks lately as an information centric model of trying to get the user to move what they're doing to where the data is because we can't move the data any longer. So what I want to see in the future is a more tight or coupling of that. And hopefully analytics that can occur during computation. Right now it's still a very separate step. You do the compute and it's still different groups of people. They produce that and then somebody else looks at it. Is it really like to see if we can provide an architecture where that can all happen in stream. I think the flash plays into that is if we can keep some of that in memory and they can do post-processing or analysis, work on it while it's still there before it ever goes to spinning storage, then we can speed up that workflow. I mean conceptually it sounds very Hadoop-like. Right, ship the code to the data not the data to the code. Exactly. Using Hadoop? No we aren't using Hadoop. The way that our codes are structured don't really work well with that at this point. But we do have an effort underway right now to look at the codes and see if we can get the codes to shift somewhat to use some of those newer technologies. Conceptually it's a similar philosophy. Yes, very much. Ship five megabytes of data to petabytes of code to petabytes of data not to reverse. Exactly. And so it sounds like there's still a lot of batch activity going on. Yes. And you're trying to make it more real time. I mean I see parallels with the whole Hadoop meme. Yeah, exactly. Big data and analytics. Our challenge is the climate codes are very, very large and it's not one piece of code. It's actually like six pieces of code that have to talk together. So very fragmented. Very. And so it's the challenge right now with the software developers with that is figuring out how to use some of the newer techniques. One of my challenges is because I am a GPFS person is watching what they do bad to my file systems and to my storage and trying to find the techniques, the coding techniques that I can pass on to them and say, here's a better way to do this. And you'll find that what you're trying to do is going to run more efficiently and faster if you bring in some of the parallel IO techniques that especially GPFS really likes. So Pamela, you talk about a lot of the data that you guys produce is produced by your models that are going out. What about on the inbound side? The other kind of big trend that we talk a lot about on theCUBE is the internet of things and the industrial internet and all these sensors that are out all over the place. I would imagine in a climate control or a climate research world that's a great opportunity to bring in lots of new data that maybe didn't have access before or not as easily. Are you guys taking advantage of that? Is that kind of growing the input side of your processes or still not so significant? To some extent, we are doing more what we would refer to as observational data. So there's instruments that go up in airplanes and collect data and we're bringing more of that in. That's really more of the weather side of things versus that climate side. So we're predominantly data that we produce, but we are looking at and like I was referred to earlier the data transfer techniques and making sure we have the bandwidths and the things that can handle that is we're looking at because the systems are expensive these days and having a system that's large enough to do the work we need to do means that we work in more locations than what we can afford. So that's becoming more and more the challenge is to how do I bring a 10 terabyte data set or a 20 terabyte data set that I produced at another site back because it's the researchers where we are that need that data. You know, in the commercial world, Pamela, the focus of the storage people is really don't lose the data, don't touch my sand or I'll kill you, right? You know, make performance consistent. It doesn't have to be the best, but don't lose the data. How do your requirements differ from the commercial world? So we very much have a don't lose the data and we've just been through an incident where it was please don't lose the data. We do push performance and so it is a real challenge for my team is how do I get the maximum performance because then I, that's where my money has gone. So how do I get that maximum and still have the stability? One of the things that I absolutely love about GPFS is that you can push the performance and remain stable. And it's been one of the things, we've used that software for 15 years at least at this point. And so it's been a real highlight. We do not back up a lot of our data. So unlike the commercial world where the center itself is taking that responsibility, dual copies, three copies, you know, tape, our tape archive is there for our users to use and it's their responsibility to do that. A lot of the data could be reproduced, which I think is different than a business world is because it's model data, you could rerun the model and produce it. That's expensive to do. So it's that balancing act of how much do we allow of tape archive so that if it's really critical they can have it, or do we just say, maybe it's more efficient for you to reproduce it. So we're seeing some of the tenets of high performance computing seep into the enterprise, certainly scale out, low cost to guys that use commodity components for years and years and years, certainly we're seeing GPFS. How do you back up a petabyte? You don't. You don't. So a lot of learnings here. Good luck. Pamela, thanks very much for coming on theCUBE. It was really a pleasure having you and good luck with the project. Thank you very much. Keep it right there, everybody. We'll be right back with our next guest. We're live from Las Vegas. We're at IBM Edge and this is theCUBE. We'll be right back.