 Pleasure to meet you, this is my co-host, Joe Furrier. Welcome to theCUBE. Thank you. We're live on day two in New York City for Hadoop World 2011. We are on the ground. This is our flagship telecast theCUBE where we go out and talk to the smartest people in the room who can find and extract that signal and from the noise and share that with you. We find the top stories. The entire team is here. The bloggers are here and news is here. We're getting all those stories, Dave, and covering the trends. Obviously, big data in Hadoop, the funding, and we're here to predict the future. So welcome to theCUBE. Jacob Rapp from Cisco Systems, Manager of Technical Marketing, right? Yeah. So welcome. We're going to talk about the intersection of networking and big data. But there was news the other day, right? Didn't Juniper bring down the internet or something? BGP table, dump, core dump, and crash internet. You've never seen that in Cisco. Sorry, Juniper. Little tongue-in-cheek. Tongue-in-cheek. So Cisco, obviously networking vendor. This whole monitoring and performance stuff has been around on the network side for many, many years. So data is used a lot in networking for a lot of reasons, configuration management, automation. So you're no stranger to it, but why Hadoop now? I mean, log analysis and all that stuff. What are you guys doing with Hadoop and share with us Cisco's view of this? Sure, so we at Cisco, I mean, traditionally have been a really large player in the big data with our networking for a while with traditional big data with the Web 2.0 industries. And what we're seeing now is really an influx of enterprise customers coming to us and asking, how do I integrate this into my enterprise system? So moving from the Web 2.0 to the enterprises now, and what do I care about? So how do I architect my network? What's really behind the scenes in Hadoop? So I mean, there's a lot of material out there of, okay, Hadoop's this, Hadoop's the other thing for the network. But what we really wanted to do is run a series of benchmarks. Because before we make any recommendations at Cisco, we want to do a little bit of due diligence beforehand. And we built this 128 node cluster about a petabyte worth of data. I think as Claudia mentioned on the first day, the average cluster size around 120 nodes. So it's a pretty good representation, I think of what we're seeing out in the enterprise today. And we ran all these benchmarks to come up with what really is happening on the network. If you're running different types of jobs, if you're using different types of compute nodes, what's happening? It's to really provide some sort of, some value into the ecosystem with some real data. So how are the network requirements in Hadoop and big data different? So there's a lot of, so I think a goal out of this was to really demystify it, make it less scary, right? Because it's all about integration into the enterprise and how we can integrate really well with the current infrastructures. So there are some things in Hadoop and it's largely dependent on the data models, whether you're running an ETL-like job or a business intelligence job, what affects on the network. One thing that is crucial is your reliability and redundancy mechanisms. Because if you have a rack go out, if you lose a top of rack switch or wherever you're connecting into, and say you have 16 servers or 32 servers in that rack, each server has eight to 24 terabytes worth of data, that's a lot of data to lose. And if you lose that rack, HDFS is redundant, so rebalancing will happen. So all of a sudden all those 16 or 32 servers times eight terabytes or 24 terabytes worth of data all of a sudden needs to replicate throughout the network. So this causes a lot of issues. So I think reliability, redundancy are really key pieces in the puzzle. So you're saying there this is the amount of data that you have to accommodate just makes it even more important than the traditional enterprise to have higher availability? Yeah, absolutely. Because I think HDFS was built with redundancy in mind, but if you lose something like a network component, that's a little bit, I mean, you know we're cover, but it's gonna be painful. Yeah, it's been a long, long time. Yeah, another thing that you may, that we found out of our study, and actually our study's published online now on our website, we did a nice comprehensive white paper with all the results, but is buffering. So as you're, depending on the workload, take for example like a ETL workload, extract transform load, you're just transforming a lot of data. You start out with say a terabyte worth of data, 10 terabytes worth of data, you end up with 10 terabytes worth of data at the end. So with MapReduce, how MapReduce works, in the middle of that, I'm gonna shuffle that data across the network. So I'm taking that one, five, 10 terabytes to shuffle in across the network. And then at the very end, if I'm replicating it for redundancy at the end, that one, five, or 10 turns into two, 10, or 20 terabytes as you're replicating after the network. So these can be really bursty traffic patterns for the top of rack switches. And so, I mean, optimized buffering is what you really need. I'm not saying buffering has been a pretty big debate in the industry of how much do I need, do I need a gig, two gigs, do I need nine megs, we need two megs. So I think we've, the position I think we've taken is really optimized buffering. And then your recent announcement, didn't you have a Nexus 3000 announcement that had some expanded buffering that fits, right? Yeah, so there's a Nexus, there's two platforms actually we just came out with in recent launch, the 3048, which is the new 3K, and the 2248 TP-E. So whether our customers want an architecture with a top of rack switch like a 3K, or a fabric extender architecture to really optimize on cost and management, we have either platforms now that are really optimized for big data in mind. You know, we talked to a lot of people on the Cube and the industry around big data we've been covering it for two years, like a playing kid, and it's fun. But there's always a conversation around, oh, commodity hardware, and that's the story, right? I didn't get you in. But with the cloud movement, compute really isn't the problem. And the constant thing that we're hearing, and I'd love to get your perspective on this, is it's not so much the compute, and there is some storage issues with Hadoop that kind of goes away, the biggest problem everyone's facing is moving the traffic on the network, right? So this is what you guys do, right? This is where it's all about. So that's a big focus here. So there's tons of compute, storage works, people with MapReduce and HDFS, but really the issue is the network. What are you guys seeing, and obviously it's the first step with Hadoop, and what's the vision, and how are you guys tackling that, obviously making, you know, that's the bottleneck. Well, I think there's, depending on the workload, I've actually had customers that come to us and say, well, I've had zero issues on the network. It's deployed at no problem whatsoever, and then some that have, some that are like, I'm moving a ton of data because I'm doing this type of workload versus another type of workload. So if you do a business analytics type workload, where I start out with a ton of data and try to find some analytics out of it, I may start out with 10 terabytes, but end up with a few megs that I have to shuffle. So it's really dependent on the workload what we're gonna see. But a lot of what we're seeing is that they want this to be integrated into their current IT infrastructure, because there's a lot of customers out there that are, there's proof of concepts, demo labs, the science projects right now where there's a Hadoop cluster kind of out of the normal IT infrastructure. So it's really the key is integrating it back into the IT infrastructure, because they've found tremendous value of it, and now they're like, let's integrate this thing, and they want to make sure that their network is resilient enough for it. So I think the good news is in most cases, it's not an issue because the size of the cluster went that big right now, except for those spot data jobs where there's, we gotta move it around. Clusters are gonna grow, absolutely. So I mean, we've deployed thousands of nodes of clusters out there on the network. And if there's, it's just a solid design principles, right? So a lot of the design principles have been kind of evolving as we go. So for example, like the new Fexes that we're coming out with, or the new 3Ks that are coming out with this optimized buffering, more visualization into what's happening, because if you look at, I mean, it's all providing more intelligence to the operators, more visibility where we can go monitor buffering, we can monitor different statistics. And I think it's more important, as these, like I said, when clusters grow, and then you get a bunch of different business entities that want to use that cluster all at the same time, what if this business entity is doing something really strange, is affecting the network, that's multi-tendency is an issue. So providing the visibility and the configuration and options to actually change the way they are doing buffering or QOS, or those types of, will become more important as these clusters grow, and multiple users are using it. Well, it's great to see you guys here, Jacob Rapp from Cisco, you guys have a table, we see you guys over there. What's some of the feedback you're hearing at the show? When what's your observation of this community and the demand and just overall uptake, honestly, Cisco's a big player, they're not a startup, it's obviously, you see Hadoop, it's validation from Cisco standpoint. You talked about some of the network concerns you guys are on top of. That being said, what's your view about this whole ecosystem and as it's developing and Hadoop in general? So I think this conference was a really great conference and there's a really good technical core audience here. So I mean, there's some larger conferences out there that are maybe not big data, there's big data themes in it, but the quality of people I think at this conference has been really good and we've had some really good questions and comments and how should we do this, how should we architect. The sessions are really attended. I mean, like the people actually really there. Yeah, they're there, and the questions that they're asking are the deeper level questions that you are really happy that they're getting that feedback from. So I think it was a great conference to be at. Yeah, I mean, some conferences you go to, it's just like, oh man, this is like a pay-all situation. It's like a value as not always there, but it's been a passionate group of people and it's really exciting to see you guys out here, leader and networking. Thanks for coming on theCUBE. We really appreciate it. Thank you. Great to see you guys here and thanks for your time. Nice to meet you, Jacob. Appreciate it. Thank you.