 We're just going to launch our own distribution for Duke. Come see at SiliconANGLE.com and download it so you can find the link. Thank you for coming on the field. Thank you very much for having me. Thank you. Appreciate it. Great to meet you. Thank you. Sorry for the scheduling. Snapoo. Great to have you. Dave, great conversation. I love the conversation about the database. It's really that the RDMS misuse and design is really right on the money. Stefan is super smart about that. I really appreciate that comment. Good insights. Great insights. Are you Jacob? Yes. Hi. Pleasure to meet you. Pleasure to meet you. Welcome to the Cube. We're live on day two in New York City for Hadoop World 2011. We are on the ground. This is our flagship telecast, the Cube, where we go out and talk to the smartest people in the room who can find and extract that signal from the noise and share that with you. We find the top stories. The entire team is here. The bloggers are here. The news is here. We're getting all those stories, Dave, and covering the trends. Obviously, big data in Hadoop, the funding, and we're here to predict the future. So welcome to the Cube. Jacob Rapp from Cisco Systems, Manager of Technical Marketing. Right? Yes. So welcome. We're going to talk about the intersection of networking and big data, but there was news the other day, right? Didn't Juniper bring down the internet or something like that? BGP table, Dump, Core Dump, and Crash Internet. You've never seen that in Cisco. Sorry, Juniper. A little tongue-in-cheek. Tongue-in-cheek. Now, so Cisco, obviously, networking vendor, this whole monitoring and performance stuff has been around on the network side for many, many years. So data's used a lot in networking for a lot of reasons, configuration management, automation. So you're no stranger to it, but why Hadoop now, I mean, log analysis and all that stuff. What are you guys doing with Hadoop and share with us at Cisco's view of this? Sure. So we at Cisco, I mean, traditionally have been a really large player in the big data with our networking for a while and with traditional big data with the Web 2.0 industries. And what we're seeing now is really an influx of enterprise customers coming to us and asking, how do I integrate this into my enterprise system? So moving from the Web 2.0 to the enterprises now, and what do I care about? So how do I architect my network? What's really behind the scenes in Hadoop? So I mean, there's a lot of material out there of, okay, Hadoop's this, Hadoop's the other thing for the network. But what we really wanted to do is run a series of benchmarks. Because before we make any recommendations at Cisco, we want to do a little bit of due diligence beforehand. And we built this 128-node cluster about a petabyte worth of data. I think, as Claudia mentioned on the first day, the average cluster size around 120 nodes. So it's a pretty good representation, I think, of what we're seeing out in the enterprise today. And we ran all these benchmarks to come up with what really is happening on the network. If you're running different types of jobs, if you're using different types of compute nodes, what's happening? And that's to really provide some sort of, some value into the ecosystem with some real data. So how are the network requirements in Hadoop and big data different? So there's a lot of, so I think a goal out of this was to really demystify it, make it less scary. Right? So it's all about integration into the enterprise and how we can integrate really well with the current infrastructures. So there are some things in Hadoop, and it's largely dependent on the data models, whether you're running an ETL-like job or a business intelligence job, what affects on the network. One thing that is crucial is your reliability and redundancy mechanisms. Because if you have a rack go out, if you lose a top of rack switch or wherever you're connecting into, and say you have 16 servers or 32 servers in that rack, each server has 8 to 24 terabytes worth of data, that's a lot of data to lose. And if you lose that rack, HDFS is redundant, so rebalancing will happen. So all of a sudden all those 16 or 32 servers times 8 terabytes or 24 terabytes worth of data, all of a sudden it needs to replicate throughout the network. So this causes a lot of issues. So I think reliability and redundancy are really key pieces in the puzzle. So you're saying there this is the amount of data that you have to accommodate just makes it all even more important than the traditional enterprise to have higher availability? Yeah, absolutely. Because I think HDFS was built with redundancy in mind, but if you lose something like a network component that's a little bit, I mean you know we're cover, but it's going to be painful. It's been a long, long time. Another thing that you may, though we found out of our study, and actually our study is published online now on our website. We did a nice comprehensive white paper with all the results, but it's buffering. So as you're depending on the workload, take for example like an ETL workload, extract transform load, you're just transforming a lot of data. You start out with say a terabyte worth of data, a 10 terabytes worth of data, you end up with 10 terabytes worth of data at the end. So with MapReduce, how MapReduce works, in the middle of that, I'm going to shuffle that data across the network. So I'm taking that one, five, 10 terabytes, shuffling it across the network, and then at the very end, if I'm replicating it for redundancy at the end, that one, five, or ten turns into two, ten, or 20 terabytes as you're replicating out through the network. So these can be really bursty traffic patterns for the top of rack switches, and so, I mean optimized buffering is what you really need. I'm not saying buffering has been a pretty big debate in the industry of how much do I need, do I need a gig, two gigs, do I need nine megs, do I need two megs? So I think we've, the position I think we've taken is really optimized buffering. Yeah, you had your recent announcement, didn't you have a Nexus 3000 announcement that had some expanded buffering that fits, right? Yeah, so there's a Nexus, there's two platforms actually we just came out with in recent launch, the 3048, which is the new 3K, and the 2248TP-E. So whether our customers want an architecture with a top of rack switch, like a 3K, or a fabric extender architecture to really optimize on cost and management, we have either platforms now that are really optimized for big data in mind. You know, we talked to a lot of people on the Cube and in the industry around big data, we've been covering it for two years, like a blanket, and it's fun. But there's always a conversation around, oh, commodity hardware, and that's the story, right? But with the cloud movement, compute really isn't the problem. And the constant thing that we're hearing, and I'd love to get your perspective on this, is it's not so much the compute, and there is some storage issues with Hadoop that kind of goes away. The biggest problem everyone's facing is moving the traffic on the network, right? So this is what you guys do, right? This is where it's all about, so that's a big focus here. So there's tons of compute storage works with MapReduce and HDFS, but really the issue is the network. What are you guys seeing, and I'm obviously, it's the first step with Hadoop, and what's the vision and how are you guys tackling that, obviously making, you know, that's the bottleneck. Well, I think there's, depending on the workload, we actually have had customers that come to us and say, well, I've had zero issues on the network, it's deployed at no problem whatsoever, and then some that have, some that are like, I'm moving a ton of data because I'm doing this type of workload versus another type of workload, because if you do a business analytics type workload, where I start out with a ton of data and try to find some analytics out of it, I may start out with 10 terabytes, but end up with a few megs that I have to shuffle. So it's really dependent on the workload, what we're going to see. But a lot of what we're seeing is that they want this to be integrated into their current IT infrastructure, because there's a lot of customers out there that are, there's proof of concepts, demo labs, the science projects right now, where there's a Hadoop cluster out, kind of out of the normal IT infrastructure. So it's really the key is integrating it back into the IT infrastructure, because they've found tremendous value of it, and now they're like, let's integrate this thing, and they want to make sure that their network is resilient enough for it. So I think the good news is in most cases, it's not an issue because the size of the clusters aren't that big right now, except for those spot data jobs where there's moving around. Clusters are going to grow. Absolutely. So I mean we've deployed thousands of nodes of clusters out there on the network. And if there's, it's just a solid design principles, right? So a lot of the design principles have been kind of evolving as we go. So for example, like the new faxes that we're coming out with, or the new 3Ks are coming out with this optimized buffering, more visualization into what's happening. Because if you look at, I mean it's all providing more intelligence to the operators, more visibility, where we can go monitor buffering, we can monitor different statistics. And I think it's more important as these, like I just said, once clusters grow and then you get a bunch of different business entities that want to use that cluster all at the same time, what if this business entity is doing something really strange, is affecting the network, that's multi-tendencies an issue. So providing the visibility and the configuration and options to actually change the way you're doing buffering or QOS, those types of, will become more important as these clusters grow and multiple users are using it. Well, it's great to see you guys here. Jacob Rapp from Cisco, you guys have a table, we see you guys over there. What's some of the feedback you're hearing at the show? When was your observation of this community and the demand and just overall uptake, honestly, Cisco's a big player, they're not a startup, obviously you see Hadoop, it's validation from Cisco standpoint. You talked about some of the network concerns you guys are on top of. That being said, what's your view about this whole ecosystem as it's developing and Hadoop in general? So I think this conference was a really great conference and there's a really good technical core audience here. So there are some larger conferences out there that are maybe not big data, there's big data themes in it, but the quality of people I think at this conference has been really good. And we've had some really good questions and comments and how should we do this, how should we architecture? The sessions are really attended, I mean, the people actually really there to listen and come in. Yeah, they're there and then the questions that they're asking are the deeper level questions that you are really happy that they're getting that feedback from. So I think it was a great conference to be at. Yeah, I mean, some conferences you go to, it's just like, oh man, it's like a payola situation, it's like a value way, it's not always there, but it's been a passionate group of people and it's really exciting to see you guys obviously your leader in networking. Thanks for coming on theCUBE, we really appreciate it. Thank you. Great to see you guys here and thanks for your time. Nice to meet you, Jacob. Appreciate it, thank you. All right.