 Hi everyone, thanks for coming and welcome to the last session. We've got the big room, that's for sure. Part of that is because we are also serving this to the online group this afternoon, so I want to welcome them and some of them are filling some of the seats here. So my name's Vince Stoffer, I'm a cyber security engineer at Lawrence Berkeley National Lab. The idea for this talk just came from a couple things, there was a lot of themes going through our mailing list and that sort of stuff about people scaling up their network monitoring, you know, either going from one to ten gig or ten gig to forty gig or to a hundred gig or whatever. We were going through the same sort of thing, so we went through a bunch of evaluation of some products and that sort of stuff and I do most of the tapping and that sort of work. In my groups I want to talk about that, just share experiences and hopefully get some feedback from you guys on what you're doing. I've already had some discussions here at the conference of what people are using to do this sort of thing, scale up and monitoring. The great scene here from UHF, I was like, I was going to try to play the video or something, it didn't work out, but an awesome scene from UHF about drinking from the fire hose, if you've never seen that. That's right. I've got to turn this on. So I'll just give a quick overview. I've got a whole bunch of slides because this is a huge topic, right? I didn't really focus in on one thing. I kind of did that overview pretty generally, so I could end up focusing on whatever was appropriate and whatever actually I got through in terms of my evaluations and stuff by the time we got to the conference. So I'll talk about a whole bunch of stuff, a whole bunch of slides, I'll get through whatever I can. I may skip some towards the end as we start getting out of time and I'm going to try to leave some time at the end for discussions and questions because I definitely want to hear what you guys are doing too. So quick agenda, we'll go over the problem, the pipeline, kind of monitoring pipeline, talk about some of the devices, especially the ones that I was just doing evaluations for, output analysis and then discussion. So the lab itself, we're located in Berkeley. Our motto is bringing science solutions to the world. We do big science research. We're an unclassified, completely DOE research facility unlike most of the DOE research facilities you might think of. So we don't do any weapons, anything like that. We do really great research into big science, especially stuff that makes commercial products as an end result or publishes research. So almost everything that comes out of the lab gets published or disseminated in some way. So we're not really worried about keeping things closed. We'd like to have things open. And so in that sense, we function a lot like a research university that many of you come from. So picture Berkeley where we are in the Bay Area, just East Bay right across from San Francisco. Here's a picture of our campus, which sits right above University of California Berkeley. So we're just intrinsically tied with them. We have all sorts of faculty members that are researchers at the lab. And vice versa, we have grad students. We have a constant flow of people between UC Berkeley campus and our campus doing research. So we're really tied at the hip with those guys. Computing overview just to give you a sense of size and that sort of thing. We're about 5,000 users, 10,000 hosts on our network at any given time. Very distributed computing resources. So while we have a centralized security group and a lot of IT infrastructure that's centralized, if you're a researcher, you have your own lab, you have a group, you get grants, all of that stuff. If you guys are familiar coming from research universities, that's kind of how it works. If you're big enough, you may have your own support team. So lots of distributed stuff. So that makes for an interesting cybersecurity environment because we have also tons of guests and visitors coming all the time. We may have people who show up for a day or a couple days to use one of our facilities and they're there to do their work and transfer their data back to their home institution often. And a closed network is not allowed for that sort of research to take place. So because of that, we have a very open network like many of you are familiar with. You know, if I go and talk to the business world or something, it's a whole different thing just to kind of describe, hey, we don't have a border firewall. We have an open network. And you're labeled SSH into a workstation from off-site or whatever. So you guys understand that? So just underscore that, you know, that's what an open network means and that's why this problem becomes even greater for us for network monitoring is to be able to scale up to the volumes of traffic that we need to perform research and to pass scientific data back and forth to our institutions and still be able to monitor it. So what's the problem, right? The problem is these order of magnitude changes in network speeds mean that we've got to still monitor the same things we were monitoring before. They did one of these massive upgrades, right? It's not just the network that gets upgraded. It's our need to be able to see the traffic from cybersecurity and act on incidents and events and everything else that we always need to. So what's driving the explosion of data? For us, you know, it's this explosion of data from both the scientific and commercial world. Science DMZ is something that I'm going to talk about a little later, but that's new to us. So having a super high-speed network connected kind of on the edge of our network that needs to be able to pass full bandwidth at potentially up to 100 gigs, that's new, and we had a network redesign. That changed our monitoring needs and kind of what we had to do. So a couple slides that I borrowed from ESNet, which is the Energy Sciences Network there, are ISP and ISP that a lot of you guys share for kind of science data. They're also located at the lab. There's a linear graph of science data transfer each month, excuse me, by ESNet. Yeah, you know, but it's linear scale, so it's a little deceptive. Here's the nonlinear scale of, you know, how things have changed just in the last couple of years, and you can see things are going crazy. It's spiking at the point where, you know, petabytes of data are being transferred over these networks every month, every day, whatever. So what's driving all of that? For us, it's the ability to transfer data in just huge, massive amounts back and forth from facilities. Now, before this wasn't even able to happen over the network, right? Some of this had been happening by FedEx of hard disks. That was the fastest way to move science data in bulk for a long, long time. But now the networks are finally catching up to the point where we can actually capture some of this stuff. We can send something that we might, that a scientist might be capturing at, you know, one of our facilities, like the ALS, the advanced light sorts, and they want to be able to pass that off to a supercomputing center, maybe NERSC, which is associated with our lab, or, you know, one of the many supercomputing institutions that is around the country that may be one of you people run. So to be able to get that data over to somewhere else, do computation and send it back real-time is kind of becoming a possibility now thanks to high-speed networks. So that's part of what's driving this growth, right? So, again, this kind of just general forcing factor is not specific to us. You know, big data in all its forms, mobility in our things, just data everywhere, right, is being passed over and around and about for all of us. And the research networks play part in that for sure. And also just outgrowing and, you know, capacity network hardware forces a network upgrade and then forces us to need to change our monitoring. So all that stuff changes and creates some of these transitions, right? So whether it just is going from 10 meg or a little more up to 100, that's, you know, that's a paradigm shift for you. If you're operating at 10 meg and you have to go to 100, that means a big change for you, right? So, I don't know, we'll talk a little bit about just some of these kind of categories as we go through. But, again, all of these changes mean changing more the network equipment, right? We've got our tabs. We've got to figure out some way to kind of distribute or aggregate all those tabs into something. We've got to filter it. We've got to send it out to our analysis or IDS tools or packet capture or whatever you're doing on the output side. And we've got to be able to do that all at the speeds that the new network runs at, or else we're going to be blind to at least some of that traffic. Now, maybe we'll do that intentionally, but maybe we want to try to see everything, right? So quick slide on tapping. Where and how do we tap? Well, this is a much debated question, right? We always like to talk about at the lab that it's critical to us to have the inside and the outside visibility. We don't have a border firewall, but we do, of course, do blocking at our border routers and that makes a dynamic firewall that mostly feeds from data for us from bro. And if we're not able to see, you know, what's outside attacking us and what's inside still getting through post-R protections, we feel like we don't have the full picture. So a lot of people say, well, there's just too much data on the outside. I can't handle it all. I can't view my firewall logs or whatever. I guess our point is just better to try to have the data somewhere. So when you need it, it's there, and you can go back to it. Even if that means only saving it for a shorter period of time than you thought you might be able to, but having that data is really critical and you're not going to be able to have that data from a kind of raw monitoring perspective on the outside unless you've got taps both in and out. So we always have focus on trying to have visibility or monitoring and taps on both the inside and outside of our network. Putting taps in some key protected network segments is also very important for us in something we've worked towards. We didn't start with that. We thought this is good enough or this is all we have money for monitoring, but as pieces of critical infrastructure have moved around in the network, the networks have become more complicated. We've got more load balancers and VPNs and people coming in and going out of the network in all sorts of different ways. We need to figure out how to be able to monitor that as well. So it's not just one place in and one place out. We still want to see this stuff and be able to watch traffic that moves internally across our network to look for compromise hosts and people that are hopping back and forth and all that sort of stuff, right? So taps versus span ports. Again, one of those things that security folks and network folks like to discuss and disagree about. I think I have a preference for passive taps just because, you know, there's no loss and, you know, traffic interruption generally. You're just kind of not, you're not relying on anyone. You're just getting, in terms of fiber, you're getting just a little bit of copy of that life that's going through the fiber and no one should know, right? Of course there's problems, but usually it just works and you don't have to worry about much, though. But in our case, like I'm going to show you, we need a whole lot of them in some cases, dozens or dozens of those taps. So that costs a lot of money if you can consolidate networks and do it with span ports. You know, that's great too. It gives you a lot of advantages to use span ports, be able to do filtering and aggregation right there. Maybe you can use multiple span port outs depending on what kind of network area you're using. So it could be the right answer for you. It just depends. I just want to kind of throw those both in there and mention it. And then we get to how do you kind of deal with all of that traffic, all of those links coming in, right? If it's just a single link that you're tapping, well, great. You don't have to worry about this sort of thing, most of us have highly complex networks, many different places that they're going in and out, many different taps or span ports or things that you want to be able to aggregate to your monitoring tools. So you're going to have to do this with some device. And we're going to go into quite a bit of detail on the ones that I've just been looking at for 100 gig. But a couple of, you know, the traditional options were something like a commercial appliance vendor who, you know, high-performance stuff. They were buying custom Asics or engineering these themselves, very flexible, but very expensive, right? The new hope, which is really delivered, is these commodity network vendors, right? Either using software-defined networking, open-flow type stuff, or just their own kind of custom code base doing similar, you know, tap aggregation functions. And I have a slide where we just talk about all the different things that they call that, but, right? Much lower cost report and pretty massively scalable. So I think even, you know, two, three, four years ago, this was just kind of a pipe dream. There were maybe some ways to do it, right, with some Cisco gear and some other things, but this is really becoming, I think, the way of the future in terms of network monitoring and in terms of cost. Just there's no way to do it other than this sort of thing for saving money. It's segue into kind of, okay, what if you're doing less than one gig monitoring, right? So I think there's not too much to worry about here, right? Single tap maybe, a span port, maybe a single analysis machine. Probably if you're in the less than one gig range, you're probably going to be okay with that. Maybe you need to do some aggregation, some simple filtering, but I don't think you're going to have to get into really complex situations where you're worried about kind of the device and what it's doing for you, right? I mean, it's not really this easy, but if it's a simple, smaller network edge, you know, office kind of thing, what am I looking for, you know, like a remote office, that sort of thing, it's that sort of setup, right? So it could be as easy as that to do less than a gig. I don't think it's something that people are really having problems with. As we go beyond a gig, we get into some other problems. But again, less than a gig, a solve problem, hardware is very capable for doing this stuff. You could purchase an appliance to do it or kind of roll your own with some of the traditional networking devices. So going beyond one gig, here's a picture of kind of the first 10 gig devices that we had at the lab. These were manufactured by Apcon, and they were installed at the lab in about 2007. I don't know what they cost at that time, but my guess is quite a lot of money. They provided pretty good functionality. I mean, you know, you can see, you know, there's basically three units here, one, two, three. So this takes up like, you know, two-thirds of a rack or something. Very large, XFP connectors, but they did the job. You had a little web GUI kind of thing. You could go in there and select ports and, you know, draw lines and kind of make them do the things you wanted for IO. So it's there, and it was working in 2007, right? Now I think this is also pretty much mostly a solved problem from one to 10 gig. You can get an appliance from a network vendor. You can try to use one of the kind of more modern approaches with the network stuff that we're going to talk about. You probably need a cluster or some sort of purpose analysis boxes, so maybe separation of duties, you know, saying we're going to do some mail analysis here. We're going to do DNS analysis on this box, HTTP over here. Maybe you could do it that way, right? Separate out protocols or traffic that you wanted to look at as opposed to doing a cluster. But with some careful tuning and filtering and that sort of stuff, a lot of us are doing this right at one to 10 gig, and this is not a huge problem, right? It's just kind of, what do you get? What do you buy? How do you do it? But our approach for this, just talk quickly. We have the C packet C views, and this is what we went to after those app cons. So these were some of those, again, commercial, pretty expensive products, but very flexible, very nice to work with. You can get in there and every port does what you want. Very complex and advanced filtering. That sort of stuff. I'll talk a little bit more about that. So we output, you know, 10 gig. We aggregate a bunch of links together on the input side of the C views. We do a little bit of filtering, and then we output it either to single servers doing things or to 10 gig clusters. We have some, again, purpose analysis boxes that do specific things, so those might get filtered off or shunted from a particular network. And we have an internal borough cluster that handles just all of our internal networks, pulls them together, does analysis on that. I'm not going to talk much about that. And then time machine we use for full packet capture. We'll try to touch on that at the end. There's a real simplified diagram of what our tapping kind of used to look like. We've got that core router there. We've got a tap on the outside, a tap on the inside. Basically two C views, kind of one for our internal network and one for our external network. And just to expand upon it, the same thing, right, the last slide is like all of this. And then here's maybe one of the C views and just showing that it's basically splitting out a couple aggregated links into some tools and clusters and stuff below. And the way we previously did that was basically what started as a custom device from CPacket, which was just doing that MAC rewriting. So we would put a 10 gig input and it would have a 10 gig output. But we would be able to go into this device, configure how many worker nodes we wanted to basically split that traffic into. It would rewrite the MAC addresses and then we just pipe it right to a 10 gig switch, which would then just output to the particular hosts and it would look something like this in our old row cluster, right? It would literally just be a stack of one use with that CPacket device on the top and the switch. And this is how it's been working at the lab since like about 1997. So we've been doing analysis at 10 gig rates using this same sort of thing. That was working great. Everything was fine. You know, average traffic between one and three gigs. So we weren't really pushing the envelope. We were holding steady and everything was fine. Pigs to six and seven. So at that point, we're starting to see some loss. But there's always going to be some loss. We try to tune that, see what we were losing, try to do the best we could, knowing that we had lost and were willing to accept that in this situation. Then we had a recent network redesign. So we did get 100 gig at our border. We have 100 gig in now connected to SNET. We got a science DMZ. So we do have a separate science DMZ router. I don't think there's a diagram of that, but science DMZ could be a whole hour long talk. But essentially it's an ESnet idea where you have a separate high speed router with very high flow rates, latency to a minimum, and you connect your data nodes to this science DMZ router, which is off to the side of your network. You pass information very quickly through, and you avoid the whole rest of the bottleneck of your main network. So a quick five second, what is the science DMZ? We got redundant border routers, all sorts of new stuff in the internal part of our network, and everything's dual connected. So we went from that simplified diagram to, again, this is kind of a quick look at it, but it just shows you that now we've got all of these internal networks that we're tapping that before was essentially one tap. This is still simplified a little more, but we've gone from at least maybe two, three connections on the input side to literally dozens of connections on the input side. So that was the big change for us. We all of a sudden had more ports than we could even plug into our existing equipment, and that wasn't going to work, and we had the 100 gig, right? So that wasn't going to work either. We needed something else. So we're moving into kind of new territory, and that meant reassessing, what are we going to do about this in terms of our aggregation appliance? So, again, multiple inputs at various speeds, many different outputs still needed, probably more now, that we've got different traffic coming in from different parts of the network. We might want to parcel and chop it up a little more, send it out to different places. So some new output groups. So we started putting together, based on mainly our experience with our app cons and our C packets, one out of the next kind of round of appliances. What does it have to have? What's our wish list for this sort of thing, right? And just a couple slides on what we came up with, that were just kind of bare minimum, or not all of them, I guess, are bare minimum, because there's a couple things on there I don't think we actually can find yet, but filtering important on both the ingress and the egress of the ports, some of the devices will do filtering only on the ingress. Well, that's great if you're collecting them all from one particular group of ports that you want to output your IDS, but what if you want to see everything on the input and then you want to, you know, filter on the way out? I think that's really the way we like to do it, but some devices don't support that, just hardware limitations or whatever, right? We need it to be port speed agnostic, so we could do one, ten, hundred, potentially on almost any port, because we want to be able to put multiple one gigs and maybe take them out to a ten gig or three ten gigs and take it out to even a one gig, say, yeah, we don't care about oversubscription, we just want this one thing, we're going to filter it, that sort of thing. We need symmetric load balancing and aggregation. Almost everyone supports that sort of thing in these devices, but, you know, the way that those hashing algorithms are done had to make sure that we at least have this kind of five-tuple load balancing, the protocol, source IP, dest, IP, source port, dest port, so we can minimally ensure that we're seeing both sides of a connection go to the same worker, right? So if we get the connection coming in, sometimes that hashing algorithm, if it's not done in that five-tuple minimum, might have one connection going in and we get that all to one worker, but the output side goes to a different worker. That's not going to work for us. A lot of the, you know, analysis software is not going to like that, so we want to make sure that it's all correctly aggregated and load balanced. No oversubscription limits was a big one for us because, again, we had all these internal links coming together and we want to say, we may have 30, 10 gig ports coming in and we want to send it out to one gig port and one of the vendors was like, well, you're never going to be able to do that. You're not going to be able to monitor all this traffic and I said, no, no, no. We're not going to get every given, any given time, right? These are just redundant links, but we need to be able to have them all ready to go and send to our monitoring boxes, so if they flip over, we're seeing them. We don't have to go reconfigure things, so even just the idea of having that many connections to some of these guys was like, that doesn't make sense, and we described in them what you're doing. An API for dynamic filtering or shunting, that's one of those wish list things that maybe there actually are some places they're doing it, we're going to talk about that, but not everyone's doing that, that's for sure. Filtering. Again, the filtering varies pretty widely, so this is another part where we found that device vendors kind of varied quite a bit on what we could filter for. Our preference would be just any sort of arbitrary IP header or TCP flags. Can't always do that, right? So that was a sticking point for us to kind of see where some of these vendors ended up. Every port could be IO, I talked about that, so you can, you don't have to specify, well, these six ports are input, and then you're only, you know, you've got 18 left to do output. We didn't want that, and most of the devices will say, you know, anyone you cannot do input or output, you just select it, go, and you're good. Port groups, IPv6 support, another one that's kind of not supported everywhere, but something we really wanted to have. Okay, I'm going to have to speed up. So commercial and appliance, we talked about that, commodity network, right? The options. So I'm going to go through a little bit about each of them. So again, this is what, this is my slide of, this is everything that I could come up with. I bet you guys can get lots more of, this is what all these guys call their products, right? Load balancer aggregation switch, network packet broker, splitter, distribution, visibility, device, I mean, every single kind of name. It all seems to me in the same thing, right? Whatever you call it. Here's a list of some that I put up there. There's definitely others. And again, this isn't the network guys, right? This is just the kind of commercial side of, yeah, we do this stuff and we can sell you big appliance for an amount of dollars to do it, right? So our experience with CPacket, again, very fully featured at 10GIG. We've been using them for several years. There is a 100GIG proof of concept at our super computing center and perhaps at some other institutions out there. So we know they can do it. We haven't experienced or experimented with the 100GIG version from CPacket, but we think it is a viable option and those guys are really good. It's just we think it's going to be a high cost, right? So it's probably in our back pocket. They've got a CLI and a GUI. Excellent filtering. Spiffy is their kind of distributed CPacket inspection where you can go through and search for something. And again, our reference indexes are these devices because this is what we've been using for several years and what we know and we, you know, we want whatever device to be able to support at least kind of this same sort of level. So this is a picture of one of our 24-port CPackets. So one of the first evaluations we did was with Endace, who has now been bought by Emulex. I think they're still kind of in the middle of name transitions, but it may be under Emulex now. And they have a device called Endace Access, which is 100GIGs. So it's got a single 100GIG port in and then 12, 10GIGs out. And it's a nice form factor, beautiful little box. One of the first things was, well, you need one device for each direction. Oh, well, that's nice. I guess we have to buy two of them then, huh? So that was one thing. It does do the MAC rewriting and load balancing together, which is pretty cool. So you didn't have to take care of that in another way. You could actually configure load balancing and choose the ports you wanted to, and then it would rewrite MAC addresses or not, depending on what you wanted to do. A really limited GUI. And again, I mentioned the GUI because often, you know, a security group takes care of these tapping appliances for us. And it's fine if I can go in and do the stuff on the CLI, but if I'm gone or sick or someone from the network group that doesn't know this needs to get in there, a CLI has been real handy. I mean, a GUI has been real handy for folks to go in and say, you know, it's a quick look. You can see a drawing. This port goes to there. I need to move it over one thing. They can do it, right? They don't have to know all the stuff about some particular language that is Cisco-like but isn't. So I think it's important for a GUI to work. So limited filtering, if any, on the end days, that was a sticking point for us. And I think ultimately we decided this isn't going to work for us because there was no 10-gig in. And it ended up being we needed a 100-gig and a couple 10-gig ports to be able to aggregate. So there's a picture of the end-days access. But again, a device probably worth considering if you're looking at 100-gig. And so those were the two commercial ones we looked at so far. I don't think, correct me if I'm wrong, that Gigamon has 100-gig offering it. I've asked them, and I think they're close, but they just announced it. So maybe we'll do an eval with that. Another one to throw in a mix there on the commercial side. The network vendors. So let's talk a little bit about each of these guys, the wrist, the brocade and Cisco. A wrist is definitely the hot player. Everyone's talking about them. We were wondering what's up. A bunch of our friends and colleagues and you guys are buying them and testing them out. And we're like, hey, great. So we just went out and got one of these 7150s. And I'll talk a little bit about this. But it is essentially their network switch in a couple different port configurations. But it has this special tap aggregation mode that you buy a separate license for. You can do some of the same load balancing and filtering and stuff that you might be able to do without that license, but this kind of is the secret sauce and the somewhat pretty gooey and stuff that they're offering. They almost have 100-gig offering. I think there is an institution that is testing their 100-gig. It's not quite out yet, I think, in terms of production. But it's pretty much there. It's just not, I think, in a single device. They do support OpenFlow and SDN, so if you wanted to go down, you know, not just their software route, you've got some other options, right? They're pretty flexible in terms of what they support for SDN. So there's a picture of that line of switches. We just got one of the 24, so there's a 48. And I think a 52 port. One of those supports 40 gigs. But these are all 10 1-gig boxes. So, like I said, we just bought a 24 port one. Most of the features on our wish list were covered. Functional gooey, you know, they're not focusing on that, but it was the same sort of thing. You can draw lines and you can do filters and you can do port groups in there. So I don't know. It made me happy to see it and that they're probably working towards fixing any problems in it and enhancing it. So it's there. In terms of their software, it's pretty awesome. You can literally drop into a bash shell. You can run Python on it. They have an API. It's probably better than any network switch you've ever worked with. It's just kind of designed from the ground up to be user-friendly, serviceable. Get in there and do whatever you want. So this is beautiful in that way. IPv6 filtering is not yet implemented, but that's on their release map. So that was one of the things that we first found. You know, we don't have a lot of IPv6 at the lab, but my first problem was, okay, I want to type in a filter and I get there and I say, okay, I want to just filter out this IPv4 host and all this traffic to this port. Well, I get that port, I mean that host and all IPv6 traffic at the lab. And I'm like, that's not great. What can I do about that? They're like, nothing. So that one's waiting, but, you know, they said, yep, it's in a release map. I'll tell the project engineer that, you know, that's important. You guys will bump it up. So they're there and they're listening to what people want. Flexible IO and filtering, again, the 100 gig is still emerging. Those optics are just so expensive that that's part of the problem for 100 gig. The second thing we tried was brocade. And so I should just say quickly that we, so we haven't tried 100 gig with Arista, but, you know, we got this 24 port 10 gig one to test out the features, see what the software looks like and kind of proof of concept that, yes, it does what we want and as long as they support the interfaces in 100 gig, we think we'd be able to do something with them at 100 gig, right? The brocade was one that we did test. They did have 100 gig support. They call it telemetry. It's 100 gig LR4 ready, which is the optic that we need. It certainly feels like more of a feature of a switch and not anything, even like the Arista feels a little bit like, yeah, it's a switch, but they've got this special piece of software, right? And the brocade feels like it's a switch and it has a couple commands that do some of this load balancing and aggregation stuff, but it doesn't feel like it's kind of purpose built for what we want it for. But it mostly works. There is open flow support and some sort of hybrid mode, kind of like Arista. They both have this where you kind of run the vendors code and then you can run some SDN applications or stuff on top of it. I haven't experimented enough with that to say anything about it. But here's a picture of it in our lab. As operating, you can see 200 gigs split the directions on the top input and then it's just like a big kind of four port chassis, essentially. So these are 10 gig chunks down here and we've only got a partially populated, but you could fill that thing up and I don't know, maybe 48 ports of 10 gig or something like that, one 10 gig at the bottom. So pretty flexible because you get a lot of density there and you get the 100 gig. So it's definitely an option that we're looking at. Mostly did what we want. There is no GUI. It was just a kind of funny story about the three VLAN tags. It ended up being, to their credit, this was just a software bug with them. But we got all set up and getting all of a sudden sampling traffic and the traffic has three VLAN tags. It's like, wait a second, what? I mean, this isn't even RFC. Possible, I mean, but yep, you can strip them off and if you're in TCP dump, right, it's like VLAN and VLAN and VLAN and your search filter. So it was just kind of funny to see that. They fixed it. It was just something in the code. So that was just something to mention because it was kind of funny. Filtering limitations definitely couldn't do all of what we wanted. We're not able to do kind of control packet TCP flags, which is what we wanted. You know, if we want just SIN and SINAC and THIN and RESETS was kind of what we usually want. So we'll just ignore all the data we want to see to set up and tear down the sessions. Can't do that right now with the brocade. I don't know if they're going to fix that or if that's a hardware limitation or whatever, but again, one of those things that you start getting in there and you look at and it's like, you just kind of assume it will do it and you start looking at some of the documentation. Oh, other brocades switches do this same thing, but how come this one doesn't support it? I don't know, I didn't get a street answer to get out of them about that, but I do like that it's a single box. It's not, you know, two pieces or things kind of daisy-chain together. It's all in that one chassis. It's redundant power supplies and everything else. So that's kind of nice. The Cisco, I have a little bit less to say about it because I haven't done anything on it, but I know their SDN support and OpenFlow support I thought was just like, I don't know, they were just pushing away further and further with this thing called OpFlex, but then I found out they had this thing called Monitor Manager. I don't know if anyone is familiar with it, Jeff. That is the MLXE, I believe. That's the only one that supports the 100 gig, but they do support that telemetry feature on some of their other devices. I don't know which one's exactly, but... So I don't know, Cisco has this thing called Monitor Manager that appears to be the same sort of thing that Brocade and Arista are doing. I haven't messed around with it enough to know what it's capable of, but just talk to another person who's thinking about rolling this out. I think it's Nexus switches, but doesn't have to be the huge massive chassis one. I mean, I think it can take some... use some of their smaller gear, and the cost could be competitive depending on what your Cisco discount looks like. So I don't know, that's definitely another one to kind of look at and keep in mind, but it's something I can't speak much to right now. As well as SDN OpenFlow. Again, we haven't actually tested this. We've got the Brocade still sitting in our lab, and I'm hoping to find some time to actually put OpenFlow and try out some of the controllers. I know there was this IU Flow Scale OpenFlow app that was running. I think it's a little deprecated now. I don't know. I haven't talked to one of those guys recently, but I think there are also some newer apps coming out. So there's definitely some possibilities there. I haven't explored them enough to know, but if you're going to be spending that much for one of these devices with 100 gig, I mean, I don't know what the advantage is necessarily over using SDN versus kind of the native feature set. Other than, maybe you're using the router for dual purpose. Maybe you're using it as input output for your network for real, and you're doing some monitoring on the side. So I don't know. I mean, definitely some cool stuff's happening in this space, and all of this stuff could be applied to your network vendor. So it's kind of like, you know, more stuff to look at there. I'd be interested to hear anyone that's messing around with that stuff. So for greater than 10 gig, I don't think it's a solved problem at all. Everyone's kind of struggling in figuring out what to do and comparing notes and, you know, seeing what's going to work best and testing it out. The 40 gig gear is more available. We didn't do that because, you know, we've already got this 100 gig link there. We don't really want to deal with 40 gig, scaling up beyond 10 gig, and you're going to trunk 10 gig links together. You're going to do 40 gig. I think there's a little more available there in the commercial space. I know Gigamon's doing the 40 gig. A bunch of the other guys are doing the 40 gig stuff now, so you may be able to just kind of go out and buy something or, again, go to one of the network vendors for that. To do greater than 10 gig, you're going to need advanced clustering for your IDS and your tools, and you're going to be looking at kind of the latest tools and techniques to do that stuff, right? So our approach is going to be to basically duplicate our setup that we've got on 10 gig, scale it up. So, you know, the 100 gig device that can send out to multiple 10 gig outputs. Right now, we're filtering at our standpoint and we're sending it all to a single 10 gig link. So, you know, we're watching the traffic, but the science DMZ stuff, these big bursts of traffic that are coming through, we're kind of ignoring that stuff or trying to filter it off some way now. So, we still don't have something in place for this. This is, again, our evaluation of what we've been looking at. So, we're kind of excited to see what we end up getting and what other people end up using. But this is a picture of kind of the new clustering, right, for Bro. So this is, you know, just one of these boxes is kind of the equivalent of what that whole rack used to do for us. With the amount of cores you can get into the new boxes and memory and everything else, right? So, we can do all of our clustering for 10 gig on a single box with a few 10 gig inputs, right? So, since we're not going to be scaling up to 100 gig immediately, we'll be able to, say, take, I don't know, 3, 4 10 gig inputs, plug those into one of these, do all the clustering internal. So, it sure makes for a smaller rack space. Okay, so... talk a little bit now about kind of output and what you can send to. I don't think we're going to have time to go through all this stuff, but I'll get through as much as I can, right? Filtering is obviously something you're going to have to do, you know, almost at any speed, right? You don't necessarily want to see all the traffic. You want to see the important traffic or the traffic you care about or it definitely excludes the traffic you know you don't care about. So, again, I talked about the control traffic a little bit, the elephant flows that we call them for, you know, science data where you might have a multi-100 gigabyte or terabyte file that gets transferred and you just really don't care about any of that data. But you want to see the connection. You want to see where it's going to and from. So, you don't want to miss that, but if we're able to cut down those elephant flows, we're just massively reducing the amount of traffic that we have to monitor with our IDS. So, that's important. So, we're, you know, anything that can do it dynamically is great, but you know, even exclusions of IP pairs or ports or net blocks or whatever is something to look at and that's definitely, we're looking at that stuff, right? Just take out some of these grid FTP transfers or, you know, we know that the DTN over in CERN is always transferring this one computer at the lab. We're just going to ignore both of those or, you know, whatever. So, some simple ways to just kind of reduce maybe the resource networks or your ResNet or whatever can be taken out of this big stream of traffic put off to somewhere else. I think the holy grail is the dynamic filtering and just talking with Justin a little bit, Justin days off about them doing this with Bro in almost near real time via an API to Arista. So, I mean, it's there. The capability is kind of finally arrived that we'll have a device that supports the ability to do this almost in real time and then plug it into your IDS and say, you know, every time you see this particular type of connection, every time you see a connection above this size, every time you see a connection with these characteristics, just tell the Arista, I don't want to see it or I want to send it off to this port 10 where I have a specially built, you know, captured device there, some sort of special filter or whatever. So, the ability to do that is totally awesome and hope to be doing that soon. I couldn't go through a talk without talking about Bro for sure. If you don't know what Bro is, it is a monitoring platform that is so much more than an IDS. I don't have too much time to spend on it here, but it was started at the lab, actually a Berkeley lab by Vern Paxton who was on the Siri team at that time and has now had a rich history since the mid-90s, all the way up to now where it's being used more and more by people in the commercial space, all sorts of people in EDU, all over the place, right? It's an amazing product. I've been working with it at the lab. A picture of just kind of it as a platform, right? So, traffic comes in through a tap or a monoport wherever into Bro. It does the kind of piece that's in the yellow there is it's kind of internal structure and it's processing and language and library and, you know, there's kind of the concept of the Bro core which you don't really do too much with in the scripting layer that you're able to interact with as a user and that's where all the power from Bro comes is it is a full programming language that you can do almost anything you can imagine with your network traffic with. So, everything from recording rich semantic logs, doing kind of traditional style intrusion detection, not necessarily signature based but better than that, right? File analysis, custom logic, vulnerability management, there's so much that Bro can do that if you aren't using Bro or haven't explored it you absolutely need to. And clustering is just a built in part of Bro. So, it's concept of running standalone is perfectly fine and it can do that but clustering is just a basic thing that Bro does and so you're able to scale Bro horizontally essentially with your traffic flow. So, there's several ways you can do that. You can do that again like we did with the hardware load balancer where you've got 10 pizza boxes and each one of them is getting a network drop or you can do that kind of cluster in a box that I talked about where you might have a 10 gig pipe in and then you're using something like PF ring on Linux or the Miracom drivers or Netmap or some other way to kind of split that traffic inside the operating system into different chunks that Bro can operate on as workers. So, a quick picture of kind of what the cluster ecosystem looks like for Bro but you know essentially you've got that same picture that we started with with the tap you've got some sort of load balancer at the front end splitting it into these multiple Bro workers and then the concept of the manager is kind of who's controlled the Bro is controlling all these workers and all the logs and everything are getting aggregated back to the manager where you're able to control what's going on push the scripts in there do everything you need to do. Snort and Sericata again another well-loved and used tool in security definitely capable of handling multi gigabit loads just like Bro the network cards matter right you can't be doing processing it at greater than gig speeds using crappy network cards it's just it's not going to work so you've got to pay a little attention to either the cards you're buying or the drivers you're using and all that stuff tune the rule set separate and filter the same sort of things we talked about for any sort of output are important doing snort stuff right. A quick look at some network cards the Intel cards support the PF ring and you know there's several ways that that works this LibDNA and zero copy and all that stuff but essentially what it does right is provide direct memory access to the network hardware and bypass the kernel give it directly to user space so that there's essentially zero latency between the time the packet is received and the time that it can be processed by whatever software you're using so that allows for high throughput you can support multiple tools using PF ring which is pretty nice so you can run bro and you can run snort and you can run your packet catchers or whatever you're doing on the same card and splitting out that traffic on the same box at the lab we're using the Miracom that should be com at the top not con so they have something called Sniffer 10G which is their 10 gig advanced driver that has support for Linux and previous D it only supports these 10 gig Miracom cards and right now it only supports one tool so that is a limitation again you can support multiple workers or splitting that traffic up on a single box but it all has to go to one particular application so it's only going to be split once you're not going to be able to have multiple copies of that same traffic but that is not only fixed I think it's already in potentially a beta release so multiple tools will be able to use the Miracom drivers it did get sold, the company was sold and some people left the company but I don't know, I've heard that things are still going well there I think we're encouraged enough to still be using the Miracom cards at the lab there are certainly other ones like the Endace cards highly custom built kind of hardware accelerated cards very high performance, definitely can do you know, greater than say, I don't know, two to four gigs they can definitely do more than that on their 10 gig cards without batting an eye and they have support for as many tools as you want but again, these are very expensive why the Intel and Miracom 10 gig cards we were talking about are maybe $500 range the Endace cards are thousands of dollars so we've never messed around with those but I know there's some people using them with great success for sure and just a couple last things here I guess to talk about Time Machine this is what we use for our full packet capture it creates pcap files with indexes but it's killer feature is something called connection cutoff so it is per port which is a little lame it would be nice if it could do some sort of dynamic protocol detection or something and be able to use applications but you're able to specify buckets per port and then set up a cutoff for that particular bucket so for instance encrypted traffic I think we save 500K of encrypted traffic by cutting down the amount of encrypted traffic that goes over our wire to 500K per session I don't have the slide on here of how much disk space we saved using these sort of techniques but after some tuning and messing around with Time Machine we're able to capture almost full packet which by that I mean like 99.9% of almost all protocols in about 20% of the disk space that we would use if we were doing full packet capture so it is a really cool product there are other things that are doing it but Time Machine is what we use and for instance it allowed us to just go back in time with this heart bleed stuff and we had captures back several months to be able to go back and analyze were there heart bleed attacks prior to the release of the information of heart bleed into the public so we had all that stuff spinning on disk and we were able to contribute some research dealing with that stuff so amazing stuff you can do with Time Machine and you're not going to need all the packets you don't need full packet capture in our environment if you're maybe required by some sort of regulation to keep that stuff maybe you need it, finance or that stuff but in our environment I don't think full packet capture is important but what is important is getting all the stuff you need and we've never had an incident where we went to Time Machine and didn't get the packets that we were looking for so one more thing to talk about is Molek and I don't have any experience using it it again is also a PCAP recorder that does index and searching it's elastic search base so reads in the traffic and writes PCAPs into an elastic search database and then it indexes them so you're able to search and do all sorts of cool things that you might be able to do on a SIM on that traffic so we have a mess around with it I've heard really good things in terms of its capabilities of writing at high speeds since it is using elastic search we've run into problems trying to use bro and elastic search and traffic volume so I don't know how it scales up but I think it's pretty encouraging just because Time Machine is a relatively old and somewhat crufty kind of project and Molek is definitely being worked on right now people are interested in it there's some hype around it so it is not supporting V6 that was one problem that we found but again it is actively being developed right now so something to look at and probably some of you guys are using Molek so I'd love to hear about it so that is actually it I got through everything in time I'm glad but I'd love to open the floor and to the online folks for questions and discussions about any of this stuff because again I am learning about all this stuff and looking and evaluating the products for our project and a lot of you guys are doing the same thing so any other experience or tools or products or stuff that you've been looking at especially in terms of the appliance and network vendor type stuff I'd love to hear about it so please speak up come up to the mic if you want and chip in so we'll go ahead you guys and then we'll begin we have a question from our online audience do you feel clustering solutions to analyze data such as Hadoop are overkill for analyzing massive amounts of data security logs from various sources whether real time or historical logs are tools such as Splunk and Logstash sufficient given their high ingesting and indexing rates fast output engines and search capabilities well that's a big question you know I don't know I think it really depends on your environment you know those sorts of tools depends on what your workflow is what you're working with what you need to do for us our bread and butter is the bro logs and I can tell you we've tried almost all of those technologies and we still go back to ASCII logs and using GNU parallel both grep and awk and you know the command line pipes that we run we still find that that's faster than using most of those other tools we've got Splunk we put a lot of stuff in it we use it for some things we have massive amounts of logs that we process but again we don't usually do those in one of those distributed kind of clustering environments so that may be just partially due to the way we run things but I think you really got to look and evaluate what's right for your environment and what you do so yeah yeah I mean that's a big concern for sure some of the devices will handle deduplication you know there's issues there right you've got to make sure the time stamps are really accurate you've got to make sure you're getting the data in ways that aren't overlapping or doing something weird the device is going to get confused but in terms of us we generally separate everything on the inside of our network and the outside and we just have separate bros or analysis running on that traffic so we've essentially one place on the inside and make sure there's no duplication in where we're tapping filter that all down but we don't have problems with deduplication just because of the way we tap it someone else I was just watching give a presentation and they said you know we have this number of routers and they're connected in such a way and we sat down and thought about it and we thought well if we only tap the inputs on this side we should be able to see everything with any deduplication and again they were able to do it that way without relying on one of the vendors or the appliances to do the dedupe which I think if you can structure it that way you're probably better off you're just guaranteed there's no duplication you don't have to worry about what the product is doing but yeah big problem and depending on how your network is set up may be not easy to do I mean there's certainly places where our network may be set up in such a way that's still simple enough that you have to tap to tap at the right points and not have duplication of traffic but when you're doing load balancing across multiple lengths and that sort of stuff it gets much more complicated so something you've got to look at at your place for sure Jason to watch you what does your back end disk systems look like how are you storing all this data and how are you writing all this data out mostly in the past we have just had single purpose machines just direct attached storage so big single units that 48 disks and their direct connect to our boxes now we'll push most of our logs to a central collector so all the bros dozens of bros running all that then sink their logs nightly back to one of these central collectors and then we do most of our analysis on either that collector box we push it off to something with a lot of course to do analysis but mostly that's where we've been at now we're exploring you know a sand that we'll be able to connect on the back end to lots of our devices and see can we write at the speeds that we need to be able to write at over a sand I don't know we don't really know yet but at this point the only way we've been able to handle it is just direct attached storage on you know a purpose device like that does your time machine box actually write out 10 giga second no well again we don't we don't have 10 giga second so we're still running maybe you know between one and four on average with peaks so the time machine writes that out with some loss for sure but doing some of the filtering and you know the bucketizing and that stuff means it doesn't have to write out those packets to disk so then you're pushing it up in the stack a little more can it handle the traffic on the input of the card and you know the can the OS handle it can the software handle it but what you're writing out to disk does become much less than what you're seeing on the input so yeah anything else alright well thanks thanks everyone