 talk of today, we look at something practical, which is kind of, and I think if you were there for Frank's talk, network traffic analytics is something that's quite important and getting more important and more like cyber-attacks we get and stuff like that, intrusions. And so I welcome Martin and Mirko from Kroder to talk about that, how you would do that on Hadoop clusters. Thank you. Thanks everyone and thanks for staying for the last talk of this dev room. Let me... Yeah, I think it's on the screen, okay. So my colleague is Mirko and I'm Martin and we both work for one of the Hadoop vendors, namely Cloud area. And we both work with customer solution architect, basically means that we go on site with the customer and try to help them to learn Hadoop, to adopt Hadoop and bring more and more workloads to Hadoop. And it's a very general question that we run into customers trying to ask for help to understand the system, that what does it mean that they are used to being one, maybe one monolithic application or used to having a database system and now this is this brave new word of Hadoop technologies where everything is distributed and your data is floating around in the cloud and how do I make sense of that. So one of the ways to make sense of it is actually concentrate on the bottlenecks and network traffic is very usually one of the bottlenecks of Hadoop processing system. So we came up with a couple of ideas of how to visualize this for clients and also to make sure that we ourselves understand these workloads better. So I will give you a little bit of the motivation of why, how Hadoop works and what we would like to visualize here. Then we go into the pack up capturing and how we capture the data. We do a little bit of analytics with the cloud area Hadoop stack, the CDH for short and the Gaffy toolkit which is an open graph visualization platform and produced the figure that you see on the side. So let's have a look at the network load of the Hadoop cluster because we have hundreds and in a couple of use cases thousands of machine in one single cluster simple operations change. When you put a file onto a file system that most of the time because it's a distributed file system DFS for short includes some network operations. Probably these files can be so large that they are bigger than a hard disk in a single machine also you want replication for full tolerance. So just putting something to the file system is a network operation of course getting something from the file system is also a network operation. Then you have to make sure that there is some service discovery parts of the system know about each other. So you need some hard beating going on in the system as such I would say a passive noise going on still network traffic is being generated and of course data is not just sitting around in a cluster you actually do some stuff on the data transactions analytics you name it again generating traffic. And one of the not necessarily nice features that we discovered by trying to explain this to customers if you look at the standard Hadoop monitoring tools today the metrics that you can see for network utilization they are aggregated on the host level and a lot of times it doesn't give you the capability to look at individual services. So instead of I'm using 80% of the bandwidth maybe I'm going to say that okay my DFS is using 20% my Harbids are using 1% and my processing jobs of these three workflows are using an additional 50% maybe that's what I would like to see in a setup like this. And currently that's not what you necessarily get from these systems. Also in this data room today you have heard about cyber security and there's an Apogee project that we are familiar with called Apogee spot which is mostly pushed by Intel and cloud era but it's open source as is an Apogee project for intrusion detection and we'll mention a little bit how we differ from their approach. So let's move to the data that we are using we used standard packet capture data so this is the library it's both implemented for Unix and Windows systems that you are familiar with if you use the tools built on top of it TCP dump, Nmap, Wireshark, Snort you name it they all use this basic implementation we used a Python hook for this but it's the same data a couple of gotchas in the data capturing of course we don't want to create additional noise while capturing the data so we capture it on the local machines and write it to local file system and we only gather it after the capturing has finished and we throw away the data in the packet itself we are only capturing the structure of the graph it's mostly for clustering and for visualization not necessarily learning from the data inside the packet itself. On the contrary what the Apogee spot guys focus on in terms of cyber where security they would like to learn topic analysis on the data itself inside the packets one of their most important approaches include this and they would like to build a cyber security platform for Hadoop using Hadoop technologies instead we focus on the clustering and the visualization as I've mentioned. So we had to come up with a couple of scenarios and we examined five scenarios in detail but here we only present three of those because those three were more important in terms of results so we had a heavy batch workloads called TheraSort where we read input from the distributed file system and we do a big distributed sort on the keys in this input and would like to eventually write the output to the distributed file system again. We also captured the idle cluster so have an idea of this background noise and hard bits and then we had a more interesting approach when we reached out to the Twitter streaming API with Spark streaming and collected data from the outside and we wanted to see how that appears in a network like this and also we wrote to HDFS we could have written to Kafka or any other source of your choice. So the way currently it works it's semi-automatic we collect data in the average format for which you had the average schema in two slides before with this peek-a-pie script. We transform the events into networks using Hive which is a SQL API on top of Hadoop. Now we are migrating these workloads to Spark because then it's easier to register UDFs and makes our life easier but currently it's in Hive and also we use GIFI this open graph visualization platform to do the visualization layer. So let's have some initial results to give you some idea of what we have managed to capture. Both the sides of the nodes and the widths of the edges correspond to the workload and in the Terrasort benchmark what we have seen is the five nodes of our cluster were of course very prominent because we were reading data from the distributed file system and the sorting also involves a big shuffle in between the nodes so of course that communication had to be really prominent but some other nodes were also included with just a little bit of traffic and of course it's important to examine those. In the other case we run the Twitter collection on just one single node and of course that's that's the one of in the center of this Sarsch email and that's the one that is of course also connected to the other four nodes that are in the cluster. However it's connected to a lot of unknown new hosts that we haven't seen in the cluster yet. Those are the ones that are actually sending the Twitter data to our cluster. With that I would like to pass the microphone to Mirko who is going to go deeper with the analysis. So after looking into this very high level initial results we could see what we expected. We see two very typical different topologies of the communication networks. You see communication networks here on a host level which is pretty interesting as a starting point but what can we learn from here? Not much yet. Okay there's a central node because one node speaks to the outside world and delivers data as an ingestion tool. We have an internal operation running on five nodes in parallel. Okay that's cool that's not something which helps us a lot. We have to look deeper. We have to increase the resolution. We must look into more details. This means we add port information not only host information as we did in the very first analysis results. Next thing we add a timestamp. We track the time-dependent graph not as a snapshot for the whole period. We look into little time slices and make this whole thing more dynamic and this allows us to do another thing. This allows us to connect to a time domain. This means we can switch between graph analytics and time series analytics and in order to show something useful Giffy will be our friend. Giffy allows us to collect data with time resolution and Giffy plots and visualizes graphs with time resolution. You will see an example later. I mentioned to increase the resolution. We did it here. We added more details to the picture. What can we see? Not much at the moment. We see some clusters, some clouds of bubbles with the same color. Everything which is in the same color means here's a port available on a particular host. While at this port some communication happened during our experiments. So we reconstruct first a static graph which only shows which ports and hosts have been involved during a certain data capture period. That's the first start and now we must group the data a little bit, do a better layout and do some statistics. First thing in this cluster or in this graph now we have 1.5 thousand nodes, about 3,000 edges and I realized that Giffy behaved not well anymore. If this number goes up to 100,000 but capturing half an hour or an hour of data will explode the stuff. So we went to Spark and used Spark tools for analytics but finally we do the whole visualization in Giffy. Our whole visualization means we start with a static network which represents our infrastructure. The colors here encode at the moment the hosts. Same color means ports on the same host. The black lines are our real communication channels. Whenever a package traveled from one port to another it gives us such a trace. Okay this looks good for the moment but we can do more. We can now study the topology of this network. This communication network changes over time. So topological properties change over time, very interesting. So different nodes become important depending on what measure we apply. We can now tune the things like we need. This is still a field where research has not delivered final answers yet. With Giffy and with Spark we are able to calculate different topological properties. The interpretation is still an open problem. We don't go deeper here. Our goal is to track the whole system in a more specific way. Our initial question was how much of the network traffic is related to HDFS? How much is related to Spark, shuffle and sort or shuffle services? And all these different questions arise if you want to tune the system so this topology does not answer these questions. But we can see what hosts are very important and what are the dominant ports. Okay that's an interesting aspect but let's go away. Let's make this whole dynamic thing visible here. On the left hand side you see dynamically changing connectivity links between hosts. Okay that's what we have seen. On the right hand side you see a different representation of the same data. It is now not grouped by host anymore. Such a color-spanning area here such a cluster in the same color is now not a host it's a subsystem. It's for example hbase or it could be zookeeper or it could be method use or whatever. Depending on what software you deploy on your Hadoop cluster you may end up with a very different setup of such clusters and you don't usually not know what it is upfront. We collect data without knowing that such clusters even exist. Our analysis procedure highlights this and finds them out automatically. So this is how we do it. We go from a host-centric representation to a layer or subsystem-centric representation. We turn things around. This means we track the graph over time. Now we apply such a component or cluster detection algorithm and this cluster detection algorithm identifies such usually isolated clusters because these subsystems are usually well isolated. In a Hadoop system depending on your real environment this could be different. Such clusters could be interconnected. In our case they are not. So finally if we have such clusters where the different ports can be on very different hosts it makes sense to aggregate along these clusters not along a host anymore. We aggregate now per subsystem and get some time series and with this time series we can say how much is the correlation between HDFS and hbase or reproduce and HDFS and so on. This means if we change the perspective and aggregate along these new dimensions we end up with time series and we can look deeper into the behavior. A final remark here this component-centric view is very helpful. It allows us to look again into the topology but be careful. Absolute numbers in such a graph are really dangerous. These numbers here this high degree or this huge circle which is used here and here they are totally misleading. You cannot conclude from this picture that these both are the most important ones. This is an artificial an artificial effect of the analysis procedure. Just as a warning be careful with this always think about the right normalization usually you must look deeper into your data to get this. But such pictures help you in order to get central and decentralized stuff by using a different layout algorithm. It depends all on your layout algorithm how these pictures look like and if you want to cheat you just have to change the algorithm for layouting. Depending on the layout you can make the graph speaking that's on it's powerful but also dangerous. You must be careful here again. We switch the domain we go from time we go from graph to time series. What we see here is the overall activity during the method use job and during the Twitter and just job. Both jobs we're running and you see what we do inside the cluster has not much heavy traffic but during data ingestion we record a lot of activity but we don't know which subsystem is really causing this activity. This is why we group the stuff by subsystem and we come up with one time series which represents the number of packets per time interval for each individual subsystem starting with hdfs where the name node is here just as an example used. We have the node manager which organizes the workload in the cluster in red and blue, black and yellow are communications related to the job management system. So we can clearly separate this time series and see some effects. For Hadoop behavior analytics you can figure out which kind of application has a lot of internal traffic and which not and if you look in this picture here we see an average the activity on the name node is much smaller we have no blue yellow or black one here this means on this layer there is no activity going on but what a surprise this job is of the same type like this jobs we can see here here a result of system tuning. In the old ways we have a lot of internal traffic after tuning the system in a specific way we could reduce this intermediate traffic and we could avoid the bottleneck that's one result of using this method. So we are in the very first experiments we have two at the moment we have the idle cluster where only the background noise is visible we have heavy ingest activity and we can clearly see our method allows us to segregate this ingestion activity from all the background noise so we can split multiple channels like you would do in an EEG where you use frequency filtering during data collection. So we can observe that we have either one very active component or we can see we have multiple competing component multiple channels yeah and all this depends on your algorithm it depends on your workload a normal ETL cluster would behave different than a graph processing cluster but with this kind of measurement approach we could measure the things and then we have at least some benchmarks and we can start tuning the stuff. So finally we have to think what's coming next first of all we must improve the experiments this means we have to run more experience we have to add more realistic workloads for example flink streaming twitter spark streaming is not done yet impala queries or even heavy hbase use cases or even solar base search use cases have not been tested yet we will add all this to our agenda to our scenario list and then we come up with a different kind of visualization inspired by the talk in the morning where the twitter real-time yeah visualization was shown we concluded okay it makes sense to have this for this measurement as well so we can measure strange behavior or suspicious connections and if we find such if we have a classification for our traffic then we could visualize it using gaffy as well and gaffy together with spark works here with our gaffy hadoop connector and then after we have settled this experiment platform a little bit better we can do much more sophisticated analytics on top of the system. The goal is we want to learn more about the time series we want to model the time series but therefore we must first aggregate the data in the right way and here we have shown our first experiments towards this results in the future that's all for today we have to say thanks to the audience and thanks to a bunch of colleagues which support the project in the background which are not here and that's all from my side it's time for questions yes it is open source it's not published yet we have it still in an internal repository we just clean it up but yes it is intended to be published soon yeah because i can imagine it's really useful also for other use cases like exactly between methods or dc os nodes or something like that right other cloud exactly it's an experimentation and measurement thing like an oscilloscope for it guys yeah but in general this sub clustering by subsystem is done on a port level or a port range level or currently we do it without knowledge and the result is overlapping with port ranges so we can do port range statistics and this is a fuzzy matching with known port ranges if you have such if you know the configuration then you can do this matching otherwise we can just learn from the data that's really cool that's the goal we have the capturing project it's a python script it still needs a little bit documentation to make you fast in doing the experiments and not not wasting your time that's our goal and then we are thinking about contributing the stuff to the patch your spot project and finding the right structure and the right procedure and all the stuff this is going on we are not so fast so far yet but yeah that's the next thing we and also the analytic part is currently a little that's why we decided to converge towards park so if you are new to these tools then all you need to learn one tool not learn three why do that yes please I could not really hear well could you please repeat we can do a mapping to known port ranges in order to see which subsystem it is if you have no information about it we can do a good guess and we can look into the pattern shown here by looking into this pattern into this one and with a little bit knowledge about your algorithm and with some expectations you can say which curve belongs to which subsystem as a system engineer you would probably be able to guess good but let's say we have two jobs running in parallel doing the same stuff from different users such a thing could not be isolated anymore by this algorithm here we need a little bit advanced information maybe we have to combine this approach with log information coming from yarn and if we have this integration then we could hopefully isolate on an per application level this would be in next step to isolate not per technical subsystem but per application right now we cannot do that that's a good question for the package if there's enough information because you have to have like a tag on the package to identify where it belongs to and the raw network package content is not really that helpful right because it's just partial information so to reconstruct the application level you need both the backup data and the yarn resource furniture logs because that's going to tell you that this application has these ports on these machines and then you join that together then you can construct it right when you have to tune some uh a little cluster that was happening uh what do you find the most that you need to tune the network the job or the uh do you want to explain for this example here we did one thing we increased the replication level of the data instead of having three replicas we have now five one per host in a hundred node cluster this would be overwhelming but for our example if we bring maximum data locality into the play the network drops it's an it's an academically example here just for showing the effect or even just to figure out if the effect is visible and result is yes but in order to do a better quantitative analysis we need to figure out what resolution is the best we don't have this yet so depending on time resolution your results could vary so that's also a problem we have to deal with we are interested in looking to graphics to see how it works when the node communicate with each other on the graph level and not on the server level maybe we can also yeah help to tune the graph algorithm for itself because such a graph algorithm is heavy network bound and then we can say okay this could be a helpful tool we will see how it works out yeah so we capture it on on the host and we these jobs weren't really CPU bound so that only part it's captured on each individual host so we are not introducing any new overhead to the network it sounds that better we are trying to capture that that was the the basic setup that that we tried to avoid of course we shouldn't introduce any overhead possibly if we can that you you can adjust this in pcap the way you want to sample currently we try to capture as much data as we could to capture the most of the network but if you find that that overhead is just huge then pcapi has an option to only capture one data packet in a hundred or a thousand so you can go in that direction and you could I've also like a window based as you say okay I just capture how many bytes per second go on this port and windows and start in one windows and then we have like a volume you get a capture kind of which kind of summarizes and then you kind of choose the time range if what it per second per five seconds or something like that and then it would give you been enough information about the communication flows but you don't need the raw data because you actually want to know when did it start and at the end and how much data went over the connection cool any other questions otherwise thanks a lot for your talk it was really really interesting before you leave I want to use the opportunity to thank you all for coming for attending all these talks I hope you got something out of it for me it was quite interesting there's a feedback mechanism on on the foster website so if you have any feedback to any of the talks for the presenters or also for us then please don't hesitate to get feedback and otherwise have a good evening and a good day tomorrow and