 Well, thanks for coming everybody. My name's Mike Bataro. I'm a principal architect at Dell working on big data solutions Some I see a few familiar faces through the lights. Some of you may know me as As an open stack contributor and I kind of dropped off the scenes because for about the past nine months I've been working on big data solutions So secretly this little project we're talking about is my way of getting back to open stack via Hadoop You can find me online. I'm pretty easy to find usually Twitter or free note is Mikey P or P. Mikey P Which is supposed to be the personal Twitter that way. I nobody thinks I'm tweeting on behalf of some company and What we're going to talk about today is Hadoop for open stack log analysis So the way I put together this presentation. I've never done it before so timing is going to be a little tricky But the way I put it together. I thought I would talk a little bit about project we've been working on at Dell and Go through a high-level overview Then switch over to the demo if I can get everything to work with the projector here And then we'll go back and kind of drill into more of an interactive discussion in QA And I'll tell you some of the things we've been looking at sound good So before we go there, how many people have heard of Hadoop. Oh good. How many people actually know what it is? We're doing pretty good there. This is the DevOps track. I need to start with that So let's talk about the problem. I'm trying to solve So the issue we're dealing with is very simply running open stack at scale It is very difficult Challenging to run open stack. There are a lot of moving pieces. I love Ken Peppel's diagram because it puts everything on one page and it Looks simpler, but it's still kind of messy The issue you have is in production is a lot of things happening And if you're an operator and you're actually trying to run open stack Figuring out what went wrong is a problem. So this is the focus I have on this project What we're searching for is really the holy grail of open stack operations the big challenge for an operator is Really to say what what would happen if we could follow a request through the entire system? So starting with a request into Nova API Follow it through the system across compute across storage across networking Independent of the physical nodes or where things are running independent of if I move services around actually catch time stamps on that activity and Potentially correlate that with the things that were happening outside of open stack. So to me. This is the holy grail of log analysis I like solving hard problems The good news is, you know, some of us were trying to do this before invisible horses were cool So we've been doing this for a long time. We have some ideas about how to solve these problems This is actually kind of easy We learn the story the good underpants gnomes taught us how to do this. It's really very very easy You collect the logs you analyze them and you get to sleep at night The truth is when you really get into this problem though, it's not quite that easy What having looked at this talk to a lot of Dell customers talk to other people in our team doing QA work The reality is this becomes a big data problem People have various definitions of big data across the board in terms of what they think it is I like this definition. It's a variation of one that comes from a Comes from a Riley. I think was was a Steve Margulis's definition I like to say just big data is when the data itself is part of the problem The original definition is when the size of the data is part of the problem And that means you're dealing either with high volume data coming along very very quickly or you're dealing with velocity Or you're dealing with just variety the different types the structure of the data and generally when you get into log analysis The data is part of the problem. So I think it fits into a big data problem The problem is now I talked about big data So that suddenly puts us into the hype curve, you know big data is on that very very hot hype curve and It's probably gonna outdo the dot-com I think you know and in terms of the the peak of the hype curve and the trophid distribution and all that But because I mentioned big data, I think it's important to say I don't want to get into all this fancy stuff What we're trying to focus on is a very narrow focus So rather than solving all the world's log analysis problems the focus right now With the work we're doing is number one operators So I'm thinking in terms of the dev ops teams the operators running open stack. How can we assist them in running open stack? I'm specifically not thinking about tenants. So I'm not trying to do or work on something like a cloud watch We're just thinking about running open stack for consumers The second focus is very much looking at the data So grab as much detail as we can from the log system Extract and index the really important fields and get it into Hadoop so that we can do future analysis And the third thing we're focused on is the patterns So since it's the dev ops team, we know there's no perfect solution. We're gonna need help figuring this out So part of our focus is you know, what what works? What's repeatable across multiple installs that we can factor out and turn into a really useful thing for the community And then finally, how do we get some help? How do we collaborate? What are other people doing? What sort of challenges are you seeing out there and The reason for the log cat here is just the distraction right whenever you talk about big data People tend to go off on tangents and all the wonderful things we can do it's cool But we're very much focused on operators data patterns how we actually deploy this stuff out I thought it might be worth talking a little bit about Hadoop. I realized not everybody here is a Hadoop expert This is the open stack Summit but in terms of Hadoop I think the first thing to talk about is you know, why use Hadoop and I think there are a couple of reasons for one. It's open source Number two. It has a huge ecosystem and number three. It's a really good data processing and storage platform So there are a lot of good things about Hadoop that fit it towards this problem in Terms of the simplified block diagram at the lowest level of Hadoop You have something called HDFS the distributed file system This is not a POSIX file system. So it's not you're not going to take this thing and map it to block storage and open stack anytime soon Maybe someday somebody will do that But it's very much a distributed file system optimized to support parallel processing in the next layer of Hadoop Which is the map reduce framework For example in HDFS block sizes are typically 64 megs 128 256 megs are just Not at all unusual in a Hadoop implementation compared to a traditional disk Above HDFS the next level up is something called map reduce Which is named after the original Google paper that described this and it's parallel processing framework from a programmer point of view map reduce is interesting I'm grossly simplifying here, but you write a loop You submit it into Hadoop Hadoop will take this loop that's processing a file and reading through it It will parallelize it across a coaster run it in parallel and let you merge back the results as if that loop ran on one CPU It's a gross simplification, but that's really what Hadoop does under the hood and it it's interesting because it makes parallel processing Accessible to programmers particularly enterprise programmers that have never done parallel processing message passing async I O All of the things that we do in the guts of open stack of the guts of other systems. It just makes it easy So on top of those two layers HDFS and map reduce There are a large number of tools above there kind of part of the ecosystem so you have things like Pig which is a batch processing language. You also have hive, which is what I would call a sequelish query It's kind of the data warehousing component of of Hadoop I say sequelish because it looks like sequel, but it's not really sequel because Hadoop is in a relational database So it's not a perfect match, but it's getting there and then you have other tools Uzi for workflow management Hue for the web UI you have Mahoot for machine learning. These are all interesting tools And they're part of I believe every Hadoop distribution out there So these are like the standard command line basic utilities to use with Hadoop They're also a set of APIs. So Hadoop is like open stack. It's API is everywhere And a lot of those APIs are used by both those tools and green and other utilities I think the main ones are meant to mention are there's JDBC interface so you can get to Hadoop as a Database or at least get to things like hive as if it was a database and there are tools called scoop and flume Scoop is a database connectivity tool that lets you do database imports up exports between Hadoop and other systems Flume is a tool that's used to stream data into Hadoop if you're doing Real-time or semi real-time collection and that's something we'd be interested in a log analysis side I deliberately skipped H base H base is the Hadoop database. It puts a Table style interface on top of the core HDFS system. So this lets us By putting that interface on top of HDFS You can you can let people treat Hadoop a little bit like a database It's actually pretty much like a database. It has indexing and it has queries. So again a powerful tool But this is the big picture on Hadoop again. I'm grossly simplifying. I just wanted everybody to be on the same page When you really work with Hadoop, there's a huge ecosystem. So there are vendors doing both distributions and tools At Dell and the solution we work with we regularly work with Pentaho and datamere and ketangas kind of high-level analysis tools None of those are in scope for kind of the work. I'm doing with log analysis It's like look, let's get the data into Hadoop and figure out what we find that I lose anybody. It's time quick check Anybody putting up their hand saying I'm lost. Oh Good So if you think about the log problem The real question is what are the big pieces of this log analysis in open stack or what are we trying to do here? So one thing is we need to deal with log collection So some places on the line if we go back to that initial picture the complexity we need to pick the logs up We need to get them into Hadoop What I think a second part I think is important is what I would call intelligent log parsing and indexing and searching so Whatever we build and release here should know something about open stack It's not just looking at blind text files and analyzing them. I think it needs to have some intelligence may not be super intelligent Third part is the storage organization So the dirty little secret in Hadoop is it's actually a bunch of files. Don't let anybody fool you There's file scattered everywhere It helps to put some sort of structure on top of that in terms of the data formats and the directory Organization she can actually make sense of it. There are multiple ways to do it. You just pick one and say this is the standard We're using for this implementation A fourth one this one's going to be tricky So Hadoop's a fairly complex system it comes probably rates up there with open stack in terms of complexity I think it's been around a little longer So they're pretty well established patterns for running Hadoop But I don't want to bring something into the DevOps infrastructure that is as complex as open stack because that doesn't solve our problem So the idea here is to keep this straightforward Get the data from open stack to Hadoop perhaps let the Hadoop people run Hadoop let the open stack guys run open stack And just keep the connections between the two as straightforward as possible Otherwise, we won't succeed in the log analysis problem And then the last thing is you know as we go through these big pieces Once we're getting the logs there once they are in Hadoop. We start opening the door up to future analysis You know by getting the stuff into a good format people can actually pick it up and do something useful with it And there are data scientists who will look at data that you give them Just white box fashion. They've never seen it before and they'll analyze and tell you what they find Sometimes they find interesting stuff This is kind of a big picture on what the whole thing is going to look like Some of this is in place. So I'll get into the details of what we've actually accomplished so far But the broad idea is the core of the system is going to be HDFS. So the Hadoop distributed file system We're going to set up HDFS and let the operators go hit this thing with Pagan Hive the two most popular tools to work with Hadoop and They'll hit HDFS access the data gets something out of it We could put layers and layers of ecosystem tools here in terms of what they use to get it into a dashboard or something But those are the basic tools Within HDFS the goal right now is to store the data in a format called Avro Avro is a file format. That's a cross-platform serialization format It's nice because it has pretty good language cross-language support and it has pretty good cross-platform support Language becomes important here. I kind of glossed over the detail, but most of Hadoop is written in Java so Inter-language operability and cross-platform stuff becomes very important when we work here and bring OpenStack into the mix In terms of the collection model There are some set of OpenStack nodes and by nodes. I mean physical machines running OpenStack services So I don't care right now which services are running on them But whether it's Nova Network or Nova API or Quantum or Swift or Ceph doesn't matter where they fit into the system They're going to be Nodes and right now I've used the red boxes to indicate that this big thing that somewhere in OpenStack There is going to be some Python logging code and there are some syslog logging code The goal here is to hook off that Use flume as a utility to pick up syslogs and Python logs stream them over to The Hadoop cluster probably using Several flume agents along the line to do the log collection and stream the stuff into Hadoop Once we get past that point I think this I haven't really dug into yet, but the Secondary plan is to pick up some stuff through scoop and flume and collect some of the data from the infrastructure So nodios, ganglia, whatever other systems we have there are potentially SNMP from switches It'd be nice to get that in there, but the initial focus is the log analysis At the same time while it's while the data goes into HDFS The trouble with HDFS and Hadoop is it tends to be a fairly batch oriented system really good for processing big data sets It's terrible for doing fast queries and saying what's this thing say right now So we'll take the data once it comes into HDFS or maybe in parallel with going into HDFS The plan is to run MapReduce jobs is the way I'm doing it now We run MapReduce jobs and we build Lucene indexes that we can feed into a solar cloud. So by running solar in parallel with the Hadoop cluster I can use solar Does everybody know what solar is? They probably should have explained Solar is the search engine that powers an awful lot of these public-facing websites So it's a really good fast tech search engine So you can put the data into solar and you can do quick searches on the log data To find stuff that you're interested in so think of it as like grep and parallel on steroids and then within solar We can then store hooks back to the actual data in HDFS So I get the fast query on some of the data But I have hooks back to the deeper data if I actually want to run an analysis and go through it That's kind of the big picture of the various pieces that are here in terms of the current status Where we are today So it's currently a batch only system. I We haven't quite worked out the details of the film collection the good news is that's a well-established tool So it wasn't a big problem to solve. I'm focusing on the hard problems But they're in batch mode. I'm able to take open stack logs run them through some utilities and get them into Hadoop We're converting them to Avro format So normally most of our logging and open stack is text That data is being converted to Avro, which is this more structured schema oriented format And it's going into hoop hoop that way There's a first cut of the schema. I think I have it here in a later slide So it's not a polished schema, but it's an initial cut. It's fairly simple and that data is going into Hadoop from Hadoop We're currently creating solar indexes so we can do a quick search and solar to look at the data and kind of get a feel for what's there and Beginning to start looking at the data itself. So using pig Using solar and kind of looking at the data and saying how do we really want to structure this going forward? So that's the current development status. It's not releasable yet. It's very hackish. I will admit that it's really a prototype But I'm bringing it here to see what people think and try to figure out where we should go with it from the next step for the next steps I Wanted to look at a demo But I realized the trouble with these presentations is you don't know which order to cover things I thought before we take a look at some data It probably made sense to think a little bit about the schema. So at the top here I just took an open stack log message Actually, it's fairly recent and dropped it at the top of the screen and the little red boxes indicate the fields that are being parsed out this body of text on the right is the Avro schema that I'm mapping it into and the basic idea is we parsed out the host name Which actually is in the log file, but I think it makes sense to track host name when you collect a log We got the date the time the message level the module. We got the request IDs Some messages will contain those request IDs. So they're flagged in the schema a string or no, which means they're optional fields The question is are there any logging standards? So I think with Oslo We've kind of moved logging somebody correct me if I'm wrong, but I think logging is now a common module so We have a fairly standard logging method for all the open stack services Swift happens to go through this log right now, but everything else goes through the Python logging So I think where we're gonna end up is at some point We're gonna take the Python logging module and figure out how to get it to just log the data right out in Avro format If it's not easily available already. It may actually be a plug-in that does that That seems obvious. I don't know about deeper structure on the logging messages But I thought this is a good start, you know, we're just parsing out the key fields getting them into a standard schema I Thought a reasonable demo would be to talk about where we are today And then there's a whole bunch of other stuff we can talk about and I'm between you and lunch. We started a quarter past, right? So I think we got a lot of time to do a demo So I want to shift over here and do a quick demo and show you where we are today in terms of being able to look at solar and also take a look at pig and some batch analysis to do that we need to get out of power point and I need to connect to the tiny little Hadoop server down here onto the table that you guys can't see but if everything went well here I'll be able to log in over there Bear with me while I do some hacking here. It's always interesting looking at history compared to I don't know if I got an IPM I local network, so I think at VPN me into the Dell corporate network and it's doing something funny So looking good. I see an IP address The demo may not cooperate, but I'm going to give it one more shot here for a moment While I'm waiting for that machine to reboot. Let me Let me jump back over to my powerpoint stuff and see if I can Start walking through the next couple of slides because it was working a little while ago. You can never win Demo not going to cooperate right now a slideshow So maybe that'll maybe the demo will work But what I'm going to do is go through the remaining slides and then if anybody wants to hang around well I try and show it at the end. I'll do it then. Oh So I was kind of touching on where we were in the current setup here for the for the project It's very much. It's prototypish, you know, I'm I'm working with a couple of people within Dell I'm working with a handful of Dell customers that are interested in some people in the open stack community They're very interested in this. So some of the thoughts we had other than the schema When we when you think about data collection the general idea here is to go after mainly the open stack logs So go after the core logs first Sys log maybe nodgeos ganglia other infrastructure data. I think that's relevant in the context here And general infrastructure data Network switches maybe pick up SMMP data or maybe proxy it through a nodgeos and let them deal with the Ugliness and we just pick it up from the database I've deliberately not included Cilometer here, although I've been in almost every Cilometer session here so There is a back-end for Cilometer that currently puts data into H base which Genji here developed That was some guys at Dell that are already doing that I think Cilometer data is very interesting for analysis, but to me it's not quite the same as just the operational monitoring aspect But I think eventually it'll come in here For input formats, we're mostly dealing with semi structure text. I want to track the subsystem I think we need to track the host name the timestamp the severity in the error level When I say host name, I don't know if we should track host name or IP address I'm you know, it's kind of up in here, but I think we do need to tie each log back to the original machine For output formats This is a little tricky Protocol buffers are an option But they're not very portable across languages. They're kind of a C C++ thing that Google uses. They're interesting But I don't think they're very portable at this point Also, they haven't had a lot of traction outside of Google The two formats that have had traction are Avro and thrift I'm not I'm not really an expert on the subtle differences between those they both solve the same problem Which is a serialization format that's pretty much cross-platform cross language. I chose Avro Primarily because it has some nice schema evolution features where you can put the schema in the file The beauty of an Avro file is when you apply a schema, it's embedded in the file So you don't have like XML where you have these external schemas and you have to go validating STTDs with Avro the scheme is just loaded in the file header and it's also a very compact format but Avro thrift I think they're Somebody will call me out on this, but they're more or less equivalent The reason I'm using Avro is the Hadoop community seems to be very very Quickly converging on Avro is their most universal file format at the lowest level So I'm open to suggestions here, but at least that's what everybody tells me For law collection thoughts The thing I'm the reason we haven't really dealt with the flume kind of streaming stuff right now is It's a well-established problem. There are some nice patterns and best practices in the community So it's not a research project. We'll be able to pick a path follow it. It's likely to work The tools that are used a lot Kafka scribe flume and flume ng I've started focusing on flume because again that seems to be where the Hadoop community is going towards flume ng If somebody made a strong case and said we absolutely should use Kafka Which a couple guys of the hacker dojo and San Jose twist of my arm. It's that you really need to use Kafka You know it's a possibility, but right now flume ng seems to be doing it It's pretty popular with people the key requirements on the log collection side isn't so much the tool It's that it supports a distributed infrastructure That it can do reliable log collection So depending on the tool all of these tools have a mechanism where they can kind of buffer data in memory or buffer it to disk and Make sure that they reliably forward it ahead because you don't want to be drug dropping log messages during a temporary outage But you don't want to fill up the collection nodes when they get busy So a stored forward some concept of aggregation This is going to be tricky the actual topology how we handle the distributed nodes how we aggregate the logs From say a cell or a data center and how you break that up. I don't think that's going to be coded in the solution I think it's going to be something you can choose at the time How do you want to aggregate the logs and stream them back to the master Hadoop cluster? Those are some of the things that are up in the air So my when I call these thoughts these are like don't know the final answer yet Actually thought about these but haven't solved the problem That's on the log collection side on the organization side The big things are as I mentioned Hadoop being a bunch of files in a file system It's things like how do we do a file organization within Hadoop? How do we name the files? How do we organize the directories? If you were in one of the earlier talks they were talking about Swift as a back end for HDFS or Hadoop and They're doing things where the files have slashes and the names and they're building a directory hierarchy in Swift It may not be the best idea, but Within HDFS you have to have some sort of structure you lose track of the files and you just kind of formalize that maybe it's a pattern The data lifecycle is probably more important You know as the data comes into Hadoop you're streaming it in you're probably going to want to stage at some place Either a local disk before it gets into Hadoop or in Hadoop itself Then we'll have map reduce jobs picking it up and doing something with it So I talk about hot and cold data. What do you do with the data from the last 15 minutes? Do you just try and get it into solar quick? Have a very very fast index for analysis and then kind of meant moving into map reduce Or do you do it the other way around? Things like how we tear it Big debate I got really good feedback from this from the from the team at PayPal which I should say thank you too I can't see you guys if you're here because of the lights But you are here they suffer through an early version of this presentation and kind of been collaborating on what's going on But the idea tiered indexes is the data comes into the system and you're building the indexes and solar You can't index everything. It's going to be too much data. So what do you index? Do you index it in 15 minute intervals? Do you do 15 half an hour and 45 windows? There's some discussion around that. I don't know the right answer yet I don't think it's a fundamental design decision and again that may turn out to be something that's a parameter Depending on the implementation, but tearing that data is important Within Hadoop, you know, what do you compress? What do you not compress your fresh data? You probably want to put it in there uncompressed initially get the indexes built But at some point you want to take advantage of Hadoop compression so you can keep more log data in the cluster You may want to throw it away after a day, but I think most people are going to keep more data And then you know at the tail end Once once you have compressed the data or not are you going to keep 15 years of open stack logs? You know, maybe you want to archive into tape. Maybe you just want to have them automatically fall off the end. I know some Dell Customers that actually take their log data not just from open stack, but other places in the archive and Swift They're they got compliance restrictions. So they take the original logs. They can pull copy them to Swift And that's like their master copy and then they load them into other systems for analysis. So these are things that are I Have some ideas. I need to talk to other people and see what do you think of this? It's probably a little design session to sketch it out Which really takes me full circle to you know, where do we go from here? In in terms of what we should do next This is very much a prototype. I'll show it to somebody if you want to hang around it is on this box But should we document what we have now? Anybody else know of a related effort, so I'm plugged into the sealometer project There is a blueprint For the universal request ID so we can follow a request through the whole open stack system that blueprint is in progress right now a Lot of these things are not hard decisions I think they're going to be take some people that have experience with this We sit around on a table or an IRC channel and we say let's do this and we kind of agree We'll do that until somebody proves it a bad idea. That's kind of how I do things We need some collaboration on the schema design You know, what's the right structure of the schema? We'll probably look at other limitations the log analysis to see what they do as well Do we need changes upstream I'm pretty sure we're going to have to tweak the Python logging I'm not even sure if that's going to be an open stack change as much as a Python change But at a minimum You know it'd be nice to have open stack logging go straight to a stream instead of writing to a disk Just I don't know if that's going to be a plug-in to go straight to flume or if it's going to be an avrotranslator with some sort of That's probably going to happen. I don't know if there are other upstream changes We'll know as we get deeper and Then I think the last thing is Something that's come up multiple times is it would be nice if while we're doing this We could get substantial sets of Hadoop log data Kind of put together in an avro format and give them to some Hadoopians and let them look at it and say, what do you find? So that they can start analyzing it soon There are a lot of people that have clusters that they're looking for work to do on them I think we can give them some work to do So those are the big-level topics. I think at this stage, you know as we come up to We're running a little late for lunch, but As we come up to this, I mean I'm at the summit all week So if I go back to the original slide, you'll find me around here I don't know if there's interest in having kind of an informal designs session around this unconference style anybody interested or I'll also be okay, so we do have interest good And I don't know how we'll coordinate that we'll figure that out as we're breaking for lunch, but We should do that and kind of do a little more brainstorming It's our intent to release this as a pattern with as much code as we can that becomes something we can apply to multiple open stack Distros and installations, but it's evolving right now Okay. Well, thank you for coming I'm gonna spend another two minutes here and see if I can get the demo behave