 Are we live? All right, sweet. Well, sorry for the delay. There was a little scheduling mix-up here This is spark hara late, but ready to go and I'm Michael McCune. I'm a developer with red hat, and I'm joined today by Chad Roberts who's also a developer with red hat and Nikita Konovalov who's a developer with Mirantis and I'm gonna start with a little disclaimer What you're gonna see some of the code and the demonstrations that we're gonna do are not upstream yet Just a fair warning. These are in github repos. You can download them and experiment at your own leisure So what is spark hara? We started with an ambition to create a Heartbeat monitor for open stack based on the logs that are created from service controllers This was also a way to dog food open stack so that we could use Sahara to deploy a data processing cluster and then consume the logs through zikar and Display them for someone to watch It's also a silly name. So spark hara I don't know how we put it together, but it's a portmanteau of spark and Sahara and it just sounded funny to us and Also, we wanted to show a Window into the possibilities of how you could analyze log data Using spark and then visualize it to see what's happening with a stack in real time So you could see errors bubble up or you could see perhaps performance issues as they happen in real time And why do we do this? We wanted to create a real-world use case a user story that could show how you could consume these services To provide something for operators We also wanted to use kind of straightforward methodologies There's a lot of technologies out there that can be used to you know sort logs and accumulate them We wanted to show how you could do it in just a straightforward manner And we also wanted to kind of inspire all of you to take a look at this and say hey What can I do? Where can I take this? You know the possibilities are endless So what do we use to build it? Of course we use the core services as you know We need to then we use Sahara and zikar and trove and on top of those we use spark and zeppelin and the Mongo database and we're gonna get into how did we use these components and and what are they? So at this point, I'm gonna hand it over to Nikita and he's gonna talk a little bit about Sahara and spark Okay, thank you Michael so Sahara is an open stack service and Its main mission is to provide two components to the user first of all It provides a provisioning API which allows user to spawn Hadoop clusters in minutes And then when the user has its own cluster with Hadoop services running Sahara provides the data processing API as well So I'll go quickly about what makes Sahara and what is under the hood of these Components and then we'll go through the demo. So Sahara is a truly open-stack service which exposes all its functionality through the RESTful API This RESTful API is reachable by a Python client Which is an SDK to other applications if they want to use Sahara clusters. We also have CLI so the Administrators can operate with this and do all the operations through the common line From the user perspective the dashboard in horizon Displays all the templates all the jobs and all the clusters which are operable in Sahara with the tenant tenant filtering name filtering And all the ux stuff you might want to have from a UI Of course Sahara is complex complex service which has a lot of components under the hood so all of these components are checked in the open-stack gate jobs as well as the Provisioned clusters are checked by third-party CIs for non-fake integration testing that allow to spin up Hadoop and check that it's really working in our in your open-stack cloud So let's go a bit deeper into the Sahara. It has a provisioning engine So this is a component which decides how the open how the Hadoop cluster is being provisioned on the OU cloud The main goal of this engine is to cover all underlying operations and Make them being provisioned automatically the engine operates in steps and all these steps are locked in our new Recent feature which is even locked so the user can track how the provisioning is going When the provisioning is finished the cluster is operable, but it's not this final state so the cluster is scalable and user can edge or remove nodes depending on his workload and Change the name change the size and the services run in a cluster Sahara also takes care about cleaning the resources when the cluster is not used anymore. So the deleting cluster Will remove all the VMs volumes floating MPs and whatever has been created for this and you leave your cloud free and Clean so the provisioning engine has been tested to prove to spawn 200 nodes VM cluster in a matter of tens of minutes and We're actually right now in a process of migration to heat provisioning and This will allow more flexibility and more reliability when you provision a larger Hadoop clusters So what is inside what will be provisions Sahara provisions the Hadoop in plugins manner, so there are just different distributions and Sahara has plugins for each of them to provide a Hadoop That suits your use case the most so there is a vanilla plugin, which is an upstream Hadoop build build by Apache and It is a reference implementation or what a plugin should look like We also have a clodor distribution and Hortonworks data platform being provisioned as a complex Hadoop distributions Which have more than just Hadoop under the hood and they are Usually recommended for production use cases There are a few solutions other than just pure Hadoop builds, which is a spark platform And we're going to talk more about it today But also we have a storm support appeared recently, which is a best Scenario for streaming processing and my power with their my paraphrase implementation is also available in Sahara The versions for all of these components are of course different and we're trying to keep them up-to-date with the recent releases So the plug-in and the Hadoop versions are almost always up-to-date with the current upstream of Hadoop Then so the elastic data processing is the component that Is responsible for running the workloads on an existing clusters? Basically, there are two types of workloads Sahara can run today is that first one is Hadoop MapReduce And that's a big data classics with its big and big queries High-vehicle like queries, but also we allow to run any generic jar file or shell script so if your job is compiled and You have tested in a standalone Hadoop cluster. You can just submit it through EDP and it will run also there are second Second type of jobs, which are the streaming and we have spark and storm plugins for those and of course park and storm Submission process is a bit different, but it is available through all Sahara CLI OpenStack dashboard and other interfaces So you can experiment with that Now let's get a bit closer to spark and today's talk. So Apache spark is a fairly new data processing engine and it is One of its main advances is that it runs in a different modes It's not bound to any specific clustering engine. It has its own standalone mode Which is provisioned by vanilla spark plugin, but it also can run on yarn or mesos clustering which are supported by yarn is a yarn and yarn engine is supported by the clodera plug-in and it is being supported in HDP, but that support is in development right now Spark is a bit different from Hadoop. It does not provide a storage layer. So it operates on the storage which is provided and it is Not bound to a map reduce concept, which is the matter in Hadoop so this is a just quick comparison on what are the main differences and These things you should think first when you decide which engine you would like to go Hadoop or spark So in a case of functionality provided again So Hadoop is mainly built around map reduce concept and it is doing it very well so spark is More very provides you more variations It has over 80 built-in functions for not only map and reduce but also different filtering multi-passing and other stuff Storage is very important for big data processing and Hadoop tries to store all its data on your persistent disk storage, which is As good as you can configure, but this is the Hadoop relies on the storage for data processing while spark tries to keep as many data as it can in the RAM and run The data processing in the RAM and only store when it's really necessary As for the non-map reduce and streaming processing Hadoop has this possibility, but it's very limited to the that jar file, which is provided by the Hadoop distributions and Spark is way more advanced in stream processing because of the micro batching transaction functions and Almost any possible data source that you can provide it As for the scaling capabilities both of these frameworks declared to be infinitely scalable which is almost True, but very hard to test because they are relying on the network storage and other infrastructure So usually the framework itself is not a limit to its scalability. It's other stuff so The question that is being asked very often when people are start comparing their spark clusters to Hadoop So is spark always better? Is it more has it has more functions? It is faster in RAM processing. The answer is The use case for each data processing job is different and you should see the benefits in both frameworks but spark is better when you have Reasonable amount of data and you have to do multiple passes while processing it so when it fits in RAM your choices this park and It spark is always almost always better for streaming support because it has a richer data source support as Again, I've said spark is supported in Sahara as a standalone implementation is vanilla spark plug-in and CDH will provide your spark running its containers on the yarn system HDP will be providing this functional to version Right now the standalone mode runs one of the zero or three point one spark releases and Cloudera also has one point three release, which is Available from Cloudera five point four distribution The configuration is available through Sahara interface as well as the Hadoop configuration So the core site and other Environmental options can be passed through the Sahara API They can be stored in templates so you can use You can have the configuration that is reusable and All of them are still provisioned automatically. So Sahara will handle everything like provisioning the keys spawning up services Formating the storage and everything for spark cluster as well as the authentication for Keystone and Swift data sources, which is required to run the data processing on OpenStack The spark pull workloads are available by ADP and Sahara uses spark submit script under the hood But that doesn't mean that your cluster is limited to this spark submit tool because this Cluster is ready for any external tooling that is ready to operate through Spark's APIs so any spark python java compile script or Shell script will be executed Either through a spark API or EDP API or directly by user. It will run in the same mode of fashion so in case of the data sources that is supported by Sahara right now there are HDFS and SWIFTS manually is coming soon and To talk more about how we can operate the data on the different data sources. I'd like to handle it back to Michael and tell you the Exactly use case. We're trying to achieve with spark and Sahara today Thanks Nikita All right, so we'll just go over Zeppelin is a web-based UI that allows you to interact with a spark cluster and It's kind of like an ipython notebook. You can you can do visualizations. You can run code samples directly from your web browser It also allows you to collaborate So if I create a notebook that has some you know, particularly nice algorithm I could share that with Chad and then he could take it and run some visualizations Pass it back to me and that in that way we can share in on some of the live processing that we're doing We also used a trove to set up databases one of the One of the things that a trove can do is it's a database as a service you can deploy relational or a no SQL databases You know can do my SQL Maria Postgres Mongo Cassandra Redis. It really simplifies deploying the database for you You can use it to create instances or clusters of you know for your databases And you can also use it to resize database volumes. There are some other operations that are available with trove You just have to kind of explore and see how you can copy databases back them up shard them those kind of things And then we also use the car Zekar is OpenStack's messaging service. We used it as a queue Between multiple endpoints. It can work in multiple tenants It has a restful API like the rest of the OpenStack services and it supports event broadcasting task distribution and point-to-point messaging so you can you know configure your endpoints so that something can push messages into the queue and then Zekar can message the subscribers to let them know that a message has arrived for them or you can pull messages off the queue So that those are all the technologies we used to kind of put this together So the next part is you know kind of what did we do? so The OpenStack service controllers create a lot of log data and those logs have really useful information in them But it can be difficult to kind of pick through them all so In the spark our workflow the OpenStack service controllers create the log data and then we have a process that takes the log data and feeds it into a Zekar queue and As it goes through the queue we have another process sitting on the nodes with our spark Processes and they pull messages off the queue and feed it to spark Now once spark has got it then we can do the processing and you can hook in Zeppelin You could normalize the data and send it to a database. We used Mongo You can also send it out to just about anything you want in this case. We used an HTTP endpoint Spark is pretty flexible in what you can do you're not limited to just a map reduce kind of interaction You can you know do whatever you want to you on the data that comes in and So to prepare for this we needed to create some custom images and this is where the the disclaimers and provizos come in We were using spark 1.5 Which is one of the latest versions out there and we created some custom images that we could deploy through Sahara to talk on these We also needed to configure DevStack to add in the Zekar plugin and the Trove plugin So they were available to us and then we needed to customize the Sahara cluster operations a little bit So that they could deal with spark 1.5 because right now Sahara is only using up to spark 1.3 So just had to tweak it a little bit so that we could get these these new images in there So the next part was moving the log data The first thing that we did was configure the controllers so that the the logs were Formatted in a way that was easy to consume on the endpoints The next step is reading the log files because they're just existing as files on a on a server somewhere And we need to move them to Zekar So we wrote a small Python application that just tailed those log files and pushed the data onto Zekar So here's an example of How the controller might be configured in this case it was the Sahara controller We just basically told it to Output to a log file in opt logs and then the next four lines are where things get interesting that's where we started to format how we want the messages to come across and Really, we just went with a simple approach using a double colon to separate between various fields of the log So the the date time stamp what the process ID was, you know What what kind of level came back from the log? What was the log message? Etc. Now reading the logs This is where the the Python application that was sitting on the same node as the controller It could be sitting anywhere as long as it can access the logs We just ran this command let it tail the logs as they happened and Instead of making it a multi-threaded approach We just went with a single process per log Started up and have them feed the logs into Zekar in those processes We would then hand off to Zekar So we just wanted to take us you know a simple approach to this just tail the file for every line That's created just send that line straight to Zekar There are probably better ways you could do this you could use more complicated technologies, but we just wanted to keep it simple Also tagging the service type This is something that the approach we took was to to based on What socket was being used we could identify what service was was sending their logs in there Other ways we could have done this we could have tired We could have tagged the service at the point where the rock logs were read and sent them up as something more complicated But we just wanted to keep it as simple log lines coming across and then log line Kind of chunking was a problem we came into where sometimes when you get an error a service controller if it's configured in debug mode will send a stack trace and Those stack traces don't come out cleanly formatted. So we needed to make sure that that those would get a Compiled into a packet that could be sent across to Zekar as a whole so that it didn't appear out of order when it came onto spark So receiving and processing this is this is happening on the spark nodes and the spark nodes need to be able to Listen to a Zekar queue and then take from the Zekar queue and push them into the socket streaming Endpoints for spark And then you have you have choices about what you want to do with the data at this point because this is another area where things Could get split up or sliced in a way that may not be advantageous We went again we went with a simple approach. You're just getting these log packets across take the whole packet and send it to to spark So receiving the data again We had another small python application that we would run once per service that we wanted to listen for and In this case you can see we just set it up for the port We wanted to send it to so that spark knew what service and what queue was based on you know The service type that we wanted to send it across We we could have put all the messages into the same queue and tagged the data before we put it into Zekar But again, this was the first approach we took it may not have been the best And then this application would move the information from the Zekar queue onto the same node where the spark Application was living So feeding spark There are there are different ways we we can get information out of Zekar We chose just a simple approach of looping and Reading the queue to see if new messages were available And if not we just sleep and then you know wait for new message to be available We could have used the v2 Zekar API to set up a subscription model where Zekar would have sent us log messages as they occurred Now sending the messages to spark is where we get into How spark processes incoming data there is not a Zekar socket streaming Method for spark at this moment, so we had to use a tech socket stream which basically reads new lines separated lines that come across the socket and Forms them into rdd's which are the the basic primitive of sparks processing So it would based on a time slice spark would read some number of lines that came in on a socket And then it would ship those off into an rdd to be processed This is this is where things could could possibly slow down because of the way that we're separating the messages How we're how we're feeding them to spark there's a lot of timing issues involved here And it it gets pretty deep in the weeds, but yeah, you'll see So we've got some data. It's in spark now. What do we want to do with it? This is part of the spark application We wrote and as the log lines come in one of the things we want to do is normalize the data So you saw before we use the double colons to separate the fields We want to break them apart now, and we're going to turn them into a JSON object that we can put into mango So first we normalized it then we could store the data you could put it into mango You could you could build a more complicated schema if you felt like it and put it into a more structured database Then you could also signal an external application This is kind of the next step after we stored it We took that we wanted to say how would anyone know that this data was stored in the database if they weren't just you know Looking at the database so we wanted to signal an outside process in this case an HTTP endpoint that new data had arrived We don't want to send all the log lines to this endpoint We just wanted to let it know that new logs were available and some number of them had arrived and that here's the IDs They can be grabbed from the database There are other things we could have done that with this too We could have used other services to initiate actions based on information that was coming in Spark could have been set up to to signal an external server that you know something had just happened and you needed to take action Maybe send an email to an operator to let them know that a bunch of error messages had just arrived And maybe you should look at your stack The options are really endless here spark is very open and what it allows you to do So normalizing you know what do we want to store? How should we store it? We went with a very simple approach again just packaging up the log files and pushing them into Mongo and Just it the way we put them into Mongo was we stamped the time on them We put the account of the log lines that had come in within that RDD and then mark was there an error in that? Packet and then the log line so we just shipped each one of those packets to Mongo which made it easy to consume on the on the other end So storing signaling you know wherever you want to take this Getting the data out of spark is the next point and and like I said We just kind of packaged this up pushed it into Mongo We signaled a listener on an HTTP endpoint to let it know new new information had showed up We could have also pushed this information back to a zikar queue Which probably would have been a better approach because it could have been consumed by more services then Again, this was the first stab we took next time. Maybe we'll put it on the queue So now we're going to get into how did we visualize the data so at this point spark has taken the data It's normalized it we've put some of it into the Mongo database and What else are we going to do with it? We've sent some to an HTTP endpoint We've also got Zeppelin set up and the data is available to us So at this point I'd like to turn it over to Chad and he's going to talk a little bit about Zeppelin Okay, so one of the tools that we use to visualize and explore the data from Mike's app was Zeppelin So recall from an earlier slide Zeppelin is our web-based UI kind of ipython notebook like to let you kind of Fail quickly which I did a lot of and iterate based on that I See there's two kind of basic use cases that that I used and Kind of went with on this the rapid interactive development for the data exploration Kind of see what you got play with it a little bit see what you can maybe make out of it Zeppelin works really well for that And the other part it also comes with them kind of very cheap and free Visualizations, it'll give you some charts And you can do quick simple reporting with that When you get a chart in Zeppelin you can actually just take a link you can link that paragraph They're called there and you can send that to somebody else and they get essentially like a read-only Version so they can see the chart and they can if you update it it'll update their chart live for them Even if they're looking at it So it's kind of a quick nice to the The Zeppelin can give you So for this example Spark hara we took the data from Mike's app that was loaded into Mongo hosted by Trove Oh Then the Sahara tie-in Spark was of course running on Sahara And then we will show you a quick simple chart that was created That is of course shareable the first thing in Zeppelin here, and this isn't necessarily spark hara related This is just a quick You know hello world sort of thing In Zeppelin this is showing the spark pie example if you're familiar with spark I'm sure you've probably seen it on the spark documentation It just calculates pie using the what Monte Carlo method. I think it is the top paragraph there is Using scala And the bottom paragraph is doing the same thing but using pie spark The bottom paragraph you can see there's that it starts with percent pie spark That just lets it know that you're invoking the Python interpreter Scala on the top It's the default so you don't really need To put the percent anything if you want it to you can put percent spark there So then we get into the more spark hara bit And so one of the use cases was to quickly look at our data and maybe play with it a little bit So what we do here the first few lines you can see where we're grabbing the data from our Mongo database We're just kind of sucking it in and we're gonna create a Data frame and if you're not familiar with spark a data frame, but just think of it as kind of a Database table that you can it has columns and you can query on it But it's not really a database. So don't tell anybody that I told you it's a database So here we just quickly we slipped it in and we Displayed the columns that were created in our data frame so we can see that our data has account some errors some log messages Timestamp and a service that's created along with it there And then we also ran a quick query the last line Before the output starts there We're gonna see that I and we're doing this on two services. It looks like Sahara we had a lot of log messages from Hopefully it was a good day not a bad day for Sahara a neutron. There was if the few messages also processed there This one I hopefully yeah, you can read it. We're at the big screen here a lot of this again Is some boilerplate at the top getting the data from Mongo again loading it into an RDD The bottom part of this is where I'm actually doing something a little more interesting with the data grouping it into like time buckets the The thought here is I'm going to be making a chart out of this So I'm going to kind of creating some time series sort of data that you can easily chart There's a little bit of extra code here that doesn't really We're not making all the charts that I originally have done in a different version of this talk We're just going to be showing one of the charts. So some of this code is a little extraneous for for our purposes today But it's fairly basic Pi spark stuff nothing too magical there really So we've loaded all of our data into this data frame Here is how we make a chart. It's just we have four lines of code To generate all this magic percent SQL this lets us know we're going to be doing, you know a query here Pretty straightforward Select on the hourly errors that was the the table the data frame that I created And you notice that there's a drop down there Sahara's the currently selected service There's a little bit of code there, but You can see where service equals and then there's some braces you can actually create drop downs You can create text boxes that the user can dynamically Enter and change the query and the graph will update as soon as you change that So it you know that's quite a bit of functionality to get in just a few lines there And it takes you know this whole thing probably just took a few minutes to create really and you could probably do it Quicker because you're probably a better spark coder than I am So what else can Zeppelin do I couldn't quite didn't want to do a whole demo on Zeppelin It has interpreters for spark py spark SQL hive flink Cassandra lots of other things. It's open source You can you know if there's a framework you wanted to work with you're you know at it. I'm sure they'd love to have it Dynamic dependency loading we actually did that the the drivers to extract the Mongo data You can load those drivers kind of runtime. You don't have to even have them on your system Zeppelin will go out and grab them for you It can handle streaming jobs if you're interested in Zeppelin you can go and run their tutorials There's a really cool Twitter based example It'll grab a bunch of Twitter feeds and you can see what other people are tweeting Not all safe for work tweets. Don't make the mistake. I did You can load other charting libraries if you're not happy with the the basic charting they give you you can use angular and Grab your own charting library and then use that there's examples for that. I think maybe on the the user list If you wanted to low run your notebooks via set schedule, you could also do that. There's options in the UI For that but like I mentioned earlier for reporting you can provide links to the individual paragraphs And share them like that Yeah, the last thing of course a Little caveat here Zeppelin's not built into the spark plug-in for Sahara by default There's a few options you have to get it though. None of them are that extensive It's fairly simple to just install and start manually So if you started a regular Sahara cluster with spark, you could just log into that cluster and install it yourself I have a couple of Repos that you could run One of the Sahara image elements lets you build the image with Zeppelin already on it And the Sahara one is a slightly tweaked version of Sahara or the spark plug-in that will start the Zeppelin process for you Not necessary if you wanted to just do it yourself though And also there is a community contributed plug-in for Ambari that will also Start up Zeppelin for you So with that we go back to Mike for some more visualization All right. Thanks, Chad. Okay, so we've seen what Zeppelin can do. What did we do? We rolled our own We built a small flask server that as I mentioned before was getting hit by spark telling it that there was updated log information in the MongoDB We used D3.js to do the visualizations and we also had some non-visual receivers So learning about information that was coming in from the spark server But not necessarily visualizing it if you wanted to just hear that okay some errors had arrived in your system Maybe it's time to send an email to an operator to let them know, you know Check out what's going on and then beyond that you could also script actions that come in off of this So you don't just need to visualize it You could actually create your own scripting out of this So maybe if you knew that a certain type of log message was going to come through and you wanted something to happen Based off that you could set your spark app up and have it send to a to a scripted action that might do more things It might tell Nova to shut down or it might you know might lock the system down. Who knows This way this is just a small example of one of the applications we wrote in this one What's happening is all the logs the logs from two services are being are being collected here The top graph shows you a total count of all the logs coming in and the bottom count shows you a graph based on two services the the orange colored line is Sahara's logs so most of the time Sahara is pretty quiet. It doesn't do much It runs a periodic task and you see a little blip there the top one is neutron Which is kind of noisy sometimes so neutrons kicking out lots of information for us And then you see around the the 50 second mark is where we started to do an operation with Sahara We started to tell it to register an image and what you can see with the red line on the top is an error has arrived And in the bottom pane here, you can see that we've clicked on that and now the error has shown up and you've seen it So this is one example. I'm gonna Hopefully the demo gods will be nice to me and I'm gonna show you another example so What I'm gonna do here is I'm gonna this is all gonna be run on my local machine So I'm not using DevStack. I'm not using an external cloud service I'm just gonna run it here for the purposes of this demonstration What I'm gonna do is I'm gonna start a keystone server on my machine and I'm gonna visualize Login attempts that are happening in real time. So what's gonna happen is I'm gonna start some services here and The first thing I'll do is I'll start up The flask application that will do the visualization and I'm gonna start up This application that will feed log information from Keystone to that And I'm gonna start the spark application Now at this point, I'm gonna have to drop out a presentation mode here All right. All I have at this point is just a graph. Nothing's happening. You can see that If I if I go back to here Keystone's not creating any logs right now And hence we don't see anything happening out here. No logs are being generated But what I'm gonna do is I'm gonna start a process that will that will create login attempts And what it's gonna do is it's gonna look like someone's just logging in they're valid any normally It's successful and every once in a while someone will miss their password So once every 10 to 20 seconds, there'll be a failed login attempt All right So at this point we're starting to see some activity come through and what you're and what you're seeing on the the blue Line is valid logins and there we go. Someone has someone's forgotten their password at this point or something Someone's failed a login attempt So I'm gonna come here and and I'm gonna attempt to click on that And if I scroll down I can see the logs logs have populated here and I'll scroll through and I say okay Warning there's been a failed login attempt. Okay, you know, I look at these logs We scroll back up and it it doesn't look like it's that bad. There's there's been very little activity here So probably someone forgot their password Now they're good. They log back in it was no big deal But what I'm gonna do now is I'm gonna simulate Someone trying to brute force a password. So what you're gonna what I'm gonna do is I'm gonna have it create 10 20 30 40 bad login attempts per second over several seconds and we'll see if we can visualize it Okay, so now we're seeing some activity. We're seeing a bunch of invalid login attempts and what's happening is First of all keystone's gotten really noisy because there's a bunch of activity happening But we're also seeing that okay What what the heck is all this and why have all these why have all these bad login attempts started to happen? So if I if I click on one of these And and I can work this like this. Okay, so You can see that all right. We're looking at logs and they're being sorted and We can see that okay. There's something happening, you know in real time these things are these things are being interleaved We were seeing bad login attempts and if I needed to I could come down and click on one of these points And this will be very challenging Based on this, but there we go and now you're seeing just all the all the errors Collated together based on time. So if we needed to look and see, you know, all right What just happened, you know, we could go back through these logs and look at it that way and So this is one example of how you could use this to to work on the logs that are coming through spark in this case All I did was separate all the login attempts and just look for look for the keywords of warning and You know, you've made a request that requires authentication So it was a very simple approach to just picking out those log lines and informing that something has happened And we you know, we can hit another one if we want to just for fun All right, so where does that leave us? If I can get back into presentation mode, all right So having said all that and shown you the demo what what did we learn from this? Pi spark works really well It was pretty enjoyable to write the application in pi spark because it was Python It was very easy to interact with all the pieces of OpenStack We could use OpenStack clients as long as we had them loaded on the spark nodes pi spark could access them The OpenStack Python clients I you know Arigato goes I must to all the OpenStackers who work on those things because they all worked beautifully It was very easy to use them and then manipulating the log data was also easy using the Oslo logging engine It was very easy to separate the the log lines It was easy to break them up and feed them back through to our database and normalize them in spark That all worked great things that could have worked better image creation is still a really big issue and Whether we're creating images for Sahara or we're creating images for Trove We still have to package these images and then deploy them to the cluster and we ran into a lot of difficulty In some cases getting the images to build properly I Don't want to name names, so I'm not going to so what next This was all pretty custom built and we injected a lot of the the routing and the pathways by hand We could increase the integration between these components And when I say creating OpenStack applications What I mean is that if we had written this at a pipe as a Python application that interacted separately with Sahara and Trove and all the different pieces we could have had it pull IDs for the different components and put them all together Automatically rather than rather than us having to write the individual pieces and plug all the endpoints in this could have all been done from an Application that sat at a higher layer than even Sahara and would have interacted with the cloud in the same way And it could have even injected values into our spark application, so With all that being said Does it scale We didn't get to the point where we could test the card a large enough level that we could say alright Here's 10,000 logs a second. Can it handle it? I'm pretty sure a spark can handle it Especially if we scaled out the spark cluster, but it's difficult to tell if that would have happened purely with the solution that we were using So if you're thinking about building something like this and you wanted to grow it out to scale You might look at some of these other technologies Kafka is a cue mechanism that you could use in places a car Fluent D and log stash are ways to aggregate logs and You know transform them and do work on them in flux DB is a way to store log information similar in a similar manner to log stash Kibana would have been an option to Visualize this had we been using elastic search and then Grafana would have been another way to visualize this if we didn't want to Read our own custom, you know D3 JS application so I Think we've come to the end of the road Any questions? All right, well Thank you all very much Go visit the Red Hat booth go visit the Maranthus booth