 So, hi everyone, happy to be here today to talk about how we did the first steps to make OpenStack support real-time data analytics. And we have done that by implementing a storm plugin to Sahara. My name is Andre, I'm from UFCG, TELIS works with me and Michael is from Red Hat. So the first point I want to bring to you today is what do we call event stream processing. So that's how we like to call real-time data processing. And event stream processing is basically a set of tools and techniques to make it possible to process data in real-time. So for example, if you want to do data aggregation, then you have to consider time windows. So if you want to do machine learning, you have to consider online algorithms. And that's basically it. And we need such things when we do applications like the ones that we need to react quickly to some information. So for example, if you are monitoring a system and you want to react as soon as possible to bad things that have happened in your system, you would like to do this in a real-time fashion. You don't want to restore the data, the logs, the events, the metrics. And then later in the day or a few minutes or 10 minutes or half an hour later, do a Hadoop job that's going to process that log and then trigger some actions. And the other kind of applications that may need such an approach are the ones that you have too much data to be stored. And going to the extreme, you can look at the application from CERN, where we have sensors that when are running full speed, they will produce one petabyte of data per second. And you cannot store that, right? So there are many applications that you have a lot of raw data that needs to be filtered, aggregated, summarized, so that you only store the relevant data to be processed or accessed later. And to help us with these jobs, there are a lot of things that a lot of tools that are available today. So if you look at the open source world, we have Storm, Apache Storm. You have Spark streaming. You have SAMSHA. And there are also things in the cloud. So Amazon has Kinetic. And there are also a lot of open source academic tools that address one aspect or the other in the full tolerance on the scalability domains. So this one is one we are working at too. Storm has initiated in back type. And this was acquired by Twitter. And then in 2014, Storm became a top level Apache project. And when you have, we chose Storm basically because it's a well-known, well-adopted tool. It's open source. It supports scalability. It supports full tolerance. It integrates wells with messaging systems like Kafka, like Rabbit. It reads from databases, reads from a lot of data sources. And it also enables you to write operations and processing elements in a lot of languages, not only Java, but also Python and others. And once you have your application and you want to deploy a Storm cluster, you need basically three types of nodes. So you need what we call Nimbus node, which is like the coordinator of the cluster. You need supervisors, which are demons that run in the Storm nodes and do the actual job. So they control the actual job being deployed. And you have the ZooKeeper, which coordinates the communication between the master node, the Nimbus, and the worker nodes, the supervisors, so that you can handle full tolerance. When you think about the application itself, then you have to think of your application like a directed cyclic graph. And here on the left-hand side, you have the data sources. And in Storm, you call them spouts. So spouts are the input adapters that get the information from outside the data processing system, like a Kafka or a Rabbit MQ, and put this information, this data, into the data processing system. And once the data is in the data processing system, it flows through the boats, which are the processing elements, the ones that actually transform, aggregate, filter the data and produce results. And the data flows through these boats until at some point, they can be put back into the real world. For example, a dashboard, or put back into a database, or publish it back into Kafka or a Rabbit MQ. Then once you have a sequence of boats, or even a boat and a boat, you need to define how these two will communicate. So at one hand, you can have a boat, which is doing some kind of processing activity. And this boat will have several instances, because you need scalability, for example. And then it produces outputs that will be consumed by another boat. And when you are going to connect these two, then you have to think what kind of communication pattern you want between these two. And these four approaches that I put here are the most common. So it can be the case that any of the instances of the first boat will produce outputs that go to any of the instances of the second boat. So you are load balancing by using random distribution. Another case is when you want that all the tuples produced by instances in the first boat to go to all the instances of the second boat. And then it's the all pattern that is shown here on the right-hand side. The lower right-hand side is a case where all data outputted by the first boat will go to exactly one of the instances. So you do that, for example, when you want to aggregate all the data in a single instance of the second boat. And the last one, which is the left-lower side, the fields aggregation, is when you do something like reducing a map-reduced job. So you want all the outputs from the first boat, which have the same key, to go to the same instance of the second. So you do this kind of grouping. So this is how you connect the boats and spouts in your application. And now that we at least have understood a little bit of the concepts of Storm, I have here a simple example that I took from the Storm starter package. And it simply shows data source. So it doesn't work on the screen. You have a data source, which in this case is basically generating random words. And in practice, it would get the data from an external source, like Kafka or a database. And this data source is producing words that are going through a boat. And in this case, the application is very simple. It just transforms the string into another string, which is the same word with a couple of exclamation marks at the end. So the processing is very basic. And then to illustrate several boats in a chain, I put another boat that does exactly the same. So here are the links for the examples. And here's the code. So the code is very simple. Here you have the spout. The first thing you do is initialize the collector, which is the object that is responsible for emitting the output, so the tuples. Next is a method that is called when Storm is ready to process the next message, so the next event. And in this case, because we are not getting the information from outside the data processing system, I'm just generating those words in a random order. And the last step is declaring what kind of outputs does this spout produce. And so the spout is basically this. And the next point is to take a look at the boat. So the boat is not too different. So you have first here the method that is being called when an event, so when something is available for it to process. And our processing here is very simple. So we just get the tuple from the parameter. And then we create another output that's basically the string of the original tuple plus a couple of exclamation marks. And then we emit it for the next component. After that, what we do is just acknowledging that the tuple was successfully processed. And this is useful when you're dealing with full tolerance aspect. So you don't need to replay this tuple in the future. Again we declare the output. And here I'm just saying that there is only one field in the output. And this field is named Word. Now that we have both a type of spout in the type of boat, what we do here is to connect these components. So in this highlighted part of the code, what I'm doing is just creating a spout of that type that I just showed you. And then I create a boat which consumes data from the spout. And it uses the shuffle grouping there. Then I create another boat which consumes data that is outputted from the first boat. So now I have a chain with three components connected to each other. Then there is some basic configuration code which ends at submitting the topology to a storm cluster. So the topology now is composed of three components, the spout, the exclamation boat, the first one, and then the other one that does exactly the same thing. And the three are connected by a shuffle grouping messaging scheme. If you want to run that, then all you need is to clone the storm repository. Use Maven to build the binaries. And then you are ready to go. You can just execute them. And what I'm going to do now is to pass the word to Michael who is going to talk to you about Sahara. Thank you, Andre. Please. All right. So we learned a little bit about storm. And now we're going to talk about how you might deploy storm into OpenStack. And so Sahara is the OpenStack data processing service. And what that means is we can use Sahara to deploy many different data processing frameworks from Storm, Spark, Hadoop. We have vendor specific as well as generic packages. Like most OpenStack services, it has pretty standard interfaces. And what you can use them to do is provision clusters of machines based off of node group templates that we make. And you would define these templates based on the type of topology you might want. So depending on what framework you're using, whether it's Hadoop or Spark, you would define what the node group you want to create is. And then you would collect these node groups together to create a cluster. And at the time that you want to provision the cluster, you would tell it, what image should I use to do this? How many machines would you like to have? What networks would they be associated with? And then you can launch the clusters. And we can have a combination of static clusters that might exist for many jobs. Or you can create clusters on an ephemeral basis. So if you have a single job, you can launch the job and have it create a cluster for the length of the job. And then at the end of the job, the cluster will be destroyed. So you can also use it to scale clusters. So then this would be most applicable for static clusters. So once you've deployed a framework and you've created the cluster, if you decide at some point that you need to add elasticity to that cluster, you want to grow it in response to a demand, then you can use the interface that we have to grow the cluster by simply telling it, add many more of this type of node group, or maybe subtract a few of the other node group. And then this way you can grow the clusters. And then we also provide an interface for running the jobs on the frameworks. And we have support for many different types of jobs from Pig, Uzi, Hive. You can run Java-based jobs. You can even run shell-based jobs. So if you know the shell commands that you want to run within the cluster you've created, you can package those, and you can ship them off as a job. And Sahara will handle packaging the job from the user interface and sending it to the cluster, monitoring the job as it's running, and then returning the results to you in the format that you've declared. So whether those are files that get stored in Swift or perhaps on HDFS, Sahara will take care of this for you. It's fully integrated into the OpenStack dashboard. So if you install Sahara alongside of Horizon, then you will get an interface in Horizon that allows you to do all these things. You'll be able to register images that you would have uploaded through Nova or Glance. You'll be able to take those images and form them into clusters. And the way we'll do this is, like I said, you would create node group templates and then you would composite those templates together to form a cluster. Then you could launch the cluster. And there are also options to run jobs, to create job templates, to create the binaries that you might want to run your jobs with, and to create the data sources that will be supplied to those jobs. And the job binaries and the data sources can be stored in Swift. They can be stored in HDFS, and they can even be stored now in NFS volumes and brought into the job as it's running. So like I said, we have support for a variety of processing frameworks. Hadoop, including Cloudera, MapR, and Hortonworks specific distributions. There's also Vanilla Hadoop, so if you just want a bare Apache implementation, you could use that. There's also Spark and Storm. And in the case of Spark, that's integrated with the Cloudera plugin. And in the future, there might be support for a basic Apache installation. And I believe Storm right now is just a basic Apache installation. So what we're looking at here is the general Sahara architecture. And this is the core of Sahara. It integrates with OpenStack in the way that many services do. So you can see up on the left there, you've got Horizon and the Sahara pages that are integrated there. They talk through a standard Python client that's similar to many of the other services in OpenStack, which speaks to the REST API that Sahara exposes. And then internally, Sahara uses an authentication methodology that will pass the credentials through to Keystone to validate all the operations that you're performing. So you can confine operations within projects. You can do almost anything that you would do with policy in any other OpenStack service. You can limit what actions a user can perform based on their Keystone identity. And then you can see we've got the vendor plugins and the EDP, which is what we call Elastic Data Processing, which is the core of our job engine and how we distribute jobs to the Hadoop or Spark, Storm clusters. And then you also see on the bottom right there we have the provisioning engine and the image registry and the data abstraction layer, which is really an internal component of Sahara. The provisioning engine and the image registry will talk to heat and cinder and glance to automatically, you know, provision whatever cluster you've asked for or to pull images from glance to associate with the cluster. And then in the upper right, you'll see we've got visualized here as a cloud. Whatever your cluster is. In this case, we're showing an example of a Hadoop cluster that's made up of various VMs. And those VMs can communicate with Swift using a credentialing system based off the user who's logged in. It can take data from Swift and bring it into the cluster or store data that's being generated in the cluster back out to Swift. Now this diagram doesn't show some of the optional components that we've added. So we've added support for Manila, which can provide shares. Right now it's just NFS, but in the future we can provide HDFS shares to that as well. So the idea that you could store and read from Swift right now, but you could also if you have Manila deployed on your cloud, you could attach Manila shares into jobs that are running so that files could be stored onto an NFS volume or read from an NFS volume. We're also working towards Barbican integration to improve the security of the passwords that Sahara uses to communicate with the cluster. And we're looking into further uses like can we integrate with Trove to bring database support into this? Can we integrate with other services like Sikar to help with message passing from the Hadoop cluster back into the control plane where Sahara is. So this just to summarize what we've gone through here, Sahara provides a REST interface that gives you access to all these features. We also have a Python client and a command line client that gives you the same access to all the features that we've just talked about. You can run jobs based on several different types, PIG, Hive, Java, Shell, MapReduce, and Streaming jobs. You can access data that's stored in Swift, HDFS, or Manila shares. And with elastic data processing, you've got the ability to use a static cluster that you might have generated ahead of time and just left running. Or you could make a transient cluster that would be per job. And then the clusters are scalable. Now, this doesn't work for the transient clusters because those will only exist for the job that you create. But for static clusters, you can grow and shrink them as necessary. And then in terms of expansion in Sahara, we have a plugin-based structure that allows us to add new data processing frameworks as they become available. So what the dev team does is as new processing frameworks become available, such as new versions of Storm, we can implement those in this plugin and make them available so that we could have, right now we have 0.9.2 of Storm, but in the future we might implement different versions and they could live side by side. And then as a user, you could choose when you generate your cluster which processing framework or which plugin you'd like to use in that cluster. And it also allows us future expansion for new frameworks that might come along. So whatever the future might bring us. And with that, I'm going to hand it over to Telus and he can tell you a little bit about the application. Thank you, Michael. Thank you, Michael. So we heard from Andre and he talked about Storm and Invented Stream Processing. Michael talked to us about Sahara and I'm going to talk to you how these two parts work together. So Storm became part of Sahara in kilo-release, so it was about six months ago. And on this release now, Liberty, we actually made it complete. So now you can use Storm via the UI. So you can have a complete user experience of Storm via OpenStack dashboard. So I'm going to show you what are the steps to creating a Storm cluster. So the first thing that you need to do, you need to create the node group templates. And for Storm, you need to have a process running Nimbus, which is the master. You need to have a process for the supervisor. You need a zookeeper. This could be in the same machine or different machines here. You create a node group for each. So the next step, after you create a node group template, you need to create a cluster template. The cluster template contains the configurations and the information, how many instances of each node group template you're going to need. So for this example, we have one zookeeper, one master, and three slaves nodes. And you have all the configuration on this template to launch the cluster. And then you can just press launch cluster. And Sahara is going to give you an active cluster in about a couple of minutes. It doesn't really take long. And once the Storm cluster is active, you can actually browse through the Storm UI. This is at the master node at Port 88, 8080. You can see that information about the cluster. You can see the number of supervisors, lots, and everything. This is provided by Storm. It's not Sahara itself, but you have access for that. And the next step is running a job. So you have a cluster, and you want to run a Storm job. The first thing that you need to do is to have a binary. In this case, we have the Storm starter binary that Andrea talked about before. And we have it on Swift. And so we're just pointing out to our Swift binary there. And once the binary is created, we need to create the job template. The job template has a name, which could be anything that you want. You need to select the job type. And in this case, you have to select Storm, and you need to give the main binary, which is the Storm starter that we just created. There are actually other options like Libs and interface arguments, which is brand new. But you don't really need to add any Libs to run the Storm job. If you compiled your jar file with all the dependencies that it needs. So once you have the template created, you can just click on the cluster. It could be a new cluster or an existing cluster if you already have one. And it's going to come up the window of launching job. The launch job window gives you two tabs. The first one is the job tab, which you have to select the cluster that you're going to use. In this case, we selected the Storm cluster, and it goes into the Configure tab, which we have to add some configuration to run the job. The most important one is the main class. So you have to give the jar main class for the Storm to run on the command line when it submits the topology. In this case, the main class is Storm starter exclamation topology. And for this particular example, we don't have any arguments or anything else. So this is enough. But if you had arguments to be passed on the command line, you could click on Add. And then you could give you each argument at a time. I'm going to show you later how that works. So once you click Launch Job, we're going to come up this window. And you can see that the job is running. And you can check out the Storm UI again. And you can see at the topology summary that the exclamation topology is running and shows some status, like it's active, the uptime, the number of workers that it's using. And if you actually click at the exclamation topology, it's going to show a lot more information about latency and everything that's going into the Storm Cluster, which could be useful depending on the application that you're running. So what are the next steps for us between Storm and Sahara? We're thinking in improving integration with data sources. If you want to use Kafka as a data source right now, you're going to have to create your Kafka cluster by itself and then make Storm communicate with it. We want to make that automatic. So if you choose to use Kafka, I want to have it all set it up for you. And Michael has done some good work integrating Zakhar with Spark. So we're actually thinking going in the same direction with Storm. So we want to make that work too. We want to add the ability to run Python jobs. We all open stacker developers and we like Python better than Java. So we want to make that easier for you. And we also want to bring newer versions of Storm into Sahara. So right now, there are three newer versions, 9.3, 9.4, 9.5, or still at 9.2 in Sahara. So we want to update that in this next release. So I'm going to explain to you a demo that I'm going to do right now. So the idea of this demo is I have a Twitter crawler that's going to crawl for every Twitter that contains the word Sahara into it. And it's going to put into a Kafka queue. And then we have a spout. This spout's going to read from the Kafka queue and throw the tweeters into the bolt. This bolt's going to filter every Twitter that has the word Storm into it. And then we're going to throw back at the Kafka. And then I have a web application that's going to print the Twitter into it. So I'm going to show a little bit of the code. It's pretty simple. This is the bolt. Actually, I didn't show the code of the spout because it's actually a spout that's from the Storm package. So it's ready to use. So we don't really need to take a look into that. We just use it. The code that I wrote, it's really simple. We just get the tweet as the input. And then we check if it contains the word Storm into it. Then we create a Kafka configuration, connect to it, and send it to Kafka into a specific topic that's going to be read by the web application. And at the end, we act in order to tell the spout that the message was already processed. So we're in the topology. Putting that all together, we have the spout, the Kafka spout here, probably not going to see it. So we have this spout here at this line. So this is the Kafka spout that I didn't show the code. I didn't wrote it. It's from Storm. So we just need to use the new Kafka spout and give the configuration of the topic that's going to read from Kafka. And then we create the bolt that reads from the spout using the shuffle grouping that Andrea already explained. And now the rest is just configuration. And at the end, we submit the topology. So right now, I'm going to try to make this work live. So I'm going to ask you, please tweet something like StormWorksGrate in Sahara in a couple minutes. So right now, I'm starting the Twitter crawler, which takes a little bit to start. And here I have the dashboard from our cloud that I have running back in Brazil. So it may be a little slow because we're quite far away. But you can see we have a Storm cluster running here. And I have a job template, which is Sahara Storm tweets here. I'm going to launch on an existing cluster. I select the Storm cluster. I'm going to configure with OpenStack Summit Sahara tweets topology. I don't really need that. I have two arguments here. So as I said before, it's a sequence. So the first argument, it's on the first line. And the second argument is on the second line. And here, I can just launch the application. Let's hope it works. Should be running not long. So it's running. We can check at the Storm UI that it's actually running. You can see that the application here is running. I'm going to click on it so you guys can see that some other information come up. And here, unless I have the web application running that's going to print the tweets into it. So if you guys don't mind, now it's a good time to tweet something using Sahara and Storm. Otherwise, this demo is not going to be good. So if no one tweets, I'm going to have to tweet myself. Anyone? Oh, there we go. Storm and Sahara at OpenStack Summit with Telus in the break. Thank you. I think that's actually a pretty simple application. We just want to show you that it works. And Storm is ready to be used via Sahara. And it's quite simple. So I'm going back here. So thank you to anyone who tweeted. If anyone else tweeted, we'd show here. So nobody wants to do it. So we have to give thanks to HP and CNPQ from Brazil that supported our development and our process to come here. And I want to thank you all for coming. I appreciate your time. And if you guys have any questions or are open to answer them, Andrea and Michael also. So thank you all. So I have a mic here if anyone wants to ask anything. I can help with the mic. No? Good. Yes. OK, so nor yet for containers, nor yet. Well, actually, that should be a word from Sahara itself, not Storm. And I think we're not quite there yet to support containers. Thank you. No, there's something. It's actually, it doesn't show everything because Twitter API just gives us a sample of tweets. So that's why I needed a lot of tweets. But it showed one. It's just good. It got at least one. So it's real. We can talk to the container question because that's something that we're interested in, learning how to deploy our clusters using containers instead of virtual machines. So right now, Sahara uses virtual machines to deploy a cluster. But what we're actively looking at is the possibility of creating clusters that are deployed through containers. It's not there yet, but it's something that we're interested in. And we're kind of, because we use the common open stack infrastructure to deploy the virtual machines, we're looking at other projects that are starting to do this like Magnum that would provide container support to figure out if that's something we can do in the future. Can we deploy a cluster that would be all containers? It's not there in Sahara right now, though. OK, so I thank you all again for coming. Appreciate it. Thank you. Thank you.