 I'm Michael. This is Gil. And if you're observant, you notice there's three names up there, and not just two. Unfortunately, Reed Reynolds, our co-presenter for personal issues at the last minute, was unable to arrive. So we're going to try to cover his material as well. We're going to go over the perfect match. Apache Sparks meets OpenStack Swift. So what we're going to cover is a very, very quick overview of OpenStack Swift. A quick overview of Apache Spark. And then Gil is going to talk about how to integrate Apache Spark with OpenStack Swift. We'll show a demo of using Apache Spark on a public Swift object store. We'll then talk about some considerations if someone wants to go and deploy it themselves in terms of how to organize the cluster. Swift obviously runs on a cluster. Apache Spark also runs on a cluster. And we'll then show some advanced stuff we've been experimenting with, showing how we can pull Swift and Spark together in a differentiated and perhaps more efficient way. So let me start with just getting a sense of what people know. How many people here have used OpenStack Swift? And how many people are familiar with Apache Spark? So Swift, an OpenStack project, is a massively scalable object store. It's not a file system. It's not a database. It stores objects. It stores blobs. It's good for storing things like images or videos, logs, JSON documents, basically things. And it's good for hybrid clouds because there's a bunch of public cloud providers that support it. And it's something you can go and deploy in your own data center. It's massively scalable. It's software defined storage. Basically, you download the code for Swift, deploy it on a site of servers, and you have an object store. It exports a RESTful API, meaning that you access it through HTTP requests. You can do a put to create an object to get to retrieve an object. A little bit about the layout of Swift. It is basically a two-tiered architecture. The externally facing tier, the one that the client interacts with, is the proxy tier. The proxy tier is responsible for getting the requests and then redirecting the request to a node in the storage tier for actually serving the requests. At a level of, you know, next level of detail, Swift has accounts. Inaccounts are containers, and in containers are objects. And there are different types of storage nodes for accounts, containers, and objects. And in, you know, so far, the way Swift protects data is it replicates the data such that when you go and you create an object, you're going to get some number of replicas of those objects created on a subset of the storage nodes, such that each storage node is in a different failure domain. You know, a little footnote, those some of you may have heard. Yesterday there was a talk about erasure coding, which is something that's expected to be coming up in the Kilo release for Swift. And to a little more detail about Apache Spark. I assume everyone here is familiar with Hadoop, MapReduce. Well, you can think of Apache Spark as maybe the next generation MapReduce type calculation. It's the fastest growing big data project in Apache. It is capable of operating for workloads that fit in memory, you know, two orders of magnitude faster than a Hadoop MapReduce. For things that go to disk, it's an order of magnitude faster. One of the key innovations for Spark, and this was a project built on a project that started at UC Berkeley, Berkeley is what's called RDDs, right, or Resilient Distributed Data Sets. Now, the way something like Hadoop works is that it's, you know, each time a map task or reduce task produces output, you know, it's calculation that writes it down to a disk, this persistent copy of the data is used for communication between the various stages of the computation. It's also used to protect the computation. In RDDs, everything is kept in memory and to deal with recovery, they keep track of the lineage, how something was computed, and in the event of a failure, failures, you know, while they happen, right, maybe more than we'd like to think, are still relatively rare. They'll redo the computation based on that lineage, based on the history of how that data was created. You know, Spark is mostly written in a language called Scala. When you see the demos in a little bit, you'll see some use of Scala. Don't need to be a Scala expert, though, to understand what we'll be showing. And one of the key companies involved in Spark is a company called Databricks, Reynolds Company. It was founded in 2013. It is the largest contributor to Spark. We have a cloud service for accessing Spark called the Databricks cloud. Spark, as I said, is the most active big data project. You can see it has very quick growth in terms of number of contributors. This is actually reminiscent of what used to be, you know, we saw for OpenStack in its early days. It has a single engine which can handle many different types of big data workloads, and it scales, you know, from gigabytes up to terabytes and petabytes of data. Now, as I said, you know, Spark handles a bunch of different types of workloads. There's the Spark core engine which does the processing of the data. And on top of that, there are a set of different types of interfaces for giving different ways of interacting with the data. So for instance, there's something called GraphX, which is for graph computation, sort of understanding networks into connections. That's not quite, you know, production level code. There's streaming to deal with streaming applications. And the one we'll be seeing is Spark SQL, which enables putting an SQL-like interface, enabling doing queries like select over data. So what we want to go on to now, and Gil will take over, is how we integrate Spark with various data sources going deeper into that, and then a discussion of how we do it with Swift and Gil. Okay. So Spark can work with the many data sources. It's where the data located in Spark use this data source to make analytics. Originally, it works with local file system, HDFS, Amazon S3, or Cassandra and Mongo and so on. The good thing that you can integrate, you can configure Spark to work with many data sources at once. So you can integrate in the same query, you can access many data sources and pull data from those data sources to perform analytics. You can achieve this in a variety of ways. So first of all, you have shell in Spark. You can use Python shell or Scala shell. It allows you to use interactive way of analytics. Or you can write your code in a Python Java Scala and deploy it to Spark to perform analytics. And the results of the analytics, it's up to you to decide what you want to do with them. So you can store them back in the data source, in the same data source that the source came from, or you can store it in other data source. You can print them, you can see them. We'll see it in the demo, what we can do with them. So what we actually did, we took the Spark and allowed it to integrate with Swift. And the good thing here that you don't really need to modify the code of Swift, so Swift is completely unaware that Spark uses data there. And Spark itself, you don't need to modify code in Spark. What you do need to do is to configure. You just use a certain configuration in Spark. You configure it to use Swift as a data source. Then you build Spark the same way as you build it in any case. And then you use it. We submitted it back to the community in Spark. It was merged in the 1.10 version. And you can see it on the website how to integrate Spark with Swift. The usage, I'm not going here into details, but I just used the schema, Swift. Then you can access your container or particular object in Swift and here we access... Again, I'll explain it in the demo. Here we access container logs data in the soft layer, and we access all the objects in this container. So a bit of this configuration... Basically, we are using a do-driver of Swift. So the configuration is you just need to configure the Swift driver as a source data definition. And then it's up to you to decide what the authentication you want to use. So here it's an example of two of them. B1 and B2 authentication with Keystone. And of course, there are many other parameters that you need to configure to make it work. Okay, so Spark allows you to use actually a couple of ways that you can integrate with a data source. So what you see here is a Scala shell. And I'm accessing here the container, the name inside, in soft layer, all the objects. This is the definition of the data source that you see in the SF311. And then when I declare the data source, I can use Spark as a normally use. Here I just count all the lines in these text files. So in the container, there are objects. Each object is a text file. And what we see here is the Spark, first of all, very Swift to understand how many objects are in this container. And then you just count all the number of rows. This is part of the configuration. So we see here that we define soft layer and it actually access the object store there. And we provide additional configuration. In particular, you see here API key and username. You don't have to store it in this configuration file. I mean, it's only for the demo purposes, but you can also provide this in runtime as you normally do in Spark. Spark SQL, it's another way to access data. So the idea that you can use SQL syntax to access data inside those objects, still Spark communicates with an object, but it can also understand what's inside those objects. And then you can write your regular SQL. And you access those objects. So what we see here is standard SQL with select and from and group by. And again, we will see it a little bit later. This is what was before called Shark. Now it's called Spark SQL. And you can access here many data sources in the same query, which is also a very nice feature. Let's see the demo. This is the recorded demo that I did before, but those who are interested can come to me after the talk. I can see them live how it works. And, okay, let's stop it. So what I did, I wanted to do some public data set to show you the taste of this analytics or this integration. And there is public data sets of the three-one-month service in the United States. We took one of San Francisco. You access there and you see all the records, all the incidents that occurred in the city. So public there and the good thing that you can download them to in any format that you like. So what I did, I just entered this public data set and I saved records of two years, I think, in two CSV files. And those files I upload to software to the object store. So this is what we see here. This is how the data set looks like. And now I just show you in the rest client that there are two objects inside container inside. There are both CSV files. You can see the content of one of the files. Just a CSV file with this delimiter that I used. And now I'm going to show you what you can do in Spark. So as I mentioned before, there are a couple of ways you can analyze your data. This is a demonstration of how I write an application. Then I deploy it as a JAR file and I submit it to Spark. Later I will show you how you can use interactive shell. So there are a couple of things here that you need to see. The first one, we define the Swift where we want to access. It's container inside. We want to access all the objects. We map. We know what is the information there. It's CSV files. We map it to a table. And this is what we call the table incident. And then we map only particular fields from each record. And now we can use standard SQL to analyze it. So this is the data source. And we use a skill that actually what we want to do here is to get all the neighborhoods and solve them by open records. So after you write this application, you compile it for this BT. You can also use Maven. It generates a JAR file. And now you just submit it to Spark. And you have your results on the screen. All the neighborhoods sorted by open records in this data sense. And now you can use any other tool to visualize it or to build graphs or to continue analytics, if you like. Okay, so we're back to the... A few words about cluster management in Spark. So there are a variety of ways you can use it. The simple one, you can use standalone cluster. It's very easy to set up this cluster. You just run a couple of scripts and you have it automatically. You can use all kinds of other frameworks to build your cluster. There is also support in Sahara in the recent release that allows to run elastic data processing jobs of Sahara inside Spark. A bit about cluster management. A bit about cluster. So actually what we have, we have Swift as a data storage and we have Spark, which is an elite cluster. So we can integrate those clusters in a variety of ways. One way to do this, they can share the same resources. Maybe you install Spark on the same cluster of Swift. Maybe they share only certain resources, certain machines. It allows you actually to have some data locality. So Spark will be close to the data in Swift. But on the other hand, it costly. And it actually requires Swift to share the same resources with Spark, which is not a good approach. So what you can do, you can install them on separate clusters. And now you can manage those clusters separately. This is more standard approach, so you have your storage and you have your own elite cluster. And Spark can access Swift, as we saw before via REST API. The problem here is that sometimes it's not efficient because you use a lot of network. In particular, Spark may need only certain data sets for its analytics and not all what is stored in Swift. So sometimes you transfer a lot of information over the network that's not really needed by Spark. There are also certain security considerations here. For example, you may have some sensitive data in Swift that you can't move to the analytics. So you need to somehow filter it before you move it to the analytics. So let's see some use cases where we actually need less information for analytics than we have in the object store. So we may have images in Swift. It's very good to keep them. But on the other hand, we may want to analyze only XF metadata. Those who are familiar with it, XF metadata has a lot of information about image. It has information about camera that was used. And if the customer or user has also enabled GPS coordinates, then you can also extract this from XF metadata and have a lot of information. The same analogy can be with PDF files. You may want to analyze metadata or some partial text of it. And logs, it's another example. So you may want to store your log files in Swift as an object, as an archive. You just put them and it's good. But when you perform analytics, you may only interest it in the records that marked with error. Because this is what you want. And then assuming that you don't have a lot of errors in the log files, then it's only small subset of the data. The problem here is that when you use the standard approach and you move all this data source from Swift to Spark, then you use a lot of network. And it's costly and slow. So the question here can we move a little bit less data over the network? With respect to SQL, there is some analogy that it's a little bit easier to push a scale query inside the data source. So basically you can take this very log status like error and move it inside database. And then database will somehow run this and return only the subset of records to Spark. It will be later used to analyze information. There is some work in the community on this and you can read on this on various websites. Also Cassandra has information on how to push queries down. And what about Swift? Can we do the same? So, storelets is an IBM solution that we can run user codes inside Swift. The idea here, the user can write a standard application. It can be standard Java library. You can use any library dependencies that you want. It's a standard application. You export this as a JAR file. And then you deploy it to Swift. From Swift point of view, the JAR storelet itself and the dependent is just normal objects. So, Swift store them as it usually store objects. But the magic comes from the storelet engine inside Swift. And the storelet engine knows when storelet is triggered, knows how to deploy the storage to the storage node that contains the data. And then it knows how to activate it in a secure way that there will be no harm. So, the regulator talk that you see here about storelets that you can attend. So, the idea here that storelets now can help us to achieve better network utilization when we want analytics. Because Swift is a good, cheap data object store that you can put your data. Spark is good with analytics. Now, you can use storelets to effectively take the data of what is actually needed for analytics and move it to the analytics to Spark. And the example that I show you is the exit metadata that I said before. So, on those images, they have a lot of exit metadata. And an example here. So, you can take a lot of packages and just go to Apache Commons, I think, and you take a package that extracts metadata, exit metadata from an image. It's actually a few lines of code in Java. So, you can take, and the image can be... This particular image is about 10 megabytes, I think, size. The exit metadata is about one kilobyte size. So, you can write a very simple application in Java using some open source package that extracts this metadata from an image. And now we will see how this works. I'm going to video the second one. So, I have images in my Swift. There are 314 images. These images are taken from my camera, and I just upload them to Swift. We can see here that they... Let's see how this image looks like. You will see it's a lot of pixels there. It's a large image. And so now what we do, we want... I wrote this storelet, this code that knows how to extract the exit metadata from an image, and the storelet is returning back as a response. So, you need to trigger this storelet. And the trigger is actually here. There are a couple of ways you can trigger it. One way is to add something to the container that will trigger the storelet. From this moment, when it will arrive to Swift, storelet engine will understand that there is need to activate storelet. It will get to the node, storage node, where the data actually located, this particular image. It will activate the storelet, and return back JSON file. And why we want JSON file? Because this is how Spark knows to work with it efficiently. And now we can have better network utilization because it transfer much less data on the network. So, we enter Spark shell, and we're going to use SQL to analyze this exit metadata. So, we define here the data source, and when we define this data source, we also have this trigger for a storelet. So, the storelet engine will understand inside Swift that storelet need to be activated. From this point, we use standard Spark. I mean, we just define the data set. This is a new feature in Spark that actually allows you to map JSON to table because JSON already contains schema. You have key value, key value, so they just use key to generate table for you. And this is print schema. You see that actually Spark understands what is this JSON about in creating the schema for me. And now I'm using standard SQL, and I'm going to understand to count all those ISO ratings that my camera used. But you can, of course, use here whatever you like. And I want to count all those ISO ratings of my camera. And you see this standard SQL with all the regular syntax. And now we have the result. So, this is the result. We see here that ISO 800 was 106 times in this one. So, we just took this information with the copy-paste to Excel, or I could store it as a text file and automatically load it in Excel, whatever you prefer, and I put it in Excel and created a graph that visualizes this ISO analytics. And the good thing that we transferred that we can have now a very small cluster of Spark because we don't need to transfer all those images in Spark and analyze them in Spark. We just transferred JSON files. And this is the summary. So Swift is great. It's a cheap, good, reliable object store. You can put your data and be confident that it's safe there. Spark is great to do analytics. And you can integrate them in a simple way, and do analytics in Spark. You can also use advanced features and transfer much less data over the network. And again, do the same analytics in Spark. Hi. I work on Sahara. So, my question for Spark and Swift integration in 1.1.0, is that being supported through the Hadoop OpenStack jar or is there another mechanism? The integration itself is through Hadoop OpenStack jar. Exactly. Thank you. Just compile it. So, one of the issues we're looking at in the next version of Sahara is dealing with authentication for Swift and how to get that into the Hadoop configuration under the covers. Do you have any... So, what you saw before, I just put those credentials inside the configuration and then Spark use this. You can provide them in runtime. The same way Spark access Amazon S3, you just need to provide those credentials for S3 to Spark. Exactly the same way you can provide them for Swift. I have an example, a simple Java application that you just provide your credentials in runtime and you don't need to store them in some configuration file. I mean, you can, but if you want to provide in runtime, you can also do it. Okay, thanks. I'm just wondering, these storelets can be part of the Spark script that we pushed to Swift or they should be there in Swift already and the Spark just points to them? Sorry, another storelet question. So, in the example that you have, are your storelets running in the object server or they're running object server or Swift or they're running separately elsewhere and accessing the Swift API? So, the storelets are not running. The storelets are objects inside Swift. They do it on the fly. So, you access Swift in the regular API and you just have this trigger in the container name. Trigger it to be a header. And then the storelet engine understand this and it activates the storelet on the storage node and it's streaming. So, it just streams whatever it needs. So, the access from the storelet to Swift goes through the proxy server? No, no, no, it's already on the... Thank you. Just to follow up on that question. So, are the storelets stored on the same object server node as the data that you're trying to compute? And do they run from that object server or is data transferred from... I was just trying to understand because you talked about data locality and the data is also going to be transferred? I think you mentioned in the slide that there's a detailed talk later on the storelet. Could you expand on that? Up. Up. Any other questions?