 So, this talk is about the fact that people are growing their object stores, and I don't know if you saw HB's talk earlier this morning, but they're getting petabytes and petabytes of storage inside of their object stores, and now, from what we've seen, there's a strong demand to want to analyze the data and be able to make it active and be able to run all of your analytics jobs on the object store. So this is about saying how can Swift on file resolve these problems when you're using a Swift object store. So this is the team that's on, most of us is on the stage right now, and you can ask us questions afterwards as well, so we can't hide from you. All right, so the first myth that I think that a lot of people have when they want to run analytics is that when you want to run analytics on HDFS, for example, with Hadoop, you have to first create the data, and then you have to migrate it into HDFS, run your analytics, and possibly then migrate it out. So this is the first one that we want to show you that is not needed. Second, the second myth we want to sort of dispel is to say that Swift is only good in cases when you're not putting a high ILOAD on the system. So in systems like Spark, you are basically doing an ingest of the data, and then you're running everything in memory, so you're not putting a high ILOAD after that. So we don't believe that's the case either. Third one, so Swift semantics and analytics semantics on HDFS are different. For example, HDFS supports appending to a file. So some people believe then that the two are incompatible and they won't work together, and so that's another one. We're going to dispel, and finally there's also a belief that then, okay, so we have connectors into Swift. We can run jobs on top of our analytics store, but the performance may suffer because of that due to a variety of reasons that we can go over later or afterwards, and there's not really a lot of time to go through now. But there's generally a lot of reasons because Swift architecture and the HDFS architecture and generally object architectures and HDFS architectures are just different in how they operate. So knowing that these are the ones that we're set out here to dispel with the open source Swift on file project, we believe that we can debunk many of these myths. So let's start off with a demo that we'll show it. This is the time, right? What screen is that? It's okay, it's okay. As Dean mentioned, so we would like to demonstrate how we can access data or do analytics. So what will our demo application do? It will create recommendations for stock purchases based on the sentiment of Twitter messages. How does that work? Each tweet is labeled either positive, neutral, or negative. And at that point we already would like to say that this is really a demo application and you will also see demo data. So actually what will we show? By using a unified file and object big data solution, we will demonstrate that running analytics on data that was injected or retrieved as an object store does not mean that you have to move any data or even have to copy it. So we will show on a first step that we upload the data using the Swift object API. On a second step, we will execute analytics on the uploaded object without moving the data and without copying it. But using the native file access. And the third step will be that we immediately access the results. Again, through the Swift object API. The demo was prepared by Ruri, who's also in the audience today. So let's have a short look at our demo environment. We are using a set of Swift proxy nodes, which have the Swift on file configured to use the scale out file system. Additionally, we have a Hadoop environment set up. And this one is also configured to use the same scale out file system as our Swift environment. So let's go to the first step. We upload data using the Swift object API. And we are going to do that by using the horizon dashboard. You can see the containers we created. We use the Twitter analytics folder and we upload an object. To make it easier for the demonstration, we pre-computed an object file in a production environment. Of course, that would be done automatically. So the data would be ingested automatically by, let's say, an RSS feed of Twitter messages. So the upload was finished. And as we form a set, we configured the Swift on file to access the scale out file system. So that means we can see our data directly without any calculations on how to find it on the file system. You can see on the pass till the SOF, that's our base pass, followed by the tenant ID and the prefix out and our Twitter analytics folder. And if you look into it, here's our object file, which we just uploaded. So next step then would be to really run our analytics application on the just uploaded data. We're going to use big insights. That's IBM's distribution of Hadoop. On the applications tab, we select the Hive query. I'm going to use a query that I used in the past by selecting it down there. That's how the query looks like. So what does it do? It selects the stocks with the most positive sentiment and groups them into a portfolio with a member of four. So down there, you can see it's running. Let's have a detailed look on the jobs that are created. I fastened this a bit to save some time for the demo. And now finally, all the jobs are done. So actually, what happened in the background? We ingested the data by using the object interface. Our Hadoop environment was able to access the data on the scale out file system directly. It ran the analytics on it, and it directly stored the results back into our scale out file system. That means we should be able to see it here. Let's change into the Twitter output folder. This is the content. And here it is. The portfolio.HML file is our result file that was created. But what we actually want to do is we want to see it on our Swift Object API. So we go back into our Horizon dashboard. We select the Twitter output folder, as we just did also on the command line. And here it is. You see our portfolio.HML file. Let's download it and view it. Here are the results. And again, don't say IBM or Red Hat asked you to buy this. This is sample data. Don't use it as a basis for any investment decisions. That's about the demo. So actually, what happened? What made the difference? Dean will take over. Thanks, Simon. So I think I wanted to re-emphasize there that this was standard open source Swift and standard Hadoop, although we do have packaging around them. But there's nothing fancy inside the distributions there. So what happened? So some people we thought might not be as familiar in this crowd with Hadoop and how the architecture works. So let me take you through on how we've changed the architecture and what you might normally think as the standard way of using Hadoop and all the other Apache projects. So what we have is, again, a series of HDFS servers with all of the data and then all of your analytics applications at the top that you can use that all leverage the Hadoop file system API down below. So in this environment, again, to reiterate what I said before, a lot of times you're generating the data in one system, ingesting it into your analytics system. So creating another copy of the data, analyzing the data, creating yet another copy of the now result set and then having to ingest that out just. But anyways, to copy that data out of the cluster into another thing. So you're creating up to several copies of your data along the process. And the goal is to simplify that process. So the first thing we did then is replace the bottom layer with a scale out file system where instead of having standard HDFS, we're installing connectors. And in this case, either the GlusterFS connector or the IBM spectrum scale connector to be able to add your analytics job on the file system. So the key point there is that we've changed the lower, initially we changed the lower level part of the architecture. Second then, we've added in the Swift code stack on top of the scale out file system. So now we have the Hadoop stack as well as the Swift stack sitting on top of the scale out file system. So the key element here is that you have what's called the Paco deployment, but with the proxy and object servers sitting on the nodes and accessing the lower level scale out file system. And now the last aspect in the change in the system to make all of this work in what Simon displayed was the Swift on file policy. And so Luis is gonna get more into that on how we then use the Swift on file policy inside a standard Swift to enable access to the scale out file system. Thank you. So you're probably wondering what is Swift on file then? Swift on file is a storage policy that has been available in OpenStack since June release. And what it does is it allows objects to be stored on scale out file systems as long as it supports the POSIX interface. What that does is that it allows objects or files to be accessed as objects and objects to be accessed as files, vice versa. And what it does exactly is that it takes a URL that is used to place an object into Swift and maps it directly on the file system. So if you're not familiar with storage policies, storage policies what they are is a technology that has been available in OpenStack Swift since June release. And what they do is that it allows administrators to have different types of policies for their data. For example, they could have a policy where containers that want two times replication or three times a replication or for example, they want higher performance in some containers with SSDs and some with low performance for example on SAS or SATA drives. So they're able to take their data and segregate it in the way that they want. It also allows for different types of storage systems to plug in to Swift. And on this example, what we have here is Swift and file plugged into an existing cluster, a 16 Swift cluster, plugging in just a cluster file system. So normally Swift, when you ingest an object, you ingest it with a very nice looking URL that everybody can read. And it then mangles it into a hash and uses a timestamp to actually place it once it's placed on the local file system. So that's kind of hard to read if you're gonna go through the file system and looking around for your data. But so what Swift and file it does is it takes that object and very simply just copies verbatim and it creates the appropriate directories and places that data on your file system. So now you're probably thinking, okay, so how does Swift and file does replication and keep my data safe and things like that? Well, Swift and file doesn't do any of that. It passes that on to the cluster file system and it leverages that technology of those file systems to keep your data safe. Okay, thanks, Luis. All right, so let's go back and see how we did against the myths. The first myth was that data must migrate in order to run analytics, we must migrate data from the object store to HDFS. The reality is with this architecture, we analyze data in place. The next myth is object stores should only be used within memory analytics, things like Spark. Again, busted. We support the entire Apache analytics ecosystem with high performance with this architecture. Third, object stores can't efficiently support all of the Hadoop analytics ecosystem because of the way that you would access data through the HTTP REST interface. Again, busted. We support all POSIX operations, including operations like Append. Finally, the myth is that object stores are slow for analytics and we trimmed some time out in the demo, but the reality is that with this architecture, we have much better performance and again, a lot of that's just by virtue of not having to move the data to do the analytics. By not copying, getting the data in place, we significantly can speed things up. Okay, so that's settled. Now, the next question is with this architecture, are there other use cases that are interesting, are there use cases that make sense? And I'll present a couple here. The first one is kind of a scientific collaboration, scientific analysis use case, and the idea is that there are customers who have invested a lot in file-based analytic applications. And they're generating lots of data, petabytes of data, counting on the data. When they're finished, they wanna be able to publish that. So again, with this architecture using Swift on File, they can run their applications, publish the results into the scale out file system, and then with the object interface, be able to selectively say which of that data, all of it or parts of it that they wanna publish and make available through the Swift interface. So this is nice because it instantly becomes available worldwide. I don't have to worry about giving people NFS shares or something like that. I've got the Swift interface for getting to the data. I've got Keystone or other authentication mechanisms for making sure the people I want to get the data are the ones who actually do. So it's a very nice combination for making this possible. The next use case is really allowing editing in place, allowing processing in place, again, of file-based applications. And in this case, video transcoding is an example that Luis has talked about in the past. You've got images, whether it's a video that you took on your phone or pictures or like they presented this morning in the keynote, the video from the TV show you're producing. You've got the data in Swift. You wanna make it available for processing without moving it. With the Swift on File architecture, you can have that data be available with your editing applications, make your edits. And then when you're done, say, I'm ready to publish and the data, again, is published or available through the Swift interface. So that wraps up a couple other use cases where this applies. Some of the future things that we're looking at doing, first of all, the Swift on File project is a Stackforge project that Red Hat, IBM and a few other vendors have been participating on. But it's something that we're putting together plans that we wanna do over the next year. There's a couple different things that are Swift on File based. There's also things that are part of the core Swift community. First of all, the single proxy object process trying to optimize the communications when we have the proxy server and object server on the same node. If there's ways we can optimize that by having it run in the same process and not having to do HTTP communication between those two processes, we feel that'll really speed things up. So that's a big part of the next step forward. Also within course with, there's a number of cases where data is copied with the Swift on File architecture and an underlying clustered file system, we can do things like just moving the data from one directory to another as opposed to trying to transport it from an object server on one node to an object server on another node. So trying to get as much optimization in that architecture is the next thing. Then the other parts, the last two are really trying to get equivalents with traditional Swift. So there's features like multi-region. Right now in Swift on File, we just don't support that. What we wanna look at is what it would take to provide that same functionality in the Swift on File architecture. So in summary, the architecture we demonstrated today provides you with a way to get insights into your data, to use the data that you have in your Swift object store more quickly. And the way we do that is by no longer copying data unnecessarily and providing high-performance analytics and also the architecture allows you to leverage the entire Apache analytics ecosystem. Any questions? Yes, can you speak into the mic please? Hi, this is Vishnu from NetApp. So just some context on my question first. Generally, Swift is eventually consistent and that kind of gets in the way of Hadoop. And also in terms of just how it drives low cost, right? We generally tend to use J-Bard trying to get Azure coding and policy management. If I understand what you're saying here, you're saying we take a clustered file system and you get all the benefits of that through Hadoop and everything else and the data protection and all of that standard stuff. And then you overlay an object access via Swift. So you give up the Azure coding of Swift. You give up the policy management of Swift. You get the get and the put on the object side and then all the clustered file system value and all the apps and the ecosystem that runs with the clustered file system. Just to make sure I understand what you're saying. That's what you're proposing. Yes, it's not just the get and the put. You gotta remember that too, that it's under Swift. So you get the entire middleware ecosystem that Swift provides also through the pipeline. So I get the WSGI pipeline, which means I can do a sender backup for example or something like that. If you wrote the middleware for that, yeah. Right, okay. So are you guys seeing a lot of this use case? I mean, are you guys seeing a lot of people saying they want a clustered file system and then they want a pack coincidence that's just giving object. Yes. It all started actually, if we can talk about Swift on file history a little bit, is that it all started back about maybe three years ago where the project actually was called Gluster Swift. And we wanted a method to, Gluster FS I don't know if you know, it's a scale of file system. It has many methods of accessing data and they wanted an object method to access the data. So we went ahead and started creating our own method of plugging into Swift, but it was not a really a community friendly way to do it. So instead what we did was we stopped working on Gluster Swift instead called that Swift on file because it doesn't really do anything Gluster FS specific. It only talks to a POSIX file system. And then we started working with the community on trying to get methods of being able to extend that technology into Swift itself so we can start making Swift a pluggable architecture. And a follow-up question, have you guys compared especially with regards to Amazon's EMR FS really three ways to do it, right? You could do EMR FS, you could do S3A, you could do S3N as traditional file systems on Swift and do analytics on that, right? So you could take Swift and you could run with the Swift 3, you could run S3N, S3A or EMR FS. Have you guys considered running that on Swift? You mean a POSIX is on top of Swift? Yes. No, we have not considered that because that brings a lot of other questions. Okay, all right, thanks. So I'm with Rata too, Christian. Shweta with my name. I have a question or it goes in the same direction. Are there actually any plans to do like storage policy migrations and to add them because especially when you're working in the scientific area, your data is hot after some time after you do some computation on the data but weeks or even months later, it becomes more cold and cold. And maybe it makes sense to add storage policy migrations that you migrate data out from the scale of file system to some erasure coded storage policy or even lower like to the, to two tapes. I know. Back to Swift. Back to Swift. Yeah. To native Swift or upstream Swift. Are there any plans for that for storage policy migrations? We're not good. Oh, I was gonna say that because now with Swift on file, all of the data protection and data management is being delegated to the scale of file system. In many ways, the capabilities of the scale of file system now determine how you would achieve such things. So for example, with spectrum scale, we can tear between different types of storage and even push out to tape if needed. So it's sort of delegated to that point. Now it's an interesting point about how would you use and do Swift policies at that level to do the same. So you would use something like HSM on the scale of file system? You could. If you wanted to do migration of data between policies, then that's something that we will work with the Swift community and try to figure out how to do it from the Swift level or the proxy level. All right. Thanks. So some questions on more the implementation side of Swift on file. Yeah. How do you take care of the difference between, the semantic difference between Swift and Pulsex? For example, how do you implement things like marker and prefix when you do listing? And for example, when the file is written by Hadoop, how do you maintain all the e-text for the metadata? OK, so listings are a little bit something you have to play around with because when you do a find command on the file system, it could take a long time. Yeah, exactly. So we have to be very careful when we do listings on large containers in the file system. So we try to recommend different methods of doing that with the customer. So I don't know, so if you guys don't use the containers. On our side, we use standard container mechanisms. So then you can use that in objects. I think the default, correct me if I'm wrong, Bill, is that as you access files via the Swift API, they get added into the container. And so a dynamic basis. But it's interesting to think about how you mix the two and what type of semantics you want. The other part of that, and this is what we did for the demo, is there can be an updater process similar to other updaters in Swift that's looking at the containers. And as it finds new entries, we'll automatically do the update to get those into the container database and generate the object metadata at the same time. And I think there's a person right over there, Prashanth. He can raise your hand. He's working on that. Some questions regarding the performance. After replacing the HDFS with the scale file system, have you measured the performance compared to the native HDFS architecture? I have not. So in our system, what we've seen is that the performance tends to relate to basically the hardware that you're using. So if you look in at standard HDFS with however hardware is configured, and then in our case anyways, the file system can achieve the same level of performance given the same level of hard drive, our same level of hardware, because effectively, the hardware becomes the limiting factor in these types of deployments. So what are you saying, if it's the same hardware, then the performance should be relatively the same? Right, exactly. You're basically saturating the disks. You're saturating the CPU, or wherever the bottlenecks are, they continue to be the same in the other architecture. Now the difference being, I would say, that in many scale file systems, they provide a different type of architectures than simply what HDFS provides. So HDFS provides storage risk servers, scale out, we can provide that as well, but in addition, we can provide much more of a client server architecture where you're doing a scale out storage system with a scale out set of clients. So more of a two tier architecture and how you can access your data. So how the performance relates to a lot about how you configure the system. Thank you. Any other questions? How do you model your data injection now to be able to get an HDFS or cluster, or what's the, how do you get more data into the scale out of the system if you're going to have HDFS in that model? Sorry, are you saying how does the application generate the data, or if it's sitting somewhere else, how do you migrate it into the system? I'm just so sure. Maybe we could try to take that offline and see more about that. Yeah. Any other questions? All right. Thank you very much. Thank you very much. Yeah, thank you. You guys, good job. Oops, sorry.