 Hi, everyone. Thank you for coming to our talk. This talk is Hadoop on OpenStack, Scaling Hadoop, SwiftFS for Big Data. Myself, Drew Lieman, and my colleague Chris Power are very excited to be here today and to be able to share with you our story. So this is part of the user story track. And so we will be sharing with you our journey. We are not part of the team within Comcast that operates OpenStack. We are actually the largest, if not one of the largest, tenants on Comcast's OpenStack infrastructure doing work in Big Data. And so we will be sharing with you some of how we've been leveraging that infrastructure. I'm going to talk first and be kind of high level, talk about what we're trying to accomplish and why we're trying to accomplish it. And then Chris is going to go a little bit deeper into the actual implementation of what we're doing. So first, since we are at an international event, I'm not going to assume that everybody knows Comcast. I'll give a little bit of background about the company. So this is, you can find this sentence on our website. This is our kind of our line. We bring together the best in media and technology. We drive innovation to create the world's best entertainment and online experiences. We have many different lines of business, as you can see here from High Speed Internet, and Emmy-winning video service that's called our X1 platform, our IP telephony network. We offer home security and automation services. We also own and operate the universal theme parks, one of which is here in Tokyo. And we have several many media properties as well. So I share this with you to let you know that we are certainly in the big data space. We have tens of millions of customers. We have hundreds of millions of devices on our network. It is not a challenge for us to get to data sizes of scale by any means of the imagination. So how our team fits in at Comcast. So our team is called Engineering Analysis. We use OpenStack as kind of the foundational layer for what we do, and we have a big data platform that is built on top of OpenStack. Those blue boxes in the middle kind of represent a bunch of the different ways that we interact with that system from doing things like just basic reporting and visualization of data to feature engineering, machine learning, doing exploratory data analysis, ad hoc analysis, and one of the more sophisticated things we do is actually simulations of our back end systems like content delivery networks, cloud DVR systems, and whatnot. And the reason we do all of this is basically to support the business. We are doing all of this analysis to provide the business with financial guidance, to provide engineering teams with design guidance for how to design these systems intelligently instead of just overbuilding all of the time, hoping that our systems will support future demand. And so we help the business intelligently allocate their capital to provide the best customer experience that we possibly can. So I will give a little bit of a Hadoop overview here. If people aren't familiar with Hadoop at all, I would recommend there was a talk yesterday that was Hadoop in OpenStack 101. I'm not going to rehash that content at all, but I recommend that you, that was a very good presentation, go find that online and watch that as well. But I will explain the basics of MapReduce here a little bit to drive home one point. And so MapReduce came out of this paper that was published by Google in, I believe it was, 2003. And it's this basic concept that you have files, right? You have data in files on a distributed file system. You then perform a map operation, which reads this content from disk. The map basically maps data to key value pairs, which are then emitted. And then there's this shuffle sort phase, which involves some combination of disk IO and network IO to get the data to the appropriate place for the reduce function, wherein the reduce function, what the system does is it pulls together all of the values that have the same key. And then the reduce function takes all the associated values for that key. And performs some reduced logic on it. And then in the end, it emits also some key value pairs in the form of output files. So throughout this system, this was predicated on the idea of that there was disk IO taking place at the input in the middle and then at the output as well. And so that's the basics of MapReduce. We've obviously, I'll get into, we've evolved a little bit from that 12 years ago. So talking a little bit how the space has evolved, and this may be well understood for some of you, but I remember back when I was in school as a studying computer science, one of the things that was kind of explained to us is, as far as performance goes when you're writing systems, memory is really fast, disk is slow, and network is slower. And the thing that's changed since I was in school is now, according to this chart from the Ethernet Alliance, network speeds have been growing faster than disk IO speeds. And networks actually now fundamentally are faster than disk IO. So we see networks doubling every 18 months versus disk IO every 24 months. And of course, we're considering commodity hardware here. So, and then the other thing that we've seen, and I wish I could have found a more up-to-date chart than this, this one goes to 2009, courtesy of Centip, the URL down there, is that we've seen that the availability of main memory has been increasing like crazy over the last decade plus. And so servers now have fundamentally more memory than they did back in 2003 when Google wrote that paper about MapReduce. So if you look at 2003, it was kind of at a little bit of a plateau. 2005 is when Hadoop, the open source project, was born. And then in 2012, which isn't even on this chart, there's a new project release called Apache Spark, which started saying, hey, let's leverage main memory instead of relying on disk IO all the time. If we can basically load data in memory and we can kind of do this shuffle sort in memory, leverage some network IO, that's going to go a long way for us versus having to write and read to disk every time. And then in 2014, there was the Apache Test Project, which came out as well, which is a similar approach of using directed acyclic graphs as well for processing data. So the main message here is that main memory is more abundant than it has been since that original MapReduce concept came into play. This is a chart courtesy of Cisco as well. So this is another thing you see is that not only the availability, but the performance of a lot of hardware components has been increasing over time. Everything is getting more performance with one glaringly obvious exception here, which is hard disk drives. And this is commodity hardware again. And so more and more, the approach in the big data space is avoid disk IO, avoided at all costs, because everything else is more performant. So what are we doing? We're trying to do big data on the cloud. And we're trying to use the Swift Object Store and Crystal Clarify in more detail. But our Swift implementation is basically self-implementation. So a lot of these factors that I just shared are kind of converging to make this happen. There's no way to avoid disk IO. It's got to be persisted at some point, but that's the long pull. And when it comes down to it, network traffic is additive. You're always going to have to read or write to disk eventually, but it's proportional. And actually, it's usually, in most cases, smaller than disk IO. Many workloads that we use are actually becoming CPU bound as well. So there's been an introduction of columnar formats, like Parquet or ORC, which ends up compressing your data down. So it takes up less space. So that network hit is actually smaller. The disk IO hit is smaller. And you're doing some CPU, using some clock cycles in order to basically decompress or understand that data set. So that's making the network hit even less impactful to what we're doing. Also, as servers have more memory, we can keep more of the data just in memory. And so our processes, we try to read only once and write only once and try not to spill to disk whenever possible. So with all of this data locality actually becomes less important, right? And the availability of all these massively parallel memory resident systems is making this more effective. So historically, when people think big data and they think Hadoop, they think about it in bare metal terms, right? And we've talked to a lot of people who are doing big data with OpenStack and Ironic, right? But when you're dealing with bare metal, it's making one fundamental assumption, right? And it's that these vertical boxes here are a server and every server comes packaged with a certain amount of compute and a certain amount of storage, right? And so you're making the assumption that when you scale this infrastructure, your demands for both compute and storage are gonna scale proportionally, which we found personally within Comcast that that's not the case, right? We have clusters, bare metal clusters, which have extreme amounts of compute utilization and as we scale them out, we don't necessarily need to scale the storage, but we need more compute. So by using an object store, you can decouple the storage and the compute from each other and this gives you the ability, right? To scale these independently, right? Your demand for compute goes up. You can add compute without adding storage because they're decoupled, right? Or you can add storage without adding compute. You can actually do some cool things like, oops, what did I do? You can do some things like scale up your compute based on your utilization, right? So you've got some really big workloads going on. You can add more compute to help take off, take the load until it's done. And then in the middle of the night, maybe you don't have that much going on, so you can scale down. And we can do other things like we could potentially even run multiple clusters against the same data sets, right? Nothing's holding the data hostage. It's in an object store and providing greater access, right? So we could have our cluster completely spun down and ETL jobs could still be putting data in the object store, right? You don't even need to have Hadoop running to be able to store data. So we find that there are a lot of advantages to structuring your big data strategy using an object store. So this is kind of our landscape of tools. So we historically have been a Vertica shop, and we have a lot of our data in Vertica. We've been making this evolution to big data on OpenStack over the last couple years. We have, as I said before, Ceph-backed Swift with OpenStack, and we have Hadoop Spark. We are using, as I mentioned before, and Presto is a new product that we're currently working with as well to basically give us performance SQL access to data sets in Swift. Hive as well as another way of accessing data via SQL. Pig is more of a data scripting language to access it. H2O provides machine learning capabilities on top of this platform. Data Mirror is a solution for self-service access to data, so it effectively takes data sets that are accessible through Hadoop and makes them available in a spreadsheet-like interface. It's actually the combination of Data Mirror and Tableau for us is our self-service analytics story, so we're enabling other teams like engineers, subject matter experts to get in and do big data stuff without even realizing that they're doing it, just training them on these tools. And at this point, I'll hand over to Chris. Thanks, Drew. So obviously we've got a lot of tools here that we need to make work with a number of different backend storage technologies, but I'll just point out one thing here that to elaborate on one of Drew's points about scaling these clusters up and down, the Presto cluster is actually running on a set of EMs, that's actually beside the Hadoop cluster, so we can scale them up and down independently, and the reasons why you might wanna do that would be something like during the day, we've got a lot of analysts that are working on the data and might wanna run these Presto queries and get back responses in a few seconds so we can scale that up. And then in the evening, we might run around a really big simulation job on our content delivery network. So the folks go home that are using Presto or Hive, for example, or Datamere, and they, so we'll scale that cluster down and then we can scale up either a Spark, you know, the Spark piece or the Hadoop piece, we actually run both of those on top of Yarn in the same cluster. So what does OpenStack look like at Comcast, right? Just a few boilerplate pieces here, vanilla distribution, multiple data centers, multi-tenant, multi-region, all the typical pieces that you would expect to find there. One thing I just wanted to point out, right, is that we use Ceph, of course, which is providing us both our Cinder storage for block devices as well as the object storage for Swift. So keep that in mind as we kind of go through some of the explanation of how we have architected the system and cylinder for some metrics and heat for some orchestration as well. So what does Hadoop look like on the cloud, right? So lots of OpenStack folks here, I don't need to explain the benefits of the cloud, but when we think about deploying Hadoop into the cloud, we wanna design for it, right? So assume things are gonna fail, try to distribute the load out across the physical hosts, both for performance, but also for fault tolerance, and use persistent storage like Cinder block storage, right, where it's appropriate. And think elastically, scale things horizontally, try to scale things to meet demand like I just described a second ago. And then return the resources when they're not in use, right, be good citizen of the cloud. As one of the larger tenants, we can certainly take up all of the resources in the entire region if we want, but that's probably not a good thing at all times. And in the night it might make sense, during the day it may not. And leverage automation, when you're running things at scale, you know, you need to automate everything. It increases the efficiency, but it also makes things repeatable. We can scale the clusters up and down and do it in a way that we feel comfortable will work in most cases. So what do we mean by performance and fault tolerance, right, we take advantage of the affinity, anti-affinity features of the Nova scheduler. We use the anti-affinity to actually schedule the master nodes, so things like name nodes, resource managers, those sorts of things. We schedule those out across physical hosts, both for fault tolerance, so if we lose a compute node, we don't lose both the name nodes or both the resource managers, but also for performance. So on the compute side, if you've got a whole lot of your Hadoop nodes running, you wanna spread those out across the, I think I did the same thing. Just go back. You wanna spread those out across the physical compute nodes, both from a performance perspective, so you can sort of distribute the load on the CPU, but also the network, right? Everybody's gotta go through the same nick, but also for fault tolerance. So if you lose a compute node here or there, you don't lose a whole bunch of your compute, sorry, physical node, you don't lose a lot of your compute nodes. So that's sort of our strategy there. So how do we actually architect the storage on the cluster nodes themselves? So be it a master node, a compute node, what, so obviously you've got a VM, you need some ephemeral root disk for your OS and that sort of thing. And then what we add to that is a center volume. That center volume serves a couple of purposes. One of those purposes is for persistent storage, like Mbari might wanna store its database somewhere. The name nodes need to store its indexes somewhere, so we use the center volumes for that. Data nodes, right? We try not to use too many data nodes. We keep a small subset of our compute nodes as actually as data nodes, and then we can add what we call elastic nodes, which have no center volume at all, no HDFS at all, to the cluster without needing to have those things, but we do use some HDFS. And it can also act as local disk for the node manager, where it might make sense, but we think there's a better option there, and that's these ephemeral volumes. So we have access to both root ephemeral volumes, and thanks to some discussions and the nice folks at our cloud team, a flavor that gives us availability to some ephemeral direct attached disk as well, that's much bigger. We use that typically for the local disk for the node managers. Certainly you need enough of it. So depending on how many jobs you're running on the cluster at one time, or how big of your workloads are, you may or may not have enough actual SSD, or sorry, direct attached ephemeral there to make that effective, so we can then sometimes fall back to Cinder as well. And lastly, we use Swift, right, as the data lake, a unified central point of storage for all of the data. And as Drew said, we store our data in columnar formats. So most of the data that is in there is actually using ORC with Zlib compression. We use that because we're on a HTTP, but if you're, you know, you can use Parquet or any sort of columnar format you might like. So how important is the ephemeral storage for big data workloads, right? Does that actually make a difference? So we set out to try to figure this out, kind of thought, all right, let's run some tests here and see what happens, right? Traditional jobs, as Drew said, are in Hadoop, are somewhat right-intensive, especially in the intermediate pieces where you might wanna do shuffle sort, spill to disk, you've got logs and things like that. So what we can do is actually use that Yarn node manager local configuration to specify where you want the local disk to be. So we will actually, you can say, okay, I want my compute node to use the direct attached storage locally, and I can run the test that way, and then we'll run a scenario where we actually use the Cinder volume, which actually, of course, is going over the network to Cef, and we'll try it that way. So those were the two scenarios that we came up with, and we decided we'd just run Terrasort, which is a fairly common Hadoop benchmark, as well as DFSIO. We thought we'd see a difference with Terrasort, and maybe not so much with DFSIO, but just the reason we did this was we just ran a quick test on the ephemeral volumes from like an IO perspective, and we found that the direct attached storage was like 15 to 20 times faster, give or take, than the Cinder volumes, in terms of just general read performance, or somewhere in that neighborhood and your mileage may vary. So we said, all right, this is worth actually trying. So when we tried it, we found out, yeah, in fact, the jobs run faster in terms of wall clock time. So Terrasort at one terabyte actually ran about 30% faster than it did using ephemeral storage for the local disk on the compute nodes, than it actually did with Cinder as the compute local storage. And for DFSIO, it didn't really make a difference, because that, sure, local, is it SSD? Yeah, local SSD, yep. So DFSIO, it didn't make a difference. That's actually just, that test is really trying to read as much data out of HDFS, and put it back into HDFS as possible, quickly as it can. So it's not really using a lot of intermediate stuff, so you wouldn't expect it to. So how does Hadoop work on top of Swift, right? Some of you may know this, but essentially what there is, is there's a driver layer that lives there that makes the Swift REST API look like an implementation of the Hadoop file system. Of course, object stores are not file systems, and we'll talk about some of the challenges there. There are actually two different branches of code that have this driver. One is in the Apache Hadoop distribution. One of them is in the Hadoop Swift file system that's part of the Sahara Extra OpenStack repo. We're actually using the Sahara Extra one, and that's the one that we've made the changes to. So, right, the cell sounds great, but what happens when you actually scale this up, right, to a fairly reasonable size? You run into a few challenges, and we did. So when we attempted to do this, right, we ran into some issues. We saw a whole large number of splits where we wouldn't normally expect to see them. The jobs took a really long time to submit to the cluster. We only would ever get back to like 10,000 objects, and then when we would write the actual output to Swift, we would notice that it would write it, and then it would actually copy it, right? These are all fairly common issues that you run into when you're running Hadoop or Spark on top of an object store. Folks at other companies that are doing this have also ran into some of these things, and we also, just digging around, we noticed with the help of some of the folks on the cloud side of the house that Seth needed a little bit of tuning. So a large number of input splits, this was easy, right? There is no concept of blocks in a REST API-based object store, in the truest sense of the world, as it would be in the HDFS world, right? So Swift's default block size is 32 megabytes, most files in big data are hundreds of megabytes or gigabytes a piece, so we just raised the block size there to a point where it made sense, something like 128 megabytes, which is typically what you would run a Hadoop system at somewhere between 64 and maybe 512, depending on the jobs that you're running, that fixed the problem with us having way too many input splits. That's bad because then you have a very inefficient sort of system that spins up way too many maps or tasks in Spark. So the next thing we came into from a challenge perspective was slow launching jobs, right? We would submit the job to the cluster and it would take minutes before it actually started running on the cluster and got submitted. And we sort of figured out that what Hadoop was doing was actually first trying to list all the stuff that's in the container and figure out what was supposed to be part of the input set, and then it would actually make calls to every single object, either to figure out metadata about the object or to get the block location of the object, which again doesn't really make a lot of sense in the case of these REST-based object stores, right? So this resulted in sort of performance on the big O of N side of things, which if you've got 10 or 20,000 objects in a container and your response time from the API, it's an HTTP call, right? That's a little bit expensive. It's not like an HDFS name node call. Perhaps so it took a little bit of time, minutes in some cases to submit jobs. So what are some of the approaches that we thought about to solve this? There are some configurations in Hadoop you can use to actually spin up multiple threads to try to read from the file system. That doesn't work in this case because it actually is working at the directory level, which would be like a container in Swift, right? So if you were doing multiple containers, then that would work fine. Some folks have actually implemented overridden get splits, which is the method that's actually doing all of this work. That actually works well and also solves some other problems, but in our case it didn't work very well because as you saw from the sort of ecosystem of tools, if we have to implement get splits for all of these different scenarios, then we run into lots of custom code lying around. So we tried to avoid that. So what did we do to fix this? There we go. So what we did was we actually went into the Hadoop Swift FS code and essentially extended it a little bit to understand some stuff that was already there. So there's already the notion of a file system being location aware or not by default. Of course, it's not in this case. And so we sort of extended that to some of the other methods that were there and it reduced all of these unnecessary calls to get all of this information that we were seeing it try to do. We localized the changes just to that driver, to that one jar file, which meant we could deploy that out to the cluster and it becomes available to Hadoop, Spark, Presto, Datamir, any of the other tools, Hive, Pig, all of those tools could then take advantage of it. Launch the jobs faster, reduce the load on the object store, right? We're not making thousands of calls and it works across the tool ecosystem and it approves some of the interactive query experience because you don't have to wait for that to get submitted. So you can see some of the sort of the order of magnitude. It's a flat line at this point. So that was that. And then we noticed that Swift only returns 10,000 objects. If you know the Swift API, you know that's the case. So we just implemented essentially basic pagination. It's part of the Swift API. It just wasn't supported in this particular code that we were working in here. So if it finds that the call to get the objects returns 10,000, it will just make another call. The Python Swift client already does this, so we just essentially mimic that behavior. And we made it configurable. So if you wanna ask for less, you can. Although I don't know why you would wanna do that. Lastly, job output write and rename. So Hadoop, traditionally what it does is it writes all the data from a job out temporarily and when it finishes that, it renames the files. That's fine in a file system, not so fine in an object store, right? In Swift, it results in essentially a copy and then a delete, which if you're writing out a lot of big data, gigabytes, hundreds of gigabytes, maybe terabytes, that can take a long time. So there's two approaches you can use to fix this. Basic approaches just don't do the temporary output, right? Override this class, just have it write directly to Swift. A more sophisticated approach is to actually use some local ephemeral storage, have your task write out there, and then when you're done, push that into Swift. So we're actually in the process of working through this approach as well. Lastly, the Ceph architecture and tuning. So this is some things that we came across as we scaled the jobs up with the aid of the Comcast Cloud team. We worked really closely with the folks on the team. They've been a great help to us through this process and of course this is gonna continue, right? Basic things that make sense if you know anything about Ceph. I wasn't an extreme expert in Ceph when this started, but I think I have a better understanding, at least a new understanding now. Scale things out horizontally, we use about a two and a half to one ratio of Ceph OSDs to the Rados gateways in this particular case. Enable container index sharding, increase the placement groups so it actually spreads out the indexes on the Rados gateways across more disks that way. And then we actually found that these merge threshold and split multiple configurations were set very small and so we were ending up with a ton of splits on the actual disk in the way that Ceph was actually writing the files out which led to poor performance. So we increased those and we turned off some logging which of course always speeds things up. So what lessons have we learned? Get to know your open stack architecture, right? If you're gonna be a big tenant and you're gonna do things that are outside of the box of traditional cloud, you should understand how things work. You don't need to know what kinds of servers you're running, you at least need to understand the topology. So we spent time doing that and just understand the impacts of the design of your cluster, whether you're using ephemeral versus Ceph versus something that might be direct attached and how that impacts not only your own workloads but the neighbors in the environment with you. And then use ephemeral disk on the node manager if possible, right? This is probably more true today than it will be a year or two from now when we get much bigger compute nodes with much bigger memory and things that use less intermediate disk but for now it seems to make quite a bit of a difference. Understand how you're representing sudo directories in Swift so that everybody's using the same thing and Swift that actually represents them as zero byte files. So everybody needs to understand that. Think about how you're gonna organize the data in your containers. You might want to age data off so you may have a case where you've got a year's worth of data is just really too many objects to be storing in a single container for maybe performance reasons. So you might wanna break those out into a container per month or something like that, right? And then use file formats that reduce IO. We're trying to get away from that so Parquet and ORC do that. And lastly, what are some of the next steps, future work that we're looking at? So we wanna upstream the changes back into the community that we made here. As I said, everything was, we tried to do it generalized and configurable, nothing specific to what we were doing. We think they're reasonable enhancements for folks that wanna try this. We have also noticed as we've gone through this journey that every single map makes a request to Keystone, when it spins up to get authenticated and then tries to get the object. So what ends up happening is if you spin up a really big job with let's say a few hundred maps simultaneously, those all try and go out to Keystone and authenticate. You might have 10, 20,000 maps in your job so you're gonna end up with all of these requests against Keystone which is basically DOSing your whole system and there's other tenants there. So what we plan to do there is actually make the requests to get the token when you authenticate the first time in the driver of whatever program Hadoop Spark that might be running and then just put it into the job configuration and hand that out to the cluster, to the map job so that they can then use that and then we essentially do one Keystone request instead of a bunch and then you can hit Swift directly with the token and it speeds things up dramatically. Handle a large number of partitions. So if you got a container with partitioned data i.e. pseudo directories in there by year, day, month, every single one of those partitions can actually result in a list status call when Hadoop tries to run a job. That's sort of the next level of magnitude up from what we were just talking about with the multiple get object calls. So there's a couple of approaches that you can use to solve that so we'll be looking at that as well and then we've noticed that the maps actually try to get the metadata again and then ask for the objects so we're gonna look to try to streamline that a bit also. And I think with that we've got some time for questions. And then if you wanna use the mic. I use a microphone so it's captured for posterity. So I assume at the Hadoop level you do the replicas and save you to the replica as well. Yeah, so that's a good question. So that's why we try to limit the use of HDFS, right? Because your HDFS is gonna replicate maybe let's say three times. You're underlying Swift in some cases maybe three, maybe two, depending on how you have it configured and you end up with that some folks like to call it a replication amplification engine, right? So ideally what we'd like to do is move all of the data out of HDFS entirely and keep it in Swift there where we have just two or three X replica and Hadoop doesn't do any replication at that point. So you end up with a much more cost effective solution as well, understanding that maybe you take a slight performance hit there but it's not actually in our testing we haven't found that it's all that significant because in our case the Cinder volumes are on set as well so we're making a network call either way, whether it's to Swift or to Cinder so. And then it sort of solves the problem of the replication. So the next question is that for the Hadoop workload is mainly the sequential IO, right? So for the larger track, do you have any performance data to share on that one? I don't have any that I can share right now in terms of like actual absolute numbers around TerraSort or those sorts of things. Typically, yeah, we just typically what we'll share is relative things like we did. Okay, but it does meet all your requirement. It does today. Yeah, I think what we found is that the one, we did try to run one of the simulations essentially. We put some data into probably a couple hundred gigabytes of data into HDFS and then we put a couple of, put the same thing into Swift and ran the same exact job, which is a simulation that sucks all of that data in, runs the simulation and then writes out a whole bunch of the results from that. And we found that the job times were within a few percentage points of one another. That's again, comparing cloud to cloud, not cloud to bare metal. So take that for what it's worth. So, Ceph supports NS3 interface too. And I know I've seen some work on like the S3 and the S3A connector in HDFS. Did you look at that and turn it on and just say it wasn't worth it? No, no, we had them turn it on. And for a couple of different reasons, we ended up sticking with Swift. So, a lot of the code in Hadoop expects you to be talking to s3.amazonaws.com or whatever the URL is for S3. So first you have to figure out how to kind of force it to talk to your own Swift endpoint. And then we happened to have it deployed on a port that was not port 80. I'm looking at the guy who did that. It's about 5,000 cuts it's sounding like. And so yeah, it was more trouble than it was worth and I think ultimately that said, I actually use the AWS S3 CLI interchangeably with the Python Swift client because that is actually a tool that does play well with our configuration. So it is in some ways maybe, I don't wanna say more mature but just has a larger set of functionality perhaps and does some things differently than the Python Swift client might. So we use that interchangeably in some cases with it. So we've actually, in Swift it's a container with objects and S3 it's a bucket with objects. You can access them interchangeably with either of those things. So you put data into Swift, you can see it in using the S3 command line interface in the same bucket name. So when working against Swift, but not on top of SF, there's a list endpoints middleware feature that allows the Swift FS to talk directly to the storage node. Is there a similar construct in SF? And if not, did that add any type of latency that you had to go through the rados gateway instead of bypassing that proxy level? I actually don't know. I haven't gotten to that level yet. So I could definitely sort of find out and we can follow up for you. This is a good question though. Simple question. Does the Sahara project have any involvement in its architecture? So they do in the sense that they are sort of the ones that generated the driver level that we were using. And we've talked with them and said, hey, this is sort of what we're doing. This is what we found. And they're sort of very receptive to us working with them. They didn't have any direct input into what we were doing here from that perspective. But yeah, in terms of the code and upstreaming and that sort of thing, they're very welcoming. I'm not really aware of the Sahara architecture, but is it similar to this? It will work. So it doesn't necessarily, we've gone into a great amount of detail around how we sort of structured the actual nodes in this case, right? Sahara gives you some control over that. It lets you pull data out of HDFS. It lets you pull data out of Swift. And you can actually now with some of the Manila features that have been talked about here, you can do things out of NFS as well. So in some respects, it will play, I think, fairly nicely with this. We don't have any experience yet with Sahara internally within Comcast though. Thank you. Sure. You spoke about using the, testing the S3 client as well. And I believe in the Swift ecosystem, Hortonworks also has a Swift driver. So the Swift FS was the rack space driver. Did you try other Swift drivers? And how did you narrow it down to you want to use Swift FS? Yes. So the Apache Hadoop distribution has a version of the code that was forked from, it looked like, and when I looked at it, maybe a couple of years ago, from like around the Ice House time release. So that driver was actually forked out. And so we actually tried that one as well. It suffered from essentially the same problems. The changes that I've made, I've actually made them in both. And it seemingly, they both function better in that respect. There's actually then a fairly significant difference between sort of the Ice House release, release of the Swift FS driver versus what's in the master branch today. It's a little more sophisticated, but it suffers from actually quite a bit more severe performance issues in that it tries to, when it lists the directory, it tries to run get object metadata on every single thing. And then when it hands that over to Hadoop, Hadoop then calls again to get all the block locations. So it's actually, it was a little bit worse. Most of the changes I made are kind of compatible with both. We focused our effort a little bit, but as I said, the stuff should work in the, and I have made the same changes to the Apache one as well. Thank you. Sure. If there's no more questions. Thank you everyone. Thank you for coming.