 Let's see. Thank you for coming today. Let's see. My name is Andy Robb. I am the technical product manager for the big fast data team at Walmart e-commerce, sort of the e-commerce side of Walmart. Couple things. Our presentation is in two parts today. I'm gonna go through some introduction stuff and then hand it off to Mingming who will talk about some more things in depth. I will try to get through my portion of this relatively quickly, so that Mingming will have as much time as possible and to get it to any questions. I believe today's session's being recorded, so if you do have any questions later, please make your way to one of the microphones. All right. So Mingming, Rae, and I are on the big fast data team. We are actually one of two teams at Walmart that run large multi-tenant Hadoop infrastructure within Walmart globally. Our team has been running shared infrastructure since about 2012. Our clusters are in the dozens of petabytes and tens of thousands of cores range. Those are shared amongst all of our users and generally for research purposes. As a result of that being research targeted, somebody will come on, run a research job, come up with something that they really like running and then decide they wanna run it all the time. Of course what happens then is that it needs to run with a deadline and it's a research cluster. Somebody else can come on and run any job that they want at any time and it's very hard to help someone get their job to end at the right time. So that is one of the leading reasons that we started a project about a year ago to be able to deploy single tenant clusters in our open stack environment. These clusters might be yarn clusters, they might be Spark standalone clusters or they might be Facebook presto clusters. In addition to that there were also a lot of teams that had varying software version dependencies. Somebody either wants to try out something new or stay on something older. We also don't like using streaming applications on our shared infrastructure. Those jobs essentially never end. Once you start one you're always going to use those 24, 60 or 100 cores until the end of time, until your job changes and you're told to do something else. As a result of that in a shared environment that's not really fair. You are essentially taking cores away from other people's use cases. The idea with the shared infrastructure is you use it for a little while and then you give it back. All right, also we wanted to be able to independently scale, and this is sort of a very traditional thing, scale CPUs and data storage independently of one another for isolated use cases. In some of our cases we did in fact build dedicated Hadoop clusters for teams that had very demanding or very revenue impacting jobs. But if they were balanced too heavily in towards CPU or towards storage, you may end up with a bunch of idle cores or a bunch of idle disk. And so we realized that we were able to cover a bunch of these use cases by using our open stack infrastructure. A different team from us actually started building both the open stack infrastructure itself, that's not us, as well as the SEPH storage infrastructure that really enabled us to do this work. They started that back in 2016. So for this talk, we will be talking at a very low level. So I apologize if we go over a couple of people's heads at one point or another. We are hoping to be able to talk to essentially the contributors and operators of SEPH, SWIFT, and open stack installations. Anybody who runs Hadoop ecosystem technologies that talk using the SWIFT API. Community members from Hadoop ecosystem that deal with the file system interfaces and then potential operators and highly technical users for the Hadoop ecosystem components. That's anything that can really talk to the HDFS interface. So that's MapReduce, TES, SparkJobs, even Presto. Anything that you can implement something underneath those to talk to another system. So what specifically we're talking about? Well, this is the layer that does that interface. So in Hadoop file system land, there's an object interface that you can implement essentially an arbitrary storage mechanism underneath the HDFS API. And so that's the software layer that we're talking about that we're working with here. The thing that actually enables a Hadoop component to talk directly to a SWIFT API. Thank you to Comcast for providing the diagram for this back in their talk at OpenStack in Tokyo in 2015. So this is built on a lot of existing and very good work. First and foremost from the OpenStack Sahara team that has a subproject called Sahara Extras. In that subproject is essentially the canonical implementation of this driver that we just sort of refer to generally as Sahara Extras, but it's really a small piece of that overall project. Also, there is a SWIFT driver built into Hadoop itself that we believe is actually a Sahara Extra. And then Comcast has also done some really great work with the Sahara Extra implementation patching that and helped guide us in our early work with this. Thank you to them. The general architecture that we're using for all of this is that persistent data is stored in object storage. The clusters that we have on our OpenStack compute infrastructure are totally ephemeral. Even when you've got large ephemeral disks attached to your worker nodes, the idea is that those clusters can be blown away at any time and don't lose anything in the process of that. We accomplished that both through the use of object storage and a shared hive meta-store that all of the ephemeral clusters can talk to. So if somebody loads a table in the data stored in the object storage, the metadata stored in the shared meta-store, you spin up a new cluster, whether that's yarn, spark, or presto, and you can immediately query data from that. There's no weighting, there's no loading data into that new ephemeral cluster, it's just there for you. The other thing that was really nice about the way that the API was implemented, the way that the Sahara team did this, is that it allows you to plug it into an existing cluster, and so we're actually able to add this invisibly to our users onto our existing persistent infrastructure so that we can actually load data into object storage. We only do it on the same data center, but we are able to do a distcp from one of our persistent clusters out to object storage so we can have jobs that may run primarily on a persistent cluster because we haven't migrated them over to using ephemeral clusters yet, and copy data over that way. That does not include the meta-store, so we would have to have some job that would in the ephemeral land go and add, say partitions to a table over on the ephemeral side in the shared meta-store, but the bulk of the work is able to be done by some large infrastructure. So in Ceph, you have the option to use Swift or S3. Both APIs are supported from Ceph's object storage implementation. On the S3 site, well, I guess I should start with, we actually spent several weeks when we started this project, trying to figure out which one we should use. We really were interested in supporting both. So there are lots of projects that support S3 and we wanted to be able to take advantage of those. Swift was also interesting to us. There are some projects that primarily use Swift and so we were interested in the notion of using both. Why can't we do that? The software supports it, right? If you are loading basic objects, you can use both. So at the binary level, you can load a file with Swift into a Ceph-backed storage system and pull it out with S3 and it's fine. The problem is with pseudo-directories, they're incompatible. So as soon as you start doing something with any complexity, especially with a Hadoop workload, you're not gonna be able to use the two of them together. So that was a point we realized we had to pick one and stick with it. So on the S3 site, broad client-side support, awesome. Unfortunately, when we did try to start using some S3 APIs, we realized that a lot of clients are built only to talk to the canonical implementation of S3. I'm actually pointing them at a URL in your infrastructure sometimes is impossible without a patch. They just don't support the notion of a URL that doesn't end in Amazon.com. Then there's the general concern around a closed standard. The community doesn't own the S3 server-side implementation and so if we need to make changes to it, we don't really have the option to do so. On the Swift side, client support's not universal. Unfortunately, well that won't get better without adoption. So if we go and start using this and request that tools that we want to use on top of our system use Swift, we can influence the community to more broadly support Swift. And then in theory, tweaks and changes can be made more quickly with Swift because the community does own the spec for that. In the end, we did decide to go with Swift. Hopefully that's relatively obvious. All right, so we are using a relatively older version of Sahara Extras. There are a couple of different branches of it. We're using one called Ice House with patches that we had some help with. Some of the issues that we ran into immediately. We tried to run some Hive queries on ORC stored data. They essentially fail if you're doing anything with really any moderate sized data. There were an uncontrolled number of HTTP connections. We'd run a query and we might get 10,000 connections per node. You're gonna overwhelm a couple of RGWs really quick with that. There were really slow meta operations. So deletes, renames, copies, especially with eye object counts. Those just took absolutely forever. When we did run larger jobs with lots of objects being returned from a list operation, they were being truncated. So we'd load some data to a table and then we'd query it back and we'd be missing half the table. Well, it turned out as far as Hive was concerned that data didn't exist because the files that were storing that data just weren't being returned. They weren't being made available to the system. Well, we did run some tests with long running processes, so in this case, Presto, where the application actually stays resident. This would probably also be applicable to Impala. We would run some queries one day, we'll go home, come back the next morning, run the same query again, up arrow, enter, and the query would fail. And we had to restart the cluster to fix that. The API wasn't re-aughting against Keystone properly. And then finally, we couldn't get large object support, which was at least partially implemented in that branch. We couldn't get that working for us. We couldn't get it to correctly break a file that was larger than five gigabytes. Okay, so why did we go off on our own a little bit? We spent several months patching the existing Sahara extra code base such that we could actually return those patches to the community. We realized within several months that that was taking a really long time. And that there were some fairly dramatic changes that we needed to make that we couldn't do just with pull requests. So we did essentially an experimental side project where we said, all right, let's just try making a bunch of changes and don't worry about making it look nice for a pull request. That ended up being very successful for us and we were able to add some performance features and make some modifications to fix a bunch of the issues we were running into very quickly. We also changed the name. So Swift A is intentionally slightly different. It's a bit of a nod to S3A so that we were more easily able to test. So even the class name that we're using is Swift A instead of just Swift so that we can load both jars into our application and just change configuration to switch between the implementations. Specific features that we implemented, bounded thread pools for things like listing, copying, deleting, rename. In some cases that adds parallelism. In some cases it limits the parallelism, multiple write policies that adjust how the driver uses local storage for the local worker that you're on as well as the upload behavior. Redesign the range, seek supports that we could run high queries against our C files, implemented pagination so that we could get more than 10,000 objects back from the server side. And LRU cache to limit header calls and stack calls against the API which depending on if you're reusing a particular set of files can be themselves overwhelming and slow some things down. And then something we call lazy seek to adjust when we actually issue a rest call to the server side that really sped up our presser queries. Along with all of this, we added a small patch to Ceph's implementation that adjusts a performance penalty that we ran into with large objects that we'll talk about in a little bit right now. The large object support that we ran into was that, well, originally, we couldn't get the client side support to work. And so we built upon that to make that work correctly so we can split essentially at arbitrary size. And then we also ran in, once that was working, you would run Hadoop FSLS on a directory that has subdirectories that has large objects in it and has files in it. It doesn't really matter as long as it's got the large objects in it, this happens. So the directories are returned correctly. They look like directories. The files come back correctly, they look like files. The large objects, unfortunately, look like directories. That's how they're implemented, that's how the system works, but there was no indication in a list, the object you get back in a list call, that the subdirectories are actually large objects. And unfortunately, that's what you base a lot of work on. And so there were cases where jobs would fail or user scripts wouldn't work properly. That was itself problematic. The problem is to fix that, you have to issue a stack call for each of those subdirectories to make sure that they are or are not, in fact, directories. Doing that for a few is fine. Doing that for 10 or 20,000 subdirectories is really a drag and could seriously degrade performance on something like a Hive query where you're doing recursive, sorry, where you're doing recursive runs down a directory tree. So we actually added a patch to Ceph and some, a little bit of code in Swift A to adjust that behavior so that we could get back a little bit of information just to let us know that the subdirectories were these large objects and that dramatically improved the performance of queries and just general behavior of the system. We are hoping that that essentially a hack can be better implemented in the community as part of the Swift standard, which is something that S3 supports so that this particular use case can more effectively run. All right, so caveats to what we've done. We've not tested against a Swift proper cluster. We've only been running against a Ceph cluster with the Swift API on it. Because of that inefficient list mechanism, oh, we just talked about that. Sorry, I didn't go through these slides too closely just before this. Okay, the patch that we applied to Ceph, you can see in a pull request, 14.592. And basically all we were populating is a header for the total size of the object and that's what gets returned to us in the list call so that we know how big that total object is. So performance results, do you wanna take over? Sure. So, Meng Meng Lu is going to talk about the performance that we were able to determine from the system. Yeah, so we did a couple of performance evaluations to test our performance against Swift API versus Sahara extra against our Ceph storage clusters. So one thing we want to test out is how well our bounded thread pool perform. And we evaluate a couple of file system operations such as deletion, renaming, and even uploading. And we also evaluated the difference of our several write policies we implemented in Swift A. And we look at both file system operations as well as a map reduce benchmarks such as high bench and pick one of the jobs called word count. And note that here is the SPAC that we did our experiments. We ran in a couple of open stack VMs each with local SSD storage. And we tested against a Ceph HDD, a couple of HDD storage clusters. And they are implementing a Ceph version and with RGWs and HA proxy on top of that. The first result as you can see, this is comparing Swift A and Sahara extra on a single operation Hadoop FSRM. This is deleting a large directory generated by high bench word count. This is the largest scale big data. It's 1.6 terabyte of 6,600 objects multi-splitted into 256 megabyte chunks. And we can see that since Sahara extra inherently for this operation is a single threaded implementation, but Swift A is able to add multiple threads to essentially doing the deletion. And we can see some performance gains over here. The next operation renaming essentially, we are doing it on the same exact same settings as the deletion, but rename is essentially a copy and delete. We do this again comparing Swift A and Sahara extra. And we can see that since Swift A can enable multiple threads doing the renaming, essentially with only three threads, we're able to see three X performance gains and with more, you can see that we bring down the renaming performance from essentially about an hour to a few seconds. Now, here are the three right policies we implemented in Swift A. The first one is called multi-part single thread. And here we are dividing very large files into small splits. And this is called multi-part split. And in this single thread implementation, we only require a very minimum amount of storage on the local desk. We essentially only write to local storage one split at a time and upload one at a time sequentially. Now, the second policy we also enable is called multi-part no-split, where we essentially can save the whole file to local storage, assuming we have that enough local storage, and then we are able to upload via like byte ranges in multiple threads in parallel. And this is making uploading faster. The third policy is really a combination where we enable buffering into local storage only several chunks, several splits, subject to the number of threads we enable, and then upload them asynchronously from the local writes to the object storage. Here is the result of uploading a single 100 gigabyte file. We split them into several chunks, and this is doing a single Hadoop FS put on a single SSD compute node. As we can see that single thread one split is the slowest among the three, but it requires the least amount of local storage. No-split requires the whole like 100 gigabyte on local storage to do this uploading. And essentially a multi-part split here has the best performance regardless of the size of the split. And it tells us that asynchronous uploading in multi-threads really has a lot of performance gains. We also compare the three policies in running a MapReduce job, a high-bench work count. This is the 6.0 high-bench work count job and we had three scales, huge, gigantic, and big data. Know that the big data one generates 1.6 terabytes of data with 60 mappers and 60 reducers. And we set four gigabyte per mapper reducer, ran it on 10 compute SSDs. The spec of each compute node has 52 gigabyte memory. And we use the default settings within our Swift A thread parameters. We can see here that the split policy has the best performance among the three. LazySeq is also an important feature enabled in Swift A. It seeks only when necessary to read the data and we see a huge number of, it reduce a huge number of connection overheads to the input streams, which are common in Presto queries. And note that a feature similar to this has been implemented in S3A API as well in this Hadoop Genre ticket. Okay, so as future work, we plan to open source this after internal workload validation. It's mostly likely within the Walmart Labs GitHub repository. And we are also looking to investigate using local tier storage for buffering before multi-part upload. That is leveraging memory as well as local disk. We are also planning to look at multiple read policies to improve downloading speed. That is to fetching the objects from the object storage. We are also very interested in supporting both Swift S3A protocols at the same time, meaning that we can have the Swift client read data from S3 client generated object. And this requires a rewrite to essentially how the two protocols generate pseudo-directories because currently they put a zero byte file to different places to indicate whether there's a directory or a file. All right, I'm going back to Andy to conclude the talk. So first of all, so we were able to get Swift A2 scale for us and we were able to run some very large workloads with it internally for our testing. And we're using actually production workloads that we just happened to have snapshots of. We would like to merge this work. We don't really want to maintain a file system driver for all of time. So we'd really like to get this work merged back into the community as well as some of the changes that we've discovered are necessary for really performant operation. Get those made to the standards at some point. We would love folks help to do that merging and to just make the code better in general. And then again, for the large object support when it comes to pseudo-directories we would, there's still a little bit of work to be done in the community for that. You may have noticed a couple of oddities in some of our testing data. We noticed them too. We haven't necessarily isolated the exact reasons why some of those numbers were off a little bit or in some cases a lot. Those are some of the work that we've got left to do to figure out what those outliers were. So with that, are there any questions? Hi, did you exclusively test Swift Day against Swift provided by Ceph or do you have any knowledge of how it would perform against Swift, Swift, so to speak? We really only have Ceph clusters available to us so we weren't able to test against Swift proper. Okay, follow up. Does Swift on Ceph implement atomic rename? I've used disccp and noticed that of course disccp keeps temporary files and then moves them to a final file commit or staging location which cause massive shuffles because it's changing keys essentially. I'm wondering if Ceph has a similar problem there. So Kyle Bader from Red Hat on the Ceph team just said but no it's essentially the topic operation. Thank you. Yeah. Hi. So I work upstream with Sahara and we're definitely interested in your work. So let's talk about going upstream to like Sahara extra or something like that. That would be awesome, yes, thank you. Thanks. Appreciate you coming to the talk, thank you. Good talk and excellent performance improvements. A quick question though. Yeah. You talked about three issues, right? I can understand the performance improvements one to three to pull. You also talked about a large number of HTTP connections with Swift FS and I think a small performance with large object count. Can you talk any improvements in those aspects? Yeah, so specifically to this. Yeah, so we found that in Sahara extra a lot of times in some cases it enables thread pooling but it's uncontrolled so you cannot actually bound it. So for instance I'll give you an example that in Hive you have this MACK repair table statement where like for a very large table you would essentially read all the directories and file system metadata and try to load them into the Hive metadata and in that call we would essentially create an uncontrolled number of HTTP requests to the RGW side and RGW would be essentially overwhelmed and you cannot get any results back in that case. The other question was about the large directories. The large directory performance issue was directly related to the ability to get back metadata information about whether a subdirectory that you were listing is in fact a directory or a file, a large object. And so that required a tweak to both Ceph's Swift server side implementation as well as Swift A. Yes, good talk. Thank you. So in real case, for example, for your workloads large objects is more important or massive small objects is more important? Or both? It depends. So in our case we've noticed that if we break even moderate sized objects into smaller objects it's a little bit easier to parallelize the reads for them. In cases with large data sets to keep those large data sets in single containers just for logical organization purposes it's necessary to tailor the object size to get as close as possible to the limit without going over it so that you can keep the total number of objects in that container minimized. So it just totally depends on the workload that you're targeting or the particular data set that you're talking about, at least in our case. Yeah, the second question is when you bring a lot of threads in the client I guess in the client or in Swift APIs. It will help. So it's a when you issue a call like for instance for a MapReduce job these threads are not only on the clients but also on all the mapper like all the worker nodes of the. Yeah, so the question is, is there any overhead like brought by this design? Because now you bring a lot of extra threads it may take some CPU cycles from. I think in my understanding since object storage would only allow HTTP requests to fetch any object and hence those threads are inevitable and the only way that we can make it better is to have a better control of it. So we know how many threads are initiated from the client and on each of the worker nodes of Hadoop to the server, essentially. So you don't see any real problems when you run your applications? We've seen a lot of problems, yes. One of the pieces of work that we've spent time doing is figuring out what the right number of threads to use is. I think even with some of the testing you can see highly diminished returns when you start really cranking up the number of threads. There are a number of reasons for that but yes, at some point you just get to a state where it doesn't matter how many more you add. You may actually degrade your performance. Okay, thank you. You talked about using dynamic large objects or the large objects support. Did you investigate using static large objects and what was that? Why did you choose DLOs instead of SLOs? I think we went with DLOs mostly because they're, well, they work a little bit better with our use cases which are always within a single container. And so there's a little bit more straightforward. There's nothing that crosses the container and the reading the manifest isn't really necessary for a dynamic large object. So you don't need to figure out where you're supposed to go to grab things. And so it was just a little bit more straightforward. I had a question about, you mentioned the impedance mismatch in S3 and SWF doing the kind of fake directory listing. Can you expand on that? Like in a lot of libraries, like for example, JClouds, you'd often use a similar approach of creating an object with a trailing slash of zero size which works on both providers. So what specifically did you run into that? I'm just curious. There's a zero byte file created on the SWF side along with the directory. So at the level of the directory, right? You create a zero byte file to indicate this is a directory, not a file. This is the SWF standard. On the S3 side, there's also similar but that zero byte file is placed at a subdirectory level. So that would make the income, I think that without code changes, these two clients are incompatible. So a lot of cases, we create data using one client, right? We create data using Swift client. Then those directories are arranged that way. But then the S3 client cannot read data directly. Sorry, you may need to explain a little bit clearer. I guess I was getting at the point where, say I'm trying to create a directory full in the subdirectory bar and create, in both places, I'd imagine creating a foo slash and a foo bar slash. And these are just objects, right? So because of the flat game space. So there's no actual notion of directory necessarily. So could you explain one more time? I must have been missing something here. I mean, it's essentially hacked in, right? If you're using just the Swift CLI, that's all you have to do for whatever reason. And I'm not totally clear on the details with, in order for Hive to see those as directories, you actually need to add a zero byte file that doesn't include the slash. And so you end up with two files, one, or two objects, one that is bar slash foo, no trailing slash, and then another one that is bar slash foo slash. Without that, Hive doesn't see the subdirectory as a subdirectory and you end up with a bunch of very strange data. On the S3 side, there's a similar syntax, but it's just slightly different. I think there may have actually, there's actually a string that you're supposed to append, like the literal word directory or something that normally gets hidden by the API. And this is all from by memory from a year ago from Google searches when we were trying to do this. And so it's not that they're totally, they're only incompatible in very superficial ways. And it's just that they're bad enough that it's frustrating to work with. Yeah. And I have to say, these are implementations of the Swift and S3A driver code. It's not something that created by, you know, the Swift or SAP backend. Looks like it. If you guys have any more questions, we'll be around afterwards. Again, we are really looking forward to working with the community on this and thank you very much for coming.