 Welcome. This is Building Large-Scale Private Clouds with OpenStack and Seth. Checking you're in the right room. Should we kick off? Yeah. Cool. Hello, everybody. I'm Andrew Mitry. It's Anton Tadaker. We're from Walmart. Maybe we'll tell you a little bit, since we're here in Spain, about who is, what is Walmart? Walmart is a retail store. It was started in 1962 by Sam Walton. Today it is actually Fortune One, the world's largest company, at 11,500 stores, 172 banners, in 28 countries, and with e-commerce websites in 11 countries. For those of you in the EU, Walmart owns ASDA in England, might be a familiar brand. We have over 100 distribution centers worldwide, and over 2 million employees. We are the largest private employer in the world, and we have about 80 million monthly Walmart.com visitors. So we wanted to share with you a little bit about where we are today with OpenStack and with Seth. So I'm going to kick off the OpenStack part, and then I'm going to hand off to Anton to go deep on what we're doing with Seth. So our current OpenStack footprint is over 170k CPU cores, and it's growing. We're in about six data centers, and we have about 20 to 30 regions across those six data centers. We use OpenStack Ansible for deploying and managing our OpenStack. We actually have two core contributors on OpenStack Ansible, Jimmy McQuarrie and Michael Gagino. And our production is currently running on OpenStack Liberty. Besides contributing to OpenStack, we also contribute to Seth. We actually are number 16 contributor for the next Seth release, which is really exciting. I think we have about four or five developers that are actually contributing code to Seth right now. We also have a core on OpenStack Watcher, and we contribute quite a few other OpenStack projects, both code, documentation, bug fixes, and we participate in various working groups as well. So as part of the cloud team at Walmart, we actually have a platform, an application lifecycle management tool, a company we bought a few years ago called OneOps. We've now open sourced that tool about a year ago, and that is actually how our developers interact with the cloud infrastructure at Walmart. And to give you a couple of stats, our developers through OneOps are leveraging those 170k cores that we have. They do about 100,000 what we call auto repairs, where OneOps allows our applications to self-heal. If a VM has an issue, it can automatically replace or repair those issues. We do about 40,000 deployments a month. We have about 3,000 applications running through OneOps on OpenStack. So in OneOps, we have a concept of a pack, which makes it easier for our developers to deploy various technologies, and we support about 60 different open source technologies through those packs. You see some of them listed there on the screen. So when we started off doing OpenStack, we actually did it only with ephemeral storage, local disk. The idea being, hey, it's a cloud app. It should be able to run with ephemeral only. And in theory, that might sound good. And we did move a lot of our ecommerce workload over that way. But we quickly found a few non-ideal states being in an ephemeral only world. One is that our CPU and memory need scale differently than the local disk. So that ends up with standard storage, both in capacity and performance. And it also increased our recovery on some of our app time when we had to rebuild that on a local disk. And also, I didn't give a lot of avenues for applications that weren't fully cloud-native. So let's talk about why we would need persistent storage. One of our big use cases is like traditional type applications like RDBMS type applications. And when I think about persistent storage in the cloud, in my mind, I split it up into three groups. And I'll try to talk about all three of those in this presentation. So we have traditional block storage for virtual machines. I think everybody probably understands that if you guys already run Ceph, I think it's easy to understand that. What I call traditional object storage, and it's like for various object type applications. And I think that's also self-explanatory. But one other type that I'm going to talk about, and for us right now, that's a big use case, is large-scale object storage specifically for big data workloads. So we'll try to cover all those in this presentation. So some of the things that we've done in the past with Ceph is we've actually gone with very dense storage nodes. So this was not going to name specific vendors, but you can probably guess who it is. So these are like boxes that had 72 large form factor spinning drives in them. And when you have so many disks in a node, you end up having what I would call challenges. One specific one for Ceph is we ended up having like 130,000 or 140,000 threads on the box. And when you have that many threads, all kinds of interesting things start to show up in the Linux kernel. There's all kinds of Numa issues on the box. There's just like scheduling issues with the Linux scheduler, all kinds of stuff. And if you're using very dense disks, you actually end up with a box that has like maybe half a petabyte of storage in one server. So unless you can probably still do this, but this is something that we're not doing right now, you can probably do this if you have literally thousands of servers like this. But we're not quite at that scale, even though we're pretty big. We're not that big for dense storage. So I would say unless you have over 200 boxes in a cluster like this, or maybe even more, I wouldn't recommend going something like this. Something that we're also not doing anymore is spinning drives for block storage. SSDs, SSD prices have come down in price quite dramatically. So we actually started doing 2X replication instead of 3X replication. And because we're not buying very expensive like 10K or 15K disks, actually our price per gig has not really gone up. In fact, it's actually gone down going from 3X replication to 2X replication with solid state drives. So we've actually been able to transition to solid state drives without really setting up a lot of alarm bells in the finance department. For block storage? This is only for block storage, yes. I think for object storage... For large scale object storage? Large scale object storage, the disk access patterns are such that you can actually get away with spinning disks. And of course it is expensive to buy all flash for large quantities. So the other thing is unfortunately sometimes we do have to build small clusters, but I try to avoid that at all costs. Ceph really performs better when it's at scale. So we've had clusters in the past where we had like maybe 6 or 10 nodes, and that's not a lot of fun if you're doing interesting things. So what is the minimum size cluster you'd recommend? Right now I think our minimum size in production is 16 nodes, and I would shy away from anything smaller than that in my opinion, because if you have a single node failure with like 10 nodes or something like that, you're really affecting more than 10% of your cluster capacity. You lose actual capacity in terms of space and you lose of course performance, and obviously you're going to be moving and shuffling data back and forth. So that's going to impact your cluster. So this is actually our current all SSD Ceph cluster configuration. I unfortunately can't give you the more specifics than that, but I think you can probably guess this isn't very cutting edge and this isn't very anything secret. Many server manufacturers. Yeah, this is traditional kind of like TR1 server manufacturers. So we do try to limit the RAM. I mean maybe somebody will say that 128 gig is a lot, but actually we've scaled that down from larger sizes. Right now we're doing 24 physical cores, and unfortunately right now that's two sockets. So these are 12 core CPUs, and effectively with hyperthreading that gives you 48 threads. Because of our existing data center footprint for the current production stuff that we have, that's going to be, you know, if you ever walk into a Walmart store or if you decide to buy something online from walmart.com, this is mostly, you know, a lot of applications are going to be hitting something that's on this right now. So unfortunately we're limited by 10 gigabit ethernet, so we're doing a dual 10 gig ethernet. So one network for front-end connectivity for all the CEPH clients, and the second NIC for the back-end network for CEPH for replication. So right now we're doing 10 SATA SSD drives. Primarily it's because we couldn't get a better chassis in time when we needed to order stuff, and we're trying to change that a little bit. And we're using, in addition to 10 SSD drives, we're actually using an additional NVMe device for journaling. So we're actually not doing journaling on the bulk SSD storage. The primary reason for NVMe device is actually for lower latency. If you're familiar with CEPH, I'm sure you know this, but CEPH doesn't acknowledge the write until the data has been placed into the CEPH Journal and on all replicas in the cluster. So if you're doing a lot of low-queue depth type write workload, you're going to see higher latency until that data is flushed into the non-volatile memory device. The other thing that a separate NVMe device allows us for journaling is we can actually try to get cheaper SSDs that have lower endurance. So that's still a work in progress, and I'll talk about that a little bit more. But that's basically our current config right now. Not really anything cutting edge, but this does provide us with the performance that we need, and it's got a good enough capacity so that we can scale the cluster horizontally more than virtually. What's the ballpark performance you're getting out of the cluster? So my benchmark performance is 2,000-3,000 IOPS per SSD. This is performance that you should get out of the cluster. So some people will say that that sucks. That's not a lot. But if you actually look at dollars per gig, if you look at dollars per IOP, I think that we have a pretty compelling solution if you compare it to a traditional type enterprise storage or something like that. So again, I don't have specific pricing, but you can guess probably how much this is. So some of the issues that we currently have that we want to solve, like I said before, we're using dual CPU sockets, and that creates a lot of NUMA issues. So NUMA stands for Non-Uniform Memory Access, but basically what it means is you actually, with traditional Intel architectures, you can't access everything from anywhere in the system. So if you have a process that runs on the Linux operating system, that process is typically tied to a specific CPU socket. And if that process needs some resource within the same server, but that resource sits somewhere else in the system, for example, on a different CPU socket, you're actually going to be traversing across the two CPU sockets, the NUMA bus. And so there's a lot of interesting, depending on how far you push your Linux server, there's a lot of interesting issues that you can start having on the system. Samsung recently, I think this summer, published an interesting white paper on there. They had a reference architecture box with 24 NVMe drives, and the box had two CPUs, but they designed the box to have everything kind of balanced between the first socket and the second socket. And they did a lot of tuning, and I think they were able to scale an additional, if I remember correctly, 20 to 30% in performance just by properly tuning all of the NUMA type things that they could have tuned on the box. Even if you're doing very small IO workloads, IO size workloads, we found that 10 gigabit Ethernet can potentially be a bottleneck. It certainly is a bottleneck for Cep, if you're doing recovery, if you have a node that failed, all of a sudden you have to shuffle all this data around. So we're actually looking at going to something faster. In the past we've done with the very dense 72 drive chassis, we've done 40 gigabit Ethernet, but we're actually looking at 25 gigabit for our next build out. 25 gigabit is much easier on the networking guys because you can take a 100 gigabit port and split it into four 25 gigabit ports. The other thing that... Oh, so the current solution that we have is unfortunately it's using two different... disk controllers, so we couldn't get a controller that had all 10 drives, so we have like... It's a weird number, I think it's like eight disks and two disks are on a different controller, so it's not ideal for... Because we're scaling... We try to scale more horizontally, we actually are not running into controller performance issues, so we actually want to go with a single controller, and potentially maybe more disks. And that brings me to my last point on the slide is traditional server vendors in my opinion are not innovative enough, so we're looking at some less traditional server vendors, some ODMs that can provide these solutions. Some of them are on the floor at this conference. So some of the things that we want to try in the future, and we've already started doing this in the lab, I didn't want to present something that wasn't true, so this isn't actually in production right now, so we want to use a single socket with the V4 Xeon processors. You can actually get enough CPU cores now that you can get enough CPU processing power that Cep needs with just a single socket. We're not doing the highest end CPU bins. I think we don't need that much performance, and also as you get up to the highest end of the CPU spectrum from Intel solutions, you end up paying a lot more per core than something in the middle. So we're sort of somewhere in the middle. You guys can ask me later what it is. So we are looking at all NVMe chassis, and we're kind of, we're waiting for the solutions that come out from AMD. So AMD is a Zen architecture. They have a solution that's supposed to come out next year where there's going to be more PCI lanes per socket than what Intel has to offer. So we're looking at that potentially, maybe not for next year, but the year after that to see if maybe that will make more sense. But NVMe prices have come down already. At this point, you can buy NVMe drives for essentially the same thing as a SATA drive. But NVMe gives you more, you know, there's a lot more bandwidth. The protocol that you use to talk to the actual flash media is completely different, and it's a lot more efficient. I'm sure you guys are aware of this. So another thing that we want to try is actually to use, we try to use lower-endurance flash. So in traditional storage, you guys are aware of like storage tiering or storage caching for, you know, to serve hotter data versus colder data. Because not all of your, you know, your entire cluster is not going to be hot all the time, right? So I want to try to take that approach, but apply it to flash endurance. And we're testing things with some of the caching technologies. So maybe next summit we'll be able to present something that's going to be compelling. We're looking at things like single power supply, just simpler servers, because it just doesn't make sense. Why do we need dual power supplies? If we're going to have a massive data center failure, we'll just let the whole data center die. So it doesn't really make sense to pay extra for dual power supplies. I don't know actually if we've settled on this design or not, but I think it also depends on what data center it is, and of course it depends on the scale of your self-cluster. We do want to take fail in place into the design of the architecture. So right now we're sort of doing that already just because we don't have time to go and fix every potential failed drive in the cluster. So we just kind of let things fail in place, but we want to take that to the next level and actually properly plan for that. And maybe have like a once a year drive replacement party or something like that. And then, yeah, and also we're trying to, as Seth matures, as BlueStore comes into production, maybe next release, we're looking at seeing if we can potentially lower the CPU count to save on power and save on cost, of course. So just the current challenges that we have with block storage on Seth, serving open stacks specifically, of course. So low latency for small Q depth workloads. So basically if you have a traditional RDBMS type application, your network storage, right, you're going to be a little bit more sensitive to higher latency from your storage. We're having some challenges right now with automation. So we've standardized on Seth Ansible for the build out of our Seth clusters and Raise of Hands who is using Seth Ansible right now. Anybody? A few people. So we're still working on the tooling for things like replacement of a node, replacement of a disk, rolling of grades of a cluster. In my opinion, the stuff that's there right now in the project is not quite what we need. So we're still kind of working on that and perhaps hopefully, not perhaps, but hopefully we'll contribute back to the community. So some of the things that we've hit very recently with bugs, so we're using Ubuntu 16.04 right now and there was a very nasty Xenial kernel bug where it was like a divide by zero bug that was affecting specifically whatever Cepho was these were doing. So we kept having these crazy crashes where an OSD dies and that's not a big deal, but then that box sort of slowly becomes more and more unusable. So we were forced to reboot servers all the time. So thankfully that's been fixed if you're running the latest kernel. There's a current bug, and I think that's been confirmed, but at least we've confirmed it in our environment and we have a developer that's sort of trying to figure this out right now. So there's a RBD cache. So if you're familiar with right through until first flush option, right now for us at least it doesn't work and that's causing all kinds of performance issues. So we've actually set that flag to false because our guest VMs are new enough that we have a reasonable expectation that they are working properly. So that's kind of weird. So I wanted to talk about the third use case to serve large sets of data using Cepho. So this is kind of a slide for traditional big data. If you're familiar with like MapReduce or Spark or traditional big data applications, you traditionally build stuff kind of in a monolithic cluster and then our current big data clusters right now are just their identical machines and we have literally like clusters that have like more than a thousand machines. And what that means is it's the same machine, it has the same disk configuration and it runs the same exact applications on every machine and as you can imagine Walmart has a lot of big data. So there's all kinds of big data problems that we're trying to solve and so there's all kinds of teams that are working on these problems and different teams can have different requirements and so if it's an environment where you're working on a large cluster, if some team needs some specific version of some application installed on this cluster, all of a sudden it's going to affect everyone else that's using this massive cluster and so there's a balance of doing things slow enough, fast enough and so that creates a lot of challenges. So we're trying to take this and move away from this approach and kind of more towards what you would see in a typical cloud environment. So we sort of are moving away to something that looks like this. So this is a... We take individual applications and move them into virtual machines but the actual bulk data that's sitting somewhere on disk is actually we're moving that to object storage. So currently we're using both Swift API and S3 API but I think we're going to standardize on one. I'm not sure if we've decided on that. We currently are carrying a bunch of patches under Swift FS and we're trying to upstream those to the community. So we're using things like MapReduce, Yarn, you can see on the slide, Spark, Presto and you see those packs, those are one-ops packs that Andrew was talking about before. So basically this allows us to have, depending on the data, we can have certain sets of large-scale big data sets that are shared across multiple teams. So it's a single place that's stored in SEF. On the object store. On the object store, yeah. On Swift. Swift API, yeah. Obviously going through Redus gateways. And if a team needs to process that data, they're going to go take that data from object store, download it onto the local machine and then do the processing locally on the local machine. And then once they've finished and packaged it, they can upload it back to either their object storage container or some kind of shared container in object storage. So we're already doing this in production in small sets, in small limited, I call it a production, it's still being tested, but it is essentially production. And we're looking to grow this substantially next year. I think this will help us a lot. So for this type of object storage, this is sort of what we currently are doing right now. So it's the same thing. It's 128 gigs of RAM, same CPU core, just to kind of standardize. We are still using 10 gigabit ethernet for some of our older environments. We're trying to go with 25 gigabit ethernet for the newer stuff that we're building out right now. Currently we have, depending on the server, this is roughly 12 SATA drives. These are dense spinning drives, obviously. So we have a single NVMe device. And for this, we're actually using NVMe device both as a typical traditional Ceph journal, but also we're testing LVM cache right now. We were testing both Bcache and LVM cache. Unfortunately, we feel right now that Bcache isn't really something that's actively supported by the community, so we're kind of shying away from that. Cooling and LVM cache, DM cache, is a lot more mature. And it's actually been working pretty well. It's almost out of the box. We see things like bucket indexes going to the cache. We've actually even tested doing journals just going through LVM cache, and they get promoted into the cache. So that's something that we're still testing on, and we hope to actually maybe share with the community maybe a next summit more specifics about that. So some other challenges to serve the big data use cases specifically. So I don't know if you're paying attention to Ceph releases, but there were actually 48 RGW bugs going from 10.2.2 to 10.2.3. And we've hit, I'm not going to say every single one of them, but it felt like we hit every single one of them. We really hit, I want to say, like five or six bugs that were really brutal. And at one point, we actually were running on nightly Ceph builds in production. So that's been kind of challenging and interesting for us. So in my past experience, we would have had, like I've run clusters where we had lots of Ceph nodes and maybe very, very few RGW nodes. So we're actually running into limitations on number of concurrent connections to the RADOS gateways. So we need to, we're scaling horizontally, and we run RGWs co-located with OSD nodes. So we're almost scaling their RGWs horizontally as big as the cluster. Also scaling for just bandwidth. When you have a massive Ceph cluster and just shoving everything through like a pair of RGW nodes doesn't make any sense. So we've hit, we've gone over, maybe somebody will laugh, but for us it was a big deal. We were hitting more than 30 gigabits of consistent performance going through RADOS gateways. And so there's some challenges with that. I think I've mentioned before the Swift FS bugs. We've hit lots of those, and we have a separate team that works on big data in terms of development, and they've been trying to upstream a lot of those bugs into the upstream project. I think that's about it. So just looking into the future, we are getting more and more and more requests for file-based storage, so something like Manila. We're not, we're not doing that today. Andrew mentioned that we're running a Liberty in production, so we're a little bit behind I guess. So we're looking into, as soon as we can run it as part of OpenStack, we would like to do that, but right now we're not doing that in production or anywhere. So we have some, so we run, I think Andrew mentioned, we run Walmart.com in OpenStack, and we're trying to move more traditional back-end applications into cloud environments. And so that also means that we can potentially run things like something that's in a store. We could run in a cloud environment. Maybe that doesn't make any sense, but we're sort of looking into that, we're investigating that. So potentially trying to figure out storage for a very, very small footprint. Maybe next year I'll present Cephon Raspberry Pi's, I don't know. And obviously containers is a big deal, so we don't really have anything in production right now, but persistent storage for containers is a big deal. I think that's something that the community needs and we don't have that right now. Am I forgetting anything? I think that's it. Thanks for listening and please ask questions if you have any. I'll repeat the question. So the question is, OpenStack has a Swift as a project. How come we didn't use Swift as the object storage for our deployment? So I think some of that is historical. We've started, at least the team that we're on, we started using Ceph a while ago and we sort of started with a smaller scale and so having separate storage for Block and Object, we didn't want to do that. Yeah, so the converged story of Block and Object, especially in a lot of distributed locations, was an attractive story for us, not having to run two separate clusters and give us not just converging, but giving us that shared storage platform because we didn't know exactly what the growth patterns were going to look like and not having to move servers or reallocate between Object and Block. Yeah, I forgot to mention, so we actually run, so right now the way we're building it out is we have smaller, so it kind of is the opposite of what I just said, but we have smaller clusters, Ceph clusters that are used just for Block and those are individual per OpenStack environment. So kind of like, you know, everything, the blast radius is limited to just that environment, but we're also running larger Object Storage clusters just for Object Storage and those span multiple OpenStack environments. So they're in the same data center, but multiple OpenStack environments. So part of it today, too, is now we have expertise monitoring automation around Ceph, and we feel it's easier for us to maintain one storage platform versus two. And so far, I think Ceph Object Storage is working for us as well. We like the immediate consistency of Ceph. In the past, we've had some issues with Swift in terms of troubleshooting and maintaining stuff like that. That was four plus years ago. This is a long time ago, yeah, this is a long time ago. But I'm not by no means a Swift expert, so it's a good question, yeah. The question is, do you have data compression or Ddupe in our stuff? We do not. So we do use Erasure Coding for Object Storage. So right now, we use 8 plus 3. So we're using very safe J Erasure plugin. So our replication goes from 3x or 2x down to 1.375, I think. Some applications I do know do compression, like something like the... Man, I'm spacing out the logging... Elastic search. Elastic search or... Okay. Yeah. Please. So the first question is, what are some of the tooling challenges we're having in the operational environment around stuff? So just to give you a simple example, I had a disk failure and someone has to log into the server, which means you're already doing it wrong. If you have to SSH into the server, you have to figure out, is it an issue with my rate card? Is it... Or I shouldn't say rate card or controller. Did the disk really fail? Is it a hardware error? Is it a software error? So just troubleshooting just failures. And then I've... I mean, that doesn't take that long, right? But then I have to deal with a server vendor that I have to get a new replacement drive in. Usually, that's pretty good, but sometimes I can spend almost a day, it feels like just doing that, right? So why bother? Why do I need to do that? So that's one challenge. I think just... Like, as you scale, you have, like, thousands of servers. You should be doing things... I mean, it should be a push-button thing, like, where, okay, like, just redo the server or something like that. So I think the Chef Ansible itself is really good about the actual initial deploy. Where we want to invest and hopefully contribute back some is about the maintenance of the cluster, right? We lose the server, bring the server back in, having it go through all the motions to join it back into the cluster. Those type of operational features, those are manual or kicked off by smaller Ansible scripts right now. They're not all unified into one set of operational. Yeah, maybe something like... I wanted to play around with, I just haven't had time, like, chat ops. Maybe, like, a Slack bot will tell me, hey, like, this cluster is having an issue. Like, you respond to that bot by some command and it'll give you some status. Like, log into some interface. And I can do it from my cell phone right now if I wanted to. Just simple things like that. So that's why we do, like, fail in place, because Chef can just keep running. In my past life, I had a cluster that was six nodes, and it failed from six to five to four to three nodes. And I still was running in production. So you can do that with Chef very easily. And the other question was, what kind of innovations are you looking for from the hardware vendors? So, like, single socket, finding single socket solutions was hard. Just more density. I don't want to just have... If I'm going to let things fail in place, I don't need to swap drives, you know, out of the chassis ever. So I don't necessarily need that. I mean, there's lots of vendors now that do, like, you know, top-loading kind of drives. We're looking at smaller chassis so that we can scale horizontally. So, like, 1U servers with 12 drives or more. There is a vendor that was doing, I think it was 30 drives per 1U. So we're not ready to buy into that, I think, because we want to, you know, scale in smaller blocks, but stuff like that. Okay, one more question, and then... Anyone else? Yes. In the back. Our goal was... Let me repeat the question. What are the typical latency VM experiences? Go ahead. So the... Sorry. So our goal was to do less than 1 millisecond, I think. So depending on... I mean, depending on the cluster, depending on the connectivity, we can easily do less than half a millisecond, I think. Ceph is pretty bad with tail latencies. Like, as you start having issues in the cluster, your tail latencies will go up. So that's kind of challenging. If it's a properly sized cluster, we can pretty safely do less than 1 millisecond. I don't know if that's compelling or not for your use case, but it's good enough for us. All right. All right. Thank you, everybody. If you have any questions, you can hit us up here afterwards or in the booth. Thanks. Thank you all.