 Okay, so we're from Comcast, we're presenting architectural considerations for big data workloads on OpenStack. My name is James St. Rossi, I'm a principal engineer for Comcast, and this is Jonathan Chang, introduce yourself, sir. Hi, thank you for having us, my name is Jonathan, I'm a cloud architect at Comcast, and very excited to be here, very tired, I'm sure all of you are, but we'll try to get through this. Tired. I'm going to the weather, so if I'm a little monotone, that's why. I still love you guys, but... I want to give some credit to the third person on this list who's actually supposed to give this talk, but did not make it out, so Chris Power is missing in action. So for video purposes, thank you Chris for putting us in this position. So a little bit of what this is and what this talk is and what it is, it is going to be a bit of a prescriptive way of thinking about how we at Comcast, who've been doing OpenStack for quite some time, think about doing big data workloads. It's a little bit of a shift for us, we've been very much so a general purpose cloud for a couple of years, for a few years, and modern workloads, big data workloads have really transformed the way that we think about doing OpenStack, so this is a little bit of prescriptiveness. What it isn't is a broad paintbrush that tells you exactly what you need to do in your infrastructure, because there is, excuse my French, a shitload of, it depends, right, that we've been weeding through for the past couple of months thinking about this. So we're going to talk a little bit about, we're going to do a little bit of a humble brag about Comcast, we're going to talk about specifically our application profiles and they're not going to be different than what you guys are running in your data centers, and we're going to focus a little bit about disaggregated storage versus hyper-converged compute nodes, because I think that's where doing both is where the magic is, and we're going to do like a set of... So my punchline, we're going to wait until the middle of the slide deck for that. I heard somewhere you need to tell them what you're going to say, then tell them and then recap that. And I mean it's up there too. So we're going to give you guys our prescriptions and we hope they're useful and if not, it depends. And James is going to dive a little bit deeper into disaggregated storage specifically with S3 and Swift. I'm the gardener, I'm going to get into the weeds. Yeah. So we're Comcast, NBCUniversal. We are one of the nation's largest video high-speed internet and phone provider and with the acquisition of NBCUniversal, we are a bunch of cable channels. We are a couple of studios. We own a few theme parks and a sports team or two. This is our current open-stack deployment. We are over a petabyte of memory, more than a million CPU cores. We have multi-petabyte CEPH. That's one of our primary storage back end. And we also do multi-terabytes of SSD back end. So we're deployed across 34 national and regional data centers. And today we're actually still running on IceHouse, but we are moving to Mataka for another conversation. We are very proud of our community contributions. As of maybe a week ago, we're at about 95,000 lines of code, 1,200 commits. We have core people on projects like OpenStack, Ansible, and... We're very active in the community as well. I mean, in 2015 we won the Super User Award. Also, we contribute, even outside the Big Ten, we contribute to Ansible with OSA and Ceph Ansible as well. So fundamentally, we're not going to cover all the use cases we have for Big Data. I think we're going to try to cover the ones that I think are pretty common across the user base. So obviously Kafka, we have 30 million plus set-top boxes, all of which we're collecting network telemetry, all of which we want to make real-time decisions with to tell us if the network is performing well or performing poorly. So, and obviously we need some place to store that data and do MapReduce functions against it. And this thing here on Pulsar where the font didn't really come through. Pulsar is our own version of NoSQL Database. Think of it as Cassandra. And so James is going to deep dive into a little bit of the application profiles while I drive the slides. All right. So our first application profile is Kafka. So we have a lot of sequential writes. We have 100% sequential writes. Very high write workload. And that's very typical of some of our workloads is we do get very heavy write workloads. And a lot of storage systems really aren't designed with that in mind. They assume that there's going to be a balance like a 70-30 split or something like that. So for Kafka also, latency is a very important consideration for us to that. That factors into a lot of our decisions regarding the storage aspects of it. We do use separate ingest and consumption clusters, I guess. Let's see. Yeah, so that's where we are with Kafka. Yeah, and I'll get to our separate ingest and consumption in a later slide. All right. Now we have Pulsar. Pulsar is a highly scalable, no SQL database service for products that need eventual consistency and interactive latencies at scale. Yeah, I'm not so involved in the Pulsar project, so I'm kind of reading through my notes here. So it's a database as a service with geo-replication between national data centers and intra-DC replication for local high availability. Pulsar is API compliant clone of AWS's DynamoDB service running in Comcast's private cloud infrastructure. So obviously, we're trying to find ways of cutting costs to not be giving so much of our budgets to the AWS here. All right. You're probably not gonna run this, but this is, as he said, Cassandra. All right, Hadoop, we do a lot of Hadoop. That's the one I've been mainly focusing on with Chris Power, working closely with him. Yeah, I'll get into this in more detail later on. You guys probably know Hadoop pretty well. You know, in our use case, low IOPS, very large blocks. Yarn as node manager, high performance admin nodes. We use ZooKeeper, the name nodes, the name nodes. We'll get to the name nodes. All right, so the key objectives for modern workloads, performance. So you have to ask some questions, right? High performance at what cost? Is it worth the million, million dollars to get that high performance? You're gonna have to decide on that, obviously. How many aspects of your latency do you control? So great, your group controls the deployment of the servers, but are you at the mercy of another group to do your network deployments? Do you have any control over using something like RDMA to try and get your latency down? And that's kind of a holistic approach, right? You have to look at it through all aspects or know where your limitations are. So the other thing about performance is you really have to do your thorough testing, measure performance in your test clusters, IOPS throughput, latency. These are the key factors we're looking at, is those three. A lot of times it's people think IOPS and throughput. And we'll have engineers that are benchmarking systems. And he'll say, I got this terrific number of IOPS and we're doing this amount of throughput. And I'm like, great, what's your latency? And he'll say, one second, I say no. One second, that's way too high latency. So yeah, so that's about performance. Availability, reliability, resiliency. Examine the fault tolerance and reliability of the storage system during component failures and software upgrades. So what happens when your name node goes down? Does your HDFS just go away? Do you have a backup? How hard is it to make the secondary the primary? Is it acceptable for you to lose that availability? Software upgrades, how much of a pain in the rear is it to do your software upgrades? We see this from multiple vendors where they have a very nice product and it looks really shiny and everything works great. But your software upgrades are just really painful for your operations team. It's probably coming up, I might talk a little more later about that later. So, I mean, these are some of the things that are trade off between, well, availability and performance. I mean, there's a compromise between those two, right? Does one size fits all in your environment? Or do you need to start considering tiering or different clusters for different tasks? As Jonathan mentioned earlier, we're starting to shift from a generic OpenStack system to more specialized OpenStack systems that will be able to deal with the individual use cases and possibly changing the computes to better handle that workload. With larger ephemeral or something similar or more expensive ephemeral for a subset of the computes. Manageability, let's see. Evaluate the management interface, programmatic API and integration capabilities. Can you use OpenStack off to access the API or do you need to set up a parallel auth system? I've definitely seen systems where they kind of support OpenStack authentication and you have to set up a parallel account on that system. And then it'll sync the passwords. And that was definitely a game-ender for that one. Isolate your workloads or workload isolation. The noisy neighbor, this is one of the big reasons why we're considering going to more specialized clusters. Again, we have 34 regions right now. So we're popping these things out once a month. It just makes sense for us to start specializing them a little bit. So we'll have one large tenant and then we'll squeeze smaller tenants in the gaps. Instead of what we have been trying to do, which is set up huge regions. And then be able to handle multiple heavy-duty tenants. I'm gonna jump in real quick and kind of add a little bit of color. I hope, how did we come about these objectives? Well, these are very clear functional requirements that actually our team actually developed by deploying in Amazon first, right? So there are a lot of optimization constraints by deploying in Amazon. You have to realize, you have to understand the instance size. You have to understand things like the volume IOPS, right? And you have to optimize for cost. So those having our teams run in Amazon first and architecting their application to understand really clear functional requirements. And then bringing them back to us and then letting us essentially compete for a lower TCO, right? Value for that workload has really helped us drive a lot of these key objectives, right? Data intensive applications. So you're gonna see a little bit of a bias in these slides towards looking at the storage aspects of the architecture. We found that that is usually the largest question we're trying to answer. Compute, CPU, memory, those are fairly easy. I mean, they're the bill of materials in your compute node. And they're easy to predict. Storage tends to be a much more moving target. So we'll definitely talk a lot about storage. And I think that that's one of the more interesting aspects of the architecting of these systems. So, disaggregated versus hyper-converged, what? So, this obviously isn't what you wanna do. Everybody should just get along, right? I think when I first started working on this project, I thought, well, I'm going to either try and replace HDFS completely with S3. And that's how I'll do it, or can I make HDFS do everything? And as with most things that black and white, it's never black and white. It's gray, right? So here's just a quick example of what we mean just to be very clear about hyper-converged versus disaggregated. Very clear except for the magic, right? I like the magic. So in a hyper-converged model, compute and storage, they're all located. They're co-located on the same node. So this is standard HDFS, right? You probably have a name node sitting somewhere, maybe on one of the nodes. Disaggregated, so this slide was a real challenge. That's why the magic is here. So I wanted to communicate that you've got your big data nodes separated from your storage and they're all completely interconnected. So I was sitting there drawing millions of lines. Instead, I used magic. So I hope you guys like the magic. So examples, HDFS, hyper-converged, so S3 in its many varieties and flavors. However you're implementing S3 would be an example of the disaggregated. Sorry, I'm coughing in your ears, by the way. All right, so the recommended approach for Kafka. It's definitely not, as John alluded to, these are kind of the amalgamation of our research and some of our trials and tribulations. It's not a definitive guide. People will always tell you, well, it depends on your specific use case. And yes, it depends on your specific use case. But hopefully we can shed some light on things that affect that use case, instead of just leaving it at some vague statement like that. So for Kafka, we use kind of a divide and conquer. So we use HDDs for the collectors. And that's because they, sorry, I'm reading my slide here. And I was like, I don't remember it saying it that way. So it's all stream data processing. So it's all, that's right, it's all sequential writes. So it's just flowing onto the hard drive as quickly it is as it can get on there. And then it's, I believe it, all the small writes get aggregated into larger blocks and then they get dumped on larger blocks. So cost of, there's no reason to go with SSDs on your collectors. I guess unless your ingest rate is just ridiculously fast and somehow the SSDs would benefit you there. But chances are you can save some money if you go with HDDs there. Cuz it's aggregating the rates in the larger blocks. And then we've got the consumption side. So that's the mirror maker and the aggregate. And those are actually running SSDs. And at first I thought, this seems counterintuitive to me because you're taking the contents of two collectors and putting them on one aggregate. So you're gonna need more storage space for that. But because you're querying that aggregate, you want it to be fast. And it's all gonna be random IO. So you're just gonna kill hard drive with all this random IO because you've got these large blocks that are gonna be sitting there and you're gonna be picking, you're gonna be doing a whole bunch of seeks into those blocks to try and get your individual records out of it. Okay, all right, we recommend an approach for pulsar. So this is again one of these like, I can't really say definitely use this method or definitely use that method, right? This is one of the is depends items. So can handle a high number of IOPS. I mean the capacity, network latency issues can be mitigated. These are some of the advantages of the disaggregated. Some others to consider are the cost. If you're going with disaggregated then you've got a specialized cluster that's doing your storage, so you're not. You've got an economy of scale there, hopefully. Also, one would assume you've probably got more capacity on that side of it versus the hyper converged where it's really well, how many drives can I fit in a 1U chassis for my compute? You've also got the flexibility, you can point it at another storage system. So let's say you want to use Seth's radius gateway versus Swiss stacks S3 implementation. You've got that flexibility versus the compute, the bill of materials that was bought that's been installed in your data center, that's going to be much harder until five years down the line when that equipment is accessed. Operations, again these are advantages of the disaggregated approach. Operations, you don't have to manage the decentralized storage. You're not going to have to write a million ansible scripts to try and deal with the decentralized approach of that. This was an interesting one, so VM scheduling flexibility. And I kind of had a back and forth with Chris Powers and I was like what do you mean, how does this work? And so this is basically referring to if you have say a 200 node cluster and you've decided that 85 of those nodes, those computes, are going to have specialized storage subsystems. So maybe most of your computes have hard disk drives and these computes like 85 of them have SSDs or NVMEs or something similar to that. That means that you're really in terms of running your map reduced jobs, you will have to schedule those instances to be on those computes. Versus a disaggregated model where you can run them across your whole compute infrastructure for that region. All right, so now let's talk about hyper-converged. And the advantages are fairly obvious. So obviously it's going to be much closer to hopefully the jobs you're running. And this depends, there's a little bit of a gotcha with HDFS in this. But for the most part you would expect low latency because it shouldn't be going over a network. So you've got SSDs and NVMEs. And as long as, go that way if you have enough capacity. Again, you can only put so many disk drives or SSDs into that smaller chassis versus a disaggregated system where you can put thousands, it's whatever you want to scale that centralized system to. Advantage is simpler setup, right? It's just disks in a compute. Lowest latency and possibly easier to scale. And by that, if you're just tacking on computes into your infrastructure, then that's a fairly easy task versus a lot of times. Scaling out your disaggregated storage system can involve rebalancing and all the headaches that come with that. All right, so HDFS advantages, hyper-converged storage. So native to Hadoop, obviously that's good in the sense that you have full support for any project that uses the Hadoop infrastructure. Everything's designed to run on HDFS, right? Works with all the Hadoop storage formats including parquet or car. You're not gonna have any incompatibilities. And this is a big difference for disaggregated where it won't work with the three I just mentioned. Fast can be, you can design it in such a way that the data will be co-located with the MapReduce daemon. It can be designed in such a way. So when you're writing in your HDFS, whatever point you choose to write to, that's where it's gonna put the local copy. So if your ingest is separated from your MapReduce daemons, you could have a problem there. It could wind up having to go over the network to actually access it. And then you lose all the benefits of that. So you have to be very careful when you're designing that to make sure that you're writing to the place where you're gonna read it. And you'll get that benefit out of HDFS. And that's not as easy as it sounds. Large file support, meh. So there's no five gig file size limit. And that was true of the older S3 protocol for Hadoop. S3A now supports large files. So that's academic. I mean, yes, both will support large files. Okay, so the advantages of S3. You have the option for erasure coding, depending on your implementation. But there is a huge savings there, right? So HDFS is gonna use triple replicas. That's kind of the standard. So right there, you're only able to use one third of your storage. You can choose between various vendors. So again, obviously AWS, Seth, Swiftstack, probably some others. Eventual consistency, that's usually the case. There are systems that are strongly consistent. It's probably not a super big deal, assuming you're not doing your map or do strubs instantly once you do your ingest. And even then, it's probably okay. Most of our clients that I've said, are you okay with eventual consistency? They're like, we're fine, it doesn't matter. Normally has very robust availability. And I forgot to mention this about HDFS. So you got your name node. And in my opinion, so the name node is where the metadata is kept for HDFS. And if your name node goes away, your HDFS ceases to exist until that name node comes back up. It still has all the data, but you have no idea which block is where or any of that stuff. It's the index that directs to how to retrieve it. And there are, you can set up a secondary and you can do a sync with that. Five, 10 years ago, that was the way you did this. But nowadays, it's very, in my opinion, jinky. It's a warm backup and you have to manually fail it over. Most of the S3 systems you're gonna use have automatic redundancy. You're barely even gonna notice when a node goes down, they'll automatically pick another source for the data. You can lose two nodes if you're using triple replication, that kind of stuff. So that's definitely an advantage that S3 has, is you're not gonna have to worry about either the accessibility of your data or losing your data as much. And then data is not tied to an individual node. You don't have to worry about the name of the node going down. All right, together, right? So this is really, again, it's not really a question of like HDFS versus S3, fight, fight. Thank you. All right, I'm just here for that. It's really, there are places where HDFS makes much more sense than S3 and there are places where S3, I think, is much more beneficial. So in our case, using S3 for the data ingest and then the results. So it comes into S3 and it winds up in S3, but then when you're chaining your map-produced jobs together, that's where you use HDFS, right? Because you get the locality, so you're getting a performance benefit. You don't have to worry as much about the scale of the storage that HDFS needs and then it'll support the alternative storage formats. So here they are, all happy. And this is a little bit of a joke, he used to work at NASA, so I got to put the slide in here. And then I kind of stole my own thunder. So here's an example of what we're talking about where it goes from the S3 source and goes bing, bing, bing, bing on the HDFS and then it comes back to S3. And this is nice as well so that if your name node has a bad day, hey, you still got the results, you still got the ingest, right? Maybe you have a way of pointing other resources at that same data source and you can provide some redundancy that way. Right, testing and validation. This was a challenge. I was trying very hard not to say we use internal testing tools that aren't available to the public. And that, you know, it's like, what good is that gonna do? We still do. So Hadoop testing, we use a combination of standard benchmarking tools and application-specific testing. Distributed, so the DFS IO, that's the tool we use and that's used to test the performance of HDFS on the local ephemeral disk. Terrasort, that's a general Hadoop benchmarking tool. Tests overall performance of the cluster. Application testing, this is like the, you know, your results may vary. You gotta test your application. Can't really help you with that. And then for Kafka, Enmon and Tick stack are used for performance monitoring. So I'm trying to give you all the open source tools that you guys can actually access. And then we use ZooKeeper config and sync and quorum and all that fun stuff. All right. I think maybe worth mentioning is when we started off with this concept of a hyper-converged node, we wanted to actually pin a specific drive, whether it's an SSD or an HDD2, a virtual machine and an open stack we discovered that's not possible. Well, that it is possible, but it takes a ton of extra work and that's one of the operational considerations you have to make, right? Is do you want to do a lot of, you know, hand jamming of every single instance type when you do an upgrade in order to pin those disks to specific workloads? So you want the HDFS workloads to run on an HDD and you wanted the Kafka workload to run on an SSD. That was a very big consideration for us to make that we didn't want to do that. So we decided to go the opposite direction and we decided to go with an individual compute node that was filled with SSDs and another compute node that was filled with HDDs. So that's one of the operational considerations that we want to bring up before he jumped into the questions. Right, operational considerations. Always worth considering. That's a spatial, isn't it? It is. Another jamming. Hey, got again there. So operations and support at scale. Noisy neighbor. We've had a lot of trials and tribulations with this. This is one thing where I feel that there needs to be more work in open stack to deal with these situations. The QOS tools, especially when you're using something like multi-back-end cinder, the defaults, there's, I think there's some pull requests to fix it, but multi-back-end cinder and the QS defaults tend not to work as well as they could. Monitoring, I'm sure we've all experienced what a challenge monitoring can be. It is vital, vital to have very good monitoring. Right now we're testing a system that has one-second updates and monitors the whole open stack cluster for CPU, memory, disk I.O. And that's so important because you'll have a customer that they figured out that maybe their IOPS aren't capped or they've realized that they're the multi-back-end cinder. So if you have one storage type that is SSD and one storage type that is HDD, the default will say, let's say you set your default at one terabyte of quota. You can only have one terabyte. But the default has no way of delineating between those two different types. So your user can then say, well, I'll pick one terabyte of SSD. Now obviously you could fix this by going in after the fact and setting it. You can set the individual QoS for the individual storage types, but then that's another step. It can get left off. In our case, it was left off. So we had a lot of pains kind of migrating tenants out of what they thought was one storage type back into another. And so let's say you have a tenant and they've asked for quota and then you have another tenant and they've asked for quota and they both sit there for six months waiting for that intern to come back or whatever. And then they both fire up all their instances at the same time. And this is where all these automation tools that are currently getting developed can be a real headache, right? Because if you give these tenants say 100 instances of peace and they're using automation tools to fire those up and have them all do something at the exact same time, well, you can exhaust your storage backend. And then there are definitely several storage backends where you have no idea. It just doesn't have the resolution for the S3 systems. It's all super distributed, right? So the rights are going all over the place. So how do you actually pull those metrics back out? And that was a really tough problem. The way we solved it was actually to monitor on the open stack side and have listeners on each and every client that were looking at the metrics that each hypervisor was producing and then aggregate that completely outside of our storage systems. That also has the advantage, say if you have a block storage system and an S3 storage system, you can look at both types and you can also look at ephemeral as well. Versus if a vendor provides you very nice graphics for their product, well, okay, that does it for that one use case. Do you have DevOps? I mean, what's, again, these are, you've deployed your system. It's now a year down the line. How do you keep these things running well? You know, having smarter operators are not smarter operators, but our operators that are, I guess, more well-rounded and aren't afraid to get further than the S-MOP, I think that that's a great benefit. Having them tied closely to the engineers. One thing that I've really appreciated the way we do things in Comcast is we sit next to our operations group. Like that guy is on the other side of my cube wall. So when he has a question, he can just ask me. He's not in a far off land or anything like that. When things start to break, well, that's pretty standard. I mean, what do you do? This involves, you know, how good was your testing? Did you pull the drives out of either the compute or the S3 cluster and see what happens? A funny one is you've got a drive that's failed and Linux is telling you it's dev-s-d-d. Now, which one of those slots is dev-s-d-d? Because you have to tell your remote hands guy, hey, I'm here to pull the drive and there are four drives. Which one do you want me to pull? You can't tell them dev-s-d-d. He's not gonna be able to get it. So you have to blink the locator light and ask that question. That's always a fun one. I say, how do I find the drive? How do I pull it? And it's one that, I was very impressed because recently a vendor had an automatic tool that when a drive goes out, it turns on that and starts blinking that light. And that's great. I mean, I'm a thousand miles away from my data center. I'm gonna call the guy who does the push buttons and stuff and the level of his knowledge is going to be pulling a drive and putting a new drive back in and maybe making sure the serial number is right. So synthetic workloads, that is so important, obviously. When I go into meetings with my internal customers, because we're mainly internal customer focused, I have all these synthetic workloads we've done and pre-qualified on a system. And I say, well, what block size do you use? And if they tell me the block size, I have a little list I can go down and go, uh, uh, uh, uh. Okay, we can handle that block size and we can do this high up. And I did it for every single block size from 4k to 8 meg, right? So right in that same meeting, I can answer that question. And I know if I'm out of my depth, you know, and that's true of your compute as well. You know, obviously in terms of memory and, well, I don't know, you probably don't worry too much about the speed of your memory. I worry about the speed of my storage. And going over a network, but that's another story. All right, so to recap, we've given you our application profiles and our solutions or at least some general recommendations. Storage recommendations, that's my forte. And then some operational considerations. And, you know, we essentially told you we have to do everything right now, right? There's no magic to a single storage vendor or a hyper-converged, you know, infrastructure provider. You know, as we evolve, we've realized that, you know, in order to meet these new applications, we have to get away from that general purpose model that we had for many years and start thinking about, you know, almost bespoke infrastructure. And so that's a bit chaotic, you know, managing multiple SKUs and for server builds and dealing with multiple storage vendors, which is pretty much in our conference this week. I tried to save you guys. But so yeah, we hope that was helpful. We appreciate the time that you guys have spent here. And I think we are feeling... It's perfect. Well, we're one minute over, but that's not bad. We didn't actually time this ahead of time. So that's pretty good, pretty good, high five. All right, here are our slides.