 My name is Alan Samuels. I'm from, I used to be from Sandisk, but now I'm from Western Digital. And I'm here to talk about SEPH and FLASH. When I submitted the abstract for this to the LinuxCon people, I was originally put on the waiting list. And then a couple of weeks ago, I was informed that I was accepted. So that was great. It made my plans and I came here. And just a few minutes ago, I said, okay, my talk is in a few minutes. I better go find where I am. I looked it up. Oh, I'm in the pot stand room. It didn't mean anything to me. Went and looked on the map and walked in and, oh my God, this is the big room. And normally when I talk about SEPH, it's in a much smaller room. And you can see it's not a very large audience. So I'm sort of scratching my head and asking the question, basically, who was booked in here before I got pulled off the waiting list? I'm going to have to go find out because I'm sure it would be really good. Anyways, I'm going to talk about the experience that I've had at SanDisk with SEPH and FLASH. The talk is really in sort of two halves. The first half of it, I'm going to talk about the current state of SEPH and some of the current best practices. And in the second half, I'm going to talk about the development work that's going on in the SEPH project and some of the results that I think we're going to see from that. So the first thing I have to do since I now work for a big company is I have to tell you that absolutely everything I'm going to tell you could be wrong. And we know that as a lie, which apparently means I'm now qualified to run for president of the United States. So in November, please vote for me. So for those of you that aren't very familiar with SEPH, I'm going to do a 30-second intro to catch you up a little bit. SEPH is a scale-out storage system. It's essentially, I like to think of it as sort of a three-tier. At the top tier, you have your clients that speak whatever protocol that they like, and SEPH supports block storage, it supports file systems, and it supports object storage. All of those protocols are converted through a thing that I picked the name, I called it a gateway. That's just my name. It's not standard by any such. Those are converted into an internal protocol called RADOS, which is common across them. And then in the bottom layer, you have the object storage devices, the OSDs, and these are the things that implement the RADOS protocol. And what that gives you is the ability to intermix all of the basic storage types, block, file, object, there's an HDFS connector, all in the same storage system. So you can freely intermix those across drives. You can segregate them. There's all sorts of capability in there. But the key to this is it's scale-out, which basically means that if you don't have enough capacity or if you don't have enough IOPS, you just add a few more nodes and you get more. And you can just keep doing that. There are clusters of SEPH in deployment in excess of 40 petabytes today, something like 10 to the fourth nodes. So it's been proven to scale out reasonably well into those kind of scales. So the key thing to deal with on this picture is that the most important thing that a storage management system does in the clustered world is it gives you availability and durability, both of those in the face of varying failures in your data center or in your storage media or in your network or whatever. The RADOS protocol is considered to be an available durable protocol. And the individual devices down below the OSDs, they are not available or durable. So they're individual fault domains. And it's the job of the protocol to sort of assemble up something that's available and durable out of those disparate parts. There's a whole lot more technology in SEPH about controlling the availability and the durability of your data. You have the ability to physically place it around the various nodes in your network to satisfy whatever your availability rules are. There's a lot more in there that I'm not going to go into today. What I am going to talk about is how you map that onto Flash. So if you look under the hood, and this is the OSD as it's built today, the things on the top, I wish I had a pointer here, I'm sorry. And I apologize for the colors. They didn't convert over very well. They look better on my screen than they do up there. But I guess we'll just have to live with it. The nodes at the top, or the boxes at the top, excuse me, those are the things that are part of the cluster management system that ensure the availability and the durability of the data. So they're responsible for keeping track of which nodes are up. If a node fails, they go rebuild the data. That's all done in these upper layers. For the rest of the talk, I'm going to talk about the things that are underneath of that. And in this case, what you see is this box called Object Store, and inside of it, this thing called File Store. Now, Object Store is a class, C++ class, for those of you into object-oriented programming. But basically, it provides a contract to the rest of the system for how to store data. And the Object Store contract is fairly sophisticated. It has transactions, it has objects, it has attributes, and all sorts of interesting things. There's actually a fairly good description of that in the source code. If you're interested, you can go read that. But the key is that that's what everything else is built on top of. So Object Store provides essentially local storage with a certain set of semantics. Now, to implement those semantics today, you have something called File Store. And what it does is it builds the semantics of Object Store out of other building blocks. In this case, a file system, XFS, and a key value database, which has been level DB, but is now ROX DB. And there is also a write-ahead log, which is sometimes called a journal. I tend to flip back and forth between log and journal. If you hear log, think journal. If you hear journal, think log. So those things can be deployed on Flash in a couple of different ways. I want to talk about a couple of those. The most predominant way that people use Flash with Ceph today is what you probably think of as kind of a hybrid configuration, where you've got a relatively small amount of Flash on top of a larger amount of rotating store. And what they do is basically put the journal, the Ceph journal, onto Flash. And that turns out to speed Ceph up significantly. I'll go back a second to that picture here. The hard part of the problem for converting Object Store into File Store is that operations at the Object Store level are transactional. And its notion of a transaction is actually fairly general. You can modify objects, you can modify attributes, you can remove things, you can create things. Those are the building blocks that support the upper level protocols. But to implement them down in a file system, the semantics of the file system just simply don't match that. There's nothing transactional in a file system. It just wasn't built that way. That wasn't the primary driving rule for file systems 60 or 70 years ago when they basically got implemented. So the way that Ceph provides Object Integrity is that, and this will seem fairly familiar to storage developers, data that comes in gets written into a right-ahead log. And as far as the rest of the world is concerned, that object or that transaction is now durable. So the acknowledgement goes back and everybody goes on. But now, after that, in the background, Ceph has to dump the object, excuse me, the right-ahead log into the persistent store, XFS and level DB. And that's basically a background activity. So when the system is operating, your right transactions come in, they go into the right-ahead log, and then later on, they're flushed into the background. And if you look into the mechanics of this, it looks a lot like the way an intent log in ZFS works. Basically, you suck up three or four or five seconds of data, and then you dump them behind you. And this sort of cyclical behavior, which I call sort of binging and purging, you see that in the overall performance of the system. One of the ways you get around that is to put the journal on flash. It significantly improves your right latencies, because the first thing that Ceph is going to do is to take your transaction and put it into the journal. By putting the journal on flash, you're able to get the high performance and short latency that you're going to get from having the data on flash. And the background activity of dumping that into the rotating store basically can continue on anytime it wants. And now, if you start looking at the demands on flash, what you realize is that this journal is going to consume a really tiny fraction of any SSD you'd consider putting in your data center. It would probably work fairly well for the USB key you're holding in your pocket, but for anything you'd actually stick into an enterprise data center, you'll find that it's really only using a tiny fraction of that, both in terms of the space and in terms of the bandwidth. So the way it's typically deployed in a Ceph cluster is you usually have six, eight, 12 spindles of rotating sitting inside of a one or a two-use server, and you stick one SSD in there, and then you partition that SSD so he looks like six or 12 journals, however many that you've got, one for each of the rotating stores. And now, essentially, what you have is this sort of aggregation of journal bandwidth across all the rotating store. And now you're getting enough IOPS and enough bandwidth where one SSD is sort of basically consumed by that. And that's a fairly straightforward and common deployment that you see today. When you start thinking about that in terms of what kind of SSDs do I need to buy, it's pretty simple computation. I've given you some of the things that you need to think through here, but typical numbers of eight to 10 drives per SSD are pretty easy to obtain. Some of the newer, if you buy some of the newer NVMe SSDs that are significantly faster, then you're going to see that number increase correspondingly. Probably the biggest issue to think about is that the durability computations are now affected by the change in the failure domain because instead of having a bunch of rotating disks that are sort of all separate that could be moved around if you wanted to, for example, you now got them clustered through this SSD, if the SSD dies, the staff journal dies, you're basically going to corrupt the data that's in the rotating store. It's not much of a change because usually it's sitting inside of one server and realistically when you plan your fault domains, most people consider the server itself to be a fault domain or even a portion of a fault domain. So at the end of the day, just putting the journal on flash, generally speaking, you get a big improvement without much effect on the rest of what's going on. I've given some other thoughts here about how to do the computations for the kinds of bandwidth that you're going to see consumed by the SSD. If you actually can characterize your IO size in your cluster, some people can, some people can't, depends on your use case, then you can actually do a fairly easy computation of that. The overhead for the Ceflog is pretty small. You're looking at typical times from a space perspective of storing only 10 seconds or so of data at the maximum ingest rate, which is limited by the ability to dump the data. So again, you know, anywhere from 10 to 15 rotating drives per SSD is really a good match. And just you want to get down into the weeds. The only thing you have to be careful about is you really actually don't want to make your Cef journal too large. And for those of you that are into the mechanics of SSDs, that's because a really large journal, you lose the sort of implicit trim that you get by overriding the same data over and over again. It's a really fine technical point. If you want to know more about it, grab me afterwards. So from the previous picture, we have our XFS file system. We have our key value store or level DB or rocks DB and the Cef journal. What I just talked about applies to putting the journal on flash. You also have the ability to put the key value store on flash sort of the next to click up. Okay. If you're running block store RBD for Cef, it's not going to help you hardly at all. Cef's block store makes very little use of the level DB attributes. You're basically not going to see hardly anything from that. If you're running the object interface, that's a different story because the bucket indexes are stored in level DB or rocks DB. So by putting that on flash, you're going to significantly improve your bucket indexing and the overhead associated with that. But your normal gets and puts are still going to be putting their data through XFS and is still going to have to do CECs for the XFS indexes, the XFS directories, as well as getting the data itself. So moving the bucket metadata out is certainly going to improve the operation, but it's really hard to estimate the kind of benefit that you're going to get out of that because if you're putting really small objects, then the metadata is a pretty big chunk of that and you're moving somewhere between a third and a half of the metadata IO onto flash. But if you're putting really large objects, then the metadata is kind of noise and you're not going to see hardly benefit of that at all. It's harder to estimate the provisioning because now you're including the size of the bucket itself in it and the bucket data can be sharded to distribute that across your OSDs and that will really help regularize the space allocation for you. The last thing you can do is put it all on flash, which is really what I've been spending the last couple years of my life at Sandisk on and now we'll get to the second half of the talk, which is the more forward looking things. We've been working on CECs a little over three years now, three and a half years and when we got started we basically said we're going to tinker under the hood, no changes to the wire, no changes to the storage format, let's see what we can get. And over the last couple of years from when we got started, which is the dumpling release, into the latest release, which is Juul, for blocks, we've developed together with the community about a 15x performance boost. And if you actually measure those numbers, what you find is the read IOPS are pretty good. It's getting to be pretty competitive compared to the commercial offerings that are out there. If you normalize for CPU sizes and various other resources, you're starting to get into the ballpark there. But the right IOPS are still really poor in comparison to best in class commercial systems. And we've reached the state where basically to get it any better we have to break compatibility. And the basic architecture of Filestore is the problem. And remember from the picture of Filestore before we have XFS, we have LevelDB, RocksDB and we have a journal, fairly complicated dance between them and that's really where all the overhead is going. You're putting the data in to the SEF journal, you're pulling it back out and writing it again through XFS or you're putting it into the journal, you're pulling it out and writing it into LevelDB. And you've really got a case where you've got a journal on top of a journal on top of a journal or you've got a journal on top of a file system which also has its own journal. So in technical terms, essentially what you have is a large amount of write amplification. If I write a blob of data on the back end that gets written a couple of times as well as a lot of churning in the metadata. And there's a couple of other issues with Filestore. The way that SEF needs to deal with its objects is in a strict ordering sense. And what I mean by that is when SEF does a directory scrub or a background scrub where different nodes are going to compare their copies of the data to other nodes just to ensure integrity. Or if you lose a node and you need to rebuild it, which is functionally the same operation, SEF's approach to that is fairly straightforward. Let's just index the objects from A to Z and I'll see what you have and you see what I have and we'll just sort of walk through the directory together, exchanging. And it's basically just the way like the way RSYNC works. And that works conceptually pretty well but when you map that into a file system what you find is that SEF really wants to put all 10 million, 100 million or a billion objects basically into one directory so that you can index them sequentially. And if you've dealt with directories and file systems you know that's really not going to work. And the reason is directories and file systems aren't stored in order. They're stored typically in a hash order which makes your open and closes fast but it makes your directory index as expensive because you end up sorting the directory as you go. And that turns out to be just a real problem and we are still finding bugs in the code in file store today that are working around these sort of inherent limitations in POSIX directories. They're all obscure. I don't think anybody really runs into them but if you go look through the changelogs you'll still see that. So anyways pretty much covered this. Oh, the other thing is that the designers of the file system have sort of a different use case in mind for the way that certain operations are done. For example, the sync FS makes perfect sense to you. You just find all the dirty pages for a file system and write them. When you're trying to synthesize transactional things out of a hierarchical directory in file systems like XFS or better FS, it's actually hard to find all of the things that need to be synced. Syncing a file is easy. Syncing a directory is conceptually easy but syncing the directories' parents and predicting the things that get modified actually turn out to be not particularly easy. And the sort of brute force approach to the problem is to do a sync FS. If you look at that code in the kernel he's actually going to index every block in the block cache whether it's dirty or not that ends up being a pretty large consumer of CPU time. Anyways, there's a whole long list of issues with file store but what I really wanted to talk about is Bluestore which is a complete rethink of the object store interface for Ceph. The original implementation for this was written by Sage Weil who's the lead developer and the instigator of Ceph. That was done in the end of last year and there's actually a tech preview that's available in the dual release which is the current release. The important thing to realize is it maintains wire compatibility so that other nodes on the network can have file store and the nodes that you're working on or we're talking about can have Bluestore. They can be freely intermixed in the cluster without any problem because the data that's communicated across the wire is completely compatible. But the storage format is completely different for Bluestore. So if you update your software to the latest version and you expect it to suddenly start running better you'll be a little disappointed because the data formats being there for file store the old code will continue to run. File store is still in the code base it will be for quite some time. But Bluestore that's under development right now will allow you to either start with a new node or you could actually destroy your data and recreate it because it'll rebuild that in the new format and we would expect and I'll show you some numbers here that Bluestore will be about or better than twice as fast as file store for write operations and actually I think it's going to at the end of the day it's going to outperform file store for read operations. Right now we're a little bit about parity there are some things that are going on you know this is development it's sort of it's a moving target right now in fact it's moved quite a bit in the last few weeks since since I did these slides. The target hardware for Bluestore is really the kinds of products that we see in the market today and what we expect we're going to see in the market in the next few years. In particular it's looking at flash and PMR and SMR hard drives and in combinations. Okay the biggest performance boost out of Bluestore is we're going to get rid of the double right and I'll show you how that works and we're going to get much better CPU utilization primarily because the data structures are just simply tailored for Ceph rather than being a hybrid of the POSIX file system you know these are going to be data structures that that that are particularly targeted and we're already seeing much better code stability so I covered that. Some additional functionality that will be coming with Bluestore check sums on reads this is really an important item if you if you look at the current Ceph code essentially your data integrity is expected to be provided by the hardware so the only kinds of integrity checking that's going on is what your native hardware does for you Ceph is implicitly today Ceph assumes that if the hardware doesn't generate an error that the data is good those of you that have worked on large clusters and have gone through the math computations know that this is a dangerous assumption at scale wasn't too bad 10 years ago when Ceph got started but as the drive sizes have continued to increase all of the bit error rate numbers that you build into your models about that need to be adjusted and we're really reaching now in the larger clusters a situation where unless you provide additional data integrity the the the bit error rates are starting to get a little worrisome so Bluestore has the ability to provide software check sums on every read operation which means that when you turn that on you're obviously going to see some CPU consumption for that but the you know the integrity of your data should be very very high you can actually sort of there's some knobs you can basically dial in however big a check sum you want over whatever size you want and you can hit I've done some computations on that you can hit 10 to the minus 30th bit error rates at scale even less than that without any real difficulty so that issue is basically solved completely in Bluestore it will do inline compression the code already has the snappy and the zlib but that's pluggable if you have a compression algorithm that you particularly like it's pretty easy to stick that in I know that there are some hardware assists for compression that are in the pipeline we'll be looking at converting over to use those also and something else the additional operations that you get in a storage management system today the ability to make virtual clones and the ability to sort of move data around virtually in other words make it appear like it's moved around without actually moving the data around those are things that you do by editing indexes you really can't do those on a standard file system because standard file systems don't do that they don't have those operations there isn't really an easy way to synthesize those outside of them Bluestore has those built in that means that when you do things like rbd clone operations you'll get a virtual clone not an actual copy of the data the way it works today the virtual move is a fundamental building block for another feature that's coming out in sef which is full implementation of erasure coding across all of the pools today sefs has some erasure coding support but the pools that are erasure coded can't be used with all the protocols the particular it only works with the object protocol so if you want to do block storage with erasure coding today you're basically out of luck there's a way to cobble it together but the performance is so bad you'll be sorry you did bad with the Bluestore as the back end and the virtual move capability this is a fundamental building block to making the erasure coding work for large stripes with the block storage interface without creating the classic raid five hole so a picture of Bluestore the outer part is the same completely unchanged the only thing that's changed is the part inside of object store which is Bluestore interestingly enough you have sort of the same three pieces of data you have the data the metadata in the journal but they're plugged together a little differently in particular the journal here is really the journal for the key value database not for sef itself and in fact and I'll talk about it here in a minute what's happened in Bluestore is all of the metadata anywhere in the system is brought together into one place in the key value store so the key value store not only has all the sef metadata but it has all of what you'd consider the file system metadata so if you look inside of that key value store you're basically going to find the sef equivalent of an inode with disk addresses in there that's all part of the metadata and the reason that we did that is the modern key value stores have transactional semantics they map very well on to the transactional semantics of object store so the basic fundamental operation is pretty straightforward now when an operation comes in a right operation the data is sent directly to the data store it's written directly to wherever it's going to go it's essentially a copy on right system allocate new space write the data to a new place and then a single transaction is constructed that modifies all of the metadata and that transaction is committed to the key value database I'll talk a little bit more about that in a minute you can take those three and put them on different devices if you want you can take the existing hardware plant that people are deploying today small you know a relatively small amount of flash over a rotating drive you can put just the key value journal database on that excuse me I say database I meant just the kv's journal log file on that you can put all of your metadata on flash or actually you could put all three of them together put it on all rotating put it on all flash you can actually mix and match these and you don't have to do any fixed partitioning in other words if I if I take two of these and put them on the same device I don't have to carve it up Ceph does that for you automatically the let's see yeah cover the rest of that I'm not going to walk through these I basically talked through the the right path options essentially the there's really actually two ways of doing writing we have something that looks like the old system where the data is written into a temporary place and then moved in the background as well as the direct right where the data is written directly where it's going to land and then you modify the metadata you use the temporary system for certain hardware combinations if you know if you really have something like nvram might make sense to write all of your data into that first and then dump it in the background blue store will support that much more efficiently than file store will but for you know the sort of expected use case that we're looking at we think that writing the data directly and then modifying the metadata is the preferred path especially for larger transactions so now I can talk a little bit about performance and these are old these are from July slides it's a typical development project we put it all together and see what you got here and then we looked at it and there are things about this we didn't like so let's tear a bunch of it up and we're just putting it back together again now and unfortunately I don't have the latest slides but I can tell you a little bit verbally about what we have interestingly enough the person that did these slides is even more colorblind than I am because blue store is not in blue I don't know what to say about that but he can help decorate my house what we have is the three here which is file store as it exists in jewel the blue store that was in jewel and the blue store that was in master as of the end of July okay the yellow so really it's the yellow line and the blue line that matter and generally what we see is for the larger block sizes we are seeing the two x performance in rights that we would expect that part is good on the sequential throughput the sequential reads kind of disappointing but actually if you dig under the hood it's pretty easy to explain blue store isn't doing read ahead okay file store of course being based on xfs is doing read ahead because the kernel does that for you automatically but interestingly we can't think of a use case where when read ahead would matter that there isn't going to be a client in front of us already doing read ahead so in particular if you think about using the block mode of sef you'd say gee read ahead is really important well yeah but almost everybody that does basic block io already does read ahead so that's a sort of interesting issue that the community is grappling with right now naturally because it's a community we're dividing up and we have the pro read ahead and the anti read ahead forces the reality right now is is it's just not above the line to go worry about when we you know when all the other major issues are out of the way and read ahead is the only thing to argue about that will be a great day on the small object size um what we're seeing is some design choices that were made in the code that lead you to some fairly heavy cpu computations because um these are actually all to flash so the performance for small blocks really has nothing to do with the storage it's all about the cpu paths and a lot of that has been restructured and i talked about we put it all together and make it work and then we tear it all apart and it takes a while to put it back together again what we did was to specifically target the cpu comp uh uh consumption for small objects with our latest go around and then putting it back together again in the last couple of weeks we are seeing somewhere around a 30 boost for the small object performance over what you see here and uh i actually think that's probably smaller than what we're get there's some more things that are in the pipeline that will improve that i'm expecting us to be about 50 percent better than the numbers you're seeing here for small objects which at the end of the day is going to be a pretty nice boost for file store over the current file store level of performance another piece of the of the puzzle that i i talked about before is the key value store that's the basic store of metadata the one of the best key value stores that's out there today is rocks db which is an open source project by facebook and that's what's in the codebase um sand disk also has a sort of a different technology product called zeta scale very similar interface between the two and we've open sourced that and that is also being integrated into sef it's sort of interesting uh the the world of databases key value stores you have again and two camps that are divided up and warring with each other you have the log structured merge people and the be tree people and um we're actually going to have both here um rocks db is a log structured merge it um it has some very interesting characteristics especially when you run it on flash um and zeta scale is a is a be tree based and was really architected completely for flash um actually rocks db works pretty well on rotating store too um which is interesting so both of them are going to be in the product um it's a simple switch to flip uh and uh what we believe since uh we believe that when your metadata is on flash you're going to prefer zeta scale um mostly because there's no garbage collection going on uh so this you know a lot systems that have garbage collection really hard to control the performance okay it's just a fact of life you've got this background thing it turns on it turns off and your front end performance is affected by that it's very hard to minimize that um the simplest solution to that is not to have background operations so when you're running with zeta scale there are no background operations at least not coming from sef you still will have garbage collection in your flash devices but that'll be done in the hardware as opposed to in the case of rocks db the log structured merge really reduces the garbage collection that the hardware has to do but that's because basically a long structured merges are garbage collection being done by the host so it'll be interesting i think when it's all done to see what uh what workloads are better with rocks db host-based garbage collection on which ones are better with zeta scale sort of uh you know controller based garbage collection um some of the early performance numbers that we got for zeta scale so this is bluestore with rocks db and bluestore with zeta scale doing random 4k ios per osd put your phones down we'll post this and you can see that for read operations zeta scale significantly outperforms rocks db and that's because of the way the data is stored in rocks db this is a case where the beachery really helps you relative to the lsm um so when we get this fully integrated into the codebase we expect these kind of ratios and performance will have a big impact on the graphs that you saw earlier so just to wrap up basically we talk about bluestore the jewel release has a tech preview um you know it was i think it was good for the community to have it out there don't confuse that with the bluestore that's in the current in the master today they're completely incompatible um there's um there's a notional compatibility between the two i talked about some of the cpu and memory optimizations that are going on uh there's also support for smr drives that's that's been worked on the hope is to get bluestore um done enough to be in kraken which you know should be later this year and there's not a lot of this year left um that might be a bit optimistic but hopefully not the hope is we'll make it the default back end in luminous next year and that's really all i have i've still got a few more minutes i wouldn't want to stand in the way of anybody getting some food but i'll be happy to take some questions excuse me encryption um no there hasn't been any encryption done in the code um you know the the data path part of encryption of encryption is pretty simple you know it's all about the key management um and i'm not a key management wizard so um you know i think that that's something that will come but it doesn't seem to be above the line for anybody just yet deduplication um there isn't well there there's a couple of people that are talking about building deduplication as a layer on top of what's going on and conceptually that's doable um sef has this notion of the of tiers or hierarchy of pools of data i think that uh being able to deduplicate as you move the data from a hot tier to a colder tier is probably something that makes sense probably not too hard to build um but i wouldn't expect to see it on your top tiers because of the way that sef is structured it won't be very efficient it's going to be fairly resource intensive to do that so you probably really only want to do that when you archive the data and i don't think it's imminent the upgrade path between file store and bluestore is to take your uh uh uh take your node shoot it in the head and let it rebuild itself well yeah basically you know if you have a running cluster you can upgrade to the uh uh you know to the latest release whatever that is and then pick some number of osd's hopefully not too many of them you know shoot them in the head and let them rebuild themselves and then when they're done find a few more victims and shoot them that's the in place upgrade story um it's not a great story but it's the only story we have right now okay let's all be first in line for the food thank you