 Hi, everybody. I think it's about time to get started. So before we start, I want to get a show of hands. How many of you have used SEF before? All right, that's a pretty good number. So I want to go over a bit of background about SEF first, for those of you who haven't used it before. So SEF was originally created by Sage Vile. And Sage is just a brilliant guy. He's been doing interesting things all his life. When he was in high school, he created Web Ring, maybe you've heard of this, which was bought by GeoCities. Then when he went to college, he found a dream host. While he was a dream host, he encountered these large sands that you use to store all these files, and had lots of interesting problems managing them. And he was very interested in storage. So he decided to pursue a PhD. And the result of that PhD was SEF. So SEF is a distributed storage system, which was designed to scale as wide as possible, to petabytes and exabytes and beyond. Sage's whole philosophy in designing SEF is to make it very easy to manage. So instead of having one large box that might have many things fail inside of it that you can't access, you want it to have a system that you could fully look inside of. And I meant making it all software-based. So SEF is fully software-based. It runs on commodity hardware. And it tries to make everything much easier for you by automatically having multiple copies of data, making everything consistent, and trying to make a life of an admin and a user much easier. So it really helps bring down your costs and time and money. SEF has also tried to be very flexible. So it didn't, although it originally was started as a system to create a file system, a product's compliant file system, it was architected in a way so that it could be a base layer for developing many other kinds of things. So today, we have block storage and object storage built directly on top of SEF. And the SEF file system is, in fact, kind of a lesser part so far, less developed. So SEF is all open source. It's all licensed under the LGPL. There's nothing secret about it at all. Everything's up on GitHub. And you can go download it today. So I wanted to go over a bit of SEF architecture and how SEF works internally, and then go over some recent changes in SEF in the last six months, what's going to happen in the next release, and maybe what's possible features in the future. So like I said before, SEF is trying to be a very generalized layer at the lowest level. And this lowest level is a strongly consistent reliable object store called VEDOS. This basically means that you're storing objects in the object store, but it also supplies more than just simple reads and writes. It has transactions that you can add arbitrary operations to. So it's a very rich environment in which to develop other applications. So you can see there's several different layers above the basic radius level. You can have the RADOS gateway, which is an HTTP REST API that provides a Swift and S3 interface to the underlying RADOS storage. There is the SEF block device, or RBD, which is a virtual block device striped over objects in the cluster. And finally, there is the SEF file system, which is a complex compliant file system, and which has been upstream in Linux kernel since 2632 or 34, and also as a Fuse client if you don't want to install a new kernel module. So how does SEF work internally? Well, at a low level, all SEF needs is a disk and a file system. So in SEF, there are three kinds of storage servers. Most basic is the OSD or object storage daemon. This sits on top of any kind of regular Linux file system. Could be Butterfaster, Exafaster, EXD4, or even ZFS more recently. All it needs is extended attribute support. And these OSDs form the basis of the cluster. They store all the data in cells. So from a user's point of view, they're accessing clusters through some mechanism here. And this whole giant set of servers represents just a basic object store where they can put objects and get them back. Without having to worry about when disks fail or when servers fail, the OSDs will automatically re-replicate anything that's missing or anything that needs to have more copies made of it. The three Ms you see there are called the monitor servers. And the monitors basically provide a record of cluster state. So they run the PAXOS algorithm, which makes it make sure that they're consistent with each other. And they can't have a split-brain scenario. They keep track of, for example, which OSDs are up or which OSDs are down. And they're not in the data path. They just provide this map of which OSDs are up and down right now, or which monitors are up and down right now, to the clients and to the OSDs so that they can tell where they need to go. So generally, the OSDs run on top of one disk. You could potentially run them on top of a raid if you have a very large storage array or a very large server that has, say, 70 disks. Maybe you don't have enough memory or processor power to support 72 OSDs. So you might want to read them together in that case. But usually, people recommend running out one for disk. Basically, an OSD user is responsible for managing all the storage, all the objects. There is nothing stored on the monitors. Everything is stored on the OSDs. And they all talk to each other in a peer-to-peer fashion so that they notice when other OSDs have failed and they report that up to the monitors. And then the monitors can update that cluster map, which then gets propagated via a gossip protocol to the rest of the cluster. So the cluster knows what's happening all the time. So that's basically the state of the cluster. But how do you actually access the cluster? How do you find out where your data is? Well, there's several different approaches that you can take to finding where data is stored among a group of servers. Most basic is you write your data somewhere to some group set of servers and then store in some other servers where that set of servers was. And I'm mapping from, say, the object name to set of servers. And that's okay, but it doesn't really scale. It adds another layer in the data path where you have to keep looking up where your data is, which isn't really necessary. So a common solution to this is using something like hashing or sharding, which lets the client, say, just hash the object and figure out exactly which server it goes to and it needs to find the data on. But if you just do simple hashing, when a server goes down or when you add more capacity, a whole bunch of data needs to be reshuffled. So Cep uses a slightly smarter algorithm called Crush, so basically how Cep locates data is by taking the hash of the object name and then splitting that into a number of placement groups or PG's, which are basically just shards of data to make managing cluster state easier. And it takes that placement group, the existing cluster state, which is maintained by the monitors, like which of us these are up, which ones are down, and a set of rules describing how your cluster is physically, like if there are different racks or different rows or different hosts that devices live on. And at the output of the crush algorithm is then which servers store that data. So basically, you have a bunch of objects and these are chunked up into different placement groups for management by the OSDs. And each OSD is responsible for a random subset of these placement groups based on crush. So if a single OSD goes down, for example, the yellow blocks and the green block here that were previously located on it, move are re-replicated to other OSDs automatically. But since crush is a stable algorithm, the same OSD comes back up again, the data will come, will be stored again in the same place when it comes back up. So another nice thing about crush is because you can model your database, your topology of your storage, you can have custom rules that define exactly what kind of placement you want, what policy you want for your data. So you could have, say, a set of fast servers and a set of slow servers and have the first replica be on a set of fast servers since reads are always served from the first replica and the rest of the replicas be from a set of slow servers that you don't care so much about, say, write performance with. You could also do things like separating replicas across rows or across racks or just simply across hosts to make sure that you can separate your failure domains. Crush also handles just separate weights for each device so you can have separate size disks with no issues there. So that's basically how a client can talk to the cluster and there's several different ways that it can do this. So the lowest level API is called Liberados, which is a user space library which has bindings in C, C++ and Java. And basically, Liberados talks to the monitors and it gets the cluster state and then it can talk directly to all of the OSDs in parallel if it wants. So this is kind of the lowest layer which everything else is built on top of. So there's no extra overhead here. It's the lowest direct access to the cluster you can have. So you can have a client that accesses all the OSDs and gets the benefits of all that performance of all those disks at once if you have enough parallelism. So the higher level above that is the Rados Gateway which you mentioned earlier, provides a SV and Swift API on top of the object store. And this basically uses Liberados to talk to the object store and also some custom object classes which allow custom operations in transaction on a single object. For example, to allow it atomically updating an attribute on an object while adding data to it. So the Rados Gateways are just stateless servers that coordinate among themselves if they, when they're caching authentication information and they also provide multi-tenancy since the native Rados doesn't have multi-tenancy directly. The Rados Gateway provides that through the idea of in the HTTP APIs of the access keys and tokens and the users and also tracks all the usage information for all of that. So the Rados Gateway is basically just a proxy that translates the Swift and S3 into Rados. So another layer up is a RBD or the self-blocked device. This is the layer that's most often used in OpenStack as a volume and sender. So a block device is basically striped over a bunch of objects in the cluster and it has the same benefits of Liberados so it can access all these servers in parallel and do all kinds of reads and writes in parallel without any kind of serialization or any kind of central authority because of crush it knows exactly where the data needs to go and it's calculated on the fly. Since this is all a shared storage, this means that if you actually have a virtual machine that is booted off of a Rados block device, it doesn't have to depend on the local server. So you have diskless compute hosts or you could even migrate a virtual machine from one node to another while it's running off of a RBD. There's also a kernel module which has been upstream since 2637 which it lets you directly map a Rados block device to a regular Linux block device so you can just have like the RBD zero show up and I mount it and have a file system on top of that directly if you want. You're not doing virtualization. The Rados block device also has a few other nice features. It supports thin provisioning so everything is thin provisioned by default. When you create one, it doesn't use any extra space really. Just stores what the name of it is, what the size of it is. It also supports efficient snapshots and cloning so you can take a snapshot at any time. It'll be consistent with everything that's there at that time. If you need more running a virtual machine, if you make sure that your application is a safe state, you can take a snapshot while it's still running. So if you have multiple virtual machines running, basically they're all accessing the cluster and through the same, in different paths. So they're all going to different OSDs because the data is placed randomly across the cluster. There's no single bottleneck and you can scale out through arbitrary size. But how do you spin up so many virtual machines when you have a sudden need or sudden spike in demand and you want to create a bunch of, say a hundred new web servers, how do you do that? Well, if you have the web servers backed by the various block device, you can just create a simple clone. So if you've taken a snapshot of an existing web server that's ready to run, you can just use a simple clone which uses no extra space and it's an instantaneous copy of that block device. So then when you, it's a copy and write clone, so when a virtual machine starts using the new clone and they start writing to it, they'll copy the data from the parent and then from the parent object to the child, the clone object that you just created. But for reads, they're still going to fall back to the parent if the object in the child doesn't exist. So one of the things that used to be, it can be very slow in Nova is actually creating new virtual machines by copying all the data out of glance onto a local disk and then creating a new copy of that file to boot a virtual machine off of. So in the false release, we had changed this a bit to allow glance to be backed by RBD and also sender to be backed by RBD. So you can have your virtual machine templates in this in-suff as well as your volumes in-suff. And if you created that, there's an API call added to create a volume from an image and if glance and sender were both backed by self, this new volume would just be a clone. So it'd be instantaneous copy with no extra space used. Whereas in the previous model, you have to copy all the data over to get anything out of the template, basically. So that's basically an overview of how the block device works. But what's happened in the last six months? So in January, we had our second stable release, which was the Bobtail release. A number of things improved there. At the OSD level, how it interacts with the file system and with its journal, which can be on a separate device for faster writes. Performance was improved quite a bit by rearranging how the locks worked and making them more find range and less coarse. So the single OSD could get out of a single underlying file system, increased from 6,000 to 22,000. And also we structured how the OSD interprets updates to the state of the cluster. So the placement group concept that I was mentioning earlier was the unit of recovery and stuff. So whenever the state changes, a placement group may need to be re-replicated to a different OSD. But many updates to the map only affect, say, a certain subset of placement groups. So in Bobtail, we made the map handling happen on a per placement group basis. So it's no longer the entire OSD going through the list of all the placement groups at once, but each one can go through and update itself to the new map independently. So that just improves general quality of service when you have lots of events happening in your cluster, like noise dying or new notes being added. In addition to the map handling, adding a priority system to the way that messages are processed in the OSD, increase the ability of clients to continue IO while recovery is happening, while more data is being replicated throughout the cluster. So basically, the old system could allow clients to be starved by high priority recovery operations. But now, because we have, the recovery system has been reworked to be based on weights, and recovery operations can be placed lower than client operations without making them totally starved. So clients can continue sending requests and continue being served without any interruption in service while recovery is still happening in the background. Because it's now weight-based, it could also be changed dynamically. So if you have a certain subset of objects that you want to recover more efficiently or more quickly, for example, now if you try to access a certain object that hasn't been recovered yet, it'll be queued for recovery next. So the block-wise cloning that I was talking about is also a new feature in Bob-Tel. I was integrated in Cinder with Balsam and the ability to copy an image from Glance that's non-RAH, that is, it's like QCOW 2 or some other format, is now, they can now be converted into a raw format image, which is suitable for actually used by a virtual machine off of RBD in Grizzly. And on the RATOS gateway side, the RATOS gateway learned how to talk to a Keystone to authenticate Swift API requests, so it can automatically create users based on the tokens it sees from Keystone and it makes it easier to manage your users that way and just generally use with OpenStack. And if you look at the SFJUGER charm, you can just install Keystone, SF, and have it all working together well with very little effort. So what's coming up in the next stable release, Cuttlefish? Well, we're on a three-month release cycle now, so this release is coming up in a couple of weeks now. A couple of major features are in it. The main one for the block device is incremental backup. So you can have, it's kind of like the ZFS concept of send and receive snapshots. You can send and receive RBD snapshots now. You can export the difference between one snapshot and another snapshot in the block device and save tons of transfer time and storage space for, say, the disaster recovery when you're backing up these block devices to another site. On the OSD side, there's support added for encryption for data at rest. So each OSD can have its underlying file system on top of a DM encrypted block device. And the keys right now are currently managed by the monitors because that's the simplest place to put them. That might be a point of change in the future if there's other key management services that come up. For the RATOS gateway, there's a new REST API for managing the RATOS gateway, so you don't have to go through the command line all the time, which makes it a lot easier to use for a lot of folks. And there's also more performance improvements, especially for a small IO including writes in particular. Most of those, a number of those performance improvements came from just moving some of the metadata that Steph stores about in objects from the underlying local file system to a level DB instance, which is, there's a level DB instance for each OSD. There's plenty more there too, but those are just the main things that are kind of important for OpenStack users. So there's, what's next for after that though? Because this release is almost out. Well, the next releases are called dumpling. We'll be out in August, and there's a couple of features that are proposed so far. There's geo application for the RATOS gateway so that it can be aware of multi-sites, they don't have to be necessarily always in sync, but can be synchronized asynchronously. And there's also a general REST management API for the suck cluster, not just the RATOS gateway, but the whole SEF system. So you don't have to go through the command line, you can just go through the HTTP API. We're having a virtual SEF developer summit on May 6th. If anyone wants to propose blueprints, they can go on our onsefs.com and there's a new wiki there. Anyone can add blueprints and come to the SEF developer summit and discuss any new features they're interested in or interested in working on. We welcome any contributions anyone has, that'd be great. And I'd like to take any questions anyone has. Yes, the question is, is it possible when you're using block storage to kind of control where the data goes so that you can move it closer to where the virtual machine is actually running? And that is possible in certain setups. So we basically, SEF has a concept of a storage pool. And so you can have, for example, Dreamhost does this in their dream computer architecture, which is available online. You can see that they have different storage pods of compute clusters. And they have an associated storage pool with each compute cluster. So they just put a block device that's going to a certain compute cluster in that same storage pod. Yeah, so the question is, can you have block storage and object storage in the same stuff cluster at the same time? And you absolutely can. That's one of the big benefits of SEF is that you can use the same hardware in the same cluster to serve as many needs. So you can have the block storage and the object storage in the same cluster. They could use separate pools if you wanted to differentiate them in some way. I have more SSDs for one and SATA for the other. But they can all be in the same cluster, no problem. Yeah. So the question is, have we been working with Red Hat to get the kernel module for RBD into the REL6 release? And we've certainly been trying to work with them to get more of SEF into REL6 releases. I mean, the seven releases the next one. And it's not clear whether we'll be able to get the kernel module in there or which parts of SEF will be able to get in there. But we're certainly looking for that at that. Anyway, we now have SEF and Apple at least. But the question is, do we recommend the user space client or the kernel client for the file system? At the moment, it depends on your needs, I think, more than, like the performance difference between them isn't that great right now because the SEF file system is relatively unstable still. Still needs more QA. But the SEF uses much easier to get started with because you don't need that kernel module and you don't need special support for it. So the RBD does support discard and trim at the cumulative level. The open stack integration right now doesn't turn that on at all. So for, and then the back ends, can you configure block devices in open stack to any of that feature? But it's certainly an easy thing to add in the future. The RBD itself does support it though. Well, there's lots of work ongoing on the file system. It certainly is important to many people and lots of people are interested in it. And we're working on it quite hard but it's a very much more complex piece than the block device or object store. So the question is, do we have any plans to enable the static website hosting for a Swift in addition to S3? I'm not sure of any plans right now. I think we wouldn't want to try to add our own extension to the API, I can go utterly. So what are performance measuring tools or what are tuning tools available for looking at SEF? There's a couple different ways you can look at this. Basically at each level of SEF, like at the SEF client and the OSD level, there is something as an admin socket which you can use to query information about latency or other performance statistics of last-end requests. In general, you can look at lots of things like a block device utilization and general IO stats on the OSDs themselves. For the client side also, for example, you could look at using the admin socket access again. You can look at how efficient the cache is being used and whether it's actually benefiting you or not for your workload. So there's a number of things there already, although they could certainly be improved in the future. There's always room for improvement. Tuning tools, the question is how to best, I guess you were asking me about how to best tune for partition sizes in particular. Generally, the partition sizes don't affect the performance too much. It's really about the memory usage on the OSDs. So we recommend about 100 placement groups per OSD per pool. What are the main differences between the Swift API and what the Raiders Gateway supports? I don't remember all the differences off the top of my head. I think one of them might be out reversing, but I'm not sure if that's supported now or not. Yeah, I think there's a chart in our documentation somewhere that lists out exactly which calls are right out there. So what's the upgrade process like since we're on this short release cycle? Generally, we're very careful about accuracy compatibility and rolling upgrades. So pretty much all upgrades, we test at Ink Tank in particular between the stable releases pretty extensively, looking for possible causes for problems. And we're pretty careful to make sure that rolling upgrades are always possible. Between the, is it one of the Raiders Gateway or Locked Bytes more reliable or longer history? They were kind of started at similar times. I'd say they're both at a similar level of reliability because they're both really, depending on the core Raiders object store for all the fancy functionality. Like Raiders itself is handling all the replication and all the consistency. So like the Block Device and the Gateway are really kind of much thinner wrappers around that. When would I expect the files to be at a similar level? I can't really give any exact dates. It's done when it's done. The question is about where Reads come from. And Reads come from only the primary copy. And this is especially useful for cache efficiency because on the underlying, on the OSD, they're stored as files. So when you're doing a bunch of Reads that hit you in the same object, for example, that file's gonna be in the page cache on the OSD. So if you had it spread across to multiple replicas and you had Reads hitting each of those replicas, your cache efficiency would be divided by the number of replicas you have. So the question is about network traffic with Reads and it's true that it reads to a single object. So if everything, everyone's trying to read a single object on the OSD level, you're going to the same OSD. So that's gonna be, it could be a bottleneck if you have a very small pipe to that OSD. But in general, we don't find that reading the same object from many places happens very often. Usually there's either, there's some stripping going on. So different clients will be reading different objects at the same time or there'll be some kind of cache layer at a higher level, which will make that redundant. Yeah, so the question is about performance with respect to ISCSI compared to the block device. I don't think we've done any specific comparisons recently, but about a year ago or so, someone did some comparisons themselves and they were very close. Like almost exactly the same performance with even with stuff providing much longer guarantees about reliability and correctness. The question is whether OpenStack is interested in replacing ISCSI with stuff. I think some people certainly are interested in that. I'm not sure, I don't have a great gauge in that right now, but I think it might not be possible until perhaps there's more support in all the distributions like REL6 for example. Are we testing against the ZFS on Linux that's been released recently? We're not testing against it currently. Some folks are trying to get trying that out and actually just yesterday a bug in ZFS on Linux was found because of that. It's certainly something to look for in the future I think. Not right now. What are the plans for integration with Solometer status? So right now stuff doesn't have any direct integration with Solometer, but there's a lot of potential for example for making the usage data from the radar's gateway go through to Solometer or making the internal cluster statistics about performance or general service going to Solometer. I'm not sure we have any concrete plans at the moment, but those are all certainly possible things. So why do people want to use ButterFS instead of XFS? There's two reasons in general. So on the OSDs there's the data device which is usually just a regular disk and you can also have a separate disk being a journal device which aggregates writes before being flush to the data disk and with ButterFS because it has consistent snapshots and transactions we don't have to do as many basically F-syncs to the data disk to get the same consistency guarantees that we do on XFS. So it can provide a theoretical performance boost in that case. ButterFS also provides some snapshotting internally. So the SFS snapshots can use the ButterFS snapshots if ButterFS is used which saves a bit more space. Any other questions? Is ButterFS consumed stable? It's not the recommended choice right now. XFS is recommended because it's generally more stable and there's still some performance problems with ButterFS over time as it gets more fragmented. So XFS is still recommended as being more stable and less likely to fragment. Sorry, what was that? The comment was that ButterFS destroys the disks because yeah, so ButterFS could certainly have a problem in SSDs perhaps because it worries them out faster. How does SFS read different tenants in OpenStack? And right now I'm at the pool level. In SFS the pool level is the level of authentication but there's no, pools also map directly to placement groups which consume more memory on the SD. And so creating more pools for every single tenant you have becomes unscalable because you have to use up more memory to further manage all those placement groups. So right now OpenStack uses a single pool for all tenants. In the future it's certainly possible to add another layer on top of pools which would be separate from placement groups but adding another level of multi-tenancy. Thanks very much.