 Welcome to another edition of RCE. Again, this is Brock Palin. You can find us online at RCE-cast.com You can find our old shows there, subscribe by RSS, iTunes, etc. You can also follow me on Twitter, where actually I've been kind of active recently been mentioning some things and we have a lot of upcoming events. So you can follow me on Twitter at Brock Palin, all one word. And again, I have Jeff Squires from Cisco Systems and one of the authors of OpenMPI and Jeff, you just got back from another MPI forum meeting. So I assume the next revision of MPI is cooking along. It's coming. There's a there's a boff at Supercomputing about MPI 3.0. We're looking to be done with MPI 3.0 by next Supercomputing. And that date is tentative, but that's that's the goal we're shooting for. And there's a bunch of interesting things coming in there. And the biggest fight right now is about C++, which is really interesting. And I feel responsible for because I created the C++ bindings back in 1996. But yes, anyway, the intermationations of the MPI forum. But come to the the boff at Supercomputing and hear all about that. And speaking of boffs, I have my own boff as well with George Basilica from the University of Tennessee about the State of the Union for OpenMPI. And thankfully, the Supercomputing organizers did not put us opposite the MPI CH boff this year. Last year, they did. It was kind of disappointing because so you couldn't go to both. And I like to hear what those guys have to say too. So I think they're on Monday and we're on Tuesday. They're on Tuesday and we're on Wednesday or something, something, something like that. Yeah, I'll also be at SC. I'll be floating around. I'm not doing any speaking at SC. But I'll be there the entire week. Also, I have coming up at my home institution, University of Michigan right here in Ann Arbor. We have a cyber infrastructure days coming up just a couple of days to 29th through the 1st of November and December. And I'll actually be speaking there on Exceed, which is the follow-on to Terrigrid with Phil Blood from Pittsburgh Supercomputing Center. Cool. I'll also be giving some other tutorials on compiler tips and stuff like that for the average researcher. Yeah, so you advertised something about, you did some MPI tutorial or something recently? A very, very basic one. It's the one I asked if you wanted to come and do. But it's very basic. Then I found out they only gave me a 90 minutes to do it in. Like, really? Really? Yeah. I couldn't justify the travel for 90 minutes. For 90 minutes, yeah. I'll be speaking on that too and other random stuff focused towards people who are actually getting their hands dirty. Hey, one throwback to Supercomputing too. We feel like we need to mention the student cluster competition because that was just so awesome last year that we were a part of it. And the energy and the excitement around that, you need to go step by and see all the things that I think they're on the floor this year or right adjacent. I'm not sure exactly where they are, but you need to go see them and talk to them because it was really, really very cool last year. Yeah, no, that was a great thing last year. And I think, you know, Doug and those guys have been putting a lot of neat stuff together with teams and stuff this year. So I'm excited to see what actually I need to look up what the challenges are this year, what the applications are. I'm curious. All right, now throw out one last thing to also my blog and my Twitter on their Brock's been more active on Twitter than I have recently, but I've been answering a bunch of MPI related questions on my blog recently. So if you have any questions about how MPI works or why we do the things the way we do or anything about the form or the standard, please feel free to let me know either an email or Twitter and I'll write up a blog post about it. So with that, I think our pre-show extravaganza is done, Brock. You want to introduce our guest for today? Yes, yes, our guest today comes from Dream Host. His name is Sage Weil and he's actually the, I believe he actually started the CEPH distributed file system project. So Sage, why don't you take a moment to introduce yourself? Sure, my name is Sage Weil. I did my graduate work at the University of California, Santa Cruz, where my thesis was on the distributed storage and out of that grew the CEPH distributed file system. And since finishing, I've sort of continued working on that project to make it a viable open source solution to the scalability and reliability issues that people have for HPC type storage and enterprise for that matter. So, Seth, basically built out of this work you did before. There wasn't encouragement from somebody else or is it something you fully decided on your own you're going to go do? It grew out of the Trilabs. It's at India Livermore and Los Alamos. It's a series of grants they made to Santa Cruz to look at scalable, pet-a-scale object-based storage system. And so there's some initial research there dealing with low-level object file systems and placing algorithms. But when I joined the project, I was focusing on distributed metadata and sort of how to deal with that issue. At the time, Livermore in particular was just starting to use Luster, and they were having a lot of pain with lack of scalability in the metadata server, and so that was sort of the key motivation for that work. But sort of out of that whole, I know I started with metadata, but as a result, we sort of ended up building the rest of the system, including a scalable, reliable object storage layer and improved placement policy and so forth. So what exactly is object storage as compared to, say, you know, quote-unquote normal storage? I think traditional systems typically talk about storage in terms of blocks. So you'd have, like, either on a single disk, it's block number, some large number, or in a, I guess, a sand file system, you'd talk about block offsets within a line or something like that. In contrast, the idea with object-based storage is that you name, in the same way that you name a block by numbering it, you would name the object, but it isn't necessarily a number, and it doesn't have a fixed size, so the object can be smaller, large, and it can have some metadata associated with it as well. Essentially, the key idea is that while traditionally file systems have to pay attention to data layout and placement and block allocation and which sectors on the disk are storing what data, using an object-based interface lets you push all of that complexity into the lowest levels of the system where it's more or less hidden and all the distributed, clustered, whatever, the higher levels of the file system don't really have to worry about those details, and it simplifies things greatly. So does that mean Ceph as much as it's a file system and a driver to be able to read and write to it, does it actually live on top of something else that actually deals with the actual 4K blocks going on disk? Exactly, yeah. So normally we put our Ceph OSDs, we call them, although it has nothing to do with this, because it's a P-10 OSD, it's a sort of fortress and name, but basically the Ceph storage servers that manage the objects sit on top of a normally a Butterfest file system and then that man, where they actually show up is just files. So that's sort of the low-level file system on each of those nodes, handles all those details. You can also run it on XS, or XE4, whatever else, but that's the basic idea. And then Ceph itself only has to worry about what's in the objects and where in the cluster are they located and it doesn't have to just care about rewriting a B3 implementation to track free space or whatever. Now you threw out a bunch of alphabet soup there, can you decrypt some of those things there? So you said things like T10 and whatnot. What are some of these things for people who aren't familiar with file systems and whatnot? Well, I hesitate a little bit to talk about T10 because it's sort of a red herring, but essentially there is a push maybe five or ten years ago to have this idea of an object disk that basically encapsulates this idea of pushing the details of block allocation into the device. And the original vision was that you would actually buy a hard drive that you would store objects on. And the protocol, you wouldn't say store this block, this data in this fixed-side block, where you'd actually name the objects and so forth. There's a spec that came out of that and it never really took off. And so now when people talk about OSDs or object disks, usually that's what they think about, but in reality they don't actually exist in the real world, for the most part. There are some people that sort of approximate that specification in their products, but nobody actually sells devices. So in contrast to that, stuff is something completely different. So basically the idea is that you just take the basic idea of pushing the details of block allocation into the storage nodes and the lowest sort of layers of the system as possible. And then you can sort of focus on the hard problems of making it scale and distributed and make it reliable and so forth. So I think you already kind of answered this question a little bit, but let's get it explicitly. There's already a couple of free-license parallel file systems out there, you know, Luster, PVFS2. Oh, why go into complete direction of building a completely new implementation rather than just putting, contributing distributed metadata to Luster? The real, I think the real difference is that those systems aren't really designed with fault tolerance in mind, whereas sort of when we were at the drawing board with Seth, we sort of realized that in order to build something that's going to scale to hundreds or thousands of nodes or more, that you really have to design for failure from the beginning. And those systems tend to be constructed on the idea that your storage nodes are reliable in some way, so they rely on a sand back end and failover and dual paths and so forth, which means you have to have expensive hardware. And our hope is that you can do the hard work of dealing with failure in software and then you can run on commodity hardware for, I guess, more cloudy-type environments. I mean, the real thing is that when we started the drawing board sort of designing the architecture, the goal is to avoid any sort of points of scalability limitations and future scalability. And so it's not clear exactly where the scalability wall is going to be and really depends on your hardware and so forth, but we try not to design any into the system. And so we've been able to test on hundreds of nodes, not thousands yet, because it's hard to get access to those kinds of clusters. We can only really simulate sort of the workloads that we would expect in those environments. But the goal is that, you know, scale as big as you can build it. So does that mean I don't have to worry about... So like on Luster, each of my OSTs, I'm, you know, making a RAID 5 or something like that to deal with hard drive failure. With Ceph, can I just throw every disk into Ceph and... Precisely. So if you want to, you can run RAID underneath each of the file systems that have a particular OST on them. And that just means that that individual node will be more reliable than it would otherwise be. And so when you do the math to figure out your meantime-to-date of failure or whatever, you can believe that. But you don't have to. So Ceph will replicate every object in the system across multiple object storage devices or demons, targets, whatever you want to call them. And so you, in general, don't have to worry about node failure. Interesting. So you really are going after commodity hardware, like you said before. So even crappy low-end disks, it won't matter hypothetically because you've got replication. So if a disk fails or even a whole node fails, it's not too much of a tragedy. Is that what you're saying? Exactly. And that's, yeah. I mean, the problem is that even at scale, even if you have the best hardware, once you start throwing thousands of separate devices into the mix, something is always going to failure. Failure is a given. It's just a question of when, not whether. And so you have to deal with it in one way or another. So you can run Ceph on high-end hardware and it'll just be super reliable. And if you run it on low-end hardware, the math will just work out a little bit differently and it'll be more probable that you'll lose. You'll have the double or triple failures that'll cause data loss. But really there really is no single point of failure in the system. I think we've already been seeing this in other types where you're, you can almost say you're a layer above where you normally would have your redundancy needs to handle your reliability failures because dual power supplies, RAID, all these things only last so long. You look at something like Hadoop's HDFS, which serves a very specific purpose. It replicates, doesn't use RAID, does all these things. Right. So why did you choose replication? Why not do some sort of parity calculation and just, does it not actually save you any space or performance? Or why do you go that route? Replication is simple and we want to make it work first. And also the cost of disk base is ever plummeting. And so at some level it's not worth it, at least not yet from our perspective. So then do you do any type of intelligence? Say I have two CEP clients request the same file. Can I use the closest replica or how does, do you do any performance benefit from doing replication? There is some logic like that in the system right now, but it's not currently enabled. And we added that specifically for the Hadoop type workloads where the whole goal is to run the computation on the node that has the data. And so you want that locality calculation. But in general, it actually isn't necessarily a win. In fact, it rarely is to read from multiple replicas of data because it means that your caches don't perform as well. So let me ask what might be an ignorant question for a non file system guy. Does CEP embody a file system in itself? Or is there an upper layer file system that understands your object store? It's kind of all things at once. So at the lowest level of the system, the basic abstraction is a reliable distributed and scalable object storage system. And so you have these series of demons across a bunch of nodes serving up their local disks. And it gives you a nice clean abstraction. When you store an object, you know it's replicated. It's reliable. It's going to move around as nodes are added to the system or failures happen or so forth. And that's all managed for you. The basic object storage is a lowest level interface. We've also have something called RITOS Block Device, which is a block device that's essentially just striped over objects. And that gives you a simple, reliable, shared storage for block devices. And then on top of that, we add a clustered metadata server. On top of the object layer, I should say. We have a clustered metadata server that gives you a POSIX namespace. And that gives you sort of the scalable distributed file system that HPC people typically care about. Yeah. So are there other ways to access CEP? Or is RATOS pretty much the CEP file system driver, you could say? Well, you can access it in this way. So you can talk directly to the RATOS object storage layer using the native APIs. You can talk to the block device that's striped on top of those, or you can talk to the POSIX file system. There is actually one other component called, we call it the RATOS gateway that gives you a RESTful object storage proxy that speaks like an S3-compatible or Swift-compatible sort of RESTful object storage interface. And that talks directly to the object store to the server objects. S3 being the Amazon storage? Right. Right. So you're basically saying that you can emulate what the Amazon storage APIs look like. Right. Or the Swift APIs, which is what I guess RATOS-based open stack are using. So can you actually use all those at the same time? Could I have a big PILO of disk and CEP on top of it access from my cluster with MPI on one and have some S3 clients and other things? Absolutely, yes. So the RATOS object storage layer lets you separate objects into pools. Each pool sort of has a set level of replication and sort of whatever the current placement policy is. But you have multiple pools with independent sets of objects. So you might have one of them that's full of block devices, one that's used for the file system, one that's used for whatever else. Brock just mentioned my favorite word there, MPI. So I have to ask, is there any thought of having an MPI-specific driver for parallel IO with CEP? I haven't been involved in anything MPI for some time now. So I don't know actually what that would entail. But presumably, you could write an MPI-IO driver that takes advantage of something in CEP that gives you more than just having a local mount would. I'm not exactly sure what that would be, honestly. Okay, fair enough. Is there any relationship between CEP and some of the other parallel file systems out there like NFS-41 or something like that? Not really. I mean, I think there's, at a high level, some architectural similarities between CEP, Wester and sort of the PNFS-4 idea where you have separate metadata management for the namespace versus and direct client access to data. But the similarity kind of ends there. And we've actually looked at PNFS several times and people have asked why we don't use PNFS for the metadata protocol. And the answer really is that a lot of the smart things that the CEP metadata server is doing to make it scale properly and deal with HPC-type workloads where you have tens of thousands or hundreds of thousands of clients accessing the same file or directory and so forth, you can't really do within the confines of the PNFS metadata protocol. And so we have our, we have our own protocol that we wrote our own Linux kernel driver and push that into the mainline kernel so that you can sort of take advantage of the full capabilities of the system. So early on, and I've been working on it mainly for less for a while, it looked like the underlying file system of choice is ButterFS. What was the choice with going ButterFS rather than EXT-4 or some other file system? Well, originally we actually wrote our own file system in user space. It was called EVOS. And it was, you know, the idea being that if you're dealing just with sort of this object workload, then you don't need a lot of the complexity in the file system dealing with hierarchies and the namespace hierarchies and so forth. But as things progressed, we found that we were essentially reinventing the wheel and we're re-implementing B trees and we're doing copy on write and we're adding checksumming and all this stuff to manage the underlying devices. And all that work is really being done in the file system space as well. And when the ButterFS project started, it sort of had the best, best of breed, I guess, all the things that we really wanted, right? It has pervasive checksumming throughout the system. It has a very generalized and efficient B-tree, copy on write B-tree infrastructure that the whole file system is based on. It gives you great metadata locality and it handles multiple devices. It'll do replication across, you know, different parts of the same disk or across multiple disks. Eventually it's going to have read codes in there. It's going to have a way to take advantage of SSDs to, you know, manage the placement, you know, make metadata performance faster, all that stuff within the file system. And then we really just want to maintain our own low-level file system when ButterFS was doing all the things that we wanted out of the system. And so we sort of chose that route. We could have used BXC4 or XFS or something like that. And in fact, you can run SEP on top of those systems. The key thing is that SEP includes support for snapshots. And in order to make those efficient, we hook into the low-level sort of ButterFS copy and write capabilities so that we don't actually copy data, whereas with those other systems, you actually do have to literally copy objects when they're snapshotted and then modified. And so they're less efficient for that reason. Okay, I guess that actually answers part of my next question, which was, does it help you much to have special functionality in the underlying file system so you're using those hooks for the snapshotting capability? Is there anything else that you would like to see, maybe ButterFS or a different file system add that would help SEP out? The other thing that we use in ButterFS is we hook into ButterFS's snapshotting mechanism to make sort of consistent point-in-time snapshots of the low-level file system. And that helps the SEP OSD's source nodes keep sort of their data fully consistent on disk so that when they restart and do the recovery, they don't have to do a lot of validation and checking. And it's hard to sort of graft those features on to other file systems that already exist. And so we struggle with that. We would really like to be able to run SEP on top of any back-end file system. But it's a challenge to sort of get the minimum set of, I guess, features in those systems that let us sort of have that basic functionality without changing the architecture and losing the stability that they offer. X-T3 or X-T4 is an extremely mature file system because it's been largely unchanged and used for so long. And so it's hard to sort of add new things to it without breaking that. So a little bit of a tangent here. At the beginning of the show, you talked about wanting to go into the enterprise as well. For that, you would probably need either OS X and or Windows clients. Are those kinds of things in the works? Sort of indirectly. So I mentioned that there's an upstream 1x kernel client for the file system. There's also a user space implementation of that that we call the SEPFS. And there are a couple of projects that link that user space library into other products. So there's one that glues that to the Samba, which is the open-source SIPS server for Linux that everybody uses. It glues it into Samba's internal file system abstraction. And so you can run a Samba server that talks directly to the SEPF cluster and re-exports SEPF as SIPS. And there's a similar project to do the same with NFS Ganesha, which is sort of a promising user land implementation of an NFS server. That'll eventually also support things like GFS, I believe. And so those sort of give you the interoperability with existing standard protocols that people are looking for. Now, on an architectural level, though, if you talk to a Samba server, Samba is not distributed. So do you lose some of the benefits of doing SEPF that way? Some. You do have to have gateway nodes that sort of re-export the system. If you don't have a native client. On the other hand, recently, Samba has the support for having an array of Samba servers that sort of re-export the same back-in cluster file system is improving. And so that interoperability is really sort of grown up a bit late. And so you can just have, depending on whatever your throughput demands are, then you just add however many servers you need to sort of support that load. I mean, you are suffering from, it does become a bottleneck in the same way that, you know, if you're coming from an in-best background and in-best server is a scaling bottleneck because all traffic has to go through it. You have the same problem with the Samba servers. The nice thing though is that you can just add in of them right in order to support your load. But you don't take full advantage of being able to talk directly to the back-in SEPF storage server. So you would have a native protocol. So continuing on this thread of scalability and performance, let's look at two different use cases. I have an HPC cluster and I want to write from multiple clients to a single file a massive quantity of data. How does SEPF scale that? The other one is, I'm one of these new machines with, you know, 200,000 cores and I'm going to do file per process checkpoints. I'm going to create 200,000 files all at the same time. How does SEPF handle those two different cases? So they're actually handled at completely different levels. In the first case, where you have lots of clients writing to the same file, basically the file byte sequence is striped over objects and those objects are randomly distributed across your back-in storage nodes. Because all the different cores are writing to different offsets within the same file, they're actually writing to different objects that are scattered across the cluster and they're all talking directly to those back-in servers in parallel. So you have sort of perfect scaling in that case. And that's one of the most complex consistency issues. So POSIX requires that a write is atomic with respect to other writes. So if you have two overlapping writes, one of them happens before the other. And that sort of thing is very difficult to do when you have a parallel or distributed file system. And so normally what SEPF will do when you have multiple writers to the same file, it'll switch to a synchronous IO mode. And so every write will actually go all the way through the OSDs and then come back in the same thing with reads. That may or may not work well, depending on whether doing lots of small IOs or a few large IOs. But what we do provide is the ability to relax consistency sort of electively for that single file, similar to the OLAZE POSIX extensions that were suggested several years back. So that clients can sort of, if they know that they're not going to step on each other's toes, then they can relax consistency for that one file and then take advantage of the buffering and so forth and then sort of more carefully control how data is written out. And in the other case where you have hundreds of thousands of cores or millions of cores writing to separate files, say in the same directory, that really becomes a metadata problem where you have all those nodes trying to talk to the server to create a file and then they write out the data to the OSDs and then having to update the file size and end time when they're done. And in that case, that's really where the distributed metadata cluster comes into play. For a single directory Cep has the ability to detect that it becomes hot and it'll take that directory namespace and break it into separate fragments and then distribute those fragments across different metadata servers in the cluster. And so that workload of creating all these files is distributed across all of your available metadata servers and so you're able to scale it that way. So does Cep actually have dedicated metadata servers and dedicated storage servers? Yes. Do you say I'm running out of iNodes in my file system by self a lot of space? I just stack on another metadata server? Well, you would never run out of iNodes because it's a 64-bit iNode namespace. So that's probably not going to happen in the foreseeable in several centuries. But yes, so what you would really do is when you notice that your metadata performance is slow, then you would add more metadata servers and then the internal cluster load balancer would notice that some of the metadata servers are busy and other ones aren't and it'll sort of shift responsibility for the file system namespace across different nodes in order to sort of maximally utilize your available server resources given the current workload. So give us a list of some of the other features that you have specifically developed for Cep that people have asked for or originally came out of the Trilabs grant or things like that that are different. So you mentioned replication for one and you mentioned a couple different ways of scalability. What else? Let's see, one that I also mentioned was the sort of the POSIX IO Olazy extensions that people suggested. So you can selectively relax consistency on a file if you're an APC workload and know the nodes are going to step on each other's toes. Sort of more stepping back and looking at the overall architecture. One key thing that's a differentiator is the use of the crush placement algorithm. And the idea there is that if you're sort of randomly distributing data across this cluster you can't take a naive approach and actually make it completely random because you want to avoid correlated failures taking out multiple replicas of the same data. And so crush lets you sort of describe your hardware or your storage cluster in terms of the physical infrastructure. So you have a hierarchy that includes knowledge about nodes and then hosts that contain those OSDs or disks and then racks that contain those nodes and so forth. And so you can have a data placement policy say something like I want three replicas of every object but I want them to be in separate racks so that if a single power circuit goes out or I lose a network switch I only lose at most one replica of any given object. Other things would be in the metadata server we have the ability to track recursive metadata so for every directory we have both this we have a recursive size for example that's defined as a summation of all file sizes nested beneath that point in the hierarchy. And so if you do an LS-AL for example by default that's what's shown for the directory size and you can see how much data is contained within a part of the directory tree without actually having to do say a DU that will run for hours and give you back the same number. We've added a few things to support Hadoop workloads like the ability to in certain circumstances read from replicas if it makes sense given the workload. The biggest differentiator is really that in the object storage layer in particular is that the driving architectural motivation is to push as much intelligence as possible into the storage demons that are running on each individual node in order to make the system scale. So in contrast to other systems that don't need to grow as large if there's a failure for example there's normally some controller node or something that notices the failure and then tells other nodes to read data off one node and writes it out into another sort of manages the migration of data and balancing and so forth. Whereas in Cep we try to do that on all the individual storage nodes themselves so we get every node in the system full knowledge of what the current state of the hardware available is and then what the placement algorithm specifies where the data should go and then all the nodes sort of use peer-to-peer type algorithms to move data to the right locations with very little central coordination and that's sort of the key to making the system self-managing, easy to maintain and scalable anywhere from one node to hundreds or thousands or whatever. So here's the one thing that we run into with really large file systems what's a fisk of a multi-petabyte file system look like? That is a good question I think the key is that you don't want to take the system offline in order to do an FS check and you have to make sure that FS check is implemented in a way that doesn't build up all these in-memory data structures because that simply isn't possible when you're talking about petabytes or exabytes of data and so I think what it really needs to be is something that first that the system is doing on its own all the time anyway so as you use the system and you traverse the hierarchy it should be doing validation checks that it can and then when it comes time to do sort of a more hardcore consistency check we have to make sure that the namespace isn't linked in some insane way with loops or something like that and then you minimize the amount of data that you have to track in memory so that that's actually efficient effective and efficient So given the way that the stuff architecture works we actually separate the consistency check sort of into two distinct pieces one is making sure that the object storage layer is consistent that's the piece that everything is sort of resting on top of and to do that what we currently do is have each node sort of periodically look at the data that a set of objects that is currently storing and the metadata associated with them and then compare that with the replicas that are supposed to exist and make sure that the replicas are actually storing the same thing and that's a sort of background scrubbing operation that we do dealing with a check on the metadata layer on the metadata servers so the file systems deposit namespace is more difficult because you have this essentially this huge tree structure and you have to make sure that the graph has no cycles there's actually you know sort of a same same thing and that's honestly that's a challenge that we haven't fully, a goal that we haven't fully reached yet so at this check we have a number of designs in the works but we don't have a working limitation yet So actually that's a perfect lead into what is the current status of SEF what's available, what's the current stable version, how many people are using it, these kinds of things Well we were very ambitious we spend a lot of time sort of implementing all the pieces that you're going to need you know you have object storage, you have a block device the other metadata server, you have a security system that looks very much like Kerberos so that there's real authentication in the system and so we sort of bit off a lot to chew and as a result not all those pieces are really ready yet and so for the most part you can sort of look at the system as a series of layers if you look at the low-level object storage layer that gives you this distributed object storage system that is very close to maturity and being able to be used in production and in fact we'll be launching, Dreamhost will be launching a product that's based on that in the very near future and as you sort of go up the stack things get progressively less stable so if you talk about the RBD layer which gives you these block devices that's also a very simple thing and it works quite well both on the client side and on the server side and once you talk about the metadata server this is a huge chunk of code and a lot of complexity because it does so many crazy exciting things and so if you use a single metadata server configuration it works quite well and it's certainly usable I wouldn't put production data on it just yet but if you're looking to try it out then it's don't expect it to crash and then once you start talking about the clustering for the metadata servers then that's when you start having to be careful and so our focus these days really is on improving our QA infrastructure or automated testing to make sure we don't introduce new bugs as we fix old ones and to expand the pool of people who are really testing the whole system So is there a this is another perfect lead in here is there a community surrounding this or is there corporate backing or what is the framework that you're doing all this work in? There's both so my goal from the outset was to build an open source system that would do all the things that the enterprise systems would do that would cost her and this amount of money and so from the beginning the system was open source it's LGPL licensed I spent a lot of time working with the Linux kernel community to get a driver upstream so that you could use this in mainline Linux and you wouldn't be tied to any proprietary or out of tree stuff and so there's very much a community there there's a very active mailing list and there's a channel people from all over the place have been testing it out from the finance industry to people interested in Hadoop to research groups to service providers the cloud community most recently has been extremely interested particularly in RBD as people as projects like OpenStack sort of open cloud environment takeoff and everybody wants to deploy their own private clouds and there's no real credible open source storage solution for that and so a lot of people are looking at stuff for that and so we've been very happy with the community that's growing around there and at the same time we also have sort of a corporate support for the project as well so my company dream host has been sort of incubating as a research project for some time and we're in the process now creating an actual independent sort of storage company that's focused on supporting this much like Red Hat supports Linux with a similar sort of support model so that organizations wishing to deploy this can actually pay somebody to support it for them they can lean on them when they run into problems and ask them for help with all the work that's being done are you the main guy or are you recruiting there's so much to be done who does it I am recruiting so I am the main guy I guess you'd call me chief architect or something internally here we have a development team of six, seven, I guess eight people we've had a couple people start we have a couple business people who are working on building an organization around that and then we have contributors from other companies as well several OEM providers I guess it wouldn't be OEMs with their enterprise appliance vendors I guess that want to put stuff in their own products in there invested and a lot of people working on products as well so if somebody was interested in joining you would they throw a resume your way or is there contact info on your website so yes we're aggressively looking for experienced developers both people with experience with storage and in the HVC area too as well I'm also for working on QA so people interested in the project should definitely go to the SAP home page at SAP.com or SAP.news.genet or then go to dreamhost.com and look at the job space the jobs are also listed there and we would be very interested in hearing from them so is anybody representing SAP or DreamHouse going to be at SC I will be there I'll be there on Tuesday there's a panel on open source file systems that Gail and Shipman organized I actually can't remember who else was on the panel I think there's somebody from PDFS from cluster I suspect I'm not sure who else so I'll be anticipating that and I'll be there for a couple of days surrounding that as well okay so then what's coming for the future of SEF so the short answer is a lot is coming we're in the process of putting together a business organization around the project to support it really turning this from what has been a research project into something that is a real system that people can put deploy in real organizations in real production environments so that means that we're hiring we're looking for business partners we're working on supporting the community and we're working on the product roadmap to really make sure that we're offering people what they need to use and we can do so as quickly as possible just be it the developer who wants to get involved the cloud storage provider who wants to set up their private cloud and run virtual machines or whether it's an HPC environment where they want something that's going to be more efficient and cheaper to maintain to run their ginormous workloads on okay Sage, well thanks a lot for your time you already mentioned sef at sef.com and there's a mailing list there so thanks a lot for your time appreciate it, thank you have a good one alright