 Okay, I think I'll start. So I'm Dan from CERN. I'm actually super happy to be here because I'm from Vancouver. So it's like a hometown thing for me. So I'm from the CERN IT data and storage services group. In this talk, I'll give a bit of background about, very quickly about CERN, why we have so much data. And I'll talk about the block storage service that we built with OpenStack. And I'll try to give some operations experience, some tips about tuning the thing to make it work, and then hint at the future, how we can scale for our big data problems. So CERN, as you probably know, is in Geneva, Switzerland, and we have this thing called the Large Hadron Collider, the biggest machine in the world. And at four corners of this big particle accelerator, there are these particle detectors, what we call experiments, Atlas, CMS, LHCV, and Alice. And we do physics research, fundamental physics research about things like the early universe, dark matter, and the Higgs boson, you probably heard about it. Now, very quickly, I'll try to explain how it is that we do this work. So it's like physics data analysis of 101, probably zero, one. It's basically two steps. First, we simulate what should happen inside these particle detectors. So we know the standard model of physics, we know the way that particles collide and what happens. So we simulate what happens, and that's how you get this dotted line through the middle here in a plot, eventually. And you know that within some kind of significance, so that's these green and yellow bars here. It's the error on what we know, according to current knowledge. But then we smash particles together, we smash protons and see what really happens. And what you got then is something that's a little bit different. If it's different, it means there's some new physics there. There's some new particle or something that we didn't understand. So that's the basic workflow of particle physics, experimental particle physics. Getting significance in these two steps requires a lot of data. So both in the simulation step, so just CPUs turning out tons and tons, petabytes and petabytes of Monte Carlo data, but also the real data from the collisions is the petabyte scale. So as I said, I'm in the data and storage services group. We do all kinds of things. We do basic IT, basic home directory services, we use OpenAFS for that. I put our different services here, ordered by the number of files or objects that we have. So we use OpenAFS, which is like NFS if you've ever used it. We have almost three billion files there. Then we have Castor and EOS, which are kind of custom or locally written storage systems for data archival and also analysis. And those are our really big data file systems. Then we also have things like some NFS filers, some NetApp filers, other appliances as well. We have Ceph, which I'll talk about later. And we have CERN box, which is actually own cloud. It's a synchronization tool. We grow, our requirements are to grow by 20 to 30 petabytes per year. So that's the data, the Monte Carlo and physics data that's coming in. And we also now have a second data center 22 milliseconds away. So there's a lot of opportunities there. Now also, I mean, it's the OpenStack conference. So we moved all of our IT infrastructure to OpenStack and to virtual machines last year. And we've actually been in production since the summer of 2013. We have a blog actually called OpenStack in Production, that blog spot. Actually, the current status is that basically all of the IT core services are in OpenStack and Ceph. Most of the research services as well, I'm gonna move my phone, it vibrate. Most of the research services are also in this platform. The only things that are not virtualized is the storage. Though I'm asked repeatedly by some managers, why don't I put Ceph in virtual machines? I mean, some people don't understand how it works. Some databases and our big batch farm, a lot of that is not virtual. In some cases, it doesn't really make sense. So here's some of the numbers about our OpenStack installation. So we have close to 5,000 hypervisors, 11,000 instances, and 1,800 tenants. Roughly that number of users as well of the system. These are plots for the last month. You see it's not actually changing that much in the last month. I mean, CPUs, we have close to 100,000 CPUs in production. The light blue is the quota assigned, the dark blue is the actual numbers in use. 150 terabytes of RAM and two petabytes of local ephemeral disk on the hypervisors. So this is really cool, but what about the block stores? That's what you're gonna talk about or hear about, right? So in 2013, we evaluated Ceph and then deployed a three petabytes cluster. We've talked about that at other previous events and the slides and presentations in the past. So I'm not gonna, I'm gonna try not to repeat too much about it. We picked Ceph because it had a good design on paper. It looked correct, looked like the best option for building block storage for OpenStack, for reliability, scalability, also the future growth. We called Ceph our organic storage platform because you get to add and remove servers and you never had, basically with no downtime forever. That was our idea, why it would be good. We did a 150 terabytes test and it was okay. Now we deployed it. We deployed it with puppets. We initially used the puppet manifests from Enovance. Today we use, we've changed it quite a bit I would say. We now use this Ceph disk tool, the Ceph disk deployment tool that's written upstream. So we have a kind of custom puppet modules. For our hardware, the main goal was to be homogenous with our existing scale out storage systems so that we could easily move hardware back and forth or between the different storage services that we have. These are, so the initial cluster was 48 servers, 24 OSDs just attached with a simple HBA, 64 gigs of RAM and dual Zeons. And the Mons were five what we call batch nodes or CPU servers and they're just distributed randomly around our data center. This is still used in production though we did add some SSDs but I'll get to that a bit later. This is actually our Ceph OSD tree right now, like yesterday. I hope, I wish there was a tool that turned a Ceph OSD tree into a cool diagram or something because it's kind of ugly but I can walk you through it quickly. So we have basically three routes. We have the default route which is the default. That's where all the main data goes. We have two rooms in there. We have our main room which is a UPS battery backed room but we also have a diesel backed room where we put what we call our critical projects. This is actually kind of new, this second room and now we've learned a bit more about crash and now we replicate across IP services which is basically across routers as well to add some additional reliability. We also have a separate route for our couple of object storage use cases. So we call it the OS route, object storage. We also have a draining route so we can move servers into just to be drained of whatever data they might have. So on the open stack side, on the cinder side we have currently four different types of volumes. We have the standard volume type. We call it standard just because it's basically supposed to be like a spinning disk. We use the QOS types to throttle the volumes. So standard gives you 80 plus 80 megabytes per second in bandwidth and 100 plus 100 IOPS. So it's supposed to be like a spinning disk. And we do three replicas across our battery backed room. Then we added some users complained about performance, very few users. So we added something that we call IO1. It's the same replication as the standard one but we give more IOPS some more bandwidth. So 500 plus 500. We, as I said, we added this diesel room so that we call it the critical power volume type. So that's CP1. And for this one, because of the symmetry of the thing we do two replicas on the diesel power and a third replica in the main room, the battery room. This way if the battery room goes down you still have two replicas that will be our up. And then we also now have unfortunately something we call CP2, which is a NetApp volume service. That's because our Windows VMs are on Hyper-V and there's unfortunately no step driver for Hyper-V yet. A couple of comments about volume types is that the usual workflow is that users get a standard volume, their performance is not adequate and they ask for it to be faster. For that, you need something in Cinder called retype and I think this just came in kilo. We contributed the patch for this. And second thing is that these IOPS, it's a bit, I said it's supposed to be like a standard disk but it's actually not quite like a set standard disk because in standard disk you get probably thousands of sequential IOPS, but with QMU the throttling gives you, it limits you also to 100 sequential IOPS. So the performance could be better if we could get the burst IOPS feature which I don't think is yet possible in Viacinder. Here's our actual usage today. The cluster's three petabytes but this is what we've actually allocated out to users. So we've allocated around 500 terabytes of volumes of Cinder volumes out to users. The vast majority of these are that standard volume type and then for images we have, well around 1,100 images and some snapshots on that too but you see images is only 18 terabytes so they're small, they're four gigabyte images normally. One challenge we have is that because for a research lab everything is effectively free for the users I mean from the user's point of view we have no objective way to decide if a user gets more IOPS. I mean every user wants more IOPS but what we do now is they have to sort of prove, I'll get into this a bit later but they have to prove that they have a performance problem and then we'll give them more IOPS. As a result you see only seven volumes have higher IOPS now. Here's the actual DF of our Cef cluster. So it's 3.6 petabytes actually. 520 terabytes are used. That's, now if you think about the numbers there I said that there were 500 terabytes allocated out in quota and we've got roughly the same number actually consumed and this is actually something cool that I like to mention to my management all the time because it means that we've allocated out 500 terabytes of triply redundant volume data but it's only consuming 500 terabytes of space on the disks. So that's thin provisioning at work. That's, normally in the past we would have given each of these users their own two terabyte or three terabyte disk and that would have cost a lot more. Of course we do have a lot of empty space in the cluster as well so maybe it balances. Now you see the used amount for each of the different pools. Cinder critical is that CP volume type of mention and you see we do have some small amount of S3 buckets there as well. It's not big in size but we have quite a few objects I would say. A lot of sub one K objects. We monitor the IOPS and latency of the cluster very closely. So we have some probes running that feedback to an elastic search instance. So we monitor with Rados bench the four kilobyte object reads and four kilobyte object writes. And also we monitor the IOPS reported by Ceph minus W. You see at the moment we have three millisecond reads roughly six to eight millisecond writes and around 7,000 IOPS in the cluster. The write latency was not always that acceptable I would say. So small writes. Why do I monitor small writes? It's important because of the following. I mean maybe it's obvious but it's also through the evidence it's important. So on one particular day last year we bumped up our logging and did some analysis of the Filestar logs, the OSD Filestar logs. And we found that out of that there were close to 500 million IOPS that day and well 300 million writes, 170 million reads but 75% of the writes were 4K writes. I mean this is probably because there's a lot of A time updates or something like this something out of our control I guess. But anyway this is the, so the tiny writes really dominate what the IO is going on the cluster. The reads were mostly for the highest, the largest size of read was 512K. That's probably some block device read ahead. Then the next largest read size was also 4K. With our 24 disk no SSDs cluster that's this four kilobyte small write latency was 30 to 40 milliseconds with three replicas. So we looked into SSD journals. So we actually benchmarked this, we put two identical servers just to see what would happen if we swapped out the first four OSDs of our 24 bay chassis and placed it with SSDs. And we found that the IOPS that you could get out of one of these servers increased by five to 10 times. It also decreased the latency down to five milliseconds. Then we asked which SSDs to buy. I mean there's discussions on this on the mailing list all the time. And basically you need high endurance, the highest endurance SSDs you can get and also stable write performance. But I want to point out that you need stable random write performance. Because it's a journal you might be tempted to think that it's sequential writes to the journal. But in fact you normally put many, you partition the journal device and you put many journals on one device. So in the end every write ends up being a random write. This is proven with this block trace plot here. The top plot is the offset in the device when it's writing. And you see around 500 seeks per second on this block device. This was an SSD used as a journal. So we use the 200 gigabyte Intel DCS 3700, the most recommended SSD. We have five journals per SSD. And the reason why we did this in the end, the main reason was not to improve performance. It was because we would have run out of IUP's capacity on this three petabytes cluster well before we would have run out of actual volume capacity. So despite the fact that we decreased the volume of the cluster by 20% by pulling out these spinning disks and putting in SSDs. Actually we wouldn't have been able to use that 20% anyway. We probably wouldn't have been able to use half of the three petabytes. It would have just become too slow. Of course, for our big, big data use cases, so for object I still say that we don't need SSDs. I mean we simply won't be able to afford SSD journals for big data. So there has to be a throughput that's more important there than latency anyway. Or maybe some future developments will make this more feasible for big, big data. Huge clusters. There's more info. There's a small white paper there about this SSD test that we did if you're interested. So this is what it looks like when you put SSD journals in production. This is one of those cool plots that you are happy about. So it's the latency, 40 milliseconds when we started. Then we started slowly pulling disks and putting SSDs in and restarting more SSDs. And you can see some spikes while we did a lot of backfilling. But in the end you go from some very unstable 30 to 40 milliseconds latency down to five milliseconds flat. This was very cool. Now, how does this look to users? So users complain about when a user's suffering from Ceph performance problem, it's what it is that they have IO8 in their VM. And you can see this very clearly. On the left plot it's the light green at the top is the IO8. So this means that a VM needs a better disk, basically a better attached volume. So we don't actively look for this at the moment. We wait for users to complain. And when they do complain, we just switch them, we retight them to a IO1 volume. And then if it helps, then they can keep that. If it doesn't help them, we just flip them back. On the right is actually the load average on a machine when it switched from 100 to 500 IOPS or 100 plus 100 to 500 plus 500. So you can see it's like overnight. One, I call this tip of the week. If you do decide to throttle the number of IOPS in your volumes, there's a couple of catches. Like one is this max sectors KB, which is a property of a block device in Linux. And normally by default it's 512 kilobytes. So if you throttle a block device to 100 IOPS and Linux kernel is splitting your IO's into 512 kilobytes, then you're also artificially limiting this block device to 50 megabytes per second. And we observed this. Took us a while to find out it was because of max sectors. I think in newer kernels, they've changed the way that max sectors KB is detected. But if you run RHEL6 or I think even RHEL7, it's still like this. Here's a bit more seftuning tips that we have. So we have found, we found very early on that the level DB used in the sefmon is quite sensitive to the synchronous right latency of wherever that level DB is stored. So I think I preach that you need to put your sefmon level DB on an SSD. Actually one SSD is okay. I don't even bother rating them. Also, scrubbing is one of my favorite topics. It's very IO intensive. And one of the unfortunate things about the way that scrubbing works in sef is that you create a pool on a Tuesday and then every week on the weekly birthday of that pool, it'll try to scrub those PG's. The OSTs limit the number of PG's, so over time it probably does randomize out a bit. But at the beginning, if you're doing a test and you create a pool and then put a bunch of objects there one week later, wait a week and then you'll see your OSTs will be in trouble. Here on the plots here, I've got, on the top I've got the number of PG's scrubbing in the cluster and below I've got the right latency. So you can definitely see that there's a correlation between the number of PG's scrubbing and the latency on the cluster. This is after we've done all of our tuning that we have. It used to be much worse. It used to be that you would have hundreds of PG's scrubbing at the same time. So to solve that we rewrote some scripts to smear the scrubs across the week. So you can preemptively scrub a PG. But also with the help of the developers we've got features now to use the CFQ IO scheduler on the OSTs and we schedule the disk thread, which is the scrubbing thread to idle. And this really helps. There's a lot of Linux tuning tips that we've found are needed. Basically on our hardware and on our hardware, particularly because it's a dual Xeon, you really need to disable this NUMA zone reclaim. I think it's a disaster. And you can find a lot of blogs, especially MongoDB related blogs discussing this feature being actually a bug. I think also in very recent kernels they also disabled this by default now as well. Now Ceph uses many, many threads, many, many sockets. So just start with these. You have to increase the PID max to the maximum value, which is above four million. And if you have a big cluster, your clients need to allow a lot of open files. So you need to increase the number of files slash number of sockets, U limit. Don't get tricked by UpdateDB. It'll use all of your IOPS in your disks if you don't add the varlib Ceph to your prune paths in ETC UpdateDB. And the latest interesting topic is TC Malik. This is a pretty detailed thing, but we find that on our OSDs, when we run Perftop, the highest function in Perftop is this TC Malik released to central cache. So TC Malik is a Google Malik function, which is kind of optimized and it keeps a buffer of a cache of available memory per thread. But it seems to be too small. The default that's being used in Ceph. And actually our evidence of this is that whenever we restart a cluster, on our test cluster, we can do this. We can reboot the whole cluster at once. And there's not many IOPS going on. The latency is about, so normally it's about six milliseconds, but for a couple of hours, it's around three milliseconds. And then as we do a couple of Radial Spench tests, it gradually comes back up to six milliseconds. So we blame this on TC Malik. Increasing this thread cache to 128 megabytes seems to help, but there's probably a better way to solve this. Here's some different issues and incidents we've had. So we used to really suffer from growing, the growing level DB of the Ceph Mon. It used to grow by more than 10 gigabytes a day. And we used to manually compact it every day or two, wear a crown job for this. But since Firefly, it seems fixed, yay. So this is a plot of our five Mons and they're growing level DBs. And then we upgrade it and then it's flat. Excellent. Disk and host failures. Well, disks fail something like monthly, weekly to monthly, and it's completely transparent. Totally not noticed by us or the users. The SSD journals have been very reliable so far. We haven't had any failures, but knocking on wood on that one. Because when an SSD fails, if it's not obvious, that would take down five OSDs would be a trigger a lot more backfilling than a single disk failing. We have twice lost the whole server. In both cases, it was, you know, well, in one case, they were replacing a memory module and it just didn't reboot after they replaced the module. In the other case, there was a network problem. But in both cases, the point is that in both cases, the server was recoverable within a few hours. So we shouldn't have done backfilling in those cases. So now we actually limit this particular config option. By default, it's equal to rack, but now we've set the host. What this means is that if a whole host goes down at once, there will be no backfilling triggered automatically. By default, it suff weights for a whole rack to go down at once, and then it won't trigger automatic backfilling. Some other incidents, these are just approved that suff works actually. So we had a power cut on the 16th of October last year. The OSDs were up, but three out of our five mons, and unfortunately the way the power cut happened, they went down. So this meant that the quorum went down. And generally the VM clients, the OpenStack clients were down as well. Their network was down or the machines were down. And this lasted 18 minutes. On the machines that stayed up, the client machines that stayed up, we just observed that the block device was just frozen for those 18 minutes. And this looks like in the kernel, in Varlog messages or wherever you see that a task, like the JBD task is blocked for 120 milliseconds, or 120 seconds, you see this again and again. But the nice thing is that when the cluster came back, those IOs just continued as normal, like nothing happened. And this I learned later relies on vertio block. And vertio scuzzy would have timed things out. I'm not expert about this particular point, but the configuration which we have, which seems to be the default, is pretty good. We didn't see any corruptions or any single bit loss in this power outage. We had a router failure on the 13th of March this year. And this was just random packet loss happening. It was pretty frustrating, pretty mysterious what was going on. And actually some OSDs, they decided in this outage, which lasted something like four hours, some OSDs decided to commit suicide. This was the scariest message we've ever seen in our cluster. So the OSDs are down, PG's are stale, and you look in the OSD message it just says, heartbeat, time out reached, committed suicide. It must be a kind of safety mechanism. So we had to manually restart those OSDs. But in general throughout this incident there was random network badness, random packet loss. And we think, and even after the network came back, it wasn't until we restarted, we were getting a lot of slow requests after the network came back. And there were some slow requests, which were lasting 600 seconds. So we think that there's actually a bug. It might be fixed in hammer, we haven't upgraded yet. But the OSDs seem to be continuing to talk through sockets, which have gone dead. And they don't notice that the socket's no longer dead. So it waits for some kind of 10 minute TCP timeout or something, I'm not sure yet. We're still working on this bug. Again, in this incident, even though there was out downtime, there was not a single data corruption reported, not a single data scrub inconsistency. So no bit flips anywhere. This last issue, I think it's the last issue, is was kind of an interesting one. So we had some, just a couple of weeks ago, we had some VMs which decided not to boot. Now, there's two things that happened. One was that we, to move our, to stop for this power cut problem, so it wouldn't happen again, we were moving three of our SEPH mons onto the diesel back generators. So we moved some mons into the diesel room. At the same time, we had, those mons used to be test mons. So now the test cluster mons were the previous mons. So we basically took two SEPH cluster mons and swapped them. In SEPH, it's okay. The procedure we followed was okay. There was never, there was no downtime, no outage. You know, SEPH health at the end was okay and throughout the whole intervention. But about, well, a few days after this, we started getting reports that VMs weren't booting and when they looked at the log files, they got this authentication error. It took us a while to correlate these two interventions, or this incident and this intervention. But eventually we realized what was happening. And basically long story short is that the Cinder driver and eventually the lib for XML for a VM is hard coding the list of SEPH mon IP addresses at the time that you attach a volume to a VM. So I can sort of see why it's done that way, but I think it's a kind of design flaw. So this behavior should change. We filed a ticket in the Cinder launch pad. The workaround that we had to enact was that we had to stop, in our test cluster, we had to stop those SEPH mon's from listening on the SEPH mon port so that now when the production VMs start up, they go through, there's five IP addresses, they keep going, checking them all until there's one that doesn't have a SEPH mon listening. So this is more now about issues. So backup or topics, I don't know, general topics. So despite all of this greatness about crush and replication and no single bit flips, there could still be some software bug or something that takes down all of the data. So you need backups, disaster recovery. Backing up 500 terabytes of data or three petabytes eventually is pretty hard and we actually don't do it yet. We're not sure, we haven't tested extensively, we're not sure that this RBD snap and export workflow is works, if you don't also coordinate inside the VM to flush the block device and freeze it with XFS freeze and things like this. So we're still working on this. Others probably have better experience. So what we do is we tell our users to backup with TSM, we have a big TSM service, so they run a TSM client inside the VM and I hope our users know this. They probably don't but we're looking forward to this RBD mirroring feature coming and there's a long tail of other things, paper cuts or little annoyances. Cinder is kind of annoying that you, when you have a volume it's attached, it's associated with a specific controller, a specific Cinder host. And also it's kind of annoying when we delete volumes, we have this problem pretty regularly that the thread on a Cinder controller, which is deleting volumes, there's only one thread. So if one user decides to delete their 10 terabyte volume, this blocks up deletions for 30 minutes or one hour, something like that. I know deletion of block devices is faster now in Hammer but we still run Firefly. We've put various scripts that we use for operations on this GitHub page, things like in our hourly Chef Health cron job that we run. We've ported the reweight by utilization function to Python so that we can better tweak things and better have smooth data distributions. We have functions to gently drain disks or gently split PG's as well. And also we have a script that finds orphaned OSD Cefx keys because sometimes when you were swapping disks you forget to delete a Cefx key but then when you reuse that OSD ID, badness happens. Okay, coming to the future now. So what kind of use cases do we have? So this whole model really enables really cool applications. I mean, first of all you can just run anything on it but you get, we've found that we can replace the previously expensive custom solutions that we have are custom boxes, expensive boxes with big rate arrays with just this very simple trivial model. If performance sucks then we give them 500 IOPS and they're always happy. We have something like 1,000 VMs with attached Cef volumes so we don't even know what users are doing but they're doing everything, right? Except running Cef on there. We also now are starting to move some of our stored services onto VMs with attached Cef block devices so we're virtualizing our AFS service. So right now we've got also three petabytes of AFS but we're moving that into OpenStack plus Cef. We're also moving, we're pulling users out of expensive filer appliances onto very simple, thin, not highly available NFS servers. So what we do is we run NFS, we run Linux with ZFS on Linux and we use ZFS send receive to a second machine to make sure that we have a kind of backup for NFS clients. There's more details in this paper that you can probably find on Google. But we have some use cases beyond OpenStack, right? And beyond this VM model and that's our physics data that I mentioned at the very beginning. So Castor is our big data archiving system and we added, the Castor developers added a feature to Cef called the Rato Striper so you can stripe objects across, stripe files across many objects without a file system. We use that in Castor now for a caching layer between our tape archives and our disk. And there's ongoing development in this EOS project which is our big 100 petabytes analysis farm or analysis storage facility to put data or maybe initially just metadata into Rato's. We also read the Rato's gateway for some corner cases where S3 is useful. But if we do eventually put physics data on Cef, this is gonna increase our storage requirements to the tens of petabytes scale. So can Cef scale to tens of petabytes per cluster? We don't really have the liberty of doing what Flickr has done. So Yahoo Flickr just, they had a press release or something a couple of weeks ago where they have many three petabyte instances of Cef, many three petabyte clusters and then they hash across them so they can scale this way. That's for sure scales but we can't really do that. So we actually decided to try running a 30 petabyte Cef cluster. So this was just, oh I've got typo, damn. So this was just the yearly delivery of hardware for our EOS and Castor storage systems. And I got the chance to borrow them for a couple of weeks or a month and installed Cef on them, see what happens. In general it worked, we got it working, it worked. There was some tuning required and the performance was pretty good. I have two plots showing throughput with different types of replication, different types of erasure coding, also replication and erasure coding with different types of backfilling ongoing. So we could do up to 52 gigabytes per second with one single replica and then it proportionally went down as an increased number of replicas. I think we were actually just limited by our network here because with 150 servers, 200 terabytes each we should have been able to do one point, one gigabyte per second per server. We found in this test, I mean it's a result interesting from some of our developers that have been involved in erasure coding that the ESA erasure coding back end basically gave the same performance as Jeraser. We didn't check the CPU usage though, maybe the ESA library had lower CPU usage, not sure. In the different failure scenarios we see that the, so the blue is with no failure, red is with backfilling from one OSD being down, orange is backfilling with one whole host being down. There's not a big performance hit when you're doing backfilling when you have a cluster this big. That was, we were happy to see that because if you have a huge cluster you'll basically always have backfilling ongoing. Okay, however, I don't think that we can scale further than this with the current, at least Firefly and Hammer that we have today. And we found that what limited us was something in Ceph called the OSD map. So the OSD map contains the structure of the cluster, the list of OSDs and the crush map and some other things. And with our cluster we had 7,200 OSDs and this OSD map was four megabytes in size which sounds pretty small. By the way, the OSD map size is a function of the maximum OSD ID that you have. So if you do a test like this and then erase all of your OSDs, if you have an ID which is 7,201 then it'll still be a huge OSD map. It's an array, not a link list basically, or a map. Okay, so it's only four megabytes but each OSD process especially, so each OSD process wants to cache 500 of these. There is some deduplication but it's not that effective we found. So we found that when we just naively deployed this all of this hardware with our puppet manifests we found each OSD is consuming three to four gigabytes of memory. Admittedly the hardware we had was did not in any way follow the recommendation, the recommended guidelines for hardware. We had only, it's 48 OSDs per server with 64 gigs of RAM but this is the reality of what we can deploy and what we actually operate with our current stuff. And by the way, I mean I consider that if you have these 500 previous OSD maps, this cache and it's replicated 48 times in your RAM, I mean that's 200 gigs of memory in the end that you've spent on this. I think it's pretty wasteful. We found a workaround to get it working, basically just cache fewer maps. Seth is so highly configurable, you can configure pretty much everything. But hopefully there's some potential to make this a bit smarter. I don't know, maybe one idea would be to not cache it at all and just use level DB on the OSD for this, because level DB can take care of caching itself. We also found in this test, so that was the big issue that I think limits scalability, there were some other issues related to deleting a pool. So when you have a cluster this big, you need to have something like 200,000 PG's placement groups to get even data distribution. Creating those pools was actually fine, worked perfectly well, but deleting those pools actually froze the secmon leader for 10 minutes. So there's probably some queue there that's getting filled with too many operations. I wrote their PG stats messaging. What this is is that basically if you have so many OSD's, they're all sending their updates to the mon with the statistics of how many IOs they're doing. And I just don't have a good feeling for how many mon's you need. If you should have five mon's with a cluster like this or 51 mon's, I have no idea. So there's a lot of messaging that goes around as you increase the size of the cluster. Also, Seth Disk had some kind of problems. That wasn't really related to the size of the cluster, but more the size of these machines that we had with 48 disks. We also have a white paper there with much more detail, about a lot more performance benchmarking and details about the issues. So that's the end of the talk. So, we love data at CERN. It's our business in the end, also physics as a side project. And we add 20 to 30 petabytes per year. But for OpenStack Cinder and Glance, we run Seth and it's really successful. I mean, I say that for less than 10 petabyte clusters, it just works. You don't have to do much. So, there's some knowledge from my colleagues there, but I'll be happy to answer questions if you have some. Thanks. Okay, apparently there's not much time. Ha ha ha ha ha. Sorry. I'll stand outside the room if anybody has questions.