 Okay, let's get started then. So first of all, thank you for coming over here. So it's the last session of the day. So let's try to keep it interesting. So let's talk a little bit about Ceph. Since you're all here, well, probably some of you have tried in the past to deploy Ceph in a highly-performance cluster and with a bit of luck, you might have succeeded. But most of you either have tried to deploy a highly-performance Ceph cluster and have failed or are only thinking about deploying a Ceph cluster and you probably don't know that yet, but you might have some problems with that. So during this talk, during this session, what I will try to do is I will try to go through some of the most typical pitfalls people run into when deploying their first or second Ceph cluster. Specifically, what you will do is we'll focus on the concept of hardware for the OSD nodes. So what are OSD nodes we'll cover in a little bit. But overall, we will focus on hardware, so networking, CPU, memory, and obviously storage for the OSD nodes for Ceph. My name is Piotr, I'm here with Bright Computing where I look after integrating our cluster management software with various cloud platforms, including OpenStack and my downtime. I also help out with managing our own Ceph cluster. So, Ceph, so as you probably have heard, nobody really wants to have a Ceph cluster. People want to have a storage solution which can be consumed by their application, so we are no different than that. So typically, what you would do when you combine Ceph with OpenStack, you would want to deploy it at least for a glance, probably also for Cinder and Nova to provide a full storage solution for all of the OpenStack services. And depending on whether you wanna go with Seven Routes Gateway, you might go with that, or you might go with Swift for object storage. But either way, during this session, we will focus on the Ceph in the context of OpenStack. So, when deploying a small-scale OpenStack cloud, many people simply go either with the reference drivers for Nova, Cinder, and Glance, so that's typically either NFS or just local storage. Not that it's a good practice, but it's simply a simple thing to configure just to get started. However, for larger-scale deployments and for any production deployments, you would typically want to have a distributed and redundant network-based storage. With that, you typically have two options. Either you go with a proprietary storage appliance or several of those, just hook them up, plug them in, power them on, and consume the storage, or you try to build such a distributed and resilient storage yourself. So, if you have a look at the user survey conducted among our community, it's clear that when it comes to providing storage for Cinder and also for other services, Ceph is by far the most heavily used driver for providing that storage. So, why is it? I mean, is it because it's free? Is it because it's simple? Is it because it's flexible? Is it all of those things? So, in order to try to answer this question and the next slide, what you'll do is we will have a look at the very brief case study following a, well, a typical Ceph admin throughout their day of work. So, let's have a look at that. So, just as a reference, the color orange will refer to Ceph itself. Okay? Yeah. Yeah, so let's have a look one more time. So, yeah, I guess you could say that Ceph is a tough nut to crack. But, yeah, it's not so simple in other words. So, why is that? Well, Ceph is complex. I'm not saying it's a bad solution. It's actually pretty good and we've been using Ceph for a few years now. We have two, maybe three, depending how we counted Ceph clusters ourselves and you're very happy with it. But the fact is that it's not easy to configure, it's not easy to deploy and most of all, it's not easy to select the best type of hardware for your OSDs when you're just starting out with your Ceph cluster. So, the typical problem you will face is you'll have to answer how many OSDs or how many OSD demons, or in other words, how many disks per a single node you would want to have in a single Ceph OSD node, be it one, be it four, be it maybe more than that. So, with that, you typically talk about the concept of thin OSD nodes which have typically less than 10 disks in them or maybe fat OSD nodes which have, well, typically more than 20 or so, there's storage disks inside them. So, further on, we'll try to compare those and see what's the sweet spot for those. So, yeah, in other words, that's what you will cover today, how many OSD demos per node should you have. So, I should probably point out that, no, that's actually our agenda. So, maybe let's have a look at that first. So, I'll give you a little bit of background and just to give you an idea where we come from in terms of how we consume Ceph and why it's important for us to not spend too much time managing Ceph. We simply want to have a Ceph deployment work properly for us so that we can consume the storage but we don't want to have a dedicated team of people managing it throughout the day. We simply want it to run and behave reasonably fast. So, we'll go through a little bit of background for that. After that, we will have a brief introduction to Ceph for those of you who might be new to the topic and after that we will jump to the main parts of the session so we'll discuss all those individual pieces of hardware which you have to consider when creating your own Ceph cluster. After that, we will compare the FAT and the FINOSD nodes. Hopefully, there are also conclusions and at the end, hopefully, we'll have some time for some questions. So, a little bit of background. So, as I said, I'm here with Bright Computing. We basically are a software product company. We have a single product which allows administrators to easily turn a pile of hardware into a fully functional cluster. Now, this cluster can be any type of cluster. It can be a HPC cluster, a OpenStack Private Cloud, a Ceph cluster, Kubernetes cluster, Hadoop Spark, whatever you want. Well, that's all we do and we don't really want to spend time managing our Ceph deployment. Now, well, that's a typical cluster for those of you who might not be familiar with the concept. We have a head node in the middle which stores the configuration of the entire deployment and then we have slave nodes or compute nodes on the right-hand side. So, in the case of this diagram, they might be running some OpenStack services on them. But what's important here is that, well, unlike some other vendors, we try to provide our customers with more than one reference architecture. So, ideally, we want to provide our customers with full flexibility with regards to how they want to structure their cluster. Maybe they want to combine Ceph and OpenStack and Kubernetes and Hadoop on the same cluster. We want to make it easy. So, what that means in practice is that our developers, our basically entire engineering staff has a lot of different scenarios to consider when developing new features, when testing them, and so on and so forth. So, with that, we are in need of creating a large amount of clusters internally only to be able to develop our product. And that's actually where OpenStack and Ceph come in. So, we have our own OpenStack private cloud which is based on Ceph. We also have something which is called cluster on demand which is a small set of scripts built on top of OpenStack which effectively allow our engineers and also some of our customers to very easily deploy virtualized versions of our products or virtualized clusters within our OpenStack cloud. And what's cool about it is that it's possible to do in under two minutes. So, developers don't really want to wait two hours to deploy cluster only to test a feature or test a backfix, right? So, they want to be able to provision an entire cluster very rapidly. But I won't be talking about cluster on demand over here so if you want to know more about it just approach me after the talk and we can discuss it. But basically that's a bit of background in terms of why we don't want to care about Ceph all that much. Just a quick diagram, how the cluster on demand is structured. So, on the bottom part what you see is the physical layer of the cloud. That's our physical hardware and the top right corner you can see individual clusters provisioned by our engineers. So, we have one of the clusters over here is running HPC and Hadoop. One other is running some other resources and then yet another is running a virtualized version of OpenStack. So, all of those I guess that are pretty easy to deploy thanks to Ceph and copy and write. So, yeah, that's our private cloud. We call it Crusty the Cloud. It's not too big on this, about 17 hypervisor nodes. Give or take 10 Ceph OSD nodes. We are planning to extend it shortly to about 10 OSDs. But it's been working out pretty well for us. So, at any given time we have about 400 VMs running there which boils down to approximately 100 pretty small clusters which is enough for development work. That's our management interface. Incidentally that's something which both our cluster administrators see as well as the end users which provision cluster. So, since we use the same piece of software to provision the physical layer and to manage the physical cluster as well as the virtualized cluster. Clusters that's a typical, that's the management interface which is available for managing and monitoring both of those layers. I won't be going to Tomajite and this one. So, I only posted this over here in case somebody's interested in having a look at the how approximately our private cloud is structured in terms of networking and different control data planes. So, the slides are online. There will be a link at the very end at the last slide. So, if you do not have to take photos right now you can just download the slides later on and have a look if you're interested in that. Okay, so in a nutshell we create many VMs. We make heavy use of copy and write to create pre-installed head nodes and customize them towards users' requirements and the bottom line at the very end is that we don't really wanna care about managing SEV itself. Okay, so before we start with a brief introduction to SEV let's just do a quick show of hands. How many of you guys over here and girls are running their own SEV cluster? Okay, good. How many of you are not running your SEV cluster but are thinking about running one? Okay, great. Okay, so I guess I can skip that part since majority of you it seems are pretty familiar with SEV so obviously software defined storage solution, object block, file storage, yada, yada, yada. Popular with open stack rados. Probably most of you have seen this slide so might as well skip it as well. Rados were reliable, we have multiple replicas of the data so every piece of data is stored on multiple nodes ideally at least three of the nodes. It's so filling so if one of the nodes goes down SEV will try to restore that replica from the remaining replicas and to one of the other remaining nodes obviously is distributed so the failure domain can be spread across hosts, across tracks, data centers, even entire regions. Scalable so as you have probably most of you have heard it's one of the biggest SEV deployments over at CERN. Yeah, clients, one of the unique features of SEV is that clients can access the data directly rather than always have to go through a monitor node or through some kind of a point which will point them where the data is so that also makes SEV quite efficient. And yeah, dimension copy and write. So in our case we use copy and write to create head nodes from images stored in glance without having to actually copy all those gigabytes of data and simply do a reference copy inside of SEV. So it's been working pretty out well for us so far. So when it comes to types of SEV nodes as you know we have SEV OSD nodes, SEV monitor nodes and if you're running with SEV as you also typically talk about SEV metadata server nodes or metadata nodes during this talk we'll be focusing mostly purely on the OSD nodes so the nodes which actually store SEV data. So that's the same slide which we've seen earlier pretty much how many OSDs you wanna have. So the way you typically deploy SEV is you want to have a single OSD demon managing a single spinning or SSD, a single disk storing the data. So in this case you can see in the main part of the screen you can see a close up to a single node having a running a five OSD nodes each managing data which is being stored on a specific disk. So FAT nodes versus FIN nodes. So let's have a look at that. So whether it's better to have a bit more of a less dense nodes or maybe fewer nodes which are really dense as in they have many OSDs with many disks in them. So FAT nodes first. So typically when you talk about FAT nodes that depends but you typically talk about nodes which have more than 20 HEDs, probably a few SSD journals to speed up data rights. Well they are typically cheapest per petabyte because you save the money and not having to buy that many CPUs motherboards and so on but on the other hand they are a bit more difficult to set up but more difficult to maintain. Also because you have fewer nodes that means that one of them goes down save will have to take more time to restore the data to other nodes basically to recover from the failed state. So that's also something that you would have to consider. Also more HED more disks often means more CPU cores which often means more CPU sockets which means that you probably want to start thinking about NUMA and how that impacts the performance of your system. What does NUMA will cover in a bit? And yeah overall bottom line being that with dense nodes there's more services running within the operating system. There's much more data going through network, through NICs, through disks and simply there's much more potential for bottlenecks in various places. So much many more things which you have to worry about if you go with father nodes. Finals on the other hand, well they are typically a bit more expensive because you have to obviously purchase a bit more CPUs but it can depends on what hardware you purchase. So recovery is faster so if one of the nodes goes down that typically constitutes a smaller overall percentage of your cluster and that also in turn means that it will take less time for safe to recover to a consistent state. But on the other hand you will need more space in your racks and also more power needed to actually fuel that cluster. But if you're just starting out with SAF well from our experience at least going with thin nodes is a good place to start to experiment with SAF. Okay so what's the sweet spot over here so we will try to explore that in the next few slides and hopefully there are some conclusions over here at the end of the session. Okay so that would cover the introduction to SAF so the next few slides will go through networking and then disks, CPUs and memory recommended for your SAF OSD nodes. So let's start with networking. Again we're talking about seven the context of using SAF for OpenStack. And with that one of the first well you will typically need three networks to actually power such a solution. So the first network you will need is a some kind of a networking fabric to for the communication, for the traffic between your VMs essentially right. So VMs have to talk to each other one way or the other be it via flat network or via overlay networking or VLANs you have to have some kind of fabric to carry the traffic. Another fabric which you will need is fabric for OpenStack and for the VMs and services like Glance API to actually access SAF and manage it. So that's another network you need and then yet another ideally a network over which is used by SAF to replicate the data. So as I've mentioned, if one of the SAF OSD nodes goes down SAF will autonomously try to replicate the data to other nodes from the remaining copies and that data has to go through a network. So how to approach that? Well, there are several solutions. Well, the simplest one would be to go with a single networking fabric. So what do I mean by that? I can think of a single switch and maybe some VLANs on top of that which would fuel all of those three networks mentioned on the previous slide. That's by far the simplest solution but it has some problems. So first of all, if you're talking about Ethernet, you have a single broadcast domain which under high load might increase the overall latencies within the network over also bandwidth limitations. So there's only as much data as you can push through a single fabric. So for example, if your SAF is replicating data in the backend, that might impact the performance, the bandwidth available for the traffic used for the traffic between the VMs available for communication between the VMs and vice versa, of course. So that's something to consider. A different solution would be to have a dedicated networking fabric. So think of it as dedicated networking switch for each and dedicated NIC, of course, on each of the nodes for each of those three networks. So you'd have a dedicated network for traffic between the VMs, dedicated network for accessing SAF and then dedicated network for SAF replication. So with that, ideally, what you would also want to have is you would want to have all of those networks to be at least, well, equivalent of 10 gigabits per second. I mean, you could go with one gigi, but by modern standards, it's actually pretty slow. But then again, well, 10 gigi cards are expensive and with three NICs, it might be or three 10 gigi ports. Well, that also might be a bit tricky to come up with such a hardware. However, if you do not go for a converged solution, so if you decide to spread your hypervisors on a separate set of nodes and then you have your several SDs, or several SDDs on a separate set of nodes, then basically pretty much all you need is a single dual port 10 gigi NIC for each of those nodes. So you'd go with one 10 gigi NIC for your hypervisor nodes with one of those ports used for carrying, say, the overlay traffic. So either VXLANs or VLANs and the other NIC used by the hypervisor nodes to access SAF itself. So that's the so-called SAF public network. And then for several SDs, again, a single dual port 10 gigi NIC, one for data being pushed to SAF and then the other NIC used for the replication network. Also, probably most of you know that but obviously when you write data to SAF, it is being replicated as it's being written. So data comes in in this scenario, data would come in through the SAF public network and before the write would return to the client, SAF would use the SAF cluster network to replicate the data to additional several SD nodes which have to start the replicas for that particular object. So yeah, so that would be my recommendation if you have the flexibility to design your own networking solution, underpinning your SAF and OpenStack clusters. Yeah, many people will tell you go with 9K MTU rather than 1500. We did some benchmarks on that actually we haven't seen all that much difference but that probably points out that we are under utilizing SAF somewhere else. But yeah, the overall consensus that 9K MTU is definitely better. And like I said, 10 gigi as a minimum for those fabrics. So we've covered networking. Let's talk a little bit about what disks to choose for your SDs. So with a 10 gigi fabric, you typically wanna go with SSDs for storing journals for your SAF, for your several SD nodes. So basically what SSD based journals do is they call us the writes a large amount of small writes into a fewer amount of bigger writes which then later on get pushed to your HDDs. So effectively IO is a bit faster when clients are writing data to SAF. One thing to keep in mind when picking your, when selecting a SSD for your journal, you do want it to be robust. So you probably don't wanna go with a consumer grade SSD because so those only have a very finite amount of data they can write, they can accept throughout their life cycle before actually burning out and dying. So you probably wanna go with a bit more solid, more robust SSD node over here. When it comes to how much, how fast your SSD should be, well, those are the two last bullet points on this slide. So if you happen to go with a one gigi networking, well, that effectively means that your clients can only push as much as almost 128 megabytes per second to your several SD nodes. Whereas regular SATA SSDs are capable of easily accepting up to 400 megabytes per second. So that's one, another reason why you wouldn't wanna go with one gigi networking, but instead should consider a 10 gigi networking. So with a 10 gigi network, you can probably fit in as much as three SSDs, three regular SSDs behind a single 10 gigi neck and or a single PCI Express, maybe NVMe-based SSD, which those are capable of much higher write speeds up to, well, 2,000 megabytes per second. So if you go with those, you might even consider a bit thicker pipe than only 10 gigi. Okay, so that's an example that's pretty close to what we use for our own OpenStack and SAP deployment. So in our case, we use Intel DC series S3700 SSDs, pretty small ones, 200 gigabytes, that's enough for an SSD journal. We did some research, a colleague of mine did quite a bit of research actually into which SSD is the best and well, that's basically the consensus from all the various sources he found. Many people recommend that, many people use those and many people are very happy with those and so are we in fact. Yeah, so there's a little bit of math over here. So assuming that you have a single journal SSD of this class, which is capable of approximately 375 megabytes per second, how many HDDs can you put behind it? So as you can see from the bullet over here, it's pretty much equivalent to about five regular data storing spinning drives which can sit behind a single SSD, which is actually in line with the Rural Thumb guideline which you might have heard already that typically you wanna have somewhere in between four to six HDDs behind a single SSD. But keep in mind that if your SSDs are really fast, like say you're using PCIe and VME drives, which are capable of much faster sequential write speeds, then you would be able to put many more individual HDDs behind them. So in our case, and assuming a 10 gigi networking, we can have as much as 15 HDDs in our SSDs. So with a 10 gigi networking, we can have easily three SSDs with five HDDs behind each one of those and in total, that's the pretty much the theoretical maximum of the amount of SSDs which we can have in our notes. Okay, so those are the disks. Let's have a look at how many CPU cores we should aim at over here. But first, how many sockets should we need? So as I mentioned earlier, if you go with multi socket architectures for your motherboards, you will probably have to start thinking about something called NUMA, which stands for Known Uniform Memory Access Architecture. And basically what it boils down to is, I won't be going into too much detail. So by the way, there's a excellent talk about NUMA over here. So if you wanna know more about it, be sure to check it out. It's from the previous OpenStack Summit, I think by guys from Comcast, if I recall correctly. Basically what it boils down to is if you have two CPU sockets, some of the, one of the sockets, one of the sockets processes interrupts from some of the devices in the box and the other socket process interrupts from the other socket, from the other devices. So one example is what we have over here in the third bullet. So say one of your CPU sockets processes the data coming in from your NIC, whereas the other CPU socket processes the data which goes to and from your SSD. So what happens here is, as the data comes in, it has to go into one of those CPUs. It has to cross the so-called QPI bus between the sockets before it can end up on the SSD. And it might not seem like a big deal, but this according to the guys which the talk I quoted over here is actually a pretty good deal when you actually start measuring that under high load for very thick nodes. Basically what it means is there's actually quite a bit of traffic going back and forth over the QPI bus and it might very negatively impact your overall self-performance. So with that, if you don't want to have to think about that and you don't have to think about things like pinning your several SD processes to specific CPU cores or CPU sockets, then a safe bet would be to go with a several SD node which only has a single socket. And so you don't have to worry about all of those things. Again, a more pragmatic approach. How many cores? Well, a rule of thumb is to have a one CPU or half a CPU core per a single SD demon. So half a CPU core under normal operation, maybe one CPU core when it starts under failure, when it starts data recovery, the data replication and then the CPU load goes a bit up. So with like 12, with 12 OSDs, you would probably need around 12 CPU cores. Ideally, you'd want to pin your several SD processes to specific several SD cores so that they don't jump back and forth between them. Hyper threading enabled disabled. So again, the overall consensus seems to be that when you talk about cores for these self-OSD processes, you want to think about hyper-threaded cores. So you definitely want to have hyper-threading enabled. Although I must say that I haven't done any actual benchmarks on that. So over here, I'm just conveying what I've heard from other members of the community. How much memory? Again, so a rule of thumb would be about one gigabyte of RAM per one terabytes of storage stored on your several SD nodes. Obviously, more is better since with more data, you have more, Linux has more capability of utilizing this virtual file system, the caching mechanism to speed up reads to obtain some of this data when it's being read by the clients. Let's see. Yeah, so here are some examples. Okay, so let's quickly summarize what you discussed so far. So what to consider when designing the hardware for your, or pick selecting the hardware for your self-OSD. So first of all, it's networking fabric because, well, it's pretty much determines the amount of the number of SSDs and the overall performance of the cluster which you will have and the amount of SSDs per a single node. Once you know the amount of SSDs per node, that will pretty much, once you know the type of SSDs you will want to have per node, this will also tell you how many regular spinning drives you would want to have per a single node. So in the example earlier which we've covered, we said that what a typical decent class SSD that would be about five disks per a single SSD. And then the number of disks itself, once you know that, it will basically tell you how many CPU cores you would wanna have. So typically one CPU core per disk. And yeah, the amount and the overall size of the disks will in turn tell you how much memory you will need for your seven SD nodes. Yeah, so we're pretty much almost done. So that's just a quick comparison of different types of nodes and different theoretical performance characteristics which one could expect of those. So from left to right, we go from a unreasonably thin node all the way to a very extremely fat, morbidly of fat node you could say. So in the unreasonably thin node, what we have is we have the unreasonably thin cluster. It consists of very thin nodes which basically have a single SSD and a single HDD in them. So a very unreasonable solution in practice, but just for comparison. So we have 96 of those nodes, 96 HDDs total. And actually you can see that all of those individual clusters, that's the first row, all of them have 96 HDDs only spread out across different number of nodes. So like I said, with the very thin nodes, so what we have is one HD per node and one SSD per node. That's something you wouldn't run in production. Obviously with only one HDD, you're heavily underutilizing your SSD. But yeah, like I said, you wouldn't want to run that in production. Then you have the thin nodes. So over here we have approximately 16 individual nodes, each node with six HDDs and a single SSD. So that's a more reasonable scenario. But again, over here, if you're talking about a 10 gigane networking, we're actually underutilizing the network because a single regular SSD will only be capable of accepting that much traffic. Then we have a more pragmatic approach. So over here we're talking about eight nodes with 12 or 10 HDDs, each approximately two SSDs per node. So that's a bit better utilization of the network overall. But yeah, with that many HDDs, you will probably need to go, depending on your hardware vendor, you will probably need to go with a bit higher chassis. So if you're talking about the three and a half inch drives, you will probably want to go with a two-year chassis for the nodes. Then you have a regular node. So over here we're talking about six SSD nodes with 16 HDDs each with three SSDs in front of those 16 HDDs. And over here we can see that with three SSDs, we are actually very close to perfectly utilizing our networking infrastructure, assuming that we are based on 10 GIGI. So with three SSDs, you can actually reach a theoretical sequential max speed of about almost 10 gigabits per second. So that's actually a pretty good solution for this particular networking fabric. But again, over here you would have to go either with two U or three U chassis. And then the last scenario. So over here we have with the same amount of HDDs. So also with 96 HDDs total, we are talking about only two nodes, but very thick ones, quite obviously, each one having 48 HDDs. So effectively 48 SSDs. And with that many OSDs, that also means that you will probably have to go into the NUMA territory. So probably that's not the place where you wanna go if you're not familiar with that. And well, those last two columns, well the obvious drawback is that if one of the nodes goes down, well you have problems with recovery. So in the case of the FAT nodes where you have only two nodes in your entire cluster, you will not have any recovery quite obviously, with the cluster with only six nodes. So obviously it should recover, but because of the fact that you're storing such a huge proportion of your data on each node, if one of your nodes goes down, then you're talking about really long recovery times, overall. Yeah, so that's pretty much it. So just to summarize some key takeaways. So we'll probably wanna go with a 2U chassis with when designing your self cluster, with a 2U chassis for your several OSDs with 10, maybe 12, three and a half inch HDDs and two SSDs. Or alternatively, if you can afford a bit bigger chassis, higher chassis then, you could go with 16 HDDs and three SSDs or one PCI Express NVMe instead of those three SSDs. That should also be sufficient. You should probably prefer similar nodes if you can afford the additional, causing the additional space in your rack. That simply makes it a bit simpler to manage to get started with. If you can handle it, if you have experience with self, then you might wanna go with more thicker nodes or definitely more better for optimizing for overall cost of your storage at the expense of complexity. You probably wanna avoid multi socket motherboards for several SDs. Start out ideally with at least 10 gigi networking. Unless a small POS, you probably don't wanna go with one gigi. Avoid small clusters. Obviously, like I mentioned two slides earlier, with only like six SD nodes, if one of them goes down, that means pretty long recovery. Yeah. Also some other random tips, just to keep in mind, be mindful and read up on deep scrubbing because it will kick in by default every week and every week you will be wondering why the performance of your cluster starts going down. It's deep scrubbing, obviously. Read up on object and data striping that also has a performance potential to increase the performance of your cluster. That's not something I'm gonna get into over here. One other thing is if you're using CIF for your Cinder nodes, you definitely want to enable QoS. So before we've enabled QoS on our cluster, every now and then you would have a runaway VM which would simply generate a lot of log outputs to the disk which would effectively saturate our entire CIF cluster. So by throttling that the amount of sequential writes each VM can generate and the amount of IOPS it can generate, we basically get rid of that problem. So again, definitely something you might wanna check out. Same applies for Nova, obviously, if you're using Nova for your VMs. Yeah, enable RBD cache on the client side and libvert so that it can aggregate small IO before sending it out to your CIF cluster. So it's called, a feature called RBD cache. There's tons of very interesting CIF videos out there on YouTube. So if you're just starting out with CIF, again, be sure to go through them. And one other here which I highly recommend is one from one of the previous summits from the guys from CIRN, CIF at CIRN. That's the last bullet point. So yeah, definitely check this one out. And that's pretty much it. So thank you. Any questions? The quick comment from the talk before about QEMA optimizations. They had some test hardware from Intel, so probably the fastest PCIe SSDs available. And they decided to run four OSDs per PCI device because other than that, they would not be able to saturate the queue. So we're already going in a direction that even the rule of thumb that everybody knows why OSD per device is with SSDs gonna be thrown out again. All right, that's good to know, thanks. Hey. Do you have any experience running OSDs on SSDs? And in that case, you have any recommendations on where to put the journal? Keep it on there or does it make sense to put it on the... We don't have any experience with that. If I were to guess, but again, I'll just take yes, you would probably wanna keep it unless you can afford the additional cost of maybe even faster SSD, right? So you could, for example, have two classes of SSDs, one very fast and ultra low latency only for the journal, and then simply several slower SSDs for regular data. So that would be all the solutions. Hi, one of the things that brought my attention on your presentation was the relationship between the bandwidth, the network bandwidth, and the, not only the type of the SSDs, but also the number. For every paper that you read, they always recommend that you should not stay with 10 gigs. But as you said, the regular size of the nodes, you only, you're gonna have up to three SSDs. And these guys, SATA SSDs, these guys cannot talk faster than 10 gigs, which brings me the impression that you say that don't waste money on going beyond 10 gigs on the SEF network. Is that, or? No, I'm not saying that. So definitely the faster your networking is the higher the overall throughput of your SEF cluster. So if you can afford 40 gigi or more than that, then yeah, by all means, go for it. But the SSDs will cap your throughput or no, because? That's my understanding. I'm not saying that will definitely happen. It might depend on type of SSDs, but that's the feeling I get from experimenting with SEF. And also, I mean, you cannot simply consider putting in faster SSDs, right? So indeed, you can, you will probably be able to reach 10 gigi with three data center classes as these. But if you go with several PCI express and VME drives, then yeah, I might not. All right, thank you. Yep. Hard drives are getting bigger, of course. We're looking at possibly using eight terabytes and 10 and 12s coming out by the end of the year. Any recommendations on how big to your hard drives to use in your SEF cluster? I didn't get the word, what's that? Hard disk drives, you know, at the moment they're, I think you showed four terabyte drives in your example. Yes, yes. Six, eight, 10. Yeah, that's just an example, of course, so yeah. So what's the risks of going to, obviously the regular time when hard drive fails, but any other risks or do you see going with bigger drives? Other risks or reliability, I would say, the bigger the drive, the smaller the size of the individual bid on it. So obviously the higher the chance that the higher chance of bid route, for example. But that's something you have to consider on your own, right? Depending on what type of HDZ you want to go with and how many of them you want to have, so. Well, you know, obviously if you're looking at the dollar per terabyte price, you know, and you use eight terabyte drives, you get a much cheaper solution. Some of the reference architectures are coming out now, so the Dell solution is a EXT with 16 eight terabyte drives and a couple of fast PCIe cards in France, so. Yeah, yeah. Hi, have you considered running CefOS this right with the Nova Compute on the same hardware? Yes, we have, and in fact we're doing it right now, although we will be moving away from that. So it's a bit hard to gauge how big of an impact a CefOS D process can have on the performance of VMs. So, no, in our case, it's not a big deal because our VMs are actually, most of the time, are fairly idle, so they do not generate that much load on our Cef cluster, but one of the problems which you might face is, well, the complexity of the management. I mean, say you want to restart your hypervisor node, but you end up restarting hypervisor any CefOS D nodes, which will generate, unless you bring up the SD out with your Cef cluster, it will generate quite a bit of traffic. So that complicates management a lot. But in other cases, if your VMs do generate a lot of CPU load, then you probably want to keep them separate from your several SDs. But again, there's something to be said about both approaches. OK, thanks. What's your experience comparing SAS devices with SSDs as journals and regular SATA, sorry, SATA devices with SSDs on journals versus regular SAS devices? Well, I didn't do any comparisons on that, so yeah. Because our experience has been SAS with collocated journals, usually compares almost the same in terms of performance. So if you do the maths, it's kind of one-on-one and you get better cost effectiveness with the SAS. For block storage? Yes. So for RBD, not for Rathus Gateway. So I was interested if anybody in the room has been comparing the two scenarios. OK, but wouldn't the latency still on SSDs be much better, right? The latency is the only thing that's a bit better. But if you go to a scenario where you're this staging to disk, latency goes through the hell with SATA. Yeah. Yeah? OK. But yeah, I think there's one more question. I just wanted to make a comment to this gentleman over here about NVMe-based journals. If you have the money, it does make sense to put the journals on NVMe because your latencies for writes will go down. And also, you can mix endurance on your data-based SSDs. Yeah, thanks. OK, so one more thing. We have some t-shirts over here. So if somebody wants a t-shirt, yeah, just come by. Yeah, thanks.