 Would it work better on the white side? I guess. Okay, yeah. I don't have a microphone, but I think the people in the back should be able to hear me. I hope all of you have prepared at least A.V.M. or something similar, which was the request in the presentation. If not, you can watch how others do it or see what I've shown on the screen. So, I work on Gluster and one of the core maintainers working on many different parts in Gluster. And I'll show you a bit about what we're going to do today. So, well, the agenda is not really complete, but I'll give an introduction about what Gluster is and so that everyone understands how Gluster works. From there on, I'll show you a bit how you actually set it up so that you can actually use it in your VMs. We'll show a bit about the technical parts there as well. Because the network is not the fastest, we'll do it in bits and pieces. So, well, things are installing and you might be waiting for the packages to download. It's not very big. It's only 12 megabytes or something, depending on what dependencies you already have. I'll explain a bit more about upcoming features and the recent additions that we did in the last year's release. So, any questions that pop up, just let me know and I'll answer them. So, Gluster itself, who is familiar with Gluster? Okay, a few. Who has actually used Gluster before and not only read about it? Okay. So, Gluster basically provides you with the network file system. That's the whole purpose. So, you provide the file system and you can store any normal POSIX complaint files on there, whatever you like to do. So, people use it for object storage or archival or media streaming is one of the things that Gluster does also very well. Storing backups is something that a lot of people have interest to. You write it once, possibly for many clients. And, well, backups hopefully are mostly stale. But, when needed, you can recover them again. We provide access protocols through a fuse mount. So, you can mount it over a fuse portal file system. It starts a process in the background and its process runs in user space and actually does all the work for you. We do Swift on file which ties in with Gluster and Swift on file on the other hand ties in with OpenStack. So, we can have your OpenStack Swift run on the Gluster back end. Other components include LibGF API. LibGF API is a user space library and you can use this library in any of your applications. Camel is one of the applications that uses LibGF API and therefore speaks the Gluster protocol natively. It can immediately use the files stored on a Gluster volume without the need to go through a Linux file system mount on the client side. It's much more efficient. We don't use any metadata server at the moment. It's completely distributed and there's no metadata server that would cause a bottleneck in this sense. So, metadata servers are normally pretty unique in your environment and either have a bandwidth limitation or if you have only one it's a single point of failure and might be that failover isn't easy enough or fast enough. So, we don't use any metadata server for this. If a server goes down, the client knows the logic, the clients actually do all the work and automatically can failover. This makes it possible to scale out really, really well. The basics of Gluster. Gluster uses bricks. Bricks is more or less like the storage unit or the storage back end that Gluster provides. A brick basically is a directory on a storage server with a mount file system. So, we use LVM or we suggest to use LVM and LVM thin provisioning. We load this file system so that we can do nice snapshots but it basically is just a mount file system anywhere. The file system needs to support extended attributes. So, we don't support bricks on NFS for example because NFS doesn't do extended attributes yet. The bricks are the lowest level that Gluster has. On top of the bricks, the file system of the bricks gets used by the Gluster processes. The Gluster processes are a stack of translators. The Gluster implement a particular functionality so we have one translator called the POSIX translator which speaks actual POSIX semantics like an open, a close, a read and a write. All the other translators build on top of this POSIX translator and so on. The translators are very flexible. We've got a contribution that uses instead of the POSIX translator, it doesn't speak a file system below it. It speaks LVM below it. If you're creating a file, this particular translator creates a logical volume. So, if you have a high virtualization workload, you could use this translator instead of the POSIX translator and suddenly all your images that you create are actually logical volumes in the back end instead of files in the file system. All the translators are very flexible like this. You can mostly enable them, disable them at will. And anything that you implement in Gluster is basically a translator. All of these translators together combined over multiple servers. So you have multiple servers, all of them have a stack of translators. These servers together, these bricks together, they combine one volume. So a volume consists out of multiple bricks. Those bricks can be located on a single server. That's mostly not what you want if you want to distribute your environment. So you have multiple servers and a volume combines all these bricks, all these servers into one and makes a user-facing volume, a user-facing file system that spends all of these bricks. In Gluster we call the servers peers. Because if you say to a Gluster developer, my server doesn't work. It's not clear if this is a storage server or if this is a client actually using the services that a Gluster environment provides. It can well be a web server that actually hosts websites on a Gluster environment. But if you say my server doesn't work, it's extremely unclear. So Gluster has the notion of peers and any of the storage servers are peers. We don't like to say Gluster cluster a lot. It's really awkward to pronounce. We call them Trusted Storage Pools, which is basically a cluster of Gluster servers. Who is not familiar with scale-out and scale-up? Well, we try to explain this in this diagram. We have the scale-up here on the vertical line. Scale-up basically means you have a very powerful server. You start with maybe two disks in the server, might have a tray of six disks. You start with two disks, you want to scale up, you add more disks. Every time you add more disks to this particular server, that's the scale-up process. If you want to do scale-out, you want to have more distribution. You want to have more servers. Gluster mostly is facilitating this scale-out way. So if you run out of storage, what you do is you add more servers. You add relatively cheap servers compared to this one very powerful scale-up server. Scale-up servers probably have multiple CPUs. You might not have every socket filled with a CPU. If you need more power, add more CPUs to this particular high-performing server, which is extremely expensive and it's a single point of failure. It introduces limitations to other things. For example, bandwidth to one server is mostly limited at some point. You can scale up your bandwidth by adding more PCI cards and more network cards and everything. This doesn't always work very well. What you want is most likely not only the scale-up process, but you want to scale-out. Scale-out is in general cheaper to add more hardware. It adds to the performance so more clients can actually distribute over all your network links because you have more servers. The distribution of your clients, assume that you have several hundred clients, maybe thousands of clients, ten thousand of clients. All of these clients, if they hit a particular number of servers, your resources get limited. If you add more servers, all of these clients certainly distribute themselves over all of the servers. So, you effectively have more bandwidth and you have more storage. If a server dies, it's not such a big deal because the server only contains a small dataset compared to a huge dataset on extremely big servers. So, it improves your recovery times and everything. Can I scale-out to twenty distribution? Yes, so the question is, can I scale-out a particular volume from two servers to twenty servers? So, you can add, if you have your volume, you start with two servers and you notice, well, my drop-lock box-like facility is really popular and everyone starts to use it. Everyone wants to upload files to this volume. What you can do is you can add another eighteen servers to this volume. The clients automatically notice, so the cluster clients notice and they will be able to use those whole twenty servers for storing files through this particular volume. You don't have to add additional volumes. You could create multiple volumes for different purposes if you want to, but you can have one volume and add additional servers two at a time, three at a time, depending on your environment. So, when a cluster client connects to a cluster volume, it uses a hostname to connect to a server, to a server. In this case, we have three servers. If your cluster client needs to mount this volume and puts it as a volume on top, it, for example, says, okay, I want to mount from this particular hostname. If you add the other two servers, the client doesn't know about this hostname, not for mounting. It knows when it mounts the whole volume layout. So, upon mounting, the first request is, how does this volume look like? What servers do we have and which bricks do we have? So, the client knows immediately how the volume looks like and which servers participate in the volume. So, the only thing where you need or you manually would like to specify multiple server names would be at the time of mounting. So, you can do that. You can pass multiple servers if the first server isn't available. You can pass options to have a backup, volume, empty point access server, and it would just go through the whole list of servers and figure out, okay, which server is the first that's available. From this server that's available, it catches the whole volume layout. After that, the client knows all of the other servers and it can access those servers directly. So, you'd known the balance between those servers or only the one point? Yeah, I'll come back to that a bit later when I explain the distribution. So, on the point of a single point of failure because that's what you want to prevent while mounting. So, in case of mounting, this particular server number one is down but these two are still up. If you don't specify multiple servers, your mount will fail because it says, well, this server is not reachable or there's no demon running so you can't connect to the port. So, you get an error message and it will fail. You can pass multiple servers on this mount code line and it will then iterate through the servers. You can also use DNS and provide multiple IP addresses for this particular entry point. So, most of the time you're just going to really care whether the client shouldn't care which server they mount from. And if you use DNS, it also gets a whole list of all the servers that are available for mounting and it will just iterate through all those IP addresses. It's okay, one after the other and use it like that. So, if you are scaling out and adding servers, your mount, it makes most sense to put new servers in a DNS entry and make sure that your cluster environment is reachable always over one single hostname. And this DNS entry resolves to multiple IP addresses. When you add more servers, you add more IP addresses to this DNS entry and the clients will know on mounting which servers can be used and it will try each of them. Does it help? Yeah? Okay, so that's basically scale out. Luster is really a scale out file system. We can do scale up as well, but there are very few people that actually want to use cluster in a scale up fashion. Okay, so distribution of files. Files are basically randomly distributed, at least for people that see this distribution on the back end and Luster stores files just in the file system so we have the same directory structure and the file names on the back end as that you see through a mount point. This distribution is basically random. It is based on the hashing algorithm and each particular brick that you see here, so we have two servers, both servers have one brick. And we use the file name for hashing. So we calculate the hash from the file name. This hash always falls into a range. So we have a hashing function. The hashing function always is somewhere between 0 and 16. The result of this hash function is always between 0 and 16. What we do is we assign each brick a hash range, so 0 to 8 is on the first server and 8 to 16 is on the same server. The client, because it's a file system, tend to use file names to access contents. So any file system without file names is basically not a file system. So clients tend to know the file names. The client hashes this file name. The result of this hash is either 0 to 8 or somewhere between 8 to 16 and depending on that the client collects the server 1 or server 2. So this is the whole distribution logic. The hash ranges are a bit bigger and we have a little twist in it, but this is basically how we distribute files. Does that mean that a simple rename could actually become a pretty expensive copy operation? Yes. So they can be? Not atomic, right? The rename should be atomic. People expect this. It's not expensive. A rename actually changes the name. Yes. But we have something called a point effect. Yes. We just point to the other server. Okay. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. This is because you change the name of the file. You hash the file name. So if the hash falls in the different there- it's a really expensive operation. You would need to move the file in everything. That's not what we do. We use a trick. We have something similar like Hartlings. But they are brick-wide or volume-wide kind of Hartlings functionality. So what we do is we create this new name. We don't have the actual file name anymore, but we place a hard name on the brick that should have a file. So that makes it much, much faster. Yeah, because otherwise, you would have to be careful with that. Yes, yes, exactly. So it's not very easy. Okay. Any other questions about how distribution logic book? So the distribution is all done on the client side, right? So the client, when mounting gets the volume layout, the client calculates the file name hash, and then the client connects to the particular servers. It's not server side distribution. It is all done client side. The client contains the logic. Yeah. What are the names of the directories? The directories are located on all of the bricks. So the directories are not files, and we use the layout of the directories over all the bricks. So if you create a directory, well, even the directories has a file name, but we create a directory on all of these bricks in this case. No, because we only do distribution of contents, not the directories. So only the leaves in a file system, like files, like device nodes, they get distributed. The directories are created on all of the bricks, and they actually contain the hash ranges. So if you create a directory tree, the hash ranges are per directory in the end. If you create a file in this directory, in one distribution part, it's empty, and wherever the file gets located, the directory contains the file. So if I name it as a directory atomic, many distributors, maybe we have 27, so we have 20 copies of the same directory. Some of them are empty, some of them are empty, depending on how the file gets distributed. Yes. So it's not completely atomic. We have some issues with multi-concurrent directory creation. If multiple clients run a whole directory tree creation at the same time, there is currently no complete locking for that. Well, I think one of the guys posted this week, who's actually fixing this. But that's something that has been a pain point for us, because if one client creates the whole directory structure, and another client creates the same directory structure exactly at the same time, they might create one client creates the directory first in this part, and the other client creates the directory first in that part, and they sign different kinds of metadata to the directory, then it gives you a conflict and gives you a bit of trouble. But this is extremely rare. So we really have to try very hard to get that to fail. I thought about the common scenario, for example, over here, when you want to delete a directory in an image, which is fast and atomic. We expect it to be atomic and fast, and then delete it later. So we accept this operation to be very fast and reliable. Not that it's there, but not on... Yes. So at the moment it's not atomic, atomic doesn't mean that it's fast, right? Atomic just means that it's safe to do. So there's a difference in there. But yes, we're working hard to get that fixed. We want to address this. And we also want to export those fixes to our stable releases. So at the moment it's not atomic, but it's really difficult to hit this in a real-world example. You actually have to run loops of the creating directory structures and try really hard to reproduce it. So it's not trivial to hit this problem. You had questions? Yeah, yeah. What about the big files? Are there split between the breaks? We have that option to do so, and I'll come back to that later. Can I enforce somehow to be for file to be at specific break? And if I'm doing this just a joint file system, can I just... If I lose my... Can I still mount at least the files that are still there without... So yes, so... ...files underneath? Yes, so the splitting of files we come back to that whole topic later. But yes, at the moment if you just use this distribution logic, you see the files that were created on server one. You see them on the backend file system. The files that were created on server two, you see them on their backend file system. So if you want, you can back them up. And if your whole cluster infrastructure, you don't have any network or whatever, is down and you still can access your files there directly. If you split them in pieces, that's more difficult, because... If I don't have a break file and just want it to rise, do I use just some files? Or whole cluster volume will disappear and break up? In this case, well that's actually the next step. So in this case, this case is replicated. If you lose one server, then that's okay. You still have the copy of the file. In the distribution volume, if you lose server one, these two files will just not be available. Yeah, but then we're at least... So column number three will be there, but these two files won't. And can I force some file to be on some specific break? If you really want to, you can. That's not how cloud computing and distribution normally tends to work. But if you really absolutely want to do that, you could. It means that when I add a new break to the cluster of the whole directory structure, it will be cloned to the break. Exactly, so you... Yeah, we call that rebalancing the environment. Or just fixing the layout of the directory structure. So creating the directory structure is important because each directory has those hash ranges. Yes, if you add a new break, the whole directory three needs to get created and the hash ranges need to get updated to have the evenly distribution. Otherwise, if you add a new server or a new break, this will not effectively be used yet. So that's a relatively expensive thing to do. That's something that we're at, that Venki is addressing with DHTv2. So distribution has translated version two. We want to improve these kind of operations and make it much, much faster. So if I add a break, there is some kind of operation to actually rehash that and redistribute. Yes. And if I want to remove the break, I can tell it and rehash that. Yes, so what you do is you say, I want to remove this break, no new files will get created there, and the old files need to get moved off there. You probably want to do this operation because even though we use relatively small servers, these relatively small servers tend to have like 30, 40 disks, right? So that's a lot of data. And if you want to remove the contents of these disks to other servers, that takes a lot of network bandwidth. So most of the times people only do these kind of operations, like the rebalancing and everything in some kind of maintenance window, or at least the period of the day where users don't get influenced too much with much network traffic. Oh yes, it's possible, yeah. Do you want to ask something? No. Okay. So the replication. So I mentioned that we have translators, right? The distribution layer is a translator. The replication is just another translator. They just do different things. One says you need to store your file there, and the other one says, well, you need to store your files in both places. In this case, it is a very simple logic. We have two servers. We want to have two copies of the data. And you have two bricks. So the client says, okay, this following layout contains the replication. These two bricks should be copies of each other. So whenever I need to write over file, I'm going to write it twice. So the active bandwidth or the bandwidth that you actually have for writing becomes only half of your bandwidth that you have physically available because your writes have to actually go from the client to the server to both the bricks. It's client-side. For reading, this is... So the client has to send the file to the both servers. Exactly, exactly. So for writing, the client actively sends the data to both sides because all these features are client-side. The solution was on client-side, the replication is on client-side as well. For reading, this is not the case. For reading, we accept the data of the brick that first responded to the initial lookup call. So you do a lookup when you open a file. So if you open a file, the first brick that actually responded is used to read the data from it. What about the different versions of data on different bricks? Different versions of the data on different bricks. So the question is, is it possible to have different data or different versions of the data on different bricks? And that's only possible if you would run into a split-brain scenario where one client updates a part of a brick. So for example, Server 2 is not available for a time. So it contains all data. Another server or one of the clients updates a file. This update is only on Server 1. Server 2 comes back up online. So it would have all data. So it's some kind of a... I'm not really split-brain because we know that this one was up at the time being. We know that this one has an older version. And when we do a lookup, when we do this open of a file, we do actually check if the files are in sync. We have some change log for the replication mechanism. That's okay, you missed so many operations. And then we can de-sync the data. So we know that this data is valid. This data is altered and it needs to be repaired. If, for the better way, we have another client on this side and we would really have a split-brain. So the network between these two would die for whatever reason. This client would be able to update file number one on this server and this client would be able to update file number one on this server. That can be problematic because they don't need to write the same data to the file. Maybe if you have a sparse file. The file was created on both pricks. One client on the left updates the beginning of the file. The other client on the right updates the end of the file. Bluster cannot decide for you which content to keep. And Bluster will prevent access to the file after that when it detects this kind of misbehavior. So there are different ways to solve this, including adding a third copy or making sure that only one server always has the main or is seen as the main node. So if server two dies, you can still write to server one. But if server two notices that there's no connection to server one anymore, or let's say the network splits, server one is the main source. Server two is still up and running, but it cannot connect to server one. We can say, well, server two doesn't accept any rights. We can do other things with forum and everything. So you can prevent these kind of things from happening. Split brains are most annoying to have. So we need to figure out ways how to prevent them, how to repair them automatically. And those things are all available. We can do many different kind of solutions for that. Enforcement of the software. Smart algorithm detecting this guy break. Server two disappeared for such and such amount of time. And now I'm going to enforce a new replica. So we do this enforcing to not accept anything from server two, or clients get informed that server two is not there. But we don't do automagation of data. We don't do automatic creation of replicas. That is something we would like to have, but it's very difficult to get some procedure that is usable by everyone. So we have a lot of users that have already built their own puppet scripts and monitoring jobs that trigger usable tasks whenever they notice that a brick is down or something, and they all do it in different ways. So we have to figure out what is the best way for most users. And I'm not sure if we already found a really good solution for that. The logistics of having this limited risk of policies, you might use a class A and use a class B and so on, goes by different policies. I was just afraid because if this is commodity hard and you only have these two replicas, you are not resilient anymore. If one disappears, transient failures, overclassics, and basically the other commodity hardware just happens to decide I'm going out of business as well. It's a cheap hardware. So with cheap hardware that we tend to use for Gluster, if you have one or half of the hardware failing, a lot of users buy hardware from a set of batch or a whole batch, so they get the whole series of, I know, a particular server line, and those often contain, for example, the same disks. So if one server fails, it is very likely that the other server fails shortly after. If you have only one copy of your data, that's a really dangerous scenario, we suggest that you use either three copies, depending on how you set your curriculum, or in general, we advise you to use a rate environment below the bricks, so we don't use single disks. We suggest you use rate unless you do more fancy erasure coding kind of things. So at least you have some resilience there. So you actually recommend a bit... We recommend hardware rates. So most users tend to put 12 disks in a rate set and use this rate set as a brick. So an actual disk failure, either with rate six or with rate one and zero, isn't too difficult to repair and they have someone running through the data center and the little lights and replacing disks. So that's what a lot of users tend to use, but it costs a lot of storage, it needs a lot of disks. So we also have a feature called erasure coding that actually splits up the data or the files into chunks. These chunks get encoded, and if you have a chunk missing on a disk, so you don't use rate below that, but use the disks directly, you still would use LVM below, but you put these chunks on a disk, if a disk fails, this chunk is missing, but you can actually repair the data, the whole file, with some of these missing chunks. So that's an option if you want to be more space efficient. You can use erasure coding, it costs more CPU to calculate all this, but it's more efficient and especially for archiving use cases and inactive data, so write once and read many times or write rarely, read many times. Those are very good candidates for, for example, the erasure coded volumes. So the normal replicated volumes are similar to rate one. So you can have two replicas, you can have more replicas if you want to, but it costs physical disks in the end, each copy that you have. This is failing, do you know how it behaves? So the question is... If I don't use rate, the heart rate is below and the disk is starting to fail. It's not. Yeah, but if it starts failing, you get these re-errors from the drive and how will the cluster handle that? So that there's no real answer to that call so the file will be taken from the other side. Yeah, so how do you detect disk failures? It's the question if you have a single disk. You don't really detect how it will behave if my disk will start failing. Yes, so if you have a... If it will still run and just... How to read everything from the second disk if I get EO errors on the first one? So if you get certain errors on your disk so the file system most likely will become read-only. So that's what the kernel would do if it really gets disk errors. In that case, we noticed that something really weird happened and we stopped this particular big process for this particular disk. Yeah, so it is very clear which content is still valid. Okay, thank you. Back to the translators. We will see them later on in the volume file if you have the laptop with you and everything. The translators are really flexible so we've had the distribution logic, we have the reputation logic. You can combine these two. So you can create a distributed replicated volume which means that you create on the bottom stack two replicated volumes. So volume number zero has two servers, two breaks, volume number one has two server and two breaks. These two act like something like a sub-volume and this is another sub-volume. The distribution logic can use sub-volumes and it says, okay, well, we have two replicas but the distribution logic really doesn't care what's below it so it says, okay, we distribute this particular file to sub-volume number zero and this particular file to sub-volume number one. So in the end it's pretty simple, especially if you understand how distribution logic works and how the replication logic works. You can just stack them on top of each other and all the calls get passed on, the data gets passed on and the client just says, okay, this is the distribution logic so I need to write to this particular sub-volume. The client also does the replication so it says we have to write this particular file to these two breaks. If it would be possible to have a scenario where you have a replicated volume high-speed link and all that, let's say, slower one that would be over a one link, what do you want to say? Like, do you want to tell us that that's sad stuff like that? Is this a supported scenario? So I don't know how service would do that but, well, it's a resource, right? Supported is relative. But what you... Yeah, so it would work just fine but a write needs acknowledgement from both sides. So a read will always be served from the fastest so most of the times it's the local environment that you have but a write needs to be acknowledged to both sides and that will hit the phones. It's also a file creation or, for example, the checks that are getting done when you open a file they need acknowledgement from both sides to see, okay, other files in sync. So the open call, or the stat call, actually, it does check for inconsistencies and if you have a remote side then normally that gets delayed. Master expects all the nodes to be in sync the whole time and I was trying to cheat on that, let's say, by disconnecting the node and then you can jump with the other nodes so make it as synchronously, but not... Yeah, so the application logic is really synchronous. We do have a way of doing asynchronous replication which where you can put your replicated environment on a different part of the world and it will do geo-replication called. So that's one of the options you would have. There are users that say, well, performance is not the most important part for us and we want this synchronous replication but we want it spread over multiple data centers. Some users do it but you have to take into account that performance most likely will suffer. If you have a disconnect between those data centers performance will suffer even more until the disconnects are really noticed and yeah, that's... So it depends on your needs and what performance you accept. All of this we have different ways of mounting so the client is overfused. We have an NFS server that we ship with Gluster NFS version 3 is provided with that. NFS Camicia is a fully featured NFS server. Samba is an option. So there's a native module for Samba that uses LibJF API and speaks therefore the native cluster protocols. Triffon Pilot mentioned earlier and Camu, Beros and other projects use LibJF API or tie-in with Gluster on other ways. We provide packages for many different distributions. It's part of Fedora, it's part of Debian, it's part of NetBSD. The CentOS StorageSig provides packages for Gluster 3.6, Gluster 3.7 for both CentOS 7 and CentOS 6. Some other random distributions are available from download.gluster.org. We have a group of maintainers that tries to push packages for different distributions but most of the packages are maintained, well, I guess all of the packages are maintained by volunteers from the Gluster community so packages for certain distributions tend to take a bit more time to get produced and push available than others. We have different quick start guides as well. Now, this is where you should get active. I hope you have your VM prepared. I like to use CentOS for my testing, it's a little bit more stable than Fedora does, Fedora changes on location and I would have to adapt my script. CentOS makes it a bit easier to run. So what you would do, you would install the GlusterFS server package and that's basically it. On CentOS, you enable the system CTL or on CentOS 7, you do the system CTL on CentOS 6, you would use the servers script and check config and you start. This is basically how you started your first trusted storage pool. So I can give you, I'm not sure who's trying this out now who needs to see the commands because I can show you maybe if we have a little bit of network. It's a very little bit. Yeah, exactly. We almost got USB sticks with everything on it but unfortunately that didn't work out so people have to use it on their own system. So this is... Is there any past sticks around with just the packages on? If we have a USB stick, we could, yes. What does CentOS do? I have CentOS packages, yes. Okay, yeah, so during the previous talk I already got some of the bits, at least I hope. Well, let's see. So what you would do on CentOS 6 or CentOS 7, that's an additional step to Fedora. So what you get is a... On CentOS you get a YAM repo file and it's a very small thing, so it works pretty well. So this creates a file. This is the file and it contains the YAM repository details and everything. For Fedora and many other distributions, the packages are just available in the standard repositories. Okay, so YAM install must have a server pulling all the dependencies it needs. We sign it with the CentOS storage stick key, in this case. And actually you need to do it on two servers. So go back to the slides. It's installing and systemctl-enabled-blaster-b. Blaster-d is the management daemon. Systemctl-start-blaster-b. And after that you can pay a probe. So just to be sure, make sure that Blaster-d is running. So after peer probing, you can check which systems are available and local host and the remote system. So these are the commands in case anyone still needs them. Sorry? Blaster is using these people in terms of verification. If you are calling with fields that you report, what is the... So the Blaster-d command connects to other Blaster-d servers and it exchanges version information and other bits and members of the environment and everything. So after peer probing, a Blaster-d server is part of the Trusted Storage Pool. No? Yes, you need to pay probe a server before. So peer probing really says, okay, now we're together in a Blaster environment and only after that we can actually use the services on the server. So that's the peer probing. So now you created your Trusted Storage Pool. The next step is to create a brick. Creating a brick, I said a brick basically is a mounted Trusted somewhere. So yeah, you have to create a brick and... Yeah, thank you. You could have done that in one go, by the way, which is saying VG create, VG name and reference the list of physical volumes you want. Presumably that's spare ones, right? Yeah, so that really depends on your environment. So a lot of users tend to create, yeah, everything on demand. So you don't really know in advance or you don't really add bricks in advance or prepare logical volumes in advance. I'm just pointing out that. Okay, okay, that's good. So I don't remember how big my second disk is, but... VG? Oh, yeah, yeah, yeah. I just created 512 megabyte disk or created an XFS file system and I'll create a devconf underscore zero directly to mount it. I'll also add it to ETCFS tab so that it automatically gets mounted on boot. So that's basically the preparation of a brick, which is the first set, point number four. Obviously you have to do all these things on multiple servers. Normally you would use different tools to do this. Some people prefer like SSH, cluster SSH. I mostly put them in a script and run that script or use Alcibal to do all this. But instead of adding all of these additional tools to a presentation I prefer to make it really simple and very clear what you're doing that you can do a lot of automation around it. That should be really obvious. And hopefully everyone would do this. But yes, you can automate everything you like on your own way and I'm not going to prescribe or suggest to use Alcibal or something else that's really completely up to you and I don't want to confuse anyone that's not using. But for whatever reason. So we created these bricks. So the cluster volume create command has different options. We just create a sample. It doesn't even matter. Would you like to see a replica volume or rather a distributed volume? Distribute, okay. So I'd like to add another directory behind the actual brick name so that when the server boots and mounting would fail for whatever particular reason or you create something wrong that a lot of users that's fed finger things. I like to add another directory into the brick to make sure that everything's there when we need it. Not everyone might be aware that you can do interesting shelf expansions like the two host names are put there between brackets. It actually expands to this whole command. So if you add multiple servers, you can make it a bit shorter to execute. Okay, so we created the volume and this is how the volume looked like. It says created. It says the bricks. The next step is to start the volume. And the only difference is that it just started. The people who are executing this on their laptops can see that many processes have been started now. So different processes are now running. And well, I'm not going to explain what they do. Some of them provide configuration files, log files, well, the different port numbers that are used. And you can see a lot. So there's a volume status command that says, okay, we have the NFS server not running. You didn't start up a she-bind. So the NFS server will start. That's just a minor thing. It's in the old packages that are currently in the CentOS storage and the next update will be in a couple of days to fix it. The bricks, each brick is its own user space process. Each brick listens on its own port. So the brick on this server, this is on the port, the same port is used, but on different servers. So you have to read the whole line to make sense out of it. Process IDs are listed as well. So in case something is not working as you think, it should work and you can check the process ID. You can see if the process ID is stuck on the kernel or whatever things. So you can do helpful things like, for example, stack. If the process is completely blocked, you can see what's happening there. And for the people debugging things, it's often very useful. So this is how we created the volume. We started the volume. Let's see that we can actually use it. It really doesn't matter from what host name you mount. So I mount it from localhost, because it runs on this particular storage server, but I could pick any of the servers that they would like to. And the first line says, OK, we've mounted localhost on such M&T, and it's there. The trick it will not go over localhost. It will resolve the host name. So on mounting, the localhost service returns the volume layout. The volume layout contains the host names and the break ports and everything that it needs to know. So the fuse client, in this case, connects to the servers by host name, resolves the host name, connects to the port, and then actually goes and do the IO. You create a file slash M&T, read me. The file is just there. It's just a normal file system. It's not anything funky, objects, access kind of protocols. You can do anything with the file that you put on a normal file system. I mentioned that the bricks contain the data. We have a distributed file system. This server doesn't have to read me file. It's distributed. Obviously, it is distributed to the other server. Well, hopefully, if it's not there, then you probably should all just walk out and... Yeah. So it's distributed over to servers. You read me files on your month server and not on your log. It's 2.0. Yes. It doesn't really need that. And it needs to be synchronized between all of the clients. Well, yes. For permission checking, it also needs to be the group of these and everything need to be in the server side. No, it doesn't need to be in the server side because we just store it on the server, and the server doesn't really care what username is attached to a group. We transfer them one-on-one on the server side so the server doesn't really care about the username or anything. What's the use case behind this cross switch type of distribution? If I understand it right, it creates updates, writes files, and it lists track number two on server two and vice versa. So, of course, the writing machine dies. It's still accessible on the other one. But what's the use case? Well, it depends really on your workload. So in general, there's no need to mount on the cluster servers themselves. You would have a storage environment and you would have clients. So there you don't have this distinction. In this case, there's no locality of... Well, there is locality of data but not on the create path. There is locality of data in the read path because the storage server that has the data would reply fastest if you open a file. So we have, for example, extensions to Hadoop. Hadoop as a MapReduce big data kind of tool. What Hadoop does is it moves the actual procedures or the actual functionality, the calculations, the data reading, the data calculation, whatever it does. It moves it to the servers that have the data so that the data does not need to get transported over the network continuously, which is much more efficient with big data than it would make sense to transfer all this big data over the network. So you want to have your big data framework move this particular jobs, this functionality to the servers that have this data and execute it locally on those systems. So writing the data in cluster is not local. You can figure out which servers. So we have special extended attributes that the Hadoop plugin uses. It asks, okay, this particular extended attribute on a file and the result in the extended attribute describes which servers contain the data and then the Hadoop scheduler knows, okay, we are going to move this particular piece of the job to these servers that have the data and the data is local to the actual processing. Okay, so this is how you use it. So I was going to say, okay, you can now check, for example, details about the volume. So I explained about the translators. Oh, let's go. This is the translator stack that gets sent to the client process. So the client process has a translator called DebugIOStats. It has a translator called PerformanceMDcache, so the metadata cache. It has a translator performance open behind. Quickly, it's IOCache. We have a lot of translators that are listed on the client side and most interesting for now is the distribution logic. So DHT is our distribution translator called BlasterDistribute, and I mentioned before it uses bricks or sub-volumes, which are basically the same. It doesn't really matter what's that. So we call them sub-volumes internally and the DHT, because it's client side, needs to communicate with servers. It has two sub-volumes. These sub-volumes are listed just above it. The order doesn't really matter, because they reference each other always. So all of these mention, for example, the sub-volume, except for the protocol client. The protocol client doesn't mention the sub-volume. The protocol client says this brick is located on this remote host and this is the actual path for the brick. So we actually asked, the ports of the brick are more or less dynamic. We actually asked this server in this case which port is this brick running on. And then the benefit being this device, this particular brick is running on this particular port. Distribution uses, in this case, two servers or two bricks. The brick processors use a different translator stack and these are two distinct processors running on two distinct servers. So if we, for example, take... So we have this particular sub-volume. This was the directory and it's running on this server. So this is the other window. We see that there is this ID, hostname, part of the brick and the path of the slashes are replaced with dots. Or with dots with dashes. So that's basically how Bluster sets up all of these details. You have log files for everything. So this is the log file and the log file also contains... So you have one binary. This binary receives the volume file, the volume file which contains the layout or the stack of translators for this particular process. So one binary doesn't matter if it's a brick process or a client process. This binary loads all of these translators in this particular order and brick processes just get a different stack of translators than that client's do. So on top of the brick process is the posix translator which speaks file system. Protocol, open calls, work calls, read calls, everything. You have different translators on the bricks and so each functionality is basically its own translator. The top translator of the brick process is the network IEL. It's the protocol server translator which actually receives the client protocol data over the network and that's how they talk to each other. Different options and different things you can pass on. The communication between the nodes, it's unencrypted or what's the authentication? Yes, so we don't do encryption by default. If you want you can configure it to use SSL encryption. We are planning to support Kerberos encryption so the protocol is very similar to NFS. So adopting Kerberos is one of the things that is a pretty natural evolvement of our protocol. SSL is not user dedicated, it's client-side initiated so any users using SSL encryption they use the same details. It's all hidden for the users but in the end it's the user data that gets encrypted exactly the same way as another user which is not always what you want. Sometimes you have multiple users and they really are not allowed to read the same data even if they are able to capture the network traffic so we need the per user encryption mechanism and Kerberos would offer that. Also SSL isn't always the easiest to maintain a lot of companies already have Kerberos infrastructure available and if we can use that same infrastructure that would be beneficial for us but also for a lot of companies. How well it copes with applications which write data and post data in the sense that the infrastructure link can alter the consistency of the files. So there are servers transactional which generally write each file as a database so trying to do a snapshot is a bit last story which can cause data loss. How does that work? I'm not sure if I understood but Gluster offers a snapshot capability? Yes, and this usually is brush applications which are not really well working with this from Gluster. Does block level copes with the application? Does block level copy of the data? The replication is line side and is per file but if you want to do snapshots we use 10 provisioned LVMs in the back end to do snapshots and we orchestrate the snapshotting of all of the bricks to the Gluster command line. So I would think that if your application can use or works on LVMs snapshots it should work with Gluster on LVMs snapshots. If it does not, except for, so obviously Gluster adds a bit more delay to it because it's over the network and orchestration between a lot of things but your application should just work with that. The applications that write something like that so there's FS freeze and tools like that that should prevent those inconsistencies. If you have any inconsistencies with that and you have a very clear example of how you can get to these inconsistencies we surely would like to know and would like to see how we can add this and fix that. That's why we add this in our Gluster tools so Gluster takes care of doing that for you. It's surely not easy to do, that's true but that's why we have developers working on those things. We will add it. Sorry? We will add it. It's already available. So it's even user serviceable. So I'm not really sure how long time? 10 minutes. Oh, 40 minutes actually. 13 minutes. 13. I can explain a little bit more about the schedule but if you really want to know you can ask me later. That's not the most important things. So we have three different stable releases at the moment. We have five introduced several features already. For example, this big failure detection that you mentioned earlier, what happens? Earlier versions tend to hang or not get non-responsive but the process would still be there and that concludes client. And we fixed that. Improved SSL support, SSL was really difficult to work with. It's available for a while already but it's really difficult to configure correctly. Improved this a bit. Snapshots was introduced in 306. Monitoring got improved with the GlusterWide logs. Erasure coding got added as a volume creation option. 307 added yet another few features and even more than we did in the releases before that. So tiering, hot and cold contents is one of the hot features that people really would like to see. Windows users and maybe Mac users are used to deleting a file and figuring out where the file is after they deleted it because they actually didn't want to delete it so we have a trash option. Actually the Linux people seem to know what they're doing. All these requests come in through some of folks and Mac OS users. BitRot is a very interesting feature so we need to be able to detect if data gets rotten or if for example this disks get bigger and bigger every time and if you have per gigabyte one in a million bits that reads on failure or on how many disks I don't know if you have bigger disks that failure quote one failure per million disks or whatever stays or per megabyte stays the same but if you have bigger disks you get more megabytes on a disk and the failure percentage of read errors or silent errors on a disk increases so we need a way to figure out how to detect this earlier and better BitRot is one of these things that can do that you sign the data that gets written and periodically or on read you actually check if the signature is still correct so not all disk failures get propagated to all of the layers in the kernel or the file system or anywhere some of those errors are bit flips they can be in the cables it doesn't have to be the disk itself it can be in the firmware it can be practically anyone in the stack but we need to detect that as best as we can and BitRot is one of the ways to do that so some people have an interest there actually Facebook is one of our biggest users and they have clusters running I think on XT4 but they have shadow clusters running on BitRFS and that's how the BitRFS developers test their file system Bluster seems to be very well exercising lots of BitRFS corner cases BitRFS is one of the options if you want to use BitRFS instead of the Blister provided BitRot you already can and you can use Bluster as BitRot on BitRFS as well but in the end you probably want to be able to check for failures on every layer so I don't know what kind of interface BitRFS offers to applications using BitRFS to detect these errors but you want to detect errors on a low level and have those error messages propagated up to the stack each layer in the stack so hardware disk from cables if possible chipsets then somewhere the kernel driver comes in every layer would need to propagate error detection so that you can actually follow up where an error happens and you can say this particular piece of hardware is broken or this particular piece in the file system is broken or actually this is a user space error so I'm not aware of any current plans but it surely is something we want to think about but one file system is maybe not sufficient yet to actually use it so currently we advise XFS because XFS is extremely well tested and proven to work very well for large disks so that's where we are now just to understand right you're thinking about doing BitRFS detection all the way up your local server stack into the daemon basically transport type BitRFS and all the rest of it is out of the picture in terms of this initial in terms of this so we currently use BitRFS detection that is only our process so we sign the data, we store the data in an extended attribute and verify if this is all correct anything below that I don't know if for example LVM would be able to provide us with an API that says actually we wrote something to disk at one point but now we're reading something else or we have particular errors that are not necessarily disk initiated or disk errors but some other way of errors we do have DL Merity which provides that kind of this is the content I put on and it's still valid but I'm not sure if that will help you well so thank you and you should talk together thank you is the main developer for BitROT so you should develop BitROT right? yeah you may have data, sorry you may have data so one of the heavy features is steering that we added I was planning to show it a little bit but you don't have that much time left if someone wants to see it I can show it afterwards if you have your existing volume which we should create another brick basically possibly on the SSD and you attach this brick to the volume, you can do the application and other things to start with that as well starting, splitting up files in little pieces it's very useful for big files in the VM images isn't it that the client is the problem here with performance? because I think that the main problem is on time where you have all the performance because you can scale out with the whole cluster and basically when you have like 1000 of nodes you have the problem on the time single network so the client talk to all of those servers right? so they get distributed so it's not a really a single point of network but if you have many clients you definitely want your storage to be fast enough the client the client yes but for example clients have or tend to have maybe 100 megabits or maybe a gigabit network connections servers tend to be connected to switches that have at least 10 gigabits or maybe even in kineband or multiple 10 gigabits connections so they can address many more clients and all of those clients together surely are able to ok ok so starting splitting of files and we improve the virtual machine workloads for that really a lot so one single virtual machine actually addresses multiple servers because the charts the bits of the pieces of the virtual machine image get distributed over many servers so one hypervisor actively talks to many servers for one virtual machine for its storage it also helps a lot of recovery time and other things because we're using smaller files without need to recover a huge big file at once enabling it is as simple as enabling the feature and by default it's 4 megabyte for virtual machines I think 512 was one of the sweet spots where people tested it NFS Kinesia support has improved so we integrated NFS Kinesia natively we want to make NFS Kinesia the default NFS server because it's a full-featured NFS server instead of Bluster NFS which only has NFS version 3 it's also pretty simple about 308 gets released hopefully in May June timeframe tiering gets more flexible it's one of the highlights there was a talk about Hiketi yesterday which manages Bluster and makes it more automation actually so automating the failure of bricks is made possible with Hiketi I think we support seek data seek hole which is a virtual machine use case so we'll send patches to KMO to actually use it and some more plans yep yeah yep mm-hmm mm-hmm yep yep yep yep yep yep yep yep yep yep very good because just so each file that you open or each file that you execute the stat system call doesn't check if the file contents is in sync on others and that's a huge overhead so for built environments doing that on Bluster it should just work fine it's just not fast we are working on improving it mm-hmm so with our DHT version 2 we are going to improve that really a lot hopefully I don't know how much are you using Fuse or NFS Fuse NFS was even worse so Fuse does become a lot better and also enabled the to look up an hashed out notion so that's whatever you would say so that's a pretty good question yeah so so doing a build environment on Bluster is normally not very smart the result of your build you should probably put on Bluster if you have any other questions at least Pinky and I and maybe some of the other Bluster developers will be around the room and to answer them you can also happily send emails to our lists we are on free notes on IRC and any questions can be answered yep okay well thanks for joining I was hoping that you would send monitoring part of this okay yes but because it's like if you have a jitter's time I can show you my issue and you just copy the slides to you want the pdf all the other pf both probably pdf I can do both on Sunday to and and and and and okay okay and and and thank you all on So, I'm waving you're glad, right? So, I'm going to get you and Adam, Katie, Brian, and a few people with Martina who started Melville's.