 This is the most open-GMS branch, though. There are a lot of good partners in the U.S., and I'm sure they can do the GMS. So, because Skidwell says open-GMS, I can do the GMS, it's kind of the same thing. Okay, I say GMS, but there are lots of different parts. Basically, there's just a list of them. There's a lot of working drivers that you can have a very simple way of moving blocks, or you can have a different channel. There's a way to configure for your cluster, which is very simple. There's a lot of different cluster manager and mob servers. The centralized ones we wrote earlier are kind of old, and we used to have a lot of customers here, but now we've done it, sir. That's our current product that's out there now. There's a Spectra cluster manager, and it's a stupid block manager that's used to blocking the scale better. So, we have a cluster probably, which we're going to talk about in a minute, or maybe two. We have a cluster around them, and we have something similar that I think now that used to be, we have a cluster stream that's getting integrated, and we've made a cluster with that. So, a bit about GMS, just to get people to speak on, we're going to talk about GMS and how this differs from where we're going to be talking about. So, it has a picture of a second big NFS server, where one of this sort of differs from the other one by the server. That one server can talk to nobody else and talk to them. If you want to write something about this, do you have to go to this server? This is the great NFS server, it's a lot of AIFAs and things like that. It's a single point of error, often, if it's a phone server and it goes down, and it's disconnected, and you can't write to them. And if you want to upgrade it, you can either replace it with a bigger server, plus a lot more money, or you have another server that manages to separate the names of it. So, you have this data replication problem that you want to have, the same content on both of them, or you partition your names by some amount. And the one thing that can change that is really is long-paced interconnects. So, we have FireChannel where you can have disk drives and disk arrays out in the network. So, I'm going to say that you're running your machines in a popular bracket. So, now you're at this, they're not going by any one particular machine, they're going by a cluster of machines. So, FireChannel is the kind of leading one of those right now. Lower end you can fire wires. My strategy, which is basically this, I just want to call over, PCPAC is coming along to come to you all. And there are some others that I've talked about. So, what happens is that you get a system, where you have some sort of search area network, and a bunch of machines are talking to this, right? And they can talk to the standard block, they can talk to each other over IP, of course. So, basically, what you have to do is a cloud system that lets multiple machines talk to the same system. This right here is like a local cloud. One machine wants to be something the other one has just written. There's no message between them to flush the data on that. There's already data on one machine back to this, the other one created. And what GFS does is it does that blocking between the different machines so that they see a coherent view of the other data on this. And basically lets them, so two machines are writing to the same disk at the same time, in the same position. And you don't totally cover your data. You don't cover it all. So, it costs a lot of money for this. So, things like web-serving clusters. You can think of where you have a bunch of machines with a big, sitting around a big data set. And each one writing a version of a package that lets you all serve in the same data. And part of GFS is that the machines do part of each other. So, when one fails, the other one will do not report whatever scale data that the machine was trying to write at the time. So, you can have this highly available, big web-server, like something in front of you that you can either do several months of these server-sector newsfaces, or you can do some server distributor problems to square out a look. So, you can do GFS as well. Shared Group is an interesting way of, interesting, really interesting mid-part cluster process. You can make it so that all the machines move off the same disk, all of them, and everyone is running GFS up like this, so that when you install packages, or install a program, all those machines can see it immediately. So, it makes your configuration a lot simpler because instead of if you have a cluster of 10 machines, you don't have to install it 10 times. You can install it once, and then all the machines will see it immediately. Then, there's also scientific clusters, which is kind of where the GFS community started. It started in a big world computer university, where there were just massive data sets that needed to be returned to, like, genome, and gas, and a bunch of things that we all wanted to do with clusters. You have your big tree that needs to be built, and you run parallel media across all of the machines. They all have access to the source files, and they just assume it's your workload to do it very, very good. And then parallel databases, like Oracle and I, it's themselves, and you can replicate the Oracle on GFS, so you don't have to replicate it on GFS. So, some of the GFS spaces, so it's symmetric. There are a number of cluster fodder, samples, where there's one machine that does all the metadata work, and the other machine is just your data. And this is an easier way of going about doing it, but the problem is that you still have a single corner category, and the problem is that you can't metadata server. So what GFS does is, every machine does both metadata and data operations as it goes, and so you build one machine that's in it, put it in it, and any other. It's kernel, so each machine has its own kernel in the cloud system, and it's a kernel that goes, and I'm sure that if one machine dies, it knows what kernel, of course, wanted that machine to go, so it plays that kernel, which it goes on. So 64 bit, we started from the ground up, and we were in a big data environment, and we wanted to ensure that the scale was possible. So there's like a local cloud system. You can run GFS as a local cloud system if you want. There are plug-in modules that you can plug into while you're in GFS. And there's one that you can plug in and it says, it basically makes GFS make this possible, but it's not. So you can run GFS, and I've seen a few programs run as the root cloud system on your laptop. So some of our goals were flexibility in terms of logging, so the way of doing logging across the cloud system is your membership to GFS is a lot easier, and we were very focused on certain types of logging at the beginning and community close that was a bad idea, and they were right. So, like I said, these modules are going to plug into different columns of logging across your infrastructure. And also flexibility in terms of log transport. As long as it looks like it's a lot of device to GFS, and it doesn't mind multiple machines barring and grading to it, I don't think we can use it. So there's a big list that should go there. One of the big things about, so GFS can run, but it's not a string local cluster. Taking local clusters and trying to make it into a cluster file system is very hard because some of the assumptions that the local cluster can make don't really work well in a cluster environment. A local cluster system's all about locality, trying to pack things in as closely together. Like a good example is the Inos. As many Inos as you can in the one block. That's a big tongue for GFS because you want to be able to access the different files individually, and if it's the smallest unit, you can write it as a block and you have more than two Inos or more than one Inos in the block. You can't exploit those two. So you don't, you need to space things out to get good performance from a bunch of files. So the other thing is, locality is everywhere. So you have to worry about deadlock ordering and things like that. So that makes things more complicated. The journaling that we play is a little bit more complicated too because you have multiple journals and those journals have multiple different versions of the same block. So one machine modifies the block and the next machine modifies the block and the third machine modifies the block. Say the first machine crashes and you play that journal. You do not want to play that old data over top of the new data that your machine changed. So you don't have to worry about you're only going to play anything that happens that way. But ultimately walking is more overhead basically because you get a lock per file and when you're turning through lots of files you have basically lots of locks you need to get. And since locality I was talking about, they're not quite as pumped together anymore. Another thing is just, for example, data structures, so in order to convert a time of number to a disk address that can be centralized and built in the wrong way. So you have to do that in a way that doesn't create a lot of network. So if you're doing a DF and a local file system, just the super block has the all-BDF information in it. That's not true with GFS because every time you use an application you have to change the centralized data structure to a bottleneck that kills your performance. So things just point out from GFS. So we try to be as close to the block as possible. There are some places where you do get that one value that everybody will disagree in it that doing something needs a change. And we do a little bit of relaxing quotas. If you think about quotas they're basically the current value of how many blocks of data is allocated. It can be at a bottleneck. So we try and the lessons a little bit so quotas are as possible over in quotas but there are mechanisms to make sure that that overruns and it doesn't accept the value of the value. And things like A-time are kind of hard to do as well. So A-time is fuzzy. It's kept accurate within a certain quantity of the specified value. So the internal layout is true fast. The process is between if you have some volume down here with CLD or whatever. It also has this GLAA player that basically lets it talk to these cluster managers. So Dolm has one of those essentialized parts of it. And then there's a DLM same-hand foundation that basically does these metrics to the cluster managers. So GFS, the very interesting thing is GFS only knows how to block there. So GFS doesn't care what type of data transport happens at all. It's all transferred to it. So IP down there can kind of be an advantage if you have the right type of data or to get IP for the front-end. It's very abstract. So I have just these big lists of features that we can all kind of breeze through. So ACWIS journaling is journaling performable. We have that local journaling top-of-top. You have X-patterns and directories which means basically the directory that has it. You can have it. You can add space and journals as you go. There's a lot caching that we have by now. So you don't have to worry about PLT and GFS. There's just a couple of bits at a time and a lot of ways to block storage. You have to have multiple blocks to form a transaction so you can sort of kind of make sure you get the right order. Okay, so 42 is the branching function of GFS. So these are some of the things that we added that we've seen that are new to the database that people can see now. So asynchronous logging. So before you had to acquire a bunch of blocks, you would tell it to acquire one before you come back and then acquire the next one before you come back. Now you can fire up a whole slew of block requests and they come back as they come back. And that's a good performance increase. So quotas are everybody expects to be there. And the monosystem is so important. There are just so many different little things that go into it. There's a lot to get out of these things. Share it with the officer. I'll give you a share. So we do direct IO and direct IO setup so that if you're doing monocating writes you can do writes in parallel. So for a database that's string-worked out as a disk and the file would be a database in parallel as you want. Lots of improvements. Journal of data and support. Plastic compressing support. So if you're doing snapshots the volume manager and the hardware you can run one command and one machine that tells the whole process in this operating and our process is clean. And then you do your hardware snapshot and you can hit another switch and it will let the cluster go on. So you can do that type of snapshots for back-ups and things like that. For improvements basically one of the things I'll share with you in a minute. So you can do a shared memory map of the file on multiple machines and then start accessing that memory and compare it across the cluster. So by doing it right to a byte on one machine the next read of that byte on a new machine will have a different value which is kind of neat. So it's basically distributed shared memory. So we also have context that's going to have names. So there are tags that you can put into your vinyl names that when the GFS sees that tag it will replace the pre-posting or the machine type. So you can have stuff in the cluster that's specific to a certain node or specific to a certain type of architecture. So if you're running a mixed cluster where you're running RPC and ECs together, you can have it so that it's transparent and equal but with different directories. So some of the a little bit more detail on some of these features. So they sync in this locking so local op modules and then geoloc there actually simply does not change basically. So there's a callback that the linemodule can send to the file system. So you send out a request in remembrance that you did your request and then request that byte and it calls back. So there are two new pre-patches. So when you're running an LS or a directory in order to run an LS it has the stats that we find. JFS pulls and has pre-patches for all those files so that you're not waiting to sync or sequentially for it to do the stats. So that's one thing. You can also do a signature request that you wait on. So standard best works that way. So each chunk of the application has its own statistics to tell how many blocks are allocated in your grid, how many your eye knows. And standard best it will have a bunch of slots and it will start firing out requests for each of those allocation accounts. So basically you do those stats of the allocation all where all of them and if you want to ask a question. Unlock is another big thing. So people are a bit of a burden to the code. So signature is unlocked. It has to have a lock. Whatever process it was the last one to question would end up having to send their way unlocked. Right now we just do a signature unlock to tell the service that you tell the block on the unlock list and you forget about it and when it gets the callback it handles the rest of the unlock and do the test. So the quotas are causing. So like I said there's this one value that the value of how many blocks a user is allocated. And if you think about the idea where one user is writing a job on 100 computers and they're all writing blocks every allocation, every time you go right you would have to check the current value of that the number of blocks the person is on here and it also doesn't change it and if that was kept in one central place every process would have to be constantly changing and looking at that value. That would be a huge problem. So what happens is we take the amount of space that's left between the limit of the user's quota and what they currently have and divided it up against the number of machines that they're going to allocate. And so each one gets a certain portion of that and you can they can independently go through that amount of space before they have to go back to the centralized quota file in the precinct. And so basically it brings it up so that you aren't constantly getting the same that same value. And the problem with that is that you can get overruns if you have uncertain situations where people allocate up to their limit and then stop and then other people allocate. But you can this next slide so you're actually seeing the quota changes to the quota file and you can also change the precinct to more often so you're closer to the limit. So you can there are a couple of options there you can limit the maximum amount of overruns as much as possible and in theory it's a twice overrun and you can make that smaller if you want to but in practice you don't see anything more than 5 or 10 percent and it's there and it works. So withdrawal isn't another option. So there are problems when you hit errors and you want one machine to be able to leave the cluster in a way that other people can see not for it and not fill a note. So if a machine has an IOW error whether it's cargo pad or if it's some sort of consistency error or whether it's a tank basically the idea is that you want to preserve the integrity of the cluster over that single node so the previous way we were doing that was panning in which of course users didn't like we added recently this way of withdrawal so GMS has its own block device that the layers would mean the regular block device and it lets whenever it gets an IOW so we can whenever we get to the situation where we have to leave the cluster we can stop any IOW and wait for any outstanding IOW to complete. Once it happens we can guarantee that Geoplasma won't have to do this again. It can then leave the cluster or leave the cluster for that process and then call the lock module ask the lock module to do something that basically is equivalent of all of the recovery steps on it and then never know except the transit part so the node that had the problem stays up so if you're running another application, a non-geoplasmic application that can stay out but you still preserve the cluster integrity and you keep your clients so that's just another improvement we've ever used okay so a good part of the lock module when we're making so like I was saying we can plug them all through the lock and if it's doing locking you can plug in any one you want there is a fair amount of context involved so it's a bit of work to do but you can do it so the lock module is very lock centered the lock module in the case is not a cluster in the case but it's because for a lot of times it's a very lock specific thing there's very simple cluster in that if you can do a bunch of operations you can now come up with that and you can do it with blocks and block value blocks and it's kind of something that's possible to be allowed to do that and it ends all the way so some of these parts that I was talking about earlier so all the time block cluster managers so you can have two handles up there if you've got a server you can have them on your own free products and it's a couple of different forms I think that's fancy and locking is very not too specific it's older one thing we're keeping it around is we haven't done that benchmarking yet to figure out when to be better to do the centralized locking when to be better to do the distributed locking you can make very good arguments for a very large cluster where you do not want to go for a standard distributed cluster transition you have a thousand nodes all working in a cluster that basically those transitions are off the scale of the number of nodes at the most of the time it takes so you can get better performance out of a a centralized membership one of the nodes all the membership potentially to get better is awesome so that's one of the things we're keeping it around where we want to be able to decide which one is better ok, so there's that and then there's this new one that we're pending out to be part of the REL4 product so the R26 version of GFS cluster infrastructure will come out with REL4 if you want and I think some kind of old stuff and so it's good to be using this GFAS so it is a cluster manager which is kind of a early start a cluster manager and a basically let's expose this to both GFS and CLBM and we'd like to be able to work together with other both applications and we've been talking about integrating the AJ stuff that Alan has been and other things like that we're trying to keep things as module as possible we've got multiple limitations but all of a sudden things are quite well and basically working together as well as possible so you can use the stuff that we have that independently we'll see if there are CLBM and GFS even to quite your own and stuff that goes to promote counter space in which ways but it's newer than that, we're still testing it and it's in QA now and it should be ready like it's a REL4 update one ok, so cluster manager is basically heavily based on X cluster so the guy who's writing it will say you long time have X cluster in any area of the interface so he can kind of model that handles membership events so you can ask it who's in the who are cluster members now you can ask it tell me when somebody shows up or when somebody goes away it also handles things like start and stop the core cluster spaces so things like when you have a cluster translation you have to stop GFS and the DLM for a while so you get the new membership setup and once you have that then you can tell the DLM to wake up and it can do a lot of recovery and once the lot of recovery has been told you you have a lot of recovery and things can go on from there so and it has led me with that level of starting and stopping services like I said earlier there's a separate user space so it's more fun to be what that cluster space does if you're starting and stopping with Archie or IP alias so you can access these commands from both the kernel and the user space so GFS was one big reason or reason in the kernel space and the user space is basically a game in the user space and the user is in there it's currently in the kernel we've gotten a lot of feedback from the community to say it is a lot of policy how do you want to do that how do you want to do cluster membership and when we first started we were going to do it on the kernel and we got good feedback so we're working on that so each cluster has a unique name and it has multiple clusters on the network and each cluster can have multiple clusters so they don't have many clusters so it's all nodes that do broadcast to Archie and when it knows the text that they came up with when they know which is a transition where all the nodes agree on cluster membership how does this form so form is very important in clusters you want to make sure that you don't have a split frame type of deal where you have a network package or something that has part of the cluster on one side and part of the cluster on the other and they can't each one doesn't know the other one so a lot so do you really have both of them so form lets you count votes so each node gets a certain number of votes and you have to make sure that the side of the split frame of your lives has at least half of the votes or at least one half of the votes where it comes from so see what happens to this form it does provide services like I was talking to other things so they need our HA part where the equation gets a node up in the data so it can do certain soft services it also gets our kind of state in the family gets these things so notice where you feel to make sure that that person doesn't come back later so again a build a general user space so part of this core cluster manager but as I was saying a GFS and BLM need kind of a layer to be covered so as soon as a cluster transition happens the whole stack kind of you might have three things running you might have a statement running you have BLM, CLVM GFS all running at the same time when one of these transitions happens GFS or the cluster manager needs to stop all and then do the figure out the lowest level figure out who's in the cluster and once that happens then you need to wake up the test statement which will then kill the nodes make sure the nodes that you think are dead are actually dead and then once it happens you can do the BLM recovery and then do the CLV recovery and you can do two cluster recovery so we have the service manager part that's part of the cluster manager that does that so all these different components register at different levels and basically it's only just layer recovery it's also symmetric so whenever you mount GFS or if you start a lot today the service manager knows what's happening and knows which nodes are participating in that so if you have a a node that was down you know it was running GFS you know it stopped that and then we see what recovery happens so like I said the CNX manager tells the service manager and I know that they also know what's happening and then the service manager manages the layer recovery so the BLM looks very similar to the BLM's cluster manager as far as many of the blocks are based on each class they're based on blocks based and you can acquire blocks from there so when you mount GFS it joins the blocks based it's joined the blocks based the BLM currently runs with kernel and we're thinking that we'll probably keep it there latency is just a I don't know GFS uses a lot of blocks there's one for a lot of the big factor you can easily have hundreds of thousands of blocks in a period of time and you're getting hundreds or every 1,000 to 15 seconds latency is very important so we're thinking that we'll probably stand there and it depends on the cluster manager so density so basically the generic the development the support elements we have quotable agents we have different types of hardware so there are different types so there's power supplies a bunch of different nodes and power supplies and you have nodes that you can tell it to that board and tell it to how the power runs and once the power is on you know it's not going to come back you're trying to really avoid is a situation where a node goes away for a while it gets stuck in an interrupt or something and sits on a model amp for an hour and then something happens and somebody comes in and it comes back to life exactly the way it was before and thinking you still had all the blocks it had before trying to find the way to the disk and you'll get to the one where you have two different machines that think they have the same block and start trying to write to the file thinking they have the same block you can very quickly drop your file system that way so fencing prevents that so to make sure that the node you think is bad can't talk to the storage I also think it's one of those ways IO fencing is another there are both the fencing and fencing there are fencing in the connection layer the private channel power basically a switch of these private channel cards you can tell the switch don't let it die top to that disk and that will prevent that situation where a node picks up and it tries to write to the disk which will tell how you can you can also do fencing in the IO device the iSCSI has well, iSCSI has a system preservation kind of built for this you can tell the disk only talk to these guys these five machines or five HPAs and then when you have a situation where a machine dies you go back to that that storage device and tell it ok, now take that kind of a list and let's talk to them you can also do a iSCSI and this block of refugee driver that we have lets you do it well so basically there are lots of different places you can do it different people have different preferences some people are nervous about that power stack connected because they don't want their machine forcibly power supplied and they can kill the processes that are running unrelated to the host and the host system some people don't want you to be messing with their switch, that their switch and they're saying it's very complicated they don't want any type of program going in and switching things so we have about 20 videos you can go on the first class for you and to reach us up most of these agents are scripts or small C programs so it's pretty easy to write another method, the reason that these are quotations is we don't have it as an argument as well as it should be so there is kind of a learning problem now but it's something we'll document sometime soon we're talking to this and it's not really connected to it so that's the kind of where it's different between the DOM and the cluster manager so DOM basically there's a centralized server that knows what the cluster positions are and it's the one that does that so it just there's this cluster configuration system that we have that cluster for each node and the DOM server that looks up that node that wants the fence and that's the CCS three years out of that it just issues the commands but that's a very centralized thing right so the cluster manager we're talking to these metrics and as distributed as possible so it works a little different than we've seen that so for each machine and the cluster manager it's transmitted it says there are a lot of listings for the web systems and so whenever the cluster manager sees a cluster transition it can go in and whoever gets to it first for it the cluster manager decides who's going to do the passing and the issues you can make that damage will have to come or be the binary there so the CCS so cluster configuration system we have these XML based configuration files that define there are a number of switches that define the biggest thing that's there the cluster name but mostly it's there defined kind of principles you can each node know what its cluster name is and you can figure out what the hard part is figuring out if you have this one node that I have and that's basically what CCS is for in our earlier products in our older products there are two devices for a product there's one there's this cluster configuration device that has this data on a centralized sharing block and then you have the cloud system with this sharing block there's something that's worked in the past that's been kind of annoying because you have to have this shared device the new thing our new local product basically that will have a reputation out there and it will protect its local files and the local cloud system you know, it won't be a shared storage but that's a different so CLVM basically so it's a user space game CLVM 2 device and basically manages block events when you do a you expand a CLV basically it will stop the whole cluster and that's the same thing you do the update on this and then let everybody read the data so basically it lets you do CLVM 2 cluster cluster we're working in clustered narrowing and scam chopped targets which is a huge thing right now javas requires shared storage which is a block device that all the world sees what some people have asked for is say you have three nodes that each have in turn which you're not using it'd be really nice if you could have them all explore the internal list of the network and then do cluster mirroring on top of that so that it's replicated across all three disks and then actually on top of that so you build a cluster with quantity hardware and then you have an extra it's an extra server node or a server failure so cluster mirroring is the same as cluster chat basically it's pretty good so a future work so a big future work time our targets small file performance has been an issue in javas something we are working on small file performance is really hard like I said earlier because you're turning through large number of blocks and you need to make sure that basically you have a lot of overhangs generally it's also hard if you're basically turning through a lot of metadata so javas has had some problems with that in the past and we're working on fixing things to better look ahead of blocks and kind of reworking our terminal it's great to see you ahead of you things working better our MBCK has been slow in the past to build for a product and do MBCK at first a lot faster they'll probably get backward and develop as well I was talking about so do you export to a local system and have them all used by the cheer group like I was talking about the files that our people have gone out and done in the field it's not nearly as pretty as it should be so getting them nicely and getting good instructions is something we're working on so basically being smarter about where you okay things metropolitan where you network where you have a data center one place and a data center the other and you mirroring so that if you have a long building catch on fire things can automatically fail over to the other being smarter about where you do the application and where you need things like that in the department back up back up becomes a huge problem and any large problem can take longer than that it takes a copy of the data on the data and it can take days so if you turn it through a day back up you have a huge problem on any problem it can take more than days to actually go out and if you have smart switcher plots to change if you close the switcher plots to change they can speed that up by just telling you if I get a macroscope which plots to go out these are more kind of the farther on future will go kind of farther out it is there each reason for the application that counts is a big thing right now there are we're pretty good about the application on good maps so that if you go to one good map you want to help get it it's crazy you put out a different one on the application the place where it has a problem is you're a DLP who wants to to DLP it a lot you have to go to the good map to change that plot to DLP what we're talking about now is making it so there's a DLP that should make it so the application that happens is pretending to create an application and then implementation and figuring out one this is getting more the other why it's making it I take questions on that it's louder it's louder so the wealthy product it's much louder it's louder it's louder