 All right, hello everybody, I am Prasanna Kumar Kalevar. How are you doing? Good? Yeah, so I am here to introduce Gluster and we have done some improvements in Gluster for virtualization use case and here we go. So I am going to give a basic introduction of Gluster, my colleague Ramesh have introduced almost most of the parts but I am going to recap some of you have came to the room mute. So I will introduce two hyperconvergence and while JVAPI is you know performs better on fuse and the Gluster QMU JVAPI integration that gives the high availability of servers and Unix domain sockets for Ivo that we brought recently some of the sharding features that gave and push the performance and yeah finally the dynamic authentication have few demos over here so I am going to go a little faster please interrupt me if you feel anything is missing. Okay this is a Gluster FS so the basic building blocks of Gluster is blocks it is not yeah so we will be having few nodes they say each of the nodes will have blocks in it we combine all these nodes to form a volume and yeah volume can be of replicate, distribute and previously it was striped and now we are not encouraging that. Yeah Gluster can be accessed with fuse, libJFAPI and it is all about introduction about Gluster and this is hyperconvergence, hyperconvergence is a box that has a compute, storage, network and the virtualization pieces all put together is hyperconvergence we have some management over on top of it we can with which we can manage the box okay libJFAPI and the fuse fuse so here is a QMU my application which talks to the kernel fuse layer and from there the fuse layer will talk to the we will send some information to the fuse dev fuse and from dev fuse it is talk to the fuse library and from there from the client side it goes to the brick side and from the brick side it will finally reach to the file system that is XFS or whatever and libJFAPI it is an API that applications can directly use like say QMU, QMU can directly use it to communicate with the direct blocks so we have seen the performance numbers of libJFAPI is almost 60% performs better than the fuse here is a graph and numbers okay now I will tell you the multi-wall file server that we have recently solved now over is a manager on top of everything it is a GUI kind of thing it will send some information so using that we can have some information and like which volume to select what kind of volume to select and how many bricks what nodes and all that information can be passed to the VDSM, VDSM will convert it to an XML file and that will be given to the libware libware converts this XML file that it has got from VDSM and it gives to the QMU as a command line arguments and these command line arguments are taken like multiple wall files or whatever the volume information and all will be given to will be used by the QMU to access the cluster okay so multi-wall file server so if I have a QMU command argument say like this and I have a URI syntax over there with an IP of 10, 70 some IP so I have an under node cluster cluster pool from which I am just using this IP picking this IP and I am trying to load my application load the VM say so if say if this node goes down what's up even we are having nodes of 100 nodes and what's the point in having 100 nodes over there so the solution for this we need to give there need to be like too many of IPs or the nodes information here that we pass to the QMU so that if one node goes down it can access the other node so we have started fixing it from the LibJF API side as I told you so from the cluster side I am coming from this side of it GlasterFS with LibJF API and we have fixed how to initialize the multiple servers in it and then we have come with the QMU so in the QMU we pass some arguments I am going to show a small demo of how to create a trusted pool and create a replica 3 volume and then QMU setup with GlasterFS and Libboard setup over here so I am taking 3 nodes Glaster node 3 Glaster nodes and I am trying to create a trusted pool out of it so IP1, IP2, 3 there we go it's a Glaster pool and now I am trying to create a Glaster volume name sample with replica 3 configuration I am going to start the volume mounted it's mounted so 4 of nodes is trust the client in this case right 3 nodes with bricks and 1 is just mounted yes yes ok this previously what we have done is we have set up a Glaster pool with the replica 3 setup now I am trying to configure the QMU so here are the patches that send for QMU for availability of multiple servers you can find them easily configuring it for the XIT6 enabling with GlasterFS and some debug options and in the Libboard side so that is all about configuring the QMU and the Libboard side we have few patches I am going to tell what they are doing before that let us configure the Libboard with GlasterFS storage driver and we install now I am going to show you a small demo how it's going to be useful to have a multiple wall servers able to see basically I have recorded with high resolution sorry for that so basically what I am trying to do here is I am trying to invoke it with the URL syntax and yeah since the server is available here it's trying to boot properly yeah it has booted successfully here I am trying to kill one of the node in the Glaster volume so if it was URL syntax that we have used it previously the same command so here this node is disconnected now I am executing the same command here and I get a transport endpoint not connected so even though we have a big replica 3 setup and it's unable to find a server over there now this is a new JSON syntax introduced in QMU the patches are not yet merged it has all the information about the next slide shows you the detailed information about all the options that it was using here and yeah the first IP got it was not connected so it got a transport endpoint not connected and it goes to the next second node address and then it boots ok what we have seen is we observed the old URL syntax and its demerits and then we went why we went for a new JSON syntax ok ok here is here is the JSON actually the libvert converts the URL the XML information that will be given by the VDSM to this kind of commands and here we see previously we have seen the JSON syntax that is given as an input for the QMU and we have all the information about where is your image and how many servers we are using in the node and we can put as many servers as you want and you can use what transport type you are going to use there and you can use the image path and volume name and all now in the libvert the recent patches that you have seen libvert has the patches for the formatter that exactly formats the information ok this information yeah it's like you have a cluster volume with so and so name and you have an address to it and it's an IP or address doesn't matter so what you have is the better availability of the servers is that answers your question that's important machine and when the machine starts and the node goes down libJVPI will just switch the machine through the corrector anyway so it's just when you want to start with the machine that that IP has to be available so I will answer your question in the next slide so you got a formatter and a parser it's not just the demo will answer your questions so yeah in libvert side we have a parser and formatter that parses and formats this kind of information when it is like more like when we take snapshots and all back and file information will be stored yeah so yeah from VDSM side yeah this kind of XML will be generated and you have the more number of available servers over here okay I'm going to show a demo which demonstrates using libvert verse to start a VM and domain XML understanding and when we create a snapshot how it backstores all the information sorry for the interruption you have slow network so previously we have configured the libvert the same one I'm trying to show it here so the one which is installed in the local machine with the patches so we have absolutely no VMs running I'm going to use this XML with all the back and wall file servers here with the transport and the volume information okay let's start the VM let's define this XML and start the VM I'm typing the XML which got actually loaded so here are the same server information so the cluster server information so we are using the cluster libjfapi way so and so image and with the this is a node information with IP and you can also have the transport type it may be an rdma unix or TCP I put into the TCP side all of them so okay now I'm trying to create an backing file so this is my image so this is the original image here I'm trying to create this image named newsnapq2 with the JSON way so if one of these servers goes down so initially we have cancelled this we have killed the clusterd here so it tries to take from here and it will create a backing file okay there we are we have this new file created so this image has a so this is an image with the backing file okay this is a backing file used here the fedora23q2 and with that the newsnap was created here is the same information of fedora23 sorry can you repeat the question louder all of them are going to check yeah it goes and checks whichever node it's available it's trying to go for the first server which is available and if that fails and goes to the next that we need to do from the oversight actually this information was created from the XML if the XML has that information the patches are able to create the JSON syntax and it will have as number of you know addresses of the all the nodes and it will try to load yes it's a batch of three isn't that more easy just to batch libjfapi itself so that is a way like you just want to multiple IP address and libjfapi to all of them and then it's transferred to you on libjfapi how do you want to work on that yeah actually QMU initializes the libjfapi the GLFS init with all the servers that we have given here so if you don't provide the information for the QMU if you don't input the information for the QMU the libjfapi will not be initialized with those addresses of the nodes yes I'm not getting your point actually the right place to actually convert and create QMU and so on all these addresses are sort of transparent and so QMU can be also to attend and how that means so that all consumers can talk about that basically I have so from my observation like libjfapi don't have that intelligence to get all the servers information so we need to provide it with all the volumes externally maybe with the QMU command and then it gets initialized with those IPs okay so I'm going to create a snapshot with these three IPs as you know information for that external for taking the external snapshot so here is my snap XML that I'm going to use when I'm trying to create an XML trying to create a snapshot this is how I create a snapshot I use the XML file that I have shown you before and there we go we have a snapshot there and this snapshot information all the back and chain information will be loaded into the XML if you dump the XML here we have all the information so this is our new image and the back and store for that image is our old fedora23 so we have all the information about the servers for each given we have seen how to create or start a VM using verse and we understand some parts of domain XML how the cluster information is put into it and we created a snapshot with back and store VM image so what are the improvements that we found we have done is like since we have used LibJF API and it's like parts through number of servers and connect to the first available so through LibJF API we have less context switches and less copying of information from user space to kernel and then to the day of use and then going that to the brick and now we have a better availability so I'm going to cover the third point next so we have introduced the unique domain sockets so it goes like this so this is the client side of GlusterFS and there we have the GlusterD running it's a brick process sorry GlusterD is a management process and here we have a GlusterFSD it's a brick process now all the volume related information will be sent from will be taken care by the GlusterD generally the GlusterD and GlusterFSD will be in the same machine so in all the nodes GlusterD will be running and in the nodes which have the bricks GlusterFSD will run so then a management information that will be happening frequently will be through GlusterD and GlusterFS communication this bridge that is a management and the brick process all the data like if you have a mount here if you write something to here all the data will be sent through this communication bridge to the GlusterFSD now in case of say hyperconvergence everything is in the same box the client and the brick is on the same that means basically you are your VM image and from where you want to run the QMU the application will be residing on the same side so instead of having generally the TCP connection between these two where the real data will get transferred it's better to have a MX domain socket which will boost our performance and here we have seen the performance and an average of 13% with this change you can see more information over here like the write performance at 128 bytes which is a normal case if you see this we have an almost of 13% of improvement in the write you have improvements in other areas like read, read and all let's see some demo which shows the unix domain sockets in action so we will try to start a volume here we mount the volume in a node where the brick is in local and we observe the UDS in action so we have a pool a glass pool of two nodes so there are no volumes in a cluster trying to create a volume UDS with the two nodes starting the volume so you will see the volume information so the second side the node 2 the one which was doing I was doing previously is the node 1 and this is the node 2 where I see the same information and switching to the node 1 and see the Glacierf sd running here it is running here because I am using the two bricks one from each so this is so yeah so okay so this node the 1.86 as a node 1 can see that information here and this has a brick 2 I am just checking the unix domain sockets connections now continuously in the same node I am trying to mount the volume so that the brick 2 will be local to it and there we go we got it connected what we have seen is we mounted a volume in a node where the brick is local and observe UDS okay the third part is improvements done with the unix domain sockets that is not all we are also trying to use the same unix domain sockets like for the DHT if the brick is it has some heuristics that it can find it can automatically find that the brick is local and it is going to do that going to switch with UDS and here is a shading shading is like consider we have a file of 84 MB it will get splitted into each whatever configure block size that we have given now this 16 MB is a basic block size I am going to show a demo of it so now 84 MB is splitted into 16 MB files now 16 plus 16 plus 16 plus 16 plus 4 this file is a main file and still it will be 16 MB and that is what you can see it from the mount point and the other files will be in the dot chart which will be hidden files with the GFID and its index so we have the index like it with the GFID GFID is a unique for every file and we have index 1, 2, 3, 4, 5 and since this is 4 MB and we have that so shading splits the file and this splitted file each file will be acted normally for the DHT and it tries to split the files everywhere and so that we can use the bricks in an efficient way all the bricks so we have lot of problems that gets all I am going to explain you next so this is a demo of it so I am going to create a volume with only one brick by default this future is off so we need to turn it on and by default the chart size is 4 MB and I am setting it to 16 MB we have not started the volume yet so from the brick side we have this and from the mount side we do not see we are going to create a file here from the mount side so this file is of size 84 MB named file 1 and from mount side we still see the only file 1 there are no charts that we can see from the mount side and if you see the size of the file it is 84 MB from the brick side still the main file is 16 MB the GFID is here so the file F1's GFID is this so with that I can see the charts created and with the respective indexes so I see this charts over here 5 charts the last one 4 MB and I am changing the chart size to 4 MB again and I am trying to append the same information some more information into the same file say 120 MB we still see the chart size as 16 MB it is configured for the file only once and can't be changed so if I try to create a different file since now the chart size is 4 MB now the new file is created with its own GFID this is the GFID of it it will be the in the root of the volume so these are the charts created with it with charts we have sorry which okay, charting provides better utilization of the disks how it provides, how it does that is we distribute all the charts into the different bricks so it uses in a better efficient way size of the file is not restricted to the brick because now it can be if the VM image suppose say initially when we created it is 2 GB and it grows on to 20 GB the one which it got inserted into is only having 10 GB of space in the brick now with the shading since we shared all the blocks into different we distribute the blocks this problem can be solved data blocks are distributed by DHT in a normal way which blocks the chart blocks heal at a granularity of charts yes all the files are now sharded they can be healed into a granular level of chart and this is more like I think we do not have time it is a manila use case suppose we have a volume over here so some IP I have not used any SSL schema here I am going to tell you it is very simple so we have a cluster volume of this got some IP say IP and I am trying to previously if I try to reject some IP on a volume say so cluster volume is not going to take care of it previously it is lazy until unless it gets a remount see what are the IPs which got rejected and the customer is going to enjoy the services until we remount it I think we do not have time for this demo these are the guys who helped me coming here Raghavendra Thalur he takes care of LibJPAPI Kismapoli and most of the sharding work is done by Krithika welcome for questions do you have a cover in 20 breaks or just write over shard that is a good question actually I think I missed that point there so how it is going to if you have three breaks the shard since the shard feature is sitting on top of the DHT so all the shard now the file is say 64 MB file got split into 16 MB now if you have here this one so now you got a file from here and the shard access like the first 16 MB of the file or the block size of yours that you got configured here that will be as a normal file and the extra data when it gets appended that the file will get split into the block size that you have configured and that shard will be given to the DHT in a normal way and that it will distribute to any of the blocks it can distribute to any of the blocks yes any other questions huge number of shards actually each file has a unique GFID so with that GFID it is going to store in the root of the volume since the GFID is unique say you have created a column even though there are a million number of files each has got a unique GFID and each shard will be appended with a respective index so you will not have any contact well actually sharding translator itself has some features to address those kind of things so here are your questions yeah questions questions non-editing questions no no no it is nice to see questions there are questions that you can give Thanks to the people for the questions.