 Thanks for coming. My name is Svartaj Ranek. I work at Red Hat as a developer and in this talk I'd like to talk about some ideas which SunLock is built upon to give you hopefully some good starting point to use SunLock and understand how it works and eventually how to tune and configure it for your application. Before we start, how many of you know what SunLock is or even use it? Okay, a couple of hands, but not everyone, so let me first briefly start with motivation why we actually need a piece of software like SunLock. I guess all of you have heard about the possibility that we have some storage area basically right over the disk and we access these disks from compute nodes over the network. So basically in this setup when a couple of machines access disk array, you sooner or later come into the situation when two or even more nodes try to access on right the same disk area. This will of course result into the data corruption and this is probably what you don't want to happen. So probably the first idea which came to your mind how to protect this is to introduce some external lock manager here which will control the traffic and allow various nodes to access different locations of the disk. This of course works, but I have a couple of drawbacks. Besides the additional run trip to this lock manager, what would happen if this component dies? Of course your application gets stuck because or at least won't be able to write any data because you won't be able to get any logs from the lock manager which died. So probably maybe we can do better and maybe you can come up with idea to somehow collocate the logs directly with the data the logs are trying to protect. And this is actually what SunLock does. It's a lock manager which use a short file storage for its processing and keeping the logs. Of course you don't have to use the logs provided by SunLock to protect only the data. You can use these logs for whatever you want. But to use these logs for protecting the data on the short storage is probably the most typical usage. And the SunLock is built up to algorithms. It's disk access and delta leases. And in the following slides I will try to really briefly outline these algorithms because SunLock is a little bit hard to configure or at least do it right. So before you touch any SunLock configuration you should really read the documentation carefully. Maybe you should also read pieces of source code to be really sure what it does. And to this explanation hopefully will give you some good start or would make it more easy to understand what's going on there at least I hope. So let's first start with classical Paxos algorithm because this Paxos is just modification of classical Paxos algorithm. So are you all familiar with Paxos algorithm? Maybe not. So I will just start with short interaction of Paxos. Paxos is basically consensus algorithm which tries to solve consensus between a couple of machines which the goal is to agree on some single value. While it may sound pretty trivial it's actually very hard problem and it's actually one of the hardest problem in distributed system world because of latency on the network which is unreliable. This is just one of the examples. So basically it's pretty hard problem and Paxos was one of the first or lastly onboard who came up with Paxos algorithm was one of the first who successfully solved this problem. So how classical Paxos works? Basically each node which joins the consensus procedure tries to propose its value. This is called usually in the paper's belt and each proposal has its own number N which is usually called belt number. So each node in first phase called prepare phase propose this belt number and other nodes upon receiving this prepare message response with the minimum proposal number which is the number they obtain from other nodes. So basically if I send number two and one node here for example already accepted some proposal with belt number one it will accept my value but it will also respond with already accepted value. When I received the response from majority of nodes I will search through replies and if there is already some accepted value so basically which means that there is a node in the cluster which already accepted some value. I can't propose my own value but I have to use this value but I can continue to the next phase when I broadcast another message and to ask the other nodes if they accept this value. I will still keep this number N but will use this value. If my number is still the largest one the nodes will accept this value and return OK. If not they will send me that they have already some higher number N and return me this number and then I will have to abort and start from the beginning and if I got from majority of nodes respond that they will accept my value I want this value is accepted and I can broadcast that we agreed that this value is accepted. So basically this sounds like three phase commit from the database world with two phase commit or three phase commit but this is a little bit similar to three phase commit. So do at least roughly understand this if yes, so understanding this Paxos would be very easy because it's just a very slight modification of this algorithm. Again it was done by Lamport and Goughny and basically now we assume that we have N processors and M disk when these processors tries to write each process obtains some block on each disk when it will try this data structure it contains the number again the value which was already accepted if there was any such and the value with highest build number so far. So basically instead of exchanging messages over the network between the nodes the nodes write this record directed to the short storage to dedicated place for each node and we assume of course that write operation is atomic so we typically write only the size of block or we typically write one block of the disk. So we start as before that for each disk I will write my record and at the same time I will also read the records of all other processors which wrote their data structure on the disk and again I will check all other from what I read I will check these block numbers and if I found out that there is some higher block number I have to abort this process and start again with higher block number but if not I can continue further and choose the value again I have to choose if there is already any value written I will have to choose the value written with the highest board number and I will from now on I have to propose this value and then I will continue to the second phase then I will try to again write this value that I found out in previous step and again read all the records from other processors and again I need to check if all block numbers are still lower than mine if so I will if not I will have to abort and start again and that's basically it if I want the process I can again broadcast the value which was selected and am done and I have for example obtained the log so if so far so good there are some unanswered questions for example how do I add more nodes and you maybe didn't notice and I didn't mention but for example if you want to know what the majority is you know you have to know how many nodes are joining together this consensus procedure but if you are only writing and reading this how you found out and maybe even harder problem is how to establish the mapping between the disk sector or block on the disk and the node which tries to write there they first need somehow to agree which node will write to which space so that's why we need something more than this Paxos and this is solved by Delta Lease Algorithm introduced by Chuckle and Malachyne and it's actually pretty simple algorithm here's the Lease which provides Lease and here's the Lease cycle you basically try to obtain the Lease if you are successful you will hold it and eventually release it the important thing is that if you obtain the Lease you are allowed to keep it for only some amount of time which is called Big Delta and also Algorithm assumes that all operations are bounded by some amount of time which is called Small Delta it can be generalized to unknown Delta Delay model but some of you use this known Delta model and it's used in this way that basically the process tries to write a value on some location on disk wait for some time if the process is happy and the look is not held it waits to small deltas repeat this location again and if the value is same it obtains the look if the look is already held it will wait Big Delta plus Small Delta and again repeats until it obtains the look it's quite trivial but Algorithm but the main drawback is that it may take quite a long time to obtain the look so how these two Albertans are used in a sun look Delta Lease Albertan is used to acquiring the unique host ID for each host which is basically the number from 1 to 2000 which basically says each node which block on the shared storage should be used for Paxos Leases for writing Paxos Leases and it also is used by sun look to determine if the machine is alive because you have to renew after this Big Delta your lease so if you don't refresh your lease sun look will conclude that this machine is dead and will try to kill it and then when you obtain your ID which determines the place on the shared storage where you should write your Paxos Leases you can use disk Paxos to happily try to obtain or completely compete for the Leases with other processes in the sun look terminology if you read the documentation the Delta Leases is called lock space and as I said prevents hosts to have the same ID sun look limitation is that it's limited to 2000 hosts it's because of not very long time ago it was allowed only to use one megabyte size for the lock space and if the block size is 512 the result is that it's limited to 2000 hosts but recently it has changed and you can use also different size but not to support a high amount of hosts but to support also disk derives which has block size for 4 kilobytes and you can configure with these parameters but 2000 hosts is still a limitation hard coded in sun look but actually if you want you can very easily change it in the source code but you have to recompile sun look and sun look internally use this big delta lease to 20 seconds so each process has to renew its delta lease in 20 seconds so if you for example have your application use sun look it's good to keep in mind that it will time out after 20 seconds so for example if you have something built on top of that and you configure time out for one minute it's nice but sun look will dies or will expire the lock after 20 seconds and the small delta lease mentioned in algorithm before in sun look is 10 seconds so as I said this leases are also used for checking if host is alive sun look is more robust and has other component like watchdog so after 10 seconds if the lock is not renewed sun look will start procedure of fencing your note and watchdog will after some another time out try to kill your machine Paxos leases are very fast to acquire so they are suitable for real locks in application in sun look terminology it's called resources and when I said when you won the battle and your value has won that you committed the value somehow so in reality in sun look it's implemented that way that this record is written to the first block of the disk so this is how you propagate the value in sun look and it's actually somehow combined with delta leases that when you renew the delta lease it will also renew all your Paxos leases so it's combined in sun look together also in this way so in summary if you want to use sun look in your application you first need to prepare directory structure where all these processes will write it's called initialization of the lock space in sun look documentation it has to be done only once and then whenever your application or start or note joins the cluster it will need to join the lock space which means it needs to acquire the delta lease to get the host ID so the host knows where to write then when you acquire a lock for a release it means that you obtain the Paxos lease I speak before and when your application stops it basically leaves the lock space which means that it release the delta lease locks as I said Paxos is more complex there are more pieces like watchdog so there is definitely something you can this presentation could be much longer to explain all the stuff but I have no time for it so it's up to you and I would really try to encourage you before you change anything in sun look please be sure that you know what you are doing and read the documentation carefully and I hope that this presentation give you at least some first step and make the understanding of the documentation more easy for you so that's all, thanks and are there any questions? who is your thing sun lock? sun lock is, I know at least about one famous project which is sun lock and it's called O-Wared it's probably the best example but I think it's also used by Costard LVM and probably other projects any seconds or you can but you have to renew the lock in 20 seconds so basically it means it's my responsibility from my application or is it library handling that so I just obtain it from the library and library handles typically it runs as a demon on your host which do all the stuff for you but it provides also client library so you can access it also from your code from your application can we leave disk caches enabled from the sun lock? disk caches, I'm not sure I would definitely not do that you probably have to do direct IO because unless you can get stuck in the cache that would be useless but to be honest I'm not completely sure but it has to be implemented in sun lock using direct IO I'm talking hardware caches on the very storage device that was the question which would kill my performance to the same device in the normal IO path but it would be required you know my answer is no but anyway that was the question caches on the hardware device and HDD was built in cache management on its internal cache I'm not expert on hardware so I'm not really sure but I would probably don't do that because unless some value can get stuck in the cache and in the meantime two applications can think that they can access the results more about use of states which leads to the other question is sun lock one time provides recovery mechanism that kind of situation no, at least I'm aware about it okay so if there are any other questions you are encouraged to read the paper and if you read the discboxer's paper this is a homework for you long port to be honest I didn't notice but on his page has a statement that there are about dozens of errors in his paper but he's not going to fix it and it's a homework for the readers so you are encouraged to find all the errors there so thank you