 Hello. Our next speaker is Fri Ekenajaka. He is a member of the LXD at Canonical, and he will present DQ Lite, a distributed embeddable database. Please welcome. Thank you very much. I guess you can hear me. So yeah, my name is Fri, that's the real name, not a nick, but as weird as it could sound. As you mentioned, I am a member of the LXD team at Canonical. LXD is a container management system for system containers. And as part of it, we started a sub-project called DQ Lite when we had to implement clustering in LXD. So instead of having just one node running your containers, now you can have a cluster of them, and it behaves coherently in a consistent way. We were using SQ Lite before, and so we thought it would be a good idea to keep using that model. Before I start, how many of you do not know about SQ Lite or SQL Lite? I don't see many hands raised, actually none, which is expected, because yes, SQ Lite is probably the most widely used database in the world because of embedded systems, Android, and similar things. There are billions of deployments, and it's a very solid code base, very high quality, with a core team of maintainers that has been keeping up for more than 20 years now. There are many use cases for it. So the SQ Lite author advertises it as a replacement of fOpen, in the sense that it's as easy to use as a system call to open a file, but in return you get a structured way to manage your data. So it's using embedded devices in agents, such as LexD, or in desktop applications, for example Firefox, Chrome, all use this to organise bookmarks and stuff like that, as you know. So what do you get with the SQ Lite that you don't get with SQ Lite? So well, that's high availability. The main problem with SQ Lite is, of course, everything is local to your node, which is a good thing, because it simplifies everything, things are faster, et cetera, but it's also a single point of failure. For example, in LexD, if you put everything in one node, if you lose that single node, everything would be gone. With the SQ Lite we distribute the data across all the nodes of the cluster, the data is replicated. So in case of failure of a node, there are automatic mechanisms for failover, and as long as your client can connect to another node in the cluster, you're good. So this is a very simple diagram of the classic way of using SQ Lite. So you have your application process. SQ Lite is a C library that you link. Depending on your application language, you will use bindings, so the original bindings are in C. And if you have an external client, either by an API or even command line, you can drive your applications through it. With SQ Lite, the model is similar, except that you have more than one node. So you have a cluster of processes, and your application will run on different nodes. H nodes typically run one instance of your application process. And instead of talking to the SQ Lite API, you'd talk to the DQ Lite API. Still it's embedded in the process. SQ Lite is as well a C library, which gets linked to your process. Inside the SQ Lite, there is an engine which uses rafts to replicate the data, and SQ Lite actually to process the data to offer you the SQ Lite engine. There's an important difference in that with SQ Lite, with the C API, you talk locally to your node. With DQ Lite, since it's based on raft, we can talk about this later, you will have to connect to the leader of the cluster. So is there a pointer somewhere? I didn't ask. Anyway, you have to connect to the leader of the cluster, which can be any of those processes, and that leader will be the one actually serving your queries. So it retains, DQ Lite retains the embedded part of SQ Lite, but there's no way you can keep the 100% local part. You need to go through the network, and this is one important difference. Should any of your processes die, the data in your database is replicated, as they said, and your processes or your client, depending on where you put the logic, can connect to the new leader, can discover the new leader, and go on with the computation. OK. Toki is cheap. Show me the code. This is a quote from Liam Storvold. So I will try to show you a very simple application. The goal is to demonstrate that you can write a Go program with 132 lines of code, and exclusively the standard library and the Go bindings for DQ Lite. And with that program you actually get distributed application with the distributed database and full tolerance. So the case study is just to make things a bit more interesting. A medical device. So one of the use cases I think DQ Lite would have is with the devices, IoT. You might want to have redundancy in your devices. So in this case, I picked a pulse oximeter, which I didn't know about before. So for those of you that don't know as well, it's a device to measure the oxygen saturation in your blood. So how much oxygen there is in your blood. I don't know much about details, almost anything, but for the curious, it works with a lead. It emits light through your body at different wavelengths. And since the hemoglobin in your blood absorbs light differently depending on whether it's bound to oxygen or not, it will be absorbed differently and there is another component at the other end, a diode that essentially reads the amount of light that passed through. So you can deduce the amount of oxygen in the bound to hemoglobin. The algorithm will be very simple. So we want to measure, perform this measurement at regular intervals. So we keep in history of the percentage of oxygen in your blood across the time. And we want to expose an API so that we can inspect the average in the last five minutes or one hour or whatever. The function to measure oxygen saturation is just a mock-up thing. So we imagine we have this fancy hardware that does what I described. And so it just returns a random value between 95 and 100 which is more or less than the actual value that you find in your body. This is the function that we will use in the program. We initialize the database. This, for those of you familiar with Go, but it's the same with other bindings, it just creates a table where we will store the measurement along with the timestamp where the measurement was made. And that's it. And as I said, SQLite is very simple. You just have to open the database as you would open a file and that's it. So this part is probably very familiar to the SQLite users among you. Then we have, of course, to save our measurements. So we have this persist saturation function and we just perform an insert, simple insert of the value that we read. And since the schema has default current timestamp, the timestamp will be saved automatically. And finally, the last piece, essentially of the model in this little toy application is to retrieve the average saturation and we just make a select query taking the average over the last number of minutes or seconds or hours that would be part of the query of the user. And that's basically it. You can put it all together. Your main function will initialize the database. It starts a goroutine for those of you that are not familiar with Go. Goroutines are, you can think about that as thread of computations that go on in parallel with the rest of the program. So this is an infinite for loop that slips for some seconds. And then makes a new measurement and saves it to the database, to the local database. And then we expose the service through the web with an HTTP handler. We just take the tail parameter which is how many minutes or seconds you want to average against and then it returns the average in that window of time. So that's everything you need to do for this toy implementation. We can see it in action. I can build the program. So this is the tags, lib SQLite parameter for those of you who know about Go is needed because the bindings it's a way to tell the bindings please use the system shared library object instead of compiling the whole thing into your executable. So it's a bit faster. And I can then start the program. And if I perform a query asking for, for example, the last five minutes of value I get the average back. This will be refreshed every 30 seconds or whatever. And so, and all the data is saved locally as you might expect. So we have the history of measurements locally in your node. So as I mentioned before this is simple and nice but in a real world, imagine a real world situation what happens if the underlying storage media fails and so you don't have access anymore to your data and maybe they are critical for assessing the health of the patient or whatever. Or if you need to do some maintenance over your devices and you need to replace it but you don't want to interrupt the monitoring or stuff like that. Or another thing could be that the hardware that performs the measurement has some issues. Perhaps it's inaccurate maybe even in small ways. So you might want to perform the measurements several times using several instances of the hardware and average the results to reduce the error in case one of them is not accurate. So in these cases you want of course a cluster of devices and that's a good use case for deculite. So you would run this deculite application in for example three instances of your devices three pulse oximeters. Each oximeter would independently make the measurement it would save the measurement in the distributed database and even if one of them dies it's still all good because the other one have replicated data and can still serve your query. So the changes the code that I just showed to you are minimal. One thing you need to do in your main function is to assign an identity to each of the instances of your application running in the cluster. An ID, a numerical ID which needs to be unique for each node. This is basically used by the underlying raft engine and very simply put to see which nodes have replicated which data. So just pass an ID to your main function and we will pass it to the function that initializes the database. In turn, the function that initializes the database will need not just to connect to your local SQLite instead it needs to start the DQLite engine. As I mentioned before there is some networking going on and the DQLite engine running in your process is the one which will be in charge of this network and which is completely transparent to your application. So the start engine function which I will show you in the next slide will do that. There is another function called join cluster. You need to tell DQLite essentially which nodes are part of your cluster. So this will just be the easiest way to do it. We could talk about membership management later on but in this case each node essentially tries to join the node if it has already joined in the past it's just a no hope. And the last thing, the third one is to register the driver. This is also a lang specific thing. So for those of you familiar Go has a standard library package called SQL and you can register different drivers for different database backends. So in this case we want to register the DQLite driver and we need to do it explicitly because in some other libraries they just register their driver transparently. For example, SQLite there is a single instance of the driver object which can be shared but in our case you need to instantiate the DQLite instance so you need to register the driver. This is mostly implementation detail of Golang. So to start the DQLite engine you need a couple of things as I mentioned you need the ID that you will use for the node you need a network address in this case we just use a network address which is derived from the ID we just append a number to the to the port. And you need the directory where you will save the data. So the data will not be saved in a regular SQLite database file it will be saved in the raft log. More about this later. Then there are a couple of options that you probably want to set. One is a bind address which is what I said this will tell the DQLite engine please start listening to this network address and all the traffic between nodes will go through that socket. And you might also want to specify the network latency we can delve into this later as well. Essentially this should be the latency the one way latency between your nodes. For example if you know I specify 10 millisecond here because I will be running on localhost which is fast. Depending on your network you should specify more or less how much it takes to send one packet from one node to another. This is used internally by raft to the side when it should consider the current leader dead. So another leader will step in. And that's it. As I mentioned you have to join the cluster. You can use the DQLite go client to tell the engine please join the node. We don't join the node if the ID is 1 because node 1 will be the first node of the cluster. The cluster will have just one node at the beginning so you don't need to join it. And the other nodes will just try to join and if they fail they just move on assuming that they joined it before. We leave out error handling and stuff like that for this demo. We register the driver. This is an important part. Go lang DQLite client will have to know the addresses of the nodes in your cluster. So in case one node dies the current leader dies it will connect to another node. In this case we just hard code the list but of course you can organize your code if you want to know the current list of nodes and you can wire it when you build the client. The total program as I mentioned is 132 lines of go code and it has everything. I will show how it works. So instead of compiling this program I will compile the other one and as you see the final binary is less than 10 megabytes OK. These are the shared libraries it's linked to you can notice SQLite of course and DQLite and there are other few libraries which are the dependencies of DQLite which are LibuV which is event loop the same event loop used by Node.js so internally DQLite is a single thread asynchronous loop event loop and it uses LibuV Twitter proof portability and it's also a very good library but it's a very LibuV it's a very thin roper around Epo on Linux then there is Libco which is a coroutine library if you're curious why we use this it's a one file long library adding support for coroutines if you're curious I can explain why but it's not particularly important for now and Libraft which is the rough library that we wrote as part of DQLite for implementing rough OK. We can start the first node OK and we can as we did with the other one and it will return you the current average of the oxygen saturation then we can start another node which as I showed you before will automatically join that cluster that was created by the first node and a third node as well so now all the nodes are online and I can query them so all of them will return the same value because the database is shared so all the nodes see the exact same view of the database from their point of view it's as if they were taken to a local database all the consistency properties that you have with the local database are maintained now if I kill so in this case the leader of the cluster will be the first node because nothing disruptive has happened so far so the first node notice that it was the only one so it became the leader since it was the only one the other two nodes, when they joined there was already a stable leader that didn't need to do anything so if I kill the current leader sorry ok anyway it's eliminated so what did I do no sub job say it again because I didn't kill the first one ok, thank you you can see that the first node doesn't reply anymore but the other two do since they elected the new leader they already had a copy of the whole thing so it's fine if the leader actually the old node actually comes back it will immediately join the cluster the data it will connect to the current leader whatever was elected and it will have immediate access to the current state of the database and in the background it will start to replicate what was missing we can do the same exercise with another node and of course it will fail but if we start it it will be good yeah, that's it for the demo so we'll just pose for some time before I delve into more details if you have questions already we can decide in which directions to go say it again reads don't go through consistency so you're asking what is the consistency model essentially ok so cap theorem not sure how many of you are familiar with it but put it simply in any distributed system to make some choices around consistency since deculite is based on raft the choices that are made are essentially consistency and tolerance to partition failures but consistency is there that means in raft reads are consistent if you make a read it will read the last the value of the last last write that means in terms of sequel if you have a transaction if you use sequel transactions like we all know things will work as you expect so there's acid semantics if you will because the reads are served by the leader this is probably the point that should make you understand why the reads are served by the leader so they always have the most recent value it's possible to have reads served by non-leader nodes so called follower nodes for scaling there is a way to do it without sacrificing consistency so retaining exactly same consistency property as acid but it's not implemented yet in DQ Lite so for now all reads are served by the leader does it answer the question go ahead I don't have the microphones can you dynamically change the number of nodes in the cluster as well yes so let's see we have something I don't have a slide for this but the membership management lately got quite sophisticated so as you saw the membership management in the demo was absolutely a toy but you can change membership dynamically you can let's say for example one example that we use in LexD I have 20 nodes in my cluster I want three of them to be voting nodes those are the ones that are critical for your data you will survive to the loss of one of those nodes or you can have five voting nodes if you want to survive the loss of two then you can stand by nodes which are nodes that do replicate data do write data on disk but they don't participate in voting so they won't be bottlenecked as far as latency is concerned so you don't have to wait for them to complete their writes to reach the quorum that you need before you say this transaction is committed and you can also have spare nodes which are just part of the cluster they just sit there doing nothing they don't replicate data they don't participate in voting and there is an API to assign roles dynamically to all these nodes for example in LexD we have that code so for example in a node gracefully at least we transfer leadership for another node if the node was the leader we elect a standby node which is a fast follower to be voting for example to replace the voters stuff like that so there are advanced membership management options that you can use ok the rough logs themselves need to be transactional and I'm curious why you didn't use sqlite itself the rough logs themselves need to be transactional and I'm curious why you didn't use sqlite itself to persist the rough logs ok so the question is since the rough logs are transactional by themselves why didn't you use sqlite itself to store them so when answer is like the logs in rough are a bit simpler data model than sqlite first and also a bit different the second reason is also performance so you want the two bottlenecks of rough are the network and the disk of course so everything you can do to minimize those paths, those hot paths is good in sqlite I probably have a slide for this in sqlite as I mentioned it's a single thread engine so it doesn't ever block to do disk IO it uses not very well known kernel interface which is the asynchronous IO interface the main system call involved here is IO underscore submit for those who knows and it doesn't block sqlite by default it blocks so for example if I have to write the commit to disk I have to block do nothing else in the meantime there's network going on et cetera the choice of going single thread there are many reasons we can go through them separately but I think this at least answers your question another question ok so I have a few more details one important thing is excuse me I have a question I'm a bit curious about the need to use some external library like libkoro could you elaborate a bit about it sorry can you keep the microphone could you elaborate a bit about the need for the use of libkoro no the lib is not libkoro is a lib uv I think you mean this one ah libkoro ah sorry libkoro ah ok so libkoro how many of you are familiar with coroutines ok for those of you that are not familiar with it in C there is no coroutines is essentially a way to do concurrent computing in a single thread let's keep it simple so you have some known blocking part of code it does its job until it can because at some point it hits some blocking job to do network or disk then that function or that coroutine poses is still executing the same CPU same thread and it passes control to another coroutine same thread, same CPU and computation goes on so it's a way to coroutines are lightweight threads you don't have to context switch you don't have to pay that over they're really really really good for networking like famous web engines are based on this model like Nginx HAProxy so in C there is no model for coroutines you can't just stop in the middle of the function and start another function on another stack like out of no blue so libco is a way to it has some black magic depending on the architecture it will save the current state of all registers and stuff like that all that kind of black magic and it will transfer the program flow to another to an endpoint in another function why I need this this is related to the other question because SQLite is blocking so when I reach the point inside the SQLite code where I need to commit the transaction instead of writing to disk I have to ask craft to do a commit essentially and that part is asynchronous in the SQLite the SQLite is callback based asynchronous so there wasn't a way to stop SQLite there just hanging for this without using threads so I had to to use the code in trick to do that I hope that answers the question speaking of which the SQLite version that the SQLite needs is not upstream one it's a patched version for the reason I just mentioned not sure how many of you are familiar with in turn of SQLite but when it comes to commit the transaction to disk of course it has an API for writing to the write ahead log I put some hooks in there essentially to call some custom interface that you can specify in my case it's the SQLite interface to do all the magic so the patch is very small it's not big one for now and I've been maintaining it for a couple of years now updating it at every release and so far I just rebase the code I occasionally get trivial conflicts but that's it that's a very stable part of the SQLite code base and the patch is minimal for now I would like to grow the patch a bit we can talk about it if you will to add new ways of supporting replication but that's it probably you might want to know whether this is going to be upstream because yes, eventually, hopefully SQLite is a particular project quoting from them it's not open contribution the project does not accept patches this is funny only 27 individuals have ever contributed to any code in SQLite and of those only 16 cases in the latest release only three developers have contributed to non-comment changes in the last five years so it's not something you push a poor request on GitHub and it gets merged it will be very difficult to get any changes in so I will wait first when I'm absolutely certain that the patch will not change for some time in the future and perhaps SQLite adoption to hope re-raise and have some more arguments ok, we have a few minutes left if there are questions or I think we are done we can five minutes more so for now the only language which is really supported by SQLite is Go because LexD is written in Go as I mentioned when your application talks to the database it doesn't talk to anything local it needs to connect to the leader and when they speak they speak a wire protocol similar to Postgres or MySQL I didn't pick MySQL or Postgres protocol for reasons this is another topic but there is a wire protocol that your client needs to speak there is a Go client that speaks this protocol but that's the only supported language if you want to support your own language you will have to write the client as for today but wire protocol is documented and you can take a look at the Go one for example to have an idea how it works it should be doable we would love to implement more clients in other languages but for now we don't have time one last thing as I mentioned for now the way data is replicated is by transferring pages of the write ahead log of SQLite that's good and robust but it might be too much for example if your insert query just changes a few cells in the B3 you change just a few records that might be overkill I would like to implement B3 based replication in the transaction I just look at which cells were modified and I just replicate a logical representation of the change pretty much like Postgres does more or less or even statement based replication those are way smaller payloads to transfer over the wire and to write on disk it depends this needs to be benchmarked because often it doesn't matter much if you transfer just a few bytes for a full page but that's it so yeah thank you time for more questions are there constraints in term of hardware or architecture to run are there constraints in term of hardware or architecture architecture no the only constraint for now is the operating system in linux because of that asynchronous disk IO but that could be ported to other operating systems as well but no architectures whatever you want we run it on ARM we run it on S3 90 whatever whatever linux supports have you benchmarked the performance hit for the raft protocol and what kind of extra latency to the ad queries so yeah as I mentioned your bottleneck will be the disk and the network compared to to SQLite needs to do a disk write more or less that's the same SQLite needs to do a disk write on any node that happens in parallel so that part you can expect more or less the same performance as SQLite the overhead will be one network round trip because you need to to propagate the change and get an answer to decide that there's a quorum so it's SQLite plus one network round trip that's more or less what we observed so first thank you for the talk I want to do a question about asynchronous IO I'm not used to asynchronous IO in C how do you manage to can you hear me? barely can you hear me? yeah now yes better how do you manage to perform blocking and unblocking operations do you have like worker threads when you perform only the synchronous blocking operations so the question is how do you perform blocking calls using libco the thing is you don't decolite does not use any blocking system call say it again that's the reason why we use libco in the first place when you enter the commit path of the sqlite transaction the sqlite logic code will pose there the registers will be swapped and the the raft engine will run so to speak that code will will slip there nothing else will will happen also the readers well the readers actually gone in parallel but other writers will be blocked like sqlite so sqlite will be frozen in that particular spot in the sqlite code so that's from the point of view of sqlite that's blocking but from the point of view of the cpu I'm just swapping the registers and going on with something else ok ok time is up thank you again