 Hello and warm wishes to everyone joining the session by all titled amnesia rocks TV to amnesia vacuum plug-in on steroids. We're glad that you could join us today and without any further ado over to you. Now let me. Yes. Hello. My name is a figure. I've been around in the airline community for quite some time now. I joined the Ericsson in 96. At that time I had been playing with Erlang for about four years. So, and I've been doing a lot of work over the years in and around. Now, since 2017, I work with blockchain, the Eternity Foundation. And we have been using Misia in that blockchain project and have made some developments that I would like to share with you now. So, I guess most people are acquainted with Misia or at least they know of it since it comes with OTP and it has since the beginning. And it really was initially intended to be like an extension of the Erlang language. And so the core language essentially giving you processes and actually initially you would have when I came across Erlang for the first time in 92, there was no standard library. There actually wasn't even distribution. And if you wanted a database functionality, you had to build it yourself, which I spent some time doing. It was fun, but not very productive. The computer science lab team started playing with how to do database programming in in Erlang. One of the things they added very early was Erlang term storage, which you could see as a form of a database in Erlang. And you've all used them. They're sort of ad hoc hash tables or trees initially was only hash tables that you can just create, you can have thousands of them. And when the owning process dies, the edge tables are cleared away. One of the core features of them is that the data in an edge table is not garbage collected. And this was initially for performance reasons. In the mid 90s, we didn't have very fast computers. And also not generational garbage collection. So keeping your data in the process heap ended up being very expensive, not least in terms of garbage collection. So, Nisia was initially called amnesia, because it was just an in memory database. So it was kind of cute when you stopped it, it lost its memory. Thereby amnesia. Some boss at Ericsson didn't think that was an acceptable name for a database. So then they dropped the a and called it Nisia, which essentially means memory in Greek. The disk storage was more or less an afterthought, because, well, for telecoms applications, you did also need to be able to back up memory on disk. And so you could say that the core way to do persistence in Nisia is to have the data in edge tables, which are then streamed to disk using disk global. I'll mention that a little bit more later. And it's actually quite robust and efficient. Now, Nisia was a distributed database management system long before there really were any commercial options for that. I think Klaus Wichström and Hans Nilsson wrote a poster about that back in 1995. Actually, this was before the first version of OTP. So there are some amazing aspects of Nisia, not least that it was also initially made so that you could subvert it. It has transaction support so you can get ACID properties. But if you really want to get to the data as quickly as possible, you have dirty reads, you have dirty writes, you can even do it even more lightweight than dirty reads and do essentially ETS reads, which is very minimal overhead compared to using the ETS API. And this was intentional and sort of in line with Nisia being sort of at the fingertips of the programmer all the time. I mentioned there that the impedance mismatch is low impedance mismatch is essentially what you get if say you have an SQL database, they are wonderful in their own right but you essentially have a different language you have different data representations. So you have to essentially work in two different environments and possibly also do heavy context switching in the OS in order to get to the data. And this is what is typically called an impedance mismatch and Nisia was designed to minimize this impedance mismatch. So essentially, you're just coding Erlang and the database keeps Erlang terms, and that's the way you like it. Typically. Now, there are some bad things as well. One thing is that the persistent storage was essentially an afterthought. The disk only copies, as you, they're called in Nisia are using a library called debts, and debts essentially has the same semantics as it's except it's on this. But this really is not a very good idea. Because it's since it runs in the same memory space, and it's in full under full control of the VM, you can guarantee that an insert actually works. If it doesn't work that you're probably running out of memory or something and worse things are going to happen. But when you try to write something on disk. A lot of stuff can go wrong. And Nisia really is not designed to handle errors in debts accesses. So that's one problem. The other problem is that debts still uses 32 bit bucket system, and it can only handle two gigabytes of data. And so in the mid 90s. Maybe that felt like a lot. But nowadays, it feels very silly, because if you only have the two gigabytes of data in a table, you might as well have it in RAM. So what people do for persistence usually in these as they use disk copies, and this copies are essentially checkpointed. It's, as I said, the big problem with them is that they're limited by the amount of RAM, because all your data resides in RAM and is long to disk. It's fast. It's robust. It's very reliable. But it limits the size of your database, or at least of individual tables. Another problem with Nisia is that the table sink is pretty naive. And I think part of that was it was intended to be used in a cluster application with all the data and memory so the memory the tables wouldn't be that large. So, essentially, you would figure out who had the freshest copy and then you would copy the entire table over when you synced for very large data sets. This will not be acceptable. Another thing is that if you run distributed databases and I my interpretation. I don't know if I'm alone in this but I mean a lot of complaints about how badly Nisia handles certain error situations in distributed database applications. But one observation I have often made is that it's so easy to set up a distributed database using Nisia that people do it without really thinking of it as hard. There are some very hard failure scenarios. And in some cases people have sort of run an Erlang cluster with several replicas of every Nisia table and then they run into split brain or something like that and things go badly. They get annoyed and they throw out Nisia and they use an especially some years ago a Postgres copy that was not replicated at all. If it was good enough, then you didn't have to replicate using Nisia either, but Nisia doesn't really give you much guidance in how to avoid nasty failure scenarios with distributed processing. If you work at it and you look at presentations and stuff you can you can find pointers on how to fix it. And it is possible to achieve pretty good robustness there but a lot of is it is undocumented and relying on third party projects. So I would say that this is a little bit bad. At least in that it sort of allows you to do it, but it it pretends that it's a lot easier than it actually is. So the really ugly that this only copies there they are essentially unusable, you should stay away from them and the biggest problem is, if you approach the two gigabyte limit, not only may performance suffer. It used to when this cash is smaller because debts is very sensitive to to caching. But if you grow beyond the addressable limit in debts, it will basically tell Nisia that now I couldn't write that. But means it doesn't care it actually ignores that. So you will get silent data loss. And this is this is evil. It just so the the the advice is stay away from this only copies. But of course, this means that for practical purposes you're bounded with the by the amount of available Ram on your node or possibly in your cluster. If you want to get fancy with with the distribution aspects. And it's one thing I've been trying to address for many years is I. While means is great in getting started and doing prototyping and you can very easily create a database that's extremely easy to work with. You don't want there to be sort of hard boundaries where you simply cannot go any further, or you go past that line and suddenly every everything becomes extremely difficult, because users who believe that they may need to go there. will forego these entirely and use something else, just so that they will avoid this trauma later on. So back in 2016. I was working on a back end plug in interface and Clara got interested in this. I'll talk a little bit more about Clara. So they hired me I was at Erlang solutions at the time to make this into product quality, and it actually became part of Asia and as a OTP 19 in 2016, but it's pretty much undocumented. So you may have missed this I wouldn't fault you if you have. There are a couple of plugins that are out there on GitHub. There is a level DB plug in which Plarna uses. It has the drawback that it relies on basho's level DB interface, and of course basho is no longer so I don't think that is very well maintained. So basho is key value store that is used in react. It's all Erlang, which is nice. And there is a back in these are back in plugin for using level ed. So we use Rocks DB Benoit Chesnot maintains a Rocks DB interface for Erlang, and we're using that and and the Rocks DB plugin. There is also an experimental postgres plugin. That was something I did while working with Clarna, and that was mainly to see how well postgres performed as a back in plugin, which is essentially using postgres as a key value store. And, well, I guess it did surprisingly well, but not as well as level DB. So something that was introduced at this time, I'm just going to mention it because it's also undocumented, but I think it's pretty cool, and it's there in the documentation for Neesia Rocks DB. I have some, I write some more about this and the plugin support, since it is undocumented in OTP index plugins allows you to register a callback function. That you can then use in indexes and it will be served the full object and is expected to return index index values. So essentially any index values that you can derive from the object can be used as an index in Neesia. And so this, I think is a very powerful feature that is a well kept secret. So the way you would use a back end plugin, I have a short example here, you create a Neesia schema, you start Neesia and then you can add a back end type where you provide an alias. An alias would be at the level of, for example, disc copies, RAM copies, here we call it Rocks DB copies, and then a callback module, which would be Neesia Rocks DB. Then you can create a table where you name, you provide the types of table, and in this case you use the Rocks DB copies alias on the local node. And then you can start using that just as any other Neesia table and all the Neesia functionality works just like it normally does. A caveat is you wouldn't want to use Rocks DB or level DB, or mostly any plugin I think for bag tables, because they're quite difficult to implement in most key value stores. So if you want to use bag tables, use RAM or disc copies. So for example, Klarna is using the level DB back end. They have, it's probably a lot bigger now, but some years ago around 2016 when they switched to level DB, they had more than 600 gigabytes in their Neesia database. They were using RAM, it should actually be disc copies mostly. So they had to keep it all in RAM. They were using debts for some things, disc only copies, but trying to get away from it. In order to make this work, they had monster machines with I think two terabytes of RAM or something. And one of the things that they were concerned with was that when you have 600 gigabytes of data in that is essentially RAM resident, even if it's on disk, it has to be loaded into RAM every time you restart. So restarting took half an hour, just loading the database was like 20 minutes of that. So one of the things they were after we're moving to level DB was much faster startup time, which they also got. And of course, over time much lower RAM usage meant that they didn't have to buy these ridiculously expensive Dell blades with two terabytes of RAM, which is also a good thing. Now for these very large datasets, they did not use the naive Neesia replication protocol, they use their own replication protocol. I'm not going to talk much about this, but there are some callbacks in the plugin interface that would allow you to write a custom replication protocol as part of a backend plugin. It's only been tested in toy examples. So but if any of you want to try that, hey, that would be interesting to get feedback on. So eternity blockchain. The thing about blockchains is that they are certainly distributed systems, they're essentially untrusted peer to peer networks. But at the node level, they're not distributed. So we're not using Erlang distribution at all. So we have a database that's like 140 gigabytes. But it's all sync, you essentially start, you can start an empty node and get the entire database from the network. It's faster to download a snapshot and start from that. But well, that's the nature of the application. So essentially, we don't care about replication. We do care about persistence. But in this particular database application, we don't worry about replication at all at the database level. One problem that blockchains tend to have is right pressure, especially if they are blockchains of the style of eternity or Ethereum where you allow smart contracts, because you have a very sizable blockchain state. And you can end up with essentially easily saturating the IO system. But the access patterns are simple. We don't have any relational queries or anything like that. We have one index, I think. So we've been using RocksDB. It's been working very well. We also have configuration support so you can use other plugins. For example, LevelEd, we could, we actually test that as well. And it's possible to use that if you want to on your blockchain node. And we can also use it RAM only for testing to speed things up. So the thing about right pressure ended up being a nasty corner case for us, because especially during sync, we write data as fast as we possibly can to the database. And then in some cases, for example, in running a Docker image on a virtual machine in the cloud, the IO system can get saturated and then you can get right stalls in RocksDB. This can also happen in LevelDB and typically mostly any key value store. And what it can do is that you can configure how it handles this. The default is that it just returns an error. But just like with the debts problem that we talked about before, Nezia doesn't care. So then you get database inconsistency. You can also have a block, which we've tried. It doesn't work in our system. It also has support for pushback so that you can slow down writes. But for various reasons that was also difficult to deal with in a back end plugin scenario. And part of this is that this whole back end plugin system was an afterthought in Nezia. And there are limits to what you can do without a major redesign of the system. So the point of no return just to clarify this is when you commit a transaction, for example, Nezia writes to the transaction log, which in database terms is a right ahead log or wall. And once it has done that, it can essentially it knows that it can recover that transaction. And then it writes to the data stores, or if you have a dirty right to goes directly to the data stores, but the API that it uses is essentially after at that point it just assumes that everything will work. And again, this is sort of the edge heritage. And it's very hard to deal with. We also have sort of a file descriptor problem because these log base log merge key value stores use a lot of file descriptors. In the first version of Nezia level DB and Nezia rocks DB we had one database instance per table. And we could end up with thousands of file descriptors open, which probably makes the right pressure problem harder. Now rocks DB has added something called column families, which are essentially lot logical tables that you can use within one database instance. So we wanted to use that and map Nezia tables to column families and just keep one rocks to be instance for an alias in the Nezia world. And this could lessen the right pressure problem we hoped the file descriptor problem. And we also wanted to add more tables and this we didn't want we wanted to do this first so we've been working on this. But rocks DB has other things that are interesting. For example, they have their own transactions. They're a bit quirky but they're essentially optimistic transactions and they're quite fast. You can also do batch updates where you have a list of updates that either are written atomically or not at all. And it's also much faster to use batches than to do multiple individual writes, for example. You can also check out a snapshot, which since it's essentially a log based system, just like you can do with immutable data structures that if you keep a reference to the old version of a data structure, then essentially that is a consistent view of the past. And in rocks DB and level DB, you can use this to efficiently iterate over a consistent snapshot of the database. So we're using that, for example, in selects and everything. And if we're using column families, then we can actually achieve atomicity in transactions and batch batch updates across tables. And when you look at this set of features, you're kind of getting close to what Nizia gives you overall, but rocks DB is much more low level and not quite as user friendly as as Nizia is. But the idea that we wanted to explore was, okay, let's keep using Nizia and we create the tables in Nizia. And still maintain the compatibility. So if you use the Nizia APIs, then you have everything transaction support, replication, and everything. But if you want to step down to a lower level, you should be able to use the rocks DB API directly. But in a way that is slightly adapted to the way you're used to working in Nizia. So an example. We create, we start Nizia. I had created a table T before. And I have now an MR DB API, which can take a table name, Nizia table name quickly find the metadata and read it using the direct rocks DB API. The MR DB insert works like Nizia dirty right. MR DB select works as Nizia select, essentially, I also have an MR DB activity where I can start a transaction. I need to name the alias because that is one. That would be my sort of scope within which I can commit atomically. And then I provide a fun just like when I do a Nizia transaction. And then I can use the MR DB API for reading and reading and writing. And this works. And it works very similarly to Nizia. And the idea is, if you get into the performance realm where Nizia might start acting funny with the back end plugin, the problems I mentioned before, this could be a way to take control of it. So the batches can be used similarly. You can also use that in the activity. The activity function you can provide a batch here and then a fun and the fun will work not in a transaction transaction context, but it will be committed. It will be written atomically as a whole to the database. And in the as batch function, you provide a table name. And then the fun provides takes a reference, which is a map that has metadata for the table. And in this case, an annotated map that also has a batch reference and then MR DB, the other functions will sense this and they will use the appropriate functions in Rocks DB. MR DB also detects Nizia indexes and will actually update them accordingly. So even if you use the direct Rocks DB interface, it will be consistent may consistently maintain Nizia indexes. And also the indexes are actually Rocks DB tables here. I'm using the function get ref for the table I, which I created that without telling you that has an index, it has metadata in a map here. I can see what the alias is. I have the Rocks DB handles for the column family, the database. Also encoding. I can tweak the encoding. For example, if I know that I have binary data and binary values, then I don't have to do Erlang term encoding or sext encoding, which is essentially Erlang term encoding that maintains the sorting properties, which is what you would use for order sets. So you can choose this to optimize the space utilization in your database. The properties sub map is all the Nizia metadata. So you can actually fetch the indexes. And then there is a naming convention that allows you to access the indexes as Nizia tables. And which can open up some interesting functionality like you can fold, you can iterate over an Erlang table in index order or reverse index order, perhaps. So it will allow you a little bit more flexibility than the Nizia does. Now, one thing that we're making use of is persistent terms that came were introduced in OTP 21. That allows us to access lots of metadata for the tables extremely quickly. Persistent terms are expensive to update, but lightning fast to read and but creating tables modifying table metadata happens very seldomly. So it's actually perfect for this scenario. So looking at performance because if this weren't fast, then there wasn't wouldn't really be a point to it. I performed some silly benchmarks where I just created a table for each interesting category. I filled it with 5000 objects I iterated over the objects and then I ran an iteration where I created a transaction where I read an object modified it, and then read it back. And to the left, we see RAM copies. And of course, they are extremely fast to update their extreme even faster to read transaction support. Well, it does locking and all kinds of stuff. So it's a bit slow, even for RAM copies. This copies are only slightly slower than RAM copies for filling or doing lots of rights. It's as fast as RAM copies for reads. And it's slightly more expensive for transactions because it actually has to log the transaction on disk in the commit. Disc only copies in the middle there. As we can see, they suck overall. Not only do they have the hard size limit. They are slow on rights. They are slow on reads and very slow on transactions. Now the rocks DB copies column that is using the Misia API on a rocks DB table, as I showed you initially. It's faster than this only copies on rights much faster on reads and also faster on transactions, which is kind of funny. But I think it has to do with that the access patterns are more efficient than I was actually using two reads and one right inside the transaction. But MRDB does the same thing, but only calling the MRDB API and using the optimistic rocks to be transactions. And it's actually approaching RAM copies and disk copies actually slightly better on rights than disk copies and better than all of them on transactions. So this does achieve the goal of boosting performance if we have to. While keeping all the data out of RAM. Well, it uses memory map tables but essentially it's it's disk only. Plus memory caching of course. And so you can have hundreds of gigabytes of data that you can access very quickly. And this allows us also to access some pretty extreme performance on very large data sets that was previously not available in Misia. So hopefully this will sort of extend the the reach of Misia or the the number of problems and the number of configurations where you can keep using Misia and without sacrificing the initial goodness of it. So things to think about. But this is in line with the Misia mentality that if you're using the MRDB transactions. They won't play that nicely with Misia transactions. Or mostly the other way around Misia transactions will think that they are offer atomicity but once you're committed, they are not necessarily as atomic on the actual update site. I think I'm fixing that but it used to be difficult since we had multiple instances, one instance per table. We have everything as column families. Then this will be better, but Misia Rocks to be can't guarantee that for backward compatibility reasons. This becomes a big problem for us. I'm not going to go into that. The way rocks to be optimistic transaction works and transactions work is that they essentially take a pre image. And they allow you to update data and then at commit time, they compare the commit set with the pre image. And if the data has in the database has moved anything in the commit set has shifted since the pre image, it will abort the transaction. This is less fine granular than the Misia transactions. But if you have a carefully considered access pattern, which you can often have in Erlang, because you have such good control over the concurrency, then this may not be an issue. We think that it's not an issue for us. So we'd rather take the performance. Now, one thing I found to my absolute horror was that if you have concurrent rocks to be transactions, and one transaction writes some data that actually pops up in the other transactions, which I consider violating the isolation property of But there is a way to fix that by marrying rocks to be transactions and rocks to be snapshots, which are very lightweight. So in MRDB I actually use both so that when you're in a transaction you have a consistent view of the data you read. So if you read it, a piece of data once you'll see that again, or possibly the data you wrote in your transaction to modify it, not something that some other transaction is writing. Now, if some other transaction is writing to the same data you are, chances are one of you will or you will abort. Now MRDB then takes a mutex and reruns the transaction, refreshing the preimage and possibly then passing, and that seems to work pretty well. So the transactions in MRDB are a lot nicer to deal with than the rocks to be transactions. But if you want to use the rocks brawl rocks to be transactions you can use the MRDB API. So I'm pretty much done. Actually, that is the repository. I would have liked to have a release tag ready for you but I'm still working on a PR. Hopefully I will merge that this week. But if you want to dive into this right away, you can contact me and we'll make sure that you know where to find the latest stuff and know what the exact status is. Okay, so I see that there is a question. When we would choose this means that he be over various other external TVs. Does means that he be support other domains rather than Erlang or is it just for building fast telecommunications applications. Well, it's not just for telecom, obviously. It's possible that using rocks to be this way and using rocks to be transactions. You could coexist with other applications using the same rocks to be instance I haven't tried that I don't know what the, what the complexities would be since rocks to be runs essentially as a niff in Erlang. But generally, the Nizia Nizia does is Erlang local and supports your local application. So I think this mainly allows you to reach further with Nizia. And then you could previously, but perhaps not address fundamentally different use cases. So, if you want other applications to access data and Nizia, you probably still have to, you still have to provide an access API to it. That you maintain. Think we've come to the end of our time here.