 Hi everyone, my name is Ivan. I'm choosing a modern database management system in 2024. So I'd like to start with explaining why I'm giving this talk. So I work in this data infrastructure industry for well over a decade. And I really enjoy talking with people about their data infrastructure to understand how they end up with their particular solution to this data infrastructure. What works for them, what doesn't. And basically this talk is a set of highlights from these years of experience, both hands-on and from these kind of discussions. So I worked on a handful of database products in this period of time. Two of them are open source. So I worked for seven years at Clickhouse. You heard all about it if you were here before lunch. And vice president, product and open source at YDB. So unlike most other talks in this room today, my talk will be more like overview of our little database market. And we'll start with trying to understand how people approach choosing DbMS in general. And then we'll switch gears a bit to take a look how database world interacts with open source world and why it matters. And finally we'll dive a little bit deeper into what changes in this process of choosing DbMS technologies as your projects or company grow in terms of data volumes and query rates. Let's jump into it. I like to use analogy that choosing a database management system is quite similar to choosing a car. What I mean by that? At a very high level, all products in both of these markets serve the same purpose. So similarly to how cars move things from point A to point B, similarly databases take your data, store it for a while, and return back to you with some modifications along the way. But if we take a closer look, you'll see that there are several major categories of these products. As you can see, different kinds of cars here. There are different kinds of DbMS for different kinds of workloads, hence they're suited for different use cases. Inside each category there are usually hundreds of particular technologies out there. Similarly to hundreds of car manufacturers, there are like 400 DbMS products currently on the market and each with their own unique blend of trade-offs and advantages and disadvantages. And even if you look at one particular technology, there are tons of new ones, how you can implement it, a lot of options, a lot of deployment ways and so on. So it's kind of a deep industry to dive into. On the other hand, similarly how most people drive cars without really understanding how they actually work. Similarly you can use DbMS just by having a general idea how they operate. And that's what I'm trying to cover today. So the most fundamental way to characterize DbMS is the following. It comes from this logical concept of a table. As most databases operate on a concept of a table, even though some of them provide higher level obstructions like documents or graphs and others. Most of these trade-offs still applies. And basically the tricky part here is how you map this two-dimensional space of a table with rows and columns into a single-dimensional space that computers provide, both in random-max memory and on disk. So the diagonal approaches of this is you can go row by row and thus store columns of each row nearby consecutively or you can go column by column as you continue the right and store one value of one column, then all values of another column. And as we can see, we'll see on the next few slides. These approaches dramatically differ the complexity of running a given query against the same dataset if it's stored differently. So the row-oriented approach usually works best for OTP or online transactional processing workloads when you change small subsets of rows frequently and a column-oriented approach works best for analytical workloads where you kind of analyze lots of data to produce smaller reports out of them. So let's imagine an example website also about cars that has a page that shows all information there is about one particular car model. So to render such a page, likely this website will use a query like this that basically fetches one single row by its identifier it took from URL. So, and as you can see on the right, if all columns of this row are stored consecutively on disk or in memory, it basically translates to one read to render this page. But if you imagine if it was stored in the transposed way to render such a page, you'll have to make reads proportional to the number of columns and kind of reassemble this row on the fly from cells you got from different column files or whatever you store on disk or in memory. And the same website could have another page that shows some aggregates over all models that they know about. This is a bit oversimplified example that shows users only one column. In real life it's usually a few columns and usually there is some sort of date involved to show a graph or something. But you can see in these cases it's really beneficial that you don't really care about like model names or other data you have about these cars to produce this report. So it's beneficial that on disk all values of price column are stored nearby and that's you again like can do only one sequential read to build such a report and also such layout is really apparently for modern CPUs to run aggregates over such datasets. As you can see even on such simple examples there are used cases for both kinds of databases and the question is what you are going to do about it. So the most simple approach that most people start with is just choose one Dbms, usually it's transactional Dbms and throw all kinds of workloads you have edited but as we have seen previously all our workloads will behave like this low and inefficient. So it will work only for a while until your dataset is small or until you have many users. So at some point you'll have to resort to option number two or number three which we'll explore in a moment. So when you try to integrate multiple Dbms together you'll have to build some sort of pipeline that kind of moves data from wherever it originated from to your databases. So basically there's a whole profession that is dedicated to building this kind of pipeline it's called data engineering. So I just give here a few examples of how these pipelines work, good look like. So the ones on the top they treat Dbms as equals and try to concurrently put data in both of them and the options on the bottom treat transactional databases as main one and synchronously copy data to analytical database for further processing and options on the left try to be as close to real time as possible by introducing an extra component of a persistent queue that kind of buffers the incoming data before it gets put into its destination and options on the right don't do that by making pipelines simpler but it introduces extra latency to do everything in batches. So alternative to this is to try to find an all-in-one solution so the idea is quite simple. You basically build a database that can store data in both ways and automatically help you decide how to process it in both kinds of workloads. The idea is simple but implementation is not so most current implementations on the market have huge drawbacks for instance quite a lot of them are commercial and cost prohibitively expensive like most companies can really afford them and other ones for example often really focus on only one set of things and implement the other kind of workloads as a kind of afterthought just to tick the check box and use that they have this in marketing and that's also what we are building at YDB so we are aspiring to become a leading HSTAP database it's also quite a lot of progress for us and as an extra step we are also adding a persistent queue component in the same platform to kind of simplify moving data around these different kinds of tables or in and out database and also in transactional manner between both queues and tables so let's change gears a bit and talk about why open source matters in databases and the first main thing is that data is so crucial to most businesses that you need to think about how much you want to rely on your database render to store your data so some people often think that one of these huge companies that everyone knows like Google, Microsoft, Amazon, whatever will be around forever so they are so huge that they are not going anywhere so I can trust my data to them and I will be fine but real work shows that even the whole company might be there it might decide to leave your market or they might jack up prices like tenfold to become prohibitively expensive and hence people start considering for this critical piece of infrastructure like databases to move towards open source so either you use existing products and maybe you leverage someone else's help just to implement them and build internal expertise over time and that's actually how communities are formed by people implementing the same technology in their different companies in different use cases and some companies can afford to even build their own database management systems that's actually how both Clickhouse and P2B started and then if you actually build such cool technology that the world can benefit from please consider open sourcing it as well because together we can move them faster and another important topic on this intersection is doing database and open source licensing so historically most many open source projects were having so-called viral open source licenses basically this means that all modifications to given technology need to be distributed under the same terms which is good for sustainability but on the other hand it limits the usage of such technologies inside large commercial companies they also have limitations on usage of such technologies and that's why my favorite kinds of licenses are permissive licenses so both YDB and Clickhouse are under Apache 2.0 which is one of the most permissive licenses out there and then there is a separate story around source available licenses that became popular during the last few years when database vendors tried to protect themselves from larger competitors by introducing licenses that look open source because the source code is visible publicly but on the other hand they limit how the source code can be used which kind of abandons open source philosophy and principles and I guess everyone has to decide for themselves whether they are okay with such compromise or not and let's get into the last part about what changes in this database choice process as data volumes grow so we'll consider three stages when you're just getting started as an open source project or as a user of open source database the key point is don't invest too much time into choosing the perfect one that will serve you forever because it's easy to think that whatever idea you have you are building will be super popular and optimistic but it's a pitfall and in reality most projects actually get thrown away after a while so it's important to actually make the business right to make sure that you're building something useful and there's demand for it and on a small scale any DBMS will work fine and if you don't miss the chance to replace it until it's played you'll be great fight to the proper technologies that will support your long term so if still you are getting started and want some specific recommendations here for you I'm really familiar with so the typical approach is to take a standalone database server and use it everyone knows OTP once, PostgreSQL and MySQL I still believe that Clickhouse is the one that will get to the furthest in terms of single node analytical capabilities and if you are interested in this HSTAP story try YDB but what I really wanted to highlight here is this left column of embedded DBMS so the difference here is that unlike database servers there are libraries that compile into your application or get imported in some other way which allows you to kind of lower the number of components you need to manage on this early stage which is a little bit official which lowers the load on your DevOps team if you have one but you don't even have the DevOps team so early and thus your engineers won't really need to maintain a database server to your application so check those out it's really good for getting started on the growth stage that's when you confirm that whatever you are building makes sense there is demand for it, if you are a startup you might get investment and try to kind of start accelerating your development at this stage everyone tries to start to run fast towards your goal ambition and the easiest way to screw it up is to start running in different directions that's why it's really important to take a moment to consider your requirements and discuss it with your team and ideally put the requirements in writing so you know which datasets you are going to have, what kind of workloads you are running against them and this will help you actually figure out which technologies might be useful for your use cases and after this once you have been writing your requirements the next step is more straightforward so you likely have infrastructure from your previous prototyping stage and then you need to evaluate it against these requirements and you'll either see that everything is fine and you can plan to migrate to some other technologies later on with some maybe one or two years plan or half a year at least or there are some weak links you need to replace immediately and try to figure out some emergent sections to take to kind of avoid getting overwhelmed maybe by a growing demand or whatever is happening in your situation there are a couple more things to consider at this growth stage one is the kind of DBMS you are choosing so historically most DBMS that were around for 30 plus years they are developed when most databases were single node and they are designed for this use case where they are basically bound by what a single server can do but over these decades data volumes grew much faster than capabilities of a single server and hence most DBMS are developed in 2010 and after are distributed DBMS designed to grow to capacities larger than a single server can handle and there are like three top here is that there is less information of these technologies out there just because less time passed on the other hand it's much easier to grow with DBMS compared to trying to build something on top of single node all the other technologies and another interesting topic I have no time to even try to cover it here but the topic is called consistency phenomena so it's easy to imagine how database should behave it should be kind of similar to a diary or a notebook physical one so you kind of open it, write some data and close it and expect it to be there again once you open it again a few times but in reality even single node DBMS don't give such guarantees and often allow for some weird behaviors you might see your data disappear temporarily you might see duplicates of it, you might see data that you only thought about writing without actually writing it and stuff like this and basically the idea here is if you are using databases check what it actually guarantees it's possible to work around all these weird behaviors on the application side but you need to be aware of this and do it and you need to be aware of these behaviors in whatever product you are providing and if you are a student this is also a cool research topic, consider this for your graduate work because there's a lot of stuff to be uncovered here on a sure stage there are also tons of things that might be interesting in this aspect but I want to highlight one it's I call it exponential complexity growth so going to the databases might be complex but let's imagine we had one team that would choose some technologies and now we have another team who independently choose some other technologies and likely there are other different and over time in large companies there are use cases for integrating different data from different places, it might be direct queries or some sort of APIs between them but ultimately for example we have a report of e-commerce site a report of where people spending the most money come from and for this you need to use data from one team who is responsible for front-end to know where people are coming from and then people from back-end to handle purchases, basically that's how these arrows arise but which large company has only two teams so in reality it gets quickly gets something like this if you pay attention you'll notice that it's not really exponential but it still feels like it because nobody understands how your data infrastructure of the whole company works how to add one more arrow and people who develop some of these arrows are gone and all work elsewhere now and to avoid this mess I'd recommend to invest into building your own best practices how you run data infrastructure so choose few technologies that work and try to evangelize internally to kind of stick with them and help other teams to adopt the same tech stack you kind of most of you kind of adopted so let's sum it up so how to choose a modern DBMS assess your requirements, datasets, workflows you have or going to have in your future with these certain requirements and try to identify few technologies that seem to work for these requirements according to what vendors claim and then try to actually reproduce these results in your own workloads to see that these claims are actually correct and indeed you can achieve what you're supposed to and ultimately once you've settled on a few key technologies that satisfy your needs try to stick with them and make sure the breadth of the company as much as possible follows along so that's it if you have any question, connect with me on LinkedIn if you're interested in YDB we have a boot downstairs, step by if you haven't yet unfortunately I'm leaving after this soon so but I leave nearby if you want to chat more I can come back here thank you for your attention, thank you for organizers thank you so much, we do have a couple of minutes for questions but before we do that is the next speaker in the room please that would be Peter Farkas Peter could you please set up while we do some Q&A questions for Ivan, excellent I have one so I might have missed, we had some organizational things so I may have missed a detail the part of the problem with doing transactional and analytic is the tendency to store in the two different ways, the row oriented and the color oriented did you explain how YDB is dealing with that as a hybrid well it's not supposed to be talking about YDB so basically there are multiple ways to do it so and we're exploring like most of them so one is to basically store or date in multiple copies and kind of smartly decide where it goes currently we're only building blocks for this but that's kind of the ultimate go and actually that's two approaches are not the only ones so for example it's possible to store data in the kind of fashion like store a few columns together and a few other columns together if they're used together frequently that's usually referred to as column families, we have that too as well if you need it so like there are many ways to look at it sorry I graphed the four major schemes like if you weren't answering on YDB that explains right any other questions going ones, all right thank you so much thank you all right next speaker is on in about five minutes that's 2pm thank you spoiler alert I think there is all right well welcome everyone thank you so much for joining this lightning talk I guess we have 25 minutes to go over a lot of topics first of all let me ask you some questions who uses MongoDB from here raise your hand MongoDB users Postgres users come on you don't use Postgres okay who knows what is the difference between open source and open standards what is the difference is it the same thing raise your hand if you think it's the same thing okay well at least there's that so I'm going to talk about open source open standards and the history of SQL and I promise that this is going to make sense in the end about me my name is Peter Farkas I'm the co-founder and CEO of a company called FerradDB we are a MongoDB alternative built on Postgres so we turn Postgres into MongoDB but this is not going to be a sales plug so let's say this much about FerradDB and now I've heard that Cloudera, Percona and various different open source companies been working in open source for 15 years and I'm Hungarian and I live in Spain so we will do a very quick recap on open source I will skip a lot of slides because I'm sure you are familiar with what open source is and we are going to talk about open standards and some examples related to this. So why open matters I think Ivan had a very good presentation just now about the importance of open and open source especially if you're considering a database to use for your use case but very shortly it promotes innovation open source projects can work together it's transparent it's more secure because you know what's in it it's about collaboration and it's also sustainable because you don't have to redo a lot of work which others already did you can just use it and this helps Vendor lock-in. Now there were not a lot of hints in the air when I asked you about the difference between open source and open standards so this is a very very simple example here open standards are the blueprint and open source is the implementation of the blueprint what this means is you can use open standards to build something which can interact with other solutions out there on the market and it is going to tell you exactly how to interface with other application and it doesn't mean that open standards should always be implemented with an open source license so you can do it in a proprietary way as well so I'm going to skip these because this is the last day of an open source conference so I would be surprised if this would be anything new to you open source the software which is licensed under an open source license it's not just anything you claim to be open source it needs to comply with the open source definition which says that the license which your software comes with needs to comply with these 10 points here for example you can't discriminate between users when it comes to location or use case or whatever else and some other stuff as well I'm sure you're also familiar with these examples of open source software and hardware so if you look at this slide there are a lot of familiar logos here I'm sure you used VLC media player or Android or Mozilla Firefox what's interesting about this slide is that some of the logos here would claim that they are open source but they are not elastic and MongoDB would be two examples but there are many more I feel bad for bringing them out but hey that's life so why do you think these two are not open source software even though they say they are open source any ideas well why they violate the no discrimination on user or endeavor by limiting so basically they violate the open source the 10 points of the OSI requirements in order to consider something an open source license so even though they are calling themselves open source they don't come with the same freedoms as the other software open standards so we established that open source is mainly software and it's licensed under an open source license approved by the OSI this is what we can claim as open source so what are open standards and how are these born if you remember this is the blueprint which you need to use in order to create software which can interface with the other solutions out there there is a list of definitions for open standards as well it came from the same person originally as the OSI definition for open source the author is Bruce parents in this case as well and it's pretty much about the same thing so open standards should be available to implement by anyone it should be open you should be able to read it implement it for free and no one is allowed to have any claims for your work if it's based on the open standard that's very very important because open standards are really there for all developers to make it possible for others to implement and innovate using something which they came up with examples of open standards so this is where the difference becomes more visible so you see some logos here and some file extensions if you look back to the previous slide with the logos of open source software we've seen implementations Mozilla Firefox is an implementation of what it uses the HTTP standard very heavily or the CSS or the HTML5 standard in order to show you web pages on the internet so without these standards it would not be possible to create a web browser because I don't know if what the age group is of the audience but some of us would still remember how painful it was to create a website in the 2000s or 2005 because all of the browsers were following a different standard when it comes to HTML when it comes to CSS and the others and you were not able to create a website which would look the same way on all screens and open standards help with that for one other example is comma separated values so the power of open standards would be that for example if you create a tool which sends out newsletters to a contact list and you have competitors as well you have other tools on the market which would do the same thing with comma separated values your user can actually download the list of contacts maybe 20,000 contacts of subscribers and import into your tool which means that open standards foster innovation and they make competition stronger which is great for users and this way the market expands because the users are more confident that if they choose one product they will still be able to move away from it if it turns out to be bad so for you developers this is great because you have a chance to compete on the market and for you as users this is also great because you can make sure that you're always using the right tool for solving the problem you have there's a very interesting example the portable document format or PDF which is also an open standard but it was only standardizing 2008 and before 2008 you were only able to open a PDF with software which was licensed or developed by Adobe and if you can see on this graph up until 2008 the popularity of PDF was dwindling it was basically dying out because you could not be sure as a user or as a software developer that if you implement PDF and you pay a license to Adobe will they raise the license fee in a year from now or will the file format change in a way which is not going to be good for your use case and in 2008 it became an open standard and from that point anybody could implement PDF and anybody could read and follow and build on the standard and this is how you can just book an airline ticket or save your file in PDF and you can open it with software which has nothing to do with Adobe. Did Adobe lose by letting PDF go and make it available as an open standard? No because the popularity of the file format increased a lot and the market basically grew into multiples of its size before 2008. How open standards are created so it's different than open source in the sense that usually standards organizations are involved. Standards organizations create space for multiple vendors in a given area to collaborate together and figure out how a standard should look like, how the protocol or anything else should behave and interface with other solutions these are related to governments and I'm sure that you have heard at least one of them in your life. Standardization bodies are special because when they discuss a standard they create a standard everyone has a say in how it should look like so it's an equal discussion rather than Adobe saying hey this is how it's going to look like and that's it. So it creates an environment where innovation can be detached from a particular vendor and that's how it prevents vendor locking I'm sure that I made this point multiple times already if anybody can build on this then there's less chance for some entity to grab the market for itself. Very quickly the story of SQL as an open standard this is by far not going to be a history lesson because we have 10 minutes to go but I felt important to mention SQL here because I'm sure all of you are familiar with it but a lot of people are not aware how it came to be. So the short history of SQL it was created by IBM, it was created by a person named Edgar Codd who worked at the IBM Research Center in California and he came out with the relational model for databases. This is the model which pretty much all databases, all relational databases are built on. He came up with the way how we think about databases at present. The problem with his invention this was in the 1970s is that he also came out with a query language for it and that query language was not as understandable as SQL is you pretty much had to be a software developer or a machine to query data from a database from this new relational database he envisioned. So there were two other individuals, Don Chamberlain and Ray Boyce who came out with a query language which is a lot more simple and can work with relational databases. This was also in the 70s and together with the relational model and SQL we got called databases. I mean for decades database equaled a relational database which probably used SQL but how that happened is very interesting. So a very sad edition is that Ray Boyce was I think 27 years old when he published the document on SQL as a query language and about a year later he suddenly passed away so he never learned the huge impact he had on the database world or the world of IT. So after IBM essentially came out with the relational model and SQL they implemented it, they called it IBM DB2 as a product and it only ran on these huge washing machine sized computers. So if you had to use SQL if you liked the thing then you had to buy one of these which is not great. So this is the essence of vendor lock-in. You could only get SQL from IBM and then since other vendors liked the idea as well they started copying IBM DB2. They made their own implementations of SQL, Informix, Oracle, Sybase, they all came out with database products which used SQL but they had very different ways of implementing it. So you couldn't be sure that you can run a simple SQL command across all of these different products which means that without an open standards the user still pays vendor lock-in. If you started using IBM DB2 even though Informix or Sybase also used SQL you couldn't just migrate to it because it was not based on an open standard. So many different dialects and no conformity between these products. And then SQL became an open standard in 1986 or 1987 it became an ISO and worldwide standards by many standardization bodies. So anyone could implement these and that was the start of the rise of relational databases and SQL because open source came and open source database projects adopted SQL. Postgres I think before 1995 was not even compatible with SQL so you couldn't run SQL queries on it but they implemented it, MySQL came, SQLite came and then hundreds of other derivatives appeared and that's the word we live in today where we just take SQL for granted we can run a query anywhere we want and it it's going to work similarly across multiple different different products. So what is open source without a standard like? And this is the reason why I chose to talk about MongoDB today. So MongoDB had a very similar story to SQL and IBM DB2 in a sense that they came up with a query language and they came up with a new way of storing data in a database called the document database or the document model and their tagline was that you don't need to be a database administrator to run a database. So this is supposed to be even more simple, easier to query, easier to use, easier to learn compared to relational databases. We can argue whether that is true or not. That's not the point of this presentation but this was open source. Everyone was able to use it but only MongoDB implemented their own query language they came up with and it went proprietary in 2018. So after they captured most of the market of document databases they just shut the door and said this is no longer about truly open source and they actively work against companies who implement the wire protocol they came up with. And the big problem and the point of the presentation and the reason why I talked about SQL is because I wanted to show what it is like when you build on something with no open standards behind it and that would be MongoDB because you can only get MongoDB from MongoDB nowadays. This is a familiar slide because the relational database slide was very similar in a sense that when MongoDB went proprietary a lot of alternatives appeared on the market. Oracle, Microsoft, AWS, Huawei, they all came up with implementations for MongoDB because they wanted to use the MongoDB query language and document database as a concept but they couldn't use it anymore because of the license they couldn't provide it anymore because of the license. They came up with their own interpretations of the MongoDB query language, the MongoDB wire protocol and this led to the exact same situation as we had with SQL in the 70s. So history totally repeats itself and users of MongoDB found themselves in a very similar situation as we've seen with relational databases in the 70s. Or well, we haven't seen it but we know how that happened. So the issue is the same, different feature sets, different degree of compatibility between these alternatives meaning there's no way to migrate very easily between them. So we think not just we at ParaDB but many others think that we are in need of a new open standards for MongoDB and the MongoDB query language. A standardized core feature set based on MongoDB adjacent query language. A standard similar to SQL which describes how MongoDB query should look like and how they should behave as a database. And it would ensure the same degree of compatibility, portability between document database products as we have with SQL. So this is the point I attempted to make today. We believe that this would be very good for users the same way as it is good for relational database users as well. So ParaDB, I promise that this is not a sales pitch while something's wrong with this slide. I just wanted to mention our solution to this is we turn Postgres into a MongoDB compatible database. ParaDB runs as a proxy on top of Postgres and we hope we believe that this implementation, okay there we go is going to be a basis of a new standard across many different products. We release this open source license Apache 2.0 and we hope that MongoDB users will be able to move from MongoDB to Postgres this way using ParaDB. In fact many did. We have lots of users over 8500 starts on GitHub so please find us as well in case you're using MongoDB and in case you also got the point that something needs to be done. Thank you so much for any questions. Thank you Peter. Before we take questions is Konstantin from Pekona in the room. Could you please set up while we take questions. Thank you. Okay. Thank you for your talk. Actually I have a question to the room. Have anyone actually read the SQL standard? Like the whole 1000 something pages? The question what do you think about the fact that you have to pay 100 bucks? Do you think it's okay practice and should it exist longer term? Sorry what was the question I didn't. So you have to pay a few hundred bucks to legally read the SQL standard. Do you think it's a good practice when it exists and should it for this new one maybe standard should be the same or maybe there are better ways to do open standards? So there's not much we can do about the whole way how open standards and standards organizations work. I suppose that the reason you have to pay quite a high fee to download the SQL standard is because they are looking to fund their operations through these fees. I don't think it's fair because of course you know the whole idea should be that it's available to everyone and that it should be free. Debatable. I would hope that if we launch an MQL standard then nobody would need to pay for it. In fact we are planning to start the project on this on GitHub and for sure you're not going to have to pay for that. I am a participant in ISO, ITU and IETF and I understand these processes have a very particular view ISO's position is unfortunate. Alright, thanks everyone. Time for one more if there was anybody else who wanted to ask a question this time. In that case I'll offer an observation. I hadn't noticed the parallel between the current state with the document query languages and the situation half a century ago with the query languages but it's even closer than perhaps you have thought about because IBM's principal mode of business at that time was the Bureau service. Most of their customers didn't own their own machines. They were in IBM data centers. We now call that cloud. Eerie isn't it? It's eerie. We're right back where we started half a century ago. About two minutes before we go on. So we've been seeing the base in ISO development for the current. The talk was about key management for data at risk and question. Hello, my name is Pausman and I was said and seen as the developer of the cloud. And just one, the topic of my talk can seem more, but we can now get to the other comprises going on. There are PiconServe for MongoDB data creation and find the key management. Our discussion will follow up on that. Technically speaking, PiconServe for MongoDB is a fork of MongoDB community edition and maintains backward compatibility with the letter. PiconServe for MongoDB implements more features that it's upstream. Many of those features such as authentication mechanisms and backup can be especially useful in production environments. But today we are going to be talking about securing user data with data at rest encryption. Data at rest encryption ensures that if someone gets unauthorized access to the storage media they still can't read the user data. It encrypts database files on disk with AES working in one of the two chaining modes. Enabling data at rest encryption requires a clean data directory since already existing database files can be encrypted. The perfect time to do so is when you add a new node to your database cluster. On such a node whenever PiconServe for MongoDB creates a new database generates a dedicated symmetric encryption key for it. On disk a database is mapped to a number of files one per each collection and index. PiconServe for MongoDB encrypts all the data it writes to those files with the generated per database key. The key itself then goes to the encryption key database which in turn is saved on disk and encrypted with master encryption key. This two level approach enables efficient master key rotation as we will see later. On subsequent restarts PiconServe for MongoDB first decrypts encryption key database with the master encryption key and then uses the database keys to decrypt actual user data. The main byproduct of setting the data at rest encryption is a master encryption key which we need to store externally somehow. On the right you can see how a configuration file for PiconServe for MongoDB can look like if we enable data at rest encryption. First we should set the enable encryption field and then configure one of the three facilities for storing master encryption key. The simplest option is just to keep the key on a local file and needless to say that this is not a secure approach and it's suitable only for test. We will concentrate on two other approaches which are keymip and vault. Key management and therapeutic protocol also known as keymip defines the number of object type it can manage. For instance different types of keys, certificates and so on. Each managed object on the keymip has a unique identifier which is a primary means for locating that object on a keymip server. Among other attributes objects have in keymip we are most interested in the state attribute which is a foundation of the feature we will discuss later. A user can issue a number of operations against a keymip server. For instance register to upload an object or get to fetch it or get attributes and so on. As an alternative for a keymip server one can use HashiCorp's vault to store a master encryption key. Vault is a server with many different capabilities implemented as a pluggable engine. We are interested in a versioned key value secret sanction that essentially stores sets of key value pairs. Because a server for MongoDB constructs a single entry secret where a master encryption key encoded in base 64 serves as a value for implementation defined key. Before an engine can be used it should be mounted that is initialized at a particular path on vault. Within an engine a secret is located by its relative path and version. So the secret's absolute path is a concatenation of the engine mounts point, the fixed data and the relative path. We can think of an ordered tuple consisting of the absolute path and version as a compound identifier of a secret on vault. When configuring PiconServe for MongoDB to work with vault we recommend to configure each node in your database cluster so it uses a unique relative secret path. For instance we can deduce a relative path from the host name of the node. By default vault keeps 10 versions for each secret. Then uploading each new version of our writes is the currently oldest one. So uploading happens every time we create or rotate master encryption key. In order to prevent we should increase the number of kept versions by setting max versions parameter on the secret engine ideally just initializing the engine. To configure PiconServe for MongoDB to work with keymeep we should provide a server name, a port and a pair of certificates for mutual authentication. When PiconServe for MongoDB first starts with this configuration it internally generates master encryption key and registers it on a keymeep server. In response to the register request, keymeep server returns a data file which goes to the metadata onto the data directory. As a result there is no need to specify identifier on the next restart of PiconServe for MongoDB. It will read it from the metadata and get the key from the keymeep server. So PiconServe for MongoDB to work with Vault repeats the general principles of doing server for keymeep. For Vault we use authentication token instead of client certificate but more importantly we should specify the absolute secret path in the configuration. On the first start PiconServe for MongoDB uploads the generated master encryption key from Vault and gets under this configured path and gets a secret version in response which goes to the metadata together with the path and on subsequent restarts path and version from the metadata enough to fetch the master encryption key from Vault. We all periodically change passwords in our personal computers and in the same way we should periodically change master encryption keys in our database deployment. To do so with keymeep we should stop PiconServe for MongoDB and start it again after setting the rotate master key parameter. In this mode PiconServe for MongoDB doesn't serve any user queries but gets the master encryption key from the keymeep server and decrypts the encryption key database. Then it generates a new master encryption key and registers it on the keymeep server. The new key is used to re-encrypt the encryption key database and its identifier replaces the old one in the metadata. Then PiconServe for MongoDB exits concluding the rotation. Now we just need to un-set the rotate master key parameter and start the service again. At this time we are already using in new key a conserved client queries. Usually master key rotation doesn't take much time because thanks to the two-level design we saw earlier it only needs to re-encrypt the encryption key database which is usually small compared to the whole dataset. Rotating master encryption key with Vault is carried out in the same way as it is done with keymeep. When PiconServe for MongoDB uploads a new master encryption key to Vault it basically creates a new secret version under the specified path that potentially evicting the currently oldest secret. Now that is why it is important to set a max versions parameter to high enough value. Even if we have all security practices in place still think of the measures we would take in the case of security incident. For instance if our database deployment has several hundred nodes and one of the master encryption keys gets compromised we should be able to quickly identify and isolate the affected node and replace the key. Fortunately PiconServe for MongoDB can help us here with its soon to be released feature called state polling. The feature relies on the state attribute each managed object has and the keymeep. When a secret key is first registered on a keymeep server it gets pre-active state. And if the feature is enabled PiconServe for MongoDB sends the activation request to the keymeep server right after registering the key thus transforming the key into the active state. Then it periodically pulls the key state to check whether the key is still active. If one of the keys gets compromised a security engineer just needs to put it in any state other than active. And the affected instance of PiconServe for MongoDB can detect that no later than configurable polling interval and stop the service signaling that an action has to be taken. You only need to rotate master encryption key on that particular node to bring all the database cluster to safety again. Here we briefly reviewed the key management systems supported by PiconServe for MongoDB and you may be wondering which one you should use. As usual it depends but there are some things to consider. Keymeep is an open standard with many implementations available. However, majority of those implementations are commercial and thus some licensing costs is involved. The only free and open source implementation I know is PiconMeep under Apache 2.0 license. However, it is a more tool for demonstration and testing than software for production. On the other hand, Vault is a specific software not a standard so you have only a single vendor. On the flip side its community edition can be freely used in production in many cases. It is also worth noting that configuring PiconServe for MongoDB to work with Vault involves slightly more hassle than doing for Keymeep. And interestingly, Vault's enterprise edition implements Keymeep very well. Probably there is time to draw some conclusions. As with many other security features we recommend to consider enabling data at rest encryption when you just start planning your future database deployment because encrypting already existing cluster is feasible but can be a tedious task. It is also important to do regular backups of Keymanagement system be it Keymeep and Vault and those backups should capture all the master keys without emissions. And when you are choosing between these two options you should configure both operational complexity and financial cost of running each of them. Finally, I'd like to present some resources they are Keymeep specification of version 1.0 by Keymeep repository on GitHub and Vault's documentation. You can find many more information about PiconServe for MongoDB in Picon a block and documentation. And if any questions arise in the future you can ask them on the community forum. That's probably everything I wanted to tell today. If you have any questions I will be glad to answer them. Thank you. Thank you so much. We have lots of time for questions but as usual if the next speaker is in the room I will invite you to come down. Igor from 15, are you here? He will arrive in time. Okay, well in the meantime the floor is yours. Any questions on this topic? In that case I have one because I feel like I missed it. The licensing status of course is something that concerns me personally but you talked about a cost complexity trade-off between using a Keymeep implementation that's not Vault and using Vault. Was the complexity just the need to pull for compromise or were there other complexities that Vault solves? I mean that from one hand Keymeep seems to be simpler to configure and operate but however implementations cost some money. On the other hand Vault requires more discipline and in operation and configuration but its community edition can be freely used unless you provide Vault hosted services or compete with the HashiCorp as well. I'm aware of HashiCorp's approach and it's quite good but it's not open source so I was intrigued to understand what that is. Neither of them can be considered open source because Vault has business source license Keymeep is an open standard that is presented but implementations are not open in most cases. There aren't any open... You see I have no one by Keymeep. It is kind of a tool for demonstration because from my point security in this implementation isn't on the highest standard. That's a legitimate beef. If the only pure false implementations available compromise choices then that's an important fact. Of course we all would be happy if we had truly open source keymeep implementation. Something to look forward to. Thank you so much. Thank you. We'll resume at 3pm with maintenance for MongoDB replicas. We have to talk to us about maintenance replicas in MongoDB. Please go ahead. Thank you everyone for joining this session. We will be talking about maintenance for MongoDB replicas. A few words about me. My name is Igor. I'm based out of Skopje, Macedonia. I'm working as a principal consultant supporting both MySQL and Mongo Technologies. I do have some certifications during the years with the experience. I just tried to see how does it look like when you are challenged to do a certification. I mainly do automation with Bash and Python. Also husband and a father I like to travel and occasionally speak at conferences. Some of my social links here mainly linked in Twitter. On our agenda today is explaining MongoDB replication one-on-one. Beginner friendly. Then jump into replica set configuration. Some of the deployment apologies. We will talk about replica set maintenance, how to reconfigure, how to make harder changes or OS patching without any downtime. We will have some Q&A at the end. In the essence, replication is maintaining the same dataset across multiple machines. With MongoDB it's not different. Processes that maintain the same dataset. Replication allows us to have redundancy and high availability in case one of the instances goes down. Then there is another instance to pick up. With MongoDB the replication works in a way that there are odd number of nodes that participate in election and the failover is automatic. In case the primary goes down, then the two other nodes in this case the secondary nodes will form an election and one of the secondary nodes will be promoted as primary. If you have odd number of nodes like 3, 5 then the majority needs to promote an election is 2, then it's 3. If you have 6 nodes instead of 5 you are not adding anything for redundancy because you still need a majority of 4 nodes out of 6. So that's why it's important to have odd number of nodes. With that you have full tolerance. The replication concept for MongoDB is that all of the rights from the application layer go through a driver which connects to the primary. The primary stores the rights into its operations lock and the operations lock asynchronally is copied to the secondaries where the secondaries apply the operations. The secondaries also create operation lock locally and that operation lock can be used to sync another secondary. So if there is, which by default is allowed chaining the replica set members cannot just ask the primary for sync operations but also can sync from another secondary, which is a good improvement because the primary will not get overloaded. And by default the lock where all the operations are saved has some different operating systems. The lower bind is 50 megabytes for in-memory systems and it's 1 gigabyte for data on storage. The upper bound is 50 gigabytes. But it's configurable so starting with MongoDB version 4.4 not just that you can set the OPLOC size but there is a configuration variable that is OPLOC min retention hours. So you can say I want 72 hours of OPLOC rotation which means that when you write to the database, that write is retained for 72 hours before it's circulated. And the OPLOC is basically a circular lock that gets overwritten. Deploying replica sets, let's assume we are deploying the basic 3 node replica set where each node will have to be a VM in a virtual private cloud or isolated network physical hardware. We need to create configuration file with some basic options. In the configuration file there is a section that specifies the replica set name and using that replica set name then we go on one of the 3 nodes we initiate the replica. Once the replica is initiated we confirm that the configuration in the replica looks exactly how we want to look and then there is a command to add additional nodes in the replica by doing rs.add so that is all done from the MongoDB shell connecting to the primary node. And at the end we finish by running 3 node replica set rs.status is the command to check what is the current status of the replica set and it shows what is the status of the node. So there can be primary or secondary and for the failed nodes, for nodes that are not healthy there might be different status. Some of the configuration options and limits for MongoDB replica sets is that you can have up to 50 members and 7 of those 50 members can be voting. There is also option to add arbiter node the arbiter node is different, it doesn't hold any data, it's the option to participate in elections and you can have also different secondary nodes by changing the priority, making a node hidden and delayed so I'll be covering those right after. The arbiter node is just a process in a replica set, it participates in election let's say you have a 3 node replica set but you don't want actually 3 copies of the node. So you have a primary you have a secondary, the arbiter is just a single processor very light machine where it checks the primary and the secondary, if it loses connection to the primary it asks the secondary, do you see the primary, the secondary doesn't see the primary then the arbiter votes and says okay now you will become the new primary. It's just for elections and usually for high availability if you have 2 data centers you can experience a loss of single data center if the arbiter is deployed in a 3rd data center then it's very unlikely to lose connection between 3 data centers. So data center failure is covered by that as well. Priority 0 node is basically a node that cannot be elected as a primary. If a primary fails then between the rest of the nodes there is election, all the nodes that are configured with priority 0 will not participate in election to become a primary. So that's priority 0. A hidden node is also a secondary state node but it's priority 0 and is not visible to the application because the driver excludes hidden nodes to the application layer. So hidden nodes are usually useful if you have some reporting or if you want to run backups then you connect directly to the node, do analytics or run backups but the driver will not see that so the applications will not get impacted. And a delay node is also a priority 0 node. It is a hidden node the driver will not see because you set a custom delay where the replication on the hidden node will be delayed by the seconds you specify. So if you want to set a node for disaster recovery let's say for 6 hours behind then you configure a secondary as a delay and then in a case of any human error or something that goes bad you can extract data from a secondary hidden delay node and then works very well for disaster recovery. You can even stop the node and apply operations up to the second or before any human error happened like somebody may delete user records and you want to restore. So let's jump into some of the maintainers operations that you may end up. There are different groups of maintainers that think any maintainers should be tested in production but I believe that if you do a maintenance it's better to test the plan in pre-production environment like QA or staging or something else but before jumping into any maintenance the first thing to do with any database system is actually to monitor what is the status if you don't monitor the status there might be nodes down and if you want to do patching you can start patching a healthy node and stop that node while other nodes are down and that may cause downtime. So there are tools out there open source proprietary software that you can use in this case this is a dashboard from Percona Monitoring and Management which is an open source tool and here I have some red flags that you should be careful with MongoDB replica sets like for example the worst that can happen in a replica set is that it doesn't have any primary node that is definitely something to trigger alerts if you are running a replica with no primary other than that some important things to monitor maybe not that critical but if any host in your replica is down or if you have a secondary that is in recovering state it doesn't replicate from a primary or if you have a huge replication lag all those are maybe not so critical but they are close and some other just like informative monitoring alerts that if a host restarts out of a sudden it reboots or what is your backup status like those are not critical but you want to pay attention. One of the first things that might come up with replica set is that you may realize that you have a very low replication window that your secondary is replicating from the primary but if the primary fails then you have like three hours then in a replication window to bring up a failed node so usually that's very low and there are requests to watch the upload size, the replication window and to increase in older versions it was more difficult to increase the upload size in newer version that's improved so you can dynamically just resize the upload you will need to do some math what is your upload size how much disk you have free what is the desired upload size that you want to increase in some of the new versions there is also the variable Oploc mean retention hours which you can set and you just go 48 hours for Oploc size and you have 48.3 hours always one caveat with that is you will need to pay attention how much from your disk space is consumed by the Oploc because the Oploc may consume a lot of disk space just by setting huge Oploc retention hours other replicas that maintenance changes that you may end up is when you are adding nodes into the replication sometimes you may need to reconfigure the node as a priority maybe need to change you may need to put a node into hidden mode or you need to just delay the replication so every replica set has its configuration into output of rs.conf you store that output into a local variable because the MongoDB shell is more or less a JavaScript and with that JavaScript object you can do any changes that are required to like you can use the priority on certain nodes you can put a node as hidden or you can put slave delay let's say one hour once you do all the changes that are required the next thing is actually to apply the same config into the replica set and there is a replica set helper that does exactly that rs.reconfig and you pass the object the configuration file that you just modified and also with this if there is not a primary maybe the primary failed and you are running the replica set into secondary only there is also option to use force through and with that you are forcing the replica set to accept your modified configuration so there is also option for that each replica set from a driver level perspective when the application is connecting you have options for read preference and there are multiple read preferences in MongoDB one is it goes to the primary by default so all the reads from the driver connected the client application will go to the primary there is read preference primary preferred which means if there is no primary which is preferred then it goes to a secondary secondary preference is the other way around you prefer a secondary for reads but if it's not available it will go to the primary secondary read preference it always goes to a secondary and nearest is based on network latency you connect your driver and based on network latency what is the fastest node to the driver then it connects to that node and with MongoDB 3.4 and later to have a secondary node with secondary preferred read preference it needs to be less than 90 seconds replication lag so if it goes more than 90 seconds lag then it goes to the primary and this is how it look like if you have a global replica set in different data centers or availability zones across the world so it can be South, North America, Europe, Asia, Australia you have one primary several secondaries but only 7 of those nodes will be voting so if there is a primary failure 7 out of the total nodes which is 50 max then 7 will be voting and this is useful because with MongoDB you can have tags and tags are something that is applied on a node so let's say that you have data centers in Europe, in North America Africa, Asia Pacific you can tag each member of the replica set with its own data center and what is the usage so for example in Europe you can have a data center Europe one node is usage production the other node is usage reporting so when the application connects and it tries to find whatever is the document that you are querying you can specify the tag the tag will be data center and usage because that is how 5 minutes okay so you configure that but you need to be careful because if you configure your members to have data center and usage and later you want to query from your driver data center and usage that doesn't exist in your replica set configuration that is like you don't have any node with those tags and you will experience an error in your application so it's important to keep an eye on what tags are configured and how do you query from the application another thing that might happen for replica set maintenance is that the secondary goes down it's down for a long time it joins the replica set it's losing the upload position and now you need to re-sync that replica member from the beginning MongoDB has that automatically you just remove everything from the data directory rejoin it and it will start initial sync or if you have a very large data copy you can look another secondary do a snapshot restore to the failed node and they both join at the same position similarly if you are doing hardware or OS upgrades network changes anything that may be related to the primary secondary secondary topology you start anything that you are doing as a maintenance on a secondary node if you have hidden delayed or priority zero node that is the preferred node where you want to do initial upgrades initial patches because those nodes are not that impactful to the application so anything that you may be doing first do on those nodes and then while all the secondaries are patched in a round robin fashion you step down the primary and the primary is down as last node so both secondary is finished the primary is last step down there is a command election happens one of the secondaries will be promoted and all the steps that you did on the secondaries now you get to do on the primary database version upgrades another thing that the annual maintenance for replica set you need to upgrade to the latest version so with MongoDB with the latest version upgrade which is 7 if you are upgrading from version 6 you need to be sure that your version is current version 6 you can't jump from version 5 to version 7 so you need to go one version at a time and once you are up to version 7 you can't go back before 6 so you can only downgrade to version 6 and not to version 5 which previously it was possible to downgrade multiple versions but now it's not and it's the same process you upgrade secondaries and step down the primary at the end there is a with every new version there is a feature compatibility that you will need to enable manually after the upgrade and index builds same as all other maintenance you can do index build with a non-blocking operation or you can do index build by taking out one node build the index join it to the replica and repeat the process backups you lock the database on a secondary preferably delayed node do a snapshot unlock the database so especially in the cloud is very convenient I have slide to show how to do implement TLS but just a quick summary replica set with MongoDB should be run always with odd number of voting members hidden or delayed members can be used for reporting and backups don't use a single primary mode where all other nodes are with priority less than the primary so don't prefer a single node for a primary do any maintenance on a secondaries before applying changes on the primary have monitoring have alerting especially for application window if a node goes down do backups take backups regularly and do restore tests and don't run your database open enable authentication and and I'll type in questions so before you go into the break any questions I have one little one you mentioned the option when updating a replica set configuration to force it what are the circumstances under which you would want to force or not force force exists to really push your configuration and ignore what is the state in the replica set so the replica set sometimes might be in a bad state you may have a primary failed and there is no primary to elect because there are not enough voting nodes and in that case you can force your configuration and that configuration gets upgrade in the version so it will be the latest version and it's in those scenarios it shouldn't be used regularly but it's just the building is on fire I really want to fix it this way please do it this way because I'm certain thank you so much we now have a break sponsored by our good friends at Penpot I forgot to check 4pm and we have two more speakers before the wrap up thanks again second last session is from King Butcher from the first principles my name is King Butcher before we get into it let me pose a question that you don't have to answer but just think about it in your head if you were to write a system to track money using a database what's the simplest solution that comes to mind so you'd probably do something like this where you have a table that has amounts that affects different users and the amounts are affected on different types like what's drawn on the project and so something as simple as this works for most cases but it doesn't work if you want to make anything complex or make anything scalable because here let's say you wanted to compute the balances for something really quickly like how much money does user 2 have right now you can't really do that efficiently with this you have to scan all of them so instead you'd probably make a completely separate table to track all the current balances but now you just have two tables they have to update for every order and it's starting to get a bit awkward so let's say you wanted to add another feature like tracking reservations or like tips so suddenly orders are now multiple steps that have to be linked together into one at the end for the final order in your invoice and so basically we went from one table to like three and now we have to manage those and figure out how to do like transactions and roll backs and now you have updating three tables how do you optimize the query for that so here you have to optimize one how the SQL is executed and two how many statements are you going to execute at once and so now the problem becomes a bit more fuzzy and also becomes a bit more awkward so we started off with one table trying to track transactions now we have three and I have to worry about the problems that we didn't really have before and when I was doing this myself because to learn about financial transactions I was wondering why is this so awkward why do we have to do all these multiple steps isn't SQL pretty much made for banking and so I would look online and I figured maybe I'm just doing something wrong but then the SQL is online so it's pretty much the same thing here so it's like is this normal so there's got to be a way to solve this at least somewhat easier and also somewhat more practical so this is why I made Tiger Beetle and so Tiger Beetle on its own is pretty much a distributed financial transactions database all it is is all in the name so Tiger Beetle is open source under the patch license and it's also written in an open source language called ZIG which is under MIT license so Tiger Beetle is designed to be fast like really really fast like normal SQL databases from before at least their inefficient versions can handle like probably tens of thousands of transactions a second Tiger Beetle at the moment can handle I think hundreds of thousands and our goal is eventually one million transactions per second so Tiger Beetle is also designed to be safe so it doesn't matter how fast you can translate or I guess handle money if you start losing money or you start having data corruption that would be pretty bad and finally Tiger Beetle is meant to be easy to use and a nice experience not just for the people deploying Tiger Beetle but for also the devs that are working on Tiger Beetle I mean it would be pretty well if we could have a good system that didn't have you fiddling with settings to set up and multiple build languages to use and stuff like that so Tiger Beetle is designed from first principles which sounds like a buzzword but it's something you probably already do already and it only comes down to two steps so the first one is to just break down the problem into its fundamentals to actually know what you're trying to solve and then start solving the problem from there it's not really complicated at all so let's go back to the previous example of tracking money one thing we noticed is that anytime you try to make a financial system on SQL you eventually converge into something that looks like double entry accounting so in double entry accounting each of the accounts have a two columns that they operate on so debits and credits debits is just money going out of the account and in credits it's just money coming in but in accounting terms the big thing about double entry accounting is that for every action in the book there's always an opposite reaction so anytime money comes in through credits it must have caught on from somewhere else and being able to track that somewhere else for every transaction that goes in the system one makes it easier to flow today when you want to check the book to make sure everything's right and two it allows for computing balances really quickly so in a double entry accounting system a balance is just the sum of debits and the sum of credits for a specific book it's pretty easy and so the other variant of a double entry accounting system is that the sum of debits and credits across all books must equal zero and that guarantees that money's along the way when you're trying to track it so we are programmers here and not accountants so instead of calling it books you just call it accounts and we just call two pairs of credits and debits between books transfers so it's these are the only two primitives that Tiger Beetle uses or I guess exposes to the user so in code that might look like this the reason we don't have the balance for the account we don't have a separate debit and credit sum is that there are certain accounts that allow you to have negative debit or negative credit and we want the user to decide if their balance is correct based on that so in code that would look like this basically a double entry accounting system is one where you have a set of accounts which have been operated on by a list of transfers so we just add an API to interact with that to create a system and then look them up and then we persist them to a database so that they don't get lost the time and so that we don't have to have like a large working memory to store everything and finally we make it distributed so that if the database size then at least we have some other computer in the background that makes sure the data stays alive because of like corruption and stuff and so this is really all what Tiger Beetle is it's just a distributed financial transaction database and that's all the design and it started from just figuring out what the problem is tracking money and how do we solve it just from that list alone so tracking money in a way that's reliable, fast, and pretty easy to understand and I'll try to convince you that's pretty easy to use as well so before that let's figure out how to make Tiger Beetle fast because that's one of the big questions when we were trying to optimize SQL from earlier so instead of SQL let's look at the problems with it and why I didn't fit our use case in the first place so in SQL we had three tables update for every order and I don't think you can get rid of that in Tiger Beetle there's always going to be multiple tables they have to update if you want an extensive transaction system but at least one thing that you can do is have multiple money transfers take pretty much the same amount of tables that you need to update so instead of updating one transfer per multiple tables you update multiple transfers per multiple tables I mean you can do this in SQL but you have to re-implement robot systems and re-implement transactions because it normally does it for you on a per insert basis and that gets kind of complicated another thing we can do since our focus is money and not necessarily data query is we can just remove the query language entirely so since we're dealing with just accounts and trucks and I guess accounts and transfers we can just make them trucks instead of like having to parse an entirely separate language that may not fit our program model entirely since we're just dealing with a ledger here we're not dealing with custom data store finally since Tiger Beetle is a transaction system we're not really doing a bunch of processing besides making sure the ledger is correct so we don't have to optimize for compute nearly as much as SQL does so basically we can just make our code single-coded which means no roll-offs, no having to synchronize between multiple threads and none of that it also makes the code easier to follow and makes it also deterministic and aids it down the line so with these constraints we basically can't really use an SQL database to model Tiger Beetle if we want to be as fast as we claim so we kind of have to write our own database and so how do those database actually work well the most easy example to understand and the one that's actually the most widely deployed is with something like a B tree there are two sorted blocks on the bottom and then there's intermediary nodes between the sorted blocks that allow you to reach them without doing as many comparisons as the amount of sorted blocks below but if you say if you wanted to reach node number four you may only have to do two comparisons instead of six nodes down below and so this works for single query access so if you want to do a single query or I guess double entry accounting most of the system is just a list of transfers and it's also an immutable log at that so you never go back and edit history you always just keep adding more history and if you try to do that in a B tree you're going to have to keep traversing the tree to figure out where to add the next node at the bottom and that traversal can add up to a lot of overhead so we can actually avoid using a B tree here there's this thing called a log structured merge tree instead where LSM for short so on LSM you have all the nodes as before but you sort them into tiny blocks and then put them at the top so the LSM tree instruction in a way where each part of it is called a level and each level is bigger than the last one to hold more blocks and so one specific level gets full of them or through other criteria then you do a step called compaction which looks at blocks which overlap between levels and tries to remove the duplicate and prefer the ones that are newer and then compact them into the next level so this is how you maintain a set of updated blocks but also have fast appending which is like adding a new block to the top and then doing compaction later but one issue with this is that if you put everything into the same LSM tree then compaction has to sort through a bunch of unnecessary data so like let's say you only updated the balance of a specific account and then you're having to compact that and you're basically having to copy all the other parts of the account or all the parts of the transfer into the new block and that kind of wastes this guy especially also compute as well another issue is that there are certain parts of the account that get updated more frequently than the others so the account balance as before gets updated on every order but like something like when the account was created but it was never gets updated so that's also contributing to the wasted compute when we do compaction so in Tiger Beetle instead what we do is we just have an LSM tree for pretty much all the different fields in account and transfers as well as for account and transfers themselves this removes most of the overhead of compaction and it also allows us to query on those fields later when we want to expose the lookup functionality so you could say that the Tiger Beetle storage engine is like a little tree or I guess a forest LSM trees that we just you look together to make the whole system work so it's not enough to be fast you also have to be safe because no one wants to use a system that just loses money on you and I've had that before myself so you can't talk about safety without talking about memory safety it's a pretty big topic now so memory safety it sounds unintuitive but Tiger Beetle tries to verify this as much as it can and it's kind of hard to do in a language like ZIG which is manually memory management so we have a few tricks of forest leaves to make that work so if you actually look at what memory safety is it's split into two categories temporal and spatial so temporal memory safety is about making sure you access memory during the time that it's alive and not after and spatial memory safety is about making sure you access the memory in the bounds that is said that it's allowed to be accessed and not outside of those bounds so we solve these in a few different ways so for temporal safety one of the issues there is you can actually access something after you free it or after you say that it's invalid to access and how we solve that is with static allocation so basically Tiger Beetle never he allocates while it's running we just allocate everything we need up front and just use that for the entire duration of the program which is a pretty odd tactic to use because stuff like querying is you send a small data and it could send a large amount of data back so how would you handle that when you have static allocation and what we do there is just stream the data back and like pagination so if you want a large transfer from a small query we just split the transfer into small blocks so you have to request multiple times so that we can still keep everything bounded but that's only for double free there's also a use after free which is a type of free that doesn't necessarily conserve about memory it's more about the logical access of the thing you're trying to verify so example of use after free that isn't double free is like saying this integer is valid for this duration but after that it's pretty much garbage and if you try to use it it's going to mess up your program so the way we avoid that is just have a search everywhere in the code like half our functions in the search half the system is pretty much a search we write it like we're writing drivers and this has a few benefits one of them is that we can encode and variance that would normally be in comments in the source itself and then just keep them there like even during release builds and it's not as expensive as you might think given tucker be a little single-footed but the other aspect of memory safety is bound checks so we had to handle bound checks by deferring to the zig language as well as the asserts because zig does automatic bound checks for all arrays and all slices and this makes sure that even if we got the asserts wrong language will still cover us after. The language also covers us for no pointers by having optionals be in the type system so we can't accidentally do a no-punch of the wrath because it's always going to have us check it to make sure that we can use the value correctly one of the things we do for safety or at least the main thing we do for safety is to make sure that the system is durable I guess your money is durable so the way we do that in a distributed fashion is we just distribute the database to other computers and we use a protocol called view stamp replication to do that it's similar to wrap and pass those if you've heard of those before it's just a way to easier send data across the network and to make sure that all the data is synchronized but VSR gives you the synchronization aspect and it makes sure that the network won't fail on you or if your nodes goes down and comes back up then it can re-sync on everything but it doesn't really save your data from being removed what we can do here is that we can also rope in a storage fault model into VSR so instead of just saying that this node is replicated we can also say that if this node storage goes down we can fix it from the replicated nodes and what allows us to do that is something called storage determinism so because Tiger Beetle is single thread and deterministic it allows us to know that if any other node act ran the same double entry bug then they also should have ended up with the same storage on their copy pretty much and so using that we can say that if our copy gets destroyed we know that everyone else should be correct and so we can ask them for the correct value and with this we can recover from storage faults or we can recover from storage misdirects like if this messes up and tells you that your storage is right when it's wrong we can still correct from that by just using under nodes so to do that we kind of have to mess up this little nice scheme that we had from before and where everything's all extracted so the VSR protocol has to pretty much reach into the storage in order to make sure that everything's correct so we have to add a little grid there just to make sure that we can fix the storage if it messes up and the final thing we do for safety is we have a simulator remember that Tiger Beetle is deterministic so that means that we can plug in pretty much any component that we want for any part of the system that we want and figure out if that messes up the entire system so we do lots and lots of fuzzing and lots and lots of fault injection so the deterministic aspect means that we can add in our own custom abstraction for the disk or abstraction for the network and like simulate faults like dropping packets arbitrarily or messing up this guy at random and make sure that the system still upholds we're even so confident about it that we have a github bot which just continuously fuzzes the Tiger Beetle repo and then once it finds a bug then it poses a github issue of it with a little stack trace to help us like figure out how to reproduce it and we can always reproduce since it's deterministic so we just put in the random seed and it'll always generate the output area and then we just get to debugging that way and if you're not aware we also made a simulator a game that you can interact with in a browser using the link below this is literally Tiger Beetle source code compiled into wasm running in your browser but then visualized using graphics so this is the actual database running in case you're interested and want to poke around and see if it's actually as safe as it claims and finally Tiger Beetle is meant to be nice to use and one of the ways we do that is by making it a single binary so let me compile Tiger Beetle it's just one binary they have to deploy and that's it there's no extra configuration files there's no extra like read replica they have to set up it's a single binary that you create the database file with and then you just start it on the same database file that's pretty much it this makes the deployment process super easy so you can use Docker you can use and you can use just even a native setup without too much hassle we also added a little repel so that you can connect to the database and pretty much play with it without having like this setup a custom like python client or custom no JS client or anything like that just some quality of life improvements that we notice we're lacking from complicated systems in the past another thing that helps is the use of the ZIG language itself so in ZIG everything has no hidden control flow so there's no operator overloading there's no inherent like operator like you can't make equal do something else or you can't make a comma do something else or plus find you something else there's also no write so all control flow is handled using some keyword on some sort and this makes it easier to read and also easier to do code review another good feature of ZIG is ComTime I was trying to think of a way to explain how good this is but I don't think I can accurately represent the utility of ComTime without you actually writing code so if you're interested in ZIG actually just check it out and try to see what you can do with ComTime all I will say is that ComTime makes having things plugable so that we do fall testing way easier than any language that I personally have experience with and that's quite a few of them and finally ZIG is pretty much the only dependency that Tiger Beetle has so ZIG is a C compiler so we can follow C code so we don't need stuff like NodeGim for our NodeClient or we don't need like GCC for if you want to do a Python Client or anything like that we just use ZIG for all of it and ZIG is also a cross compiler so with the ZIG executable we can compile from any target to any target and vice versa and finally the ZIG build system lets us use ZIG like we would use CMake or we would use cradle Tiger Beetle experience is so nice that people have taken our components and made it into their own libraries so someone took the IO stack of Tiger Beetle and made it into its own ZIG library and people like unironically to make their own services and so we're kind of thinking of doing that for our VSR protocol but that's like a thing done on the road so to recap all we did is took a well established financial system like double chain counting then we made a distributed financial and transaction database from it and that's pretty much all Tiger Beetle is yes we have a problem and we solved it and if you'd like to learn more about Tiger Beetle you can go to the QR code down below so thanks for listening thank you so much before we take Q&A Richard could you please set up now we'll take one question while you're setting up we don't have a 5 minute break between talks okay do we have questions for King on Tiger Beetle he has covered the subject so completely that there is nothing left to ask I have a question so the usual accounting structure is that you've got accounts and entries and the entries are grouped into a transaction because there's no assumption that a transaction only affects two accounts it can affect an arbitrary number so long as it's in balance does that work or does that require redesigning the definitions of transactions not necessarily that can be fit into double entry accounting by using multiple accounts or you can use our batching system before so remember Tiger Beetle has a way to if you're selling a thing then there's an entry for account receivable or cash paid there's an entry for goods sold but also shipping expense GST and all of those have to happen together or not happen at all yes so our batching system lets you link transfers between each other and evaluate them as completed and not completed and that lets you represent multiple ways of doing multi transactional systems it's a workable approach alright thank you King finally last but by no means least we have in the right place Richard Gao from Microsoft talked to us about the cosmos assuming that the technology of gods are with us in a nice way as we look at what we can go from onboarding to production on the superior data using a certain process divided into three parts data modeling, polishing the data and estimating the capacity and provision of the data optimizations which can be applied to the application of performance and finally we will take a question which means that all the infrastructure is taken here of bias for you where you can go in applications which is an active for this company application needs to be highly available because we don't want to lose status of our vehicles in the fleet so looks like as your cosmos DB can cover the use case for today so lets quickly set up the cosmos DB so you can start by choosing the API that you want according to your use case we will continue with the NoSQL API so all you need to do is provide the account name the region in which you want to create the account and the capacity modes whether you want a serverless account versus a provisioned account the resource model is pretty simple then we have databases and within that containers we have a set of containers as tables in relational world and the data is stored in JSON format which gives the schema flexibility finally to create a container all you need to do is provide the database name container name which is easy then we have partition key what exactly is that we will talk about that then you need to provide the capacity okay is it visible correctly yeah finally we need to provide the throughput which is basically how you set up the capacity for the container and that's all that's how you can create the container and start using the cosmos DB account talking of capacity request units is how you configure capacity in cosmos DB it's nothing but an abstraction or a currency over the request, over the resources that a request needs in order to execute like memory, IOPS, CPU, etc so it's the abstraction that we provide from our site so that you have to only think in terms of one resource not multiple and finally it can be provisioned both at database level and container level which means that you can have dedicated throughput for a container as well as you can get shared throughput when you configure it at database for multiple containers let's quickly see how you can find our use using SDK so this is what a sample telemetry document looks like it is a JSON document of 15 properties or so so this is a snippet of my IntelliJ ID I'm using cosmos DB Java v4 SDK so I'm writing the code in Java but we have SDKs available in all the major languages like Python, Node.js, Golang, Java and .NET so all you need to do is all I'm doing is just writing the document in the container and getting the response getting the request charge from the response and similarly reading the same document and getting the request charge from that so when I run this code you can see that the request charge it took to write one item is approximately 10 RUs while it took only one RU to read the same document so for our use case let's try to estimate the capacity that we need for the container so assuming that our Contoso Logistics application has 1000 vehicles in the fleet and we send one document every second that means that we need the capacity to write 1000 documents per second and each write consumes 10 RUs that means that we need 10,000 RUs per second so this much capacity should be enough to basically get the data from all the vehicles in our fleet we also have a capacity calculator which is just a fancy summary on top of these calculations it lets you estimate the RU capacity and also shows the associated cost based on the provisioning models now that we have understood the capacity how to estimate the capacity what exactly request you needed is let's understand how many what types of provisioning model Cosmos DB provides multiple provisioning models based on different use cases so in case you don't want to deal with the capacity estimation at all or you are just starting out developing your application and don't know how much resources it is going to consume what you can do is you can provision a serverless account there is no need to estimate any capacity for that it's a consumption based pricing model where you only pay for the throughput that you are going to use so it's quite good for customers who are just starting out on Cosmos DB or who are testing their workloads or who are very infrequent access to the database so that it's mostly idle so all you need to do is just choose a serverless account while creating the while creating the account coming to provision model we have two provision models one is the auto scale so auto scale is basically a provisioning model where you configure a capacity and you can scale between 10% to max of that capacity instantaneously and you only have to pay for the throughput that you are using in that particular moment you don't have to pay for it all the time so for example you are using only 40% capacity at any moment then you only need to pay for 40% and rest of the time it can stay at only 10% so it's quite good for the use cases where the traffic is unpredictable it changes so many times in the day it's difficult to predict that so you really don't need to break your head around estimating the capacity or always paying for maximum resources finally we have manual provisioning mode which is good for predictable workloads and it's quite cost friendly and easy to configure this is a fixed capacity that you configure and which is available and payable at all the times moving to partitioning partitioning is a core concept in horizontal scaling so do you remember that when we were creating the container it asked for the partition key in the list of things that you needed to provide yeah the reason is because partitioning is the key to scaling we ask for partition key because that's how we distribute data to multiple nodes so that the container can scale horizontally as opposed to vertically where you use only one large machine to store all the data so as and when you add more data the containers keep scaling up and adding more and more nodes to accommodate that capacity think of it as working in a team where we have different team members and the workload is equally divided among all the team members so this is how the data distribution model looks like so whatever partition key you define so the documents belonging to the same partition key will always go to the same logical and physical partition these are different physical partitions so this is how basically the data is distributed so it's very important that we choose the right partition key which ensures that the data is equally distributed and also the request are equally distributed among all the partitions to take to maximize the throughput that you can get out of the database so the general best practice from our side is that for right heavy workload you should use a partition key which has very high cardinality so that you can concurrently access the data from multiple physical partitions at the same time and for read heavy workloads you should always do key value lookups where you can query with id and partition key so that exact document is fetched from the container instead of going to the query engine and finding the document it always consumes one request unit and finally for queries you should always include partition key in the filter so that the query is directed to one specific physical partition so that the querying is fast as well as less resources are consumed to execute the query so let me show you a quick demo of how a good versus bad partition key makes a difference to the application's performance so I have one container which v1 which has id as partition key and the next container v2 which has id as username so both the containers have the same data exact same number of documents the only difference is the partition key so what I want to do is I want to find out the documents related to a username called Tori Bahringer so when I query this against the v1 container you can see the request charges took is 82 while in the second case when I query it against v2 where the username is the partition key and also the filter in our queries also partition key sorry username so it took only 62 this difference can become huge if your data size is large so that way it becomes important that you choose a partition key which is efficient for your queries so that you end up doing single partition query instead of fanning out the queries to multiple partitions now that we have understood the core concepts and set up the cosmos tv let's move on to do the development which from the client side first thing that we need to keep in mind is whenever you want to do a key value lookup in cosmos tv you should always use point reads which is an api in the stk so here we are using the read item api where you need to supply the id which is unique and partition key so with this this always consumes 1ru and it takes very less time to execute point reads versus when I find the same document using the query it not only consumes more resources it also takes more time so it is advisable that you always use point read api to do the key value lookups although we have emphasized that you should use single partition query and always use the partition key in the filter but there can be situations where you want to use cross partition queries and in those scenarios you should take full benefit of the parallelization that our SDK can offer so for that you can set the max degree of parallelism to minus 1 which lets the SDK optimize the number of threads required to perform the query based on the client configuration so that it can parallel execute the query to the physical partitions in the back end so there is so you have multiple threads as optimized as possible based on your client configurations which are parallel query all the physical partitions in the back end while so this gives you better latency and performance versus in the second case where we are using a finite degree of parallelism it limits the number of threads or in some cases it can choke the client side resources to do parallel queries and end up losing the parallelization benefits so you should let SDK take that optimization call from its side by setting the degree of parallelism to minus 1 to concurrently perform cross partition queries finally moving on to writes Cosmos DB by default indexes all the fields in the JSON document while it is very convenient that you don't have to think about indexing at all when you are changing the schema it can become an overhead when you are writing the document because you have to always update the indexes associated with all the fields so this is what default indexing policy looks like here we have indexed all the parts so what you can do from your site to optimize the writes is tune the indexing policy so that you only you only index those fields which are required for the queries that you are going to execute in your application so here we are just indexing 3 fields while we are excluding the rest of the fields from the JSON so let's see how this makes a difference to the write performance so when I write a telemetry document using default indexing policy it takes around 15 rus while when I write the same document using the tuned indexing policy it only takes 7 rus so that's approximately half this lets you cut your throughput capacity into half and gives you better performance in terms of both latency and the resources resources required to execute a write finally moving on to monitoring and monitoring the performance Cosmos DB provides certain default metrics from our site so that customer can take advantage of these default metrics so we have metrics related to request which can be filtered with status code operation type, container name, database name anything you want and then we have metrics related to throughput which lets you monitor the resources consumed by a particular request it can be again bifurcated by status code operation type so on and so forth then we have other metrics related to storage, availability, latency and other things so all of these metrics while are available on the portal as your portal we also have APIs to access these metrics from the SDK itself so you can create your own dashboards and custom alerts on top of these metrics without actually writing the code to fetch the metrics get the metrics from the responses so that's pretty convenient and while these metrics are aggregated metrics these are aggregated on a minute level or a 5 minute level period we also have metrics related to each request which can be enabled differently using log analytics so finally we are ready for production now that our application development is complete and we can monitor it but before moving to production we need to make sure that our application is highly available, our database application is highly available and for that what we need to do is add a second region for high availability adding a second region gives 5 nines of availability which is just 5 minutes of downtime in a year which is pretty good right so yeah with single region you get 4 nines of availability which is 52.6 minutes in a year and with with second region you get 5 nines of availability which is just 5 minutes all you need to do is just add another region from the globe that you can choose and you have 5 nines of availability for your database service so finally to summarize you need to set up the service right by choosing the right API by choosing the right API designing the data model appropriately for your application and tuning your queries and choosing the partition key which is efficient for the maximum queries in your application and estimating capacity then while doing the development you should take care of all the client side optimization that SDK provides and during testing you can test it using the local emulator which is just a desktop version of Cosmos DB which you can download for all operating system and test locally it is exactly the same you will get exact same responses exact same request units that would be consumed so you can do the testing locally and finally before moving to production make sure that your database performance is as expected through the metrics and set up another region for high availability that's pretty much it from my side thank you folks thanks very much we have no further speakers so no more awkward setup but questions anyone this question might be too simple for the stock but I'm a bit confused about the concept of indexing policy so if you could explain a bit more about what is indexing policy yeah so it's pretty similar to what indexing is in a SQL database so in SQL you create primary keys and then you create different indexes in Cosmos DB the indexes are by default created for you like when you ingest a document all the data is indexed by default for all like all the fields have the index so that is what the default indexing policy is it gives you the flexibility of adding new fields in the JSON schema and not thinking about indexing or creating the indexes but you can also optimize the indexing policy according to your needs so that your writes are more optimized so that way you can what how you can do that is just include the fields that you need to index and exclude the rest of the fields or you can do it otherwise as well you can exclude all you can exclude certain fields and include what you need so the important change or an important change is that the default index set is not empty so most relational databases by default index nothing by default it is all in this case you're investing by default all there is a reason to it is to optimize the right cost rather than the cost I have a question you have a sort of unified way of dealing with either provisioned or serverless deployments can having created a database can you transition between serverless and provisioned and if so how long does it take and what does it cost yeah that's pretty good question so we are just rolling out this feature in May so you can start out with serverless and then move on to provision when you have a better control of your capacity requirements you can definitely migrate to provision it's an in place change both ways we also have both ways that will be coming out in September so we are rolling out serverless to provision in May fantastic any other questions thank you folks we also have a virtual conference called cosmosdbcon it will be happening on 16th of April in case you are interested about cosmosdb or any specific topic in general you can watch this on youtube and or live or on demand so you can probably register for it as well thank you very cool that concludes the final talk on this track of the conference and in fact the final session of the conference starts downstairs in five minutes it's the final commit yes this has actually been a distributed version control problem please join us in the main hall downstairs in a few minutes time for the wrap up session for the conference cheers