 Morning. Thank you for getting up and joining me. It is just super cool to be here at EMF and to be telling you something about open-source databases. So hi, my name is Lorna. I run developer relations at a company called Ivan. We're also sponsors of this event. So if you want to tweet at us, that would make everyone very happy. Ac ydy'r cyfnodd maen nhw'n gweithio'r cyfnodd y cyfnodd. If you do this stuff all the time you will know there's a lot to keep up with. If you don't do this stuff all the time, I have no idea how you're supposed to keep up with this. So, I'm here to give you a quick snapshot of the things that are going on, that I like the best if you haven't seen it before, things that have changed recently, and just help you see a little insight into my world of open-source databases. Ivan's a cloud database provider, I do this all the time. I am using very scientific data sources for today's talk. The first one is my own opinion. I'm so opinionated that that needs a capital O in this context. I'm also using some data from the DB Engines website. They have a ranking of database popularity. Now, you know that you're not exposing what database your applications actually use. So, DB Engines use a sort of secondary set of data. They index what people are searching for on Google or other search engines. They index what's being talked about on social media, on Stack Overflow and in job adverts. So, if a technology is in wide use, you would see it in all of those places. So, I think the absolute numbers make no sense, but in terms of which technologies are we using, stuck on talking about, I think it's a pretty good metric. So, I've used that as the basis. So, there are lots of different types of database. But if you've only used one type, then you will have used a relational database. Relational databases are the traditional databases. You define them as tables with columns and you put your data in each row, a bit like a spreadsheet, except cleverer. And then the relation is where we link one table to another. So, if you have a table of books, then that every book will probably have an author ID and the author ID will link to the author's table. And we relate the data together instead of repeating it for each book. The world's most used open source database is MySQL. It has a long and rich history as part of the LAMP stack. The next Apache MySQL pick a programming language beginning with P, LAMP. So, it is really, really widely used and it's still very widely used. And I think we think of LAMP stack as being something that isn't super popular today. We're probably not using Apache's web server. But even as these things change, MySQL is still very, very popular. It's open source, it's GPL licensed, but it does have a proprietary enterprise server version available. MySQL is super interesting. There was a long gap between releases. Version 8 has some really beautiful features. It's always performed well and it still performs well. The version 8 has the document store and much better JSON support. So, if you were leaning on Postgres for the JSON support, a lot of that is in MySQL now. Not all of it, but I would say all the bits I understand are now available on both platforms. And Postgres has a bit more. It also has some improved metadata and cataloging features. What's interesting about MySQL is how other things, which are not really open source but I'm going to mention them anyway, are building on it. So, some of the cool new proprietary database startups like PlanetScale Data are built very much on a sort of serverless version of MySQL but in the cloud. I can't talk about MySQL without talking about MariaDB. Also, GPL licensed. It's a fork of MySQL and it's intended to be a drop-in replacement. If you use MySQL, you can choose to use MariaDB instead. Now, it has a few extra functionalities, capabilities, particularly the support for extra storage engines. The big thing about MariaDB is this is a MySQL fork intended to keep the database open source. Because MySQL ended up being owned by Oracle and we felt unsure about its future, then its original creator forked the original project to create MariaDB. You can get MariaDB as a cloud service as well. It's called SkySQL. But you'll even find that some of the Linux distros package something called MySQL. And if you look super close, it is MariaDB. They're that interchangeable in most cases. Now to the second most popular open source database. Finally, we're at PostgresQL. Now, I am going to go through a bunch of databases in this talk. And it would be terrible if I admitted to having favourites. PostgresQL is my favourite. If you only need one database, it is PostgresQL. Interestingly, it has its own license. I don't know why, but it is OSI compatible and it's sort of MIT-ish. Is how I understood the summary of that. If you haven't used PostgresQL before, it is a powerful and performant relational database. It's brilliant, its performance has improved enormously over the years. The version 14 released last year brought us extra, extra, extra, extra JSON functionality. It's basically a document database at this point. And the logical replication makes it incredibly, incredibly valuable. I really think PostgresQL has been around for a long time. There's a lot of seriously hot new shiny technology in the database space. Postgres is the best boring database technology I know. And it's an incredibly healthy community. It has a lot of contributors, it has a great atmosphere, lots of different people get involved in Postgres. And they all work at different employers and they all bring different things. And that depth and strength of the community as well as of the project is something that you really, really don't see everywhere. So it might not be shiny new technology, but it kind of is with every release. The other thing that's amazing about Postgres is it has a lot of extensions. So it does a bunch of things out the box. Not too many, it's not heavy, it performs very well. But then you can go along and add things in. And typically this will be extra data type support, extra functionality. Some of these extensions almost make this a new database in its own right. One of these is PostGIS, which is a spatial database. So it's an extension to PostgresQL, but it gives you everything you need to handle spatial data and you can use it with all of your existing Postgres tools. You get support for data types that represent geographical objects. So you can store and reason about and manipulate those things. It understands area and distance. It has the functionality to do the complicated globe-shaped math or maths that I kind of don't want to do myself. And it does it inside the database, meaning that you get it there for indexes as well. You're not pulling out data and doing all of that in your application. It's built right into the database. It's incredible. If you work with geographical data, I'm preaching to the choir. If you're thinking of working with geographical data, this might be a good place to start. Another of the extensions that make almost a new database is TimeScaleDB. So TimeScale is mostly Apache 2 licensed and therefore made it into the talk. But some of the features including the clustering are they use their own license, which is not open source. And I think TimeScale is pretty cool. So it's an extension. It gives you extra table types. So you get these hyper tables where you can put data in that can take lots and lots of time series data in a sane way and gives you extra functions for dealing with that kind of lots and lots of data points. So Time Series data, because I've just gone on a tangent. Here's the slide to support the tangent. What Time Series data is a very specific type of data. There's a timestamp and there's a value. There's usually some other things telling you which server this came from, which metric it is or what you were measuring. If you're lucky, you'll get units as well. But then I'm an engineer and I'm opinionated about that. And you get something that basically makes you want to represent the graph like this. So you get points and time and you don't want to read that in a table. But you have amazing built-in visual processing. So we graph the timestamps over time. What you get with this is lots and lots of very skinny columns. So the data behaves in a way that needs its own handling. So it could be the temperature of your fish tank, the amount of memory left on the server, all of those types of applications. It's Time Series data and you'll need a Time Series database. I mentioned Time Scale already, but another open source Time Series option is InfluxDB, which is also MIT licensed. It's a specialist Time Series database. Again, all of your internet of things, metrics, energy graphs, all of that is available. Influx is open source in single node. It's proprietary for the clustered version. You can get more in one node than you would think. So don't panic when you see this. If you're operating in open source, then... Yeah, still have a look at this. I think it has options. One interesting thing that I'm starting to see around the Influx space, also around the Postgres space, is the tendency of new technology to reuse the old protocols. Now, I think this is really smart and it's a real trend. It's been around, but I'm seeing it almost everywhere now. You invent an amazing new database, but you don't invent an amazing new wire protocol. You use the existing ways that we speak to the databases so that all of the existing libraries work, the clients work. You don't need to re-engineer that. You don't need to document and build all of those integrations. We have those already. So I'm seeing one example is CrateDB, which has support for the Postgres protocol. Also, the big time series databases of Victoria Metrics and M3 are two in particular that they both use the Influx wire protocol or the Prometheus one, which makes sense, right? Because they are big time series storage. They're ideal backends for Prometheus. So they've implemented that protocol. This makes sense to me. If you have a database that looks like PostgresQL, talks like PostgresQL, like, do you really care if it is or not? I think this is really smart and I think we'll see more of this reusing the best bits of what's already here, especially in the open source tools and moving those forward. One more relational database before I move on. And it's Esquilite. Esquilite is a really interesting open source project. If you haven't come across it in the tool before, it's file-based. You don't need a server. You often see it in embedded applications or for local development platform testing. It does really most of what a relational database does. It isn't going to do it in production or at scale, but it's embeddable. I am seeing it also in showing up in the cloud now. The Cloudflare just launched this product D1, like your little database that sits by your little serverless function. It's Esquilite. It's a perfect edge database. And I think that's really interesting. It is public domain and describes itself as open source, but not open contribution. Because if you think about the definition of open source, you need to have access to the code, but you also need to have the ability to use and change that code in any way that you want. That you can't have an open source licence that outlaws certain fields of interest, certain types of people. Either you make it available for everyone to use and to change as they wish, or it's not technically open source. It's just source available. Or sometimes it's open core if quite a lot of it's open source, but not the vital bit that makes it perform well. Those sorts of things. Esquilite is very open about being open source, but not open contribution. They don't want you to fix it, but you're welcome to do what you like, which at least they're upfront about it. I certainly have run into other open source projects that are not open to contribution and should probably say so. So I kind of admire this about Esquilite. Esquilite is a small and mighty database. I shall use that fact to make a tenuous connection to the next one, which is Redis. Also a small and mighty database. Redis is entirely an in-memory key value store. It is speedy. It is amazing. And it has some real superpowers for such a tiny piece of software. It's in-memory. So it is not going to take your big data and you probably don't want to store a lot, a lot, a lot, a lot of data in it, but it's brilliant for not enormous data that you need for not ages. So it's often used as caching or queuing. It's often used as a secondary data store where you have your main data somewhere and you're using Redis for quickly accessing things or keeping intermediate running totals of stuff like that. It has support for different data types so you can have different data structures stored in Redis. It understands lists and sets, hashes. This leads to fun features. My favorite is sorted set. So in Redis you can have sorted set. You throw data into the set. You can increment or whatever, set those values. It stores it, sorted. So you can get it back in the right order at the speed of light. I'm exaggerating. Not a scientific measure. And whenever you need it, because it understands the data types and it stores it in that way. Redis is absolutely everywhere and it should be. I think this is one of our most under celebrated database technologies. It's the third most popular open source database by DB Engines Measure. Last year's Stack Overflow survey had it as the most loved database. Like I say, often it's secondary where you've got a primary data store and you're caching or queuing or something in Redis. But it is well worth a look. There's a bunch of other key value stores around. I mean realistically this talk could have taken all day. Tried to keep the list small. Memcash D has been around for a while. I would describe it as less fully featured than Redis but still super, super valuable and speedy. Etsy D is often in the background. It's depended on by other tools. This is true for RocksDB as well. I have a RangoDB here. This is up and coming. I just think it's interesting. It's seeing quite a lot of adoption. It's a key value store but it's also a document database and a graph store. So there's a whole bunch of technologies in this space and it's changing all the time. We talked about key value stores. We talked about Redis and I said it wasn't really for huge data. So let's talk about where you put your key value stores for things that are too big for Redis. I'd like you to meet Apache Cassandra. Apache Cassandra is, it's amazing, right? It's key value store but for big, big, big, big data. It's a distributed database and it's designed to run on commodity hardware. So you take whatever hardware you have, throw out a bunch of servers and Cassandra is just totally happy to just work around what you have. It abstracts the complication of dealing with the multiple nodes for you. It is designed for very large volumes of data but you are going to put the data in the way that you want it back, right? This is not chuck it in and we'll run some queries later. This is, I have modelled and understand my data needs very well. I know exactly what I need to get out of this database and that's scale and therefore I have designed my database to support that. So you're going to denormalize the data. You're going to throw it in at whatever speed you want more or less and Cassandra will allow you to run those analytical queries but you need to know before you start how what you're going to ask questions you're going to ask at the end. So it's not something that iterates really quickly but it is, we see it, we have it on the Ivan platform, we see this a lot. It originally came out of Facebook. Technically it's a wide column store if that's your thing, if you like the terminology. And it's designed to play nicely with a bunch of the other open source processing big distributed data sets. So Apache Hadoop, Hive, Pig, it belongs in that space. Cassandra is also a very actively developed project. The version 4 released last, want to say October-ish and that had some really nice addition. Better auditing, kind of important in an application like this one. And also some metadata features which I think as we get more into the data lineage, data catalogs, those sorts of things is really, really valuable. So if you are outgrowing your big databases, have a look at Cassandra. I think it's super, super interesting. It's one of a family of distributed databases. They're designed to make use of many nodes and to allow you to scale horizontally even for writes. So we talk about the more traditional databases and I'll pick on Postgres again. Then the Postgres will have one main primary node that you're going to write to. It can be big. You'll have a lot of secondary replicas that can take a lot of the read scaling. The distributed databases are intended to scale horizontally for writes as well. So this depends how your read-write ratios look for how you design that. The way they work is that they structure the data into, they're called shards or partitions on the different products. Normally I talk about this with relation to Apache Kafka. It's not a database, so it can't be in this talk. But it also has partitions and multi-nodes support and sharding. So each bucket of data, let's call it a partition, is then replicated across multiple nodes. So things are organised into the partitions and then each partition exists in two or three or however many places you configured it. And this depends how important it is that you don't lose things and how much you want to pay for that. There's like a risk-benefit calculation. It can take place here. So it's usually replicated for that reason. This sounds complicated and it kind of is, but the database is just taken away from you. So when you're working in an application, all of these just you feel like you're talking to a single node and it does the rest for you. There are quite a few different projects in this space. A famous one, another famous one, I hope that Cassandra is also famous. I mentioned Kafka. Another famous one would be Elasticsearch. However, this talk is about open source databases. Elasticsearch is no longer open source. But luckily, we have an open source fork of Elasticsearch. It's called OpenSearch. It's Apache 2 licensed. It's compatible with Elasticsearch. And it is also a distributed database. This one's really specialist. It builds on the Apache Lucene project and it provides really, really awesome search features, particularly search. It does amazing aggregation as well, but the search blows me away. Every single time I eventually go, no, no, let's stop doing full text search. Let's just put it in OpenSearch. It's so good. The interesting thing about the search databases is that they don't have a defined schema. So they're more like the document databases. You can store whichever records you want to in there. Whatever structure they're JSON so they can have nested fields. It's really, really flexible. But to get the performance on the search, you need to define your indexes so it knows which fields to keep track of and to give you the search performance back at the end. Again, making use of multiple nodes so they scale just beautifully. You can put a lot, a lot, a lot in these databases. If you were using elastic search with Kibana, then check out OpenSearch and the new Kibana is called OpenSearch dashboards. It does what it says on the tin. It's the graphical interface to your OpenSearch database. I'm okay with it. The OpenSearch 2.0 release went out last week. Maybe the week before. No, I think last week. It has adopted a newer version of Lucene so you've got some cool search upgrades there. And it's improved the notifications so you can get some alerts off that. That is better. So again, another healthy project that is moving forwards. I'm completely biased having worked in open source all of my career and also loved data for most of it. I think open source databases really give you something that you can build on, use in lots of different contexts, scale as you need to. And I think this is some of the best technology around. I've tried to bring you a quick snapshot of what's going on and what's happened recently. Hopefully it was helpful. I will put some quick resources on the screen. Check out Ivan. They're sponsoring and they let me be here. Check out our event Uptime which is about open source data in September. That's my website and if you want to learn more about this then I'll recommend the book 7 Databases in 7 Weeks. I'll be around for questions if you have any and with that I will say thank you very much for your time.