 Thank you very much and thank you very much for coming and Hope you enjoy the talk. I have lots and lots of slides. I'm gonna put the other slides up online I'm not gonna read all the stuff that I have on the slide. So Let's get going. So who am I I'm I'm Mark Lamberg. I've been with Python for a very very long time I'm the former chair of the your Python Society Also passes off a foundation fellow. I have my own consulting company and if you want to connect then This is the LinkedIn QR code if you want to scan it or you can go to the website down there MA Lamberg Right so Motivation so Let's assume you're you have the great idea for a startup, right? And you want to build the next big thing of course what you need if you want to build something That's really big and you don't know how big it is going to get you need to make a good choice About which database to use underneath, right now If you've been in the business for for a long time then You probably know that there are quite a few, you know standard options that you have But more recently in the recent years, they have been huge developments in the space of databases So if you go to this website DB DB dot IO, which is the database of databases It's it's a quite useful thing because it compares to different databases features and so on Makes it easy to to have a look around and to shop around for a database Then you find that there are more than 800 databases available for you to use now So it's it's not as easy to choose or it's not as straightforward to choose a database anymore and The reason why this is is that lots and lots of companies especially the well What used to be the small startups and now are the the big huge internet companies all those companies had an issue When they started their business because they they found that the databases that were on the market at the time Did not really fulfill the needs that they had that they weren't as scalable for example They didn't really Focus on the specifics that they needed for for their business And so what they did is they started to take Available open source projects and then customize them modify them and made them more scalable So now you have lots and lots of different databases for different types of workloads that you want to handle Of course, you can always go and just you know pick the standard ones like my sequel Maria DB or Postcards, which is actually my favorite How many of you know these databases? Everyone excellent So those are the obvious choices, right? So let's assume you want to have something that you know scales may be a bit better than these databases and You are going to the market and you don't want to lose a competitive edge that you have in with your startup so You probably have to approach this whole thing a bit in a bit more systematic way So you have to answer a few questions about what you want to do with the database You have to Find out which features you need and then you have to go out and shop on the market to see what's available And then you know pick the best fit for for your particular use case For larger projects, I would always recommend doing a PLC as well Right because sometimes, you know, even though the you know the marketing slides of the different databases look nice They may not actually perform in the way that are they are being advertised for so it's always a good thing So first question is what's the typical data load that you're going to have what kind of work Load are you going to put into the database? So that basically two major different types one is OATP online transaction processing And the other one is OLAP, which is online analytical processing. How many of you know these terms? Okay about half so the first is basically mostly about what you typically do with a database with the general purpose database So you put stuff in you edit stuff you delete stuff Data changes a lot in your database Versus that the second one the OLAP case is more about analytics So you put in data into your database and you want to get reports out of it And this is also what Maro talked about So you what you first do is you collect all the data and then you put it into an OLAP storage to do the Analytics and then do the business reporting So a typical use case for that would be a data warehouse solution So once you have that figured out what you need and it may actually be that you need to databases because you want to have both Loads handled and the typical case is that databases only good at one of these Of course, you know The database vendors and the companies working on these things they try to make both work nicely But there's always a trade-off in these things so you have to be careful of you know when choosing these The next is you have to figure out what feature set you need from from the database because there are a lot of things that You might need There are also unfortunately a lot of things that You may need in the future, but you don't have any clue of yet You only figure out when you start developing and so you have to plan for both the knowns and the unknowns when doing the selection Now when it comes to the feature set I put together You know a few slides explaining different complexity dimensions that you have to consider when choosing a database So I'm just going to go through all of these This is not necessarily complete you might have additional ones So it basically depends on what you're looking for the first one that you have to figure out is whether you're going to Have mostly simple schemas like you know web server for example a logging server or something like that Or whether you whether you want to have complex schemas Let's say you're building a big huge data warehouse that's supposed to handle all the the data In in your particular business then it's more likely that you're going to have a complex schema versus if you just want to do Some quick processing. Let's say you have a web shop to handle then simple schema would be More important for you now when it comes to choosing the schema This is probably fairly straightforward for you What's not so straightforward is when making changes to those schemas So you should have a database that allows you to implement migrations efficiently if you Have a need for for changing its schema afterwards, which you typically do I mean even if you know exactly what you want to do at some point You're going to find that you need more fields or more tables and you have to have these migrations So the database should should probably have features for these things next is of course Cardinality, so how many how many rows are you in the store? How many are you going to have tables with lots and lots of columns? What's called a white table? Or are you going to have you know both lots of rows lots of columns? And of course you need to figure out how fast is you're going to is the data that you're going to put into the data Storage is going to scale so how much new data are you going to get every single month and that's very important because Databases they they often they they hit a certain wall after after some time. They're not necessarily Elastic in terms of scalability So you have to be careful about that another point is temporal complexity Maura already Mentioned that like for example if you wanted to time travel in your data So you want to know what did my database look like on January 1st? No 2020 That is a question that some databases can answer and others can sometimes you can you know build something on top To make that happen, but this is this would be something that you would have to figure out before choosing a database Right next is queer complexity, which is fairly obvious. So of course you want to well Typically you want to have everything as performed as possible, right? So performance matters a lot, but performance comes in two different Categories so for old OLTP it would be transactions per second For olab, it's more important to know how quick is the database going to answer my question And this is especially important for bi reports that are Whether the user doesn't want to wait like half a minute or so for the for the answer So that's something that's very important joints are very important as well If you have lots and lots of tables you denormalize Sorry, you normalize everything. So you basically use the drive principle for databases Then you typically have to do lots and lots of joints Now if you're working in the olab space, then it may be better to not denormalize to not normalize everything But instead to denormalize certain things Denormalization means that you duplicate data across multiple tables Which of course is not dry But it makes the queries run faster because the database doesn't have to do all these joints on huge tables anymore So that's something that you need to consider As well and something that in in in recent years has become very attractive is to use views for these things Especially to if you are working in larger organization You don't want every single developer to write his or her own sequel of select statements But instead what you do is you prepare views on the data and then manage the views centrally and Make those available to developers now Views are basically just hints for the database of how to do the select statement But databases have now have materialized views as well So they can dynamically turn the view on it on the data that you have into an actual table And then the table is being managed by the database Of course, you have to tell the database when to update that table every now and then You can also do that using special indexes again. This is something to consider Next is deployment complexity So you want to know where to run your your database whether it's like on a single server It's maybe a cluster. Maybe you want to have something that works in Kubernetes You pretty much have all these options nowadays with databases Where it comes to storage is a different question. So you have to know how much that they are going to put in into the database If you want to have all the database all the data local to the engine the database engine Then it's usually better to have something, you know, even in memory or maybe on an SSD But the trend goes to basically separate compute from storage So that you store everything on S3 for example Because that's a whole lot more scalable if you do a data lake for example in S3 is probably the right choice Depends again on what kind of data you're looking at Resilience and disaster recovery very important if you're working on a database that is, you know Basically the basis for you for your business you want to make sure that everything continues to work even if the database goes down Or you you know have a disaster like I don't know data center goes on fire or something like that Then you need to figure out how that is being handled by your database operational complexity so You have to decide whether to you want something managed So someone does the work for you or you want everything self-hosted their pros and cons to these things Nowadays most people tend to go to manage services So let's say you go to Amazon and then use a redshift for example Then you don't have to worry about the how the database is actually managed how the hardware is Composed or you use the self-hosted version and then you have to you know figure out What does the the sys admin what do the DevOps teams need to know about? Are the integrations for these things available for your database and then of course every now and then You get a new database version You have to handle upgrades and you have to know whether you can do everything in zero downtime whether that's possible or not Whether you can afford having you know a bit of outage or not And you have to basically make everything work in that kind of kind of scenario as well Right. Those were a couple of interventions to consider like I said there might be additional ones I also put up this slide with some additional decision factors. So what Maro already talked about it's very important to figure out whether this Projects that you're looking at whether they're mature enough whether there's going to be support for the next 10 years or so Right. Those are not necessary for especially for open source projects. Those are not necessarily easy to answer If you get a managed solution like from Amazon for example You can just ask Amazon for this, but if you use something like let's say an Apache iceberg or so then The the answer may not be straightforward. So I've I've seen lots of these projects Basically be hyped up very much for let's say five years or so and then people lose interest and then walk away And basically the projects they they die even though it may not necessarily be bad project But people just you know go to other things Of course because you want to use Python you should always look for good Python interfaces available for your for your database, right? So Enough talking about, you know, the the theory behind it. Let's let's have an actual look at these databases So I just you know, I'm going to show you a few selections different kinds of databases to to consider Let's see how far we we can get in in the bit of time that we have so general purpose OLTP databases, you know the standard ones I'm not going to talk through these so you can get all of them to basically install on your own servers And you can get cloud versions of these as well More interesting and in some use cases are in memory databases So basically where you run your the the whole database inside memory provide you have enough memory Of course, you can use SQLite for that you can use duck DB for that SQLite It's more OLTP and duck DB is more overlap. How many of you know duck DB? Just a few it's definitely something to work to have a look at. It's a very new database. It's a very The it comes from academia, so they they actually put some really nice techniques in there Then there's a commercial one called excess soul. How many of you know that one? Oh Few that's nice. So this is this is a bit different than SQLite and duck DB It's an it's an in-memory database works in a cluster and The the the engine itself only works in in memory. So it loads data from persistent storage and then works in memory is extremely fast Lots of companies use accessor for example for caching stuff if you want even faster and you have a good budget and And you know a good link to NVIDIA to actually get those GPUs Then you might want to use one of these GPU base Databases I've I cannot really say much about these things because unfortunately I don't have the huge budget to try these out, but they're supposed to be extremely fast And also extremely scalable. So heavy DB is just got renamed. So it had two other names before I don't remember This is a very fast one They're putting lots of energy into that if you want to have something that works in the context of the rapid Stack or you want to work with desk and blazing SQL is something to look at. So That's actually written in Python. So that's interesting and there are extensions for postgres For example, if you're using postgres already, you might want to have a look at that extension and then or you can use scream if you are more into, you know doing analytics ad hoc analytics where you have to load lots of stuff in a very fast way and then do the analytics and then basically throw everything away again. So Next is data warehouse OLAP databases. So again, all the big names like snowflake, you know big query Azure has something called Synapse Analytics Amazon has Redshift Green Plum is a nice development. They are using a postgres and basically extended that to to work in a cluster It does MPP so massively parallel processing This is coming on strong terror data is one of the the old-timers in this field. So it's probably one of the first data warehouse Systems that you had nowadays is basically just a cloud platform. It used to be a System where you basically bought the hardware from them and appliances and put those into your data center All of these are column oriented, which is something that most data warehouse applications do Because for OLAP if you have if you store your data in a column oriented way Then you can get at it much much faster and do analytics much faster Same goes for data lake As as Mara already mentioned data lake is basically you throw everything all the data that you have or all the data That you can get your hands on you throw into a system called the data lake database The storage is typically on something like S3 because you want everything's scalable You often have to deal with unstructured data. So you can you have to be able to put that into your data base as well like you can use data lake for example to do this Apache has a nice database engine Which is actually just the engine and then you plug in the storage later underneath. So you can use data lake for example with that Apache has Presto and Trino. Those are actually two projects which forked at some point so I Can't really tell which which are those two is the better one I've heard lots of good things about Trino. So perhaps that's something that you might want to look at Amazon has Athena for this so you can basically throw everything at S3 And then you put Athena on top of it and do the query Now if you want to do Distributed databases, so you can use these for data warehouses or data lakes as well and You want to put your database on on lots and lots of cluster machines and you want to have scalability So horizontal scalability in your database, then these are nice databases to look at Juggerbyte DB is a very hard database at the moment. Cockroach DB is basically a direct competitor to them The two are Postgres compatible So if you have an application that already has a Postgres interface Then those are good choices and like I said, they're easy to scale if you need more power You just add more VMs to it or more containers If you have a need for my SQL compatibility, then single storage probably a good choice Clickhouse is a nice database. It's very fast, but doesn't have all these SQL features So some of these databases they focus more on You know specific things like Clickhouse for example is good for log analysis and then again ExaSol extremely fast and Good for you know, if you have lots of RAM Right, that was it in terms of the classical databases then of course you have document databases So let's say you have unstructured data or you have data that you have in form of JSON documents Then you can use any of these so elastic search. You probably know open search who knows open search Just a few so open search is basically a fork of elastic search By Amazon and that was done because elastic search changed the license, right? So Elastic search used to be a patchy license and open search still is a patchy license So depending on and on what you want to do you can use either one of those Cassandra is very nice. It's a very hot new database for documents as well. It's high performance. You can use SQL to courier Monger, you all know right couch DB is is basically is more specific if you have like let's say you have a mobile application Then couch to be is excellent for doing the replication across devices So let's say your mobile goes offline it basically you store everything couch to be a next time it comes online again And then does the synchronization with your back end? So that's That's good to to use time series always very important for event data So in flux DB, you probably all know a patchy through it. You might not know how many of you know drew it Just a few so that's a very well, it's kind of new It's extremely good for fast intake similar to influx DB It was originally written for for add systems Where you have you know lots and lots of data come in and then you had to make you know very fast queries to Decide who gets to show you and add on a website Create DB is interesting because it combines elastic search for the document side of things And then it does pressed or three know for sequel So that's something very nice if you use postgres you can use time scale DB, which is an extension to postgres Lots of people are working on it. I've heard lots of good things about it. So definitely something to look for If you're more into a space where Immutability is important or you have to prove that something happened in your database Let's say for auditing purposes. Let's say you have regulator Come to your company and then you know check your records Then you should definitely have a look at for example emo DB Which is which also has some crypto stuff built in so they actually sign the changes that you do in the database And then store that Or you can have a look at Apache Pino, which is it's append only so you can just throw data at it and it will store it very fast and then finally machine learning and Databases so how many of you are doing machine learning inside the database? No one If you think about it actually makes a lot of sense because the database already has all your data, right? So why move it out of the database to wherever and then do your panda stuff on it or do your Machine learning models on it and then put everything back into the database again So why not just you know have everything happen directly in the database now with postgres and with a green plum This is possible using the the MAD lip so you can actually use all the the Libraries that you have for Python, but they will run directly inside the database server So it's a it's a lot faster. It's a lot more efficient other databases have support for this as well But you have to always then use UDFs for this or you have to you can write your UDFs and Python for example And then interface to to these libraries where this mad lip is already optimized for these kinds of things There's a special database which is kind of interesting because it works as a proxy to your your actual back-end database and What what you can do there is you can actually do the machine learning inside sequel So this is this is nice for people who don't know Python, but who know sequel very well As you have in you know a data engineers for example many many of them They know sequel very very well and or feel much more comfortable with sequel rather than you know going to Python to do the machine learning and what they have done is they basically they parse the sequel they added extra elements to sequel to enable machine learning and Then people who then don't know Python can then use sequel to work in the machine learning space So that's very interesting Of course, there are lots of other things that I you know can't fit into the talk key value storages. I haven't covered there lots of those Geo databases again lots of those graph databases Neo4j for example Vector databases if you're for example storing machine learning models, then vector databases are a good thing to do If you have lots of those if you want to have then have like a data catalog of all your your models And those are something to consider There are plenty of other use cases very specific So you might want to shop around a bit So this is everything I mentioned in this talk. This is just a small selection Like I said, they are more than 800 databases out there. So I could only cover some of them. I Hope this was useful if you have questions then please come to the mic and I guess the main takeaway from the talk is you should never stop to learn and always try out new things, right? Okay So we have lots of time for questions if there any in the room, please come to the mic and You'll give me a signal if there's any on remote No questions. Excellent. I covered everything Okay, very good then once more Okay This is really good talk. Thank you Just want to hear your comment on the modern serverless Databases like plan scale ends of those type that are right now getting pretty hot Right. I've used them or I haven't used them. No, but I mean serverless is coming on strong, right? The problem with serverless is that it's data locality. So even though you can you can basically you scale up your your compute and It's because it's serverless. You can even do that on demand You always have to get the data to the engine, right? And so getting the data to the engine is this the bottleneck in those cases if you have Queries that don't need a lot of data at the compute node, then I guess it works out fine But otherwise you introduce lots of latency because you have to get the data from let's say s3 For example to that particular node and then run the query and then you know get rid of the data again Graph, yeah, I would probably use Neo Neo 4j for that and That's basically the standard go-to kind of databases for graphing right Thanks for the talk. I just wanted to get your opinion on maybe using a database outside of its intended use So for example, like elastic search using it as a maybe a reporting database Is that generally like remanded or there are a lot of pitfalls that could It is specifically for for elastic I think there are certain use cases which are which elastic covers really well For example, if you if you do lots of text kind of oriented queries then elastic search is perfect for that So even though other databases, for example postgres has extensions for for doing text-based search as well because Lucine is Directly built into elastic search. It has very very good features for for that particular space So if you're doing something like for example elastic the dear I think the original use case for elastic was to to store log data from various different systems and then you wanted to Query your log data based on the text that was written in the log messages, right? Because it wasn't always Very structured log messages can have various different, you know formats and syntaxes and so on So that made a lot of sense And so if you if you have a specific use case in that area then you should definitely use elastic for that or you know Open-search one of these document databases. I wouldn't say that there's much of a pitch pitfall there Elastic is certainly scalable to, you know, huge data sets. So I think it's a good choice Yeah Using postgres, but the performance wasn't quite there for the volume right performance was a bit better on elastic search But yeah, definitely. So if you have lots and lots of data Like logdecker for example, if you have like a terabyte log data, then I would definitely put that into elastic search into the query They are rather than putting everything into a postgres and then have one of these indexes, you know built on top Yeah, thank you very much Do we have a remote question? No, okay. I have one question before we go Are we using snowflake now and my company? This seems to be a trend towards this kind of scalable Warehouse type setups. Do you see this as a sustainable trend and are the other big monoliths Pivoting in that direction Well, yeah, I mean slow snowflake is like the go-to solution if you if you're in a company and and your And you want to do a data warehouse, right? So everyone picks snowflake I'm not sure whether it's a good choice. I mean certainly a very capable database and it was it's a new database. It was written from scratch Specifically for doing data warehouse kind of workloads but you know, it's just a single company and You never know where that company is going to be in say five ten years So it may be good to also look at other Projects plus if you have more specific needs in terms of performance, for example There are some new Apache projects out there like Druid, for example Which are actually a lot faster Yeah, so it's definitely worth having a look at those things as well. Okay. Thanks. So if there are no other questions We'll end it there. Thanks Mark Andre