 Hello everyone, my name is Adam Furmanek and I'm a head of DevRel Atmatis and today we are going to talk a little bit about the missing part of our CI CD pipelines, namely database guardrails. We are going to talk about how to safely deploy stuff to production, how to prevent bad code from reaching our production environments and how to make sure that we have a proper database observability, monitoring and troubleshooting in place. Let's begin. So in today's world, we just can't wait for the errors to pop up in production. It's way too late for us to notice that something is wrong. We want to make sure that things do not work bad when they are deployed. We want to make sure they are correct and we need to push these checks to the left as much as possible, meaning that we need to verify our applications as early as it is possible. We can't wait for the checks to happen like during expensive load tests, then also happen in very late, like very late in our pipeline, CI CD pipeline. We need to be able to take all the stuff and test it as early as possible. Ideally, when we are just implementing our changes, when we are just implementing this stuff in our developer environment. That's very first thing. Second thing is when something goes wrong, we need to be able to pinpoint the issue exactly and we need to do it quickly. We need to be able to show what is going on, whether it is because of recent changes or whether this is because of what the code we executed or maybe because the data changed around us. We have now different traffic distribution. We have now different data flowing into our application. We have various other things. We need to be able to tell exactly which line of code causes the issue or whether it's like misconfiguration or maybe this is because some external world changed. This is what we need to do. This is what we need to be able to answer when we are talking about our applications. And it's a must-have, no matter whether we are using like or working with small applications with like one transaction per second or maybe when we are dealing with very big scaled-out applications that handle like thousands of transactions every second. This is a must-have for every single application to make sure that we deploy the proper code to production and when something goes wrong, we need to have full context around to be able to root cause issues quickly. So how do we actually know whether the code that we deploy is it going to work well in production? How do we know whether we can safely deploy on like Friday afternoon and things are going to work well over the weekend when we are safely resting at home, right? And what they typically tell us is what even can go wrong, right? There are various areas that may break. One of them is like deployment may go wrong. We have multiple technologies. We have multiple platforms. We have Windows, Linux. We have various cloud providers like AWS, like Microsoft Azure. We have other places which differ in terms of like what they run, how they run it, etc. For instance, permissions. This is a source of various multiple problems when we are talking about the deployment. We tested our application in like the developer environment. But then when we deploy to production, we forgot to update the IAM role in AWS or we forgot to like access or grant some permissions to access like services or repositories or whatnot, right? Other thing is connection strings. We forgot to change the connection string or we didn't replicate it properly. We didn't configure environment variables. We didn't configure connection pooling or scalability of or other things like even simply line ending like Windows and Linux, they differ in this area. So generally multiple things may break just in the deployment phase, right? But the other thing that may go wrong is like the code made noteworthy, right? We may have edge cases. We may have unexpected like data distribution. We may have different locale. We may have different bugs or generally stuff that we did not foresee. It may work very well on our machines, but when we deploy to production, it fails badly, right? However, those things we can test to some extent. We can figure that out if we are aware of them, right? But not all of them will be able to foresee. So not all of these changes will we be able to like predict, test, verify accordingly before going to production. Finally, another thing that may cause multiple issues on our end is like the different load of the application. When we are talking about running things in like local environment, we're typically just trying things out, testing happy paths or like critical paths, whether they work well or not. However, when we are talking about running things in production, we need to deal with peak times. We need to deal with different activity happening like in the morning, in the evening, during the night, over the weekend or during workdays. We may get different traffic distribution that may also depend on the country, on the continent we are running. We may have completely different data available like in the US versus in Europe. Generally, various things may break that are strictly related to how we deal with databases, right? Whether we perform efficient queries, whether we test things properly, whether they can be scaled out and whatnot. So those are multiple areas that may go wrong when we are talking about the deployment and about like pushing our code to production. How can we make sure that these things are going to work well? Well, the typical thing they tell us is let's use CICD and this is going to solve all of our problems, right? So just to set things straight, CICD stands for continuous integration. Continuous integration, which means that we take all of the changes, compile them together, prepare the application package, run unit tests, run end-to-end tests, integration tests, whatever. Just to make sure that the things are like running properly in our synthetic smaller environments, right? This is continuous integration CI for short. CD on the other hand stands for two different things. One of them being continuous delivery. We take the package that we build with CI and now we want to deploy this package to all non-production environments. So this could be our dev environment, our team environment, pre-production environment, QA staging, whatever else. So we deploy the package to all of these places to make sure first that it's possible to deploy this package and second to keep all of these environments like in sync so that everyone using various environments is exactly the same data, the same logic, the same behavior, right? And finally CD also stands for continuous deployment. Continuous deployment is pushing application to the actual production automatically, meaning that we take the package that was previously deployed to all the non-production stages and now we want to push it to production so to do it automatically so it runs right after we commit stuff, right? That's the idea. And they tell us, hey, you take CI, you take CD, it solves all your issues, right? And you have your application tested well and you can be sure that you can push it to production. Well, but is it the case? As we are going to see, that is not the entire truth and we are going to understand why. So what can go wrong when we are talking about the deployment, about the code, about the other stuff? So we'll cover various areas like databases, like object relational mapper libraries and the lack of context when we are talking about the application. So let's start with databases. There are various things that may go wrong when we are talking about databases. First thing is code may change. We are talking about application code here. So you change the way you actually talk to the database, you change what data you extract from the database. You generally modify your application code because you changed your business requirements or whatever else, right? So you change the code which results in going to database and sending different queries. No matter whether it's like SQL database, whether it's no SQL database, key value store, document DB, whatever else, you just change what you extract from the database. The other thing that may change over time is like schema changes. Meaning that hey, we needed to add new column, we needed to remove a column, we needed to change the column type, we need to do various multiple things that may actually break the things how we are dealing or interacting with the database. Those things are something that we typically can test. Test with CI, CD, with unit tests, etc., etc. At least to some extent. However, the other group of issues which is also very difficult to verify and the group of issues that we typically miss when we are running unit tests or whatever is the query changes. I'm not talking about query changes in terms of like changing what query we send. What we are talking about here is how the query is being executed by the database. This will depend on data volume, how many rows you have, how many partitions in your table you have. This will depend on like the indexes being configured. This will depend on like time of day, other transactions running around, lock contention and other stuff. So we send exactly the same query as we used to. However, over time the query changes the way it behaves. SQL Engine or your database engine in general now executes the query differently that leads to performance degradation and then leads to generally deteriorated user experience. So let's see exactly what may go wrong when we are talking about issues of this kind. The very first thing is like slow queries. Imagine that we have an application that is basically doing something like hey, we do have this application root in here. And we would like to get the user for this particular user ID. And what we would like to get is we have an app like aggregate root. So we extract also some details about the user like details, pages, questions, reports, what not. And ultimately we just want to return this one aggregate root from our database, right? However, because we join multiple tables because we want to extract everything that is related. And this is actual problem that I observed like in one of my production applications. This results in generating this SQL query like select state, select star from table and joining multiple tables on the site, right? However, when you take a look at that, it is possible and this is what actually happened that just this one aggregate root resulted in extracting 300,000 rows from the database and returning to the ORM, to the application. Why is that the case? That is because we do have those joins. So if you have multiple rows in each table then you effectively exponentially multiply how many rows you get in the output of the query. Ultimately, you get all of those rows. Most of the columns are really like duplicates and your ORM knows how to deal with them and how to handle all of that. However, it takes 25 seconds to complete and this is actual statistic from the production server, 300,000 rows and 25 seconds to execute this query. How can we help that? Or where the problem is exactly? The problem is here not that we are changing 300,000 rows. The bigger problem we have here is that we do not notice that before we go to production. Because when we are talking about developer environments, local environments then we probably don't even have that much data to generate 300,000 rows and to see the performance issues, right? How can we fix that? In this particular case, the solution was to basically split this aggregate root of user into multiple independent fix that we wanted to get from the database. This resulted in running multiple queries against the database, right? Multiple SQL statements. However, because every statement is very small and returns only a couple of rows all of those statements together executed in split seconds and resulted in getting way less, way fewer rows from the database. So this is the solution that we could apply in this particular case. But in order to know that we can solve it this way, we need to identify the problem. We need to have a full blown observability of, hey, that query is actually going to work very slow when we are talking about running the code in production. What else can go wrong? Well, depending on how your SQL engine or how your OLM library works you may run different queries that are logically equivalent however they result in completely different performance. Let's take this handcrafted synthetic query. So we have a table that is called boarding passes and it has something like 8 million rows. What we want to do now is we are taking or sending two different queries that are logically like equivalent. They extract exactly the same data. In the first query we use the common table expression CTE for short which can be considered like an alias for a temporal, let's say, table or part of the query that we are going to use later on. So we get rows from boarding passes and we calculate like MD5 hash twice for each ticket number. And then we take this table which is now called CTE performance and we join it like three times together to get like matching rows and matching rows, we identify them here by ticket number and the flight ID and the boarding number and we just filter those things based on the hash and you can see it's the same hash here and there. Or we can send equivalent query which does not use CTE. What we do now this time is we just run this select statement when boarding passes three times and we calculate this hash three times for every single table separately. Ultimately we get select count star so we just get the number of rows. And the interesting thing is even those two queries are equivalent. First of them runs in like 13 seconds while the other executes in 8 seconds. So it's nearly half the time of the first query. So what we can notice here is that hey, those queries are equivalent. If you were to write a query like by hand you would probably go with the one on the top of the screen. However, this would result in like degradation of performance and you won't notice that before going to production most likely because you do not have a proper observability around your databases and you just don't see what's happening behind the scenes. So this is another thing that may break. Yet another thing that may go wrong is like incompatible changes in the schema. Adding a column to the table may seem like something that is innocuous should be very safe and we can do it anytime. That's not necessarily the case. The problem with adding the column is that hey, if you do add new column and something uses select star then this query will start getting this new column extracting the data from the table, right? Because select star gets all the columns and now you may end up in a situation that you now get much more data from the database and that you may actually get like much more network traffic or your ORM library will need to work much harder to process this data just to simply ignore it sometimes, yes? The other thing that may break is when you're dropping a column, right? If you do drop a column probably most of your code needs to be aware of that. The problem with dropping a column is when you have application that is scaled out or when you have heterogeneous applications running against the same database. Meaning that there are multiple nodes running some code and then you deploy this change so you drop a column and you do not have backwards compatibility. And now imagine that half of your fleet is now running with the new code base so it knows the column is not there and can handle it properly. However, the other part of your fleet does not get the latest code yet didn't get the latest code yet so it's still running the old code that assumes that column is still there and relies on this column. So because you are introducing this change in a non-backwards compatible manner you may end up in a situation that like half of your fleet and this is what we call split brain half of your fleet is actually trying to access the column that doesn't exist anymore. This also depends on like if you have heterogeneous ecosystem, right? You have multiple applications touching the same database you need to coordinate how you deploy changes to these applications so that you can safely delete the column. The problem with that is who or which component runs migrations against your database and do you control these migrations like technically, right? Do you know when these migrations are being executed? What is executing these migrations? How to roll them back? How to push them forward and what not? Those are the things that you need to be aware of and those are the things that you need to control when we are talking about the database. Even like innocent column type change may actually be like very terrible in terms of performance. Why? Because it may require the so-called like table rewriting. So your SQL engine when you try to change the column type may need to actually copy whole table on the site create table from scratch with new column, copy data back and only then let the application know to properly work to continue, right? Only then does your application can your application safely carry on and do the work. The problem with that is do you know how long is it going to take to migrate your tables? If you are talking about like developer environment or testing environment, that's probably not a big deal. You have like couple hundred rows so it's basically a split second when you are talking about that, right? But when you do run that in production then this migration may take seconds, minutes or even hours. Can you allow to take your system down for hours? Especially that if you go with continuous deployment and you may be actually deploying something like during peak hours or shortly before the peak hours. Can you do that? Do you control all of that? This is why we need to have this full blown observability around everything that we built around all the databases, around all the database interactions that we run, that we have in our applications. The other thing that may go wrong is like missing indexes. Missing indexes may cause your SQL query or data extraction to scan whole table instead of getting like particular rows directly using the index and using the binary search, right? Index is basically a data structure that allows you to find rows based on like their ID or some other conditions to find them much quicker just by using a binary search because we have data stored in the order so we know where is like how to run the binary search on top of this data to find the rows easily. If we do not have indexes then what happens is we need to scan every single particular row. The problem is scanning full table takes tons of time is very slow and is probably something we really want to avoid. Problem with that again is that you may not notice that you are scanning the table in your developer environment because it's fast enough, because you have like tens or hundreds of rows and it works in a split second. However, once you deploy to production, you immediately see that it's very slow. How can you protect yourself from doing that? You can obviously run load tests just before pushing to production but load tests are very late in the pipeline. Load tests will happen only after you're done with all your tests with all your implementation, with all your changes probably even with your pull request, right? Then you run your load tests and this is the moment you learn performance will degrade, I just can't push this to production. This is way too late. However, another problem that we may have here is like indexes are often considered like a silver bullet solving all the problems you may have. When you ask junior developer what should you do when your database is slow? Add an index and what do we end up with? Index is everywhere. So we basically index every single thing or provide an index for every query that we sent from our application which really works well when it comes to reading. However, when it comes to data modification adding rows, removing rows, modifying rows this is terrible in terms of performance, why? Because now we not only need to update this particular row but do we also need to update all the indexes around the database or around the table we have? So generally having too many indexes even though maybe it worked before maybe this is something you added index like six months back and it worked improve the performance but then over time you added more and more indexes to basically optimize particular use case but in result you like degrade the performance for other scenarios for other stuff that you wanted to run. So generally having too many indexes is also not a good solution. There are obviously other problems that may happen for instance bugs. Like sure all of us had this case we run SQL query against the database we forget the worst statement bank there is a bug. It's good enough if we figure out or we find identify this bug when we are talking about like CI, unit test, etc. However, some bugs manifest themselves only in very particular context. For instance Halloween problem which is pretty interesting one is the problem that your engine may be buggy your engine may run your queries incorrectly and result in bugs and problems. This may depend on your license. This may depend on like whether you scale out your database whether you have sharding, partitioning whether you go with high availability, lock shipping active, active, active standby replicas various other things that may affect how your queries are executed and how your bugs expose themselves. The same goes for lock contention. If you have different locking like different transactions happening in parallel you may get different results, different bugs. Let's just name a few. For instance, isolation levels. If you have read committed isolation level according to the standard SQL 92 you are very well allowed to get like duplicates of the rows or miss a row even when scanning full table. This is allowed by the standard obviously that's some kind of an implementation detail but this is something that you probably won't notice because you are talking about running the query in your developer environment when there is only one transaction happening and nothing else going on around, right? But once you deploy to production you immediately see, hey, this thing is actually not working properly because it's a completely different environment and you can't reproduce that easily locally why? Because it's a completely different context a completely different situation. So those bugs are very hard to track but we still should be able to identify them before going to production. Moving on. Other area of bugs that we may face is related to the libraries we use the ORM so object relational mappers. Those libraries make our lives much easier. They let us interact with SQL databases or generally with databases in much like easier way just by writing the application code without necessarily thinking and understanding all the stuff of how it works behind the scenes how the data is structured in SQL and what not. However, they may result in various problems as well. One of them the typical issue we have is like n plus one selects. Imagine that we have a table aircraft that is like in relationship one too many with seats so basically every single aircraft has multiple seats. What do we wanna do? Imagine that we want to run an application code that basically gets all the aircrafts and then for every single aircraft we get like seat size, number of seats, whatever else we just iterate over the code. Depending on how we configure our application this may result in the so called n plus one select problem. Why? Because we first send that query to get all the aircrafts from the aircrafts table and then for every particular aircraft we send another query to get like seats for that particular aircraft. So this results in one query to get aircrafts and one query in total to get all the seats for every single aircraft. How can we fix that? In this case, this is actually easy when we are talking about this ORM case what we can do is we can reconfigure our ORM to get the data in the so called eager fashion instead of lazy mode. So we can get all the data with related entities in the eager mode basically with this one particular query so select start from aircrafts and join seats. This way we fix this issue. We get all the seats, we get everything in one simple query, right? However, this is something we do not see in our application code. This is a configuration key. This is something that we either declare somewhere or this is something that we configure using environment variables or maybe this is something that we even define in a later schema, right? Something that we can't clearly see when we are talking about the application code meaning that if one of your coworkers submits a pull request with this application code in here can you tell whether it's going to get n plus one queries or just one query and how can you verify that when we are just talking about pull request? Okay, how do we fix that? We mentioned that we can reconfigure our ORM to run in the eager mode. However, if we do that for everything then we end up with the situation we already seen, right? With this query that is getting all the joins together. So in this particular case maybe it's not a good idea to get data like eagerly maybe we should follow the lazy mode in this particular case. What I'm trying to say here is there is not like a silver bullet telling you exactly what you can do and how you can do that because it strictly depends on your situation and your business code and the business you are effectively running. So those things they need to be tuned like to your use case to the case you are actually dealing with. But moving on, ORMs introduce way more challenges as well. For instance, the throw code impedance mismatch. How do you represent things you see in your object oriented languages? How do you represent them in databases? You have various approaches for instance, table by hierarchy table per type or table per concrete class or whatever else. However, these things may not be easy to translate one to one between your data models between like database model and between your object oriented model. Those things generally need to be handled specifically. And now when we are talking about like heterogeneous applications one database may be easily representable in like one particular library but other library may have very hard time to represent all of that like one to many relationship or many too many or joint tables or whatever else. Those things may not be easily representable using your ORM. This often results in actually reversing the priority. So you design your database schema in a way that your ORM dictates. And this is typically not a very good idea because then you are basically a prisoner of your ORM and you need to structure your data not like in the best way you can think of, but in the way that your ORM can handle. Moving on, other issues that may happen, for instance, data types. It's super easy to think that, hey, all the data types I have in like my database, they can be easily represented in my programming language, right? But is it the case? What about spatial data? Like spatial data meaning geographical coordinates that we can use around the globe. How do we represent them in the OOP object oriented programming language? Like we can go with a pair of numbers. What about binary data? We typically say, hey, array of bytes, okay? But things may go a little bit trickier. What about VARCAR? How do you represent a different string literals in your object oriented language when it comes to string ordering? You can configure your SQL database for instance to ignore casing, like lowercase, uppercase is the same for the database. You can configure your database this way. Or you can change the order of characters so basically change the lexical graphical order. How do you do that in your OOP? Can you configure string in your programming language? What about numbers? Like you have decimals of various precision in SQL database. How do you represent that in your programming language? Do you go with floats? Do you go with doubles? Do you have like big integers of any kind? How do you do that? Those are things that we typically ignore. However, our ORM libraries also impose more and more restrictions on top of that. For instance your ORM library may not be able to support spatial data at all. So maybe you need to go with string representing spatial data and you need to parse this string, split it by comma and do some magic, right? Or maybe when it comes to binary data and when you try to, let's say, serialize a string to a binary, are you going with like UTF-8? Are you going with UTF-32? What other ways do you have? What about Java language that for instance stores characters like in 16 bits, right? Those are things that we typically ignore but they may lead to very like multiple subtle errors that we can't identify easily. Moving on with the ORMs, there are more things that I would just briefly touch upon. ORMs, they do have configuration. Configuration that may be hidden from you that you may not easily see and you may not even control. Transaction isolation level. Do you know what your default transaction isolation level is? Do you know whether it's your ORM configuring that or maybe it relies on the default by your database? What about transactions like in general? Who stops the transaction? Who commits that? Can we do nested transactions? What happens if the nested transaction needs to roll back, right? How can we control all of that? How can we handle errors? Caching. Another thing, what if you have like multiple nodes running your application because you scaled out? Do you use cache? If you do, then how do you make sure that your ORM synchronizes this cache across all the nodes? Does it do that at all? Can you configure that? Can you see how many things are in cache? What your cache hit ratio is etc. etc. Generally multiple stuff, multiple things that may go wrong around that. Pulling. Do you pull your connections? What if you run in like AWS Lambda? Do you open connection every single time you start a new Lambda container scaling out your application? Or do you have some pulling in between? Will it scale well if you now have a pick traffic and you have like 10 or 100 times more incoming connections, right? What about other things like query hints? Do you know whether your indexes are used? Can you configure on indexes in the ORM? Can you configure joint strategy? Can you configure other stuff like table hints, locking hints etc. Do you take care of all of that? Or can you even? And the answer is ORMs impose multiple other issues as well. For instance, migrations. What if you have heterogeneous applications? What do you have multiple applications? Is your ORM happy if your database is not in sync with the logical data model? Can your ORM deal with that? Or maybe will it require basically to modify the schema in the database? How do you test your migrations in unit tests? How do you fix errors that you spot on the computer? And more and more and more problems with the libraries and tools we use. Generally, biggest challenges we have with ORMs is that first, changes are scattered in various places. And there are multiple moving pieces like when it comes to model definition, migration, declarative configuration or even translation of like the application code to SQL queries, right? Just changing the way we code triggers, stored procedures, functions, all of that may be hidden from us. It's generally not clear how we do that stuff and not clear how to configure it later on especially when we are talking about pull requests and reviewing the code basically just by looking at the application code and not seeing everything around. So ORMs, well, they are cool and they give us tons of features. They also hide so many moving pieces from us that very often they lead to subtle errors that we don't know how to deal with. And this is how we enter the true problem of everything I'm talking to you about today. The true problem is we need to know the context to find the root cause. We need to know what's going on. We need to be able to pinpoint the specific issues and to track them down as early as possible. We need to be able to basically troubleshoot our issues and all the necessary details on our plate just when we start debugging and when we start figuring out what's going on. So let's see how we can make it better and to make it better we would like to get some context and to get some context we need to understand what can we get from the database. So in order to get some context and to understand what's going on we need to understand how the SQL engine executes the query. Every single query that we execute in SQL goes through various phases. First phase parsing the query. Your SQL engine wants to parse the query in the thing that is called abstract syntax tree AST for short and this is basically a graph like representation of the query that you execute. Graph like showing exactly all the operations and what's going on. Then your SQL engine rewrites the query to make it standardized, unified and easier to process. For instance if you use aliases in your query those aliases may be handled differently so that your SQL engine knows that two things alias differently actually point out to the same entity right? Then we enter the world of the so called planner and this is a very important step of our application. So planner basically contains details on how to read data, how to join tables how to filter rows, how to get all of that and this is basically the planner creates the actual plan of how the query is going to be executed how the data is going to be read in what order how things are going to be joined and what not this planning is the crucial part of executing the SQL query because planning tells you exactly how the query is going to be executed ultimately by the last phase which is executor. So what is a query plan? In for instance PostgreSQL what we can do is we can always add the explain keyword to our query so when we have like select star we can use explain to get the query plan query plan so basically a visual representation a textual representation of what's going on and how this query is going to be executed like physically how data is going to be extracted and what not so whenever we have this plan every single plan consists of nodes so nodes are basically like more or less every row here is basically a node and each node contains various multiple in like important information for us. First thing is the node type so what is exactly being executed for instance whether it's nested loop whether it's sequential scan whether it's index scan whether it's ordering joining filtering sorting caching etc etc the second thing every node has is the cost cost is basically an arbitrary number arbitrary measure telling us how hard it is to execute given operation how hard most of the times in terms of like IO operations how much data we need to extract how much data we need to get from the database and basically how expensive it is to physically go grab the data and execute this particular step this is arbitrary number meaning that we can't reason anything about like the cost being 6 or 100 whether it's going to take this many seconds or that many minutes we can't reason anything about that but what we can do is we can compare various plans so if we have plans with different costs what we can do is we can take those plans and choose the cheapest plan the plan that is the least expensive one and this is what SQL engines do they generate multiple query plans and then they choose the cheapest one to execute the operation or the whole query as fast as possible this is how it works so what we now can do by knowing that what we now can do is we can get our interaction with the database and figure out by by looking at the execution plan we can figure out how it's going to work when we increase the data volume whether when we run this in like production isn't going to scale well or is it going to be too slow and we can reason about that just by looking at the query plan and applying some logical analysis on top of that and figuring out whether this is going to scale out properly or not but how do we get those query plans how do we get all of that well we need to have observability and to get observability what we need is we need to have locks we need to have traces and we need to have metrics and obviously because we are now in the world of multiple heterogeneous applications and queries coming from various sources or queries being triggered against various services we need to have a modern nice solution to get all of that and the solution to that is CNCF open telemetry open telemetry is basically a set of SDKs for instrumentation they are supported by CNCF cloud native computing foundation and this is like a standard plus a set of libraries for various technologies various languages that you can just plug into your application and get the true observability based on signals so on traces, on the metrics and on locks so by having this standardized way of extracting the signals we can uniformly get signals from any application that we are dealing with how does it work in practice what we do is open telemetry hotel for short gives us traces and spans imagine that we have the application that causes some workflow to begin with by calling service ale let's say that's your checkout application you click checkout button and what you end up with is service A being called to trigger the operation service A may in turn for instance call services B, E, C, D so call services like Q database engine etc etc to get the data to figure out all of that and what we would like to capture is we would like to capture everything that happened within this particular workflow and this is what we get and this is what is called a trace so basically trace is like a snapshot and in given workflow and in all the sub components that were called because we triggered this workflow so we do have like a time axis and this everything here represents how things interacted with each other so we can see that hey this is how long it took for this service A to handle this particular workflow we can see that service A called service B which in turn called service C and D and also service E was called and all those horizontal bars are spans and those spans they carry on various attributes for instance we can have a timestamp we can have like a friendly name we can have start time and time we can have parent ID for traces etc etc to understand exactly what's happening behind the scenes so this is how we can use open telemetry to instrument our applications and because this is standardized and supports various technologies we can use it for whatever application we have we can use it for our like web servers, sequel engines, application code queuing systems etc etc so this is what we can do so now having these two things that we mentioned like having query plans and having open telemetry we can finally get the observability, monitoring and root causing how do we do that? well we need to verify all the interactions with the database right? so we can use load tests we can get load tests observe our application how it works when talking to the database and based on that we can figure out whether it's going to work well in production or not however there are problems with load tests first load tests are super expensive they take hours to complete because we need to fill caches because we need to deal with tiered compilation because we need to deal with like various edge cases we need to consider and what not so generally it takes us hours to complete those load tests and also those load tests are very expensive due to that you need to pay for additional machines generating the traffic handling the traffic and basically running the stuff other things you may have is data distribution and cardinality issues right? you can't test your european stack with the data from the US why? because it's different data you have different data storage you have different like content in your application so you can't just blindly apply the different queries or queries coming from different continent right? also what about smaller countries? how do you test small countries that do not generate like significant part of your load but still may not work well when you deploy changes to production other thing is how are you going to get the hardware? are you going to get like GPU instances if you are load testing machine learning stuff? what about edge computing? what about like custom hardware if you run things on IoT etc etc? how do you do all of that? how do you deal with data anonymity? can you even reproduce the traffic? can you easily take traffic from production and run in any like developer environment? is it safe? is the GDPR compliant? can you do that technically at all? load tests ultimately happen at the very end they are at the end in our CI CD pipelines so now when we are done with implementing changes, done with designing the solution, done with running pull requests we run load tests and that's when we realize hey this thing is not going to work well so this is something that works in theory and can be automated in our CI CD pipelines right? this can be very easily automated but this is not what we want because what we want ultimately is we want to be proactive and push to the left we can't wait for the problems to appear in production we can't wait for load tests because they are too slow, too late and too expensive issues need to be identified as early as possible and they need to be identified automatically right? this is what we need to have we need to have proper database guardrails very early in our pipeline ideally right at our hands when we are implementing the application and how can we do all of that? we can use Matys and never go blind again Matys is basically a solution that provides us based on CNCF open source frameworks based on open standards provides us with multiple things that give us all the observability monitoring and everything that we need what we have is source code integration so we can integrate just by dropping one package to your application we can easily integrate with your app and use open telemetry to get the observability we can use the open telemetry and we can use SQL query plans to get the behavior we have to get all the monitoring of everything that happens when between your application and database to tell you immediately hey this thing is not going to work well we have pull request analysis so we can cover better CICD process whenever you submit a pull request we can automatically provide insights we have database observability we can cover your like production databases your analysis of your older production environments to see what's going on so let's actually see that in action let's see a short demo of this of this Matys platform so once we register to the application what happens is we can create a project that consists of three important parts first part is like the staging so non-production environments and in this part what we see when we click on it is we can see the actual interaction between your applications and your databases so you can see that hey there was basically a REST query executed like to this endpoint to this URL that resulted in like code HTTP code 200 and the query was put when we open that we can get the actual history of what was going on behind the scenes so we see those traces and spans of open telemetry you can easily see that hey this generated, this query this query or like this many queries and for every single query that went to your database we provide automated insights based on what is going on for instance we can show you that hey this is how many rows were actually extracted from the database this is how many rows were returned ultimately to the application this is how many tables we read etc etc we also provide automated insights meaning that hey we can do some analysis on these queries and identify that there is a critical issue with your rows read you use tables scan this is not going to work well in production we can see like cost prediction that hey this is not going to work well we can see like number of rows returned we can provide inside details we can provide impact and we can also provide remediation plan so we can tell you exactly how to fix these things and what needs to be fixed to get the better performance or get the proper behavior the other thing we have is now based on this screen we can immediately answer to questions first is it going to work well in production because we have automated insights and second how to fix that but if we were to ask a question okay but why is this actually happening I see there is a problem but why do I have the problem now we enter the world of like truth-causing and explanation of what's going on so we can get to like a little bit more sophisticated for instance this is the actual query that was being executed and you can see that hey this is interactive you can hover you can see that hey this is the table that it was using etc etc so you can easily see what the query was and whether it worked properly or is it the query that you expected you can get some metrics showing you that hey this is the cost this is estimated number of rows this is how many rows you had in your table in your database this is the planning time you can see exactly what was going on you can get the query tell query tell is showing you that hey this query actually did that you can see like parallel sequential scan you can see outputs you can see workers you can see all the details of everything that is happening here you can see the cost you can see iobuffers so you get a full clarity of what's going on you can even get like the execution plan from your database so you can take this plan and just use it in different tools to see or analyze it even more you can get a summary of all the tables that are access and whatnot so not only we provide the like the the answer is it going to work well but we also give you all the details of why it's not working well but that's just the beginning this is something that you integrate with your application just by dropping like one npm package or one pip install and enabling the instrumentation with line one line of code assuming you use open telemetry you have all of that right but what about other cases let's take CICD what we can do now is we can go to pull requests so we can also integrate for instance with github actions or other CICD systems and you can see that hey all the tests triggered in your CICD pipeline caused this interaction with your database and once again you have everything clarity you have clarity and you have all of that that's generated automatically for you and for instance a comment is put on github action that hey this is what analysis was performed and how the things worked in your case so we have cover a full blown like CICD pipeline now you can get insights immediately when writing your application code but also we improve the CICD code reviews and everything else so you can for instance automatically block your pull requests from being merged if there are critical issues so this is what you can do moving on let's say that you finally deployed things to production how can you make sure that these things work well and here we enter the third part which is like database observability first you can drop a site container like a docker container that you either run in your in your AWS or whatever cloud provider or you can run it locally but generally you drop a docker container that connects to your database and that can extract some statistics to enrich all the data we have here for instance we have schema analysis so you can immediately see hey those are the tables and they have some insights for instance this table does not have indexes configured on it right so probably or it doesn't have a primary key those are the things that we can verify and those are things that we can check just by looking at the schema right but there are other things we can do we can provide observability reports for instance those are the tables number of rows table sizes etc etc those are the indexes you have and this is the history of how they were used in last weeks right those are the queries that were executed in your production database and this is how they performed over time this is how the experience or like the query performance changed over time and what you should look into you have some configuration details for instance extensions, database config etc etc all of that provided to you automatically just by dropping one docker container and giving it a connection string to your database another thing we have is we can enrich your ad hoc analysis if you would like to verify what's going on we can for instance provide a query analyzer when the query analyzer what you can do is you can provide like a SQL statement with the with the query plan and you can submit that to basically analyze this query and provide all the details of what's going on to happen and you can do the same using CLI tool so you can do that like from your developer environment you can integrate it with your offline CLI tools of whatever kind you have all of that we can do just to give you full observability and full understanding of what's going on in your database so let's wrap up Metis provides a source code integration using open telemetry and CNCF open standards it can capture traces rest codes from your HTTP server SQL server whatever else it can integrate with your ORMs with your SQL drivers with whatever context with whatever language you have it can support you during your CICD pipeline in pool requests in your whatever automated testing suit you have to give you all the insights what's going on and how it's going to work it can give you full blown observability like either done at Hock or done using CLI tool or done generally by the platform and it gives you all of that because your databases may break because of bugs because of ORM libraries because of database inefficiencies because of misconfigured indexes because of all the configuration either lazy or whatever else of transaction isolation levels or what not all these things may go wrong and you need to be proactive when checking and making sure that these things work properly you can't wait for load tests because they are way too late they are too expensive it's hard to show you exactly what's going on you need to have constant monitoring and even after you deploy to production you need to have a way to actually verify that your things do not degrade over time that your things do not break like because of different peak hours or because you deploy it on Wednesday but it's Saturday when actual traffic comes to your application and Metis covers all of that covers application integration pool requests, observability, safety got all your bases covered all of that to make sure that your application works well to push all the checks to the left and to provide proper database guardrails to your CICD pipeline and this is the novel approach we all need to take we can't wait for the issues to pop up in production we can't fly blind what we need to do is we need to verify those things as early as possible and being all of that said I'd like to thank you for listening to this talk I hope you find it interesting feel free to visit our metisdata.io page see our product, start using it drop us a feedback on like twitter on email, join our discord channel to learn more and generally stay in touch thank you for being here