 Hello, and welcome to Fix Your Data Bottle Necks with Trino, Hive, and S3. My name is Andrew Berglant. I'm a principal software engineer and architect of the cost management project at Red Hat. So giving a summary, first off, I'd like to talk about some of the data problems we had, which is what wasn't working for us, then I want to cover what Trino is and then finally kind of go into what we did to fix our problems. So let's start with those data problems. So this, to kick it off, the data I'm specifically talking about today, sort of, our biggest problem was the Amazon Web Services Cost and Usage Report. In the form that we were interested in, that's hourly line item data, both cost and usage of really everything that you can run in a cloud account, and that can be everywhere from the payer account down through the hierarchy of child accounts. So depending on the size of your AWS, this can be quite large. Now this little blurb here is about the line item ID, and you might say, oh, that's great. So this is the ID to identify each ID. Well the important thing here is that the line item ID is consistent between all of the reports on one day, but it will not match those same resources the next day that they publish it. And the way this report works, every day they're publishing the entire month to date's worth of data. There's no append, it's just a republish every day. So the problem this creates is it's really difficult to get a unique key on the data that we're putting into our database. The next problem is that it's a pretty large file, both in the terms of how many columns it has and how many rows. And at the end of the day, with what we're showing, cost, usage, maybe a few things like the account that's actually running, or the region, or the service, there's really only a few columns of interest. So right off the bat, just the style of data, it's not ideal for our OASium database. Next we have no control over data publication time. Probably not a unique problem, but it's not streaming data, you know, where we have to wake up on a schedule, basically poll every customer's cloud account that we have and check for new data. And the way it gets published, we just end up with spikes of processing time, and then lulls later on. So the next problem is that, well, today's data often comes tomorrow. If that report's published at 10 am, there's a good portion of the day that hasn't happened yet. So we're not going to see that data till tomorrow. So it's never as simple as just processing today's data, right? There's at least a rolling window of several days that we have to be concerned with. So what that ends up looking like, well, because we don't have a reliable way of updating data or having a neat key around it, we end up doing a lot of deletes. So maybe that raw data, that hourly table, well, we have to delete a bunch of data from it before we reprocess. Well, we also do a daily roll up. So before we reprocess that, we need to delete from those tables as well. And then we do some aggregation, some pre-calculation to make our APIs job a little bit easier. Finally show the summarized one of the data that we UI, delete from that too. If you are unfamiliar with the way deletes work in a relational database, it isn't an immediate delete. It's basically marked for deletion until a later date when a vacuum operation comes and actually frees that space. And that freeing, well, depending on how the vacuum works, sometimes it can leave large blocks of data in different spots. So it's just not ideal. And you might end up with seeing a lot of these, maybe 500s, 502s, 504s. The database is so busy doing that data processing or vacuuming, well, it's just not free to efficiently serve API requests. And this is going to be the first thing your customers see. So it's just a bad situation to be in. It can kind of feel a little bit like this. In short, now let's switch gears and take a look at what Trino is. So using their own phrase, Trino is a distributed SQL query engine for big data. Let's break that down and first look at distributed. So the architecture of Trino as it deploys, you start with a coordinator. And then you can deploy any given number of worker nodes. And those are really what do the meat of the work in Trino. The coordinator manages and sends work off to the workers to do. Now without a connector, those workers can't really do anything. Trino alone, it requires a connector to access data in various locations. The two connectors I've shown here are one for Postgres and one for Hive. This happened to be the two we're using. Now it is also SQL. So as you see here, if you're used to writing SQL for relational database, you will not have a problem writing SQL for Trino. Now it has its onset of functions and there might be some translation here or there required. But at the end of the day, it is still SQL. Next, it is a query engine. So back to our architecture here of the coordinator. We'll take that SQL query and we will send that off to the coordinator. And the coordinator is going to come up with a distributed query plan and divide that into a set of stages. And then accomplishing the work of those stages is done by doling out tasks and distributing those to the available workers. And those tasks in turn deal with splits. And you can think of that as almost like a slice of data, the bit of data required from the connector to accomplish that part of the task. And there are a few more concepts, even a lower level lower than that, but we'll stop there for now. Finally, it's for big data using a quote again from Trino. It's good for running interactive analytics queries against data sources of all sizes, ranging from gigabytes to petabytes. So you might have, you might say big data. Well, I don't really have big data. It's really not that big and it still performs quite well in those cases. And finally, putting it all together, let's talk about what we did. So the data that we have coming in, we can think of that it's a CSV file. The very first thing we do is take that CSV file and it's raw form and we'll send it to S3 and AWS. Maybe you use your favorite object store. Maybe you prefer Red Hat Cef Storage. You could use that interchangeably here. Next, using Python, specifically Pandas and PyAero. We're taking the data from the CSV file, putting in our Pandas data frame, running a little bit of processing on that and then converting it to a Parquet file. So we do things like specify the data type of each column for the Parquet file. Pandas and PyAero take care of the rest. We generate that file and it too goes to object storage. Now we have Trino over here. Right now it doesn't look like it's connected to anything. So how does it come to play? Well, that's where our Hive connector comes in. To access the data in S3, we're using the Hive connector and Hive here, specifically the MetaStore is really a way of keeping track of table definitions and where they're located in object storage. Now there's also some data remaining in Postgres that is useful to us. So we have a connector there. One very neat thing with Trino is in the same query, we can gather data from two different or multiple connectors and use them in the same query. And finally, once we've done that query processing in Trino, we send the results of that query back to Postgres, where it will be ready for our APIs and UI to show off. And really what we've accomplished here is we've removed all of this file processing off of Postgres. Now it's free to really be our API database there to run the queries we need to to show data to our customers. Okay, talking specifics, let's look a little bit at Parquet. Really the most crucial thing to know about Parquet is that it is a column or data storage format. Now if you're used to a relational database, that's going to be row based. If you need information in a single column, you must fetch the entire row, not so with Parquet. Here, if you need information from a specific column, you can fetch just that information. There's more specifics that I'm not going to get into heavy detail here around sort of the file format and the way it uses compression and grouping data within a file to make that even more efficient. But again, if you take home nothing else, it's columnar. All right, so let's also look at partitioning, which is an important part of our strategy here. So this is a create table statement executed in Trino using the Hive connector. We have our columns. The important thing here is this with clause. Now we've got an external location, and that's our bucket in S3 with the path to our data. We specify that the format is Parquet, and then the crucial bit here is this partitioned by, we see that we have three different partition columns, source year and month. Going back to the table definition, it's critical that those are the last three columns specified. So you see source year and month, all of our charts here. And let's take a look at how that would look in S3. Okay, so I'm using minio as a stand-in for S3, but it's going to look exactly like our data would look in S3. I'm looking at our Parquet. This next portion here, this 1-0-0-0-1, this is another important point with Trino. If you use schemas in Postgres, you can use schemas in Trino too. So we're using those for tenant isolation. You can pretend that this is a customer's account number and all data in this schema is only accessible when you are in that schema. So let's take a look at the structure of their AWS data. So again, I showed those three partitions. So the first one, you see the source equals and then the string specific to this partition. You see we have two partitions here. Then a year partition for 2021 and finally month 08. So I've got a three level partition and finally that Parquet file is here. One more element that was important in our analysis was cost. So a lot of what we talked about with cost so far was performance, compute. Let's talk about the financial side of it too. This is a very rough picture. I'm sure you could come up with different numbers or different configurations, but taking two machines in AWS that have the same CPU and memory footprint. One in EC2 and one in RDS. As you can see, there's a 2x premium on the RDS instance. And that makes sense. You've got a managed relational database. But that's one of the ways you're paying for that. But in this case, running machines for Trino, we don't need that. So just spinning up an equivalent EC2 instance, we're already at 50% discount there. Now looking at storage, it's actually much more significant here. Storing things on an SSD for RDS is expensive relative to storing data in S3. So this is a huge cost savings for us. We also have that beneficial thing about S3 that it is effectively limitless. We don't have to continue to boost the amount of storage we have. We just put more files in the bucket. So now let's talk about scale. That's really what this is all about. We had a scale issue, right? So this is a good problem to have as you get more customers. Well, they bring their data, don't they? So one way to handle that, and this is certainly something we've still had to do some. Well, you have your relational database, your Postgres instance. You can grow it and you can grow it. This is vertical scaling, right? You're just putting more resources at the same instance. You might say, well, you can horizontally scale a relational database. That's true. My experience with that has been not easy, but I know it is possible. Looking back at Trino, scaling here really becomes taking the number of workers and multiplying it out. So this is horizontal scaling, right? And I know that there are limits to this as well, but for our workloads, I don't think we're going to hit those. And also for scale, thinking about storage again, when we were storing raw data in a relational database, there was just a limit to the amount we could have there to effectively process. Whereas switching to object storage, it's just putting files, deleting files, putting files. And this is really giving us the ability to offer more data viewing, more access to their data for our customers. Let's review. What did we do? Well, we moved our data pipeline to S3, Hive, and Trino. Of course, we leveraged horizontal scaling for file processing and Trino queries. So file processing bit, that is when we're using pandas to create those Parquet files and send those up to S3. Well, there's nothing about that that needs to be serial. So when we spin up sort of our own code to do that process, that part can be horizontally scaled to deal with whatever load we have coming in. And again, that data is getting to S3. So there's no bottleneck for loading up to S3. Horizontal scaling on the Trino query side, that's just spinning up more workers for Trino so we can handle more queries at a time or larger queries. We freed our database to serve the API. It's no longer managing the data pipeline. So it's there to do the queries that we need at game time. We reduced query time for large data sets. So because of the distributed nature of Trino, because of Parquet and having a column or storage, we're only hitting the data we need in those queries. The end result for really large data sets is faster query time. Not to mention all of the time saved, not continually deleting data in our Postgres database. To delete data in S3, you delete a file. You can overwrite it. That takes considerably less time than it did in Postgres. We saved money. Our storage is cheaper. The compute is cheaper. That's a good thing. And finally, we can keep growing and customers can see their data. So this all stemmed from an outage and a bottleneck. And that bottleneck has been solved for us. That's not saying we're not still on a journey. There isn't a silver bullet moving to Trino. It comes with its own learning curve, its own configuration. It's not perfect. It would be wrong of me to say it was. But we got past a blocking point. And thanks to the horizontal scaling we have here, we can keep growing. That's all I have. Thank you so much for listening and giving me your time today. Yeah, so thank you, everyone. I was just going to saw it wasn't in the QA, but in the chat, Eric had a few questions, but I think he's going to reach out to me offline. So we can have that conversation there. Thank you again so much.