 Welcome to query against multiple PostgreSQL instances as if they're one server using Starburst Presto. I'm joined by Randy Church-Calm, Solutions Architect at Starburst Data, who will discuss using the Presto abstraction layer to connect your PostgreSQL servers together so you can query them as one unit, create a scalable architecture where you can have multiple users access at once, and it will allow you to add diverse data store types. My name is Lindsay Hooper. I'm one of the PostgreSQL conference organizers, and I will be your host for this webinar. A little bit about your speaker. Randy has over 27 years of IT experience and spent the first 20 years of his career as an infrastructure architect at large enterprises such as Abbott Laboratories. He has a master's in computer science with a concentration in data communications and artificial intelligence. He has spoken at technical groups and meetups all over the country about database topics and has been a regular speaker at PostgreSQL meetups. Besides the extensive technical background, Randy's a musician on the side and has written four books with major publishers about the music business. So welcome to you, Randy. With that, I'm going to hand it off to Randy. Take it away. Thanks, Lindsay. Thanks so much for having me. Can you hear my audio and see my screen before I get started? Absolutely. Thank you. Excellent. Thank you for making sure about that. Thanks everybody for joining. I actually come from doing a lot of PostgreSQL-style meetups, actually, was doing them across the country. I used to be at a company called Cockroach Labs, which made a version of a PostgreSQL database that was highly resilient. And I was always really fascinated by the people who would show up because they always asked extremely detailed questions. At one point, I remember bringing up something related to high availability Postgres, and they were asking questions that were super deep in the distributed systems area. And at the end, I was like, your questions were really detailed. Why was that? And he goes, oh, I wrote the book called High Availability Postgres. So I know that the crowd can be really high caliber here. Looking forward to your questions, feel free to dig in on whatever you need. It said, I will also be giving kind of an overview depending on what your role is, as I realized not everybody has the same background as it relates to this. If you think about what we deal with as Postgres users and admins connecting to a single database is not really a problem. It's really straightforward for you to do it. As soon as you have a second database, you actually have to figure out how you're going to be able to get data from it and do anything useful with it. So one of the ways, of course, is to just make another separate connection to it. And then naturally, you can merge the data together using a spreadsheet, using a script, using all kinds of different methods to merge the two pieces of data together. Or of course, you could use the foreign table function within Postgres, which works well enough for those types of things until you get yet another server to try and deal with. In fact, as I found as I have dealt with especially large companies, which is what I always end up doing consulting for, there can be many, many, many different databases that sometimes need to basically merge the data together and deal with things like cross DB queries. Being able to run queries across DB across multiple databases, selectively moving tables. Sometimes you're going to use create table as select to do this kind of thing, which you can do if you have a foreign table. And of course, needing to be able to do analytics tends to be a really big reason to do this. And of course, those tools are not really aware of the underlying systems very much. Something like Tableau has the ability to pull the data in from multiple places and merge it together, but it doesn't in memory, which means that it's kind of limited as to the scale and scope of what it can handle. And of course, you could use data science tools as well that has a slightly different look on all the pieces of data. Really there's a lot of ways to solve this. But the problem that you often have is that the analysts, the people who are trying to do something with it, not necessarily the administrators. But the users need something easy to be able to handle it. And you might find this next part interesting. It's probably one of my favorite things as I made the transition from being an infrastructure architect and working in very large companies. We had about 16,000 servers at Abbott Laboratories, for example. And I saw how we did it and naturally had our own systems. But then once I broke out, I became on the pre-sale side. It was great. I wanted to do more consulting style work and see how many, many, many different shops work. As it relates to data, what I found is that there's really three personas that have to deal with the large amount of data that you might have available. The experience that I've had based on the other Postgres meetups that I've been to is that many of you are DBAs or data team people. You're very close to the data and data sources. If you had another data source, you would have the skill set to be able to pull it in through finding out what the connection string is and pulling it together through whatever various means is your favorite way to handle it. The data analysts tend to be one step away. And in particular, they don't always understand where the data sources are in the first place. And sometimes you have to handle them. And the business is the furthest away, but they're usually asking a question that really needs to get answered, like how much inventory should I buy or anything along the lines of merging multiple data sources together. I was talking to one company the other day, and they are in the marketing area, but they also had a team that sort of acted as the data team underneath a number of other different areas of their company because they became the experts. This particular company, I'm under a lot of NDAs, so I can't often say names and I can't say details. But keeping it at a high level, they're dealing with inventory for a product that they ship out quite a lot as well as marketing needs. And so very often they need real-time access to data, and it's in multiple types of data sources, not just one type. So we often see this as like how much the ask a business question. You have data analysts, and larger companies are separate people than the DBAs that need to access those data sources to answer the question. You usually end up creating an ETL job, which was another one of my jobs at Abbott Laboratories, and I'll talk a little bit more about this in a little bit. Finally, you can provide the reports, and of course, they usually forget the question by the time you've finally come up with an answer. I think I was talking to another company the other day, and the executive level was saying, why does it take you as a group will really appreciate this? Why does it take like a month or six weeks to be able to add fields to some of my data sources? It's just one field. It's just one field is the opinion of this kind of level. And the truth is it's not simple because there's really a lot of complications, as you know, to dealing with schema and figuring out how to put everything together. And further, what I often find is that there's four methods that people used to get access. Now, of the people who are the closest to the data, you can use scripts and query tools. This was my poison of choice when I was at Abbott Laboratories. I was mostly using scripts directly connecting to various data sources of all flavors, turning it into something that could be columnar data. And eventually pulling it into whatever database was handy, whatever DBMS was nearby, so I could query it. Or sometimes I even wrote the scripts to do the filtering and get the data we needed. Naturally, nowadays you can use tools. And we talked about this a little bit earlier, where you have BI tools, you have Spark, if you're kind of going in more of the data science area, which can handle multiple data sources. But there's some issues with that. ETL is one of the most common ways. And as mentioned before, I had that at Abbott Laboratories. Our ETL, our data warehouse had, I think when I had left, I had been there for 10 years, it had 25 different data sources is what I had counted by the time I was done. I also had something like 1,000 scripts that I had written over the time. It was surprising to me how few people knew how to script in the first place, let alone knew how to do to write even a simple SQL query. Whereas I had a lot of executives asking questions about the environment that we can only answer by dumping it together. If I had had number four, I would have been thrilled. It would have really made a huge difference to me because from all those data sources, if you use an abstraction layer, you can actually just grant a single point of access and query it as if it's a data warehouse, query it as if it's just a single database. And I'm going to talk about how to put that together as it relates to Postgres today and how this comes together. What we usually find in companies that try and do it with purely tools is that the security becomes a real challenge. This might not be a problem if you've got a smaller shop, but I've dealt with many that have PII. If you're dealing with health care life sciences, which is where I came from, you have PHI, private health information, and HIPAA rules to handle. And that becomes a little harder when you're decentralized in multiple data stores. You kind of want to put that in a single place if you can handle it. You also don't necessarily handle concurrency very well with these tools since each of them are trying to handle connections to multiple data sources. And if you take something like Spark, and if you've ever tried to have, let's say, 10 or more users against it, you usually find that it kind of grinds to a halt. It doesn't tend to do too well in that scenario. From a standard standpoint, direct access methods can't always merge data from different types of data sources. And I'll get to that in a little bit. And finally, the problem is every analyst has to get the source connection and the strings, and if that changes, then you run into a problem. When we move on to the data warehouse, which, as I said, is probably the most common way to deal with this. You would tend to just move the data from one place to another. And I used to write these all the time. I ran into these types of problems. And yes, I used a little bit of alliteration just to make this a little more fun. It tends to be difficult, although not everybody has skills to do it. Those that do are very highly priced and in real demand. And maybe they can turn it around quickly. In the company that I was talking about that's shipping lots of inventory, they're sometimes asked to create these within 24 hours. And so the delayed part becomes an issue, not only because it takes time to write them, but the jobs are run on a schedule, let's say at 2 AM. And the most recent data that you wind up with is really out of date. If you're dealing with inventory, you might want a dashboard that just shows what is the real time information that's going on. Naturally, it's duplicated, which can create a problem because it costs money for the infrastructure. And it's usually missing data sources. Like if we look back here, I actually did this picture on purpose showing some data sources that were missing. And in fact, it seemed like half the time when I was managing one of these that someone would ask some kind of business question that would have some data source that I'd never heard of before, that we were trying to pull together. And so this leaves us with another way to handle this type of thing that's really advantageous and really kind of cool. I think you're going to enjoy exploring it and seeing how it can work for your own environment, partly because from what I'm about to tell you, you can do this with just a container if you want. This containerized method could actually connect to multiple servers so that you can use it at least initially as a utility server. And I'll show you what kind of features and things you get out of this little utility server. And by the way, it's open source, which means it's free. You can just load it up and do it. That is going to be the topic of the next session that we have. I will show you how to do this. For now, if you can picture this, that you have one database connection string and basically get a catalog of all the data sources that you have in your environment to be able to get to it. You can query it with ANSI SQL, which is not a big deal for those of us who are just using Postgres servers since Postgres does an excellent job of implementing SQL. But once you start to deal with any other data sources along with it, it becomes an issue like I had to deal with in my environment. I really wish I could have done this because I had MS SQL servers, Oracle servers, and a lot of other flavors, not just one type. Also, you want to be able to add data sources quickly and it's really straightforward with a tool like this because all you need is a connection string. All you need is to have basically what you would normally give to a user to be able to connect to it. It also translates data types on the fly. So it makes a chunk of the ETL or ELT unnecessary. And in case you haven't run into that term, ETL stands for extract, transform, and load, and ELT is just when you change the order slightly. Sometimes you want to load it before you do some transforms against it. The reason why you need to do this when you don't have just Postgres types, when you have other types is that as you've probably realized, if you've ever played with other data sources, date formats are not stored the same way, nor are things like doubles or floats. They're sometimes stored differently, but you want to be able to treat it as if it's all the same type. So a lot of that's done and handled automatically for you when you have a good federated access platform like Starburst Presto handles for you. The other thing is you sometimes want to just do a create table and select and move a table between your servers. I mean, you can always do a backup and restore it, but what if you just want to move a table or what if you just want to do an insert update, delete based on some kind of criteria and query between servers? Once you have this connection, it becomes really easy and straightforward to do. Finally, merging Postgres SQL data with other data sources. So once you break out of that little world, you might want to do that. So part of dealing with a federated access, what you can do with this that opens up with an open source tool like this is you can add data sources quickly. You can access dissimilar types. And in particular, the center column is what really is something here. If any of you have ventured into seeing that Kafka can allow you to kind of SQL query it. Same with Hive where we have a lot of orphaned. I've seen so much of this lately. A lot of people are kind of exiting from the whole Hadoop area. They've got a huge set of data, pile of data that's kind of queryable using something called Hive if you haven't run into it before. It has SQL, but it has a lot of limitations and it doesn't work quite the same as your others. Spark has their own Spark SQL. And of course, NoSQL is not SQL in the first place which Cassandra finally created something. It's one of the NoSQLs that created it. Interestingly though, you can normalize this if you can abstract it, if it makes sense. So think about it from that standpoint. And once you do, you can get highly concurrent access and grant centralized security access. You can think of it in terms of the fact that Presto is a distributed SQL engine that provides a real time scalable single point of access to your data. It ends up looking like this. And the nice thing about this of course is that from the point of view of the users, you just have a single connection string. No matter how many of these data sources you add to it via just giving it a connection file and I'll show you what one looks like in the second part of the presentation, I'm gonna do some double clicks. So I just wanted to keep my promise of keeping it at high level the first part and then diving in for those of you who like the bits and bytes in the next part is all you have to do for those connection files is add it and suddenly they'll see another item in their catalog and don't really have to worry about it. Again, those of you who are administrators of a small company or even mid-size and you're doing all the work, it might work out great. As soon as you have a few users that are using it, this simplifies your life quite a bit. And actually, I often think of things from the different points of view of the different types of users that you would have that would want to get something out of the various data assets that you have, whatever data stores that you have. It's probably the number one thing that I really refined a lot after just dealing with a certain set of customers in my very large company. I see that there's a lot of internal customers that various people have to deal with and all the business wants is quicker answers and they want it to be more precise. I dealt with one company saying, every time somebody comes to me with a response, it looks different based on where they are and I wanna get a more consistent answer when it comes to some of the questions that we have. This is a very, very large insurance company. And I'm not surprised. Those of you who have dealt with a lot of statistical methods know that the famous saying there are lies, damn lies and statistics knows that that's absolutely the case. You can always manipulate data to show people what you want based on your point of view. So they actually wanted a better set of and clearer view from their data analysts. And of course, what we find from the data analysts is that they're always trying to get more data sources involved in their dashboards. But from the point of view of those of us who are mostly living at this level and just trying to provide access to it, it simplifies things. You can add more data sources easily and provide scalable, controlled and secure access. So really briefly, from the point of view of a data analyst, single connection string and then a platform that can scale up to handle the load and access new data sources that they're needed. Most of these guys are kind of drag and click. They're not even point and click. They may not even write SQL very well in my experience. And unfortunately, the kinds of SQL they can write can sometimes bring the system to its knees. Interestingly, when you have an abstraction layer, it handles some of that. It takes that away from being something that's gonna bring your underlying systems to unacceptable levels of load. So it's a problem. And of course, new access is very easy for them. You can see it from the point of view of a data scientist as well who are often trying to do things like Spark. And if you offload it to a platform like this, it becomes much easier. You can also do things like data masking. So if you have a social security number, you can show the last three numbers or you could show a hashed version. Interestingly, when you have a platform like this, this is key. So you have these numbers that are like this. You want to provide the underlying query without the masking. You just wanna show the result in a masked form so that you can't see the details. And that's exactly how the system works. But key for you is for a lot of people who are in this type of zone is that your advantage is that you have a much simpler time administrating this whole platform. We often get questions about scalability. Those of you who are really worried about making sure that there's just the right amount of infrastructure and not more is that you can actually scale it down. And since it's stateless, it's very easy to scale down. Security becomes a big issue. Just one final note on this before the next slides that I'm gonna get into is the architecture. I'm gonna talk about how this actually works for those of you who like the bits and bytes. But for some of those of you at the policy level and others who might be in financial institutions and health and life sciences institutions, the ability to do auditing, the ability to handle security, the ability to do fine-grained access to the point where you can say, somebody, this particular user can only see these rows in this particular table and even choose which columns to do is really important. You would rather do it in a centralized place than having to do it in all of the various data sources that you might have. Part of what this is made to do is to make that very simple for you. And I'm not gonna be doing a lot of demonstration on this. I might, for the next session, I might talk a little bit about this that's scheduled in two weeks. But the great thing is you can use something like Ranger for those of you who might have run into it to be able to handle that from a centralized standpoint. So that's an overview. The idea is, if I were to summarize it, is to simplify by abstracting it away and then providing a platform that's really robust and secure so it can give it to you. And we already took a look at this architecture document and got an idea that we have workers, we can add more workers as we need to and we have our coordinator node. Let's go underneath the hood a little bit because I know that some of you really like to do this. Oh, and one question really quickly to make sure that I answer. The most asked architecture questions by those in the know when people talk to us is what's different between this and data virtualization? And if you haven't run into data virtualization, it has a similar idea where they're providing an abstraction. The key difference is probably best answered inside of the architecture that we have. And from that standpoint, the reason is when you do data virtualization, you're actually taking the data, you're moving it to another one of the systems, let's say a SQL server or another Postgres server and then you're making it do the computation. The thing that happens with this abstraction is that you connect to the coordinator and then it splits the data up so that it can act in parallel from the worker nodes that you have. This is what speeds things up tremendously compared to what you've seen in many different data virtualization platforms where it copies the data all into one data source so that you can do the query. It's happening in memory, everything that we're talking about where it actually processes it, recompiles the result and so it splits it into pieces and finally gives it back to the analyst. Because the load on every single worker separately is actually very light, it's very small, this is why it's able to react so quickly. This is why Starburst, which by the way was created originally at Facebook and open sourced in 2012, was made to handle hundreds of users at a time and petabytes of data, but be able to do that with a reasonably small cluster. Interestingly, a much smaller cluster than Hadoop, for those of you who have dealt with Hadoop before, if you were to size a Starburst cluster, you could do it in approximately one third of the nodes to one quarter of the nodes and still meet the same amount of capacity and the same amount of users and the same amount of query load that you would have. And it actually usually acts a little more quickly and the reason it acts more quickly than Hadoop is because it does all of its work in memory whereas Hadoop has many different MapReduce steps that keeps hitting the disk and that tends to slow things down a lot. So I wanna get into a little bit of the architecture and I do this, I don't do this for every group, but again, I've traveled around the country and gone to enough Postgres meetups that I know that most people will want to dig in. So let's go into a little bit of the terminology and how it works. And if you wanna ask questions about how all these pieces work, I'm happy to talk about it later. We have a coordinator which is a master node controlling the cluster. It handles the user connections, it coordinates the worker nodes as per the name coordinator and it can be made HA, but the HA is basically that there's another one standing by. If you have one that goes down, the other HA one isn't watching all the queries and can just pick it up for more left off. Because this system is primarily stateless and mostly made to handle analytics types jobs rather than reads and writes, it's not really, it does tend to be used to do some very straightforward CTAS work like create table as select, which I'll be doing in two weeks and demonstrating that or inserts or updates. It's mostly doing reads, mostly users will just rerun the transaction. And as long as there is a coordinator node available, it tends to work out really well for people. I was an infrastructure architect, as I said for a long time. So I always look for the flaws and I always like to bring it out. That is a small one, but it doesn't tend to be as much of a big deal to my users as I've found as we talk to people because their systems aren't really geared towards that from a read standpoint. The worker nodes are really rather disposable. They handle the queries, they connect to the data sources and you can add and remove them statelessly. This is what makes your system so flexible. I was talking to somebody the other day who used that containerized version of this that I'm about to show you. And he's a very advanced Postgres user has been doing it for a long time. And at one point he asked me, well, I was running it and it got really slow. Can you tell me what I need to do in order to speed it up? I'm like, well, let's take a look at what you have. And I saw that he really mostly had compiled everything into just one container. It was a coordinator worker node. So you can do that for little tiny labs, but the poor thing didn't have enough juice to handle it. Once you handle and add a few more worker nodes, the whole system actually comes together nicely. And from the point of view of me as an architect, I love the fact that I can flexibly add more nodes as I need it and turn them off when I don't. So here's a few other pieces of terminology. They're a little bit overloaded in the whole database industry, but it's really specific to the way Presto works. So you're gonna hear these terms. I'm gonna talk about specifically how Presto in a federated system like Presto handles it. Your connector is the code that handles connecting to your data source. So you can think of this sort of like driver is the easiest way to think about what this does. It handles connecting to a data source, but a catalog is a connector for one particular data source that's configured with the connector name, the database connection string, and the login credentials. They're very tiny. They're very easy to create. And it's one of the reasons why it's so easy for you to add one. So you're looking at an actual catalog file right now. And I have, this one just has an example connection string. But as you can see, we really just have the connector name, which of course is Postgres SQL in this case. URL, which is the standard connection string that you would have used anyway in any kind of tools you use. And naturally the username and password, which yes can be squirreled away into a separate vault if you wish. I just have it open here because it's really straightforward. That's what a connection file looks like. And I'll talk about how to actually initiate one in a couple of slides from now. It has a pluggable architecture. So that means you can add new connectors and people do all the time. There's actually a very large number of data sources. Postgres is just one of a universe, which is great for all of us because it allows us to connect it to more things. You can actually do typing. You can have UDFs. You can have user defined functions and system access controls. So the pluggable architecture allows you to extend it nicely. I would give a lot of caution for those of you who are experimental and want to create your own because it really is running at the code level, the core level, you could cause some real problems if you create your own and it starts snarling everything up. There's a lot of ones out there though. And so the great thing there is that it's very flexible. The UDFs alone though, are one of the most common types that we see, people extending it. And that's totally fine. We have good documentation on how to add them. And you can do an awful lot with UDFs because in this case, you're doing it at a cross data store level. You're not doing it on just a single database. You're doing it across databases. Can do some very powerful things. So the technical definition of it from a more high level that we see here is that it's a federated MPP distributed SQL engine. It's not data virtualization because the processing is done in Presto rather than at the data sources, which we just explained a moment ago. And it's faster and more scalable, much faster and more scalable than that. It's ANSI SQL 2015 compliant and most major SQL features are covered. So things like CTs and windowing functions are available and you can use that, even in data sources that originally did not have them. Of course we have them in Postgres, that's not a problem, but doing it across this and others is something that's really a big advantage. It supports all TPCH and TPCDS queries, supports for UDFs as we just talked about. And additional functions are always being added to it by the open source community because it's an open source platform. Interestingly, something like 93% of the commits come from Starburst Data. That's the company that I work for and that's the one that that's who's talking to you right now basically. So we're the ones who really have the most stake in making sure things are good and we hired really the people who invented it in the first place, as well as a lot of the people who are the biggest contributors. So it's quite a community behind it. The great thing about how it can be put together is you can actually run it on premise. You can run it in Kubernetes if you want and many people do because that makes extending the cluster and adding worker nodes trivially easy to do. The AWS marketplace has it if you wanna do that and there's a cloud formation template that you can use as well. The other form of it that you don't see here is something that's too small to be on the kind of slides we would normally show customers, but perfect for you and it's the thing that we're gonna do in two weeks, which is I'm gonna give you a containerized version you could use to connect to multiple Postgres servers. As I said, I dealt with a person the other day he actually runs one of the groups. I'm not sure if I should use his name, so I'll leave them out for now, but he's super sharp and I gave this to him as just something to play with in his lab. And next thing I knew, he was using it as a utility server to basically copy tables across multiple Postgres servers when he needed to and to just do little utility things that required work across multiple servers. It became a very useful thing for him and so much so that it gave me the idea that I'm gonna actually develop something and hopefully I'll be able to put it up on GitHub so you can use it for next week or two weeks from now. Connecting to a data source is something that I promised I would talk about and we already talked about the file. It's easy, trivial, takes a minute to do. Once you've created that file, you put it in the catalog directory of all the cluster servers and then just reinitialize the nodes and all of a sudden, users immediately have access to the new data source. Very straightforward and simple to do compared to let's say an ETL, like the company I talked about earlier who sometimes were given 24 hours to write an entire ETL just to get the data together so that they could do it. Once you've done that, it actually appears as just another data source and let me show you, I do have a live system right now. This one is running in AWS. It's a Presto instance and this one is connecting to many, many, many different flavors of data sources. So we have Glue, we have Hive, we have Jamex. We even have a Kafka queue. Naturally, there's Postgres in the list. This particular one is a live one that we have running that's just connected to a diversity of data sources so that we can show how it works. And you can do queries like this, which is connecting three different data sources and four different tables. So this one's connecting Glue and MS SQL and Postgres as well into just a single query. And the neat type of thing like this is that the Glue type of thing would normally be hard to merge together, but we've actually done an analysis of it. It actually feeds our cost-based optimizers so it happens lightning fast and we can actually pull data together from multiple diverse data sources to do it. So just that's just a quick taste of what we're gonna be doing, the kind of thing we're gonna be doing in two weeks. You can connect to object stores with it. I know this is a Postgres group, but for those of you who would be interested in doing that, if you have ORC or Parquet files and using a Hive meta store, it becomes really straightforward. You don't need to run Hadoop. In fact, people are starting to use these abstraction layers, these federated layers to replace their Hadoop systems. It runs much faster as mentioned because it all runs in memory. It's very cool. Supports many different flavors of it and you can use partitions and bucketing. This is a full list of data sources that you can go after and this tends to be a very big advantage to people. Now, of course, the one that we all love is Postgres and you can do multiple Postgres servers on the Starburst Presto cluster and you could be done at that point. But if somebody ever asks you for a piece of data that happens to be in MySQL or MariaDB naturally or MSSQL, Oracle, Teradata, even Snowflake, which tries to eat the world, tries to pull data from everything, even Snowflake could be connected to other flavors and data sources, including a Kafka queue, to be able to answer some of the queries that you might need. And the way that you connect to it is you just get a single connection string. It looks like, let me show you what it looks like with the instance that we have right here. This one, we don't have a DNS entry for it so it's a little bit raw, but I know that that's the way that this group tends to like it. The connection settings are just a JDBC connection to Presto and it's port 8080 and that's all. Once you have that, you get a connection that same, the web port of that particular server actually would end up having your console. So for finished queries that I might have here, like this is the one that I just ran just a minute ago, you can actually see how it ran, see how it broke it up into five different stages and actually sent the data to different worker nodes. We can go into that at another time because we only have a few minutes left. So I wanna make sure I get through the architecture really briefly. There are many different business intelligence tools that already have a Presto connection. So this makes it really easy for anyone who would try and use it to put it together and also plenty of data science tools. Once you do, you combine things using a three dot notation, we're used to two dot notations of course, in Postgres, but three dot would include the data center, the database, the table, and then you can combine the data sources in one query to do that and just demonstrated one of those a minute ago. I can go more into this if you want to in the Q&A session since we're headed towards the end of this time period. Scaling in and out just means adding nodes or leaving them behind if you need to, shutting them off when they're done. As mentioned before, Kubernetes makes this particularly easy if you're familiar with it. I can explain how that works again in the Q&A for those of you might not be familiar with how that works, it's pretty neat. And operationally, it's stateless. So you just back up those little config files, they're tiny files, you can even put them in your code repositories, GitHub or whatever you're gonna use. And as mentioned, you can use HA Wormstand by coordinators if you want to. K8s is particularly good because if your coordinator dies, it will just restart it, no problem. And sometimes people create multiple clusters to be able to deal with it. Finally, the last little piece to it is pretty neat. There's a cost-based optimizer that's cross data source. If you wanna see a little bit about how this works, we actually have a book out about it. There is a book on Presto that goes into it. And the enterprise version does some neat things in that it grabs extra statistics to make the cost-based optimizer even more effective. We are at the 45 mark and I wanted to make sure to leave enough time to answer any questions. So feel free then to ask away and I can touch on any topics that people wanna cover. That's it for the overview for now and just in case you need to drop off, in two weeks you're gonna see it in action just against Postgres servers and in a little setup that you can run in your lab. The third session that we're gonna do which is two weeks after that is going to be about connecting it to multiple disparate types of data sources just so you can kind of open up your Postgres servers to a broader world. Okay, that was fantastic. Thank you so much, Randy. I've had two questions come in but anyone else, please send theirs over now. The first is at a high level, what's the difference between Presto open source and the enterprise version Starburst? That's a very good question. And of course we like many enterprise companies that have an open source behind it. We like to make it very easy for people to run it in their labs and as necessary. But once you enterprise eyes it, once you start to actually operationalize it and run it the way that I used to run things back when I was at Abbott Laboratories, you actually need a lot more, first of all, support more than anything else. So at a high level, the support alone is actually really worth having but besides that, we actually provide connectivity and connectors that are not available in the open source version. And the reason for that is that these are more enterprise data sources that we've dealt with the various vendors at hand like Oracle and others. So like Teradata connector for example is something that you can only get as an enterprise user of the product. The other part of it is the enterprise versions of the connectors that are available in open source are much more robust. They actually grab table statistics and metadata so that the cost-based optimizer, the CBO becomes much, much faster. So the connectivity actually goes back and helps out the performance. On average, it's about seven times faster than the open source version and this can be a really huge help when it comes to this. Probably one of the biggest cornerstones of most of them are security. And one of the aspects related to this is the fact that as mentioned before, if you have a financial institution, if you have a life sciences company, if you have any kind of system that requires you to do, let's say data masking because the data is identifiable, you need to have all of these extra features and that's something that we poured a lot of effort into. Open source, while they're somewhat interested in this, it's not as, let's say sexy for people to work on this but it's the kind of thing that enterprises require to have and so we've poured a lot of effort to making this as robust as we could so that it can meet your security audits so that it can meet all of your regulatory requirements. And of course, we've enhanced the management functions but I'll leave that as a very high level thing. I have details on each of them in case anybody wants to dive in, like if you really wanna see how the cost-based optimizer works, I have some neat slides and how the parallel connectors actually get much faster data as you can see that or on the security standpoint, how we handle Kerberos and LDAP and fine gradient access control but I'll leave that as a high level for now and leave the rest for double clicks. And our second question, with all this information, what are the top three takeaways from this presentation? Oh, what a great question. I wish I had prepared an answer to that ahead of time because it deserves a very tight, very straightforward answer. I will do my best though. I think that the first takeaway is that if you ever need to connect to multiple data sources, first of all, multiple Postgres servers but certainly multiple data sources of any type that having a federated access layer can make your life much easier and much more straightforward. Second of all, if you ever have a data analysis problem then very often the issues that you're gonna face are not just related to the main data servers that you have but actually trying to create a coherent look on all of it and you wind up with this partial view that I had to deal with very often. I really, really wish I had had this tool back at the time. Once you bring this to this type of architecture you can add new data sources and not only that as mentioned before with the company that I dealt with that sells a lot of product on a regular basis it gives you real time access and man does that simplify things so much and make your life much better. And finally as a third thing if you're going to implement something like this you would find if you looked in the space that there's a lot of potential ways to try and get at multiple data. It's not surprising because very often you need to talk to multiple data sources in order to answer the simplest questions. You will find that in particular with Starburst Presto that considering its history as mentioned it grew up at Facebook where they dealt with petabytes of data and sometimes hundreds and hundreds of users. This is a system that can handle running as a little container on your laptop which is what we're going to do next session and then scale all the way up to having hundreds of thousands of users. There's a company that I talked to the other day that handles reporting. Their end users can make requests for reports. They claim that at their peak period they can get as many as one million reports per day. Million report requests per day. And at that kind of scale you actually need to be able to have a platform that can offload this properly and respond quickly. I think the name is actually really well-chosen Presto. So the idea behind having a good federated platform is one that can start from a small seed and can grow as large as whatever your particular IT needs are at the time. What we found with the other ones out there and all the various solutions is that some of them hit a scaling limit that's actually far too low. And part of it is like in a data virtualization sense which we talked about previously is that it tries to do the work in just the end data sources and all the workers are doing is copying data between it. Eventually you end up straining your end data sources when you really, what you would prefer to do is actually have the work being done at the workers which is what we've talked about before. So those are the three things that I might pass on to anybody who's just trying to make sure they walk away with something useful. And as mentioned, I will make sure that you have very practical ways to use this in your own labs as part of this talk series that we've put together for you. That was great. I haven't seen any other questions come in. So with that, I think we'll give everyone a few minutes back in their day. So first of all, thank you, Randy and Starburst. This was a great presentation. I learned a lot. And then secondary, not secondary, but second, thank you so much to all of our attendees. Thank you for joining us. So whether it is morning, afternoon or evening, I hope you have a great one. And I hope to see you back here in a few weeks for the next Starburst webinar. And I hope to see you next week for the next Postgres Conference webinars. Cheers. Cheers, thank you.