 Yo, hey, yo, hey, yo, yo. Pack the chrome styles, fly like Mrs. Jones. Lyrical mathematics will have the devil smoking stones. I put heads to bed, lick shots, and rap with me. Yeah, I'm so sorry for doing it. It was Anticipal. He sent me an invitation for the wrong hour. All right, so yeah, hi, everybody. My name is Cipo. I work for Amazon. I'm going to spend a lot of my, I have spent a lot of my time working on analytics. And one of the services I'm very close with is Amazon Redshift. And this talk is going to talk about the evolution of this service the past 10 years. If you are taking Greg Ganger's class, I did pretty much the same talk last week for Greg's class. So some of you, you may want to just skip the class and do those other stuff. Yeah, I have three students. You're good. All right, so that is probably will be the only marketing slide I will have. But Amazon Redshift is Amazon's cloud data warehouse. It is a very large service. We're kind of very careful when we make public statements. And we have gone publicly and said that daily we have tens of thousands of customers using Redshift to process exabytes, plural exabytes of data in AWS's global infrastructure of 30 regions and 96 availability zones. So this is a massive system, you know, since it's already 2 PM in escows, you can safely assume that for the day we have had, our systems have processed at least one exabyte of data, if not more. So this is a very, very, very big system. And our goal is to be easy, secure, and reliable. We want the customers to be able to analyze all of their data using Redshift data that are complex, either scalar or complex, nested, data stored in operational systems with data lakes, warehouses, and whatnot. And we want to do that offering the best price performance at any scale. But how did we, how the socials it's made now? So if you look on any large service or a system like Redshift, you will see that there are some thematic areas, some areas of focus for the service. And for Redshift, if you screen, there are kind of six basic areas where we are putting most of our energy and investment. The first one, and there is a reason I put them in that order, the first one in terms of priority is security and availability. Usually when I do this kind of talk in a class in person, I ask at that point people to say, OK, raise between security and availability, which one is the most important? So can somebody answer this question? Hello? I'm sorry. Yeah, I'm asking the class between security and availability, which one is the most important? It just depends on your use case. No, no, it does not depend. Security is by Trump's everything. So the most secure system is the one that it is down. And security is the highest priority and availability after you are secure. So customers do, we have tens of thousands of customers giving us exabytes of data to process and have them available. And that is our primary goal. And when it comes to security, we keep on pushing the envelope of the things that the customers can do and can control. There are very interesting challenges there. For example, being able to offer high performance analytics on top of encrypted data to offer dynamic data masking to be able to do identity management through our systems, are there are a bunch of complicated problems to do that? So and, for example, in a in a Redshift case, all this processing is happening to the customer's VPC network and the data does not go out of these boundaries. When it comes to availability now, once you get all these security features, when it comes to availability, we try to make our systems very available. So one of the biggest architectural chains that Redshift did over the recent year since the inception of the service, since we launched the service, was the separation of compute and storage. So when we launched the Redshift, the data was the system of the place where we were persisting the data was on the local attached SSDs storage of the individual Redshift environments. And we went through a very big project and we separated now storage and compute. By separating storage and compute, we have the ability to guarantee that the recovery point objective, RPO, is zero. Meaning that whenever a customer commits, sends a transaction to commit some data and we say we committed your data, we can guarantee that there will be a no loss of data at that point on. And once we separate now storage and compute, you can do very interesting things by offering improved availability to your systems. For example, we have the ability whenever there is an environment running in one availability zone and something happens with this availability zone, we can easily, when the press of a button remove this environment, spin it up in a separate availability zone and start consuming the managed data from the point of the last commit. Now we're taking this thing a step further and we are offering multi AZ Redshift. So what we do is we run two separate environments to independent environments in two availability zones. The data gets committed and written in the Redshift managed storage which is essentially Amazon S3. And then we monitor the health of the systems and if something goes wrong, we very quickly failover to the second environment and continue being available. During happy hour, during the health state when the systems are healthy the customers do get the throughput and the performance of having two environments and when there's something wrong happens we go and we do this very quick failover and we offer the best of both worlds availability and good performance. If you said it's essentially S3 but is it, is there something special above it? S3 that you guys have prioritized Redshift or is it like, like if I try to build my own version of Redshift now that I'm going to but like, could I do the same thing or do you have an extra layer there that's doing the next step? So I mean, Redshift managed storage is the data the blocks are stored in a proprietary columnar format in S3 bucket, S3 buckets, right? Now, what constitutes a format is a little of a philosophical discussion, right? Iceberg, if you take iceberg, what is it? Iceberg is parquet at the core of it but there is something more on top of that, right? It's a protocol manifest. In a similar fashion, Redshift managed storage is the data, the bulk load is one megabyte blocks, columnar blocks in S3. There is obviously some protocols, transaction processing protocols and whatnot that make this very efficient and very performant, right? So there is more to that. I'll take it. Okay, but the bulk load, like the physics the data is stored in S3. Yeah, yeah. Okay. All right, so security and availability are very important things for customers when they want to run a database. And then the second area of importance for us is performance kind of the DNA of Redshift is to offer very, very good performance. In order to understand why Redshift achieves this good performance, we need to take a look a bit under the hood and start understanding how Redshift is implemented internally. So let's, you know, in the next couple of slides I'm gonna get into details of how Redshift is implemented. In Redshift, we have two layers, the compute layer and the storage layer. Within the compute layer, we have two types of nodes, execution nodes. The first one is the one in a green box there. It's called the leader node. And then there are a bunch of compute nodes that are the workhorses. They are the ones that are doing most of the work whereas the leader is the brain. It does the brain, which is the one that gets is the point of connection from externally to the system. And it is where the customers submit their requests, the statements, and this is where the catalog leaves. So whenever a query comes in, you know, if you have, you know, you are taking anti-class so you know all these things by hand, we go through the typical query execution lifecycle. We take the query, we parse it, we do semantic rewritings. We try to rewrite it in a way that we generate an equivalent SQL query that it is more efficient. And then we go to a cost-based query optimization based on statistics that we maintain and we generate a distributed query execution plan that may look like the one I have on the right-hand side. There are some interesting properties in Redshift. For example, we have the option for the customer to, the tables in the database to have distribution keys and short keys. So physical properties. So for example, you may decide to have, if you have a customers and orders table and both of them are distributed by customer ID and every query you have joins the customers and orders on the customer ID, then we don't have to shuffle data across nodes. We do call what we call collocated joins them. So the query optimizer obviously takes all these physical properties into consideration to generate efficient distributed query execution plan. And after we are done with this cost-based query optimization process, an interesting thing in Redshift is that for every query fragment, query fragment being something, a pipeline that ends in a stop and go operator, what we do is we go and generate C++ code. We almost literally open a C++ file and print def code into that file. And we take this generated code, we compile it and we take it, we send it down the executables down to the compute nodes, which are the workhorses. The compute nodes that are run on EC2's Nitro hardware. So we have obviously a lot of hardware optimizations in there. They start running, executing this tight loop of the individual query fragments. And what they do is they are assigned some data partitions from the managed storage and they start processing these partitions and doing whatever this query wants them to do. And we do all sorts of tricks that you can read in the modern literature efficient query execution. We have, we do min max pruning. We use vectorized AVX to SIMD processing on the data. We use vectorized execution friendly encodings, compressions on the data block so that we can operate on, apply, do a lot of processing on encoded data. We do late materialization. We do all sorts of stuff that in order to make this processing very efficient. Sorry, you may say this, are you doing first or full processing? It's literally, it's a push-based model. Okay. So what we do is we take a query fragment. We kind of generate the code. We generate one executable for all the operators within this query fragment. We put them together. So it's a one piece of code. It's not a staged execution. And then this push piece of code, what it does, it takes data from blocks from the storage layer and pushes them up. Got it, thank you. Okay. So, you know, in one slide, that it is, you know, by now you are experts in the design of Redshift. This is a very, very simple design. Are you using GCC or using some proprietary app hands-on compiler? We use GCC, you know, obviously, we start code with a bunch of compilers and we keep on updating that. And, you know, that is always an area of optimization for us. Are we having a question? I have a question about, yes, sir. How do you distribute the table data? Like, do you have third storage? And like, how do you decide where to run the table? Yeah, are you using something like a different app to like, ensure the level? So, yeah, so the way we do it is we have a fixed number of buckets of data buckets. And then we do a deterministic assignment of these partitions to compute nodes. And I'm gonna talk about it later, you know, when we change the compute size, then we kind of in a consistent fashion, we move these things. So we maintain the ability to perform collocated joints. Thank you. All right. So let's dive a little more into this code generation with this something that it is kind of unique in Redshift. And let's take a very simple example of a query like the one I have up there that applies a filter in one table, joins two tables together and calculates an aggregation on top of that. What the Redshift ends up doing is it generates some piece of C++ code that looks a lot like the one I have on the left-hand side, which you can actually read it, right? It opens a while loop, it starts pulling data from the scan, it applies a filter, this predicate less than 50, it calculates a hash value and then goes and probes the hash table. And if there is a match, it applies the aggregation on the match. It's a very, very simple. It's like, you can actually read it. Of course, we do a bunch of optimizations, a bunch of tricks in there. For example, we do some, we have a FIFO queue where we push and pull from the end of this queue in order to improve the L1 cache hit rate and we minimize cache mislatencies. This version of the code is the scalar version. What we run in production is actually an AVX2 vectorized version of this code in which would have been a little more complex than that. This version actually is doing the late materialization and does not do this late materialization and being max pruning, we do all these things. But in a nutshell, you can get a gist that the feeling that what Redshift ends up doing is generates this very, very efficient piece of code that runs the tight loop. Okay. Whenever I'm giving this kind of talk, at that point, somebody raises their hand and says, hey, this is cool, but don't you worry about the GCC compilation costs at run time that may incur? And the answer is absolutely yes. We do worry about the cost of compilation. I'm proud of you. The reason why the class isn't asking you is we already talked this. Okay. Okay. So to answer this thing is traditionally when we kind of, what we see is that, okay, in workloads where the individual queries would run for minutes or four hours, adding some seconds of query compilation may not have been a problem. But what we see is a shift, especially the recent years towards more real time and lower latency analytical workloads where the potential cost of GCC compilation at the critical path would be a problem. And in order to minimize this effect, we do a bunch of tricks. For example, just by just putting a cache of the compiled snippets on the leader note of any Redshift computing environment. And because there is a lot of repeatability in what queries customers run within an environment we were able to improve the hit, we were able to achieve up to 99.5% compile code cache hit rate by just putting a small cache on the leader note of every individual Redshift environment. But we took it to the next level, you know, now that we are in the cloud. And what we do is anytime we see a query fragment that we have not seen before, we push this query fragment to S3. And then we have a form of machines that compiles this code with the maximum optimization, compilation optimizations. And we put the compile code in a global cache and that we made that available to other clusters in the fleet. And by doing this very simple trick we were able to improve the cache hit from 99.5% to 99.96%. So we were able to improve the cache hit by an order of magnitude by taking advantage of the fact that we run in a cloud as a service. And the reason why this thing works is because of the pincer hole principle, right? There are that many tables in the world that have five integers. And there are that many queries that run, they apply a symbol integer predicate on one column of a table. And, you know, in a large service such as Redshift where we have tens of thousands of customers, hundreds of thousands of computing environments and billions of queries running daily, the probability of having repetition is very, very high. Obviously we do all sorts of tricks right there to minimize the, to minimize the, you know, to try to have mostly stable code, stable generated code when we do this query processing. And there are a lot of details I'm omitting here. So do you explore using NLGF to work in cluster? Yes, yeah, okay. So that is, you know, deterministically the second question I get when I am in this slide. So the first one usually is don't worry about the compilation cost. And then the second question is, why don't you use LLVM for that? And to, honestly, you know, we kind of, we start building Redshift, you know, more than a decade ago and that was actually kind of preceded LLVM. So we have a huge code base, a very stable system that it has the kind of C++ based generation. And that's why we stuck with that. The hyperpapers 2011, right? You guys started building this 2001. So we, we were, we were already developed. So presently, it predates the hyperpaper. Yes. I mean, to be very, you can compile it into the system. Can you please repeat the question? It's a little difficult to hear. A comment is, sorry, what is your comment? Sorry, the comment was that if you don't create, if you can really compile it into the system, you should, you should be able to do that. And I remember the thing was, they couldn't compile it into the system because they were facing issues with conversion. Yeah, so his comment is that if you, since you can't convert this into a plus, you have readable debugable code with, in the hyper case, you have to be cracked, now a BAM land, can you compile it into an IR, you're looking at a separate. So the fact that like you can, you need to bug this. Yeah, and to be honest, that, you know, with C++ code generation, actually the developers, you know, okay, they first need to learn how the process, right? But after that, it is quite easy to develop and maintain because the code they generate is C++ code. So it's, you can, you know, you can reason, you can read it and understand what the query did as we did in the previous slide, right? We, you could, you were able to read the query. All right. For the ADX2 stuff, how much you rely on GCC to auto-vectorize that and how much that is the right size? We are right at all. Okay. Yes. You have a world expert in sending there, so. Yes. Yeah, yeah. A Greek guy, something like that. All right. So, you know, and, you know, we are, as a service, we are, you know, very much focused on improving performance. The good thing when you are running some service in the cloud as a service is that you can put a lot of telemetry and you can understand what are your customers doing and where they are spending their time and what you need to work on in order to improve their experience. And that's what we do, right? In a repeat fashion, for example, in this graph, I'm showing back what we did back in 2018 where we looked on the fleet and we said, okay, let's keep on improving performance and doing, you know, start, you know, like a walk-a-mole game of improving performance in various places. And we were able to improve, for example, the performance of Redshift in the, you know, TPCDS benchmark by 3.5 times within a year. And there was not a single silver bullet that did that, but it was very much focused on where the time goes and fixing that going to the next. And we keep on doing interesting stuff there. For example, we just released, like string processing is one of the most important, strings in general is one of the most important data types in a warehouse. And, you know, we are starting now to becoming really, really smart on how we do efficient processing of strings. We just released some, you know, we just announced now that we can get up to 64, 63 times speed ups on a special strings with low cardinality. And we keep on innovating in this area. And can you talk about what are you doing here? Are you doing like, as you're standing along, like, are you standing along, you're like, oh, this is a country game. I know how to, like, do quick, like, look up on it or like, or like, yes, no, it's not obvious, like. Yeah, I mean, to be honest with, like for the low cardinality strings, I mean, there are, you can do so. First of all, there are very, very many columns with low string columns with low cardinality in the world. You would be surprised how many there are. And if you know that you have a domain of strings that has very, very low cardinality and take advantage of that, then you can do a bunch of tricks. For example, you could like, you can either do something like a dictionary coding that has low cardinality, or you can even encode them as byte, this is what we do. We use a dictionary, a byte dictionary in order to encode the data. And then we use AVX execution in order to process them very efficiently in press code. But you're only doing this for like red ship managed storage. Like someone doesn't look parquet file, it has this, just get in and you can't do anything. So we, so the answer is yes, like these encodings are the encodings that I'm talking about here are on the kind of, on the managed storage version, not on the external one. So, you know, very much focused on performance. And this is, you know, you can see it in kind of in the public domain. For example, I'm showing a graph here that this is a graph we showed back in a tree event last year where we went and said, okay, red ship provides up to five times better price performance than other popular cloud data warehouses. And this is an out of the box benchmark. Out of the box means that we don't spend any time tuning any of the warehouses. We just define the schemas, we load the data and we start running the benchmark. As a database practitioner, one universal rule, I can share with you a secret maybe is that the highest performing data warehouse in the market is the one that it is consistently second in all the competitors benchmarks. And you know, you can go and check which one is that in currently. And I would claim that this is resist. So we feel very well in the performance we are getting. A workload that it is very, very interesting and it is becoming increasingly common in warehouses right now is more near real time or operation and analytical workloads. So what we see is that customers start running workloads that have higher concurrency, more concurrent request and the response times of the individual request is smaller. And we're spending a lot of our energy in trying to improve this particular type of workload. And because Redshift generates this C++ code that is very, very tight and very small in terms of consumption of resources, Redshift does very well in this type of workloads. So in this graph, we are comparing Redshift against other popular cloud data warehouses in a workload that we have an increasing number of smaller requests, very small and short running requests. And you can see that Redshift achieves up to seven times higher throughput than competition. So we did all that Redshift achieves pretty good performance. And the next thing our customers went and told us is okay, performance is good, but we would like to put more data into the system and we would like to have also more concurrent users using this system. Can you please do something about it? And this is where kind of the bigger architectural changes in Redshift happen over the recent years. As I told you earlier, by the way, Andy, I hope I have 30 minutes, right? Yeah, you're turning the time, keep going. Okay, okay, okay. So, okay. So the biggest architectural change as I told you was the separation of compute and storage in Redshift. When we launched the service, the highest network bandwidth you could get out of EC2 was 10 gigabits on a full droplet. And that was becoming a challenge as you were trying to offer high performance, high performance in a disaggregated compute and storage environment. But as we moved now to in an EC2 world where it commonly offers instances with hundreds of gigabits of network, now this kind of evolution in the network speeds enable this disaggregation of storage and compute. And that's what we did. We separated storage and compute and you can see now that EC2 has announced instances that have 200 gigabits in a droplet, 400 gigabits, 800. And I think at the last event we even announced the Nissan's that had 1.6 terabits of network within a single EC2 droplet. So we separate storage and compute and now we are able to operate on more data within a single database. In this graph where it's one of my favorite ones, what we did is we start running the TPCDS benchmark and we scale the dataset size from 30 terabytes all the way up, oh, sorry, from 10 terabytes all the way up to one petabyte. So we increased the dataset size of the database by two orders of magnitude from 10 terabytes to one petabyte. And what we did is we increased proportionally the hardware we were using, the size of the computing environment to run this benchmark. And what you can see is that the time it took to run this benchmark it stayed pretty much the same around two to two and a half hours. Even though this is the expected thing and you say, okay, da, you use two times proportionally more hardware than the dataset size, this thing should scale. This is a very hard engineering challenge and we are able to actually achieve it in the petabyte scale. This is a very nice property, like customers, they can use this kind of nice linearity in the increasing performance and scale to be able to forecast their expenses, do their planning and say, okay, I will need that. My data growth is X, it will cost me that much and they can do the proper planning and type and management. So by doing that, we were able to kind of solve this data scalability challenges in Redshift. The next area where we looked at was on the scaling on the compute side. Scaling compute was a challenge in Redshift and we were able to address it with two different techniques. The first one was to, we implemented the ability to change the size of any individual Redshift compute environment by up to 4X, either 4X up or 4X down. And the way we do it is we bring some nodes in as you asked me earlier, if it is consistent hashing, what we do is we take, we, at the metadata level, we assign some of the data buckets from managed storage, from the old nodes to the new nodes. And then we start, we resume operations. And by doing so, we are able to increase the size of the system up to 4X up or 4X down. And the customers use that in order to hunt you the SLA of then individual workload to meet their SLA. So they can fine grain tune the size of the environment to meet their needs, their SLA's. And once they fine tune the environment, there will still be the Monday mornings where all the, you know, all the employees of a company will come in and start running workloads. And there will be some, you know, at some point the system will not be able to process more concurrent requests, right? It will, it will be, there is a finite number of resources that are being used there. So at some point it will say, you know, I cannot admit any more requests at this point. Can you please wait? And what we do in order to even, to solve this problem of bursts of concurrent requests was we built an auto scaling capability which we call concurrency scaling, which what we do is the requests still come into the main Redshift computing environment. But if there is queuing happening and the customer has elected to auto scale, we bring equalized computing environments and we start running the queries, still, you know, overflowing the queries to this computing environment and them accessing the data from their managed storage to keep on improving the throughput of the system. The customer does not need to do a single change in their application and they start getting auto scaling capabilities. There was a question. But the scaling is done at sort of like, like within a timeframe of I need this for five minutes to run the queries and then it goes away. Like, so again, we discussed Snowflake has their flexible compute and probably tell it's doing that on a temporary basis. Like, you know, when you grab some nodes and resume stuff sounds like what you're doing here is you're bringing up a whole nother like compute cluster. So it's not something you want to do on a temporary basis. You're doing for some period of time. Yeah, yeah, it is very economical because now say you have, you know, I say, I'm just doing some rule, a simple example. I'll say you can run five concurrent queries in one environment and then you have 10 concurrent. So then you bring the other concurrent scaling environment you start running another five. So you can actually amortize and you have more efficient, you know, process. You are getting this auto scaling capability but at the same time the cost is, the cost remains low, right? It does not scale with the number of queries but it scales with the number of environments that run there and you can multiplex multiple queries within this environment. Yeah, thank you. Yeah. And the cool thing is, you know, customers do not need to do a single change in their application. And in this graph, what we're planting it's an experiment we did in the lab when we, before in 2019, at the beginning of the year we did not have the ability to auto scale. At the end of the year we were done implementing this capability and you can see that the beginning of the year the maximum throughput one could get was from five concurrent users submitting queries from this DPCDS benchmark and they could achieve approximately 200 queries per hour in this workload. By the end of the year for the same experiment we were able to achieve over 12,000 queries per hour from 210 concurrent users submitting queries with zero think time. This emulates a realistic environment where you have thousands of concurrent users running using this environment in Redshift. So we were able to achieve 60X improvement in concurrency without having a single change in their application. The only thing that the customer had to do is to enable in a box enable concurrency scaling in my environment. You know, this is, it's an impressive improvement in the throughput of a system which was unheard of in the traditional on-prem days to have this type of elasticity and go back contract when the burst of activity goes down. Right? We did that and by doing both the ability to online and elastically resize any computing environment then auto-scale to improve the throughput we were able to address the needs of our customers in elasticity on the compute side within a single Redshift environment. The next thing our customer said after that was we like this elasticity but now we would like to have separate environments separate computing environments to run our analytics. Say because we want to charge the individual environments to different business groups or we want to have isolation because we have data scientists where they want to run on their own environment and whatnot. And that's what we did. We built the ability to set data, live data in a transactionally consistent function across Redshift environments. So you can have a producer environment, producer cluster writes the data, we enforce a snapshot isolation in Redshift and the data gets committed into Redshift managed storage and then you can have other consumer clusters or serverless environments that read this data in a read-only fashion and in a transactional consistent way. And those things can fail independently and you get this isolation and you can charge back and whatnot. And these systems can also auto-scale independently like auto-scale or online resize independently. And you can do this thing within clusters and environments within one account or even across accounts or even across regions. We do have customers that have business in say the US and the Europe and they have separate environments in different regions and they are able to run analytics in their environments across regions without having to move data and build weird pipelines to move the data in one place in order to do this. So building a transactional consistent global data sharing layer in Redshift, we were able to do it and customers really like this ability. So by doing all these things, we were able to address the needs of our customers in terms of storage elasticity, computer elasticity, computer isolation and ability to serve data across accounts, across regions and whatnot. So we did all these things and our customers were quite happy. They said, they were saying, okay, performance is good, elasticity is good. Still Redshift is a little difficult to use. I still have to think about, have to have a dedicated database administrator to tune my environments. Can you please do something about that? And that's what we did. So since in the cloud, essentially what our customers are doing are reading some compute time. What we do is we take advantage of any idling cycles in the systems and we try to run machine learning based optimizations to take all these mundane operations that usually are where is possibility of database administrators and make them a service problem. So what we do is we monitor the workload. We see what the customers are running and then we have a list of tasks that we would like to execute to perform on behalf of the customer in the background. And we rank these operations in terms of priority and impact, right? Which the biggest bank for the back is the one that it's prioritized first. So we may decide to run some analysis, analyze some tables to update the statistics of these tables to get better query execution plans or we may decide to vacuum a table or we may even decide to change the physical schema of a database if we decide that this is the right thing to do. In order to be able to make this kind of decisions, you need to have very strong and high confidence signals. And we have to keep on improving on the algorithms we use in order to make these such recommendations. There is, for example, this VLDB paper back in from 2020 where we kind of describe the algorithm we use in order to make distribution key recommendations for the tables in Redshift. When we started this capabilities, we were just sending these recommendations in the console. We were saying to the customer, hey, you may want to have, you may want to change the distribution key of your tables because we feel that this will improve the throughput of your system, the performance of your system. As we gain confidence by getting some feedback that those recommendations were good, we actually took it to the next level and we start doing them on the fly, on behalf of the customer without asking, without having requesting the customer to take an action. So in particular, we introduce a new type of table, which we call it an auto table. And when the customer defines a table as being auto, it gives the responsibility to Redshift to go and perform all these optimizations. And in this graph, what I'm showing is one of my favorite graphs is what we did is we emulated, we did an experiment where we loaded the tables of TPCH, one of the benchmarks, we loaded 30 terabytes of data and we start running this workload without doing anything else. And what you can see is that within some hours, the performance of the system improved by almost the two exit, went from running within 100 minutes and went down to running within 60 minutes, without the customer having to do anything. And the way this happened is was the system picked change distribution keys, short keys and coding and whatnot in order to bring the performance of the system down, to improve the performance of the system. And now we are taking it to the next level and we not only go and change the physical schema of the physical properties of some of the tables in the database, but we may even go and decide to create materialized views if we feel that this is the right thing to do. So what think about it is if you have every morning at 9 a.m. a specific report that the customers are using, we may decide to create a materialized view for this report and keep on refreshing it so that the system does not have to go and execute this query on the fly. If the customer stop using this particular materialized view, we may decide to drop it. So this very active monitoring and continuous enhancements in the performance is something that again, it's kind of a theme that you can see in the cloud. It's auto tuning capabilities is kind of now the expectation in the cloud, in the cloud database world. So what isolation level do you provide for the customers? Like if you insert something to the table and they see the update immediately from the machine-like view or like they're only using waves. So what we do is we have auto-refresh. Like we're gonna, we have a materialized views technology in Redshift is we have a very wide set of operations where we do incremental maintenance of those views. And for the ones that we do incremental maintenance, we also, we have the ability to auto-refresh them. So with the combination of those two allows us to go and keep the materialized views fresh. So the question is like, if I have a transaction, I do an update on something. Do you refresh the materialized view on commit or inside the transaction? No, no, we don't refresh it on the commit. We actually refresh it off, right. It's like a cron job where you can push the refresh. Yes. Yes. It's based on post-graphs. Yeah. Yeah. But actually you don't might see some O data either after a third form of data taken from the machine-like. The standard is a customer might see old data, stale data, if they make a change that they expect them to get applied to the materialized view, they commit and then merely query something that is materialized view, they may not get back the result that they expect. So, yes. So what we do is, yes. So we try to immediately refresh the materialized views. The customer, if he wants to always see the transactional consistent data, they can simply run a refresh of the view prior to running the query. And the refresh will become a no-op if there is nothing to refresh. Again, it's not a transactional, it says that you don't, it's not materialized. We're done here, sorry. Yeah. Keep going. Okay. So another nice area of using machine learning in the database has been in the area of workload management. Like, if you have taken more class about operational research, if you have big jobs and small jobs in a queuing system, the best way to improve the throughput of the system is to get the small jobs, admit them as fast as you can, get them in and out of the system quickly and then for some kind of queuing for the large jobs. And this is something that Redshift has been doing for several years. Now we have a very, a symbol classifier there that tells us whether a job is big or small. And if it is small, we get it in and out of the system as fast as we can. We took this technology and now we have been building the ability to predict not only the runtime of a job, but also the resource consumption of this job. So we can predict the CPU, IO, memory needs of the particular job query. And by using this information now, we are able to bin pack more queries into the system and get better, more efficient utilization of the hardware that the customers are running on. So we are able to pack more, improve the workload admission, the admission control algorithm by having this knowledge of the resource consumption of those queries. Can you say anything about like, what that pre-production model looks like? Does it look like Ryan Marcus's QBB net or is there something else? It's something else, but we work, Ryan Marcus is working with us a lot. My ass. My ass. All right. So we did all these things, the autonomics. And the last thing the customer said, okay, we like that, but still there are some decisions we need to make in order to size our Redshift environments. We need to decide how many nodes we are gonna put, what kind of FISTAs types we're gonna use and what not. Can you please make our lives easier? And so the culmination of our autonomics efforts was the introduction of Redshift serverless. Redshift serverless is a new experience we introduced this past summer, which takes all this automatic, machine learning based workload management, monitoring and management, the auto scaling, the auto tuning, the auto maintenance. And we kind of wrapped it around an intelligent, compute management layer. So we take decisions about the sizing, the auto scaling and what not. We don't give options to the customers to decide about instance types and stuff. We can change the hardware and there is as EC2 hardware improves, we are improving, we are improving. So we are offering the whole capabilities of the SQL capabilities of Redshift in a serverless experience. And we get very, very positive feedback so far with that. We have thousands of users using it. And you still get kind of all the SQL servers that Redshift used to has from data sharing, manage storage, also integration with streams, operational databases, SageMaker, Lambda, everything. So we did all these things. And with all these things, I kind of, I tried to give you an overview of, we took Redshift from and we kind of changed the architecture of the guts of the system and we are able to offer now a system that has separation of compute and storage, can auto scale, has autonomics and it is offered in a serverless experience. And so that is kind of the core of the system. The last area where we are putting a lot of our energy is in the area of integrations. So now once you have this very strong core in the warehouse, what we are trying to do is to integrate with the rest of the ecosystem. Analytics is not a single, you don't buy a warehouse and you are solving analytics. The data lifecycle is coming from, is complex. So customers have data in their operational stores or they have the data, they put them in a bus, in a stream or they just have files and they want to use machine learning on them. So there is more to that than a basic cloud data warehouse. So what we have done, for example, we have been having the ability to query data in open file formats in S3. We have the ability to query also the data that sit in operational stores in RDS Aurora and RDS MySQL and Postgres databases or RDS and Postgres and MySQL databases. But now we're taking it to the next level. So we have the ability to have streaming ingestion from the streams from Kinesis and Kafka but we also introduced the zero ETL capability where we went and we integrated with Aurora's storage layer and we can move data from the Aurora storage layer to the Redshift managed layer and you have this zero ETL capability between Aurora operational store and Redshift. So, and that is on kind of on making easy to ingest data, move data and process data in Redshift. And then on the consumption side, not only we have been offering SQL capabilities but we have integrated with SageMaker which is the suite of machine learning tools at AWS. So you can go and create models from SQL within your SQL to create a model based off of data that are managed by Redshift and those models then can be consumed as inference for inference as SQL functions within Redshift. So you can go and write SQL queries and then you can do select something where inference or scoring function from column two from table block. So there is a lot of integration happening on the other side. And then at the core of it is the data warehousing technology based off of Redshift managed storage, separation of computer storage, data sharing and serverless. And we are quite excited with where we are and we're getting very positive feedback of how the service have changed and how it can address the needs of the customers as they move to more modern data architectures. And that's what was it. So over the past 45 minutes, I went and gave you an overview of how we have evolved Amazon Redshift over the past 10 years. We have tens of thousands of customers processing exabytes of data daily. We have strong security posture. We are very focused on performance and scalability. And then we use a lot of machine learning in order to make Redshift easy to use, to build autonomics and build a serverless experience out of that. And we have also very tight integration with the rest of the, you know, the broad set of AWS services. And we enable integrations as well, data-based architecture on top of that. And we are excited for the future. And as we like to say at AWS, it is still day one. And we believe that we are still scratching the surface of what customers can do in the cloud for analytics and data management in general. And that will conclude my talk. Thank you. All right, any questions? Yes, go ahead. We have a question. So the Redshift support UDF and the Redshift support UDF consolidation. Redshift, we need to, Redshift for UDF, yes. And do you guys do UDF compilation? So we have two UDFs, right? We have SQL UDFs and we have Python UDFs, right? So the SQL side, yes, of course. On the Python, we run the Python code in a contain. Do you support PLG SQL? Yeah. Oh, okay. All right, but you don't compile that. Yeah, I'm sorry. But you don't compile PLG SQL. We do compile PLG. Oh, okay. Yes, yes, absolutely. All right, that's your question. Anybody else? Yeah, and by the way, also the very interesting is related to that is when you are training a model in SageMaker, for many of the models we have in SageMaker, which you could have trained it in Redshift data, the SQL, the inference function, what we do is we use SageMaker Neo, which is a compiler. And for these one P models, what we do is we generate C++ code for the model and we link it as part of the executable. So then we run filtering and predicates and inference within one C++ executable, which makes it very, very efficient. Yes. I have two questions. I have two questions. My first one is, is there anything you've seen in another kind of database, data warehouse, and so we've covered in this class, we've covered Dremel, we've covered BigQuery, we've covered, oh, no, no, no, no, sorry. BigQuery, we've covered Photon, we've covered Snowflake. Last class was Bellox. We didn't cover anything with Microsoft. Is there any feature or anything about those particular systems that you thought, like, that's a cool idea. We'd like to do that in Redshift one day or if I need to cover it, but recording off, like, I can do that too. Yeah. I will tell you that, you know, as I told you in the last sentence, is I still feel, you know, we are very excited with what customers are, what you can do in the cloud. Like, you know, as we learned the data, you know, you and me, we were, you know, learning about databases using the cow book, there were specific rules, like, for example, you could then auto scale, you could, there was finite set of resources, you couldn't think about storage and compute separation and all that stuff. So we are very excited with, we are still, I think, scratching the surface of what customers can do in the cloud with all these kind of interconnected services and things that they can scale on demand. So, you know, there is nothing particular that, you know, kind of can say, oh, this is, you know, excites me, but I'm very excited with, you know, the innovation that is happening in the cloud in general and AWS cloud in particular. I see, I have a third question. Do you guys use kernel bypass or anything, like dbdk, spdk? I mean, obviously we are very close to hardware, right? We do all sorts of stuff to make efficient usage of the underlying hardware resources. And we work very closely with the EC2 teams on that. Yeah, okay. All right, and this one we can cut off for the video. You know, there's a lot of students are worried about the, you know, the big layoffs in the tech companies. Amazon is not new to this. And I can be telling them database companies or database divisions in these cloud companies are still making a ton of money. Is Redshift still hiring? That's my favorite all right. What is it? Yes, it's the S.T. Cricut, I.D.E.S. I make a mess unless I can do it like a G.O. Ice cube with the G to the E to the T.O. Now here comes Duke, I play the game where there's no roots homies on the cusley, I'm a Foucault, I drink brook with the bus a cap on the I.D.S, bro Bushwick on the goal with a blow to the I.D.S. Here I come, Willie D, that's me, I'm a four six pack 40 act gets the real pass I drink brook but yo I drink it by the 12 they say Bill makes you fat But saying eyes is straight so it really don't matter