 From around the globe, it's theCUBE with digital coverage of AWS re-invent 2020. Sponsored by Intel and AWS. Welcome back to theCUBE's ongoing coverage of AWS re-invent virtual. theCUBE has gone virtual along with most events these days or all events and continues to bring our digital coverage of re-invent with me is Rahul Pathak who is the vice president of analytics at AWS. Rahul, it's great to see you again. Welcome and thanks for joining the program. Dave, great to see you too and always a pleasure. Thanks for having me on. You're very welcome. Before we get into your leadership discussion, I want to talk about some of the things that AWS has announced in the early parts of re-invent. I want to start with a glue elastic views, very notable announcement, allowing people to essentially share data. Across different data stores, maybe you can tell us a little bit more about glue elastic views, kind of where the name came from and what the implication is. Sure. So, yeah, we're really excited about glue elastic views. And as you mentioned, the idea is to make it easy for customers to combine and use data from a variety of different sources and pull them together into one or many targets. And the reason for it is that, we're really seeing customers adopt what we're calling a lake house architecture, which is at its core a data lake for making sense of data and integrating it across different silos, typically integrated with a data warehouse and not just that, but also a range of other purpose-built stores like Aurora for relational workloads or DynamoDB for non-relational ones. And while customers typically get a lot of benefit from using purpose-built stores because you get the best possible functionality, performance and scale for a given use case, you often want to combine data across them to get a holistic view of what's happening in your business with your customers. And before glue elastic views, customers would have to either use ETL or data integration software or they'd have to write custom code that could be complex to manage and could be error prone and tough to change. And so with elastic views, you can now use SQL to define a view across multiple data sources, pick one or many targets and then the system will actually monitor the sources for changes and propagate them into the targets in near real time. And it manages the end-to-end pipeline and can notify operators if anything changes. And so the components of the name are pretty straightforward. Glue is our serverless ETL and data integration service and glue elastic views are about data integration. They're views because you can define these virtual tables using SQL and then elastic because it's serverless and will scale up and down to deal with the propagation of changes. So we're really excited about it and customers are as well. So, okay, great. So my understanding is I'm going to be able to take what's called what the parlance of materialized views, which in my lay person's term is essentially I'm going to run a query in a database and take that subset. And then I'm going to be able to copy that and move it to another data store. And then you're going to automatically keep track of the changes and keep everything up to date. Is that right? Yes, that's exactly right. So you can imagine, say you had a product catalog, for example, that's being updated in DynamoDB and you can create a view that will move that to Amazon elastic search service. You could search through a current version of your catalog and we will monitor your DynamoDB tables for any changes and make sure those are all propagated in near real time. And all of that is taken care of for our customers as soon as they define the view. And that data will be just kept in sync as long as the view is in effect. And see this as being really valuable for a person who's building, I like to think in terms of data services or data products that are going to help me, monetize my business. Maybe it's as simple as a dashboard, but maybe it's actually a product, might be some content that I want to develop. And I've got transaction systems, I've got unstructured data, maybe in a NoSQL database. And I want to actually combine those build new products and I want to do that quickly. So take me through what I would have to do. You sort of alluded to it with a lot of ETL. But take me through in a little bit more detail how I would do that. Before this innovation and maybe you can give us a sense as to what the possibilities are with glue elastic views. Sure. So, you know, before we announced elastic views, a customer would typically have to think about using ETL software. So they'd have to write an ETL pipeline that would extract data periodically from a range of sources. They then have to write transformation code that would do things like match up the types, make sure you didn't have any invalid values. And then you would combine it and periodically write that into a target. And so once you've got that pipeline set up, you've got to monitor it. If you see an unusual spike in data volume, you might have to add more resources to the pipeline to make it complete on time. And then if anything changed in either the source of the destination that prevented that data from flowing in the way you would expect it, you'd have to manually figure that out and have data quality checks and all of that in place to make sure everything kept working. But with elastic views, it just gets much simpler. So instead of having to write custom transformation code, you write a view using SQL. And SQL is widely popular with data analysts and folks that work with data as you all know. And so you can define that view in SQL. The view will look across multiple sources and then you pick your destination. And then the glue elastic views essentially monitors both the source for changes, as well as the source and the destination for any issues. Like for example, did the schema change, did the shape of the data change is something briefly unavailable? And it can monitor all of that and handle any errors that it can recover from automatically. Or if it can't, say someone dropped an important table in the source that was part of your view, you can actually get alerted and notified to take some action to prevent bad data from getting through your system or to prevent your pipeline from breaking without your knowledge. And then the final piece is the elasticity of it. It will automatically deal with adding more resources. If for example, say you had a spiky day in the markets, maybe you're building a financial services application and you needed to add more resources to process those changes into your targets more quickly, the system would handle that for you. And then if you're monetizing data services on the backend, you've got a range of options for folks subscribing to those targets. So we've got capabilities like our Amazon data exchange where people can exchange and monetize data sets. So it allows this end-to-end flow in a much more straightforward way than it was possible before. Awesome, so a lot of automation, especially if something goes wrong. So if something goes wrong, you're going to automatically recover and if for whatever reason you can't, what happens? You quiest the system and let the operator know, hey, there's an issue, you got to go fix it or how does that work? Yes, exactly right. So if we can recover, say for example, you can, for a short period of time, you can't reach the target database, the system will keep trying until it can get through. But say someone dropped a column from your source, that was a key part of your ultimate view and destination, you just can't proceed at that point. So the pipeline stops and then we notify using APIs or an SNS alert so that programmatic action can be taken. So this effectively provides a really great way to enforce the integrity of data that's going between the sources and the targets. All right, make it kindergarten proof, love it. So let's talk about another innovation you guys announced, a QuickSight Q, kind of speaking to the machine and my natural language, but give us some more detail there. What is QuickSight Q and how do I interact with it? What kinds of questions can I ask it? So QuickSight Q is essentially a deep learning based semantic model of your data that allows you to ask natural language questions in your dashboard. So you'll get a search bar in your QuickSight dashboard and QuickSight is our serverless BI service that makes it really easy to provide rich dashboards to whoever needs them in the organization. And what Q does is it's automatically developing relationships between the entities in your data and it's able to actually reason about the questions you ask. So unlike earlier natural language systems where you have to pre-define your models, you have to pre-define all the calculations that you might ask the system to do on your behalf, Q can actually figure it out. So you can say show me the top five categories for sales in California and it'll look in your data and figure out what that is and it'll prevent, it'll present you with how it parsed that question and it will in line in seconds pop up a dashboard of what you asked and let actually automatically try and pick a chart or visualization for that data that makes sense. And you can then start to refine it further and say, how does this compare to what happened in New York? And it'll be able to figure out that you're trying to overlay those two data sets and it'll add them. And unlike other systems, it doesn't need to have all of those things pre-defined. It's able to reason about it because it's building a model of what your data means on the fly and we pre-trained it across a variety of different domains. So you can ask a question about sales or HR or any of that. And another great part of Q is that when it presents to you what it's parsed, you're actually able to correct it if it needs it and provide feedback to the system. So for example, if it got something slightly off you could actually select from a dropdown and then it'll remember your selection for the next sign and it'll get better as you use it. I saw a demo in Swami's keynote on December 8th. It was basically you were able to ask QuickSciQ the same question, but in different ways, compare California to New York and then the data comes up or give me the top five and then the California New York the same exact data. So is that how I can check and see if the answer that I'm getting back is correct? Is asked different questions. I don't have to know the schema is what you're saying and I have to have knowledge of that as a user. I can triangulate from different angles and then look and see if that's correct. Is that how you verify or are there other ways? So that's one way to verify. You could definitely ask the same question a couple of different ways and ensure you're seeing the same results. I think the third option would be to potentially click and drill and filter down into that data through the dashboard. And then the other step would be at data ingestion time. Typically data pipelines will have some quality controls but when you're interacting with Q, I think the ability to ask the question multiple ways and make sure that you're getting the same result is a perfectly reasonable way to validate. You know, what I like about that answer that you just gave and I wonder if I could get your opinion on this because you've been in this business for a while and you work with a lot of customers is if you think about our operational systems, you know, things like sales or ERP systems, we've contextualized them. In other words, the business lines have inject context into the system. I mean, they kind of own it, if you will. They own the data. I don't want to put in quotes, but they do. They feel like they're responsible for it. There's not this constant argument because it's their data. It seems to me that if you look back in the last 10 years, a lot of the data architecture has been sort of genericized. In other words, the experts, whether it's the data engineer, the quality engineer, they don't really have the business context. But the example that you just gave with the drill down to verify that the answer is correct, it seems to me, just in listening again to Swami's keynote the other day, is that you're really trying to put data in the hands of business users who have the context and the domain knowledge. And that seems to me to be a change in mindset that we're going to see evolve over the next decade. I wonder if you could give me your thoughts on that change in the data architecture or data mindset. Dave, I think you're absolutely right. I mean, we see this across all the customers that we speak with as there's an increasing desire to get data broadly distributed into the hands of the organization in a well-governed and controlled way. But customers want to give data to the folks that know what it means and know how they can take action on it to do something for the business, whether that's finding a new opportunity or looking for efficiencies. And I think we're seeing that increasingly, especially given the unpredictability that we've all gone through in 2020, customers are realizing that they need to get a lot more agile and they need to get a lot more data about their business, their customers, because you've got to find ways to adapt quickly. And that's not going to change anytime in the future. And I've said many times on theCUBE, our industry, the technology industry used to be all about the products. And in the last decade, it was really platforms, whether it's SaaS platforms or AWS Cloud platforms. And it seems like innovation in the coming years, in many respects is going to come from the ecosystem and the ability to share data. We've had some examples today. And then, but you hit on one of the key challenges, of course, is security and governance. And can you automate that, if you will, and protect the users from doing things that, whether it's data access, of corporate edicts for governance and compliance, of how are you handling that challenge? Yeah, it's a great question. And it's something that I really emphasized in my leadership session, but the notion of what customers are doing and what we're seeing is that there's the Lakehouse architecture concept. So you've got a data lake purpose built stores and customers are looking for easy data movement across those. And so we have things like glue elastic views or some of the other glue features we announced. But they're also looking for unified governance. And that's why we built AWS Lake formation. And the idea here is that it can quickly discover and catalog customer data assets and then allows customers to define granular access policies centrally around that data. And once you've defined that, it then sets customers free to give broader access to the data because they put the guardrails in place, they put the protections in place. So you can tag columns as being privates so nobody can see them. And then we announced a couple of new capabilities where you can provide row-based controls so only a certain set of users can see certain rows in the data, whereas a different set of users might only be able to see a different set. And so by creating this fine-grained but unified governance model, this actually sets customers free to give broader access to the data because they know that their policies and compliance requirements are being met. And it gets them out of the way of the analyst or someone who can actually use the data to drive some value for the business. Right, they can really focus on driving value. And I always talk about monetization. However, monetization could be just a generic term for it could be saving lives, admission of the business or the organization. I meant to ask you about a queue. Can customers embed QuickSight queue into their own apps? Yes, absolutely. So one of QuickSight's key strength is its embeddability and then it's also serverless. So you can embed it at a really massive scale. And so we see customers, for example, like Blackboard that's embedding QuickSight dashboards into information that's providing to thousands of educators to provide data on the effectiveness of online learning, for example, and you could embed queue into that capability. So it's a real way to give a broad set of people the ability to ask questions of data without requiring them to be fluent in things like SQL. If I can ask you a question, we've talked a little bit about data movement. And I think the last year at Reinvent, you guys announced RA3, I think you've made general availability this year. And I remember Andy speaking about it talking about the importance of having big enough pipes when you're moving data around, of course, you're doing tiering. You also announced Aqua Advanced Query Accelerator, which kind of reduces bringing the compute for the data, I guess is how I would think about that, reducing that movement. And then we're talking about, you know, glue elastic views, you're copying and moving data. How are you ensuring, you know, maintaining that maximum performance for your customers? I mean, I know it's an architectural question, but as an analytics professional, you have to, you know, be comfortable that that infrastructure is there. So how does, what's AWS's general philosophy in that regard? So there's a few ways that we think about this. And you're absolutely right. I think as data volumes are going up and we're seeing customers going from terabytes to petabytes, and even people heading into the exabyte range, there's really a need to deliver performance at scale. And, you know, the reality of customer architectures is that customers will use purpose-built systems for different best-in-class use cases. And, you know, if you're trying to do a one-size-fits-all thing, you're inevitably going to end up compromising somewhere. And so the reality is, is that customers will have more data. They're going to want to get it to more people and they're going to want their analytics to be fast and cost-effective. And so we look at strategies to enable all of this. So for example, Blue Elastic Views, it's about moving data, but it's about moving data efficiently. So what we do is we allow customers to define a view that represents the subset of their data they care about. And then we only look to move changes as efficiently as possible. So you're reducing the amount of data that needs to get moved and making sure it's focused on the essentials. Similarly with Aqua, what we've done, as you mentioned, is we've taken the compute down to the storage layer and we're using our nitro chips to help with things like compression and encryption. And then we have FPGAs in line to allow filtering and aggregation operations. So again, you're trying to quickly and effectively get through as much data as you can so that you're only sending back what's relevant to the query that's being processed. And that again leads to more performance. If you can avoid reading a byte, you're gonna speed up your queries and that's what Aqua is trying to do. It's trying to push those operations down so that you're really reducing data as close to its origin as possible and focusing on what's essential. And that's what we're applying across our analytics portfolio. I would say one other piece we're focused on with performance is really about innovating across the stack. So you mentioned network performance, we've got 100 gigabits per second throughout now with the next gen instances. And then with things like Graviton 2, you're able to drive better price performance for customers for general purpose workloads. So it's really innovating at all layers. It's amazing to watch. I mean, you guys, it's an incredible engineering challenge as you built this hyper-distributed system that's now, of course, going to the edge. I want to come back to something you mentioned and I do want to hit on your leadership session as well but you mentioned the one-size-fits-all system. And I've asked Andy Jassy about this. I've had a discussion with many folks at AWS. You're full, and of course, you mentioned the challenges you're going to have to make trade-offs if it's one-size-fits-all. The flip side of that is, okay, it's simple. It's the Swiss Army knife of database, for example. But your philosophy as Amazon is you want to have fine-grained access to the primitives. In case the market changes, you want to be able to move quickly. So that puts more pressure on you to then simplify. You're not going to build this big air ball abstraction layer. That's not what you're going to do. And I think about layers and layers of paint. I live in a very old house. So that's not your approach. So it puts greater pressure on you to constantly listen to your customers and they're always saying, hey, I want to simplify, simplify, simplify. We certainly, again, heard that in Swami's presentation the other day all about minimizing complexity. So that really is your trade-off is to put pressure on Amazon engineering to continue to raise the bar on simplification. Is that a fair statement? Yeah, I think so. I mean, I think anytime we can do works or our customers don't have to, I think that's a win for both of us. Because I think we're delivering more value and it makes it easier for our customers to get value from their data. We absolutely believe in using the right tool for the right job. And you talked about an old house, you're not going to build or renovate a house with a Swiss Army knife. It's just the wrong tool. It might work for small projects, but you're going to need something more specialized to handle things that matter at school. And that is, that's really what we see with that set of capabilities. So we want to provide customers with the best of both worlds. We want to give them purpose-built tools so they don't have to compromise on performance or scale of functionality. And then we want to make it easy to use these together, whether it's about data movement or things like federated queries, so you can reach into each of them and hope through a single query and through a unified governance model. So it's all about stitching those together. Yeah, well, so far you've been on the right side of history and I think it's served you well and your customers well. I want to come back to your leadership discussion, your leadership session. What else can you tell us about, you know, what you hovered there? So we've actually had a bunch of innovations on the analytics stack. So some of the highlights are in EMR, which is our managed Spark and Hadoop service. We've been able to achieve 1.7x better performance than open source with our Spark runtime. So we've invested heavily in performance. And now EMR is also available for customers who are running in containerized environments. So we announced EMR on EKS and then an integrated development environment and studio for EMR called EMR Studio. So making it easier both for people at the infrastructure layer to run EMR on their EKS environments and make it available within their organizations, but also simplifying life for data analysts and folks working with data so they can operate in that studio and not have to mess with the details of the clusters underneath. And then a bunch of innovation in Redshift. We talked about Aqua already, but then we also announced data sharing for Redshift. So this makes it easy for Redshift clusters to share data with other clusters without putting any load on the central producer cluster. And this also speaks to the theme of simplifying getting data from point A to point B. So you could have central producer environments publishing data, which represents the source of truth say into other departments within the organization or to partners and they can query the data, use it. It's always up to date, but it doesn't put any load on the producer. So it enables these really powerful data sharing and downstream data monetization capabilities like you've mentioned. In addition, like Swami mentioned in his keynote, Redshift ML, so you can now essentially train and run models that were built in SageMaker and optimized from within your Redshift clusters. And then we've also automated all of the performance tuning that's possible in Redshift. So we've really invested heavily in price performance. And now we've automated all of the things that make Redshift the best in class data warehouse service from a price performance perspective up to three X better than others. But customers can just set Redshift to auto and it'll handle workload management, data compression and data distribution. So making it easier to access all of that performance. And then the other big one was in Lake Formation. We announced three new capabilities. One is transactions. So enabling consistent asset transactions on data lakes. So you can do things like inserts and updates and deweats. We announced row-based filtering for fine-grained access control and that unified governance model. And then automated storage optimization for data lakes. So customers are dealing with unoptimized small files that are coming off streaming systems. For example, Lake Formation can auto-compact those under the covers. And you can get a seven to eight X performance boost. So it's been a busy year for analytics. I'll say, is that it? No, great job, Raul. Thanks so much for coming back in theCUBE and sharing the innovations. And great to see you again and good luck in the coming year. Be well. Okay, thank you very much. It's great to be here. Great to see you and hope we get to see each other in person again soon. I hope so. All right. And thank you for watching everybody. This is Dave Vellante for theCUBE. We'll be right back right after this short break.