 Hello everyone, welcome to theCUBE's presentation of the AWS Startup Showcase. The theme this episode is Data as Code, and this is season two, episode two of the ongoing series covering the exciting startups from the ecosystem in cloud, and the future of data analytics. I'm your host, John Furrier. Got a great featured panel here with AWS heroes, Lynn Leggett, the CEO of Lynn Leggett Consulting, Peter Hansen's founder of Cloud Cedar, and Alex Debris, principal of Debris Advisory. Great to see all of you here, and remotely and look forward to seeing you in person at the next re-invent or other event. Thanks for having us. So Lynn, you're doing a lot of work in healthcare. Peter, you're in the middle of all the action as data as code. Alex, you're deep on the databases. We've got a good roundup of topics here, ranging from healthcare to getting under the hood on databases. So Alex, we'll start with you. What are you working on right now? What trends do you see in the database space? Yeah, sure. So I do a lot of consulting work with different people, often with DynamoDB or just general serverless technology type stuff. If you want to talk about trends that I'm seeing right now, I would say trends you're seeing is a lot just more serverless native databases or cloud native databases, where you're seeing these cool databases come out that really take advantage of this new cloud environment, where you have scalability. You have elasticity of the clouds. You're not having instance-based environments anymore. You're paying for capacity. You're paying for throughput. You're able to scale up and down. You're not managing individual instances. So a lot of cool stuff that we're seeing with this new generation of infrastructure and particular databases taking advantage of this new cloud world. And really a lot deep into the database side in terms of cloud native impact, diversity of database types, when to use certain databases. Is that also a big deal? Yeah, absolutely. I totally agree. I love seeing the different types of databases. AWS has this whole purpose-built database strategy. And I think that that makes a lot of sense. I wouldn't go too far with it. I would more think about purpose-built categories and things like that, specialize in an OLTB database within your organization, whether that's DynamoDB or DocumentDB or relational database or something like that. But then also choose some sort of analytics database, if it's Druid or Redshift or Athena. And then if you have some specialized needs, you want to show some real-time stuff to your users, check out Rockset. You want to do some graph analytics, fraud detection, check out TigerGraph, a lot of cool stuff that we're seeing from the startup showcase here. Looking forward to unpacking that. Lynn, you've been involved in a healthcare action with cloud after the pandemic pushed this hard core on everybody. What are you working on? Yeah, it's all COVID data all the time. Before the pandemic, I was supporting research groups for cancer genomics, which I still do. But what's impactful is the explosive data volumes. There's big data and there's genomic data. I've worked with clients that have broken data centers, broken public cloud provider data centers because of the daily volume they're putting in. So there's this volume aspect and then there's a collaboration, particularly around COVID research because of pandemic. And so you have this explosive volume, you have this need for computational complexity and that means cloud. The challenge is it put the pedal to the metal. So you've got all these bioinformatics researchers that are used to single machine suddenly that have to deal with distributed compute. So it's a wild time to be in this space. What was the big change that you've seen with the pandemic and in genomic cloud genomic specifically? What's the big change that's happened? The amount of data that is being put into the public cloud. Previously people would have their data on their local capacity and then they would publish their paper and the data may or may not become available for reproducing the research to accelerate for drug discovery and even variant identification. The data sets are being pushed to public cloud repositories, which is a whole new set of concerns. You have not only dealing with the volume and cost but security. The federated security is non-trivial and not well understood by this domain. So there's so much work available here. Awesome, Peter, you're doing a lot with the data as a platform kind of view and a platform engineering. Data as code is something that's being kicked around. What are you working on and how does platform engineering change as data becomes so much more prevalent in its value proposition? Yeah, so I'm the founder of Cloud Cedar and we sort of built this company out this consultancy all around the challenges that a lot of companies have got with getting their data sorted, getting it organized, getting it ready for other use cases such as analytics and machine learning, AI workloads and the like. So typically a platform engineering team will look after the organization of company infrastructure, making sure that it's coherent across the company and a data platform engineering teams doing something similar in that sense where they're looking at making sure that data teams have a solid foundation to build upon that everything's quite predictable. And what that enables is a faster velocity and the ability to use data as code as a way of specifying and onboarding data, building that, translating it, transforming it out into its specific domains and then onto data products. I have to ask you while you're here, there's a big trend around data meshes right now, you're hearing we've had a lot of stuff on theCUBE. What are practical ways that people are using data measures? First of all, is it relevant and how are people looking at this data mesh conversation? I think it becomes more and more relevant, the bigger the organization that you're dealing with. So, you know, oftentimes in the enterprise you've got projects with timelines of five to 10 years, often outlasting technology life cycles, the technology that you're building on is probably irrelevant by the time that you complete it. And what we're seeing is that data engineering teams and data teams more broadly are this organizational bottleneck. And data mesh is all about breaking down that bottleneck and decentralizing the work, shifting that work back onto development teams who oftentimes have got more of the context than perhaps a centralized data engineering team. And we're seeing a lot of velocity increases as a result of that. It's interesting, there's so many different aspects of how data is changing the world. Lynn talks about the volume with the cloud and genomics with data engineering and a platform level. You're talking about slicing and dashing real-time information. You mentioned Rockset, Alex. So I'd like to ask each of you to answer this next question which is, how has the team dynamics changed with data engineering? Because every single company's impacted. So if your research is, Lynn, you're pumping more data into the cloud, that's got a little bit of data engineering to it. Do they even understand that? Is that impacting them? So how has data changed the responsibilities or roles in this new emerging area of data engineering or whatever you want to call it? Lynn, we'll start with you. What do you see this impact? Well, DevOps becomes data ops and ML ops. And this is a whole emergent area of work. And it starts with an understanding of container technologies, which in different verticals like fintech, that's a given, right? But in bioinformatics, building an appropriately optimized Docker container is something I'm still working with customers now on because they have the concept of a Docker container as just a virtual machine, which obviously it isn't or shouldn't be. So you have, again, as I mentioned previously, this humongous skill gap, concepts like CI CD, which are prevalent in ad tech, fintech, that's not available yet for most of my customers. So those are the things that I'm building. So the whole ops space is this wide open area. And really it's a question of practicality. I have a lot of experience with data lakes and containerizing and using the data lake platform, but a lot of my customers are going to move to like an interim pass based solutions if they're using Spark, for example, they might use to use a managed Spark solution as an interim step up into the cloud before they build their own containers because the amount of knowledge to do that effectively is non-trivial. Peter, you mentioned data lakes, onboarding data into lake house, architecture, for instance, is something that you're familiar with. This is not obvious to some verticals, or obvious to others. What do you see this data engineering impact from a personnel standpoint and then ultimately how things get built? And are you directing that to me? Peter. Yeah, so I think first and foremost, the workload that data engineering teams are dealing with is ever increasing. Usually there's a 10x ratio of software engineers to data engineers within a business and usually double the amount of analysts to data engineers again. And so they're fighting an ever increasing backlog and so they're fighting an ever increasing backlog of tasks to do and tickets to churn through. And so what we're seeing is that data engineering teams are becoming data platform engineering teams where they're building capability instead of constantly hamster wheel spinning, if you will. And so with that in mind, with onboarding data into a lake house, architecture or a data lake where data engineering teams are getting wins is developing a very good baseline of structure where they're getting the categorization, the data tagging, whether this data is of a particular domain as it contains some PII data, for instance, and then the security aspects and also the mechanisms on which to do the data transformations. Alex on the database side, those are known personas and enterprises. Hey, I'm the database team, but now the scale is so big and there's so much going on in databases. How does the data engineering impact organizations from your standpoint? Yeah, absolutely. I think definitely gone are the days where you have a single relational database that is serving operational queries for your users and you can also serve analytics queries for your internal teams. It's now split up into those purpose built databases like we've said, but now you've got two different teams managing it and they're designing their data model for different things. So OLTP might have a more denormalized model, something that works for very fast operations and is optimized for that. But now you need to suck that data out and get it elsewhere so that your PM or your business analyst or whoever can crunch through some of that and now it needs to be in a more normalized format. How do you sort of bridge that gap? That's a tough one. I think you need to build empathy on each side of what each side is doing and build the tools to say, hey, this is going to help you, OLTP team if we know what users are actually doing and if you can get us into the right format there so that then we can analyze it on the backend. So I think building empathy across those teams is helpful. Lynn, I'd like to come back to you. You mentioned health and informatics is coming back, but it's interesting. I look at a database world and you look at the solutions that are out there. A lot of companies that build data solutions don't have a data problem. They're not swimming into a lot of data, but then you look at the field that you're working in right now with the genomics and health and quantum. They're always, they're dealing with data all the time. So you have people who deal with a lot of data all the time are breaking through new, and people who don't have that experience are now becoming data full, right? So people are now either it's a first time problem or they've always been swimming in a ton of data. So it's more of what's the new playbook and then, wow, I've never had to deal with a lot of data before, what's your take? It's interesting because, you know, bioinformatics hires grad students. So grad students use their R scripts with their file on their laptop. And so to get those folks to understand distributed container-based computing is, like I said, a non-trivial problem. What's been really interesting with the money pouring in to COVID research is when I first started, some of the workflows would take, you know, literally 500 hours, and that was just okay. And coming out of FinTech, I was blown away. Like FinTech is like, could that please take a millisecond rather than a second, right? And so what has now happened, which makes it feel like I said, even more fun to work in this domain is the research dollars have really gone up because of the pandemic. And so there are, there's this blending of people like me with more of a big data background coming into bioinformatics and working side by side. So it's this interesting sort of translation because you have the whole taxonomy of bioinformatics with genomics and sequencers and all the weird file types that you get. And then you have the whole taxonomy of DevOps, data ops, you know, containers and Kubernetes and all that. And trying to get that into pipelines that can actually, you know, be efficient given the constraints. Of course, we on the tech side, we always want to make it super optimized. I had a customer that we got it down from 500 hours to minutes, but they wanted to stay with the past solution because it was easier for them. And to go from 500 hours to five hours was good enough. But you know, the techies want to get it down to five minutes. This is, we've seen this movie before, DevOps, Edge and operations, you know, IoT world, scenes of convergence of cultures. Now you have data and then old school operations kind of coming in. So this kind of supports the thesis that data as code is the next infrastructure as code. What do you guys, what's the reaction there for you guys? What do you think about that? What does data as code mean? If infrastructure as code was cloud and DevOps, what does data as code, what does that mean? I could take it if you like. I think data teams within organizations have been long been this bottleneck within the organization. And there's like this dark matter of untapped energy and potential waiting to be unleashed. Data with the advent of open source projects like DBT have been slowly sort of embracing software development life cycle practices. And this is really sort of seeing a big, steep increase in their velocity. And this is only going to increase and improve as we're seeing data teams embrace data as code. I think it's the future's bright for data. So I'm very excited. Lynn Peter reaction. I mean, agility data as code, as developer concept, CI CD pipeline. You mentioned that new operational workflows coming into traditional operations reaction. Yeah, I mean, I think Peter's right on there. I would say, you know, some of those tools we're seeing come in from software like DBT, basically giving you that infrastructure as code but applied to that data realm. Also, there have been a few like get for data type things Packarderm, I believe is one and a few other ones where you bring that in. You also see a lot of immutability concepts flowing into the data realm. So I think just seeing some of those software engineering concepts come over to the data world has been pretty interesting. Well, literally just versioning data sets and the identification of what's in a data set, what's not in the data set. Some of this is around ethical AI as well, is a whole area that has come out of research groups, mostly AI research groups, but is being applied to medical data and needs to be, obviously. So this metadata and versioning around data sets is really, I think, a very of the moment area. Yeah, I think you guys are bringing up a really good kind of direction that's happening in data and that is something that you're seeing on the software side, open source. And now DevOps and now going to data is that the supply chain challenges of we've been talking about it here on theCUBE and this episode is, you know, we've seen Ukraine war with some open source, you know, malware hitting data sets. Is data secure? What is that going to look like? So you're starting to get into this, what's the supply chain? Is it verified data sets? If data sets have to be managed, the whole nother level of data supply chain comes up. What do you guys think about that? I think it is. Sorry, I'll jump in again. I think that there's some of the compliance requirements around financial data are going to be applied to other types of data, probably health data. So immutability, reproducibility, that is legally required. Also some of the privacy requirements that originated in Europe with GDPR are going to be replicated as more and more types of data. And again, I'm always going to speak for health, but there's other types as well, coming out of personal devices and that kind of stuff. So I think, you know, this idea of data as code is it goes down to versioning and controlling. And that's sort of a real succinct way to say it that we didn't used to think about that. We just put it in our relational database and we were good to go. But versioning and controlling in the global ecosystem is kind of where I'm focusing my efforts. Brings up the good question. If databases, if data is going to be part of the development process has to be addressable, which means horizontally scalable. That means it has to be accessible and open. How do you make that work and not foreclose it with a lot of restrictions? I think the use of data catalogs and appropriate tagging and categorization, you know, I think, you know, everyone's heard of the term data swamp. And I think that just came about because that everyone saw like, oh, wow, S3, you know, infinite storage. We just, you know, throw whatever in there for as long as we want. And I think at times, you know, the proliferation of S3 buckets and the like, you know, we've just seen perhaps security not maintained as well as it could have been. And I think that's kind of where data platform engineering teams have really sort of come into the fore, you know, creating a governance set of buckets, lake formation on top. But I think that's kind of where we need to see a lot more work with appropriate tags and also the automatic publishing of metadata into data catalogs so that folks can easily search and address particular data sets and also control the access, you know, for instance, if you've got some PII data, perhaps really only your marketing folks should be, you know, looking at email addresses and the like, not perhaps your finance folks. So I think, you know, there's a lot to be leveraged there in lake formation and other solutions. Alex, let's back up and talk about what's in it for the customer, right? Let's zoom back and saying reality is I just got to get my data and make sure it's secure, it's always on and not going to be hackable. And I just got to get my data available and remember performance. So then I got to start thinking about, okay, how do I intersect this? So what should teams be thinking about right now? As I look at all their data options or databases across their enterprise? Yeah, it's a good question. I just, you know, I think Peter made some good points there and you could think of history as sort of ebbing and flowing between centralization and decentralization a lot of times. And you know, when storage was expensive, data was going to be sort of centralized and maintained sort of, you know, by the people that are in charge of it. But then when S3 comes along, it really decreases storage. Now we can do a lot more experiments on it. We can store a lot more of our data, keep it around and do different things on it. You know, now we've got regulations. Again, we got to be more realistic about keeping that data secure and make sure we're doing the right things with it. So we're going to probably go through a period of centralization as we work out some of this tooling around, you know, tagging and ethical AI that both Peter and Lynn were talking about here and maybe get us into that next world of decentralization again. But I think that ebb and flow is going to be natural in response to, you know, the problems of the other extreme. Where are we in the market right now from progress standpoint? Because data lakes don't want to be data swamps. You're seeing lake formation as a data architecture as an example. Where are we with customers? What are they doing right now? Where would you put them in the progress bar of evolution towards the nirvana of having this data sovereignty in this data as code environment? Are they just now in the data lake store everything real-time and historical? I can jump in there. SQL on files is the driver. And so, you know, when Amazon got Athena, that really drove a lot of the customers to really realistically look at data lake technologies, but data warehouses are not going away and the integration between the two is not seamless. You know, we are partners with AWS but we don't work for them so we can tell you the truth here. There's work to it. But it really, for my customers, it really upped the ante around data lake because Athena and technologies like that, the serverless SQL queries or the familiar query libraries really drove movement away from either OLTP or OLAP, more expensive, more cumbersome structures. But they still need that OLTP, like if they have high latency issues, they want to be low latency, can they have the best of both worlds? That's the question. I mean, I would say we're getting closer. We're always going to be, that technology is going to be moving forward and then we'll just move the goalpost again in terms of what we're asking from it. But I think the technology that's getting out there, you can get really well. And just, I work in the DynamoDB world, you can get really great low latency, so single digit millisecond OLTP response times on that. I think some of the analytics stuff has been a problem with that and there are different solutions out there to where you can export Dynamo to S3 and then you can be doing SQL on your files with Athena like Lynn's talking about. Or now you see, Rockset, a partner here, that'll just ingest your DynamoDB data, make all those changes. So if you're doing a lot of changes to your data and Dynamo is going to reflect in Rockset and then you can do analytics queries, you can do complex filters, different things like that. So, I think we continue to push the envelope and then we move the goalpost again. But I think we're in a lot better place than we were a few years ago for sure. Where do you guys see this going relative to the next level? If data as code becomes that next Agile software defined environment with open source as well all these new tools with server listings happening with data lakes built in with nice architectures with data warehouses, where does it go next? What happens next if this becomes an Agile environment? What's the impact? Well, I don't want to be so dominant but I feel strongly, so I'm going to jump in here. So I feel like now for my most computationally intensive workloads I'm using GPUs, I'm bursting to GPU for TensorFlow neural networks. So I've been doing quite a bit of exploration around Amazon Bracket for QPUs and it's early and it's specialty, it's not for everybody and the learning curve again is pretty daunting but there are some use cases out there. I mean, I got a hold of a paper where some people did some there's a QCNN quantum convolutional neural network for lung cancer images from COVID patients and the QPU algorithm pipeline performed more accurately and faster. So I think bursting to quantum is something to pay attention to. Awesome, Peter, what's your take on what's next? Well, I think that was absolutely fascinating from Lynn but I think also there's some more sort of low level low hanging fruit available in the data stack. I think there's still a lot of challenges around the transformation layer, getting our data from sort of raw landed data into business domains and that sort of talks to a lot of what data mesh is all about. I think if we can somehow make that a little more frictionless because that's really where the labor intensive work is that's kind of dominating data engineering teams and where we're sort of trying to push that workload back on to software engineering teams. Alice, we'll give you the final word. What's the impact? What's the next step? What's it look like in the future? Yeah, for sure. I've never had the breaking a data center problem that Lynn's had or the bursting the quantum problem for sure. But if you're in the pool, I swim in of terabytes of data and below and things like that. I think it's a good time. And just like we saw, like we were talking about DevOps and pushing, allowing software engineers to handle more of the operation stuff. I think the same thing with data can happen where software engineering teams can handle not just their code, not just deploying and operating it but also thinking about their data around the code. And that doesn't mean you won't have people assist you within your organization. You won't have some specialists in there but I think pushing more stuff even onto the individual development teams where they have ownership of that and they're thinking about it through all this different life cycle. I mean, I'm pretty bullish on that. And I think that's an exciting development. So is that shift what left or left as security? What is that shifting to up? We've shift so much stuff left that now the things that were at the end are back at the end again. But at least we can think about that stuff early in the process, which is good. Great conversation, very provocative, very realistic and great impact on the future. Data as code is real. The developers I do believe will have a great operational role in the data stack concept and impacting things like quantum. It's all kind of lining up nicely. And it's a great opportunity to be in this field from a science and policy standpoint. Data engineering is legit. It's going to continue to grow. And thanks for unpacking that here on theCUBE. Appreciate it. Thanks for having us. Okay, great panel, AWS heroes. They work with AWS and the ecosystem independently out there. They're in the trenches, they're in the front lines, cracking the code here with data as code. Season two, episode two of the ongoing series of the AWS startups. I'm John Furrier, host. Thanks for watching.