 Hello, and welcome to this analyst angle on theCUBE. Today, we're going to continue to dig into the sixth data platform and how it could evolve based on global metadata management, governance needs across the data platforms, because these things are growing organically in many organizations. We see organizations evaluating what their next data platform will be and how multi-vendor and how modular it will be. Because these are kind of concerns about how they actually get built and how composable they are. We believe that metadata is the glue that binds it together. So identifying and solving the metadata question is key to how you build the next data platform. And organizations are taking notice as they look for solutions from their existing vendors, such as Informatica, who saw a cloud subscription annual recurring revenue or ARR in the fourth quarter and for the full year of 2023, increased by 37% year over year to $617 million. And we also see emerging platforms from Starburst with their Trino and their emerging governance platform and DBT Labs, who is becoming the authoring and execution platform for turning data into business level models. We're seeing people talk to these folks today and see growth in the citations weight net sentiment according to our partner, ETR, where these are the companies that are really being looked to. I'm your host and analyst, Rob Streche, and joined by principal analyst, George Gilbert with theCUBE Research. Today, we will be talking to co-founder of DBT Labs, Drew Bannon, who is about to join us here. Welcome on, Drew. Hey, thanks so much for having me on. It's a pleasure to be here. Yeah, it's exciting. I think that again, I was out at your Colase 2023, which was a fun, energetic, really packed audience there. And I think, again, you seem to be one of the people who is right at the center of what's going on. So kind of let's jump into where DBT Labs and DBT Cloud are going in supporting a composable data platform and the challenges and opportunities that we see out there. Okay, so, Drew, let me kick it off. When we start thinking about assembling a data platform rather than getting it all from one vendor, the center of gravity starts to move from the DBMS to the metadata, because that's where the source of truth is. And so with DBT, and it's DAG, the directed acyclic graph, you can capture all the enterprise lineage and turn that into a graph. Help us understand how data engineers will transform all data from source-conformed to business-conformed. How you see that over time. Sure, I love that framing of source-conformed to business-conformed. The reality is that folks will load data into their data platform or data lake. And then there's a series of transformations that need to be applied to this source data to convert it into the language of the business. And so frequently this is a multi-step process. It involves filtering data, joining data sets together, aggregations. And DBT can be used to define those transformations, define the logic as SQL or as Python models, and then push the execution down to the data platform. So the benefit is DBT has a map of every single data transformation and how they relate to each other. This is what we call the DBT DAG, directed acyclic graph, as you say. And with this, you get a sort of set of assets. So these are SQL files or Python files that you can version control and code review and sort of manage changes to over time. Yeah, I think that's really one of the keys is that DBT is really, was built on composability. How do you bring things together? How will this evolve over time? How do things like data contracts come into play as we move forward? Great question. We unveiled a new paradigm for building DBT projects at scale last year. We called it DBT mesh. So it's very much a nod to the problem statement outlined in the data mesh literature. And it's sort of our take on how to solve these problems of governing a data practice in a federated data environment. And so part of the DBT mesh paradigm is the ability to define data contracts and contract the data sets that are exposed by one team and sort of received by or built upon by another team. This is one of those ways that we can ensure governance over the entire data function, but not require all the data to happen, sorry, not require all the data work to happen on one team by a select few people. You can actually spread that out to the domain experts within the organization. So we see this paradigm being really relevant and interesting to folks with big decentralized data functions, typically, you know, larger companies with very, very many data engineers in different pockets around the organization. Yeah, that kind of gets to almost the next question I had for this because I think that again, the data contracts concept is not new. I mean, people have been using that for APIs and things of that nature in the modern cloud native applications, but what's really driving these organizations to go to a federated topology in that way? Sure, and I want to let you in a little secret, which is DBT and what it's all about. Every good idea we've ever had about the product came from looking at how software engineers scale complexity and scale the number of people who can participate in building applications and just trying to bring those same workflows into the data universe. So when we talk about version controlling your code or automated testing or CI CD with DBT cloud, data contracts fits right in there, right? You mentioned APIs, APIs have contracts. It's how you can help different teams potentially across organizations or within one organization all collaborate and build on top of each other's work. So I actually, when you mentioned it, I think about the famous Jeff Bezos email about everything needs to be an API. And I don't have it front and center, but famously he wrote, you know, there are folks direct linking to libraries or compiling each other's code. He said, you can't do that anymore. You have to depend on each other's APIs. And so it's really that same thought process, but we're trying to bring that to datasets. So the question you asked was, why are we seeing this grow? And the reality is data is complicated and you need to be an expert in the sort of business domain, the organizational domain. So understand what that data means or what you should do with it, what it's appropriate to use, even just how to use it in the first place. And so we see these decentralized data functions coming out of the fact that, you know, if you take a sufficiently sophisticated, big, long-living organization, they might have been the function of, you know, 10 different, I'm sorry, let me say that differently. We often talk to folks who, let me try this one more time. We talk to customers who have acquired a number of different companies and they bring in with them their different datasets from different domains. And the idea of one team that can manage all of that data across all these potentially very disparate domains, it's very challenging. So instead of what you wanna do is push that work out to the people who understand the data the best, they're closest to the nature of the data, how it was produced, what they wanna use the data for and you wanna empower those people to actually use the data effectively. I mean, that's where the governance comes into play, making sure that even though it's federated, you understand how your data is being used and where it came from and making sure it's being used the right way. Totally makes sense. So Clueless, let me, let's build on that thought, which is you've got these decentralized teams and they're trying to all be part of this coherent data engineering process. Now, what happens? These decentralized teams at some point and they need different execution engines to process the data. Like they might have the final products in a SQL DBMS for serving dashboards but they might wanna do the transformation work in another engine that's lower cost. And the DBT accommodate that eventually. So that's a very good question. I will tell you that DBT is not an execution engine. We're not a database. What we do is we manage transformation logic defined in SQL or Python. You could define tests on top of these data models. We will assert the quality of your data, trace lineage, things like that but we operate purely in the logical realm. So this question of how do you apply transformations data sets across different execution environments? I would say that DBT is really well suited to solve this problem because we do operate in this logical domain and we push 100% of the execution down to the data platforms. In practice, what I can tell you is it's a pretty hard problem because depending on exactly which data platforms you're using or where the data lives, you're subject to a number of different constraints. So one of them could be egress costs, for example, if you need to get data out of one region or cloud into another, just say. Another challenge could be like privacy laws. So you can imagine that you have data that resides in the European Union but you want to combine it with data in the US. Well, you might not be allowed to do that. And I just want to illustrate that it's part a technology challenge. It's also sort of a governance challenge and I would bring it back to the idea that, we have not solved this problem with DBT cloud today but it's something that we're hearing a lot about from our customers and I think we are well positioned to solve it in the right kind of way that is governed once we can kind of work our way through these technical and maybe compliance challenges and make sure that we give people like tools to solve their problems but to do so in a way that is well-covered. Yeah, I think that makes sense and I think I actually I like your thing because it's not even just about the European Union and Geo, oh my God, I was gonna go GDPR. GDPR? Yeah, but GDPR, but yeah. But when you start to look at it, even in France, you're not able to take the data out of France for PII and things of that nature, which leads to kind of a really interesting topic because there's so much you could do at that, like you were saying, we kind of look at it as DBT flying at the control plane level across these different execution environments and when you do that and you have the DAG, there's probably different things you can do. What's kind of the roadmap that you see for the metadata that you intend to generate and capture from the DAG as you go forward? So it is a great question and I would say that there's kind of two types of data that are germane in the DAG. The first one is DBT things. So you write SQL select statements in DBT and we want to expose the information about how these transformations relate to each other. We want to expose information about data quality or freshness. We do all of this today. We have, it's called the Discovery API and DBT cloud. So you can make an API request and get a list of your entire DBT projects lineage and the source freshness and data quality of every single model they're in. You can do that today. We also have a user experience or edit that we call DBT Explorer. So if you want to click through a website that shows you all your models and which ones are up to date, performance information, recommendations, all that surface inside of DBT cloud. So we're always working to put more and more information into that metadata data set and API. And actually just a couple of days ago, we launched in a beta state right now, column level lineage. So we'll actually dig into your SQL select statements and understand how columns are transformed across these different models. That's really helpful for impact analysis to understand the impact of changes to source data sets or root cause analysis something broke, like what changed upstream that resulted in this change. But all these things are sort of within the logical domain of DBT. And I think one really interesting area for us that we're kind of working on currently and planning to do a lot more of in the future is expanding what the DBT DAG knows about to things that are upstream of DBT, like data loading. So data ingestion, call it. Where did this data come from? Did that process succeed? When was it last loaded? And also things that are downstream of DBT. So operational analytics, you know, maybe in the past things that were called reverse ETL or maybe you're still called reverse ETL or just dashboarding, understanding which dashboards depend on these data sets. Say you're making an intentional change, you wanna go and update those dashboards to account for your new business logic. Maybe you didn't know that this table was being used. Well, we could surface to you, hey, that data model you're about to change, that's loaded in a dashboard that was viewed 10,000 times yesterday and you're removing a column that's part of those queries. Are you sure you wanna do that? So just the idea that the DAG because it encompasses so much information about how data flows and how it's used, if we can expand what is represented inside of the DAG then we have more information that we can put in users fingertips so they can make informed and intelligent decisions. So Drew, let me build on that answer with another question, which is you're capturing more and more metadata sort of built off the DAG. As you said, the flow of the data and what it's used for. But there's more that you're adding in terms of the semantics of the data ultimately, whether it's metric and dimension definitions or business rules that are embodied in the quality metadata. So how should customers think about the ever-richer semantics that are embedded in that metadata? How might they use that? So this was a big evolution of the DBT Cloud Platform for us starting a couple of years ago but getting really serious some of us exactly a year ago where we acquired a company called Transform. And so the co-founders of this company were kind of the original metrics, the service metrics layer folks that previously worked at Airbnb. And so these folks brought with them kind of this core intellectual property that has become the DBT semantic layer. They've also brought decades of cumulative experience about how to operationalize metrics at scale, which is a hard problem. So what I would point to is DBT's all about data modeling and getting the right data sets and the right format to speak the language of the business. And the idea of a semantic layer takes that the one step further, not just from a data set perspective, like a table and a data platform, but as you say, the actual metric query. So revenue over time. Seems like it's easy to calculate, but if I think about our organization of DBT Labs, we sell software, we do professional services, other companies that are much bigger and much more established have lots and lots of different revenue streams. So pinning down the answer to the question, like how much money did we make last month is a very non-trivial question to answer in a non-trivial business, which most businesses are pretty complicated under the hood. So we think about this as very much a metadata layer and defining kind of as metadata the definition of a metric. And I'll expand on this for just one second because I think it's so interesting and it really highlights the DBT value proposition. You as a data analyst can rarely sit down by yourself and define a metric for the organization. If you define as a data analyst a metric that the, I'm gonna say marketing organization cares a lot about, they're gonna be pretty mad at you if you didn't consult them about exactly how you wanna pin down the idea of a marketing qualified lead. Maybe everyone knows that already in my experience, it can be pretty tricky to pin down these metrics precisely. Frequently there's different definitions of the metric that makes sense in different contexts or used by different parts of the organization. And so DBT semantic layer creates this opportunity where you can define the metric in exactly one place and it's version controlled. And so you have change management and you can see who changed the metric and why and when. And so that's kind of the vision of the semantic layer is okay, there's all the sort of physical execution of how do I create this time series that shows marketing qualified leads over time but really the powerful thing is a data person and a person aligned to the business like say a marketer getting together and pinning down exactly what do we mean by the metric and where does the data come from and what does it mean if it goes up or down and then broadcasting that to the entire organization. So it's very much a metadata endeavor to define these things, you know, the logical definition of the metric. And then of course, DBT has different APIs for querying that metric. So we have a JDBC interface where you can query MQLs by month and get back the time series you would expect based on this version controlled metric definition. Yeah, that totally makes sense. And I think to your point, I mean, especially with MQLs, I mean, that's something that I've been at multiple companies and it's not only defined differently at each company but it's defined differently at each company over time. So being able to version control that is it's pretty important so that you know where you've been and where you're going but that actually kind of leads pretty nicely into kind of the next set of questions which is, you know, let's turn to kind of governance tools and the teams that build and use that data. What are the different ways customers should structure a governance function to manage this federated data organization? Because I think organizations are still trying to, like you said, they're growing as they encounter these new ways and multiple execution engines and things of that nature. So maybe I'll start with the definition of governance and I'm not gonna attempt to define it for everyone but I'm gonna tell you how I think about this capital G governance word is it can mean a lot of different things that it does in different contexts. So just to be specific, I think about governance in terms of visibility into where data came from and how it's used. I think about it in terms of change management. So if you reported on the number one way last month, is it the same or different than how we reported on it this month? And I also think about governance in terms of just like data quality and recency and viability. Like can you use this data set for the problem you're trying to solve? And so I'm sure that there is a textbook definition of governance but it's one of the things that comes up a lot when I speak to our customers and partners is that different people look at governance in different ways. So what I would just point to is really this idea of visibility. I think that's the starting point for it. If there are things happening with your organizational data and you don't know about them, it is impossible to govern that use of data. So the very first thing is I think seeing lineage of how data flows from ingestion through transformation to application or maybe we would call them exposure in the DBT land. Once you have that, you can start to understand the different ways that the data flows and what specific data is flowing. So when I think about our new release from just a couple of days ago, call level lineage, this actually gives you the ability to see how PII is transformed from staging a new dataset into potentially producing tables that power reports. Totally depends on the organization, the context and the data, but you can imagine that you might want to mask PII before you create a table that is then used to report on and disseminate to the organization. So I think it starts with visibility and there's a whole lot wrapped up in there but that's maybe my starting point for this conversation. No, I think that's perfectly good. And I think again, it's a good place because I think there's metadata that DBT is going to be looking to in partnering. The partners are going to actually go and generate. What is it that you're looking for for partners to come on top of the DBT DAG to go and generate? What does that look like in your view? So I've learned about differential privacy and the concept of K anonymity. This was a couple of years ago from another company kind of in the data ecosystem. And I was blown away by both the power and complexity of solving those kinds of problems. I would say that to the extent possible today, we use the data platforms to apply things like dynamic data masking. But if you want to take that to kind of another level and do things like the K anonymity or differential privacy, that's a place where we really do look to partners to implement these true privacy and sort of masking techniques at scale. Well, that makes sense. I think just with the time left, I think we'll jump to a question George was thinking about around SQL and Python. Yeah, true. And we had sort of chatted about this a few days ago, but in the spirit of adding more and more semantics into the data, you've got this trade-off between Python being more expressive, but SQL being declarative. And so sort of it lends itself to optimization and easier analysis. And at the same time, SQL is about strings, not things. So how do you think about the trade-off between those two in terms of adding the semantics to the data? I'd like that framing, strings, not things. The first thing I would say is SQL is currently the lingua franca of analytics. It has been for decades. I think we believe it will continue to be for decades. And I don't think we should expect SQL to go anywhere. We do also recognize that there are things that you can do in a language like Python that you can't do in SQL. And what I would say is that actually ends up being less about the language itself and more about the capabilities of the underlying data platforms. And as the years have gone by, we've seen data platforms enrich their capabilities to serve use cases that previously you need to grab Python for. So I'll just give you some examples. I know that there's a machine learning capability in the GCP world called BQML. It's a BigQuery machine learning. Or Snowflake has Cortex. And of course, Databricks has a number of data science and AI machine learning capabilities, both accessible via SQL and PySpark. So one thing I want to really claim here is the data platforms are expanding their capabilities to make more of these use cases addressable on SQL. Why are they doing that? Well, I would posit that SQL is the most understood, let me say the most broadly understood programming language in the world, just given how many people can look at a SQL query and get the rough idea of what it's doing. Like you don't have to be a data scientist to look at a 50 line SQL query and kind of get a sense of like, I see some column names, that kind of tracks, that makes sense. If you were to put equivalent Python code in front of someone, and I would argue Python code can be arbitrarily complex and confusing and at least the way that I write it, it's arranged, unhinged, you name it. Mostly kidding. SQL gives you this ability to bring more people into the process, right? You could have folks like, just say, product managers are sufficiently technical folks, but have folks participating in the data practice because they can look at SQL and know what it means. And a lot of them learn SQL themselves and author it too. That's not the case for Python. And so I think that from my collaboration perspective, it's hard to beat SQL for analytics. And again, I just don't see it going anywhere in the future. I think the data platforms will do even more to make these sophisticated, powerful abilities addressable via the interface of SQL, which is all it is, right? It's an interface. No, I think we both agree. I think, again, it's one of these things that really when we look at it and pull back and being a recovering DBA myself, I kind of look at it and go, it is fairly easy when you're trying to explain it to people where my Python coding really is, I can't remember how you said it, but unhinged, I believe was the word and that would be my coding style as well. But I think this is a good place for us to park it for today because I think, again, this gives people a pretty good idea. So I definitely want to, first and foremost, thank George for joining me on here. But then really, Drew, thank you for coming on board and kind of helping us dig into this because I think metadata is the glue. And I think we're gonna look to have you and the folks over at DBT back because you guys are right in the thick of it. Great, well, thanks so much for having me on. It was a true pleasure. And yeah, I hope to talk again sometime. Awesome. And thank you for watching this analyst angle examining the sixth data platform on theCUBE, the leader in high tech enterprise analysis and coverage. Stay tuned.