 The Carnegie Mellon Vaccination Database Talks are made possible by Autotune. Learn how to automatically optimise your MySeq call and post-grace configurations at autotune.com. And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org Thanks for coming. Welcome to another Vaccination Divide Summer Series. Folks, that is today to have Drew Bannon. He is the co-founder and DPO of DVT, which is the main company backing this data transformation tool. So Drew co-founded DVT in 2016 and he has a bachelor's degree in computer science from Drexel. And prior to that, he spent some time working in other analytic companies before coming up with the idea of doing DVT. So with that, we'll get started. But again, for the audience, if you have any questions while he's giving us talk, please unmute yourself, say who you are, ask your question, feel free to do this anytime. We want this to be a conversation and not have Drew sitting in his office talking to himself for now. Okay, Drew, the floor is yours. Thank you so much for being here. Yeah, thanks so much for having me. I'm excited to be here. Today, we're going to talk about how to count things. And I do want to argue that it's harder than you think it is. I really appreciate the introduction. My name is Drew Bannon. I'm based out of Philadelphia. On the other side of the state, one of the co-founders of DVT Labs, I run our product team. My background is in computer science. I, just between us friends, applied to CMU, but was not accepted. So I'm excited for the honorary doctorate. You're going to be setting my way after this talk. Is that right? You got to run a big check for that or be answered, right? Well, yeah, pretty good. So, so I spent a lot of the early days of the company building DVT open source. And since I've grown and kind of built out of product and design team thinking a lot about like ecosystem in the space and how the product works. But today I want to talk to you less about the sort of implementation of DVT. And I want to talk to you more about what happens with databases out in the world and how people use them. So something funny has happened to me over the past like five years of working on DVT. Every time I see a news headline, I think of the poor understaffed, underappreciated team of data analysts that are just doing their best. We saw this recently with Facebook and Instagram. Here's a headline from March. I think at this point these analysts probably went out and they were gathering requirements from stakeholders about the KPIs they needed to track. And then in July, Facebook says it's moving forwards with Instagram for kids despite backlash. And so at this point, hopefully data analysts are kind of out there implementing all their only data models. And of course, as we know, in the end of September, it's on ice. It's time for all these analysts to go and revert their open PRs as the sort of wins in the business change. Data is challenging. You need to be out kind of in front of what the company's doing. People want to make decisions with data. At the same time, they want to know what's happening sort of on the ground with the product, how are people using it. And there are real ramifications for what happens with data. So I promise you I'm not picking on Facebook, but I just thought this was a perfect example. I don't know if everyone's familiar, but Facebook had a reporting issue that overstated average duration for video abuse on the platform. And this is one of the sort of nails in the coffin of journalism that just happened over the past, I don't know, five years. It's this pivot to video as they call it. So the actual issue, I didn't know this until I recently looked it up, but Facebook only counted views of videos that took longer than three seconds at its average. And so Facebook ended up telling publishers that videos were massively more engaging than they actually were. And it was because of the reporting issue. So I don't know this for a fact, but my guess is that this was not a simple typo that went undiscovered. My guess is that there was a product manager who defined a KPI like average duration of video viewed or video views at all. And there was a table in a presto cluster somewhere that an analyst queried, and they averaged, you know, a few time in that table. But they didn't know that it excluded views that were less than three seconds long. Maybe it's a tracking quirk or a reporting quirk, but the result is this had a really big impact on like all of publishing. And so this is what happens when sort of data provenance and understanding of where it came from and how it's modeled is sort of done incorrectly. So business logic is really hard. Questions like what counts as a video view? Like that's a hard question to answer if you're Facebook. What's a weekly active user? Or even what's a user? Like our unauthenticated visitors to a website users? That's probably a thing that you could define, but it's a hard thing to broadcast and communicate out to an entire organization of people. Sort of comically, there's that debate over burritos or sandwiches or not. This is important for tax purposes. Like say you want to count revenue. Like are burritos taxed to sandwiches? They're taxed differently than other kinds of food. It's kind of funny. And this last one, this is a lived experience for me is a dog bed mattress. We worked with a company that sold mattresses online and two of the VPs had different numbers for mattress sales revenue. Turned out one of them was operating under the assumption that dog beds were mattresses and the other was not. So even answering a question like how much money are we making on this core product that we sell? If you don't know what the thing you're selling is or how it's categorized, you're going to come up with different numbers. Ultimately, people are going to build a distrust of data. So that's all framing for what we're talking about here. Getting data into databases is largely a solved problem. Queering data at scale and databases is largely a solved problem for 99% of companies on the planet. Fang companies have their own set of challenges at that scale, but if you are a company that is selling products online for the most part or providing software as a service, like data ingestion and query performance is not the challenge. It is helping everyone in the organization have a shared understanding of what these core entities are and how they relate to each other and what they really mean. So we're going to shift gears a little bit from that problem statement into more of DBT and how DBT helps with addressing this problem. To do that, we're going to talk a little bit about this change that has happened in the data warehousing space over the past five or 10 years. Sort of like principles, things here. So if anyone has any questions at this point, please do feel free to speak up. I'd be happy to navigate the questions. So historically, yeah. I guess the highlight of the question is like, transition from ETL and ELT, is this because of technology changes? Do being one of them or S3D another? Or is this something more fundamental because the rise of the big data movement or data science? Yeah, amazing question. Let's run through to high level real quick, like what these things are, and then I will absolutely answer that. So historically, I guess the first thing to say is when we're talking about ETL and ELT, these are just like initialisms or like acronyms for extract, transform, and load. And it's the order in which you do those things is what dictates ETL versus ELT. So old school, the transformation step happened in the movement of data. So you'd pull data out of one source, you would apply transformations to it, you would load it into a data warehouse. And historically, that data warehouse might be something like an Oracle database or Vertica or Teradata. It might be like almost an appliance as a database. The reason we want to do that is because these databases were like very expensive and very rigid and inflexible. And so you couldn't just throw like enormous amounts of data at it and have it like auto scale, like auto scaling for databases is like a relatively new sort of concept, at least in the data warehousing world. So you want to do is you want to kind of do capacity planning and understand how much data can you put into the warehouse upfront. And what are the data points that you're able to actually load into that Oracle database and sort of long and slow sales cycles to upgrade your warehouse to have more storage. It was very expensive. So it motivated this ETL world where you're kind of like pre-processing data before you put in the warehouse. And everything in the warehouse is pristine. And that's like the true source of everything for your business. It all lives in a data warehouse. So the new school approach is a little bit inverted. Instead, what you do is you load data directly into the warehouse and then you can apply transformations to it with the data sort of already living inside the warehouse. And the big reason that that happens is because these I'm sorry, the big reason for that shift, that paradigm shift is that modern data warehouses came into existence and they're significantly more powerful and accessible and flexible and cost-effective than their sort of like previous generation. So whereas storage would have been like mind-bogglingly expensive on one of these appliance databases like 10, 15 years ago. If you snowflake, I think the cost for storage is like 20 bucks per terabyte per month. So you can just throw a terabyte of snowflake at it and pay $240 for a year. It's just like a very small amount of money for a business that's like has a need to run a data warehouse. So you just throw data at the data warehouse. It can be like raw. It can be messy. We'll talk about what it means to sort of transform or enrich that data a little bit later in the presentation. But probably what you want to do is just get that data into the warehouse in its raw state. And then you can have the ability to apply transformations to it and clean the data up sort of within the warehouse itself. Benefits are really like flexibility and accessibility. It's like if you have access to the warehouse, you can look at the raw data, you can transform the data, you can query it. It's kind of all in one place. Big thing here is bringing the compute to the data. So in this kind of old school example, what you're a little bit doing is bringing the data to the compute. So you're doing these data transformations in memory on some sort of server somewhere. But in the new school ELT approach, what you're doing is letting the warehouse do the heavy lifting. Is that okay, Zivar? So this is the big paradigm shift. You throw source data at a data warehouse and then you ask the data warehouse to transport for you. And that fundamentally is the thing that DBT helps with. Yeah, I have a question if you can hear me. Yeah, of course. So to what extent do you think this was motivated by the fact that once you load your raw data or whatever into your database and then apply the transformations, if something goes wrong, then you won't lose it? Is that a big motivator or just another facet? Yeah, you're on the money. It's equal parts like if something goes wrong, but also I guess I'll go all the way back. It kind of comes down to this part, right? Businesses change. The business requirements change and the business logic changes. When you have your transformation in flight in the process of extracting and loading data, well if you want to change your business logic, you want to change these transformations and now you're replaying data from a source when you want to apply new logic to that data. And so you have significantly more flexibility if you've just landed that raw data in its unenriched form in the warehouse and then you can transform that data however you want. It's already like where you need it to be and these data warehouses are like data processing machines, like they're the best place to transform data because what they do is fundamentally like move data around or otherwise reshape it. We'll talk a little bit about how DVT actually does that later in the presentation, but yeah, total flexibility here is a big benefit of this approach. And I will just say like there's no such thing as a free launch. Like what's worse about ELT? Well it's like more expensive, I guess, like all things considered. Like you are storing more data. You are running more compute in the warehouse. There are probably more efficient ways to like load a subset of data. But our experience, and I think the reason why this is taking off is that the costs of storage and processing are significantly smaller than the benefits of flexibility and sort of like flexibility and accessibility. Like you don't need to have access to the data source itself to replay ETL processes. You can just like process the source data in the warehouse as is. All right. So I'll just say there's a couple of data warehouses on this list. I saw that like there's a Firebolt talk a little bit later in the series, which I'm definitely going to tune in for. Really exciting stuff they're doing, but it smells like Grouch at Bigbury. Firebolts has been out for more than a year. I was surprised to see you listed there. Do you come across them in the wild a lot? Well, so we're working with them on kind of building a DVT integration for Firebolt, which I'm now wondering if that's like public information or not. I think it is. I don't can't think of a reason why it should be a secret. It doesn't sound bad. It sounds good. Yeah. Yeah. This is not exhausted. Like yeah. You come across Yellow Brick a lot. I haven't heard too much about them. Okay. No. But yeah, there's other than this list that I just sort of omitted because there's different names for these things. And I don't want this to become like a conversation about semantics, but you can put like Presto or Spark and what Databricks is doing with Delta on this list. Lots of folks use DVT in the Microsoft world. So whether it's SQL Server or otherwise, like there's really a long list. And honestly, DVT is pretty agnostic to which of those tools you use as we'll talk about later. So okay, we're like 12 slides in. I haven't actually told you what DVT is yet. We'll fix that. DVT is a framework for building data models in a data warehouse. When we talk about data models, we're talking about transformations applied to source data. These data models are testable and documentable and they can kind of form a DAG by depending on other models. And just to be clear about what DVT is not, like DVT is not a database. It works with databases. And it very much relies on them to do a transformation of data. So DVT is going to go in there and kind of run SQL queries on a database and ask it to transform data. DVT is not going to extract data and transform it itself like in memory or anything like that. It's all done via SQL orchestration. And I thought it was a cool header. DVT is Apache 2 licensed open source software. It's sort of a cloud component that we're not going to talk too much about today that's proprietary, but everything we're talking about in this realm is open source. So okay, I said that DVT is sort of DAG aware. This is a screenshot from the documentation that DVT can produce about your DVT project. What we're looking at here is models in blue that can kind of depend on each other. And these green nodes in this DAG are data sources. So these are things that are loaded in the sort of EL extraction mode process. And in this case, we're looking at sort of transactions, so like finance reporting. And this actually comes from our own internal analytics project. I'm going to show you all how we calculate revenue at DVT labs with DVT. So let's zoom in on this one narrow slice of it here in the swipe box. If we take a look at these green nodes, these are the data sources. So these are tables right loaded into the data warehouse. And you can see there's kind of two versions of invoices and customers, which is which is funny. We started out as an LLC and we converted to become a C corp, which is fascinating. And I would love to tell you all about it. So from a data perspective, what it means is we have two different QuickBooks accounts. And if you want to look at even our revenue as a like five year old startup, like we are not encumbered by decades and decades of growth and acquisitions and all the things that you might see at larger companies. We're a five year old company. And you have to look at, I mean, actually like stripe and two different QuickBooks accounts, but just say here, two different QuickBooks accounts to understand how much revenue we've made historically. So if you would pay a line chart of revenue over time, you can't just look at like the QuickBooks invoices table. You actually have to look at two of them and you have to combine them together. This is just one example of like business logic, right? This is how companies have people come to meetings with like two different revenue numbers. And they ask like, how can we not know how much money we made? Because business logic like business logic like this is hard to pin down and sort of universally make sure everyone's aware of. So the DVC solution, right, is don't make it so that everybody needs to know about our conversion from an LLC to a C Corp in 2018. Instead, build these abstractions inside of the database. So here we're looking at fact transactions is this node kind of all the way on the right. This is the table that represents all of our revenue historically. And if you want to understand any aspect of how much money we've made into business, which is like kind of an important thing that we use data for, there's one table that you can look at. And it sort of abstracts over the business logic of how we merge Stripe and QuickBooks data, finance, two different QuickBooks counts, et cetera, et cetera. There's actually more to this diagram that I'm not going to show you about amortization. But the point is you can kind of encode this logic with dbt. And then if somebody downstream wants to run an analysis, they don't have to know all of the business logic about how to calculate every single stage here, they've gotten abstraction as a table in the database. So this is kind of what that looks like zoomed in on the database. You have multiple different data loader processes that kind of load broad data into the, this is how I think about it, like the left side of the database. In practice, these tend to either be like logical databases or maybe like schemas in a database, if anyone's familiar with those. So you can kind of separate your raw data from your model data. And the idea is any downstream use case data should exclusively be pointing to this, this model data. So it's not going to point directly to like the source data loaded by your sort of data loader process. Instead, it's exclusively going to point to modeled, tested, documented data. And that's the way that, that you can have BI and analytics and data science applications all operating on the same core abstractions, like these, these entities. So the example here is when we talk about like, is a dog bet a mattress. You don't want to encode that particular bit of logic in every single analysis you ever do. You just want to represent it in your model data and then plug all of your BI and analytics and data science applications into that, that table that encodes that abstraction. Maybe I'll get this in a second, but can you talk about why, why materialization, the way you're doing here is better than a view based approach? Oh, fantastic question. I actually wouldn't say that it's better. I think dbt is both and we're going to look at that in like a couple of slides. Okay, awesome. Thanks. But yeah, so, so right on like, I these objects here on the model side, they could be tables, they can be views. They can also be more esoteric objects like materialized views or things like that. Okay. So sort of at a fundamental level, why did we build dbt? It really comes into a tell to workflows. So right, my background is computer science. Many of your backgrounds are in computer science we're accustomed to things like version control and code review and automated testing CICD. But at the very least five years ago when we were kind of starting to work on dbt, analytics was the wild west. Like there'd be tables and databases. Nobody knew where they came from. The code does not version control. There's like nobody really thought about staging environments for data and a data warehouse. So you just like did stuff in prod. It's like very little concept of documentation because there's kind of nowhere to put this stuff, maybe of a wiki, but a lot of it came down to just like asking around to see if people knew where things came from or what a user is, what a mattress is. And yeah, automated testing and it's really challenging to do automated testing without a sort of framework for testing changes that you're making. So testing looked like a lot of spot queries to say like, and does this look right? Like, good enough, ship it. But we really believe that this analytics workflow should look a lot more like how software engineers build applications. And we've tried to encode a lot of that workflow into the sort of dbt workflow. So the dbt workflow is about version controlling SQL transformations with tools like it. It's about creating sandbox development environments or staging or production environments that are each kind of independent. And we'll talk about this a little bit later, dbt supports data testing out of the box. So uniqueness, not nullness, foreign key integrity, and like enumerations, I think is the other one, as well as like custom data tests, a type of your data. We'll look at the example a little bit later. And last, we won't touch on this too much, but there is a way to do like CI and CD with dbt. So you make a change, you want to merge it, you want to deploy that out of production. This helps like avoid human beings needing to run queries to promote stuff to production, which is often pretty, pretty fraught. Why bother with SQL? Like a lot of folks write Scala. Why do we choose to make dbt support SQL exclusively? Well, we think SQL is the sort of lingua franca of analytics. It's been around for decades. We think it'll keep being around for decades. Every analyst worth their salt either already knows SQL or could learn SQL pretty quickly. Yeah, I see some claps in the chat. I'm glad people get excited about this. I definitely do. The sort of realization for us is that it's a lot easier to help train a data analyst to write really good SQL than it is to take someone who's like, and I'm not calling anyone out here, but take like a really competent like Scala developer and say, okay, learn everything about marketing use cases for data. It's just sort of like a misalignment of skill sets and areas of expertise. And so it's these analysts that can sort of bridge the two worlds of deep domain context and technical capability. I mean, you do both of those things. Like we would call you an analytics engineer and you're kind of like the core person that we're trying to address with dbt. And we've seen a lot of people learn SQL in this way. It's been like, this is a real thing. This isn't like make believe. You're not going to have very much pushback about choosing SQL within this audience. I love to hear it. Okay, great. So yeah, dbt doesn't support Scala yet. Well, I guess we'll see what the future holds. But so far, so good with SQL. If y'all are sold on SQL, I don't know that I'd say too much more than what's on the slide. SQL is very capable of specializing in warehouses. It makes it way more accessible. Like you can just, it's a client on your machine, right? In SQL, yeah, it's this last point, definitely one version controllable. This is sort of like why dbt isn't a visual drag and drop tool. It's like very hard to version control, like those kinds of files. It's hard to dip them. It's hard to code review them. Whereas SQL statements, you can kind of do all those things. Okay. How does dbt work? Awesome. I just want to make sure how we do it on time. We have until 5.30. 5.30. Perfect. So we're cruising. Okay. Oops, I went too far. All right. Okay. So how does dbt work? Well, you write SQL slack statements. You can see one of these at the bottom of the screen. Those astute observers among you will note that this isn't a valid SQL statement. There's curly brackets in here. This is a templating language called Jinja. And so what dbt actually does is it interpolates or we say kind of in a little bit of a grandiose way. It compiles that Jinja into actual SQL that can be executed against the database. And finally, dbt will wrap your select statement in DDL and DML. So the user has written a select statement like this. But when dbt runs that model, like this is really a model code here, when dbt runs this, it actually goes off and creates a table or a view in the data warehouse. This sort of, it's two steps here. One is the interpolation of this ref function. So ref is like reference. By interpolating the reference, you're not sitting in dependency on like a particular table. And this is what gives dbt its ability to like operate in different environments. Like via configuration, you could say, build all this into my personal development schema or build all this into the production analytics schema. So this is how dbt like kind of creates different environments. And then the same thing happens when we wrap the select in DDL or maybe DML as we'll see. You can kind of create objects in different places dynamically if the user's not tying their logic to a particular physical location in the database. It's more like a logical representation of what they want to build and then where it lives is all via config. So here it is kind of side by side. And the really important thing to call out is this ref function, is how dbt actually builds the DAG that we looked at a little bit earlier. So it's a very intuitive workflow for building this DAG up, because you're not consciously building like a sort of adjacency list of edges or anything like that. You're just saying in every single select statement, here's where I want to get my data from and dbt can kind of implicitly build that DAG from your select queries. Cool, any questions at this point? Not very controversial. Great. So here's what it looks like to run dbt on the command line. So that's a select statement like we looked at. And when you do a dbt run, that's actually the command, dbt will go off and kind of build all those objects in the database. So in this case, we're saying, right, these are all going to be views. That's kind of a configuration thing. We'll talk about that. And here's the schema. So ecom is for ecommerce. But you can name this thing whatever you want. It's all based on configuration. So here's what it looks like to run dbt on the command line. But you can name this thing whatever you want. It's all based on configuration. So you can have production, staging, development, what have you. Other thing to say about this is dbt will kind of like intelligently parallelize these models based on the shape of the DAG. So it really can't be pretty efficient for like we've seen projects with 5,000 models. And I don't know who knows how many edges dbt will parallelize that to the best responsibility. In this case, runs with fourth runs is what that says. So let's talk about how dbt actually goes off and creates these things. We did show create table ads, but it's a little bit simplistic. Materializations are how dbt actually creates objects in the database. So these objects are typically either user tables, but zooming out from the physical object in the database, we can think about the logical strategy for creating that object. So view a table, okay, that's pretty easy. Like you just replace the view or the table in the warehouse. But incremental is where things start to get fun. In this case, you actually write SQL that will define like a patch of new data to upsert into a table. If you do this, you have like really large data volume. You don't want to recalculate like a whole table. You just want to insert the new data from yesterday, for example. You can do that with this incremental strategy. And there's a really fun one, ephemeral, which is sort of like not a real materialization. It just compiles queries into other models. I don't have an example of that, but we can talk about it if it's interesting. So the incremental one, the user has to write the code to integrate the new data. Yeah, so there are kind of various ways that you can do this. And there are simple versions that are not very efficient, and there are sophisticated versions that can be really efficient. In the sort of most naive version, you tell DBT like a primary key for each row, and DBT will automatically kind of figure out what to do to the best facility. What you do need to do is give DBT a way to filter for only like new records for whatever new means for you. So that can be new since the last time the models run, or new since yesterday, or a new since yesterday minus seven days for like late arriving facts is what we call them. But yeah, users have kind of full control over how they define this new data to absurd. So you're hand rolling a materialized view through SQL? Yeah, exactly. Without the benefit of like the opinion that just dealt those. Right, so I am the number one desirer on the planet of like really good materialized views. Yeah, unfortunately, materialized views have pretty draconian limitations, at least in terms of their implementations on stuff like Redshift and BigQuery. So you can't do joins, you can't do count distincts. There's a lot of like window function things that you can't do. So like time, we call them frame like frame causes you can't do. And these are like the things that people rely on the model data. So we'll actually talk, this is like just one slide at the end. I guess I'll spoil it. Like they're really cool new database technologies that are aiming to help with this. So one of them is called materialized. And it is like a streaming data warehouse, but everything's incremental in nature. That's like a very, very good thing for DBT. We don't want users to have to like re-implement materialized views in user space, but the sort of physical realities of how materialized views are implemented on Snowflake BigQuery and Redshift today, at the very least, sort of motivate the need for users to define this sort of incremental logic. Yeah. So materialized gave a talk last year. I mean, materialized views are strictly hard. Like people are working on them for 35 years now, I guess. People's effort is pretty good. Or a little bit, but like the big vendors have to catch up. The cloud vendors have to catch up. Yeah. So there's actually, okay. Steven, I'm seeing your comments. Yeah. I'll tell you, like you're all the money, I want databases to be able to use all the information that they have about, you know, both the sort of underlying data in the database is how to optimize queries, plus deltas to upstream tables. So DBT manages this DAG and sort of DBT is responsible for making the data flow through all the models, right? So you're creating tables that cascade incremental upserts that cascade into views and so on and so forth. Like the database is really well positioned to do that. It just like can't be done that way today, at least on these cloud data warehouses. So that's like one of the things I'm most excited about in the data warehouse space. I want to ideally delete like thousands of lines of code from DBT to make it not need to know how to do this. But until the cloud databases can sort of implement these materialized views with less draconian constraints. It's a real benefit that DBT can help you implement this logic incrementally. Cool. So DBT has to make materializations at least from the time being. These are some of the attributes that we look to enforce or ensure in the materialization logic. We want these to be atomic, we want them to be a dependent, like external queries shouldn't know that the transformations are happening. And you should be able to run them many, many times over without funny things happening. That's sort of a depotence, idempotence, I think maybe folks say. Of course, they should be correct. Like it's very important that the logic that you're writing translates to correct data in the data models. Otherwise we haven't solved this problem. And then of course, sufficient and environment aware. We want to build this like programming model that people can really throw a lot of their business logic at. That sort of adheres to the DBT paradigm. So that becomes the big challenge. How do you write DEL and DML for different databases that kind of does all these things? And to me, that's the really fun part of DBT. So here's an example of like actually the same model that we looked at earlier. But here you can see like this configuration construct. So here's how you tell DBT to materialize this select statement as a table in the data warehouse. By default, DBT would create a view. So look at the table materialization. This is like the sort of naive approach. I wish it were this easy. You can sort of, within a transaction, drop the table if it exists and create the new table. But this create table as construct. This is supported on every one of the databases listed above. You give the database a select statement. You say create table as select. Database will go off and create that table for you. That would be great. If this is how it works. Unfortunately, big query doesn't have transactions. So that's out the window. I mean, okay. They do, but they're pre-alpha and only supported in scripting. So can't use them yet. Snowflake has transactions, but they I think auto commit. And so actually very strangely encounter intuitively to me, if anyone knows more about this, I would actually love to pick your brand on it. I think the transaction gets committed when you're on the drop table statement on Snowflake, which sort of like defeats the purpose of the transaction in a way that I have not personally internalized in a way that I feel good about yet. And on Redshift, those are kind of the big three with dbt. If you try to do this, anyone running a query against this table at the moment you drop it would get an error that says query table drop by concurrent transaction. You actually can't drop tables inside the transaction on Redshift. So it's fun. It's a puzzle. How do we make this table materialization work on each of those databases? Well, we can create different implementations for different databases. So we're going to kind of breeze through this part because it's a lot of code. And I'm happy to share the slides too. But basically you kind of do more swapping inside of the transaction and you drop the table after the transaction is committed. And so that fixes your problems on Redshift. Great. Well, Snowflake in BigQuery, it's actually a very easy thing to do. There's a creator replace table construct. And so this is an atomic and sort of a dependent operation that will just wholesale replace a table. This is temp. I didn't catch that. But this would actually not be the temporary table. It's to be the final table. There's, of course, nuance. Like again, this is what makes it fun. Like what if the thing is already in the database called name, but it's a view and not a table that actually fails. So there's a lot of logic in dbt for dealing with all those possible cases. It's less interesting. But let's talk about the incremental case, because this is kind of what we were touching on before. We talked about this. You want to up cert new data into a destination table. And you do this because it's more efficient to conceptually replace a partition or a subset of that table than rebuilding the entire thing. Here's kind of what that looks like at a high level. This would be like this sort of Redshift compatible syntax. So create a temporary table for your transformation. And this is the thing that only includes your data to up cert, not the entirety of the contents of the table. Just say data from the past week. You delete everything that matches the primary key with the data in that temporary table. And then you insert from that sort of incremental temporary table into the destination table. This is actually how dbt does it. It's a little bit funny. It would be more intuitive to do an update and an insert. But as I understand it, most databases implement an update as a delete plus an insert. And so we tested it out. It was faster to do a delete plus insert than an update plus an insert. It's kind of like the pseudo code for how dbt will incrementally build the table. Recurring theme, big query and snowflake are really very good at this. This is supported on other databases too. Here's merge syntax that's really very ergonomic. This kind of does exactly the thing dbt needs to do. So you again merge into a destination table from a temporary table on some primary key. And then there's kind of this set of statements or clauses really. So when not matched, you want to insert. There's no matching ID. And then when it's matched, you can update. And you can actually get really carried away here and delete things that should be deleted if there are upstream hard deletes, things like that. But yeah, merge is really cool and ergonomic for this. That's materialization in a nutshell. The really interesting thing to say here is we made a very intentional decision with dbt to implement materializations in user space. So you as a user of dbt can actually create your own materialization in your own project. You don't need to fork dbt or install anything into your Python environment. It's sort of like an macro. We'll look at macros in a second. But you get to find your own materialization. So if you don't like the way that dbt actually builds these things, you can go off and write your own logic for creating or updating objects in the database. It's really interesting to talk about some of the longer tail bits of dbt, but I'm wondering if anyone has any thoughts or questions at this point. I think you're good. Cool. Okay. So let's just remember all of this is in service, not of just creating objects in the database. It's like building these abstractions and specifically empowering your data analysts to go and build them. That's sort of why all this exists. Like certainly this SQL existed without dbt and certainly you could write Python code, but that would run the same queries. This is not the workflow around it. So just want to touch on some of these pretty quickly. Testing dbt supports, we would call this like a sort of schema test. So this is a YAML file that kind of defines models that have to do with github data. So we like to report on how many community contributions there are in dbt releases internally. And so one of these sort of tests that we want to apply is that a commit SHA in the stage commit, the stage github commit stable should be unique and not null. If a SHA is ever like not unique or if it's null in this table, then something bad is going to happen. And that could be either like within our modeling, we could maybe try to join on a field that's null that we don't expect to be null and then we would kind of lose records. If it's duplicated, you can get what's kind of called a fan out. And so you've multiplied records in a join when you didn't intend to. This kind of testing becomes really important for dbt because there are kind of two things happening at the same time. One is you're changing your business logic or writing new models or refactoring code, whatever it might be. So you're actually changing your logic. Second thing is you're constantly receiving new data through this like extraction and loading process and upstream services like their behaviors change. So it's probably unlikely that github would change in such a way that commits can be, commits shots are duplicated. But there's an ETL process or I should say like a data loading process that could break or that can have a regression something like that. So the goal is to be able to notify the people who are responsible for these like data models that you've made an assumption that no longer holds. Like you're stating your assumption that commits shots should always be unique not null. If that ever stops being true, you can get notified about it. So in the same way that you can kind of do a dbt run and build all your models, you can run a dbt test that will actually go and run queries to assert that all these things are true. This is really important because data warehouses typically don't enforce column constraints. So it's actually very interesting. You can put like a index on a table in Redshift like a unique index and Redshift will use that for query planning but it will not enforce it. And if it's not true, you'll actually get incorrect query results. It'll like make assumptions about your data and build a query plan. It'll get the wrong answer. So like column constraints are not the answer in data warehouses in the way that they might be on like transactional databases like Postgres. You've actually got to run the queries and test it. Cool. So these test codifier assumptions about the data, those assumptions could be broken if the code changes in a way that regresses logic or if your upstream data changes in some way that you don't expect like your company starts selling dog beds for example whereas you hadn't expected that when you wrote the code. And yeah, it's really important to test for uniqueness and sort of numbness in your data because if you don't do that, it's like, I don't know. I can kind of picture the analyst at Facebook who did an average over view length that had nulls in it and they weren't expecting that and it leads to the collapse of print journalism. It happens. What a lot of things you're going to talk about is macros. Software engineers don't tend to copy and paste code at least within sort of a code repository, surely copy and paste from Stack Overflow all the time but analysts and data scientists are very accustomed to doing this. And in fact, I've seen this like hundreds of times over. There are these, you know, 500 line-long SQL queries that get shopped around across the organization and people will change out specific lines of the queries to change date ranges or focus on particular categories and look in the query or whatever. But the point is you have this like really meaningful asset that gets duplicated like dozens of times over and it's just an opportunity for things to become inconsistent. So macros and dbt's for code reuse and help keep logic dry, we say, so it's like, don't repeat yourself, it's dry. And this helps enforce consistency and kind of all your logical things in one place, not copy and paste 100 times over. So it doesn't look like, here's a pretty simple example. It's a macro called cents to dollar. It takes the argument column name and I guess an optional argument precision. It'll do this little bit of SQL to basically divide the value of the column by 100 and format it as like dollars. Is this changing the world of analytics, this particular macro? No, not really, but it's something like this helpful if you want to avoid like copying and pasting this exact string like 40 times over. Like, yeah, it's pretty helpful. This is a pretty straightforward example of what you could do in a macro, but we've seen things like, there's a mapping company that use dbt and they cared about the haversine distance, which is this very long equation for the distance between two points on a sphere, like the Earth. And so rather than write out that equation, like in 10 different SQL queries, they put it in macro and then you can call that macro like you would call it function. So this actually interpolates, it sort of compiles out to valid SQL that could be run against the database. Why would that not be better executed as a UDF? Yeah, great question. So something like this would probably be better executed as a UDF. One of the drawbacks of UDFs is that it's a little bit hard to like, I'm going to say version them and make them environmentally aware. So if I want to actually change the haversine distance thing, like, well, I need a special version that I use in dev and I have to promote it to prod, like that becomes a little bit of a challenge. It's certainly tractable. This isn't being more useful. Oh, yeah. I'm not saying, like, I'm not talking about, like, from your perspective as the person doing dbt, like, could you recognize that something should be actually better executed as a UDF? Now, UDFs have become a black box and for the optimizer, which is problematic on its own, SQL server can handle it. Nobody else can. But I, like, someone could still define a macro. You'd say, hey, this probably should be UDF. Let me make it a UDF. Completely possible. That actually gets some of the most interesting things we're thinking about. So one of the things I'd say is, like, GINJA is a full template language built, I think primarily for web development. So, like, creating HTML on a website. And so you can do, like, four loops and if statements. And we didn't really show that here, but you can dynamically build SQL queries that here's a very common example. We'll iterate over, like, a list of, like, categories and sort of pivot out a column. So you can build a for loop that does, like, 10 different counts. So a pivot sum of case when product equals dog bed then 1L0. It's like, tell me how many dog beds we sold. It's kind of how you would do that. I kind of regret not putting an example in. But that's more akin to, like, very appropriate uses of macros. It's when you actually want to build SQL statements that you couldn't otherwise build with, like, a UDF. If that makes sense. Okay, so your example is the dbt is interpreting that for loop that then spit out the SQL, the expanded SQL, not, like, somehow you're magically converting that for loop into you know, some kind of domestic query that does something on the server side. Okay, yeah. Yeah, right. dbt is, like, templating out, so I say yes. Now, this is where it gets kind of wacky. You can hard code a list of, like, categories and then loop over them and build aggregates. But you can also just, like, query the database and ask for all the distinct values in a column. So dynamically pivoting out a table based on the values that are actually in the, like, say it's a category column. And maybe, I don't know, what's our appetite for going off script? I guess I could pull up an example, if that's interesting. So basically you're saying that these macros, whatever, can actually make invocations to the database to fill in certain things. Yeah. What am I doing? dbt details. Yeah, so we're not even going to talk about about this part, but, okay, pivot. So this is the macro. It's kind of namespace on the package called dbt details. But it will pivot out the values in a column. So this is your input, like, size and color. It can pivot that out to one column for each distinct value. And in this case, like, account. I can share a link to this. It's probably worth, like, staring at for a little bit longer. But you can, this is an example of pairing this with this get column values macro that will get the distinct values in the column. So it's a little bit hokey. It's probably, like, not super worth digging into source code together here. But a cool example of how to chain these things together and dynamically do things that, like, databases aren't actually able to do themselves. But you can generate the SQL to ask the database to do that, which is kind of funny. You can pull it off. Are you doing any type checking as you do the interpretation? No. But we should be. That's, like, in the next frontier. So let me actually, we're, like, a little on time. db2 is documentation that's really powerful, especially when we want people to have shared understandings of data models. Here's what that looks like. This is all open source. So I encourage you to poke around with it. But let's not talk about that. The next frontier. This is kind of what we were touching on just at the end. I have, like, a lot of thoughts. Most of them are not, I don't know. They're either good ideas or bad ideas. But I'm, like, very interested in the idea of typing for macros. And specifically, the ability to do the kind of, like, validation linting that you would get if dbt or a tool like it could analyze a query that had Jinja plus SQL and tell you, like, yes, this is valid or no, it's not. That's a real pain point today, because Jinja is, like, such an analysis fence. It's very challenging to parse both Jinja and SQL when the Jinja is untyped and can generate fragments of a select statement. But I think that if we could type Jinja and macros, then actually we could pull this off and build really, really good tooling. I think that would be, like, very exciting. We'll just say there's lots happening in the data ecosystem. Materials and firebolts are both very, very cool. Certainly Snowflake and BigQuery and Redshift are all doing exciting things as well and not even to speak of Spark and Presto and beyond. I think we're kind of asking ourselves a question, like, how does dbt work in a streaming world? You will run dbt run, like, once, like, when your code changes. The database does the rest. Like, that's kind of fun. But how do tests fit in that environment? We have to figure it out. And we didn't really talk too much about metadata, but that's very much the next frontier for us I guess we'll say business. dbt produces a ton of metadata. When it does dbt runs, like, the statistics of data about a table, just simply, like, number of rows on the table. How's that changing over time? But certainly we can make dbt, like, gather more information and see how skews in the data might be changing or cardinality or averages, things like that. So how do we collect this data and then serve it back to users in ways that aid with, like, discovering new data sets or creating a little bit more observability until, like, not only what dbt is doing, but what's happening in the whole warehouse. Since you have data loaders and dbt running and bi and data science queries, like, it's really hard to understand the whole system and editing metadata is a way to help with that. Hey, almost sounds like you're trying to build a database. Um, you know, it's, it's funny. It's catalogs. I don't know. It's a good point. So we definitely are not building a database. But it's a great, it's a great thing. Like the innovations that are happening in the data warehousing space are just so very good for dbt because it means that we get to make dbt a little bit simpler and leverage with the data warehouse is really good at. And dbt really becomes about like the workflow for scaling this out to hundreds of people in your organization. And I guess I could talk more about about what's in the docket for us, but really like we want to nail the workflow. We want to do as little like replicating database behavior as possible and as much around helping folks like version control and collaborate and otherwise build data models. That's like that's as much a social thing as it is like a technical thing. And maybe if there's a point of this whole talk, it's that one. It's like there's a real social side to databases and data. You know, we're trying to help make that a little bit more seamless and collaborative. Is it the last line? So yeah, it's awesome. Yeah, it was, you know, just like, yeah, if anyone has any questions at this point, that's all I got for you today. Okay, so I will call on behalf of everyone else to note that this is the end of the talk. Okay. So we have a few minutes for questions for Drew. So if anybody has them, unmute yourself and fire away. Otherwise, I'll be selfish and take all the time. Hi, Drew. My name's Doug. I have a question. I'm a huge dbt user. And so I'm interested in like testing more than just like uniqueness and not nulls like for in your example, like, you know, I want to have sort of a set of sample rows and columns. And at the end of all my transformations, I want to make sure that it equals 10, you know, or some fixed value. And then if someone changes the sequel and now it equals nine, like I want to, you know, alert that and, you know, do something about it. Is there anything that you know of that has like higher level tests than just like not null? And I know you have like queries, but that's like more designed on like the data in the database and after it runs, it's like, oh, this value is negative. It's that's bad. So great question. This is a real limitation of dbt today. I think that the data team at, I want to say Shopify did some work around unit testing with dbt. I think that say this like like totally integrated system that kind of works in their organization and is less applicable to just like dropping into dbt as is. But I think they got big parts of it right. It's about standing up fixture data and then running that through your transformations and then codifying what the output would be. I totally think that's a thing won't teach dbt how to do in the future. But today it's it's not yet implemented. Okay, cool. Thanks. Great job. Good. Really happy with the, you know, the product. Yeah, I have a question. How do people generally go about choosing how to cut off a transformation? Like what is too big of a transformation? What is too small of a transformation? Oh, what a good question. Yeah, it's so part and parcel of like, what's the right materialization? Like when do you need to reach for incremental versus table or review? The thing that's so interesting is if you take a data warehouse like Snowflake, you can just pay more money to run the query on a bigger on a bigger machine and it'll run faster. There's like no hard and fast answer. I think that in general dbt runs only in batch capacity and if the thing takes like a few minutes to build, it's not a huge deal. It becomes more of a challenge when you're actively developing this thing and your iteration loop is like change code, do a dbt run, wait five minutes, like that's kind of tough. And so you can, we talked about environment awareness. You can write logic very much using Jinja that says, you know, if my environment is dead, we'll limit this to the last three days of data, otherwise do all time. And so you get this tighter development feedback loop, but in production, you still get full historical data. So there are tricks like that, but but generally, you know, I can just talk about my own, like this business, like I'd be annoyed if there was a trend line that we cared about and it stopped six months ago. And the answer was like, well, there's too much data. I'd be pretty happy to like pay Snowflake to just go and crunch those numbers, you know, that's my take. But it obviously depends on like volume and cost and it's like that. Cool, yeah, that's helpful. Thank you. Yeah, good question, thanks. Okay, I have a couple of questions. So since you have the whole DAG, you know what the pipeline looks like, you know, going from one stage to the next. Are there any, have you thought about doing any kind of optimizations on either SQL deriving or the DAG itself because you have the high level view of what they're trying to do, whereas the data warehouse sees, tries to optimize one query at a time. Does that make sense? Well, it's a very good question. I'm trying to think about ways in which be we could leverage that information. I can tell you that I don't think we do too much of that today, with the exception of like parallelizing, building the things while ensuring DAG order. But like an obvious would be like, if someone, again, these DVT DAGs, in math that I'm assuming in production. So like, but if someone defines a table that is slightly different in this other table here, and they're both being materialized. So instead you can materialize it once and then put two views on top of it to do whatever additional process you wanted. Yeah. So that's cool. DVT does just very, very little of that today. I think that probably our angle is going to look more like linting or something like that. And ideally telling you like, hey, you've got two models that share similar code that's meaningful, like similar logic. Why don't you consider pulling that out into like a parent model for both of them? And rather than DVT dynamically rewriting, like instead, again, it's not workflow, like don't duplicate the same logic in two different places. That's a liability. You're better off having like a single model that can be tested and documented that can feed two downstream use cases. So that's like the design philosophy of the tool itself is to push things back to the user. So you fix this rather than like blockizer tries to rewrite stuff itself. That's fair. Okay. Exactly. And so I was thinking this when you said like, hey, you're pretty much building a database, right? I can just tell you my internal, my mental model for DVT is that it's like doing Judo. It's like these data warehouses are like big, heavy, opponent is sort of the wrong framing. But say partners, how can you use their weight to like, you know, do the thing? Exactly. And that's a little bit of Judo with users too. The trick is they have to do all the hard work. We just run some SQL. But yeah, it's all about creating that workflow where they can do it. Cool. So other than materialized views and more consistent transactional semantics, is there one sort of feature that you wish that all of these database systems that you have to support that they all had in the same way to make your life easier or transaction to materialized views, obviously, those two obvious things? Yeah, those are definitely good ones. And having to put it to like one thing, it's actually kind of funny. I would point to publishing their grammars. You can find the BigQuery grammar under the sort of Zeta SQL library on GitHub. But there's really no good way to see Snowflake's grammar. And it's one of those things that's kind of limiting our ability to build really good tooling around validation. There's more work for us to do on our end, certainly, but it becomes... How do you find them for us? You're never asking for it. I guess we could. Yeah, I'm sure you have a direct line to this, guys. Okay, so my last question is... And we can bleep this if you want, or we cut this out of the video if you don't want to say publicly. But is there one database system that is... Of all the ones you have to have to support, that has the most sort of one-off specialized handling because they do something that's different than everyone else that's a pain in your ass? Yeah, we're partners with all the databases I was stuck here, but I'll mute it if you say it. No, you don't even have to do that. Part of our brand is just like... We started as a consultancy. Like for three and a half or four years, we did analytics consulting. We built DBT open source on the side. We wrote a lot about the work we did. And so I've personally used all these databases under a deadline to deliver client work and you really get to know a tool when you're like on deadline for producing something with it and maybe in some ways using it in anger, I guess they say. Of the list of databases on here, Redshift is by far the one with the most quirks. I think that it has come a long, long way in the past four years, but I can tell you that there were times when I was opening bug reports because we would run queries that caused Redshift clusters to restart or like I'd have to go way back, but like dropping a table inside of a transaction like no reason that should have been the case. Weird behaviors around like their views were early binding. It was called and so if you dropped the table and there was a view that selected from it, you'd have to cascade the drop and actually drop the downstream view. And so that would actually be very upsetting for users because the view that they were trying to query would disappear and we couldn't build both of those things atomically. So Redshift introduces late binding views that actually took them kind of a while, but they had a lot of problems. I think that's where we got the database restarts from. So they've come a long, long, long way, but like you can, DBT started with Redshift as the primary database it supported and you can see in all of our abstractions they're built for a world of Redshift and it's why like I should do this first. This is how you do it on Redshift. So it's like that's how you do it on Soap Lake or that's how you do it on Redshift. Like it's how you do it on Soap Lake and BigQuery. So I'm excited to see Redshift keep innovating. I think I've got a good opportunity.