 We have a much coveted 2 o'clock slump spot, so hopefully we're all feeling pretty energized. And if not, we can get you there by the end of this talk. So I'm going to talk a little bit about how we can implement data ops using GitLab today. But first, let me introduce myself. So my name is Emily Sherio. I've been at GitLab for about a year and a half. I was hired as our first data analyst. Then I moved into a data engineering role. And now I'm in a strategy role on the chief of staff team. I've kind of built a career out of being the first data person that your organization hires when you decide you want to kind of build out a data team. And so between jobs I've had and consulting projects I've worked on, I've helped about a dozen companies set up their data stacks. Apparently you're supposed to have a fun fact on these slides. So if you want to start a conversation with me later, ask me what my one rep mech squad is. I'm kind of a serious person, so that's the closest I come to fun fact. Before we move on today, I just want to give folks the heads up. I'm going to tell you how we're doing data ops at GitLab. As I mentioned, I was on the data team for a long time. But what you're going to see is a lot of similarities to DevOps, which is something most folks don't know about. We'll have time for questions at the end. But I really want to, as we go through the slides, if things come up or things are unclear, please do write those down so we can talk about them. So let's talk about spreadsheets. And I'm a bit biased, but I think data teams get a bit of a bad reputation. A lot of people have probably seen, or if you haven't felt this, you've been in situations where two people have two different Excel or Google Sheets, and they both have some metric. Now, this metric might be things like revenue, or it might be users, new users, new signups, new orders, whatever it is. But two folks, two spreadsheets, and two different numbers for the same metric. I had a previous role where I probably spent half of my time trying to explain to people why their two spreadsheets had different numbers for the same thing. Now, whether it's two charts, two spreadsheets, whatever it is, two hieroglyphics on the wall, it can be really hard for your data team to be successful when they spend all their time explaining to you why your spreadsheets are different. Or another thing that data teams often get settled with the baggage of is if you've got a nifty chart that you use to do your job on a regular basis, you come into work Monday morning, energized after that wonderful cup of coffee, and you open your chart, you look at the numbers, you just know they're wrong. You don't know what happened, you don't know what's going on, you just know something's wrong with the data. And so, data teams get in this tough situation when they spend all their time doing these, kind of solving these problems. And really, there are three parts to this. Data integrity, data quality, and data reliability. And it creates this vicious cycle where people don't trust the data team's numbers and so they don't use the data team's numbers. So the data team spends all their time putting out fires and fixing the numbers that people don't trust because they're wrong when they look at them, so they ping the data team to ask people to check the numbers and receive the problem here, right? It's really hard for data teams to do things that add value when they spend all their time putting out fires for other teams. But if they never add value, people don't wanna work with your data team. So data integrity, data quality, and data reliability problems are things that need to be solved before analyses make it out to your end user. That's usually a member of your team. So in on a white horse to save the day, much like the Disney fairy tale you all watched on Disney Plus, is data ops. So before we dig into that, it's important to understand what is data ops? So data ops is an automated, process-oriented methodology used by analytical and data teams to improve the quality and reduce the cycle time of data analytics. So this comes from the data ops manifesto, and if you're familiar with DevOps, I bet certain parts of this sound familiar, right? Reducing cycle time, improving quality. You see, data ops has a lot of different parts to it, but there are 18 principles of data ops. I'm gonna give you a second to read these. There's a lot of information there, right? 18 principles, that's a lot. I'm gonna summarize them for you in one sentence. Analytics is a subfield of software engineering. Take a second to think about that. Analytics is a subfield of software engineering. More is being expected from data teams than ever before. Data roles are more technical. Being a data person is not someone who knows how to use a spreadsheet really well. Data folks are writing code just like the rest of your engineers. So what does this look like in practice? The first, like I said, analytics is code. Data teams use a variety of tools to access, integrate, model, visualize data, and fundamentally, each of these tools generates code and configuration, which describe the actions taken upon the data to deliver it. When you make your work, your data analysis reproducible, you create an environment in which everything is versioned. Things from low-level hardware and software configs to the code and the configuration specific to each tool in the tool chain. So let's think about what this applied means. When you have your definition of, say revenue, or your definition of new user stored in code, then any time you wanna change the definition, you create an MR. It's a really different way to work from the way most teams work today, where, I don't know, you update an Excel spreadsheet tab, or you create a new tab, or one that says New New Final, or New New Final January 2020. When you're putting these principles into work, you also work in disposable environments. So this is important because you minimize the cost for team members to experiment when they're working in environments that simulate production. When they're free to work in isolated, safe disposable environments that mimic production, they know that things don't work just on their machine. And finally, the beginning to end orchestration of data, tools, code, environments, and analytic teams, set these teams up for success. So, I give you a high level, this is what data ops is, it's really fancy, we're work like code, but what does this look like? Now before we dive into the specifics of how we apply this, it's important to think about what a data stack is. For those of you who may not know, I'm gonna walk you through this using the analogy of baking a cake, where what you're looking for, what comes out of the end of the cake, or what comes out at the end, is your cake. But what goes in is your data. So this is like eggs and flour where it goes into a cake, usually I just get it from the box on the shelf at the grocery store. But that's what goes into your data, and you can use a number of tools to move that data into what's called a data warehouse. There are a couple different data warehouses on the market, but the general gist of a data warehouse and what makes it different from a database, is that it's optimized for analytical purposes. So instead of being focused on row based storage, like a OLTP database, it's optimized for column nerve based storage, or OLAP data, because in your data warehouse, it can then be used for whatever purposes you need, such as business intelligence, building machine learning models, whatever your use cases. Now it's important to note here that the process by which we move this data to by the acronym E, the extraction piece is where you pull data from one place to the other. Think of this like bringing your ingredients home from the grocery store. You load data into your data warehouse, this is where you open all your packages and you pour them into your mixer, and then you transform it, that's when you turn on the KitchenAid and let it do its magic. Or you bake the cake, because somehow you're supposed to have a cake at the end, maybe you just eat the raw ingredient. Once we, now that we have an understanding of what a data stack is, we can kind of talk about how data ops fits into this data stack. So it starts with version control, and for those of you who may be doing DevOps for a long time, this seems really simple, but like I mentioned, so many data teams around the world are operating in this spreadsheet universe where their version of version control is final January 2020 edited, I mean final this time for real, right? Looking to do is create a single source of truth. When you start working in version control, you establish exactly that, one place where people are working together. Once you've gotten over the version control hump, the next piece is working with merge requests. So I'll mention here, this is the actual GitLab data team project, which is public. One of the cool things about working at GitLab is that everything we do is in the public. So there's gonna be this posted on the slide at the end, but if you go to gitlab.com slash gitlab dash data, slash analytics, you can poke around and find exactly this MR up on the screen. When you're working with version control, you create the opportunity to work in MRs, and when you're working in MRs, you help people understand why you're making changes, you establish code reviews, like all the things that apply to DevOps also data teams benefit from when they implement data ops. When you do that, one of the perks of working with MRs is that here you've got a change management system in place. So when your definition of what a customer is or a new user changes, you're then going through an approval workflow. So here this is, like I said, an actual screenshot from that same MR where some changes are being made. And if you're working in MRs, you can have pipelines that make sure what you're releasing isn't introducing any new regressions, data quality problems into your workflow, much like a software development workflow. If you're working in automated pipelines, then you need disposable environments to make sure that you're testing in environments that look just like production. So there are two ways that we've seen this be done effectively. At GitLab, we use something called Snowflake Zero Copy Clone to implement the first thing that you see on that slide. So we clone the data warehouse in entirety for every MR. And that makes us, we're working in an environment that is a clone of production. And so we're testing against the clone of production to see if there are any, anything that's being introduced that shouldn't be. Depending on the side of your data warehouse, that really doesn't scale. And so what we've seen other teams do is that they generate a data warehouse that's a statistically representative sample of their data warehouse and use that for their disposable environment. Like I said, we do the former, we've seen teams do the latter effectively. It's just a trade-off in compute versus engineering resources. So if you're working in version control, you're using MRs as your change management system. You are using automated disposable environments to make sure that you've got testing going on in your merge request process. And you're making sure that people are working on environments that simulate production. What's left is the question of how do you do testing? And that's where DBT comes in. So DBT is an open source tool, stands for data build tool. I just realized that the letters DBT are not actually on this slide, but DBT. So I'll tell you a little bit about what DBT is first. DBT is the T in the ELT process that I talked about earlier. So you bring your data into your warehouse, raw. We strongly encourage folks to follow the ELT paradigm. The advantage to this is that when your business logic changes, when the definition of revenue or revenue recognition or whatever it might be changes, you don't need to reload your data from its source. So you've brought it in in its raw form and you just re-transform it. Like I mentioned, DBT is the T in ELT. It's a command line tool that lets you write select base SQL queries and turn them into an analytics engineering workflow. It also allows you to snapshot, transform, test, deploy and document your data. We'll talk about what that looks like in just a second. While leveraging version controlling, alerting and logging all in your cloud native data warehouse. So DBT supports the most popular data warehouses that are on the market today. Full of what a DBT model might look like. You'll see that there's some funky Jinja configuration going on at the top. And then if you're familiar with SQL, the bottom portion there, lines eight through 17, probably look familiar. Except again, you're seeing this pattern of Jinja. And what's going on here is that Jinja's a templating language for those who are unfamiliar. And it allows you to write dry SQL. This is really powerful because you may define, you might utilize a metric in multiple places. And by writing, just like writing dry code, you define things once and you get to leverage them in multiple places. So here's an example of an incremental DBT model where is incremental what you see on line 14 is defined once. We call that in a macro and we get to use it every place we build data incrementally. Here's another example of a macro being leveraged. This macro is calculating the churn type. And so it takes two inputs, original MRR and new MRR. And then it's a case when statement, a pretty straightforward piece of SQL. But by defining this in one place, we can call this churn type macro anywhere in our DBT project. DBT also ships with testing. And that's what this is showing. There are two kinds of testing that DBT ships with. The first is kind of rudimentary data quality data integrity checks. What you see here are not null and unique tests. So a lot of data warehouses don't enforce constraints on identifiers the way we see in production databases. And so we wanna make sure we're testing for this in our data. We wanna make sure we're especially testing for this post-transformation. Because what we wanna make sure is that we're not introducing anything we shouldn't be introducing post-transformation. So this is an example of some of that low level data quality test. Pretty straightforward. We're making sure that we've got unique keys and think columns that shouldn't have nulls don't have nulls. But you can also do much more complex testing like this one. So here what we're doing is writing a select-based SQL statement. Again, we can leverage GINJA within this. But what we wanna do is return no results. And if this SQL query returns a result, the test fails. This is a really amazing way to write tests that make sure your data quality and your data or your assumptions about your data are correct. When we put these pieces together, what you have is a pipeline that looks something like this. And what I've included a screenshot here of is the actual GitLab, this is a part of our GitLab CI file, where you can see how we call our pipeline. So when you create an MR, here's the process it goes through. First we clone the data warehouse. Then we'll skip to the middle. We run the DBT job. And so what the DBT job is doing is we run a full refresh, we test the data, then we run it again to test for any changes to incremental models. And then finally we test it one more time. And if this runs successfully, then we know we're not introducing any failures because we've written some tests that test our assumptions. We know that things are performing the way we expect them to. Kind of like test-driven development, but apply to your data. Oh, I went backwards, sorry. That comes with DBT. So we get all these perks of testing, alerting, logging, but we also get this really amazing piece of documentation. So what you see up on the screen is our DBT docs. Like I said, the ships with DBT as an open source tool. And this is where we document each of our transformations. So what's going on here? Why is it happening? How do we figure out that this transformation need to happen? We have a step-by-step process, in which we develop and work with every merge request that we merge has to have documentation to be merged. So we know that any changes are being documented. And this has been a really easy way for non-data folks to introduce themselves into the work of the data team. So when team members want to look at subscription data or engineering productivity or Salesforce information or usage ping information, they can pull up these DBT docs and just start typing for the words they're interested in and usually help themselves or get very close to how to ask a really great question of the data because they know what we have. I'll putting all these pieces together. I went backwards again. We start with version control. Version control lets us use merge requests. Merge requests force us to create disposable environments which we run automated jobs. And then we can add testing. And with our testing in these automated jobs that we're running in our merge requests, we know that what we're outputting is better than what we had before. And if you're documenting, well, then you're just in much better shape. Take a second to take a picture of this or, I don't know, write it down in your notebook if you're still writing things down on paper. That's what I do. I'm happy to answer questions, but before I do that, I just wanna emphasize the importance of this. If your data team spends all their time putting out fires, they're never going to add value to your business. I wrote a blog post recently called the Data Team Maturity, three levels of Data Team Maturity. And the first thing data teams do is what I call reporting. This is getting you numbers that you already have elsewhere so that you can trust the numbers that your data teams producing. The next level is called insights. And this is where your data team actually produces something new. But if you spend all your time on reporting, you never move to insights. And so moving your data team to a data ops system in which they're developing and treating what they're working on as code will really level up your data team and it'll help them move from reporting into insights. So what questions do you have? And we can take a few questions and I have a mic, just raise your hand and I can have a question right over here. So DAGs and tools like Airflow are becoming common for... I'm sorry, can you be a little louder? Sure, so for a lot of data delivery jobs, tools that use direct to the cyclic graphs like Airflow are becoming more common. What is your opinion on integrating like a DAG tool that's used within an organization with GitLab and GitLab CI? Like integrating Airflow into GitLab itself? Basically how would you delineate where GitLab stands and where do you rely on existing DAG tool to perform this data delivery? Yeah, so we do use Airflow at GitLab and I think one of the coolest parts of Airflow is that we store all our DAGs in our project. So if I had a laptop here, I were to click this link and actually go to the GitLab data team project. We could look at the DAGs that the data team is using. The other thing I'll mention is that because we work so heavily in DBT and leverage those DBT docs, we actually use GitLab pages to deploy our DBT docs so they're public. You can go poke around in them if you'd like. That also has a DAG, like our DBT models are also directed and so we can see how transformations happen there. I think it becomes one of those things where like you don't need to, like so much more of our processes and our transformations happen in DBT than Airflow and so deploying that to a place is much more important but we, like I said, we do use Airflow. We have it deployed. We use GitLab to deploy it and all team members have access to it. We can see the DAGs. Same thing. I think the principles are really similar, right? Like we're storing our infrastructure as code, our schedules, our pipeline, all those kinds of things as code and Airflow really overlaps with the data ops philosophy. Yeah, that's awesome. Thank you. And we have another question up here. Can you maybe elaborate why you chose DBT? Yeah, why did I choose DBT? Great question. I'm gonna go back to that previous slide on what it is. So in full disclosure, it's an open source tool and I'm a contributor to it. I've been using it for a couple of years and so why are we using DBT? So one, I knew it before I started at GitLab and GitLab already used it when I joined. So it's important to mention, it's kind of like what was here but also like a technology I was familiar with but I don't really know that any other tool has made it so data analysts can do the transformation piece, right? So if we go back to what a data stack looks like, nope, that one. And I talked about ELT instead of ETL. If we think about how data engineers worked not that long ago, even when I first got into data maybe four or five years ago where ETL was the traditional paradigm, anytime you needed to make a transformation change so when some definition changed and your transformation logic needed to change, a data engineer had to do that because they had to re-extract and reload all the data because the transformation step happened in the middle. Now it's so cheap to store your data, like that's not where expenses around data come from, that it just makes more sense to do your transformation once the data is already in the warehouse. So you extract your data once and you load it in once. Once you, like if we're on the same page there then the question is, who needs to be doing your transformation? And if you use a tool like DBT, you make it so people who can just write really great SQL can do your transformation, right? When you make it easier for more people to contribute to your transformation, it's like changing the size of your dev team, right? So many more people know how to write performance SQL than know how to write ETL. And so DBT by pushing that into SQL and we still have to level people up, but for a long time you had data analysts who knew SQL and maybe not Git or they knew SQL but they really didn't understand code because they needed SQL to access data that was stored in data warehouses or databases. So it's about bringing the technology to the people who are already doing the job, which is data analysts. And speaking their language instead of teaching them how to speak the language of ETL. And it's really only gotten better over the last couple of years. Any other questions? I think I might have one, Emily. Please. I am familiar with directed acyclic graphs generally and this is a bit of a longer question. When I think of a DAG, we have a feature in GitLab CI and the use case for this is when you have a pipeline that has a web app, iOS and Android, normally you'd have to wait for all of them to complete to move to the next stage but with the DAG feature in GitLab, if the web test completes first, it'll just go ahead and deploy even if the Android and iOS tests are taking longer to deploy. So that's my paradigm or use case when I think of a directed acyclic graph. What the heck is a DAG when it comes to data? Yeah, great question. So, last time you played with Play-Doh, right? It's probably been a long time, right? But Play-Doh is this really cool thing because it comes in one shape and then you kind of do some things to it and suddenly you went from a ball to a dinosaur and then you smush it up again and you go from a dinosaur to a sandwich. And it's a really cool thing but that's kind of what data transformation is. You have this data that comes in these rows and it has all this information and you run some analysis over that transforms that information into a different format. And then sometimes you wanna do additional analyses on top of that and you do that transformation on top of it, right? So if we think about what that might look like to use a bad analogy, you take water and it comes out of your fridge when you put a cup in the door and then you freeze it and it turns into ice and then you put it to boil to make pasta, right? So you transform that water in different ways but the order in which you did it was important because you're running some sort of experiment. That's what a DAG does. It allows you to make sure that your transformation happens in exactly the steps and the order you're interested in. So A has to happen before B has to happen before C. And if we look at, not that one, almost there folks. This slide, we can kind of see what that looks like where the data warehouse needs to clone first, then the extraction can run into the clone, then we can run our DBT jobs and then we terminate the clone. And so these orders are really important. I think I'm right about at time but if you catch me at happy hour later, don't forget you can always ask me what my one rep max squad is. Otherwise I'm really grateful for your attention folks. I hope this was useful and please corner me and ask me questions later. Thank you.