 Okay, hi everyone, I'm Chen Yang, a BI engineer with Amazon Web Services. So we support the sales analysts in terms of building dashboards for the sales people who are going out there selling cloud services and stuff. So what I'm going to share here today are some of the methods that we use to actually help build trust in our data and why that is very important and that is often forgotten. The reason why I'm sharing this is I've had about at least six years of experience working with data across different types of data, work with geospatial data, work with crowdsourced data. Just before this, before Amazon, I was actually working on user behavior itself. So we work with a bunch of data, terabytes of data every day and data quality is real. You hear people talking about the buzzwords, you've got AIML, you've got chat GPT. Without data quality, without good data, without trustworthy data, all that will be nothing. So that's what we're going to cover. The biggest takeaway I guess you want to take from today is how do you actually start that from day one. You don't go in and say, hey, I'll start earning money, I'll start earning business, I'll start being able to have customers first before you build trust. No, you actually have to do that from day one. And how do you actually obsess? How did our team actually obsess and adapt to our customers when we build platforms like these, when we build automated frameworks like these? And if you're Amazon fans, you realize that I tried to squeeze in the leadership principles inside here. Yes, so obsess, customer obsession and earning trust. Before I begin, I'll just very quickly go through a personal story. So when I was running through this, there was this particular personal annual build that happened to me in my previous company, which was very related to data quality. So we're working on this project. A lot of you will be familiar. Actually, just curious, how many here are fellow data engineers? Just show of hands, fellow data engineers. Data analysts, you work with data, you work with dashboards, build dashboards, none. Okay, those are data engineers' enemies. We always hate it when data analysts come to us because it always means that there's issues and there's problems. CTLs, CDLs, anyone here? So I guess the rest of us are builders. So make a takeaway here, something that you can build for your data pipelines or your data people in your team down the road. I wouldn't say 10, even just one or two years down the road, because this is going to be a huge thing, actually maintain that quality. Okay, back to my story. A few years back, I was working on a project. Every data, it's a standard topic across every company. We want to look at registration churn. So what's that? Basically, you have 100 customers coming to your website. How many of them would actually sign up? And what are the variables actually impacting them? So there was a lot of studying. There are a lot of papers out there, a lot of models already exploring it. So a simple thing, collect the data, throw it into a bunch of models, look at the variables, talk to your business people, understand what these variables mean, go back, tweak them, and eventually present the management. And hey, you got something, you got a project on resume. So that was what I was aiming for. A month and a half project, we already spoke to the business people. We figured out the variables. You're like, hey, we have that data. This was a user behavior team. So we have a bunch of data. We collect click tracking. So anything that you did, we tracked it. We had that. We knew what we were doing. We knew what we were doing before you registered, after you registered. Why didn't you register? Why did you click off the website? What threw you away? All that, nice and done. Two days before the management meeting, disaster struck. The engineers came to us and said, hey, the registration flow was broken. The data that you've been collecting, that you've been building your models on, are not accurate. I wouldn't go into details on why it wasn't accurate, but basically we were looking at skewed data. And we weren't privy to it because registration churn naturally is skewed. You wouldn't expect, unless you're giving $100 for every registration, you wouldn't expect 100. You wouldn't expect a very balanced 50-50 when someone comes to your website and says, hey, it's a 50-50 chance that they sign. No. In actual fact, it's a very low 1%, sometimes even less than 1%. So when we looked at that, right, it was natural that it wouldn't have been caught early on in the project. And it was only caught on because one of the sales account managers were actually going down, speaking to the customers. The customer was like, hey, I've been trying to register, but I can't. It's like, oh, shucks, our flow is broken. So come to the analysts, myself, the project, two days later. If you want, I can share on how did we figure out we didn't manage to salvage my career. If not, I wouldn't be here. We didn't manage to figure out how to save their project. We didn't go through the management meeting. They were actionable. But one thing to take note is clean data is important. And impacts not only the users yourself, but the business owners, especially in the case of now, the team where I'm in now, the data, the business, the users that are using our data, right? Impacts their daily life, impacts the sales that they make, the money that they make, all that salary that they have, compensation, right? It's purely based on data that you're providing them. So trust, yes, it's not just about yourself, it's about other people's livelihood. So yes, the story, we're talking about accuracy and consistency. But what people actually forget about is, even data engineers ourselves, we have a lot on our plate. You can't expect us to actually also go and figure out what your business is. We do. If you are in the job for two to three years, you know what the business context is. What I mean is, if let's say there's a 1% variance between yesterday's number and today's number, are you able to tell that 1% is actually significant? Or is it something that we can just ignore? It's something that, yeah, it's not alert, it's just something that we expect. So all these contextually relevant information, right? It's not available to data engineers. We can tell you if the schema change. We can tell you if you're expecting integers, strings come in, yes. But we can't tell you all these business contexts that only the business people need to know. So this is some of the expectations that people have when they talk about trusting the data. And every data engineer. Okay, I wouldn't say every data engineer, every engineer, all of us here as builders know. It's very easy for our customers, right? We can build things, it works fine. People will be happy. But once something breaks, once I supply incorrect data for one single day, for the rest of the month, my analysts, my data scientists will be coming back hounding me. Are you sure this data is correct? Are you sure this data is correct? Every day, every single time. So we know that trust is very easily broken, but very hard to build. And we can't possibly manually build them, right? But that's what we've been doing as in talk to any data engineer, talk to anyone working on data pipelines out there. Every team starting out. What 1.0 looks like? Of course we have more than that, but I just want to flag out some of the key ones that we have. We have on calls. Every week we have what we call someone on duty. Check through the dashboards. Making sure the dashboards are showing value. They're showing the correct numbers. They're showing the right numbers for people. And if you have 20 dashboards, the on call goes through 20 dashboards to check and make sure that everything works. So then you have priority. You have to make sure that, oh, which are out of 20, which are important, which aren't. If you have 100 dashboards, go through 100 of them, check through them, make sure that every one of them works. Because like we said, these 100 dashboards impact the livelihood of people. Then you have a sequel or familiar sequel. So procedures basically, these procedures break. Then your ETL fails. You go in and check. So it's alert and says, hey, there's an issue with your data pipeline. The data that you're producing is potentially wrong. How do you actually fix it? So your procedures like this, in between your ETL, once it fails, it gives you a chime. It pages you. Even if you're asleep in the middle of the night, your pipelines you have to run. You boot up your laptop and fix it. So that's what we did. Amidst a lot of other things. But then challenges. So some of the things that we face. Actual chat message. This was one of the key milestones that we encountered every year. So basically you have, you just basically have a group chat. Your manager suddenly pings you, hey, this dashboard doesn't look like it's working. Can someone look at it? If you're lucky, this was sent at 8.30. I was still having breakfast. You open up your handphone and you're like, okay, yeah, I can boot up my laptop and fix it. Imagine this was in the middle of the night. You're supporting a team around the sun. So you need to wake up in the middle of the night. If you can wake up, if you can't wake up, then your stakeholders will be living with incorrect data for maybe six hours. And a lot of times pipelines actually overlay. Pipelines are complex. They overlay on top of each other. This six hour delay. My job not just delay one dashboard. It could mean that the source data is incorrect. Subsequently, other pipelines are building on incorrect data. You've got to rerun the entire six hour pipeline. So certain things still need manual resolution based on what we're doing. Overwhelming alerts. You will constantly get alerts. You start off with one audit. Then you start off with two. Then you have more datasets. Then you have more datasets. Then you have more engineers coming in. So each engineer builds their own data. Each engineer builds their own alerts. Some engineers don't fix them. Some engineers de-paratize them because not all alerts are equally important. Some you can fix them maybe. Some maybe are just exceptions. I know that these are issues. Then it becomes 20 alerts in a single day, which is what's happening now. When I open up my laptop every day, I open up my chat. I see 20 alerts. I don't go through them because I know for fact if I don't go through them, if I go through them, the days still goes on as per usual. But then you get an issue. All these 20 alerts, why one of them was actually key? That could break my career. That could break someone's career. So lastly, now yeah, hard-coded checks. Some of these hard-coded checks, for example, tolerance. We can set a 0.1% tolerance. But this 0.1%, how important it is at a wider level, I can say because we do the Asia-Pacific region. So 0.1%, that's okay, right? Because 0.1% across a billion, 0.1% across 10 billion, that isn't nothing. But if you drill down to the smaller segments, the smaller areas, they actually have even 0.1%, it's important to every one of them. Then there's the issues arise, right? Tolerance has to be changed accordingly. I got to adapt these checks currently. That's where we, on top, okay, so you might be thinking, there are already libraries out there. There's great expectation. You've got DQ, you've got these libraries that are available to help you. But one key thing that we did different was, we democratized them. So for our team, as a data engineer, we supported the analysts. So the analysts, we needed them. They were the ones that were producing datasets. They were the business owners. They were the ones that knew the context. They were the ones that knew what was the difference between a 0.1% day-on-day, whether it was seasonal, or whether it was something that could be flagged out. Or whether a particular, just simple boolean flag, whether this boolean flag actually made sense for a particular role, for a particular column, inside the data itself. This was something that data engineers didn't know. We never had a context, but analysts had. So what we did was, we democratized them. Any great geeks out here, or anyone that knows what Greek goddess this is, goddess, and wants to try it. So this is, we call our project, Project Soteria. This is the goddess Soteria. She's the goddess of salvation. So basically, salvage our careers, salvage our data. That's what the project was meant. And what we did was, we came up with the automated data quality framework, very long, very long, which you see inside the talk title. But basically what it does is, instead of the manual checks that you saw, we help to automate them. And these automations, these checks were powered not by the data engineers, but by the data set owners, by the analysts themselves. So these were people who knew the business, who knew their data, who knew what was going to be wrong. So it left the data engineers the important role of just maintaining, making sure that our cluster runs when everyone wants it to run. That's it. And it was a shared responsibility, but you've heard. So by what you've heard, data analysts, they're responsible for your data, make sure your data is correct, and it's powered by Amazon Athena. We can go through later, after the talk if you want. But the reason why we chose Athena was very simple. Technical considerations. This was, project was done 2022. We started late 2021. Our cluster was primarily Redshift. Redshift wasn't serverless back then. Yeah. So we just wanted something serverless. And Athena was available. I was like, hey, we can just boot up something. We don't need a thing. We don't need a scale. Plus we were AWS employees anyway. Although that's not the truth. We actually need to do some cost optimizations. But yes, at least we had that like, oh, we can just boot it up and just run it. So it's how it looked like. Basically, we have our ETL here. We had, most of you have heard of DBT, I guess. I've got DBT stands for, but if you have heard of DBT, Selfish is our version of the DBT. It's a dumbed down version, which you build in-house, which you realize is actually very important because a lot of companies are Amazon, right? There's this friction on adopting these libraries outside because there might be licensing issues, might be a lot of things. So we actually tend to build things, build a lot of things in-house, including the ETL drivers that we use. So what we do is we actually have, Soteria essentially is a separate cluster. Separate cluster for a few different considerations. We have it separate so that any data quality checks do not impact the production cluster. That's the main reason. The other one is to maintain a clean room environment. So basically when any issues happen, you want to make sure that our analysts, whoever that's fixing the data, whoever that's looking at it, is looking at data that failed at that point in time. So what we do is we actually copy the data into the new cluster. We do the checks. If there's anything, you just use the data that's been snapshotted so that you can go and check and make sure the error is. So in a sense, you try to be able to capture a lot of all these random data issues that come in. They only come in maybe on the 15th or the 13th of every month. That always gets overwritten because by the time you're looking at it, new data is already in. But this environment allows you to do that. And on top of that, we use it with the customer of session comes in because back then our analysts workflow was full. They come into our office. They open their laptop. They open Jupyter Notebook. So like, hey, our guys are familiar with Notebook. Why not let's just build something on Notebook? So what we did was Sotiria actually has a Jupyter Hub built on top of it. You go in, a new user goes in. You automatically get created a separate repo that's separate from all your other customers. So you won't be able to see other people's notebooks. Once you're done with your notebook in your own hub, you copy into repo. That notebook repo folder is just a sim link, the actual folder that we use to pull the data quality checks. So that one ensured that the security, you can have someone from working on confidential data, someone working on non confidential data, they can still exist in the same environment itself. Yeah, and everything is all stored in Postgres. So whatever checks, whatever results, metrics that are generated from these checks, we store that in Postgres so that it can be analyzed. And they also use to run the regular automated checks. So what next after this? So we are always evolving. What I've been telling you, right, was what we started last year, when we had analysts that were working on notebooks. So this year, we had a whole bunch of highest, I won't say this year. Late last year, we had a whole bunch of new highest that came in that actually don't use notebooks. So actually this one was actually a barrier of entry because they weren't familiar with notebooks. They needed to figure out new Python libraries. They needed to figure out the Soteria libraries. They didn't have time because we were onboarding them very quickly. This took the old analysts fast because it wasn't something they were familiar with. The new analysts were familiar with SQL and Python was foreign to them. So we need to come in something different for them. And of course, we also want to be able to, after we identify them, we want to be able to fix them where we have self-healing. So basically when you talk about self-healing, you talk about actually going back to your older datasets and fixing them so that your trends still match. So this one comes up a lot now because we are three, four years into the business now as part of the entire commercial team and definitions change. You realize that based on the economic situation, based on new information, you realize that the metrics that you're using before that actually doesn't apply now, but you still want to look at trends. So you need to go back into your data, look at the source, look at the raw, and fix the metrics that are generated so that your business people can, you can compare apples to apples. You're not comparing past to apples or you're not comparing different ones. So that's where self-healing comes in. So we're working on and advanced statistical techniques. Before this, we had the standard one. So if you're familiar with the standard data quality checks, we mentioned we have 13 of them, including Z score tests. We looked at the population sample, making sure that you weren't deviating too much from your previous 30 days or previous five days, depending on what you're looking at. But looking at more advanced versions of those, and in actual fact, those were important because based on the current economic environment, if you're an analyst, you realize that for the past three years during COVID, right, your business people will always be asking, hey, I want to analyze something pre-COVID so that they can do comparison, or they want to normalize the data such that data pre-COVID can be used during COVID. So now we're exiting COVID. They want another fresh set of data, another fresh set of analysis, and a fresh set of handling so that they can again do the apples to apples comparison because that's important to them. So that's where we're getting at. It's a nice different time. So yeah, I'll repeat this the end. My name is Chen Yang. I'll just roughly go through my introduction. It's been a year at Amazon. It's been a crazy year. Before that, I was with the e-commerce company and prior to that, I was with a geospatial data company and irrelevant to data. Before that, I was teaching kids programming. So that's about myself. If you need or want to learn more, feel free to reach out. Thank you.