 coming to our experiment. I'm in a block of time today, so I'm anticipating about 30 minutes for content and 15 minutes for questions, so get those brains thinking about what you want to ask. Disclaimer, I come from a business intelligence perspective, so when you hear me say BI, that's what I mean, and this talk could be titled How to Implement Business Intelligence, but that's not necessarily what people want to hear about their data lake. So a data lake to me is any big store of data that is not organized and probably not useful. Does anybody not want to hear about how to implement a BI solution? Because I'd be really interested in talking to you. So who am I and why should you listen to me? Well, my name is Allison Stanton. My pronouns are she, her, hers, and I am the founder and chief problem solver of Stanton Ventures, where most of my clients are softwares of service companies, though I do have some in healthcare and some in fintech. My main languages these days are SQL, Python, and Lookamal, and I've been doing data full-time for the last seven years, even longer if you count part-time phone hobby stuff. So to begin with, we're going to have a reality check. You do not have a data lake. You have data lakes. So what are the things that would make these data lakes valuable to you? Well, you could make money with them. You could lower costs with them, or you could improve people's quality of life. But you'll notice that all three of those things can be accomplished by giving data that is useful to managers and employees of your company, and that's where the business intelligence side comes in. I specifically say employees, not just managers, because it empowers them and it values them. And data is its own kind of power. We should be using it for good. So what has to happen for your data to become valuable? Or in other words, what requirements are there for it to have value? Well, it has to be accessible. If people can't get to it, it can't do anything for them. It has to be accurate so that people aren't second-guessing it. It has to be on time because if you have to make a decision on Tuesday, it really doesn't help you on Friday. It has to be understandable, which means it can't be hieroglyphics to people, and there can't be room for interpretation. It has to be very clear-cut. And it has to be segmentable. So your data science team, if you have one, which most of you don't, because probably most of you don't need one, is that a random sample. How do you know it's random? You have to know things about it. You might want all the water in that lake that has a specific fish. We'll call her Melinda. She is a client. You have multiple clients. You want that specific client universe for talking to that particular client. All the water that has a swordfish in it or a goldfish or a whale or a shark, those are segments of customers. You could want all the water in the lake, which is your entire client universe. If you lose any one of those things, your data loses its value and cannot provide the maximum impact. So how are you going to get all of those things to happen? Well, people have to be... How are you going to answer those questions? I'll give you a hint. That's the business intelligence solution. So now you know you have data lakes. You know what you want to get to. Your data types do matter because they impact the decisions that you make on your infrastructure. So I'm going to start philosophically before I get to practically. I'm going to list these really fast, but don't worry. I will go back through them. Do ELT, not ETL. You can't keep the data in the schema that is currently in. You're probably going to have to move from a normalized schema to a star schema. Start with the easiest, fastest use case, not the one that people are asking for most, not the glitziest, not what's most needed, the easiest and the fastest. And you must have written definitions of your terms. This is a data dictionary. It's your proverbial Bible. It's a living document and a go-to resource for technical and business staff. So when I say ELT, I mean extract, transform, load. You want to extract your data and load it and then transform it once it's in a data warehouse. You don't want to transform it on the way there because you're potentially changing or losing data. My favorite method for transforms is to do it in the SQL itself. It allows for version controlling of your scripts. It prevents a lot of IO. It utilizes a language that's been optimized for aggregates and manipulations. And to put it bluntly, I can do more than 1,000 lines of SQL that I can do in 5,000 lines of Python. That said, in general, software developers are generally not good SQL writers at the level of expertise that I'm talking about. And so please involve someone who knows what they're doing, SQL is not for the untrained. You can't keep the schema that you have, right? First of all, for those of you who don't know, a schema is essentially a format that your data is in, in the data warehouse. It is optimized for app performance. Whether it's your app or someone else's app that you're pulling in, I promise 98% of the time they have not given any thought to anything besides the crud of the application, right? The create, the read, the update and the delete of a single record. And therefore, for performance optimization, you do have to change your schema. Even once you have a different schema or a consolidated schema that just brings all of the apps together, as you scale, you're still going to run into performance problems and that's where a star schema will come in. So start with the fastest, easiest use case and here's why. You are learning. You probably don't do this for a living. So if you take on the most complicated thing first, you will fail and you're going to be delayed. And you won't have built a positive, good track record with your coworkers. Once they get negative in suspicion, you get into reconciliation hell and we'll talk about that a little bit more later. So you must also have definitions of what terms mean. I know it sounds boring, but this is so important. Let me give you an example. Pretend that you are your favorite online retailer for a moment. Someone asked you for a count of total orders. That sounds really easy, right? Right? That should be easy. Wrong. What is in an order? Are you asking for a gross or a net number? Net of what? Cancel orders, charge backs. If someone adds an item to an order after it's already placed, is that a new order? Right? That's an Amazon use case because of how they take orders and group them and ship them. So what is the total of your order count? Here's what a term definition should sound like. Net total orders is the total number of orders placed and given a unique order number. It is calculated by taking the total number of orders placed and subtracting canceled orders. It does not account for charge backs because that categorization is post shipping. So look at what I did there. I gave a definition. I gave how it's calculated. I told you what specifically is not included in it and why it is not included. This definition is narrow, precise, understandable by business users and repeatable. So where do I start practically? Now that I've covered my philosophical beliefs on this topic on how to make a data lake useful, let's cover that practicality. Again, I'm going to go through each of these in more depth later. You're going to start with a survey of what you have and where you're at. You're going to identify your easiest fastest use case. You're going to extract and load those data sources simultaneously to make sure you have working written definitions that the organization agrees on across departments. You're going to do the T's, those transforms. You're going to build the report, launch it. You're going to start collecting data you need but don't have. You're going to do that cycle 10 or 20 times. You're going to then iterate on those 10 or 20 use cases. If you don't have all the data sources in one place, that's probably the right time to add them. If you don't have a BI tool, get it now. Work on the highest impact reporting, the company-wide reporting including financials. You're going to iterate on existing reporting and stay up to date on new product features and company initiatives. So if I do all that, what do I end up with? Well, a data warehouse, which will have both schemas, one each, for each incoming data source. I'm going to have individual developer schemas and I'm going to have a master data warehouse schema. You're going to have an extract, load, transform data infrastructure that is version-controlled. You're going to have BI infrastructure. You're going to have reports used by the business that are accurate, timely, and useful for decision-making. You're going to have that data dictionary I talked about. You're going to have an ELT environment and script documentation and you're going to have your conventions in one place. So what do I end up with? To do that, I have to go through my survey. So what kind of data do you have and what does it matter? It matters because, as I said earlier, it determines the technical decisions you're making and how to pipe it all into one place. You need to know what infrastructure you already have so you aren't either double purchasing or making decisions that don't include a main data source. For example, if you buy some kind of ETL solution that doesn't cover the data source that is critical to your use cases internal to your company, you're going to be in trouble. So you have to know what you have. You have to know what audiences you're serving, both internal and external. Are you only responsible for the marketing department? Are you responsible for company-wide? Are you responsible for clients? Because once you're doing client-facing reporting, your requirements change as well. Now you're going to identify the easiest, fastest use case. This is, by the way, a worksheet that I put together for people to think through and survey what they have internally to their organization. So you'll see that you have inputs, you have timing because there's a very big difference between batch and real-time. You have how you're going to get information to people and you have what types of transformations you're going to be doing. I see somebody taking pictures. I'm going to go away. One example of an easy, fast-use case that I like to use is a help desk ticket dashboard. People would look at me like I'm crazy when I say that, but it doesn't just help your help desk team with the management of that team. It also helps your product with prioritization. If you have a help desk ticket that comes in from some large percentage of your clients or it's just your top ticket issue and it can be fixed in your product, fixed in your product and you don't have to hire as many heads for the help desk, right? And it also helps finance with product pricing. If there's a certain categorization of client that takes up a lot of help desk time, they should be paying more for it. So you're going to extract and load the data sources needed to do that first use case into one place. You're going to have somebody working on your written definitions of those terms, dimensions and measures. A dimension is a fact and measure is generally anything you would use a formula and excel for. And again, you're going to do the extract and load without the transform because the transform can skip or alter data that you might need in the future. And it's easiest to just have it all in hand and manipulate it later. The exception is JSON. Do not put JSON in SQL. Parsing it in SQL is not fun. If you do need to have JSON in SQL, then make sure you have a second import of the parsed version. The biggest thing people do wrong is not have those definitions. It's one of those like take your medicine things that people don't want to do, but it will save you a lot of time in the long run. And make sure you're getting them now because you have to have them before you're writing your report. Now you do the transforms, you build the report, you launch it to the team that's using it. You start collecting the data you need that you aren't collecting it. Everybody thinks they're collecting all the data in the world or they're collecting so much that they must have everything. And the answer is that they don't. Specifically, the thing that most happens is there'll be a categorization that is used in an internal vocabulary that they aren't codifying as a piece of data at the time they need to be codifying it. If you're always talking about X group of help desk tickets, then those tickets need to be tagged with that categorization in order to be helpful for reporting, not adding it later because then you're just rereading the ticket and involving another person in spending more time to do that. So this is the point at which if you identified something you want in your report that you haven't been collecting you're going to add that to your collection so that by the time you get back to iterating on that report it's there for you. You're going to go through that cycle 10 or 20 more times. This is a learning experience that you're there are always low hanging fruit data cases that you can use during this time. Now if you're someone like me and you do it for a living this is going to be a very condensed period of time. If you don't do it for a living it's going to perhaps take you know weeks or months if you're always getting paged into firefighting. Now you're going to iterate on those 10 or 20 cases and that by that I mean going back and making another version and launching it taking into account feedback right you had an original scope you're now doing updates to that and that is where one of the questions I put in my abstract comes into play which is why is interdisciplinary important. After you've done those 10 or 20 cases you're going to find things that are used across departments right the sales department and the marketing department need the same kind of information they just have a different lens on it. The product team needs information that the marketing team would be interested in they just have a different lens on it that's where you're going to get into a unified schema and you can get into data marts later on as you get into more complicated use cases or more specific cases but I have yet to find a company that doesn't need a unified schema for the majority of its reporting. If you don't have all of your data sources in one place at this point consider adding them and this is also the point where you can start considering a BI tool so that you aren't doing you know kind of one-off Excel or one-off graph kind of whatever happened to be on the shelf or what happened to be cheap. You can now have a better understanding of how you're going to have to present cases again you have to know if you're going to have clients involved or not. Now it's time to completely shift your focus to the highest impact reporting. Those company-wide reporting with financials these have very high visibility very high importance but you're now ready for it and you won't be able to hold your stakeholders off at this point anyways. Then you enter into maintenance mode so you're going to iterate on your existing reportings you'll have a ticket queue at that point people will know how to request changes or request new reporting and you're going to stay up to date on new products and features and company initiatives. There's more about how to do that and how to do it well in a talk I gave to PyData Chicago last year called what data engineers wish developers knew and that is available on YouTube. So moving on what's going to come back to bite me if I specifically what's going to come back to bite me if I don't account for it up front that is totally and completely avoidable reconciliation how which I mentioned before. This especially happens in organizations that have never had numbers or in organizations that don't like the numbers that they're being given. They're constantly going to question them and make you reconcile them to whatever they happen to think is the right thing. No one likes doing this work it's very painful it is tedious it is time consuming it is awful and once you get into that cycle it's almost impossible to get out of so avoid it as much as humanly possible by doing reconciliation up front for yourself before someone asked for it and make sure you have quality control measures in place. Sales Commissions is a very great example of this there is nothing that sales people have more time to do than nitpick a report that determines their salary and then there are multiple sales people and it just gets exponentially more difficult. In terms of the data architecture of that data warehouse the things that are going to be painful to fix later on are if you don't have enough hierarchy levels right in your categorization system if your objects aren't correctly delimited between each other and if you mostly it's if you there are three types of relationships in a relational database right one-to-one one-to-many many-to-many if you use a one-to- many relationship to just kind of avoid doing the extra work to make it a many-to-many that's going to bite you later on so do the work up front. I like to tell people about how if you do the infrastructure front yes it takes more time before anybody gets anything but it means that instead of taking two days to get a new report later on it can be as little as five minutes and then they generally back off a little bit. The other thing is to avoid binaries if you're going to do a binary make sure it's true to false not something like gender which is not binary and make sure that that true false is very specific and not a combined group you'd rather have more true false columns that you can calculate later on to make a new true false like then if you try to combine too much at one time but at least that's just adding a column so that's not as bad as what I've seen some people do in terms of a tech stack do not do something new and cool do not build a data warehouse in my sequel it does not have window functions you're going to want to use something that is postgres or similar my personal recommendation is Amazon Redshift mostly because it's been around and been battle tested at scale and you're also going to want to think about your caching where is it feasible to have caching where you're going to want caching and it doesn't have to be very complicated it can be as simple as a pre-calculated table don't over engineer it if you don't have to all that said you're going to run into problems data problems are going to be things like incoming data is crap especially if you're getting it from a vendor or if your own internal team hasn't validated it as it's been coming in the door the other problem we've already talked about is not having enough data tech problems could be cron job doesn't run there's an incomplete file load a file is skipped altogether a vendor goes down make sure you have the testing in place for that and you're going to get people problems which is really just complaints from your co-workers they generally tend to cluster as things aren't fast enough either they don't get the report fast enough or they click the button and the report doesn't generate as fast as they want it to not the right thing right they expected x and you gave them why and then you can go back to hopefully your written documentation of the request and say that no you actually ask for that and why haven't you been doing this thing that I've been asking for for months and yet to go have a prioritization discussion so what strategies are useful for troubleshooting those using an iterative approach making sure everybody knows that there's going to be another version is a lot easier conversation to have than somebody who expects everything to be perfect at the the first time we know that from software development it's the same as true for reporting have a data related test suites you want to make sure that any failures you do have are not silent a silent failure and data is going to be the worst failure you can have so a simple one is just making sure that a file was imported before you generate a month-to-date total right and communication focused on common vocabulary and especially around expectations making sure that people know that this is how something was prioritized and that's how long it will take so how do you automate it well that's a tools discussion my main point is taking data out of Salesforce is a pain in the ass okay and it's been done by very many people so don't rebuild the wheel my favorite implementation right now is actually stitch data integration which does a full sync to Postgres or whatever database just by o-offing and checking off some boxes it's beautiful it's like a breath of fresh air but given the time that we have at the moment if you want to talk more about tools please feel free to ask about them in the question section what does collaboration that rocks look like um this is not just the documentation and it's not just training it's also doing ad hoc office hours and making yourself available to people when they have questions in whatever communication channel that they are comfortable with making sure that you have a team and not just one person and that there is some skill overlap um project tracking feedback loops iteration and there is a fabulous stitch fix article um called engineers shouldn't write etl a guide to building a high functioning data science department that i recommend everyone um you can go into advanced lessons like a data layer but uh honestly if you want to talk about that you should be talking to ria gauche who spoke yesterday um she's actually implemented one and had some great success with it so with that are there any questions i work for a health organization so whenever we talk about any kind of data accessibility we get into protected health information we throw in that phi award to say no to a lot of things give any advice for how to deal with that yeah um get people to grant it and like whatever your anonymized rules are and then keep a set of data that way and build reporting off of that um uh also uh not having multiple copies of the phi right there's there's one version protected behind whatever wall and then anybody with permission can see the anonymized stuff right making sure you're getting into um you know the netflix data set you could if you had like three data points you could find a person um but really just having that wall and having written standards around it is the best thing that i've found you said that you suggest doing your transforms once you have your data in the data lake uh what about doing data enrichment do you recommend doing that before you put in the data lake or also just kind of putting all of that data in your lake and then doing that enrichment as a view once it's in the lake so an enrichment is a transform and transforms come after you have a clean copy import um and but you're also uh bring up a really good point which is the difference between a data lake and a data warehouse right to me a data lake is just like this big old pile of data set in a corner no one's done anything with and a data warehouse is that data lake transformed into something that's useful and organized so make sure your import is going into the data warehouse clean and then transform it will save yourself a lot of time so um you know somebody told me something uh that resonated with me that one of the biggest um a blessing of data is also its biggest curse which is that it can be copied and uh i don't know how many of you guys deal with some stakeholders who ask for a report so that they load it into their own excel and they parse it and they create their own report and maybe that happened multiple times multiple times so this is a big problem for us and we want a model where this is our vision where we can offer almost anybody the ability to on an ad hoc basis go and you know create what they need this is our nirvana right and the data stays on source doesn't get copied so the question to you uh considering this kind of aspiration yeah um after you you you you invest in your infrastructure you created the lake you put everything there what would be the next logical or series of steps to create this culture do not copy the data go to the source use get what you need and and leave it there um so as you're talking you are talking about a perfect use case for a product called looker which is a business intelligence solution that goes on top of your data warehouse um and allows people to point and click to make their own little charts and graphs and their own dashboards um and you can turn off by permissions downloading the full data set um and uh that is the source of truth right you have a dry look ml model um and if you want to talk more about that i'm happy to get into details i'm probably one of the top experts in the country on that program all right well thank you so much for the great talk let's give it up one more time