 care about data or dev ops or you just wandered in off the street for free food and you're looking for somewhere to take a nap. Well, too bad because it's involved with the ruse for my one-man show in 3x. But in all seriousness, as Adam mentioned, I run tech ops for MIT Open Learning. I am the one-man show at Family's Notions and I post a couple of weekly podcasts on Python and data engineering, which is a big part of what brought me here today because in the process of talking to a lot of different people in various areas of the data stack and data engineering concerns and operationalizing their data to be able to drive value through some sort of advanced analytics or machine learning or AI, there are a lot of themes that came up in terms of difficulty getting access to the data that they need, being able to deliver it in a timely fashion, to be able to fulfill requests from people higher up in the business, and the spam issues of the data infrastructure, data access, conflicts with the data scientists that they work with. And this sounded very familiar to the work that I do in operations and trying to deliver reliable platforms for applications and for being able to provide value to the organization. So if you're interested in seeing me tweet automated things about my podcast, you can follow me on Twitter. I also put the link to the slides there if you want to follow along. And, you know, if you're coming into this wondering what even is data engineering, you know, alone, it's a new term for an old role used to be business intelligence or business analyst or database administrator. And as more types of data started coming in faster to organizations and began playing an even more critical role in the success of different businesses or groups, it started to morph into its own distinct concern separate from data science, because the primary role of data scientists is to be able to analyze data, provide insights, be able to get something actionable to the business. But they were spending all their time just trying to find the data to be able to get those insights and figure out how to get it into a shape where they would be able to take advantage of it. And so gradually, this idea of data engineering as a distinct concern from data science has taken root and has been growing over the past few years. So the role of data engineering encapsulates things like collecting data, cleaning data, keep in track of what data do I even have and where is it, integrating it, enriching it for multiple data sources, managing data governance to make sure that you're compliant and that the people who should have access to the data do and the people who don't know. And when done right, data engineering is largely invisible to somebody outside of the organization. And as everybody in here I'm sure can relate, it is a role that is very broad in scope. It's constantly changing. It's hard to keep up to date. And there are a lot of parallels in terms of the conflicts that exist between the roles of data engineer and data scientist, just like the role, the conflicts between systems administrators and developers where we all want to be able to provide value and unique and interesting new capabilities to the end user at the end of the day. But on the data engineer and systems administration side, we want stable systems, we want reliable and repeatable systems. And we want to be able to make sure that we can get a good night's sleep without our phone exploding at three in the morning. And at the same time, data scientists, developers, they want to be able to bring in all the shiny new toys, add in new features, test out new machine learning models, play with the new frameworks. And they want to be able to do it all yesterday and not have to wait for data engineers, technical operations, systems administrators to give them the green light. So that causes a lot of workarounds and people just trying to put their own things into production without having the background or time to consider all the ramifications that that brings in. So we all want to have we all have the same goals at the end of the day, we have different thoughts about how to go about it. And from the technology side, in data engineering, some of the issues as far as conflicts within the role are people in the business, you know, whether it's a VP or the CEO who say, I want to be able to answer a question about what's happening to customer acquisitions as of two months ago, because of this new feature that we shipped. And they want to be able to answer that now because they have a critical decision they need to make for the board meeting tomorrow. But you just heard about this and the data to be able to answer that is actually in three different places. So this can often be because of data silos where you have application databases with one view of reality, you have your CRM systems with another view, your Google Analytics or web tracking, and you need to be able to pull all those things together, hand it to your data scientists in an easy to access fashion so that they can go and work their magic and provide the insights that your CEO needs to be able to answer that question that he asked you. And compounding the issues of the data being in five different places, they're all in different formats, they have different schemas, they might have different timestamps, and you need to be able to figure out a way to reconcile those and keep your sanity at the same time, which isn't easy. And as I mentioned before, there is constant change and innovation in the tools and platforms and techniques, and also the regulatory environment for being able to implement all of these changes and requests and systems necessary to be able to build those platforms to enable self service to the people trying to answer these questions. So this is a high level snapshot of some of the tools and some of the platforms that data engineers are trying to grapple with today. This isn't even everything. And this changes every year, maybe even every week, but in operations, we're no stranger to that. So these are all the tools we have to deal with and try and keep up with every day or we want to be able to use. So as I mentioned, we all want to be able to bridge that gap between having stable, reliable, repeatable systems that we don't have to lose sleep over that we can take vacation and not have to worry about what kind of fire you're going to come back to at the end of it. But at the same time, we want to enable our application developers and our data scientists to build exciting new features, answer valuable questions, interact with our platforms and our products. And a common refrain that has been part one of the tenants of DevOps for a while, and that data engineering teams are beginning to embrace is the idea of internal product teams. So on the outward facing side, application developers, data scientists, their customer is someone either in the C suite or external to the business entirely, whereas our customers are the people within our own organizations, they're the application developers where we're building reliable platforms for them to be able to ship features and access all of the system metrics and logs that they need to be able to debug things when everything breaks. They're the data scientists or business analysts who want to be able to access the different data sources and combine them and gain insights from them or people trying to build business reports so that they can be able to answer the questions that they need to run the business day to day. And the work that we all do as people fairly low down in the stack is mostly invisible because honestly, it shouldn't be the only time that we're visible when everything breaks. But that can be a lonely place to be sometimes. And there's the constant need to stay current because of the constant change in regulations and data sources and data formats and new business ideas. And so that requires a broad range of skills to be able to be affected within these roles. But nobody's ever going to be an expert in everything because it's impossible. So we have to settle with being adequate and a lot of things and able to pick it up quickly when we actually need to start using these different tools or techniques. And on the technical side, data engineers, a lot of their work has to do with distributed systems because you can't fit all of your data on one box anymore. So we're dealing with complex systems, lots of failure cases, lots of complexity. And deploying them is hard. A lot of times people building these distributed systems are coming from academia where they're happy when they get it to work and then making it operationalized is a secondary concern. So you might have platforms where you can get all the nodes up, but then you have to go in and point and click in a GUI or run a command on every machine to get everything to join together in a cluster and talk to each other nicely. Or it might not be designed for cloud environments where failure rates are high and if a machine goes down all of a sudden the entire cluster is dead. So that's challenging. We all know that repeatability is consistency of the data that they're dealing with. It can be problematic because you might be using the Twitter firehose and the data source and that's messy. They might change their APIs. You might be dealing with web analytics where people have different plugins or ad blockers or somebody might be trying to span your system. So there are a lot of edge cases there to deal with. And being able to effectively test and monitor data systems is hard because testing stateless things is easy. You can just blur it away and start it over. But when data is actually the primary concern, it adds a lot of weight and gravity to the systems that you're trying to deal with. So it's not always obvious if it's even possible to test some of these things except in production. So there are a lot of inherent difficulties there. And then there's also issues because of data volumes and data gravity and being able to actually reliably stand up a non production environment to get any real sense of if the changes that you're making are going to work effectively because they might be fine when you're dealing with one or 10 or 15 gigabytes of data. But then when you go to production and you're dealing with a terabyte every week, it's a different world. Whereas on the application side, we have issues of our own with data systems. We're collecting large and growing volumes of log data, metrics, tracing, and we need to be able to figure out how to store and integrate those across multiple systems and platforms and be able to actually make any sense out of it at the end of the day other than to say, I have logs. Yay. And managing data across environments is hard for us too. We might not necessarily be dealing with the same scale of data, but we might. And that's a challenge that as far as I'm aware is still sort of yet to be solved, but people are tackling it in their own ways. So some techniques that we can borrow from each other on the data side, people are starting to take up configuration management for being able to try and wrangle some of those distributed systems problems and be able to get their data infrastructure set up reliably, be able to get some approximation of a production environment set up without having to actually touch production so they can try and make some of these changes and test things before it goes live. Using CI and CD practices to be able to iterate quickly, have fast feedback cycles on new ETL pipelines or new streaming logic or any of the changes that they're trying to make into the system implementing feature flags so that when, because of the fact that you do have to test a lot of things in production, you want to be able to have an easy kill switch when everything starts breaking. So starting to consider adding feature flagging to your ETL systems or your streaming systems, whether it's Spark or Plink, and have that option of maybe doing canary deployments to the small set of early adopters within your organization to say, hey, I built this new feature. It's not fully baked yet. Can you test it out? You just need to set this flag in your browser or in your console, whatever they would be accessing the data with to be able to see those changes and then give you that feedback cycle. So just having that knowledge of whether the changes you're making are effective without having to wait all the way from data source to you handed off to your CTO, your CEO. And then they wonder why all of their numbers look wrong all of a sudden. Containers are also increasingly coming into vogue in the data engineering space. So a lot of things like Spark and Airflow are starting to become Kubernetes native, which is starting to alleviate some of those issues with managing distributed systems. But Kubernetes is a distributed system too and it's hard. But we're starting to converge on some of the same problems. Agile incremental system development. So you need to be able to build this fancy report. It has all these 15 data sources and traditionally you may have tried to do that all in one big bank delivery, but it's probably easier to just start with one data source at another and just incrementally add those different systems together until you get to the final product so that you can verify whether this is even useful because you might spend six months building out an entirely new data platform, new etiologic, new streaming systems. And by the time you're done, nobody cares about the answer to that question anymore. So you want to be able to get easy wins, get fast wins and make sure that it's worth continuing down that path. And another idea that has gained a lot of popularity recently in systems is the idea of chaos engineering. Kill an instance, see what happens. Just talk to them in production unless you're really sure that it will work. And then on the operation side, some of the practices that are coming in from data engineering and data science platforms are things like append only log structures for consistency and durability and having one central flow of information that you can then pull off of for your own uses. So Kafka has gained a lot of popularity in infrastructure systems for this reason. The student file systems are useful for being able to scale across multiple instances and be able to have easy failure cases. So things like CEP has gained a lot of popularity. It's actually one of the main underpinnings of file systems for OpenStack. Using data routing and enrichment and integration and operations concerns. So things like FluentD, LogStash, those are valuable for being able to consume data from multiple systems, maybe merge it together in play, distribute it to other systems to increase the visibility that we have across our multiple platforms. Change data capture for being able to replicate production data into non-production environments so that you can have that QA data that you need so that you can have an approximation of production and be confident that those applications that changes that you're making make sense that you can do that database modification without having everything die while you're doing the rollover of the application code. Starting to use ETL or Extract Transform and Load Techniques or alternately ELT which is sort of the new evolution of that for concerns such as data backups or being able to migrate data across environments where before we might have just thrown together a quick bash script and said it works today, it breaks tomorrow, I'll just tweak it. So having visible systems so you can see and other people across the organization can see that these backups run. This can also be used for anonymization of production data into QA environments where you have that transform step before you load it into QA and then the idea of tiered data storage. So in things like elastic search it's great to have access to all of our logs super fast but do you actually need three months ago logs in 10 milliseconds or can you age that out to S3 or some other slower file system where you can then maybe run a batch job across it to get the insights you need for longer time horizons and metadata management. It's absolutely critical in data engineering systems because you need to know what's the data, where is it, why is it and that's similar in infrastructure where we want to be able to keep track of what are instances, how are they aging out, what assets do they have, you know, what are my SSL certificates inspiring. And so in data engineering shift early, shift off, shift often, make sure that the changes that you're making are going to be useful. Don't operate in isolation, talk to your customers whether that's the data scientist who's in the cubicle across the hall or your VP of Engineering or your CEO, see what is it that you're actually trying to do, what do you actually need for me to make sure that you're going to be successful, you know. DevOps is about unifying the entire business along a single goal that goes for your data teams as well and make sure that you build in feedback cycles everywhere that you can, whether that's early unit testing of schema validation before you store it in your data lake to make sure is this data I'm actually going to be able to use for anything. On the operation side, at the end of the day everything is just data whether it's your code, your logs, your metrics, treat it as data and that might give you some ideas of how you can use some of these new platforms and techniques that are being thought of in data engineering and data science to make our jobs easier, use things like event sourcing, change data capture, maybe command query and responsibility segregation to make our systems more resilient to failure so that we can have higher reliability in our monitoring and our logs and I'm going to say again distributed systems are hard so we should know the trade-offs if you haven't ever heard of the cache theorem you should look it up because it's a very useful way to think about the systems that we run every day and from everywhere I talk to it holds true no matter what so it's worth having an idea of what that is. So I want to thank you for listening, for the podcast episodes where I thought of some of these ideas, other resources that are useful for being able to learn more about data engineering and some of the operational concerns that folks on that side of the fence are dealing with. So with that I'll open it up to questions. If you raise your hand then somebody will run up and bring a microphone so that we can get it on the video feed for people watching on YouTube afterwards. So what are your thoughts on some of the new technologies around like deploying AI that can take a bunch of unstructured data and do a lot where before there was a need to have everything kind of I guess normalized and structured in such a way that it could be analyzed. It's definitely very useful, it can save a lot of time on our end of being able to gain insights in the systems without having to have somebody go through and visually come through all the logs, see what are some of the anomalies but as with everything it's garbage in garbage out so you can't just dump everything and say AI is magic it's just going to work you do need to have some consideration of what information are you feeding to it is it all a consistent schema or is it consistently schemalist so is it all text data that you can maybe run some natural language processing against something like that so it's definitely useful with caveats. Yes, how have you organized yourself you have a DevOps organization team you have a data engineering team within that you have a data scientists team within that how will you do that? So the group that we're in is pretty small we actually have one data scientist on staff right now and so along the lines of you know ship early get easy wins right now we are relying on a tool called redash for being able to hook up to all of our different systems mostly our application databases and be able to build reports off of that so that business users can get the answers that they need and have self-service access to some of those reports another platform similar to that if you're in the sort of early stages it's metabase and then as we start to plan out a more comprehensive platform we're starting to implement a tail-lake approach where we're using our Fluent E log aggregator to ship information of the S3 so then we can run things like AWS Athena or other systems against that to be able to give some insights for those for that information. Hi I have a question have process have data management so in the big corporation we probably have multiple source and we like to put it into one data source as a source of truth I guess and then as we as you mentioned earlier there's multiple team like data scientists software engineers there are system admin they all have different slides of looking at things so do you recommend they all have into the same source of truth or should we have different data like those things so that somewhat depends on the access patterns and the level of sophistication within the teams so the data scientist might be perfectly happy going and combing through all the raw logs running some sort of algorithm against that whereas your operations manager just wants to be able to go to a dashboard and get an answer right away so having different sort of tiered systems as far as the maturity of the data and how processed it is can be useful and a valuable approach to make sure that people aren't all getting different answers about different things is to look into something called master data management where you do sort of entity extraction to make sure that you can say this AMI ID equals this instance name equals this business process so that you can be able to run the same query for multiple sources but I'll be asking about the same unit and not have confusion come up what's everybody talking about So I've got a question You create a lot of content not one of the two podcasts and you bring a lot of value back to MIT can you speak to something specifically to the data engineering podcasts that you've learned through the creating of that creation of that content to have that value in MIT Yeah So actually just recently an episode I just published yesterday I was talking to somebody who has a lot of history and context in sort of enterprise data management and data curation as far as what approach you can take to go from I'm starting to architect a data platform but ultimately I want to be able to have a unified way of non-technical people in the organization being able to ask questions and so he discussed that sort of tiering approach of aging your data through different levels of maturity where you might land it in a data lake it's largely schema-less slightly processed and then expose that to an analyst or a data scientist to be able to start building some reports off of that figure out what data is actually useful encapsulating that in the form of different tables and reports from a data warehouse so that they're faster to query easier to analyze but might have some reduced level of granularity of the data so taking a very sort of life cycle type approach to how you store and access and process the data Hi I saw you mentioned CICD for ETL do you have any recommended tools for that? So I mean it's still an emerging space there's one tool that I talked to the creators a little while ago called Great Expectations that's focused more on the pandas in the Python space where you as you're going through this exploratory analysis of the data and processing the data frames you create these different expectations which after you're done doing your exploratory analysis you can then export those as a set of unit tests to run so that when you're actually doing these operations in production you can validate that at each stage the data is the shape that you're expecting the volume that you're expecting things like that there's another tool I talked to some of the creators so a little recently called Query Surge which is for being able to run unit tests against your SQL schemas and your SQL queries to make sure that you're building the reports that you think you're building and that they're returning the shapes of data that you're anticipating those are some of the ones that come top of mind but then there's also just you know doing quick schema validation checks doing checks to make sure that the data volume is what you expect so if you traditionally get you know five gigabytes a day from Twitter whatever it is all of a sudden you're getting one or 50 raise the red flag look something's probably wrong there making sure that the data is actually making it through all the different stages of your pipeline so just doing quick checks of you know I haven't seen anything from you in five hours what's going on so just being able to have some of those fast checks easy validation that everything is wired together properly yeah and come find me afterwards if you want to ask more questions