 Thanks everybody for coming in here today. I know that we're reaching the end of our first day and people are quite tired Maybe a little bit jet lag, so I appreciate anybody here who's coming in is interested So a little bit about me. My name is Lisa I am a product manager at data straight-o quite newly minted in this role, but really enjoying it I previously spent a few years as a data engineer and as a data analyst before and I've been through a few different industries Mostly startups and mid-size companies and it's been really fun to see kind of the different data architectures and the different approaches to Data ops. I'm also a Google woman technic years ambassador, and I've been part of the community for a little while now So a little bit about what I hope that you'll leave this talk with Really, this is a huge talk in terms of the scope that we're looking at things And I'm gonna kind of give you a quick warning here that we're gonna cover a lot of topics But we're not really going to get into the meat of anything and that might be a little bit unsatisfying But that's sort of the point of this talk that you get this true end-to-end perspective of data ops and em all ops and You know kind of the different aspects around that few of us get to really touch all of these different things and I'm gonna talk about and so, you know, how can we gain that sort of perspective and Really, I also want to provide some relevant frameworks, especially within the open-source space all the technologies I'm talking about today will be exclusively open-source and Where you can kind of concentrate your efforts. I think that teams can be stretched really thin especially when it comes to Data ops in general and so we want to really make sure that our efforts and our energy is being put in the right places I also want to go over some other general definitions So data ops, I think can be a bit of a buzzword. I think a lot of people don't necessarily know the consensus of this term I'm not going to claim that this is the definition that the entire field agrees upon but for the purposes of this talk I really want to consider data ops as all of your data Automation and operation needs especially from a dev ops perspective with regards to your service systems and pipelines and kind of Securing those as well. If there was anything like a data supply chain this is kind of where I would want to to be tempted to use this terminology and MLOps as being this end state of it where you know it I don't want to consider it as a tag-on But I do think that MLOps is so specific that it really you know deserves it its own sort of culture around it And so that's kind of where I'm going to be working in terms of definitions here So some of the data ops components that I'm going to be discussing is going to be the orchestration of an abstraction of your pipelines How you can implement some governance and data contracts? Really looking at metadata management as well and using all of these tools to for the benefit of your data quality for your monitoring and Observability and how we can bring in these sort of core DevOps concepts into your data platforms and data teams such as you know CICD Containerization and deployment and documentation so again This is a lot of stuff like this is a huge amount of ground to cover But I think it's really useful for us to be able to look at it all So some of the increased needs for data ops that have occurred over the few last few years I think the big push has been us moving into data platform and data mesh really moving into non-linear ways for us to serve data and Trying to bring in reliability into these systems So of course migration into cloud data systems has been really important for us And I think a lot of data teams work almost natively in the cloud now and maybe hybrid on-prem as well But it's quite rare to see like just solely on-prem and with that comes a lot of considerations There's increased complexity when it comes to our data pipelines and data systems as well Just in terms of the number of sources that we have to manage the number of formats What we process for and now that we have machine learning kind of picking well machine learning in full-flex How we manage data quality specifically for machine learning as well I think that data teams have become siloed and spread out so instead of maybe an organization having one data team You might have an embedded data analyst on each team And how can we make sure that all of these analysts or or machine learning scientists are equally supported among their Organizations and especially how can we make sure that they're getting the the best data quality possible without having to Constantly feel like they have to feed back into another team So again talking a little bit about all of these different architectures that we're seeing now and how can we best support them Some of the benefits that we're looking at for data ops. I don't think that I need to sell this too hard But I think visibility is really at its core. How can we track our data lineage? How can we introduce observability into our systems? How can we assess data quality at this really high level when you're dealing with such gigantic data systems? Again practicing data governance is really important these days Especially when it comes to legal compliance and security and we also want to make sure that the uptime for our data and you know The guaranteed availability of our data is something to really consider How can you also streamline so much of our data infrastructure and architecture and how can we abstract that out in a way That makes sense, but doesn't oversimplify things We want to again in introduce these core software engineering principles as well such as versioning iterative development and Configurability into our work and a lot of this might seem really Like intuitive for us But if you've been part of a data team like a team with just data scientists and data analysts You'll know that these skills have to be learned and it's something that it's really intentional So Implementation challenges when it comes to data ops I'm gonna be honest the biggest one here Especially in most organizations is going to be the stakeholder buy-in all of this is infrastructure all of this is dead all of this is stuff that needs to be continuously maintained and So I think getting buy-in if you're a data driven organization And if machine learning is maybe the core of your product or data is the core of your product You might get a little bit of an easier time really diverting resources into this sort of stuff but generally speaking if you're like 90% of the other companies out there This is something that you kind of have to to finagle your way into so again thinking about interoperability between our different technologies How do we standardize the formats and our practices as well? How do we avoid vendor lock-in which in terms of scalability can be very costly down the line and all of these different things that really make us You know kind of wary to adopt new technologies and adopt new practices And we might think of this process as very straightforward But it's really quite linear because we want to go back and improve processes As these holes pop up and we're not going to do it perfectly each time around So I'd like to propose for different strategy domains for companies to focus on and for organizations to focus on and They're not all equal But I think that the pressure and the ways that we run into these things tend to to rise over others depending on what? Your business goals are I think a lot of the push for data ops and increasing quality Really comes down to like hey, there's something wrong with the numbers let's go back and see where it goes and that's going to be our accuracy and data quality and How we're affecting our downstream reporting and metrics whether or not our models are outputting the things that we want them to and how they're performing and This is kind of going to be your biggest sort of push initially And this is because all of your stakeholder trust is going to come from this and this is kind of the thing that affects the core business the most Next I want to go into visibility and transparency So once you have a complex enough system having data lineage and being able to track the different stages of your data And its source systems is really important This can also be in the form of metadata collection observability documentation and just Overall making sure that this isn't just this black hole that this data is coming out of and we don't necessarily know where it's it's coming from right and When something changes we want to be able to track that over time Reliability and latency is going to be another domain that we're really going to have to look at and At certain scale you're going to want to introduce some sort of monitoring into your data So are these values looking you know how we want them to is their giant holes in the data That's coming in and how do we optimize our data systems to to be less costly? And I think FinOps is a really good application of this as well How do we introduce fault tolerance into our systems and just generally really good principles? Security is going to be our final domain not that this isn't extremely important but I think if you're coming from a data background this does tend to fall by the wayside and so it's something that needs to be very intentionally brought in and discussed and And again, I'm going to go over this data pipeline but I think that a lot of people here don't necessarily need to know what an ETL pipeline looks like and Generally the the whole point of me bringing this up is to really show that There's different hands on all of these stages. There's different people who touch and are in charge of processing and moving the data and at every single one we kind of need to consider the underlying currents that might Play into that. So we might think of just building this pipeline as you know our one and done Okay, it's there. It's up and it's moving But really we want to make sure that it's maintainable over time that we can have iteration over it and that we can have visibility and control over different aspects And so this is really what data ops is meant to to be and this is why I consider it to be a little bit different from Normal DevOps and why I also consider it to be different from ML ops, right? So some of the tooling that you'll see around this There's a ton of tools around data pipelines and we'll get into data ops tooling later But I just want to kind of have this slide to point out that a lot of the open source concentration has been around just manipulating data and Actually, and I think that makes sense considering where we are But really we want to consider the the cost optimization undercurrents of this and what our actual business requirements are and these sort of drivers for it But really again bringing it back to data quality When we go back to This particular image It's very straightforward and oversimplified here, but at every single one of these stages Your data can get wonky like anytime somebody touches something Anytime there's a process or a step or some sort of code in place you're kind of putting your data at risk and so data quality issues can be unavoidable especially at scale when you're dealing with so many complex pipelines and not all the metrics that we measure data quality by are going to be equal and More so importantly, it's not going to be worth your time to pursue That's something that's really important to take note of is that when we talk about ops everything's this nice to have and nobody If you pitch it to somebody they're going to be like yeah, that sounds great But then you ask them for the resources in the time and the money and they're like, oh, maybe not You know we have better more important things to do So the way that we pitch a lot of this within an organization is going to be centered around our business needs again, we want to gain that consumer trust and we want to make sure that we have trust within our own data systems and One of the the early data engineering examples of this is the right audit publish pattern that You can use using many different open-source tools on the market And I think that's a really great example for how we can really Like put in these different processes innately within our systems and pipelines So when we talk about data quality, there's really like these five pillars of data observability You've probably seen these before but how do we actually measure the shape of our data? And how do we make sure that what's coming in is what we're looking at and over time? How do we track these things? These are sort of key metrics for observability if you're going to track something over time It might as well be like these five but again, they're not all equal depending on the data that you're working with so Bubbles are different sizes, but really it depends on your organization So the first is going to be you know freshness is this data recent Is there a hole in when our data is being pulled and is it up to date? How is the distribution of this data? Is it within the values that we're expecting and this requires a bit of domain knowledge and it depends on what your data Actually is right and it requires a sort of cross-functional aspect to this as well am I working with the analysts who actually understand what this data is supposed to look like as a data engineer and Can they tell me whether or not these ranges are you know looking okay now? Or will they tell me six months down the line before reports do and it's already kind of broken That's a really disturbing thing to have to deal with us minute Can I track the lineage of my data this becomes more important as we have these iterative? Models that are coming out and we want to make sure that we can track what our data is that we're feeding into there and make that really easy Is the schema of our data actually you know stable Is there some sort of did somebody push an engineering model that wasn't cat caught and Is that going to change something and how can we track that? I think data contracts are a really great way to tackle this particular issue and of course the volume of the data as well Which I think is our first instinct when we're really trying to do a sequel check against our incoming databases so there's a couple different tools that we have around this as well and so really we want to Give heed that and I'm not saying that you need to use all of these And I don't think you should because the technology is probably not relevant to your stack depending on what you're running on but Data quality because it can break at every single stage of your pipeline You want to make sure that you're checking your data quality at every single stage as well That's kind of unrealistic, but at least at ingestion maybe at the source system level and Right when it comes to an analyst, right? So I think great expectations is a great example of a tool that a data scientist can use To test a data frame that's already been modeled and then early on you might want to use dbt If you have an analogous engineering team and do some unit testing there as well And how can we unit test our data at various stages and to make sure that the data is being shaped how we want it to? I also added a patchy airflow here as well It's not necessarily supposed to be used this way But there's a couple of different ways that when we're bringing in our orchestration tools that we can you know Break the circuits when something's not looking right or add like sequel check operators And it depends on the cloud environment that you're working on as well Metadata can also be a really powerful tool moving To this bigger plane from just data quality We want to be able to have a really good understanding of our data without having to invest in all of this infrastructure just to test Really hard-coded things and so to have dynamic and active metadata can really enable so much more in terms of your infrastructure So some examples of this can be descriptive and structural and administrative context to your data that you might be able to have and have easily as well It's useful for data discovery, you know lineage and generally being able to virtualize your data can be very powerful Especially if you're on something like a data platform team. There's a couple really cool tools out there for this Apache iceberg. I think is really popular these days. There's also months in data hub And Gravitino, which is where I'm coming from today so the company I work with is called data straight-o and I do product with them and we're open sourcing our Metadata Lake today, which is called Gravitino and we have a little QR code if you want to be taken directly to our GitHub repo and so we're still actively working on this and Essentially what we're trying to do is create metadata virtualization layers over top of your existing Databases, so if you're say working in a multi-cloud environment where you have GCP and AWS or you know Azure and AWS How can we make sure that that metadata is consistent across all systems and you can kind of have this really technology agnostic way of Accessing all of the different pieces of data in your system without having to do really complex joins across from them and Some sometimes that's impossible as well So what we're really trying to do here is create a single source of truth for all of your data needs and Really create that layer where people can have a lot of trust in So to kind of review over this framework that I've brought in There's a few different domains, but there's also a few different places that you want to concentrate on again basic version control it seems really Intuitive for a lot of us who come from programming backgrounds But having all of these principles really heavily ingrained within your machine learning engineers and data scientists is really is really important As well as bringing in specialized tools to version your data So including like maybe DVC or debesium if you're really into Trying to version that actual database as well as implementing data contracts similar to API contracts Where you want to make sure that your schemas are enforced in some sort of way as they're being ingested? Metadata can be really important for having other applications be able to query and to be able to create images of or Create really a shape of what your data is and to gain a lot of information without having to do all of this heavy processing and we're trying to do that at Gravitino and We also want to be able to monitor and create alerting systems for when our data systems are down and to deploy them and to document them as needed as well So another aspect of data ops that I want to discuss is also going to be data pipeline management through incident Management and I think that a lot of folks from DevOps backgrounds Understands maybe this idea of being on call or doing a root cause analysis But how can we also implement those sort of principles into? Data as well where we're understanding that if a data system is down or if a pipeline is broken Who is going to own that and who is going to be the person on call to fix that a lot of this ownership is really just Documentation, but it's really important for us to again take that responsibility and to understand how that can benefit our organization down the line We also want to make sure that we have Processes for doing root cause analysis into our data. Do we have standardized processes where if our numbers are wonky? Who's going to do what and what steps are going are they going to take to do this? I also want to discuss MLOps a little bit I think MLOps has been a really big part of this session and this talk So I'm not going to claim that my MLOps portion is going to be any better In fact, you'll probably get a little bit more detail But I want to give an overview of what that might look like because we've discussed Pipelines so much up to this point. So really we're looking at the testing validation and reporting of These different systems and we want to make sure that our models are deployed safely But they're deployed in a way that we can manage them really effectively using large-scale infrastructures We also want to make sure that all of these different experiments that we're running have metadata that can tell us what was the the version of this validation set or Testing set that was being run with it. What's this version of the model, etc? And we want to be able to detect drift over time as well as you know These models are deployed over years and years and years and how can we then automate a lot of these processes? so Kind of a not super complicated graph But just an image to kind of show you know what might be the components around this and how we might look at it And it differs a lot from maybe the first image that I showed earlier And that's kind of why I consider them to be so different in terms of data ops and MLOps So our roadmap here is going to be pretty straightforward in terms of we just want to version control everything as a first step Especially for a very experimental and trying to just get things to work How can we also use get tags and get ops to make sure that we're able to backtrace in the lineage of our data and the models Themselves and how can we include you know this continuous deployment in there when we're deploying on to you know massive clusters And how can we package this in a way that's reproducible and nice especially if we're looking at deploying on devices natively And how can we then monitor that in a way that's scalable and manageable and of course? I think documentation is really important especially documentation for non data folks There's a lot of people who would benefit in your organization from just having layman terms Documentation around the data system, and I think that's really important to to emphasize as well So again some of the frameworks that we might want to go into here is going to be what we use to develop our models again These are pretty standard tools that I think most people here will be very familiar with So I won't go into things too much But we want to test at this end state as well as we're developing the models How do we test the models and cells? How do we test the data? How can we then serve this in a way that we feel confident in and How can we then build out the tools around that that makes sense to us and that fit our needs? And so we also want to make sure that the long-term effects of this are going to be something that we track And so a lot of this lives within the Python ecosystem currently Whereas I think in data ops you're having so many different tools that span a lot of different languages as well And I think that's where a lot of the trickiness rely resides so yeah we also have another talk if you're interested in learning about Gravitino and That's going to be an L20d hosted by Jinping I'll also be talking there as well if you're interested. That's right after this session and We're also hiring so if you're interested in working with us and you're interested in metadata Please do let me know or visit our website here and thank you so much Do we have time for questions or yeah any questions at all or well, thank you very much