 Hi, everyone. I'm Bernice Herman. I am a data scientist at Y Labs. I'm here to talk to you today about why static data sets aren't enough and where deployed systems differ from research. Just to note that this work does not represent the views of the work conducted at the University of Washington where I also am a data scientist. Why do I feel like static data sets aren't enough? I would say that my varied experiences in both industry and academia have informed these views for me. Currently, I am a data scientist at Y Labs, the AI observability company. We've created and maintained Y logs, an open source data logging library that uses statistical profiling for an efficient logging solution that scales and works in real time. Many more features. So I apologize. Could you hear me for that? I'm going to, I'll just repeat it just in case. So I'm a data scientist at Y Labs, the AI observability company. We created and maintained Y logs, an open source data logging library that uses statistical profiling for an efficient logging solution that scales and works in real time. I'm also a research scientist at the University of Washington E-Science Institute. And there we do a number of programs. So I conduct research on evaluation metrics and model interpretability. I also run along with the others at the E-Science Institute programs like the summer data science for social good program and academic hack weeks and other programs. Prior to that, I was a software engineering at Amazon and prior to that research did research at Morgan Stanley. All of those things kind of feed into my views on how we think about static data sets and how we think about real time data sets because there are some real differences across industry and academia. All right. So many of the machine learning and data science resources and tutorials and learning you do are kind of surrounded by static data sets. So some example data sets up here. These are data sets that have been collected maybe from the internet, maybe curated more specifically in some way, but they're represented by just one large CSV file or HDF-5 or any other source. But they're treated kind of as a whole. But realistic data sets, especially when you deploy systems in the real world, often change over time. So here's an example of a CSV, which I think you have to include for this conference, of Pronto, which is a bike sharing company that was in Seattle. And they have the data of all of the bike trips for some period of time. This is a classic example of a data set that does have several columns with time. And really, when we analyze these data sets, we should be thinking about them with that time in mind. We're indexing by time. And we're also, if we think about our analysis, we shouldn't assume that we have future data. We should be thinking about our analysis only looking at the past, never looking forward. Right. And here's one graph of that from the Y-Logs open source library. We'll talk a little bit more about that later. So in deployed systems, the static approach to having a data set leads to kind of these periodic data set patches and model retraining. So you'll start with a static data set. You'll do your experiments, your training, your evaluation, and then deploy it. And then at some point, who knows when, maybe you've seen you've run into issues, or maybe you just feel like you've enough time has passed, or you've gotten a significant amount of inference on that data set. You might take that data that you've applied while deployed and include it into that static training data set that we have to retrain or to update kind of our data set. And so I'm considering that kind of a patch to this data set. And it creates this singular updated data set. These are static data sets, despite the fact that we live in this kind of time world. But maybe we should be logging and storing our data in a dynamic way. So just getting your head around the concept that we have static training data set, we may do our experimentation and training and evaluation on that static data set. But after we deploy this data set, we are going to continue to see more data that gets included into the data that we have at our disposal. And we should be thinking about that data in that way, that these are all separate data sets that can be combined, as opposed to thinking about them as one fixed data set that will never change, that doesn't allow us to kind of dynamically add to our training data set, or not our training data set, but our data set overall, over which you can later do training. And so this is why we've created an open source our library Y logs at Y Labs. This library is all about statistically profiling and monitoring your machine learning data from end to end. And using these concepts of time and batches that we're going to, oh, it's a bit patchy, that we're going to continue to think about our data in this real time way. And this really is the way that going forward, as you deploy real systems in the world, you should be thinking about your data. So my goal here is to really start a conversation about time batch data and other skills that might be missing from the data science learning pathways, that if you do, as I've done, and transition back and forth between academia and industry, you really find are lacking in some places in academia and some in places in industry as well. But really this like realistic time element is a huge one. So we shouldn't be ignoring this deployment stage, right, the preparation of data, the building and training of models, those are generally in common across all of the data scientists that I know. But the deploying and prediction is not it. So some people, especially as an academic, you really don't think about deployment. And you don't think about the effects that deployment has on the way you think about the data, the way you model your data, and the way you evaluate your data. So one one thing that's really cool that I won't go too far into, but I really suggest that you look into is progressive validation and delayed progressive validation. So this is a way of validating your data kind of that, that is different from cross validation or really builds on cross validation, but is meant for the online use case, the type of use case where you have continuous data going in. And the main idea is that we shouldn't be evaluating our data that is from the future. So how do you and perhaps you need some delay, perhaps it takes a while to get the ground truth for new data that you have. And so when you evaluate and run your model, you shouldn't assume that you had that data. And that's how we should check our model because that's how it's going to be deployed in the real world. All right. So all data scientists need these skills for time batch data. I did yesterday I went on to Kaggle and by hand looked through the top 40 most voted data sets and 14 of those 40 had some date or time index column that was at least daily. And so this is an example of data that was kind of forced to be in a CSV or in a kind of a static data set form, but really wants to live as this dynamic data. All right. And just here's an example of some papers talking about the relationship between academia and industry and showing how much these these two fields are converging and working together and motivating hopefully the academics to also think about the same time batch data, even if you previously thought of it as a deployment thing. And one thing, so one reason that I am a data scientist and one thing that I'm really motivated by is the impact of data science applications on the world. And I think a lot of that impact does come from industrial machine learning. And so building the skills in this area and other areas that align with kind of how these things are actually deployed for large scale systems are incredibly important. Here just a number of kind of very common issues and things that have come up within kind of machine learning and data science as of late and how having knowledge in the industrial machine learning space is important for everyone. All right. So to log your data, I'll go through this very quickly, but for why logs, you can log your data in just a few lines of code. This is some code from PySpark, but we have a pure Python, we have a Java and Spark for for why logs right now. And you can explore trends in a few lines of code as well. So this is using our visualizer to look at a specific feature in that code. And it's very, very storage computation efficient and data analysis friendly. So that's it for me. Sorry, this text is so big, but do check out why logs at so you can go to bit.ly. slash why logs and go to the GitHub website from there. You can also email me at Bernice at y labs.ai. And thank you very much. I'm happy to answer any questions we have. Thank you for that. Yeah. We did hear in the chat there was there's a little bit of distortion on the display, like the reach, but I think if you can post the slides to Zinodo afterwards, people can definitely take a look. Yes. If you have any questions, feel free to put them in the ask a question. And one of the questions that I have is really about, you know, that the if this is a, you know, it seems to be very connected to the idea of quote unquote versioning. Yes. So, you know, that the time the time stamps of, you know, and when you look at data being recorded in places like Kaggle, like the way that it's stored is just like, you know, tons and tons of more rows. It's like, you know, versions become rows and then just it's just like completely unmanageable. So is there an aspect of this where people who want to better manage that stuff for reuse should, you know, kind of jump in in a different way? Or is it like a training problem that people have? Or is it is it just the simple export problem? What is the reason why that becomes the default? Yeah. So I mean, I think there's a number of reasons why that becomes the default. I think in machine learning, right, we have this assumption that our data is kind of independent and identically distributed. So we're already kind of assuming that each row of data can kind of stand on their own. We don't build these time assumptions into our models. I think that there's certainly work in that space, but very little. So I think that's one space that really kind of influences us to lean toward the static data. But with respect to logging, I think one big thing is just the difference between kind of large scale and a smaller scale. I think when we have a static data set, you know, lots of rows and a CSV or any other solution, but that really doesn't work when we think about lots of deployed systems. And so we need to kind of completely reimagine this and think about kind of approximate statistics and other techniques to get away from this kind of new data equals new rows and in the data set. Some people in industry do sampling, but sampling has a number of problems. You lose a lot of, you lose the min and max and mean and lots of other things. Yeah. So any other questions from the community? I think we're, we apologize for the late start. No worries. Thank you. And so then Y Labs, the Y Labs link that you put in there, that's specifically to how to use this, like to the website. Is there something? Oh, it goes straight to GitHub, I believe. Oh, okay. Yep. And so then there is the website as well for additional, like... Yes. If you go to ylabs.ai, you can see more information about both Y logs, which is the open source library, but then also the platform that we have on top of that. Nice sense. Very cool. Awesome. And then let me look real quick and see. It looks like we have one question that just came in from Bohain, which says, where do people stumble in picking up time batch to data skills? Yeah, that's a great question. I think one place, I mean, so one is this, we don't talk about those skills. We don't really provide as many tutorials or things like that for it. So a major place that people stumble, I think, is that they're really not used to evaluating data that has this time component. So when you're evaluating data, you really need to think about, have I seen this data before? There's lots of target poisoning, which basically says, okay, well, I have this information, I have both the inputs and the outputs for my training data, but does that poison my evaluation, my test data set? And that can happen in a number of ways, especially when it's time-based. And so I think people really struggle with evaluating in particular, but there's lots of other places. Thank you.