 Okay, hey everyone. Good morning. Good evening. Good night. I am Sophie. I'm a data scientist here at Red Hat and today I want to talk to you about five things you don't know about data science. Edson implied that you're a mainly developer heavy audience and certainly primarily this talk is targeted at developers. So perhaps you're interested in data science. And you want to know how to get started or how to take it to the next level. Or perhaps you're part of a team who has to work with data scientists and you want to know how best to support them and integrate the work that they're doing into your applications. Well, if either of those ring true, then you're in the right place. However, if you're a data scientist in the audience, don't fear, don't hang up the phone yet. If you stick around, then you'll probably be retold some things that you might already know, which I think is always reassuring. But also you'll learn about how your skills, motivations and goals differ from that of software engineers and application developers. And we'll talk about how you can all work together to get your jobs done better. So before we dive into the meat of this talk, I want to set some expectations about what this talk is not going to teach you. First up, this is not going to be a 30 minute deep dive into neural net architectures. I don't want to spoil the big reveal in my talk, but I do want to let you know already that you probably don't need a neural net in order to solve your problems. This talk is also not going to be a PyTorch tutorial. If you Google PyTorch tutorial, you can see there's 3,430,000 of those online already. And what you absolutely don't need is another one from me. Similarly, we won't dive into any of the other tools and technologies that data scientists use since there's so much content out there online already. If you do want a chance to get hands on with some TensorFlow, then come to the lab that Chris and I are doing tomorrow. Or just Google TensorFlow tutorial and you'll be good to go. Another thing that we're not going to do today is a lecture on maths and statistics. Now, I'm sure that if we were all in the same room and I said that, you'd sigh and look very sad about it. And that's because you know that like me, math is so important. And it's certainly fundamental when we're doing data science and machine learning, making sure that you're not making dangerous assumptions in your models or your code. And understanding the math and those assumptions that you're making and they're happening in the background is absolutely vital. But suddenly I can't teach you all of that in 30 minutes. And finally, what we won't be doing today is talking about tools and tips to get your machine learning models running faster. So we all know that machine learning data science can be computationally expensive. And there's so many ways that we can speed it up. But that's not what I want to focus on today. So instead, what we are going to do is talk about the processes and principles that data scientists use in their daily work. And with a better understanding of those, I hope that we can go forth and you'll be able to, as developers, help out your data scientists and think about getting machine learning and intelligence into the work that you're actually doing and the applications that you're building. So with that, let's dive into the first of our five things that you don't know about data science. Number one, it's not all about the model. When I talk to practitioners in other fields, I often get the sense that they think that data scientists just deal with models all day. We're training models, we're retraining models, we're creating new models, we're designing things by hand to create a better model. In practice, data scientists do much more than just dealing with models. What we're looking at here is the machine learning workflow. So it starts on the left from codifying your problem and metrics. So that's essentially stopping and thinking about what's the problem that you want to solve and how are you going to know if you've done a good job of solving it? From there, we're on to exploring our data, visualizing the data, collecting it. Maybe it's streaming in from Kafka and we have to aggregate it. We want to integrate that with some static databases we might have and look for patterns in that data. After that, we're on to feature engineering, so extracting the important bits of information from all of our data and any other transformations that we need to do to our data. And so all of that is before we get anywhere near the model. And actually, I would argue that these stages, these early stages in the workflow are way more important and certainly way more time consuming than the model training and tuning. And you might say, okay, sure, but we're still ending up with a machine learning model at the end of that process. Well, in some cases, yes. But in those cases, you've still got to think about how we treated and transformed that data prior to passing it into the model. If we're running our model in production as part of a larger intelligent application, any data we've got streaming in needs to be transformed in exactly the same way as we treated our training data. So we might still have a model, but in some cases, you do the exploration, the visualization, the initial data processing, and then you stop. You can think of this as more traditional analytics. So often it's enough to return a business report or findings back to your management and stakeholders. And they can act on the insights that were found in the data and use those to guide strategy or decisions. So, sorry, time for a tea break. So data scientists do way more than just models. That must be tough, right? It must be complicated. All of these things to do, all of these things to know and learn, all of this data to deal with. Well, actually, not really. For data scientists, data science is the easy bit. If you give a data scientist some data, they'll likely dive straight in. Maybe go straight ahead and create a Jupyter notebook. Start to load that data, explore it, understand it, plot it, transform it. And to me, as a data scientist, that's the fun bit. That's the stuff that we're well-versed in, we're trained in. That's where we thrive. So what's the tricky stuff? Well, the tricky bits is making all the work we've done shareable and usable by other people and other applications. Data scientists find this hard for a few reasons. We're often not well-versed in skills like version control and Git. That's changing. I think over the last kind of three years, we're seeing that use of Git become more prevalent in data science, but it's still not something that we're experts in, so it makes collaboration difficult. The environments in which we do our work certainly look far from production artifacts. Jupyter notebooks do not look like code that is ready to be integrated directly into an application, and in practice, they're usually not. A lot of work has to be done to go from coding a notebook to code running as part of an application. You can really think of a Jupyter notebook just like a standard paper notebook that we might have on our desk. It's full of scribbles, some ideas, some workings out, some crossings out, and you've got to do some archaeology to uncover what happened in the notebook and figure out how to recreate it and reproduce it. Another thing that data scientists are not professionals at is containerizing our applications. Turning applications into well-wrapped, formed code that can run anywhere is certainly well outside of our comfort zone. These patterns are things that we've seen across many of our customers at Red Hat. I think over the last kind of five years or so, lots of companies hired loads of data scientists to explore data, find insights, and maybe train models to put into production. But what they're now seeing is that they just haven't seen the payoff from that, and that's because they're missing the cross-functional teams that are needed to piece all of these parts together. So the hard stuff is the collaboration. It's getting the workflows and pipelines integrated into applications, and that's where you as developers come into the picture. By understanding enough of our data science workflows and helping us to impose structure and processes on our work, as well as guiding us through the things that we're not experts in, you can really make it easy for us to integrate that intelligence, machine learning, and data science into applications. If you want to see some concrete methods for doing this, then head back to the talk that Chris Chase and I gave earlier this morning. It features Chris's dog, which is always an absolute reason for watching a talk, but it talks through some of the structures that Chris imposed on my work, which meant that we could collaborate efficiently and rapidly develop an application and then change that application without me as a data scientist having to do a load of work that's outside of my comfort zone. So this brings us on to the third thing that you might not know about data science. In data science, so many things can go wrong, but the real question is, will you even notice? So we all know how our normal software breaks. You get an error message that looks something like this. If you've got something wrong in your codes and it won't compile, you get a message. If you don't have all of the things needed packaged in your container image, you're going to get an error. And usually, hopefully it's a sensical error that you can understand, you can debug and you can rebuild. Similarly, if you've got a spelling mistake or a human error in your code, maybe you find yourself stuck in an infinite loop that can be frustrating, but it's the kind of thing that you debug. You figure it out, you know something's gone wrong and you can fix it. So I think of these as noticeable errors. They all go wrong, they all happen, and we know when they happen. But data science and machine learning is plagued by another class of errors that I like to call invisible errors. So if we represent our model by this blue box and think about it running in a system, our model is really just acting like a function. We pass data into the function and the function spits out a prediction. And our model will continue giving us predictions as long as we pass data to it. So what could possibly go wrong? Well, in data science, there are so many things to worry about and so many things to go wrong that we can't fit them all into a quick 30 minute talk this morning. But I want to focus on changes to the data. So let's think about a specific example. Suppose we've got data streaming in from a weather station, we're collecting atmospheric pressure, weather station location, precipitation and temperature. And you're trying to predict the occurrence of a tropical storm. So we've trained this model, this blue box in the middle, on the past 30 years of weather data from the USA. And we've got this model up and running all over the US. It's working really well. It's predicted the last two tropical storms. So somebody thinks, fantastic, I'm going to take this model and roll it out to predict weather events in Canada. What could possibly go wrong? Well, it's a great idea, except weather systems in Canada record temperature in centre grade rather than Fahrenheit. So your model will still make predictions. It's not going to spin any errors. But the data you're passing in is on a completely different scale. That temperature is a different scale, a different magnitude from the data that the model was trained on. And your model doesn't know this. The predictions just keep on coming. They're just going to be wrong. So really, this example is arguably down to human error. It's us not checking or understanding what the model is expecting and how it's running. But the notion of data drift is more general. In many cases, data changes over time. Certainly, I think over the last year and a half, many of our personal behavioural habits have changed. I know that my spending habits have changed because I'm not commuting to work. I'm not buying lunch out. The amount of air travel happening has changed due to a global pandemic. And certainly, the climate is changing over time. If the data that you trained your model on is no longer representative of the data you're trying to make predictions for, you're likely going to end up with incorrect predictions. So what can we do about this? Well, one thing we can do is monitor the data coming in. We can look at the distribution of data over time. And if that changes and we know that something is awry, then you're going to have to do something about it. So suppose your data looked like this on day one. But by day 101, it looked like this. That indicates maybe something is wrong. Maybe my model isn't going to be giving sensible predictions anymore. Similarly, we can monitor the predictions that your model is making. So we've got a model that's predicting fraud. And we know that we expect about 3% of credit card transactions to be fraudulent. If you monitor your model's output and see what proportion of fraud it is predicting every day, perhaps that starts out at 3%. And then suddenly every time that increases or jumps up to 6%, it indicates that something somewhere has gone wrong. The model isn't making predictions that we expect. And that indicates maybe something's gone wrong in your data. Maybe the date format changed and that's messed everything else up. Or maybe somebody messed up the conversion from Canadian dollars to US dollars. And now your model is predicting things that aren't actually happening. So given that everything can go wrong, and you might not ever actually realize it, I think it's time for some happier news. Most people don't realize that for machine learning, simple is usually good enough. You can often get more mileage from a simple model and from standard statistical methods and rules that allow you to understand what you're doing without going all in with a complicated model. And the simplicity allows you to understand the assumptions that you're making and draw clear conclusions. So I suppose I've got some data that looks like this and I want to estimate the relationship between these two variables on these axes so that I can make predictions in the future for any data we might see. Well, there's a few ways I could do this, but perhaps the simplest would just be model this as a linear regression. So there's a linear relationship between these parameters. On the other hand, perhaps the most contrasting thing I could do would be model it with something that looks like this. So which line is best? Which model do we prefer, the red or the blue? Anyone want to shout out in chat? I'm watching it and now I can see whether or not you agree with me, but I can tell you that even though that red line is giving the best accuracy on this data, I agree with Edson. I would much rather use the blue line for future predictions. So why is that green? Burry, okay. Best is green. So aside from the fact that I think that that blue line is capturing the underlying trend here and the red line is certainly overfitting to that data, but I think it's much simpler to explain that blue line to stakeholders. It's also easier to understand and it's probably easier to compute. Even if you're going to end up using a more complicated model like this model in red, you still probably want to train that blue model and use that as a baseline. You can compare your complicated model to this more basic one and check that the complicated one is performing as expected. You can also check that the way off to work it worth it and worth using that more complicated model. In practice, the more complicated your model gets, the more computationally expensive the model becomes, it might take longer to actually make predictions and this might not be acceptable for your particular use case. If you want to block a transaction at the point somebody puts their card in, you've got to be able to make a quick decision about whether or not you're going to block or not block that transaction. Thank you for all of the motivational comments about the graph. Exactly. So Ivan's pointing out it depends on the precision needed. In some cases you need to be really precise and really accurate if you're predicting something like some serious illness and the treatment for that is very painful and debilitating. You don't want to accidentally tell people they've got the illness. You'd rather spend a longer amount of time predicting correctly, checking and then going forth with treatment. Now even if your use case does require things to get complex, there's still ways to keep things simple. So I opened this talk by telling you that you probably do not need a neural net and actually that's not true. For some cases you might, maybe you're doing object detection, you're trying to determine whether or not a dog or a groundhog is digging up your yard but what you might not need to do is train your own neural net from scratch. I've worked in data science for about ten years now and I have never trained a neural net and that's because there's tons of open source models out there pre-trained on data sets that are larger than you could imagine or ever get your hands on or process. There's always something out there in the open source world that you can already use. So take a look around, adapt it and don't start from scratch. So we've talked about data science, we've talked about some things that can go wrong and some things to keep in mind and I just want to end by telling you that data science and machine learning can solve many problems but it can't solve all of the problems. What we can't do is just throw data at a model and get fantastic results. So the results and the quality of those results are fundamentally determined by the quality of your data. If we go back to this idea that we can pass data into a model and get a prediction, it kind of seems like magic but really baked into that model are all of the assumptions that were held inside your training data and your modelling. So when we did that model before we assumed there was a linear relationship between those variables. Are we going to assume that that holds even if we extrapolate out into space? I don't know, we've got to think about whether or not we're happy to make that assumption. And similarly, anything that is in your training data any errors or any behaviours and patterns are going to be perpetuated by the model and into the predictions that it makes for it. So most importantly, if you've got any bias in your training data that's going to be passed through into all of your predictions. If you don't have enough data you might never be able to make sensible predictions either. So your model might be doing a good job it might not but the question is will you notice will you notice if it's perpetuated by us all the way through your chain and what are you going to do about it? So thanks for sticking with me for the last 25 minutes or so we've seen that data science is about much more than just the model we've seen that it might not save the world it's difficult to integrate into systems because all of the things can go wrong and you might not even notice it sounds like a sad story but what can we do about it to make it better? Well a few things let's always start simple try to make things as straightforward as possible make sure that things work as you expect use your intuition the machine learning model is only as clever as the data that you put in and the assumptions that you pass into it so make sure that things line up and make sense to you. Secondly we need to monitor our data science and machine learning systems whenever you can monitor data coming in monitor predictions coming out monitor everything that is possible to monitor so that we're more likely to catch those invisible errors as they arise and thirdly let's communicate focus on communication make sure that your work is documented and that you're communicating with your teammates even if they've got a different role to you developers talk to your data scientists before they start a project so that they understand what's actually required to pick up their work and put it into production give them guardrails and help them out with the things that they don't know so that they can focus on the things that they do so with that I'm going to pass back to Edson thanks so much for having me along today and I hope you enjoy the rest of the sessions