 Welcome to the first talk after lunch. Hope you had a nice lunch. We have here Ian Oswald, who is going to talk about the starter data science process for software engineers. So please give him a nice welcome. Hello, everyone. Welcome back after lunch. Hope you had a good lunch. So I'll be talking on a standard data science process for software engineers. Before I begin, I want to understand who you are, who my audience is. So I'm going to ask you to self-identify in one of two ways. You're either more of a software engineer or more of a data scientist. So put up your hands. If you had to self-identify as a software engineer, first of all, it's kind of predictable given that. And how many are data scientists in the room? OK, so hopefully you will also learn a little bit, but just a little bit. The rest of you, I hope, will learn a bit more. So who am I? My name is Ian Oswald. I style myself as an interim chief data scientist. So I'm a practicing data scientist. I write tools and intellectual property for clients. And increasingly, over the last few years, I've stepped in to help teams and companies that lack senior leadership on the data science side, trying to shape the teams, help them figure out what they're doing, what is working, what is not working, and how to figure out how to make a project work successfully. And I've been doing this for nearly 20 years. I've been around for a long time doing all of this. I'm very driven by all of this. It turns out you cannot learn enough. There's just so much to learn. And it keeps changing all the time, even to the point that in Sylvain's wonderful keynote this morning, I got a bunch of new ideas. And so I carried on typing for the next two hours. And we've got more bits in the live demo that will happen later on, because there were just too many cool new things going on in that keynote. So yeah, I quickly work to help teams build strategic data plans, data science plans, getting data products shipped. And then I'll work with them and make sure things get delivered. So I really enjoy now trying to shape up data science projects to make them work. I've had with my 20 years experience, lots of experience of things not working and going horribly wrong, typically with me as the only data scientist in the room, or the only AI person, back before we had the term data scientist. And it was kind of terrifying and really depressing at times. And I try to change that conversation now. So at the top there, there's the PiData logo. I'm one of the co-founders of PiData London. That's the largest Pythonic data scientist to meet up on the planet. We've got nearly 10,000 members now. I'm super proud with what we've built there over the last six years. We have our sixth annual conference in the middle of July. Who's coming to PiData London's conference? A handful of you. The rest of you want to come as well. You will learn so much. There are still some tickets available. So I work with big companies like the ones you see up there, hotels.com, insurance company, QBE, Channel 4, and also lots of small and mid-sized companies. So I've got a nice range of the stack across many different verticals about how data science works. I'm also one of the authors of High Performance Python. And O'Reilly have kindly sent me a box of my books today, so I'll be doing a signing. So if you want a free copy of the first edition, then I'll be doing a signing for that outside in the coffee break. And I'm working right now on the second edition. And the first edition is still correct. It's not wrong. It's just Python 2.7, and it's lacking a few things like pandas and Dask, and that'll be added for the second edition. And I'm also now on PluralSite doing video training. So when does data science show value? So I like it when I work with a client where the management are numerate and they're asking good data-driven questions. I've worked in organizations where we need magic. We need to make more money. Just go and look at the data and give us the magic, and it'll all work. And that's rubbish. That does not work at all. I need numerate management who can ask sensible questions and understand the quality and the problems in their data and their process so that we can do some kind of change process management to get a product shipped and doing things in the real world. Particularly, we need suitable data. Many, many organizations have unsuitable data, and they hope the data science magic will do the thing that makes it all work, and it doesn't just crap in, crap out. That's the way this works. We need good data. And we need well-defined achievable outcomes, so sensible outcomes that let us bite off a sensible problem and deliver something that works and then incrementally improve it. And change has to be enabled. We have to have the permission not just to hear that we want change, but actually deliver change, change an organization, change a workflow so that things get accepted and put out into production. We need to check the business need. We always need to know, so as an engineer, approaching this problem, you know, I want to do some data science. What am I going to do? Well, you need to do something that delivers value inside the organization. There has to be a change driver behind this. It's not good enough to do it because it's cool and it's deep learning with the big data. That doesn't matter. It matters that I can deliver change inside the organization. So is there a burning desire in the company to change a thing or is it somebody's aspiration and idea to have a deep learning project because it sounds cool? That won't actually deliver value. That'll be a waste of your time. We don't want to be doing that. When I was speaking, I had the privilege of giving a keynote out in Lithuania a month ago at Picon, Lithuania. One of my colleagues out there, Jonathan, gave a really nice talk talking about the data science process, and he gave a really nice example. He works in ID matching, where you take a photograph of yourself holding up an ID card and a photograph of yourself holding up a passport, and then they have humans checking these things, and they wanted to automate the process. Do we do big data and a deep learning and then try to do the whole computer vision thing? And he said, no, of course not. It'll be a silly way to approach this problem. Rather than having humans checking every single image to see if they can clearly read the name, we do a bit of OCR. And if the OCR recognizes name and some name, and on the other document, name and some name, and those some names match up, then we say, hey, that works. And for all of the other documents, the other 50% or that doesn't work, will we give that process to a human? And so the machine automates half the problem easily with some off-the-shelf tooling and an afternoon's coding, and then we identify the bits of the problem that a human is required to work on. And then we can incrementally improve the harder cases and get the machine working on these harder problems. And so looking at incremental value delivery. So I'd encourage you to look at that talk that Jonathan gave the videos online, I believe. And it's important that this solution is automatable. If it's not automatable, then it doesn't matter how much deep learning, clever data science, whatever you do, you can't put this thing out into production. I really think we all need more project specifications. This is a slightly unpopular opinion because it means we don't get to go and do the core thing with the machine learning, but instead we write down some things and we get agreement from our colleagues. But in 15, 20 years of doing this, I've realized you really need to write these things down, write down the shared agreed problems and solutions and milestones and the unknowns that might torpedo our project and get agreement amongst all of the stakeholders. Otherwise, everyone assumes different things will occur. Everyone assumes that I know what the problems in the data so you know too, you know it's happening next week, you know it's happening in a month's time. And then when things don't go to plan, there's a lot of confusion. And then the poor data scientist or the poor software engineer learning to be a data scientist is left carrying the can, feeling really awful because things went weird for the third time running. Why did that go wrong? Write a project specification. So I gave a talk at PiData Cambridge a few weeks ago. That's that title on the bottom right hand corner. That's based off of two training courses that I run and lots of other talks that I give where I've boiled down both this successfully delivering notion and software engineering notion for data scientist into a half hour talk. So that's up on my blog at the moment if you want to go and have a check on that. So in the blog is enosvold.com. And you need a good definition of done. And then that story from 10 years back, this was a slightly terrifying incident where I had a client, a very sensible client. They were an e-commerce company. They were doing price comparison shopping. And so the problem was we go and scrape websites, we get the same product from different websites, can we find out which website has the cheapest version of that product? Yeah, very sensible. If we get the same product specification down, we can just do a string comparison. They'd already solved that. Can we use machine learning to do a human-like reasoning when the product descriptions are different but clearly the same product? Yeah, same reasonable. So we agreed. I would work on a data set that represented the real data set. We'd come up with a prototype solution. It took three months. And then we would apply it to a larger data set. And the data set I was working on would be a representative of the real data set. We all agreed that verbally. We didn't write it down. That was silly. After three months, we said, look, the prototype. It works, right? This is kind of cool. This thing actually is doing the right job. It's doing a human-like comparison of these strings and coming up with a good agreement most of the time. They said, great, now you can look at the other data set. I said, well, what do you mean the other data set? We agreed the first data set would be representative of the true data set. They said, yeah, but we didn't want to scare you up front. So we gave you the easier one, first of all. And now look at this one. So we don't know how to do it ourselves. Oh, crikey. So I looked at this and I thought, no, as a human, I can't differentiate this. There's no way the machine is going to differentiate this. And at that point, we had to kill the project. And I thought at that point, as a consultant, I had an idea about how all of this worked. But it turns out I didn't because I wasn't writing things down. So get shared agreement. Write down what the problem is that you're solving, why it's valuable when they come up with you. Agreed milestones around your project. That's the first practical bit around getting a project that works, so a process for software engineers to think about delivering some value. So now we're going to do a short live demo, which I've written in a Jupyter notebook. And then, as I mentioned, given Sylvain's lovely keynote, I've added some bits to it. So here is my pretend example. We want to automate miles per gallon estimates from some engine statistics. The reason I'm using this is because there's a convenient standard data science toolset, standard data set called the auto MPG problem. So I'm using that. I've co-opted it to give you this setup. So we have a colleague. They're an engineer. They go through these data sets looking at characteristics of engines and then mile per gallon estimates or recordings. Their task is to find a good engine that matches some constraints that have been provided by a client. And they've got a big pile of books of all of these different engine configurations and mile per gallon statistics. But they don't always have mile per gallon statistics. Sometimes they're missing. So we want to fill them in with some machine learning. And then the goal would be, well, if you can fill in these missing mile per gallon estimates, then rather than the human having to calculate and run physical tests to figure out what the mile per gallon is, which is an expensive human process, maybe we can come up with some kind of prioritized ranking of these engines fit to your constraints and your mile per gallon requirements, and then you could have a prioritized process of looking through them. So it's a simple way to use mile per gallon estimates to kind of prioritize a workload to make it more efficient. It's a fictitious problem, but it fits to lots of real world problems. But the team is suspicious of machine learning. So how do we get them less suspicious? So here we've got a Jupyter notebook. First of all, can you see that at the back? We haven't got to read the text. We've only got to look at the diagrams that come past. So this is a Jupyter notebook. It could easily have been a Jupyter lab rendering the notebook. I've loaded in some of the standard tooling. There's NumPy, Pandas, Psychit Learner, some of the Psychit Learn metrics. Who's used the yellow brick visualizer before? Or Psychit Learner? Who's used Psychit Learn? Some of you have. Who's used the yellow brick visualizer? One of you. Right, good. So some of you will learn some new things in here. And I'm also using Outer. Who's used the Outer data visualizing tool? One. Right, not enough at all, because it's brilliant. So we load in this auto-MPG data set, and that data set looks like this. So the standard data set. In fact, can I make that slightly bigger? Is that slightly bigger? So we have some characteristics. Mile per gallon and cylinders, a number of cylinders in the engine, displacement, its size, horsepower, and a few other characteristics. So can we use those characteristics to learn our miles per gallon? One of the first questions we have to ask ourselves when using a new data set is what correlates with this thing that we want to predict? What data should I be investigating first to understand what I've got in my data set? And a correlation that I've got there at the bottom, that's the first sensible thing to do. You run a correlation, and it just is all right for miles per gallon. It correlates with weight, and then displacement, and then horsepower, and then a few other characteristics. Once we know what correlates, we've got an idea of what might help predict this thing. So we could use weight and displacement to predict our miles per gallon. So this gives us an in into building our predictive system. But first of all, before we build any predictive system, we want to draw some graphs. We want to investigate our data. So we know what correlates. So we might say, right, I want to draw a graph, a scatterplot, of miles per gallon versus weight. And so we do that. Here I've got out here. So out here, it's a bit like maplot lib, but it's not static. It uses JavaScript, and it's interactive. So we can do some cool things. We can zoom out and zoom in and pan around and all that jazz. And you get tooltips on this. So very quickly, in just those four lines above it, I can get a nice interactive data set and visualize it. That's really, really nice. Out here has recently become quite mature. It's been a couple of years old now. One of the nice things when you're looking at your data set and you need to be doing this as software engineers, you look at your points and you think, oh, it's kind of cool. So I've got this relationship between weight and miles per gallon as weight goes down to the car gets lighter, the mile per gallon, the efficiency increases. That kind of feels sensible. It's not a linear relationship. It's kind of a banana-shaped one, so slightly nonlinear. But it's still linear-ish. And so that's kind of cool. And so I look at this and I think, well, why is it that for this particular weight, I've got a band of engines that have a certain miles per gallon. But for the same weight, I've got a couple of outliers. And so I look up at those outliers. And because I've got interactivity and tooltips, I have a look at the car name and I see in brackets I've got diesel. And then I realize, oh, there's a subtlety in my data. I've got petrol engines and diesel engines. And that's not represented in any of the features that I've got in my data. That's just shown in the title. So when I sit down with my colleague, I can ask, what's the deal with diesel versus petrol? Is that an important thing that I should know about? And I say, run my mouse over this data. I can see that there's different years and then there's different territories. And I think, well, what if I color it by year? And I rerun that. And I think, oh, that's kind of interesting. So for early years, I get a lower mile per gallon for the same weight. And then for later years for the same weight, I get a darker year. So vehicles are becoming more efficient over the years for the same weight category. That's something interesting. I didn't know about that, but I can talk to my colleague about this. And as I run my mouse over this, I notice at the top I'm getting territories like Japan and USA and Europe. And when I come down the bottom, I'm just seeing USA most of the time. So what if I change the coloring instead and use territory? And then we see that exposed very quickly. We've got those red dots at the bottom there. So all the US cars are less efficient than the European and the Japanese ones. Well, that's pretty cool. I can talk to my colleague about this. So I talk to my colleague and they go, well, it's great. I've got this heuristic. When I estimate mile per gallon, I just take 1% of the car's weight with a negative relationship. So that as car's weight increases, my mile per gallon decreases, and I call it a 1%. So we can draw a straight line through our data. We can take our weight, multiply it by negative 0.01, so 1%. And this gives us a mile per gallon estimate. And then we draw it and we get that green line going through our data. So we've come up with a human estimate and we've drawn it. It seems pretty reasonable. But it doesn't fit terribly well, but we've got some relationship going on. And then our colleague says, well, that's great. You've got this machine learning stuff with lots of clever visualizers. I don't really care about that. I just want to know how well am I fitting right now? Well, I could make my own scoring mechanism or I could wrap up this estimator that we've come up with into the scikit-learn framework. And I do that with this my estimator here. Most people would never do this, but as engineers you might be interested in the idea of what goes on behind the scenes with scikit-learn. So I make an estimator. I have a fit and a predict method. I ignore the fit method. I'm not going to fit my data because I've come up with my human assumption, but I've come up with a hard-wired predictive method that just takes the negative 1% and applies that to the data coming in, the weight, and gives me a prediction coming out, which is that straight line. And then I use the scikit-learn. One of their standard scores, the R squared score, zero means there's no correlation in my output. One means perfect correlation, very predictive. And so I can use the standard scoring methods on my own estimator that I've written by hand. And I find out I get a score of 0.53. It's okay, it's fine. It's not super special. But my colleagues now say, well, that's kind of cool and I understand some linear algebra. How does ordinary least squares do on the same data set? I go, wow, we're using the same format as the scikit-learn ordinary least squared. So we can take, we'll jump, ignore that. We can take linear regression, train our data, and ask ordinary least square to make a prediction, use the same output, and we finally get a better score, score of 0.66. That's very nice. And we can print out the coefficients. And the coefficient that comes out is not 1%, but it's 0.8%. So a little bit smaller than the coefficient we'd estimated. And the offset is not 55 mile per gallon, but 46 mile per gallon. So pretty close to what the human came up with, but it turns out this one scores better. So we've used ordinary least square to fit better, but we're showing a progression from human estimates through to using a machine-learn solution. And then maybe I say, well, look, Collie, now you're on board with this. We're using some Hexatike-Learn. We've got a model going. We've improved upon the human model. We're using standard scoring. What if we used a more advanced model, a random forest, a non-linear, more sophisticated fitter? And we'd give it some more data. So we give it more columns of data. And then this will understand the non-linearities in our data set. And it should give us a better result. And indeed, we plot it out and we get a score of 0.83. And we see those green dots, they're interspersed with the blue dots. They're not running at a straight line anymore now. So this is looking pretty sensible. We've come up with an estimate that's doing quite a nice job. And our colleague says, well, this is great. I kind of understand. I'm kind of with you. I think this is working. How do I use it? And he goes, well, actually, how about that thing that Sylvain, our colleague told us about with the iPython widgets? What if I gave you a nice little widget interface where we could put up, say, weight as an interactive item? And then every time we change our weight and we keep the other coefficients fixed, so we keep the other inputs fixed, we can then predict our miles per gallon. And then I get a nice interactive slider here. So as weight increases, we see our predicted mile per gallon decreasing. And then as weight is decreasing, we see our predicted mile per gallon increasing. And we can imagine building up this user interface and making it more complicated, having more features, having graphical outputs along there. But I think just having an interactive slider which we could serve up, that feels pretty powerful. But then our colleague says, yeah, but Ian, right, if I take this, I'm going to break it. I'm going to delete files in your machine by accident because this is being served up inside the notebook. And he'd say, no, hang on a minute. What about voila that our colleague showed us? And so you run voila. And he'd say, look, I'm serving up this not client, sorry, server side rendered version of our notebook. Just exactly the same notebook. I just ran voila, my notebook. It pops up all the output. And then I get that interactive slider down at the bottom. And as I increase the weight, our predicted mile per gallon decreases. We've got an interactive user interface being developed here. That's kind of nice. And if it rendered in time, I may, oh, there we go. That's one, this one, did it render? I failed. Earlier, I had a Binder public interface running as well. So I'm going to run over for a couple of minutes, then take questions. So earlier, I had a Binder version of this running off of my public repo. And then this was mounted on the web. So you could have accessed this from the web on your laptop and interacted with the data along with me. So I was really impressed with that, the Binder component and voila, in this morning's talk. As far as I can tell, it just works. I just did it in the last couple of hours. You should try it too. It's very cool. So back to finish off the talk. There's a couple of resources. I mentioned that I do some training. I've got a course coming up in a couple of weeks. That's not interesting to you. It's aimed at data scientists. But you might know some colleagues who will be interested in this in the future. The two books on the right, on the other hand, will be interesting to you. Python for data, sorry, the Python Data Science Handbook by Jake van der Plaas. That's a really good software engineering focused introduction to the world of data science, not just machine learning, but data science and data analysis in general. So that's a really good book to get. And for your managers, data science for business. It's a formula heavy book. But the formulas are really easily explained. So it's really easy to follow it along, understand what's going on. There's lots of really nice diagrams, really intuitive explanations. So I enjoy reading that book. It just makes everything really obvious. So I recommend both of those books for different components of your team. So I'm going to wrap up here in just a moment. I've been speaking at events for a long time. One thing I realize is many attendees don't realize how much this is volunteer run, how much the organizers, the room co-chairs, the speakers, everyone around are volunteers. So please go and thank your volunteers. Go and find an organizer or a speaker or a room co-chair and thank them. It makes the world a better place. And if you would like a free copy of my book, there'll be 20 of these or so somewhere out here at 3.30. And I'll be signing them. They're sent over courtesy of O'Reilly and their outreach program. So in summary, we want to automate parts of a high-value problem. You haven't got to automate all of it upfront, automate the high-value parts, and then iterate and keep on building out your solution. Keep building that value incrementally and communicate your results early and often and come up with ways of communicating that people can interact with the system. So using, say, Voila and those widgets to make an interactive front-end is much nicer than just having a notebook, but the notebook beats having, say, a CSV file as an output. So come up with ways of getting it into the hands of your users to actually get value out of it. I've got a couple of mailing lists on my website, one of them for the training announces and one of them for jobs that I've got that I know about that are coming around and a bunch of talks and books and other things that I recommend. If you're interested, you can get that off of my blog. And there's lots of other talks that I've given over the years linked to my blog as well that we'll fill in around this. I've been talking about this area for about five years now. So there's lots of background stuff there if that'll be interesting to you. And I'll finish there. Thank you very much. We'll have time for one, maybe two questions. So, any volunteers? Thanks a lot. Thank you for that. So good without saying it. It was an extremely compelling and interesting talk. I just wanted to ask a question about automated testing. So I've got my hand up as a software engineer. I'm dabbled with data science and have a junior data scientist who may be in this room working at a certain place. I do, I like automated tests and the nature of data science. Obviously, you're dealing with large data sets. You might not want to do an end-to-end test with your full data set. And if you knew what the outcome was, you probably would have even run it. So I just wonder if you could talk at all about automated testing in the context of data science. So, and this is something I'm covering on my course in two weeks time because it turns out testing is a complicated subject and it's not often covered in any of the, or it's not covered in any of the data science books. And yet we need testing. Most data scientists don't write tests. That means we can't trust the output of the data scientists work and that's awful. That's not the right thing. So my work style and what I teach and what I talk about is to develop your code in a notebook, find the code that you want to trust and then extract that into modules. And then you've just got a module. Then you can put your unit tests around that module. So that lets you test your code. Then you've got your data as well and there you'll be using your data fixed as a fixture, pushing it into your pipeline, which could be a notebook. It could be an exported notebook as a Python file, importing those modules that you've made that already tested, pushing your data through and then checking the quality of the output. So if I know certain rows of input should give a certain type of output, I can then set up a unit test that says with my big black box thing that I've done, my data transforms, the visualization, the model, the predictive output. When I get this certain row from my fixed set of inputs, check that it gives me the right kind of mull per gallon output within some tiny range, some wiggle room perhaps. And check that none of the mull per gallons go negative or go to 1000s or whatever a wrong result would be. Find some limits and test those. And that mixture of sanity checks along with your regular code unit tests, I found to be, it's a pretty strong way to go. And you can do the same also at deployment time, having checks running on your running system, which is ingesting live data, which could be changing in shape. The mean could be changing in that dataset or the variance or whatever. You could put some checks around that at runtime so that in the background it's always checking that things feel like they're in the right bounds and the system is behaving appropriately. And if not, it flags. And those strategies do a reasonable job. That's all that we have. So please have a nice round of applause. Thank you very much.