 from San Jose in the heart of Silicon Valley. It's theCUBE, covering Big Data SV 2016. Welcome back to theCUBE. I'm Peter Burris, my co-host Jeff Frick, and we are in the last lap here at Stroud Hadoop in San Jose as part of our overall Big Data Silicon Valley conversation on theCUBE. We've been having some absolutely phenomenal interviews over the course of the past couple of days generated an enormous amount of content, Jeff. And this next session I think is gonna be especially exciting because it's not focused just on the technology, but rather starts asking serious questions about the real business problems that some of these technologies are going to create or we're not if we do it right. So our guest today is Mike Williams at Fast Forward Labs. Mike, thanks for coming. Thank you. So Mike, let's just jump right into it. One of the things that Fast Forward Labs thinks about is machine learning and the impacts of machine learning. Let's unpack it. What is machine learning? Okay, so there's two kinds of machine learning, supervised machine learning and unsupervised machine learning, and perhaps for the purposes of this conversation we should focus on supervised machine learning. Supervised machine learning, there's a whole zoo of algorithms, a whole zoo of ways of doing this, but fundamentally if you blow away all the smoke and mirrors, what they do is they learn rules of thumb from data. You show them examples of historical data, historical trends, or the kinds of patterns you're interested in, and then the algorithm will learn which patterns work. And those patterns, we call them rules of thumb. So for example, you might know that a certain kind of employee is likely to succeed in that job given their background, their educational background, their demographic attributes, for example. Or you might know that a certain kind of customer is more or less likely to churn. You know that based on the historical data. And that historical data in supervised machine learning has a fancy name. It's called training data. You train the machine based on that historical data. It learns those rules of thumb and applies them. The problem is that we have a name for rules of thumb that we apply to human beings. They're called stereotypes. And stereotypes can be true, they can be false, they can be positive stereotypes or they can be negative stereotypes. But I use the word stereotype deliberately because it should set alarm bells ringing for you. Stereotypes raise the possibility of doing harm, either doing ethical harm, legal harm in the sense that you become liable for some sort of civil suit or something. Or they raise commercial problems. You maybe just build a bad product because you build it based on stereotypes. So that's the kind of machine learning I'm interested in and the kinds of patterns we can discover when we think about humans. So in certain respects, you're looking at developing rules of thumb that are guiding the rules of thumb development. So that the rules of thumb that are being developed don't go off the deep end and start generating behaviors like we saw this week from some of the Microsoft experiments and some of this machine learning stuff. So I'm gonna draw another analogy. You tell me if this is right. So in the legal system, we talk about statutory law, which is a law that govern day-to-day activities and constitutional law, which are the rules for how you go about changing the statutory laws. Are we talking about a structure that's like that? Well, as you may be able to tell from my accent, I'm not from around these parts and I'm also not a lawyer, but there is a history of law going back to, for example, the Civil Rights Act. We go right back to 1964 and that act was updated in 1991. Both of those years were really before the commercial application of machine learning. So the legal framework is, I think it's fair to say has not kept up with development and that places on us, those of us who deploy machine learning models in the real world are responsibility to think creatively about the legal framework. And I don't mean that in the sense of what can we get away with. I mean, if the law had caught up with us, what would they be telling us not to do? So an example of that is you might want to build a model that decides whether to admit students to your university or not. And you have a tremendous amount of data on applicants. They give you their name, their zip code, their grades of course, all these things. And let's say you build a supervised machine learning model. And remember what a supervised machine learning model is. It's one that depends on historical data. So you train your machine learning model on historical data and what that does, if you do a good job, and by that I mean a good job in the technical sense, your model is accurate, it has high precision, high recall of these technical terms, what that means is you are simply going to recapitulate and reinforce the historical biases that are present in that data. So let's say you built the model using data from 1870. That means that you would set in stone or in code the biases that existed in 1870 and reinforce them and recapitulate them. And the problem of course with setting it in code is not many people can scrutinize your code, either it's private and it's very difficult to prove that someone has broken the law from the outside in this particular situation. Or the other reason of course is that for most people machine learning models are black boxes, what goes on inside that box is obscure. And that means that those of us who are technical and work with these models have a particular duty of care to think about what's inside the black box. Because it's not a black box to us, we explained it to the computer and the computer is stupid. The computer, you have to explain things in that very pedantic explicit way we call programming. So yeah. But what you're describing sounds more like machine execution versus machine learning. So where does the learning from the machine, when, how does the machine take the updated data and start to morph the algorithms to reflect not the historical, but the new and looking forward. That's a great point and that's one of the places in which we have an opportunity to mitigate the harm we do. If we simply set in stone a historical data set and say, right, we're gonna act on this for all time going forward, then clearly we've got a problem, especially if we're employing it in the real world in a sensitive area like college admissions or advancing mortgage credit, for example. So one of the ways in which we can mitigate that harm is precisely the way you're talking about. It's keeping your training data up to date, throwing away old training data, or perhaps even subjectively coming in as a human who has opinions about what's right and wrong, either because you have them internally or because the law tells you you better have opinions about what's right or wrong, otherwise you're in trouble. And you can, for example, censor your data or adjust it. So you might censor it. For example, you might have in your data set race. That might be a piece of information you collected for whatever reason. Maybe you don't want that in the model and there are good reasons you wouldn't. And in fact, there are good reasons you would as well. But let's say censoring it out does at least reduce the chance that the machine is simply gonna act on race as a predictive feature of something like, for example, is this person a good bet for me to lend money to? So we have a duty to censor the data the model gets and keep it up to date, keep it fresh. So there might be, as you said, there might be circumstances where it would make some sense to put race in the model, for example, healthcare, because genomically, people are slightly different and that has impacts on the nature of the health. So again, I wanna come back to this notion of are we looking for, sounds like we're looking for a couple of things. One is we're looking for new methods, whether that be testing or other, to periodically assess our machine learning models. But there also are some rules, and I'll use, say, Asimov's laws, some rules about what constitutes good behavior in machine learning that we have to sustain over a period of time as we learn more about what can go wrong and ensure that these machines do not create problems. We got that right? Yeah, you've hit on exactly where the situation is right now. We're lacking those rules. We are lacking a set of rules you could write down a postcard to say, am I deploying a safe machine learning model? And one of the reasons that hasn't happened yet is that measuring ethics is hard. You don't assign a number to how ethical a thing is. I mean, a judge can assign a number once you're in trouble. It's $150 million. I recognize it when I see it, right? Exactly, yeah. So, and hopefully you don't, you wanna avoid that point. You want your data scientist to be able to say, right, there is a problem with this model. And if I do this, then that problem becomes x% less severe. And here are the red flags, like including race or not including race, for example, that I should be looking out for. That index card with those rules does not yet exist. So we have a lot of work to do in this space. We're throwing around machine learning as though it's this killer application for big data. Maybe not, but certainly, it will be a technology that's very valuable, but we have to find ways to ensure that it doesn't do harm. Mike Williams, Fast Forward Labs. Thank you very much for being here. Great conversation. I'm Peter Burris. This is Jeff Frick. We're here at Strada Hadoop. Last few moments of theCUBE and big data Silicon Valley. We'll be back after the short break.