 All right so there have been amazing advances in the past few years and what machines can do as you read about in the news headlines like this machines can recognize people in images it can transcribe speech and translate languages they can drive a car diagnose diseases this just came out last week they can even tell you you're depressed before you know it based on what you type and scroll how you type and scroll but why now like the AI has been around for a while machine learning why suddenly all these big advances and I want to argue that it's it's not about the algorithms it's actually all about the data and if you think about it the top 20 contributors to open source projects on github among those are google facebook amazon microsoft IBM uber alibaba most of the biggest players in the AI space are open sourcing their AI pipelines but what are they not open sourcing they're not open sourcing their data because it's their number one asset the data so while in 2017 a lot of the headlines looked like this there's a really big year for AI now we start to see a lot more headlines that look like this right so so biased AI is leading to racial discrimination and policing and criminal convictions gender discrimination and job ads headlines about racial discrimination and image recognition and film recommendations and these are all data problems they're not algorithm problems kathy o'neill has made a very good argument that we need more transparency in unpacking the algorithms especially deep neural nets but again I want to argue today that many of the problems we're seeing in the solutions to them are really about data and not algorithms and not just data like training data for models but the entire data ecosystem so why do I care about this a healthy AI depends on a healthy ecosystem and I'm an ecologist that's my background so I like to think about ecosystems I've been studying complexity in nature for about 25 years I used to run an institute in Yosemite national park that where we did work on alpine ecosystems and it was very data heavy we compiled 30 years of satellite imagery data blended that with a lot of hard-earned data on the ground to validate models and and to build and train and test models to help predict where endangered species are living in the park and how to best prioritize decisions for how to save them now these are things where at the time we're very aware of concerned about potential unintended consequences because if something's going to disappear off the face of the earth we don't want that to happen now I work at racketen intelligence part of racketen and I work more in economic ecosystems where I work on e-commerce data and stuff and I'm becoming much more aware also of the potential unintended consequences of how our personal data is used by algorithms so if we want to understand this let's use an example for an unhealthy data ecosystem to understand what goes wrong this is the Russia Today news channel on YouTube I don't know if you guys have seen it it turns out it's the most watched news channel on YouTube in the United States so how is it that a Russian news channel is the most watched news channel on YouTube in the United States similarly if I type in Syria news into my browser and YouTube three of the four top results are Russia Today or Russian news channels one of them actually in Russian even though Google knows that I'm searching from a computer in San Francisco California in English how does that happen what's happening well the algorithms are doing just fine they're doing exactly what they're supposed to do the problem is a data not the algorithms and so I want to unpack this problem about the about the data behind it so let's just think about a traditional simple machine learning pipeline where first we need training data to fit a model and predict something that's how it starts and then we need additional validation data to maybe tune the model or pick amongst many models which is the best one with the goal of minimizing the error and maximizing prediction success and then we have independent test data to independently test the model and evaluate its accuracy and then once we've done that and have our tested model we put it out in the wild and it goes wild so what happened with RT news well if you go back into the history of their their YouTube channel they actively seeded the training testing and validation testing data with clickbait viral videos if you zoom into those they're really this is a news channel so these are videos like horror footage shocking video of a plane crash kid dies from falling from Ferris wheel whale hits and smashes yacht surfer fights off shark attack live on tv so these are just like salacious clickbait videos to drive high engagement in the site because they know that the models are trying to predict engagement and they're optimizing for that so then the algorithms flag the RT news channel as a high engagement channel and then lo and behold their new they put their news in the search results and all the recommendations which then leads to higher engagement than any other news channel on youtube this is a classic case of biased data leading to biased outcomes so a classic data problem where RT news actively created a biased training data set to game the models and it's a virtuous cycle for RT news because the reason the outcome results are then used for more training and testing and and off we go in the virtuous cycle just because it's not big data doesn't mean it's not biased it's a it's a big problem and again the algorithms are doing a great job they are doing a great job of engagement it's just not the right thing so there's a second problem a data problem in this ecosystem that i'd like to look at too let's look at not the input data but the outcome so they're trying to predict engagement and that that's sort of what the models focus on trying to minimize the error on and when the models deployed if people are recommending viral videos of course they click them and they watch through the whole thing they comment on them etc so it looks like it's doing great the ad revenue is rolling in so nobody pays attention because it seems like it must be working but the goal was actually to in theory recommend good quality content relevant content that's good quality and engagement is just a proxy for that that we can measure easily and optimize for so since there's a mismatch between our proxy for success and our actual goals then we end up recommending propaganda instead of high quality content um and and again there's nothing wrong with the algorithms they're just they're working great for engagement and ad revenue but not for predicting high quality good content this is this problem on the other side is a mismatch between our outcome metric and the goal so meanwhile the the algorithms doing a great job at actually really predicting the wrong thing and the key problem here is a data problem that we're often missing outcome data data on real outcomes that's independent from the engagement from that one metric like engagement metric so um so for example we rarely have good data on like who got recommended what just big picture data and did it is it what we intended we don't often get to see that big picture and without alternative outcome data we'll never know if we actually optimized for the right thing until it's too late again this is not an algorithm problem it's a data problem and it really boils down to two very simple data challenges one is bias data in is biased outcome out and the other is wrong success metric wrong outcome they're pretty trivial in retrospect but we miss them a lot um so the question is how do we better detect and correct bias how do we get real outcome data so that we can check our results and gut check things and how do we optimize for the right thing or prevent accurately predicting the wrong thing so there's two core additions to this data ecosystem that I want to that I think can help a lot and that are emerging right now and it's really great to know about one are data marketplaces and the other is data interfaces and I'm going to go through each of these so let's start with data marketplaces there's a lot of really interesting movement happening in this space to improve the health of our data ecosystem just a couple weeks ago tim berners lee the the creator of the worldwide web launched this project which is a decentralized personal data marketplace where you can have your own pods personal online data stores and where you can own and control all your personal data in a personal data marketplace this really interesting project the linux foundation here last year just launched the community data licensing agreement to facilitate sharing of open data not just open code because remember it's not the algorithms it's the data that's really important here oasis labs in berkeley uc berkeley just raised 45 million including from anderson harwitz to make it easier for us to actually share that data without losing control of it to prevent another cambridge analytica type of scandal from happening again ocean protocol just raised also about 45 million to kickstart a decentralized data marketplace that's focused specifically on data for ai and machine learning and and here's an example nebula genomics just recently was funded to take on 23 and me to build a decentralized personal genomics data marketplace which lets you own your genomic data control who gets to use it and for what and even get directly rewarded for sharing it all right so last summer martin gillies said that made a case that we need to reign in the data barons to restrain their market power but i think that it's possible with all these emerging open data marketplaces will just make them irrelevant that's sort of the goal so decentralized data marketplaces are coming and it's really great news they're trying to solve some really important challenges around ownership you know and how do you how do you set licensing how do you control who uses your data and for what purpose how do you have secure systems for sharing data but without losing control of it transparency to see who's using your data for what and how to have aligned incentives between the data provider and the data user it's all really exciting but even if we have better and more diverse data we also need better tools for thinking about how to gut check what we're doing and this is the human side of the data ecosystem and it's one that i'm really interested in um let me give you some examples in response to all the recent press about ai bias there's been some interesting activity facebook google ibm microsoft all in the last few months announced that they're launching tools to detect bias in algorithms and what's really interesting to me about some of these that have seen like here's the the one the google what if tool it's called code free probing of machine learning models there's two things worth noting about it one is it's not a data visualization tool even though it's visual it's a visual interface for probing and discovering it's an interface it's not just a data viz it's also the second thing is it's code free which means it facilitates just spontaneous one-off exploration rather than intentional scripted um visualization these are really important for discovery interfaces uh my company that i that i co-founded vibrant data created this one and we were we were recently acquired by racquetan and i'm really excited to announce that we just open sourced it openmapper.org and it's it focuses on exploration of complex networks and i want to give you a quick example of how some researchers at national geographic um used it recently for um some of their machine learning processes. So this is an image of one of my study sites in the Sierra Nevada which is a pretty nice place to be and i'm really thankful that i get to go to places like this and it turns out that some of the researchers at national geographic are learning that even just imagery of nature can have calming effects just like not not maybe quite as good as being there but they do have real effects on the brain and they're wondering whether they could try to understand if it'd be useful for places like this where environments where people have no access to nature and no hope for it you know maybe could images help and if so which ones are best so they launched a research project to try to find out they're analyzing thousands of data of images from the national geographic instagram feed and they they gather in this preliminary around they gathered a few months of data and measured attributes of every image and tagged them according to whether there's um like a person in the image or not whether it's urban or nature whether there's a mountain in the picture it's a distant view is it from above or below is it a close up far away etc and then they're trying to use that to build models and predict which which types of images have the best responses. So to explore that preliminary data set they just uploaded in a map where it's a cloud-based tool right now and they just upload it as a CSV and then once you upload it OpenMapper can generate a similarity network to find structure in the data immediately. So here you know it automatically detects high dimensional patterns in the data every node here is an image and then they're linked to one another if they're similar if they share similar image attributes and then similar images automatically cluster together into groups that are auto interpreted auto labeled based on the features that are the most common in the group. So this is a group of images that are distant they tend to be distant unobstructed views of snow cap mountains and then if you look on the upper left there it turns out that's the biggest group of images so right away we can see in this high dimensional space that there's a there's already a bias towards these kinds of nature images in the data set not that it's bad but just so that we know that it's there. So here's another cluster of images the next biggest one is focused on oceans and pictures of water and animals in oceans kind of and they all have I think like low consistent low hue in them. Here's another set of cluster of images that are close ups of animal faces that have high edge density and high fractal dimension so so the cool thing is that just by having even the images render in the nodes you have instant like gut checking on what are these images what do they look like you can zoom out and then zoom in to individual individual data points really quickly. So we can also zoom out and we sort those things and scatter plot let's say just see which images are high engagement here they're sorted vertically by the emergent themes and then horizontally by engagement and off to the left we have distributions of all the attributes so we can see here just as an example on this urban natural gradient the we can see there's a really heavy bias in this data set towards nature images which is not surprising but at least it sparked an idea to say oh there's all these other ones with humans and urban scenes in them what does look like and are there high engagement ones in there as well so you could just quickly take the filters and have a look see who they are urban images that have relatively high engagement and then subset those and dive in and look at them and see what they look like so there of course we have some really nice unobstructed distant views that are sunsets a lot of sunset images that people liked with with cities in the foreground but also some really interesting cultural landscapes as well and then of course we have people seem to like panda on the on the lab bench and then this last one struck my attention because you got a tiger in a cage it's kind of scary and I thought you know it's a high engagement image but probably not great to put that in a prison right if the algorithm automatically recommended it and and of course it's obvious now but if that was just the image ID some random ID number I just wouldn't have thought to you know wouldn't have known that so now we step back and it just encourages to go back and look at the data now and just go find other metrics of responses so they also measure instead of engagement they measured valence which is a sense of the mood like what's do you have a positive or negative emotional response to the image and also arousal which is the intensity of your emotional response and so now with this we could look and say well let's look at all those negative emotional response images and ones that that also had a high intensity of a response dive into those see where they land there's high engagement ones there as well and look at that and see where they are and well that was suddenly we found all the predators and all the the shark attacks even a bloody killer baby polar bear and and these are there's plenty of these in the data set and so we could go back and look look at the other examples of high mood and high arousal and see what they look like these might be a little bit better to put in a prison and sure enough these are what we are the warm and fuzzy creatures that you kind of expect right and what was interesting here too is that there's also some interesting human scenes that were part of that group as well so let us to think that maybe we should try to include more human and urban things in the data set that might also elicit positive emotional responses so that's just a you know a rough view of the data the this view for them really helped them just exploring the data it helped them find errors it helped them identify biases in the data unintentional towards certain types of nature imagery help explore the unexpected value of maybe some urban and human scenes they hadn't thought about before to help them think creatively about alternative outcome metrics that might better match their goals and and also help them spot and fix errors in the data they couldn't have done that just by staring at the data like this I don't think it's possible okay so data and data interfaces are not data visualizations they're not data dashboards they are tools to facilitate critical thinking that allow you to easily go back and forth between a big picture view and and dive into you know point level details they're because they're code free they allow us to follow our curiosity with one-off spontaneous questions so while a lot of artificial intelligence is focused on training machines to be smarter to spit out predictions data interfaces use machines to help make us smarter so we can increase our understanding of the data okay so I'm just going to wrap up here and summarize how this all fits together into our data ecosystem so in our current system real outcomes are generally a black box it's rarely is our data to see the big picture of who got served what ad or who got recommended what video or what product because of that it means that it's really hard to track if we're actually achieving our goal or if we actually are optimizing on the right thing meanwhile a lot of the training tuning and test data are biased they're hidden behind closed doors and this situation makes it really right to have a model that is very accurate but without knowing whether it's actually achieving the goal that we wanted the emergence of these distributed data marketplaces are critical for providing better data and increasing the accessibility of that data and the transparency of it they're also going to be critical for providing independent data that brings transparency to the outcomes of machine learning models from other companies the data interfaces are critical for helping us gut check whether or not the outcomes match our goals and they can help us be more creative about evaluating this this the success metrics that we use to evaluate the models so they're more likely to match our goals and finally those data interfaces can help detect bias early and help us reward people particularly for providing data that's going to eliminate that bias or fill the gaps this to me is a healthy data ecosystem and that's what it should look like and we need we really need to invest in these and we need healthy data ecosystems if we want future AI headlines to look more like this and not like this thank you very much