 First, thank you all for being here. This lecture we're giving today is the result of some of the ODI's innovation program, for which the first year ended in March this year. And we're presenting one of the projects that we've worked on on AI, part of a broader program looking at emerging technology. And I wanted to thank two of our colleagues who worked on this. And this was a team effort from the whole of the ODI team. But I wanted to single out Jamie Fawcett and Jared Keller, who co-authored the report that this talk is based on. I'm giving the floor to Lysiana. OK, cool. Thank you, Olivier. So recently, there's been a lot of excitement around artificial intelligence. And large companies have made huge investments on this technology, such as IBM, which has invested 15 billion pounds alone on their cognitive system, Watson. And from PWC, who estimated that artificial intelligence could add $15.7 trillion to the global GDP by 2030. And so all these big numbers on all this bold pronouncement have shown that AI actually has a huge potential in improving our lives. But first of all, why AI and what is it? So more and more businesses have been using AI to help them make decisions, for example, where to invest, what route to take, or even what movie to watch. So that really shows how AI is being more and more present in our daily life. But first of all, how does it really work? So think of it as a black box. Where you put lots of data in it, such as pictures, and also taste, as well as your social interactions. And from all this, this returns to you with decisions, recommendations, as well as pattern recognition. There's many type of AI systems, as you can see. And it's very confusing to actually think of what actually AI is. So here's what we think it is. Artificial intelligence systems are a combination of clever statistical and mathematical techniques, an understanding of the world, and lots of data. So the current trend within AI businesses is to treat large sets of data that we fit into AI systems that we use to train. And then we sidle that data. So instead of making it public, or share it. And if that continues, this will actually lead to negative effects, such as oligopoly, which means that that would stop innovation. And thinking that AI systems use data to train their algorithm, if they can't have access to that data, it means that fewer businesses or startups can actually make their own new AI systems. And this is actually a natural tendency that researchers have found, have noticed. And this is partly due to the fact that this is a self-feeding data ecosystem, which means that those who have the most data can train the best AI systems, and obviously the ones that have the best AI systems collect the most data. So that goes round and round. And so one of the most famous examples could be to take giant Facebook and its famous algorithm, which I'll hand to Olivia for that. Thank you. So it's worth thinking for a moment. We've been talking about artificial intelligence systems and data, data, data, data. Typically, in our kind of collective narrative, when we talk about artificial intelligence systems, we hear a lot that word, algorithm. And I wanted to spend a few minutes kind of unpicking what that word means in the context of AI and why we still think that data is a more important word and one that needs to come as an equal in the conversation about AI. Now, caveat. AI is a moving target. AI is using a lot of Bud's words. And AI is really badly defined. One of the experts we talked to as part of our research said something that I really like, that we need different words for AI. Now, for the time being, we're kind of stuck with the ones that we've got, but know that it is a moving target and that the definitions and the word that we're using are going to change. For one thing, one of the really good definitions that I've heard about AI is AI is basically the field of computer science of stuff that doesn't work yet. Once it works, we just call it computer science. So keep that in mind. But let's have a look a little bit at what we mean by algorithms is when we talk about artificial intelligence. So as we see, we're saying a typical analogy for AI systems is a black box. But if we open the black box typically, and it's important to remember that typically doesn't mean in every case because there is a really, really broad set of techniques, a really, really broad set of applications that we lump under that term AI. But typically, what happens is that inside that box you've got a lot of data, typically training data. Although it's worth noting that in some cases, such as if you've heard of GANs like the system that powers alpha zero, you don't actually have training data. But in the overwhelming majority of cases, an artificial intelligence system will have been trained with a lot of data. This icon here is my feeble attempt at conveying that there's a lot of statistics happening and a lot of methods to deal with this data. You are modeling the world. You're using statistical techniques to optimize something. And that something is in some cases a neural network, in some cases other things. But let's look at this as a model of the world that has been trained on testing data with statistical techniques. So far so good, you take some data, you use it to train a model using statistical methods. Then you take this model that you've just trained and you give it input data and you get stuff, insights, recommendations and so on and so forth. So, we see I talked earlier about taking our taste and so on and so forth. So in here, our taste, our history of watching stuff, reading stuff, et cetera, et cetera, plus some statistics gives us a model of what we like. Then we say, today I want something and it gives us recommendations. For instance, what's interesting when we look at what typically is called the algorithm and there's a lot of the talk of the algorithm when talking about Facebook, is that we talk about this. The algorithm is the result. The algorithm is what is shown to us as output to all our input. But when you talk to data scientists that work on that black box, what they call algorithms is the set of techniques that they use to train the artificial intelligence system. So, already algorithm is a bit of a complex thing. What's interesting, however, is whatever definition of algorithm you use, a lot of the time, these algorithms are open. That is the techniques that are being used to train artificial intelligence systems. The software being used to use the model that you have trained tend to be open source. There's a lot, lot, lot of the software being used to train and then use the models that are open source. And they're all based on mathematics that have been published in, say, the 70s and 80s. There is, of course, quite a lot of research and kind of cutting edge AI algorithms that are being published right now, but the overwhelming majority of the algorithms being used in AI are open. And the reason behind that is that we have now a relatively mature open source movement that is able to be convincingly making the case that open source for those algorithms create better transparency, that you have more scrutiny over the code that is being used to train and then power those models, and collaboration. You don't have to reinvent the wheel because you can reuse software that's somewhere else built and you can collaborate in it and make it better. What's interesting then is looking at how the perception and the behavior of organizations that use AI today do with regards to data. And I'm going to give the floor back to Lucia with this quote that much of the new hype on AI is based on better hardware to run these algorithms that are not particularly new and a lot, lot, lot of data. So let's look at how those organizations deal with data. Thank you, Olivia. So let's have a closer look at data. So providing access to data and to the algorithm is not a binary choice, it's not just a yes or no, it's more of, so obviously yes or no do you want to share and what do you want to share, with whom you want to share and how you want to share it. And to describe this access to data, we're here at the ODI like to use this ODI data spectrum, which you can see from the very left-hand side, it shows where organizations choose not to share anything at all with any external organizations and to the very right. This is actually the other extreme is publishing as open data with a license that enables anyone to share access and use data. And in between, you can see that there's also very different ways of sharing. It's either you can share with someone in particular or a selected group and also you can also share publicly but with a restrictive license. And as for algorithms, we have found that in our research that it works pretty much the same way. That's why we came up with this quadrant and we have broadly defined archetypes and trends that we think that businesses using AI are using to approach data, the access to data and the access to algorithms. So you can see that the very top is open algorithm and the very end is closed algorithm and then at the very left is closed data and then open data here. So this is by no means categorizing any products or services. It's just to give a broad idea of how businesses approach the access to data and the access to algorithm in order to create a competitive advantage. And we have so seen a trend here that goes towards the top left-hand side of this quadrant. So the trend is leaning towards open algorithm and closed data. And the reasons for that is that they want to open up their algorithm not for other companies to use it and to build on it so that they can identify any potential or existing issues alongside with any potential benefits. And then a second reason for opening up their algorithm is to speed up adoption of new methods or new ways of operating and this can be seen in many interviews that we've led is that a lot of businesses have the desire to help the AI community progress. But then why do we keep data closed? So I'm going to hand this over to Olivier. So the first reason we heard when exploring with roughly the 20 organisations we interviewed in this research of why they were relatively convinced by open source is not very good news to those of us who advocate for open data. We'll look in a little bit at why they don't typically want to open the data that they do hold but there was actually some hostility to using open data at all in their AI systems. This is exemplified by this quote from Anvali Bak from a VC organisation, a pan-European VC funder who says that in the organisations that she talks to they barely use open data because they think that if they use open data it might not have the relevance and the quality that they really need to train the algorithm. And that's bad news because if you work in data science you know that 80-90% of your work is actually dealing with data that is messy, dirty, not 100% relevant. So in the open data movement we have work to do to counter the expectation that open data should be better than this. Either it means that we need to get the message out there that there is high quality open data available to train AI systems or actually fix them so that they are better at being used for AI systems. But it's not all a negative reaction to open data. There's actually quite a lot of arguments being presented by those organisations we talk to that we're not about saying we don't like open data, we don't use it, but we hold data and we want to keep it closed for a number of reasons. I'm going to let Lucia talk you through some of those reasons we heard. Thank you. So we have talked to a couple of businesses and we've asked them why they were reluctant in sharing their data, open up their data and they came up with two reasons. So first of all is personal data. So data itself is already a sensitive topic and if we add personal data this adds one more layer of difficulty because of privacy reasons. And we have Roma from Frosha who expressed it quite succinctly when we asked him whether they would benefit from data that was siloed whether that would help them create or innovate. And this is what he says. Of course we would be able to develop more business models out of that but as a citizen I would probably object to that. It's with good right, there are some silos. And this approach is quite common when we talk about data about people within businesses. Obviously there are access to data that should be restricted because of personal reasons or privacy reasons unless they have given an explicit consent to that. And besides privacy issues there's also business reasons for keeping that data limited or closed and that's due to trust so that would retain trust from the customers and also avoid some potential regulatory issues. However obviously this business reason is justifiable but many companies use this as an excuse to keep the data they have as proprietary data. So proprietary data is the second reason why businesses are reluctant to share their data. And this proprietary data is a form of companies IP with intellectual property. They see it as their market advantage and they don't want to disclose that. Obviously that's also a valid reason but only if those data have anything to do with profit or their loss. So that would be understandable that they don't want to share this data. And obviously that would mean that if less data is shared then there would be less innovation as well and knowing that AI systems need loads of data to train itself it also means that there is a limited... with this limited data they could lead to a lot of bias and as Sandra says so if we talk to machine learning people and if you don't have a rich data set then you will actually start discriminating people because there is bias in the data set and I'll hand over to Olivier to give you two interesting examples. I'll stay with this quote for a moment because there's a double whammy here. One is silos lead to lack of diversity in training data sets which indeed means that bias can creep in but also worth remembering is that analogy of the black box that we talked about a few times. Most artificial intelligence systems at the moment are black boxes. AI systems are very bad at explaining why they came to decisions that they came to partly because of the way they're built you've got a set of weights in a neural network it doesn't really tell you, I did this because of that. Now there's a lot of research going on at the moment to figure out ways for AI systems to be self-explanatory but at the moment the only scrutiny we have on AI systems is whatever is available to us to query it so we can use an AI system and go yep there's a problem here but at the moment mostly what we've got is the code which as we said a few times earlier tends to be open so the code that is used to train AI tends to be open but the data is closed and therefore the scrutiny on what the data that has been used to train an AI and therefore what bias might have been in that data because they are typically held in silos and held as property IP or personal data they are not, you can't go and look at that and look at it for bias which leads to some pretty egregious discriminatory systems so those are a couple of examples that are a few years old you've got a few more that are a little more subtle but in order to make my case I think those are good examples this is an example of what happened when Google holder of quite a lot of data and quite a lot of resources to make sure that their systems are not massively racist released a feature in their photo software that was very conveniently sorting out photos based on topics and probably because it had been trained with mostly photos of white people as the exemplar of what a person is it then classified photos of this person's friend so Jackie Elsin was the person who raised the issue with Google this person's friend with dark skin were classified by the algorithm as gorillas now you can be pretty sure that Google was not intending to do that so that's not bias that is intended sometimes you do have a view of the world and you're trying for it to be reflected in your AI system in this case that was not the case and that is largely due to the fact that there was no way to test for that kind of bias until someone was able to put their photos in the system and see that another example that is tragically funny is from 2010 Nikon multinational company but you know Japan based released a new camera with a really nice feature that helps you take better photos and what it does is it's trained on facial features and tells you when someone's blinking and this person, Jos Wang on the picture here kept taking photos of herself and her family and the camera kept saying did someone blink up to the point that her brother had to basically test how badly he had to like artificially open his eyes for the camera to stop saying that and that again is a pretty egregious case of a training set with some inherent bias in this case the camera application was trained on feature mostly from Caucasian faces and therefore did not take into account that actually across the world there are many more types of faces and types of shapes of eyes that might be mistaken for someone blinking So what then? We've talked you through the fact that we talked to quite a lot of organizations and told you that most of them tend to go to the top left of the quadrant using open source software to build their AI systems and power those AI systems but sticking with close data I could be mean and say if you want to know what we recommend as next steps, read the report it's good but let's very quickly go through some of the things that we think are the right next step The first one is a note of optimism While a lot of the popular imagination around AI revolves around tech giants the ghouls, the facebook, the apples of this world actually most of the unlocking of value through AI is probably not going to be done by those organizations but by organizations that sit on a lot of data and don't know what to do with Those are the organizations that we think have the opportunity to test other models than the closed data as personal or proprietary IP parameter There are ways for a company holding a lot of data to not say we're going to build an AI on that but maybe they're going to partner with others Maybe they're going to work with experts in the field to create systems with that data out of their hold But that means a few things That means that bit of optimism much of the data is yet to be unlocked But that means that we need better data sharing and access while retaining trust If you look at the recent scandals or problems around AI both the Facebook and Cambridge Analytica scandal and a little earlier the issues that DeepMind had with their data sharing agreement with the Royal Free Hospital Those were well-meaning data sharing agreements that backfired spectacularly and we think that the problem is not with the concept of sharing or giving access to data but the problem is that we don't We're at the infancy of figuring out data sharing agreements and data sharing models So we've started a project a few months ago trying to categorise, classify, understand and explain data sharing and data access models so that companies and organisations can pick and mix and pick and choose the ones that work for them while retaining trust while not ending up with, yeah, lawyer's say it's okay and then two years down the line you end up with a massive scandal and really bad publicity And in order to unlock data that people think well there might be some personal data so we're going to keep it closed because we don't want to take the risk We acknowledge that, that is a fact that there is a risk and especially now with the UK's implementation of GDPR the data protection bill that is going to make willful re-identification of people when using data crime, felony, misdemeanor and the lawyer, I'm definitely not a lawyer but a bad thing then we need to help organisations deal with that risk rather than go binary, no, we're not going to do this because there might be a risk of re-identification and I don't want to end up in jail because someone somewhere re-identified data that I shared because there was some personally identifiable data in there So we've started this second project to help organisations better manage risk when dealing with data sharing, data access and in some cases, opening data That's it from us, thank you Hannah, I'm supposed to give you the microphone so I'm going to do that right now Are there any questions, any comments, any concern anything that wasn't clear Hi, thank you, that was really great My question is, at the start especially with that sort of matrix of open algorithms and open data you're actually talking about open data but by the end you're talking about risk management and access to personal data and that makes sense to me because most of the applications for AI are with personal data I'm wondering if you have examples where open data is truly regardless of whether a venture capitalist thinks that it's relevant examples where algorithms can and should be trained on open data and because I feel like I'm being muddied between personal data and open data in the narrative There's a few examples out there of successful data sites being used to train AI I'll mention just two One is OpenCV, which is used for computer vision Pretty big data set of things and categorizing things in computer vision is really, really hard but at the same time done it's a solved problem There are good data sets for that There's good data sets for kind of linguistics, computer linguistics One of the examples that we have in the report is a project by Mozilla to open the data set they used to train their voice recognition So it can be done but you'll see there that what's interesting is that those are data sets that are specifically made to be used to train AI and not just random open data sets that happen to be somewhere Yes, we start at the beginning with let's say some kind of box data box called like you... What do you mean by data box? My question is when it comes to the question you have the data you are running this AI but of course you have to run in some kind of infrastructure, some support because I think in this box you have big data you have AI so where do they run in other words that box I figure myself like a data cube where you can have a lot of data and analyze it so this means AI is not just analytics, data but it's also infrastructure this means all together I see AI that's my observation Wait, we can share Okay, so I'm also proud of this so the black box is a metaphor to explain why it's so hard to understand at the moment what AI typically does the way to explain this metaphor is that you can if you've got a dog you kind of understand why the dog does what they do there's food there so they're salivating you kind of understand that but you can actually not really understand that by just opening their head and looking into it you can put electrodes and perhaps figure some patterns that relate to seeing food and being hungry but we're not in the same way as neuroscience doesn't really have a grasp what is happening and why we create systems in AI that we can't really understand we know what comes in so training data we know what comes out decision, pattern recognition and so on and so forth but there's an aura of magic because it's so hard to exactly explain the why it's very, very hard to create a system that says yes, I gave you this response because of that the answer typically is I gave you this response you're trying to bring me with some data, I don't know but the second part you answered, the infrastructure do you want to answer that? Thank you very much for mentioning the infrastructure so I think from our research actually this infrastructure is basically the fact that we have more and more data we are able to have access to loads of data nowadays 21st century compared to the past because the AI system it's nothing new it's just that right now we have more and more data to fit in and that's why we're able to develop more to have better trained AI systems so hope this answers your questions here You are right I think this black box will be eliminated once you are able to combine infrastructure for example, IBM you mentioned 15 million they spend 15 billion this is mainly to link infrastructure with AI because if you have this type of I see different type of teams that it goes from infrastructure to the data scientist to the developer, to monitoring to go back etc otherwise this is already with Cognos Cognos does the same of things so this means it's the same idea of Cognos just in different way, it's more complex that's where I see AI and the utility of AI so from my perspective I think these challenges of getting the companies who are setting on this data to open up are probably almost insurmountable firstly the competitive advantage of this data for most of these companies this is the main thing they have is that they have the data like you said the algorithms are known and they're not going to have as good engineers as Google, Facebook and the others so this is all they have as a way of competing and it's also a barrier to other people from competing with them so I work for a search engine company and we would not be able to so with some initial data to start with you just cannot even start and you're so far behind Google you're never going to catch up so there's a massive disincentive to open this data then on the other aspect the risk this is also extremely high so I work on the privacy aspect of it and even data sets that companies think are de-anonymized are not a classic example is the Netflix data challenge where they published a data set that they thought they'd anonymized to see if people could come up with better recommendations and researchers proceeded to de-anonymize the people and these are people who know this the best there's many many data sets and also the more public data sets that are the easier it gets to de-anonymize because these can be cross-referenced so I wonder if you have any other insights about if this is any possible Lucia can talk in more details about the organizations that we've seen that do actually manage to find ways to find other models than the closed data open algorithms but what I'll say to begin our answer is it's better to again think not just as a binary choice of open versus close but more on a spectrum and a lot of answer is going to be it's not a binary thing so full full proof total infinite forever anonymization is nearly impossible unless you go and you know mangle the data to the point that it's unusable we get that but you know managing risk is not about binary full full no risk and full risk of you know what the risk you are that you're taking and then you make your choices accordingly and likewise opening we're not saying through this that all the AI should be based on completely open data and all organizations should open data entirely but we're saying that the model of closed data open algorithms that is preferred might not be the best for everyone Lucia did you have examples of successful organizations outside of this top left quadrant well there's which we mentioned a couple of times so this is a Dutch startup based in Amsterdam and they collect they collect actually personal data and this new GDPR law is they say that it was actually beneficial to them so what they do is that they collect those personal data then they anonymize them and then they give it to they sell it to companies so they have a open algorithm but sort of shared exactly sort of shared data so that's why at some point in the quadrant we have a shared model in the very middle of the quadrant which is where we see a lot of companies that is that actually tends to move towards this shared model because they see that collaboration and partnerships is actually easier for them to open their data because there's more trust building during this whole process I noticed a trend in enriching web pages with structured data to enable natural language processing in search what trend how you see this coming and where can people find more information to enhance their web pages in that way your websites downsides I'm going to try and answer but I think your neighbor behind you working on search engines might have more knowledge so don't throw a bottle at me if my answer is wrong I'm going to answer somewhat broadly the sense that if you count the web as data that is publicly available most of the web is dirty and messy and any effort to increase the structure of it will not automatically but will very be very likely to help people then use that structured information is that essentially what you're going beyond the semantic perfectly everything marked up but then you enhance it with structured data this process is quite complex complicated for many organizations we've got 99.7% SMEs in this country and they have absolutely no clue about that so how can they approach this topic otherwise they won't be in search soon I mean we've got quite a lot of it working already there natural language processing techniques are again not entirely new what's important is that to realize that they are useful only to some extent in my previous role before joining the ODI my team was actually building natural language processing to understand the web better and the joke was that we surprisingly failed time and time again to build algorithms that would detect sarcasm not sure why but we were still able to automate quite a lot of the understanding of content out there through fairly effective machine learning techniques so the key is to understand what it is usable for and what it is not usable for so understand where the the efficacy and the accuracy of those methods are okay so have you found a question here yeah thanks for the lecture it was very interesting I think something that maybe also interesting to us is the fact that when you're talking about data and the fact that they can that you have to be sure on the kind of data that you plug in actually the machine learning apart from having like good quality data something that is also like warfare I think is the biases towards the kind of people that actually like plugging in that data because like a big thing on that on this topic is also about having more diverse team working on that data thing so like open source is actually like a very good thing to kind of avoid having I mean kind of shaping actually and unbiased future a bias future but I think something that is also very important adding is working on creating a device a diverse team to avoid that don't know if you answer that but other than nodding yeah thank you thanks for the questions as you can see on the screen here the R&D project funded by the UK government is still ongoing research is still happening so if you'd like to collaborate with the team find out more about what we're doing please do get in touch info at theodii.org can we please give Olivier and Lucie a big round of applause