 please, everybody, and welcome to this session. My name is Alan Smeaton, I'm a Professor of Computing at Dublin City University and one of the founding directors of the Insight Center for Data Analytics, and you're at the session which is the Royal Archacademy sponsored talk on show me your data and I'll tell you who you are. Take a cognitive sense of the sign that's up and I'll read through it just so that we make sure that you know it is. So during the talk what will happen is is the public data which is traditionally and always broadcast gofyn cyntaf wedi'u cyhoedd llawer o y bwysig iawn. Gwasanaeth gennym defnyddio a gennym iawn wedi'u cyhoedd ond fel y maen nhw'n myfyrwyr yn ddiweddol o'r pwyntau ac ffion, oherwydd mae'r cynyddiad yn gyfчикu mwy o'r ffion, byddai yna'r wahanol, mlynedd y teimlo yn i-fo. Mae'r name ar gyfer ymddangas gynaenfer yw hefyd yn ddefnyddio. Gan llawer o'r gyfer yma, dyfodol yn gallu, dyfodol yn ymddangos. Fy hoffa ddim ni y gallu i ddifw— As part of the talk, ther's a demonstration of this, if you prefer not to have your data collected then put it to flight mode just as you would in an airplane, and then afterwards all of the data that will be collected from your phones will be deleted. So, my, as I said, my name's Alan Smith and I'm chairing this session sponsored by the Royal Irish Academy, which is Ireland's leading body of experts in sciences and humanities, and has been for a number of centuries. As part of the Royal Irish Academy, One of the structures that we have is the domain committees, and the committee that I chair is called the engineering and computer sciences committee, and we do many things for society and for science and for engineering, but one of the things we started is a lecture series as you do in an academy, but we decided to make the lecture series slightly different to the typical lecture series that happens. We advertised for applicants to submit a lecture and a bio in an open and well advertised competition. The shortlisted applicants, and I'm hearing a lot of echo on the sound here, the shortlisted applicants were then interviewed and gave mock presentations and it was a two stage evaluation process. The winner of that selection process was the lecture that we're going to hear this morning, and that lecture is entitled show me your data and I'll tell you who you are. It's given by Dr Brian McNamee, who's a lecturer in computer science at University College Dublin and a funded investigator in the insight centre for data analytics sponsored by Science Foundation Ireland. Brian gave that lecture as a kick off in Caddoby House in Dawson Street late last year, and we use Eventbrite to track participants just to keep numbers, keep a hold of numbers, and that event was sold out. Part of the lecture series was that not only would Brian give the lecture in Royal Art Academy House, but he would then take it on tour, so he then took that same lecture and he gave it in Derry and it was sold out, and he gave it in Galway and it was sold out, and he gave it in Cork and it was sold out, and he gave it in Limerick and it was sold out, and I say sold out it means Eventbrite reached its capacity, the room capacity. So, when we saw that the data summit was coming up and the theme of the data summit resonated very much with Brian's talk, we proposed to the organisers that he give the homecoming event back in Dublin, which is why we have managed to get this slot in the agenda for this. So, happy blooms day everybody, and I'd like Brian to give his talk on show me your data and I'll tell you who you are, and I'd like you to welcome up to the podium. So, thanks very much for that introduction Alan, and thanks to everybody for coming along. So, as Alan said, I've delivered this talk over the last year or so around the country, and the goal of it I suppose on the back of the RIA has been to try and sort of illuminate a little bit and show people what data we create and then what people do with that data and can do with that data. So, I'll go through this here and I'll start with a question, just to have hands in the air, does anyone in the room not have a smartphone with them in their pocket? I see no hands in the air, that's kind of interesting, as I've gone to more technical audiences, the likelihood of someone saying yes, I find is higher. So, in the very technical audiences, some people will have reverted back to their old Nokia handset for one reason or another. I don't know is that a hipster trend or is it because they want to opt out from data. But the fascinating thing is we're now all kind of carrying these smartphone devices with us and those smartphone devices that we carry around are phenomenal data generators. So, if we imagine, I understand there's about 700 people at the data summit since yesterday morning when people came along up till about now, just by carrying those phones around in our pockets and interacting with those phones, we've generated about a gigabyte of data. Now, if anyone was at David Bray's talk yesterday, he was talking about petabytes and exabytes and yotabytes and a gigabyte sounds a bit a bit measly compared to that. But if you were to write down a gigabyte's worth of text, that would amount to not just one set of the George Orr or Martin books that Game of Thrones is based on, but about 30 sets of the full cycle of books that the Game of Thrones series is based on. I used to do that example with phone books and I realized nobody understood what a phone book was or what that meant anymore. And that's kind of what we're thinking about. So we're generating this, that's just from the people at the summit over yesterday and this morning. We're generating this phenomenal amount of data and it's worth stopping to think a little bit about what that is. And it's really an amalgamation of all the interactions we have on our device and from carrying those devices around with us. So we still use phones to make phone calls and to send text messages. And while the content, let's say the phone calls in particular isn't necessarily stored anywhere, metadata about that is. So metadata is a thing that people got very excited about when the Edward Snowden releases came out. And it's basically just data about the fact that something happened. So in this case, whenever you place a call on your mobile, a little piece of data arrives into a database somewhere to say that you, given your phone number, made a call to somebody given their phone number, in particular time, maybe in a particular place, maybe how long that call was, maybe how much it cost you, and various other bits and pieces that you might be able to record about the fact that that event happened. The overadker mentioned yesterday, I think the 50 million emails per hour or per minute are something phenomenal like that, that get generated. We all send lots and lots of emails on our phone and that generates a phenomenal amount of data. The more media savvy amongst us maybe took videos or Instagrams or Snapchat's as they moved around the summit and all of that media data, so video and images contributes in a big way to the overall collection of data that we get. Obviously then we're moving away from maybe making calls on our phones to using all these platforms like Twitter and Facebook, Snapchat, WhatsApp. There's probably a couple of people who maybe still use Foursquare for check-ins. I think World of Foursquare are kind of diminishing a little bit as I've gone round the talks, we've seen less and less Foursquare. Maybe a couple of fit people in the room went for a run and ran up and down the Liffey on their way in here and maybe used an app like Runkeeper or the Garmin app. So there's lots of similar apps like that that would generate lots and lots of interesting data about where you went, how fast you went, maybe your heart rate and various other bits and pieces. So all of those are kind of interesting active data generation that we do. Referring back to metadata, also if you do a Google search or you visit a web page, all of those contribute to little bits and pieces of metadata at the same time. And then there's a whole series of implicit data collection that goes on. So maybe it's not that surprising to know that if I send a tweet or I send an email that a little data trail gets generated off the back of that. But just by carrying around a smart device like your phone, there's huge potential to capture other data. So for example, as you move around this building here, there's lots of Wi-Fi routers and we'll talk a little bit more about those in a few minutes. Your phone is in a constant conversation with those and you connect to some, you don't connect to others. There's an opportunity for data collection there. As you move around the city, the mobile phone network that you connect to, the various masks there, if you connect to a mask and then you move to a different mask, that's all opportunity for data collection again. And we tend to see that arises sometimes with the guards when they're investigating various crimes and they try to find out where somebody was. They often refer back to that data. And then obviously we can just collect that data ourselves so your phone has the capacity to collect location breadcrumbs. Some apps do that, some apps do that and you kind of wonder sometimes why they're doing that or you can actually collect that yourself. So I run a little service on my phone and I've been doing it for about the last seven years called Open Pats. That records little breadcrumbs of everywhere I go and then I can see that data and I can never quite figure out what to do with it. But I have seven years of everywhere I've been and someday I'll come up with some amazing plan for that. So they're a set and maybe it's just a subset of the different things that contribute to that one gigabyte of data that we've all generated by moving around the summit by carrying these smartphones in our pockets. And the interesting thing about all that is all of those little bits and pieces of data contribute to this digital you essentially. So the trails that we see from those little bits of data, the little kind of indicators, little clues to who you are and what you do all surround you in these databases that are spread out all around the world to give this digital picture of what you like, what you do, who you are and where you go. And one thing that's really interesting is that's just the phone, right? So we're carrying on those phones. The data that I'm talking about is really just data that we can generate from that one device that we carry in our pockets. The Internet of Things and wearable devices and smart devices is exploding that again. So I have a Fitbit on my wrist. There's a selection of all the other activity trackers that are out there. We have a running group. So I'm a lecturer in UCD. We have a running group from the computer science department and there's enough computing power strapped to various parts of that running group to launch, I'd say a dozen spaceships. All tracking different bits and pieces of what we do. The extreme of that, I have one particular colleague and some people will know who it is, who has these socks. These socks are embedded with sensors on the soles of the feet. Those sensors record pressure information of every single foot strike as he runs around UCD. And then he can sit and analyse that data and I guess a little bit like my location data. I don't really know what he does with this, but maybe it tells him interesting things about his running form and his gate and maybe his likelihood to get injured. And so fitness has seen a big kind of boom in these kind of wearable devices and potential data generators, but it's spreading out more and more and more. So I really like this. This is an Irish company who make this thing the drop smart scales. The drop smart scales is weighing scales, but it's an internet connected weighing scales and there's an associated app. And it delivers recipes to you and then you follow those recipes along, but every time you weigh something on the scales, that generates a little bit of data. So you followed a step in the recipe, you've put some flour in a bowl, you've weighed that with the weighing scales, drop scales, no, this happened. And they're building up a big interesting picture about what you eat, what you cook, how you do it, how often you do it, that all contributes to this overall picture of you. This is one, so I had a little girl about two years ago and I bought this and my wife doesn't let me use it. It's an internet connected suzer, so this is a suzer with a little thermometer in the end of the suzer and when you give that suzer to the baby, the baby sucks on the suzer. The thermometer reads the baby's temperature and I get a live steady real time stream of the baby's temperature. Again, who knows what I would do with that. Neither my wife nor my baby will use the suzer, so it's consigned to the bin at this stage. But it's interesting, we see more and more of that, so the kind of fitness type devices, they were kind of the early wearable internet of things type devices. Now we're seeing that creep into more and more areas, health obviously is an obvious place to go after fitness and I think we're just going to see more and more and more of that because it's so cheap to build these things now. And so not only do we have all that data that our mobile devices or our smartphones that we carry create, we also have this other data surrounding us from you and interesting smart devices like the Fitbits and internet connected scales. But also all those things we've been doing for years like our online banking, other social networks we might use, maybe smart home devices we have like the Climote smart heat controller that I've got up in the top left there. And what I want to do now is just give you a demo of, I've talked all about that data just how easy it is to collect a rich data set and this refers to the demo that Alan mentioned at the beginning there. So, while everybody was filing into the room, I ran this demo and this demo is based on the fact that as you carry your smartphone around in your pocket, that phone is constantly in a conversation with the network infrastructure in this building and any other building that you visit. So basically your phone or laptop or any other device that you have is constantly basically chatting away with the infrastructure to say, hello, I'm here, this is what I am, are you something that I can connect to? And in particular your phone is looking for Wi-Fi routers. So your phone wants to be always connected to Wi-Fi, so it's constantly looking out for Wi-Fi routers. And part of the protocol that underpins that infrastructure is that your phone or other devices that you carry sends out what are called probe requests to announce itself. And they're constantly just whizzing around in the room beside us and as Alan mentioned when he discussed the demo that we might do, the interesting thing about those probe requests is they tell a little bit about you. What I've done is I've run a little probe request sniffer. All of that data is open, it's unencrypted, there's nothing clever, I'm not a clever hacker or anything like that, this is really easy to do. I've run a little probe request sniffer and gathered up a set of those probe requests that are whizzing around in the room. So now I'm going to try and do a live demo of this. I felt good when Vint Cerf was saying no software ever works. So let's see how we get on with this. So I put up code just to say this a little bit of code. This is cobbled together again. This is very, very easy to do. So the collection that we've done has found 586 devices nearby. So in a big modern building like this, there's all kinds of interesting devices that are announcing themselves on the internet. But a large set of those devices are probably the smartphones that you are all carrying in your pockets. And what those devices do is they announce their MAC address, which although it doesn't tell me who you are or anything like that, is a relatively unique identifier for that smart device that you're carrying. So if we do this again and you're here tomorrow, I should be able to find the same number again. There's a few nuances around that about some anonymisation that sometimes happens. But that's kind of interesting. So without anything particularly clever, I can collect a relatively interesting data set about the people who are here. So maybe if I wanted to count the number of people who are moving through the convention centre, I could use this data to do it. But the data also tells me a few extra things. And again, I think this is nice to see some of the richness that we can get from data. So from the MAC addresses, I can see some manufacturers. So if I scroll down here, this is a Samsung device. And I know it's a Samsung device because the way the MAC addresses are allocated is regulated. And Samsung own a particular band of those MAC addresses. If I scroll down a little more, somewhere we'll see an Apple logo, I'm sure. Somewhere. Sometimes this works better than others. This is the live software bit. Lots of Samsung's. Somewhere in there there's an Apple device. Right, I'll give up. It's in there somewhere, but I can see manufacturers of phones. The other thing that's really interesting, and this is the one that surprises people, is, and this is just a quirk of the protocol that underpins the Wi-Fi infrastructure. And it was great to hear Vinsurf talking about developing the original internet and the protocols around that. Underpinning the protocol that runs the Wi-Fi network, not only do your phones announce themselves, but in order to be a little bit more efficient, they announce the networks they usually connect to. So they say, hello, I'm here. I'd like to connect to a network. Are you Brian's home network? Because I usually connect to that. And if you were, that would make our lives very easy. So they're constantly announcing these. So again, in that public open data set that your phones are announcing, I can see little bits like this. So you can see here, CCCD guest. So that device has connected before to the CCD guest network here in the convention centre. And it's announcing, hello, are you this? Because you're something I know how to connect to. And if we scroll down through the list again a little bit, we'll see actually I saw if you passed by. Here's a few that are very prolific. So this is one device announcing all of those networks that it's previously connected to. If I scroll down, you'll see one box in red, you can find this, which is me. So I know this is my phone and I've kind of collected up a few of these. So this isn't just what it's announcing today, but these are some of the networks that my phone has announced about me. And that's kind of surprising because people don't expect that kind of public data that's whizzing around about us is giving that kind of rich picture of what's going on because sometimes we see pretty interesting things about this. So from my set of data here you know, well I've been to UCD, I've been to a Raddison Hotel, I've been to a VIP lounge. I don't know what I was doing there. I've been to the ORIA and I've been to the Porter House in Dublin, potentially, or I've been near those places enough for at some point my phone to connect to those Wi-Fi networks. And we go a little step further with this. So now I know that my phone has connected to all these things at one point or another. There's a peculiar hobby that some computer scientists have called war driving. And what war driving involves is basically open up your laptop, driving around in your car and scanning for wireless networks. So just looking out for wireless networks. And if you find one, you record the latitude and longitude of where you found it. It's a bit like kind of modern orienteering of some kind. And when you find those things what people do is they upload the locations of those networks to central repositories. So there's a central database online called Wiggle and Wiggle is basically a global collection of locations of Wi-Fi networks. And what I can do with this set and any other Wi-Fi network names that I find in this data set is I can push those up to Wiggle and say, do you know where this is? So if I jump on to this tab, this is me recently and the different Wi-Fi networks that turned up in those probe requests that my phone is constantly just broadcasting without me asking it to or knowing anything about it, pushed up to Wiggle to say, do you know where these things are? And for a reasonable subset of them they are included in that data set that Wiggle has. So you can see here, while I'm in Dublin, a good bit, the convention center is probably in there. I grew up in Nice in Kildare, so here's me, that's me obviously in the VIP lounge down in Kildare, which is great. And I was in Galway recently enough, so there's some networks over in Galway. So just from that public data that my phone is broadcasting, we get sort of an interesting picture of whoever the person who owns this device is, so I don't know who it is, but from looking at maybe some interesting things about the places they've been, so I see some hotels and other bits and pieces in there, and actual locations, I can start to build up a pretty interesting picture of who that person might be. And I just throw that up, I wanted to do that demo just as an illustration of how easy it is to collect the kind of data that we're talking about, and then some of the small easy steps that you can take to enrich that picture that we see. And if I jump back across here, let's skip over the slides in case it didn't work. What we can do with all of this data that we collect, so both the kind of data that I showed in the demo and then all those other little bits and pieces of data that we're generating that I mentioned, is we can follow those digital footprints. So we're leaving these little digital footprints everywhere we go, we can start to follow them and we can start to do some interesting things with them. So for example, I was recently in Shipall Airport and they had this sign on the door as you walked into Shipall Airport, which I've kind of blown up in text over here so you can see it properly, and basically what they're saying is they're doing exactly what I just showed in the demo, and they're doing that to see how people move around the airport. So to see here's a person coming in, now can we see that same address at check-in, can we see it at security, can we see it in the shops, to understand how people move around the airport and how kind of traffic and volume flows work in there. That's a really simple example of following those kind of physical digital trails that we're leaving around the place. Another one that we see all the time in our kind of online experiences arise from the notion of cookie pools, so people are probably familiar with cookies, we all see those warnings now on web pages to say, you know, this web page collects cookies, so cookies are just little files that web pages place on your computer when you visit a particular page, so that if you come back to that same page they can recognise that this is still you. And the thing that's kind of become more common recently is this idea of cookie pools where rather than just one page managing all of their own cookies, those cookies start to get shared in pools. And the upshot of this is that we see this kind of thing happen, so here's me again trying to torment my poor little girl with things that I'm going to buy online for, so I'm going to buy her some sunglasses, and I have a look at these and I say, well, she's never going to wear those, I'll leave it. And then I'm reading an article online and bam, here is an ad again for those sunglasses. So I'm not on Amazon anymore, I'm somewhere else, but these glasses are starting to follow me around, and then I go to a different site and look, there's the glasses again, and then I go somewhere else and look, there they are, and eventually I'm going to give in. I'm going to say, right, I should buy these, something in the world is saying I should buy these, so I should give in and do it. And the reason for that is this notion of cookie pools, so although these sites that I'm visiting, so in this case, wired.com, they don't have access to that data, what they are doing is saying, you know, in this advertising spot, some third party can put whatever they like in there. So the third party advertising network that does have access to the Amazon data that wired, in this case I said, you just look after the ads, pop whatever you like in there. So now we see that these digital trails are widening out across the internet when we visit different places. So we're seeing the results of those trails in more and more different places. So they're kind of the simplest things that we can do. So the simplest things we can do is recognize, okay, there are these digital trails, and we can start to follow them and we can start to do some interesting little things with that. The more interesting things that we can do with this data, though, comes from recognizing patterns. And that's why I throw up this picture. We're really good at this. So here's a picture that maybe at first looks like a fairly random collection of black and white blobs, but if you think about dogs, and in particular dalmatians, at some point you might realize, okay, there's a dalmatian here in the middle. So it's maybe a dalmatian walking along. When I've done this, people see all kinds of strange things in this corner. I think that might be like one of those inkblot charts that says more about you than the picture. But we're really good at this. So we can take this relatively random collection of black and white blobs, and we can recognize the pattern that this is a picture of a dalmatian. Computers are still really bad at this, but what they are really good at is taking great big data sets, and we're not meant to see any great sense in this, but taking a great big data set like this and finding patterns in those data sets. And in particular for this, we use machine learning, and this is a bit of a shameless plug for my textbook in machine learning, but which I wanted to put up there in particular, it recently got translated into Korean. So I'm told that says the same thing in Korean, but if there's anyone who speaks Korean, they can confirm that. And what machine learning does is analyzes great big data sets like this and finds patterns inside those data sets. What I want to talk about is two particular things that get done with those patterns and are done all the time with these patterns by lots and lots of different services, which are to recognize your demographics, interests and preferences, and then try to predict what you might do next. So if we look at the first one first, Twitter is really interesting in that if you sign up for a Twitter account, you give Twitter very little information about yourself. So this is the login screen for Twitter. You need to give them a phone number or an email address. You need to choose a password and you may give them a full name. You can type whatever you like in this box, it doesn't matter. You may give them your actual name. And that makes it very easy to sign up for a Twitter account and I guess makes one of the reasons why Twitter gets so many people, but Twitter's business is based on selling ads and selling very targeted specific ads to specific groups of people. So here are some of the screenshots of going through the process of setting up an ad campaign on Twitter. And you can see here I start my campaign and then I can start to choose things like who would I like to see my ads? So what gender am I interested in advertising to males or females? And it gets very, very detailed. What kind of interests would I like the people that I'm advertising to to have? And you can see if we look, these are the categories that I get to choose and it gets very detailed from cars to luxury cars to performance cars and lots and lots of other bits and pieces. So the interesting question then is how do Twitter go from just your email address and phone number and maybe your name to this really rich picture of you that they can use to drive advertising and drive targeted advertising? Well, the way they do it is by looking for patterns in your data. So Twitter is kind of interesting. So that's the data that I provide signing in. The other data that we get are my tweets. So this is my Twitter account. These are my tweets. And don't forget we mentioned metadata previously. So when I'm tweeting, where I'm tweeting, maybe what devices I'm tweeting on. They also have the timeline. So all the people that I follow, whatever those people are tweeting, that data is all available as well. That has some relation to me. So I must be interested in this in some way. And then more directly the network of people who follow me and people who I follow. That's all kind of interesting. That might tell me some kind of more detailed information about who I might be. So what Twitter and lots of other people try to do is take this data and from this data infer all those other characters so we can go from just my email address and my phone number out to a rich picture of who I am. And this is the second demo that I want to show you where we'll look at this and some people might have seen I tweeted this out and I got it didn't start too well. I don't think I'm cut out for viral marketing. But as the morning went on I got more and more responses to this of people who will be involved in this demo. So basically this is a set of the Twitter handles for people who responded that they'd be interested in this demo. And what we're going to do is show how we can infer and predict the gender of these people and their interests. I just want to show you how this works. So starting with gender if we try and look at all this data I'm a machine learning researcher so I would try and attack this with machine learning models and say well I'm going to take in the tweet content and the follower graph. That should all tell me rich stuff about somebody's gender and I'll build a predictive model and we'll do that and we tried to do that and it didn't work very well. So we have a big data set with Twitter handles and marked up genders of who they are. I'm going to add it until one of my clever PhD students said well you know the name that most people fill in if I was trying to guess somebody's gender that's what I'd look at. I wouldn't go and look at their tweets and look at their network I'd look at their name and I said that's a great idea. I'm going to steal that. Once we have a name all pretty much all of the central statistics offices around the world publish baby name lists. So every year the Irish Times and the newspapers always get a couple of nice articles about this to say well what are the popular baby names this year and what does that tell us. Well we can get this data and you can get this data going back a long way and you can get it for pretty much any country in the world. And if you take this data this gives you lists of boys names and girls names. If you take the first part of the name that someone fills in on Twitter and assume that must be their first name and then compare it to these lists you can say is this number higher for this name as a boys name or higher for this name as a girls name or does it not appear on one of these lists. And that turns out to be an incredibly reliable way to predict gender. So you don't have to work that hard to predict gender. What you do need to do is augment the original data set that we have with this other interesting data set that gives me richer information from it. So we can do that and here are the Twitter names that I got and if I divide them up into male and female this is where they land. So we did okay in this so hopefully some people will see themselves up here and they can tell whether they believe what's happened here or whether they don't believe what's happened here. These two are left in the middle and they're left in the middle because they didn't fill in their full names or didn't quite fill in their full names. So JKTool99 that person filled in their name as JKTool and obviously nobody else in Ireland has ever been christened JK. So in my CSO data JK doesn't appear. Doesn't appear in the boys names, doesn't appear in the girls names so we don't know who that is. And KenB65 filled in their full name as KenB65. So they didn't give us any information beyond that so the system doesn't work. We can fall back on the content based version so where we actually look at the text and it guesses that these go into the male category. I don't really know what the right answer to these is but that's the guess that we make and if KenB or JKTool are in the room they can tell us whether we're right or wrong on that. But that's pretty easy to do and you can do that fairly well. So it works most of the time. Predicting interest is a bit more interesting and a little bit more tricky and here we do bring in all the data. So if I want to make a model or recognise from Twitter data what kind of thing somebody is interested in well the things that I can look at that might be really useful here are their tweets, the timeline of all the tweets that they're reading and then some information about their followers. So what we do in this example is we put people with these Twitter handles and again that's all public information. It's quite easy to do. It's a few lines of code to gather up that information. We suck it all in and then we train up models to recognise the topic of different tweets. And again we've got great big data sets that are marked up as this is a tweet about sports, this is a tweet about politics, this is a tweet about entertainment and we can train up machine learning models to take any piece of text and predict this one is about sport, we take all the tweets for a person we put them through that model and we get all the answers out and then we see the frequency of the different categories. So if all of the tweets that you're tweeting are about sport well you must be really interested in sport. If they're kind of half and half between sport and politics well they're your two big interests. And here's a couple of examples so here's a person and this is how their interests kind of boil down to other kind of smaller interests. And we've done that for everybody so in this data set again we've gone through everybody there we've sucked in all their Twitter data we've ran that through our prediction models to predict the topics of all those tweets and all that content so both the tweets that they're tweeting themselves and the tweets that they're reading from the people that they follow and then we've kind of built that interest profile for them and if we divide them up and then they can say yes or they can say no but the interesting thing about it is our model maybe it's not as good as the one that Twitter uses but they're doing exactly the same thing and whatever that model says is what you're interested in as far as Twitter are concerned that's what you're interested in and as far as their kind of generating ads and kind of putting ads towards you you're going to see ads based on these interests that they believe that you have and that's great the subset of my dataset of those people or if I'm interested in men who are interested in society and politics well I can pull out that set from my dataset and this is exactly what people like Twitter are doing in order to drive that target at advertising towards you and it's interesting and it works and it works pretty well they do it much beyond simply just doing gender and age actually there's a nice article by Facebook who described the 98 different things that they try to infer about Facebook users obviously there's a version of this that became very kind of buzzy and interesting in the news around this company Cambridge Analytica and what Cambridge Analytica proposed to do is using very similar techniques predict your personality essentially so they use this in the ocean model which describes personality according to these five dimensions openness, neuroticism, extraversion and what they believe and what they claim is that they can do a pretty good job in your personality from Facebook and other data that they're able to generate and the thing that they've been supposedly doing is using that to drive target at advertising around elections and it's kind of interesting and there's a lot of confusion about exactly what might be going on and what they might not be doing but in some ways for me it's not terribly surprising so this is just the same target at advertising that everybody else is doing it's maybe a little interesting looking at things like interests likelihood to do something interesting in the future so that brings me on to the last thing that I just want to look at is and this is the other big area that people use machine learning and analysis of data to understand people and it's about understanding propensities or likelihood of somebody to do something in the future so if you've ever had a phone call from your mobile phone provider to offer you some free credit or to offer you a new upgrade or a new model that has flagged you as a turn risk so there's a model there that says this person is about to leave us and go to one of our competitors so what we should do is we should jump in quick and we should offer them something nice and that will make them stay mobile phone companies do this a lot lots of other companies do this mobile phone companies are particularly good at it because they have very rich data about their customers so all that data they have about the way that you use your phone and the way they do it is really simple so what they do is they take historical data so they take a data set about their customers they pick some point in time and they say okay let's look at who left after that point in time so in some little time horizon and who stayed so in this the blue people are the people who stayed the orange people are the people who left and they gather up a great big data set probably thousands and thousands of customers and describe them according to these two categories then they extract some interesting descriptions of those customers so in the mobile phone scenario you might be interested in somebody's age their job has their bill changed recently what kind of handset have they got so do they have right up to the minute handsets or do they have that old Nokia phone still what's the kind of balance in terms of making calls using data that they use so if we can describe people according to that kind of data we can use a machine learning algorithm to recognise the pattern that identifies someone who's about to turn versus somebody who isn't and we can do this pretty accurately this is an example of the kind of simple model that we might build using the decision trees all kinds of interesting machine learning algorithms if we zoom into a bit of it it's kind of obvious the kind of stuff it might pull out so if you're going outside your package you might turn depending on how expensive your handset is you might be more or less of a turn risk so it's basically extracting patterns based on that data and then once we have those patterns we can take all of our current customers apply this model to them and it will tell us these are the people who are likely to turn next month these are the people who aren't and then you do or don't get a phone call on the back of that so this is a simple example of propensity modelling propensity modelling has been around for a long time the machine learning techniques for this are kind of relatively advanced and they work pretty well and what we're seeing now is those kind of techniques moving out to more and more things so we can do churn we can do the prediction or the likelihood of somebody to vote we can do the likelihood of somebody to buy something one of the more kind of interesting examples of this that again got a lot of press and this was mentioned in the couple of the talks yesterday are some of the ways that this is being used in the law so there's this particular system that's used in a lot of states in the US that makes predictions about recidivism risk so if somebody's up for parole are they likely to re-offend in the future or not and the picture is exactly the same as it was there a second ago because we use exactly the same techniques so these people use exactly the same techniques so they take a great big data set of historical incidents and they see who are the people who were given parole and didn't re-offend or behaved and who are the people who did re-offend and then given that great big data set they extract the features that might tell the difference between those two groups and they build a model that predicts the difference between those two groups so this is a particular system and like I say in a number of states in the US some details are filled in and then a judge at a parole hearing essentially has a laptop or an iPad with a screen very like this that they have a look at and this says the likelihood of future criminal activity from this person in this case is high and that should feed into that decision that they're making and it's very interesting so it's using exactly the same techniques that we're talking about that turn analysis and a more narrow way to predict the likelihoods that somebody is about to do something now that the way that the judge is meant to use this is they're meant to take this as one piece of information that falls into their overall decision but it's one example of how we're seeing because of our ability to collect and store data from lots and lots of different areas we're seeing these types of predictive modelling approaches being used in more and more areas and being used to help more and more decisions so we see this in farming we see it in sports we see it in education we see it in finance we see it in energy H or anything you can imagine somebody is using a predictive model based on great big data sets that they're collecting to help people make some kind of decisions and just to kind of start to wrap up one kind of response to that as well who cares we've been doing this forever so this is a screenshot of Edmund Halley's life table from 1693 and this to a large extent is the first example of someone coming up with calculations of risk so it's essentially life assurance risk that they're coming up with and people have been doing this since 1693 so that's a little over 300 years ago now so is anything different and to a large extent no so we're collecting data we're building models using that data to make predictions about what might happen in the future and then we're using those predictions to try and help people make better decisions the thing that I think is interesting is this idea that maybe it's ever been before so with all these things that we do online with the phones that we carry with the wearable devices that we're putting on there's these richer data sets that allow us to do these kind of jobs in more and more areas of our lives I'll leave you one example though just to show this isn't perfect by any stretch so I get this email from LinkedIn and I get a version of it every six months or so and it's basically the groups that LinkedIn have that they think I might be interested in and I think it's a nice example of how these algorithms are good but they're not perfect by any stretch so what's happening here is LinkedIn's analysis algorithm is looking at all the people are in that group they're looking at their interests and their characteristics they're interested in machine learning they have jobs like I do so I look like someone who should join this group except for the fact that I'm not a woman but the data doesn't really pull that out I'm mostly just like these people so the algorithm says yet you should probably join that group because that's lots more work to do so thanks very much for your attention on that hopefully that was an interesting tour through some of the things that people are doing with data I'll finish up there I'd just like to thank the ORIA for allowing me to do this this kind of series of talks I've had a great time doing it and hopefully other people have enjoyed them and I know there's flyers around Alan who's going to talk more about this and people should apply for this for next year it was a great time so thanks very much for that and do we have any slide-off questions that have come in yet? No, not yet. OK, well that's fine, let's make it informal then. Yes. Hi, that was a great talk, thank you very much. I was just looking at the example you gave about the law did the example you gave about the recidivism and the scoring mechanism how do we prevent ourselves from using that and you kind of alluded to it but the judge sits there he's supposed to use it as just one piece of information but it's very easy to trust data and say, well look, I don't want to use my own personal opinion there's data here that tells me what I've got to do I'll take that decision because it's easy how do we prevent ourselves from doing that? I don't know the answer to that and I think that's a massive issue so one of the things I think is interesting around that same system one of the states passed a law relatively recently, there's a particular case where they said the judge must not just use this score they must use other things that they must not do they need to go away and think for a little while or something like that so I think it's really hard to do that and there are lots of examples of where exactly what you describe has happened so one in particular so credit risk modelling in banks people have been doing this for years is ahead of the other uses of modelling around data that we're doing and credit risk modelling in banks is very regulated and one of the regulations is well if you apply for a loan a model says this is your likely risk someone looks at that output and says this information and all the other things that they as a banking professional know and make that ultimate decision so they're not meant to just use the output of the model but the thing that's happened is if you agree with the output of the model you don't have to do anything as a person who's working in that scenario if you disagree with the output there's a form you have to fill in you've got to convince somebody if you turn out to be wrong that the model doesn't get blamed you get blamed instead and the impact of that is nobody just agrees with the model it's a huge risk that we have to take because people put so much confidence into these so the kind of things that people are trying to do is you'll hear the phrase explainable AI and explainable modelling a lot is to try and move away from that pretty blunt prediction of 0 to 10 this person is a 10 and try and enrich those outputs from the models to say this person is a 10 here's why this person is a 10 here are the kind of things that if they had changed a bit this person might have been a 2 or a 3 whatever the thing might be so it's a huge problem and it's I think we're going to see it more and more and you can see in that I think the banking the credit scoring example is a good example where people have tried to fix it by putting some regulation about that but then the other structures that build around that just make it easier to agree with the model so that's what we'll do so there's massive work to be done there and we're not close to having figured that out yet so one of the things that we saw this morning in the large auditorium in the panel discussion was at one point Mark Little and Vincent were talking about bias in predictiveness and whether it's churn or whether it's judges or whatever and Mark was making the point that the algorithm which is used to run this is coded by humans and therefore there's the danger that there might be human biases built into the algorithms so it was almost jumping up and down and saying no it's not just the algorithms that introduce the bias more and more it's the data that has the bias and the classic example which has come up repeatedly in the last two days is the face app application which sort of beautifies your face by making you more white and it's been on stage mentioned many times so picking up on your point how much of this inherent bias it's not algorithmic because the algorithm is public it's open it's procuring and read it and understand how it works but it's the data that has got those inherent biases completely exactly so the models like you said there the model basically is a dumb thing that's a pattern recognizer that looks through the data set and says what are the indicators that go from this to this and whatever biases are in those data sets are going to turn out to be in those models that we learned or certainly at the moment unless we do better things to build them and the judge example is a good one and some studies of those models and pulled out pretty serious biases that seem to be baked into those models and those biases come from the fact that it's particularly around African American people are much more likely to be predicted as re-offenders in that model and the explanation they're giving some of the analysis is that in the data sets that they're training it on because of biases within the policing systems that those people are more likely to have been in parole hearings that are more likely to have ended up being baked in and that's baked in because there's more of them so how much of explainable AI is not just about explaining the factors which influence whether you've got that crediting or whether you've got that parole or not but is about explaining where's the data that drove that so I think given the data that drove it or the thing that I think has real potential is saying if we change this from this to this this would have given you a different answer so trying to let people explore if you take the example of the parole so we had a person if this person had been a little bit younger a little bit older what would have made the difference and pulling out the things that would have pushed this over the line from a re-offend to a non-re-offend and illustrating those to people to try and help them understand well these are some of the biases that are maybe in there so I think that's a big thing definitely look at the data sets but it comes very difficult to try and communicate those big data sets to people because the whole point of using these models I guess is that you take big data sets so data sets with lots and lots of columns essentially and it does become tricky to pull out you know well this is what the data that this model was trained on looks like but I think in the one part of it that maybe has become more and more important is and again the recidivism prediction model is a good example of this making those data sets public in some ways so obviously you can't make them completely public but making them public in some ways so people are aware of those things the big thing that's being given out a lot about that particular model is that there's a private company who have built that model they tell a little bit about how it works but it's a closed box and you can't get at either the data or the model to a large extent so another deployment of that judges, yeah I'll get you to the second one last point to make that judges dashboard isn't just a yes no on a scale it could be some sliders so he or she could play with well what if that person was younger and you slide it down and each time you played with this you slice and dice the data set that's used to make that prediction differently yeah I think that and even to make suggestions to say well you know if this person was slided down to be younger that would have made a huge difference so you should think about that a little bit hi Brian thanks a million for a great talk but what I wanted to ask you is about the responses that you got from the public when you were out giving this talk I suppose two things one I'm just interested in and a key concerns and issues that came up when you were out around the country giving the talk and I suppose secondly did anything any feedback that you got when you were out giving the public talk come back to the research group and change anything that you were doing there so I guess to do the last bit first that last bit about the explanations and things that's kind of we're doing a little bit on that now and that's come directly from discussions around this talk in particular which is directly from that yeah I think one bit of feedback that I think is really good is that I didn't know that was happening so I didn't know this data was being collected and not in a good or a bad way but just you know it's good to know so I always give the example where I should know better but I got caught out on this when I first got my Netflix subscription I didn't realise it was connected to my Facebook account and my brother rang me up to ask me to do something I can see that and that kind of thing is a good example I'm a computer scientist I do all this I should know better than to do that but I think that's been really interesting is people just aren't aware of how much little bits of data are being generated and how some of that data can be connected together to do interesting things and again it's not a good or a bad thing it's just a thing that people don't know is going on so it's been really great where people say oh wow that's great you can do that and you can go and look at so people like Google and Facebook and Twitter are doing a better and better job of providing tools for you to look around at the data that they collect about you and that can be a really interesting thing to do so the Google one in particular you can look at your search timeline and see basically since as long as you've been using Google and have been maybe signed in with an account every search term you've ever typed in to Google and that can be a fun thing to look at over time you can say oh look here's where I changed job I've gone and looked around in some of those tools and I think I had some of them on the last slide the links are pretty easy to find I guess the trickier one and the one where I run out of steam a little bit on this so I'm a computer scientist I do machine learning I'm on the, if anyone remembers there was the example in Jurassic Park of we asked if we could do it rather than whether we should do it I land on the could side rather than the should side because you see good problems we could do that we could do that and that's great lots of people do end up in the well is this okay should we be doing it should we not be doing it and I think there's loads of interesting questions there I think there's as many good ones as bad ones or there's as many positive potentials as negative potentials and I think that's really interesting stuff and I think a key thing with that is yeah it's not me who should be asking a lot of those questions so we're scientists and engineers and are going to want to build things together and it's good and we're seeing the saying that the research centres that around people are putting the effort into looking at those those other questions as well but I think that's really interesting and I suppose the thing that I've been trying to get is I think maybe in the first time I gave this we ended up in that this is terrible and it shouldn't be happening kind of end point and I think that's an easy place to end up where you say oh this is terrible just turn everything off and I think that's the wrong answer because there's as many good things we can do with all this data as there are things that you might want to happen with all this data Anybody else? No? Okay well we're actually pretty close to time finishing up so once again I'd like to thank Brian for his presentation oh one quick one oh yes sorry I missed that was right on time I'm Lauren Cirullus I'm a reporter at Politico in Brussels I was sorry Grace because I got a lot of this stuff I was wondering if you had any recommendations to I mean for instance in the European Parliament there's a discussion to what extent algorithms can be put into some kind of tool that looks at how it generates results so basically keeping trade secrets in algorithms for companies who want to make money out of it and at the same time have a certain degree of transparency for the public good basically is this feasible you think can you sort of scrutinise without going open source can you scrutinise what an algorithm is doing so I think there's two parts that so if we think breaking it down so if you have trained up a model can you understand what it is that that's model learned and then the second part I guess is can you still if you could do that in terms of trade secrets and commercial advantage so the first part is actually really tricky so again we do work on this and there's lots of work going on on this so in I showed an example of a tree model and that's kind of the simplest thing that you can do and people still use lots of decision trees and one of the reasons they use them is because it does make it easy to interrogate that model and say okay the model gave us this answer and here's why because you were young those models can get a bit big and a bit complicated but they're still relatively interpretable the problem that we see on that side of the question is as you want to make more and more accurate models that deal with more bigger messier datasets is you move away from that so one simple step that you take is you go from one tree to a thousand trees and they're all slightly different and then they all vote together on here's what the answer should be and that becomes your final prediction and even that small step that we understand moving from one to a thousand of them now understanding what it is that this model has learned becomes quite difficult so as we move to more and more complicated models that we use and so that step from trees to ensembles or groups of trees and then the big thing that everybody does now is deep learning which are neural networks they're doing the same kind of job but the internals of those models are very hard to interrogate that first part of the question is still really hard there's lots of people doing good work on it there's lots of people trying there's nice techniques that come up so like we talked about here with sliders to explore here's what might happen or nice visualizations of here's what this model looks like it's learning but that first part of the question is actually really, really tricky to do and so there's lots of interesting things left to do on that the second part of the question I don't know a whole lot about it I know there are people that at the same time enable people to use them to build models and to learn patterns I guess if you, when you can do that well then the model maybe is you're able to interrogate it but you can still keep some kind of trade secrets maybe one thing that we do an amount of that does work actually that maybe answers a little of both questions so one thing that we often do is if we train let's say a very complicated model so an idea of my thousand trees what I can then do is make a simple version now that simple version won't capture exactly what the more complicated one does but maybe it gives me a decent it gives me an okay approximation so I do this a lot if I'm trying to explain if I'm trying to convince someone my model works I'll convince them by saying look we tested it on some test data and it's 99% accurate but then I also want to give them and here's the kind of thing that it's doing because if people kind of believe it's as accurate as possible and then from that we can boil it back down to a simpler version that won't be nearly as accurate but hopefully captures an amount of what the bigger more complicated model does that allows us to explain here's the kind of thing that's going on in here and maybe that's an interesting step for the trade secrets piece where you can say I guess maybe yeah would secret recipes be the equivalent of that so maybe Coca-Cola is made of these five ingredients maybe there's an equivalent there but definitely there are two big interesting questions I guess if I go back to the banking scenario one of the things that's common in banking so go back to that idea of predicting whether you're going to pay back alone or not people have been doing that for years it's very regulated one of the artifacts of that regulation is people don't in that scenario use these let's say more accurate modelling techniques that give us bigger or more complicated models they use decision trees and logistic regression models that are easy to interrogate so that's kind of where that's the answer and that solution is don't use the more accurate models that I guess doesn't work in lots of places very much we have to leave it at that let's thank Brian for his presentation don't want to ask for a competition and as I said at the top of the session this was the homecoming event for Brian he's given this lecture around the country literally around the country the series lecture series has been successful we're going to run it again for 2018 so submissions for the Royal Art Academy of Engineering and Computer Science lecture for which Brian has just given are now open and if you think or if you know somebody who thinks that they can give a lecture like this then please do consider applying there's yellow flowers on the tables and there's also information on it in the Royal Art Academy website RIA.ie the topic doesn't have to be about data science it can be any topic of choosing within the engineering and computing domain the data summit event suited the topic of Brian's talk which is why we've been able to bring it back here and it's a Dublin lecture followed by a tour of the country ending up again and the 2018 series which is described in the yellow flower there is sponsored by Hewlett Packard Enterprises in Galway and we're very grateful to them for their sponsorship we now have tea and coffee on the ground floor and we resume the session of your choice at 11.15 Thank you