 So thank you very much. I'm delighted to be back here in New Zealand. I was first in New Zealand in 2014 when I was a Canterbury fellow down at the University of Canterbury and Christ Church for about two months and I suspect by getting to know James Smith is down there was part of the reason I've been invited here today because I understand that he's been part of this community for quite a long time So a little bit about what I'll be talking about today I've got about an hour of your time and I want to talk about this idea of knowledge machines and hopefully It will all become a little clearer as they go along what that means since it's a bit of an opaque term possibly Before I really get started Let me give a bit of background of myself and establish my street cred for being at an organization like this as Andy pointed out the Oxford Internet Institute. It's a Interesting multidisciplinary place. Let me get my glass of water. I am definitely a little bit dried out and jet lagged from traveling so far We're a multidisciplinary department. Don't let the name Institute throw you were actually a department in the social sciences division at the University of Oxford We've got about Every time our director says how many faculty we have the number seems to grow And so I don't know what the actual number is it's somewhere between 45 and 50 these days at various stages of their career We've also got about 50 master's students and about 25 doctoral students at any given time Trying to do this interesting thing of understanding life online And if you know Oxford at all Oxford's a very slow-moving beast of an institution in some ways. It's been around for over 800 years It doesn't make rash decisions And so the fact that we've got an Internet Institute at the University of Oxford is is quite a forward-thinking thing in some ways We've actually been there for 15 years now. We were established in 2001 and I've been there for 10 of those years As you can see I'm professor of social informatics and if you you will be forgiven for not knowing what social informatics is on my next slide I will explain it to you Hopefully in a way that you'll remember and also director of graduate studies there at the OII I'm also a faculty fellow at the Alan Turing Institute I won't have much time to talk about the Alan Turing Institute today But if if you want to catch me in one of the breaks or something I can talk about what's going on with data science in the UK because that's what the Alan Turing Institute is all About is about advancing data science not only in the UK, but around the world So social informatics What is this? You will be forgiven if you haven't run across this term that much before there's not that many of us in the world My own degree is in social informatics and information science So I got my PhD at the school of library and information science at Indiana and Bloomington and Studied with the person who was considered the father of social informatics Rob cling He unfortunately died a number of years ago But he really influenced the way I think about the world and this word is how I use to explain people in a very Simple way what social informatics means do I have a pointer on here? There so socio-technical and I and I spell it with a hyphen This actually caused quite a bit of fights with my editor for the book that was just mentioned because they didn't like spelling it With a hyphen and they kept trying to take it out. They said no, it's important. It matters and you'll see why When think think of the word socio-technical the left side is obviously socio social people things that are in the world around us and technical the kinds of objects and machines and data and computers that we think about And I have argued in this article if you'd like to look it up that what social informatics does is examine the hyphen in Socio-technical so how the people are connected to the technology. I'm really interested in that connection between the two and This differs from some other areas that try to understand socio-technical systems or or the relationships between people and technology any of you who have an academic background in This area called science and technology studies They tend to focus very heavy on the socio and then bring in the technical later Some of you who might be from computer science backgrounds tend to start with the technical and then bring in people at the end to make Sure that things work and do user interface design But when I go into a new project I try and understand it from not Predetermining whether the social the technical is more important in any given situation So I look at that hyphen Now one way to think about how one looks at this hyphen and to think about the con the complications of putting people and technologies together is I like thinking about a picture like this So this is Los Angeles a picture borrowed from Flickr Creative Commons, so it's legal to put it up here If you think about the affordances affordances an idea that the kinds of things that technology lets you do One of the affordances of cars and roads Well, they let you do certain kinds of things and build certain kinds of Configurations in the case of Los Angeles for any of you have been there They've got giant freeways with lots of cars rushing around all the time and people drive every place I used to live in California. You drive absolutely everywhere. There's a Movie from many many years ago called LA story that one of the things that made me laugh up when I saw it It was I'm back in the 80s Steve Martin's in it and he walks out of his house and he gets in his car And he drives about 20 feet forward and gets out of the car and walks down to his neighbor's house because in Los Angeles You drive absolutely everywhere, even if it's walking over to the neighbors So this set of socio-technical choices that has set up in Los Angeles builds this big massive Ecosystem of Los Angelinos getting around This is a picture right outside my office in Oxford For those of you who are library aficionados the Bodleian libraries are right down here This my office is just just over there to the left of this. This is Bailey all college Now this is all the same technical stuff, right? You've got a road You've got some cars you've got people trying to get places But the socio-technical configuration is actually different in a place like Oxford This college has been here for over 800 years The University of Oxford has been here for over 800 years and We have made a different set of choices and how we get people from place to place and that results in a different socio-technical Configuration even though the technology is essentially Identical we don't have the same way of going about things in a place like Oxford so when you look at Technical systems you need to think about the kinds of choices people are making now one of the things I'm always encouraging people not to do is engage in this sort of naive Technological determinism that if you read Wired magazine, they're really good at being technological determinants in an IE way, right? Technology X causes why you know the internet makes people stupid or the internet makes people connect to each other Whatever you want to say that technology causes X Instead you need to think of it in a more nuanced way is that tonight technology allows people to make certain choices and That we need to understand those choices and they're not unlimited choices The technology has certain limits of what it's able to do but also you can ask new things of that technology And what I've been trying to do over the last Decade 15 years or longer. It's to try and understand how as my book that I'll be talking about quite a lot today knowledge machines How the choices that researchers have been making About using technologies is influencing the kinds of research that gets done the kinds of questions that can be asked and The kinds of things we can know about the world So this is a picture from one of my articles from a number of years ago with my colleague Ralph Schroeder I'll mention Ralph a couple of times he and I write all kinds of stuff together the interesting thing about Ralph He won't get to meet him here. He'll be in Christchurch later this spring He's got he's a visitor there in the spring. So if you happen to be down in Christchurch to want to meet Ralph If he were here It's quite entertaining when we speak together because they don't agree on anything We disagree on practically everything we ever talk about but we work really get good together because of that Because we have to hone our arguments to be able to convince the other one to get a word down on the page We have to really think through every single word until we're some guy got a sentence that we are both willing to put on page So our article there at our book particularly is well thought through in terms of what we've actually said rather than being lazy about it So so we disagree on things But here's one of our images from this this 2009 paper That is part of a bigger image that you'll see in just a second Which it says here's a naive view of traditional research. This is very simplified and oversimplistic You've got researchers down here doing traditional research they publish stuff Libraries come into this because they interact with publishers in the journals that they're collecting Academics who are also these people come back into it when they read the journals and they feed back into the research results You get this very simple little feedback loop And then there's also this this path out here into the larger public understanding of science But it's very tightly controlled right it's through science media science outlets Popular science outlets that you're familiar with some of the media programs and so forth so forth And this is an old view and overly simplistic But it's something that we relied on to build this bigger picture Which is a more modern view of what the information ecosystem looks like And I'll talk about a couple different areas of this diagram today So I'll talk about this top left here in just a few minutes and then later on in the talk I'll talk about this area up on the top right But essentially we've got a much more complex space and maybe it was always more complex than we were making it out But it's certainly more complex today when you've got all these different players going on in terms of different ways of accessing Scholarly communication and talking with each other whether it's blogs or whether it's Twitter or some people might be tweeting about this right now And certainly it's a more complex area of who's able to find and see this stuff that we're writing about as academics So this is the book Knowledge Machines digital transformations in the sciences humanities been out since 2015 and many of the things I'll talk about today not everything but many of the things I'll talk about today are in that book So if you'd like to pick up a copy you can I got very lucky on the day I grabbed the screenshot We were the number one new release on Amazon, which was quite exciting. It's not true anymore. That was quite a long time ago So what about this top left corner of this diagram this area that we call e-research How many have even heard the word e-research? It's kind of few people pretty good actually better than some some audiences I do this to e-research used to be Quite a common parlance particularly in the UK There was a lot of investment in this area of e-science and e-research and what they called in the US cyber infrastructure in the first decade of the 21st century and We were heavily engaged in this Ralph and I and my colleague Bill Dutton and some others at Oxford and around the UK in the thing called the the UK e-science and social science program and What was happening during the sea science era was they were trying to embed Computational approaches to doing science social science and later the humanities and some of the ways that they were talking about doing that was grid Computing federated data sets cloud computing and so forth and building these tools that would let Distributed teams and researchers share tools and data To be able to do new kinds of research So this is our definition from the book E research is research using digital tools and data the things that many of you here are interested in For the distributed and collaborative so it's spread out and it works to it's about working together not just working on your own Production of knowledge so making new claims about the world and making that in a way that can be shared and maintained over many years Now again, this is from the book a so that the data originally ends in 2012 because that's when we were done writing the Book it publishers are so slow it took three years before we got from when we finished the book for it actually to appear in print Quite frustrating, but anyway You can see that there's this a number of different terms These it's not just these terms as a compilation of terms You can see the whole search thing in the book if you're interested in that On a different bunch of different collaborative computing topics and you can see there's this inflection point starting here around 2003 where a lot of stuff really started to take off and this is not accidental This is because of the funding efforts that were going on in the UK in the US in Australia. There was a bit in New Zealand But also some other places around the world that were spending new money on building these e-research Infrastructures and getting people to use grid computing and cloud computing and everything else And so we saw this big take-off in 2003 and then it peaks up here around 2010 now this slight drop-off actually is when all the funding stopped and People didn't necessarily stop doing these things But they stopped talking about them in the ways that they've been talking about them for a number of years Now I've left a little bit of space over here because this bottom line here big data Which looks pretty small up through 2011 when there's this slight tick in 2010 nobody knew about big data Nobody talked about big data. Nobody cared about big data. I've Looked at the more later the later years for big data references in academic publications and it takes off like mad It's even bigger if you look in the popular media, and so I'll talk a little bit about big data today and what that means for research Now Data is an interesting thing to some people, but if you had a conference on data up till 2011 You got a room like this a Couple of bored people checking their mobile phones not really caring that much it was I was at meetings about data And this is what they looked like Suddenly in 2011 and I can't tell you exactly what happened But suddenly everybody got interested in big data it had to do with things like the Snowden revelations It had to do with the increasing awareness of the kinds of things that were being collected about us online But as a suddenly big data became sexy and when you have topic conferences now about big data You get packed out rooms. This was a at the ICA in 2012 a session I ran in big data and people were out in the hallway trying to fight their way into the room because suddenly when it was big data everybody cares And this gives us a new opportunity to think about what we can do with data in different disciplines Now what I'm gonna late do for the next half an hour or so is give you a couple of examples that come from our book and our Other work and some of our other projects that aren't in the book that talk about the fact that in an increasingly interdisciplinary world As I said at the beginning I live in an interdisciplinary world at the OII But like it or not disciplines still matter most people are trained into thinking about the world in disciplinary ways They learn how what the rules of the world are very early on I remember once we did a project where we worked with some second-year undergraduates at College in London and these history students had taken on all of the not only language But the biases of their faculty members in their second year of undergraduate. They had Somehow gotten in their heads that digital things were dirty that you should hide the fact that you've done anything digital We're asking these students and we said well, you know, do you use Wikipedia nice they sort of you know look around? Yeah, but don't tell anybody and the when we asked about when they would build their lists of references at the end of their papers we said well Do you cite things in a digital? Do you indicate that they came from a digital resource and they said well You know I get all my stuff through digital resources, but I would never do that if I've built my list of references And there's seemed like there's too many URLs. I just delete some until it looks like I've done real work Because somehow sitting at the computer isn't real work So these kinds of disciplinary norms are trained into us quite early We see this again and again when we look at some of these examples that I'll talk about today So I've got about two examples from each of these areas of sciences social sciences humanities and arts These are very brief versions of all these I could talk for an hour about each of these But I won't be be a Fidel Castro sort of afternoon here and keep you until midnight talking about the different cases But I'll give you just a brief overview of some of these things So two quick examples from the sciences now before I start on these I want to talk a little bit about this scientific styles work that we rely on the book which is based on Ian hacking who's a historian of science and he based this on an earlier philosopher of science called Cromby and He argues that there are essentially six possibly more but you know It's debatable Scientific styles that you can break most science down into this doesn't necessarily include everything in the humanities But we'll argue that certainly some things in the humanities can also fit into this And I won't go into these in great detail I can I can there's plenty of sources to talk about these but you have some things like taxonomic just means sorting things out This is what a lot of people in libraries and archives do is sort things out and build the taxonomies and keep track of things And you'll see a couple of examples of that in science today Also probabilities pretty self-explanatory and modeling and so forth. So we'll come back to a couple of these as we go so my first example is from a prop project I did with marine biologists and I brought this example today, even though the the project now is going back over a decade And I think is of interest in New Zealand in particular because actually some of the scientists we worked with their research field stations were in Kikura Working with dolphins. Now any of you who are familiar with marine biology might know that one of the ways that marine mammals are identified is through Photographs of various identifying features. So for whales, it's the tail of what for humpback whales At least it's the tail blue whales its patterns on the side for dolphins It's the nicks and notches in the in the dorsal fin and these you can take a picture of a humpback whale for instance It's like a fingerprint if you get the tail at the right angle and then you can identify that whale in the future so here's an example of a humpback whale and This one you can see is the same animal obviously, right? This one actually is pretty obvious anyone see the obvious feature on this one That one's these little dots there really a dead giveaway on this particular one But also you look at the shape the coloration the the nicks and notches on the trailing edge But this is the same animal it was viewed in two different locations 2004 2005 by two different people in different places around the Pacific Ocean and This photo identification lets you say with certainty that this is the same animal over a period of time now The humpback whales won't ever get fewer features They might get more because you know they'll get bitten by stoffer. They'll get hit by boats and so forth But they won't get fewer so you can do this quite effectively Now one of the things I was interested in this project was understanding the practices of photo identification as they switch from film to digital So they've been up until 2002 using film cameras to do all this work and in between 2002 and 2003 essentially the entire discipline switched to digital cameras in a very short order as digital cameras got a certain amount of capabilities that were Consistent with what they were doing and one of the questions. I was asking was well What is the meaning of this switch that seems on the surface quite simple, right a digital SLR and a film SLR look identical? They essentially work almost identically You take the expensive lens off of one and you put it on the other one and you're good to go Right it shouldn't really cause that much disruption, but of course it does Now here's a matching technique on screen So this is some dolphins obviously not whales and They've got a some algorithms that Sort them out into different characteristics and they can say okay. I've got a new dolphin fin Let me bring up the ones that might be matches and then visually match this myself Now well This is one of the things that computers still largely aren't all that great at is doing this kind of pattern recognition and human Brains are better at This is tedious work. So this is one of the matchers at a place in Washington State in the US who was part of the splash project that I was talking about with the population levels and abundance of humpbacks and This is her full-time job 40 hours a week Matching whale tails to pictures of whale tails. It's quite difficult work as you can imagine There were four women. It just happened to be women at the time doing this no structural reason that had to be but there are more women in marine biology than men and And four four relatively recent graduates who were spending 40 hours a week doing this they all had different strategies They all did things like one. I know only parked a two-hour meter So she had to get up and move her car every two hours and get a bit of blood flowing But the other interesting thing is they all used a different mental algorithm for how to match a whale tail There wasn't a single algorithm They all use so one would look at the trailing edge first and one would look at the coloration patterns first and one Would look at other features first they and when I talked to the manager of this group you said well Yeah, it's very difficult to predict I just have to throw people in and see if they're any good at it Some people are terrible and they never do anything and they can't match anything Others just take to it like that and they can match stuff really quickly because they've got some sort of built-in ability to see these patterns now These are all printed out, but they're digital images. They've taken digital images with digital cameras they printed them out in these books they've got bunches of these things and What they're doing here is this is back to our styles of science is they're doing this taxonomy They're taking all the pictures from individual webiologists They're sorting them into photograph building this database and then you can do science with it Then you can ask questions So this is the first step in being able to do science and one of the basic scientific questions was How many humpbacks live in the Pacific Ocean because up until this time they had no idea this work was being done by so the splash project included pictures from 500 different groups of people around the Pacific Rim and They were Sending photographs to four central locations which are then being sent to this one central location to do matching Across the Pacific Ocean before you could have matched individual animals that came to your same spot And you took pictures of it year on year, but you couldn't know where they were the rest of the time So this was printed in National Geographic magazine in 2007 and they were finally able to answer this very basic question of science Which is about how many humpbacks live in the Pacific Ocean the answer is about 15 to 20,000 based on this work And now they've got some sort of baseline going forward Now the reason I bring this up is not just because it's a really interesting case that I really had tons of fun doing this work I got to hang out on boats and look at whales and dolphins and so forth Quite quite a fun way of doing research. I must say But it also raises interesting questions about preservation of data So how many people in this room know how long a humpback whale lives? Good nobody's a liar because nobody knows how long a humpback whale lives The answer to how long a humpback whale lives is so far it's longer than scientists have been studying humpback whales They found a humpback whale washed up on a beach in about 2007, I believe That had a harpoon point in it that was used by a company that went out of business in 1891 So they know that that whale had been swimming around since at least 1890 and got harpooned at some point and then kept swimming for next 120 years So they live a long time. They live much longer than the life of an individual scientist So the data that I'm collecting as a scientist today is not only a value to me It's a value to generations after me, but one of the things that they're struggling with in this field is how do I make that data available to the generations after me when I don't have a joined up information system. I don't have a way of sharing this data that really works I've been mostly working at a small scale in individual groups So this is something they've been struggling with and they're continue to struggle with is how to make sure that data that could be valuable for For centuries is of use when that time comes about. We certainly hope that the whales are still around in a hundred and plus years It also raises an interesting issue also for archivists is what constitutes a digital data? That's digital data. These are digital photographs of blue whales. They're stored in a box. We got our box down here We got our box lover Richard earlier This is a very very simple staples and lid that they've taped together and the reason that this These pictures so that the rubber rands This is an individual sightings of an animal when they find a new individual sighting They'll put it in the rubber band and store it in a little packet and keep track of it in a separate database that they've got a record of these things But the reason that these are in a box and you can see that they've categorized it as mostly dark here is Because this was quite a small project for this organization Their bigger project was the humpback one and they had a elaborate database and they had everything being kept Digitally for that one, but they didn't have much money for the blue whale project And so they had to resort to what they could in order to manage and store this data But then if you're going to make this available long term What is the the archival strategy for making these data available to blue whale researchers down the road? It's not immediately clear So let me go on to another couple of examples. I spent a little bit of time on that one Just because it's one of my favorites it always is with the whales but other ways of thinking about Not another example of scientific research has to do with some of the genetic work I used to work in genetics bipolar disorder genetics many years ago when I was in Indiana and some of the the scientific structures that are being set up here have to do with Collecting data from various research teams sharing these blood and phenotypic databases Building federated databases and then being able to share these around the world And on my example for this has to do with some work We did with the OECD on big data for advancing dementia research Now it's interesting back to my point a big data sudden somehow being sexy I'll give you a clue a little insider bit of tip This project really had nothing to do with big data The data is not that big but the funders insisted that we include the word big data in the title So there you have it I mean one of the databases that we were working with here at me they had Less than a thousand subjects that they were dealing with Not that big a data actually But because big data was seen as sexy we had to put it in so there it is But you can read this as data for advancing dementia research now those of you know who about dementia Hopefully I'm sure there's some people who sadly in the audience know about it from personal experience with relatives and so forth But dementia is a devastating disease that causes lots of problems and this certainly came up in 2013 when there was a then g8 now g7 organization meeting about dementia and Some people may remember this guy who used to be a prime minister in the UK He's since slunk off to do whatever he's doing these days But one of the things that he did do that was interesting and good was he got the g8 thinking about dementia research And he put together this got these people together for this dementia summit summit and was interested in a number of things particularly this third point Which was this awareness that a lack of collaboration openness with different scientists around the world Using different data and trying different approaches, but frankly not really working together enough one of the huge challenges in a lot of Medical research is the fact that the sample sizes need to be bigger in order to be able to detect any differences In the bipolar studies I used to be involved in we kept scaling up and up and up We started with you know hundreds of families that went to thousands of families They went to tens of thousands of individuals to try and find signals that would indicate which genes were responsible for bipolar disorder The same is true for dementia Essentially, we don't know exactly what causes dementia We certainly don't know exactly how to treat dementia and if you're going to come up with answers to this You've got to start sharing globally rather than being based in just a small area Which is how scientists are trained back to this question I raised earlier about the disciplinary differences When you're in medical science, you're trained that your data is very valuable You don't want to get scooped and you want to be able to rust as much value out of your data as you can before you Give it off to other people So we worked with the OECD on this and we came up with this diagram that goes all credit goes to my Research assistant Ulrika Dijon who came up with this which was trying to understand some of the structural challenges to data sharing for Dementia researchers and you can see there's some things the ice above the water here on our little iceberg Which are the technical challenges which are sort of fixable not that actually difficult quite frankly There's this consent challenge which has to do with the fact that if I've given you some biological material or given you an interview here in New Zealand and I've The researchers followed whatever protocols are in place in New Zealand That doesn't necessarily allow that researcher to then share it with a researcher in another country where the laws might be different Where the rules about consent can be different? And so there's a lot of barriers to sharing data outside country settings And that that is just true around the world But then there's these other underscore underlying structural challenges about people and processes And this is some of the social informatics questions is how do we change the incentives in medical science to make sure that people have a Given incentive not just to create data, but to create data in a way that's reusable How many of you have worked with a scientist who comes to you possibly at the end of the project that says Oh, I got all this data and I'm supposed to put it in a repository somewhere And I I haven't really thought about how to do that. I don't have any money left the project's over Could you help me shove this together somewhere and stick it someplace so that I fulfill the requirements of my funder I Hear enough laughs that has happened to some of you Of course then it's very late in the game And we need to change people's thinking so that they start to think about these things early on so that they make the choices when they're gathering data And we're working with data so they can be shared There's a lot of other things in here I can refer you to the report But I think just this idea that we need to change our way of thinking Collaboratively is important in the sciences we tend to Have a tendency to idolize the sciences and think that they've got they've got collaboration all sorted out You know physicists and astronomers they've got sharing and all this kind of things sorted out They've worked it out when they build things like CERN, but there's a still a lot of Structural impediments to sharing data even when the the stakes are so high as with dementia Now what is big data or data in the dementia space? Well, these are the kinds of data that researchers are used to Purposely collective data. This is what they've always done. They go out and they get samples They they work with people they they interview them And increasingly they're using routine data from medical systems, so this is administrative data data about admissions I was talking with a Insurance company in the US that's one of these companies you've you've never heard of but they own 20 companies that you have heard of And they were working with the data from all these 20 companies to try and say okay If we've got someone that's got a diagnosis of dementia at age 85 Can we look back at their medical claims? 10, 20, 30, 40 years earlier and see if we can detect patterns of what's happened in their life that would predict whether they're going to come down with dementia So we're increasingly using routine data But there's a whole new area that's unexplored and really raises challenges in terms of how one preserves these data, which is non-medical data Because you can do interesting things with these potentially we're not largely not yet, but loyalty card mobile data I mean you've got countdown cards These are examples from the UK because I'm from the UK Or online engagement data the things people do online the kinds of stuff we can scrape from people's activities online I was talking with Clive Hunby of the company done Hunby which made this thing called the Tesco club card, which is right there and He was saying you know we can actually do some interesting things with people shopping habits Which is I can start to detect that people are Doing the things that we can know we know the dementia patients start to do which is they make a narrow range of choices in their life They buy fewer kinds of things They'll go to the store every day and buy the same thing over and over even though we know they couldn't have eaten all those in The time that intervened so they'll go to the store they'll remember Oh, I like this kind of biscuit and they'll buy one and they'll get home and they'll find that there's a hundred hundred of them in their cupboard So you can start to potentially detect these things in people's shopping habits But then how ethically does Tesco or Countdown or anybody else share that data with a medical researcher that doesn't violate people's privacy But does help us advance our understanding and potentially our ability to help the people who are suffering from this It's a completely uncharted space, but it opens a large potential for being able to do interesting new things So big data is about these traces We leave everywhere and this is what I'll segue in just a second into the social sciences Certainly many of you probably all of you in this room have some kind of mobile phone on you that knows exactly where you are at all times Increasingly some of us will be driving self-driving cars that we'll know exactly where it's taking us and what our habits are Whenever you spend anything retailers know all sorts of things about you how many people recognize that thing there Yeah, what is it? It's an RFID tag Do our badges have these on the back are you tracking us now? I've been to conferences before that have these RFID badges on the back and I'll usually go to the conference organ and say so What are you tracking? What are you doing with this data? And if they don't know I just tear it off and throw it away But increasingly these kinds of tags and ways of tracking us are out there and being able to know Things about us and the question is can we use these in ways to advance science without infringing upon people's personal liberties and their rights to have their privacy respected Some of you probably saw this example that scares people Target is an American retailer. How many people saw this example a couple years ago about a third of the room so I can tell the rest so This happened in 2012 and we reported in Forbes in New York Times and other sorts of things said target figured out someone who's pregnant and Okay, they do this by looking at the purchases you make you buy unscented hand creams You buy other combinations of stuff So by the second trimester an organization like Target can know that you're probably pregnant if you're a female and you're doing these kinds of purchases They know this from mining everybody's activities You've all gotten this happening to you and you're offered coupons in order to save money And they're quite accurate or when you've gone to Amazon and it suggests a certain kind of purchases based on your past purchases Now what freaks everybody out of course Is this part of the title? That they figured out a teen girl was pregnant before her father knew and the father said target Why are you sending my 15 year old daughter or 16 year old daughter? However? She was coupons for for nappies and it turns out Target knew a lot more than the father did that the teen daughter was pregnant and hadn't told her parents yet And people start to then worry wait a minute How is this big data shaping what we are able to know about our own families in the world when a retailer knows more than Than I do about my own family So it raises some interesting question when we combine the national health service in the UK with something like club guard What does that lead to in terms of research and what does it mean for the organizations that are meant to do something with this data? To make it reusable in the future It's one thing if that researcher that I mentioned earlier who comes to you and says yeah I've got this data. Could you please sort it out and store it someplace also says? Oh, yeah, and by the way a whole pile of the data is a proprietary data set from a private retailer that has this huge Non-disclosure agreement attached to it that they made a sign before they give us anything Can you please sort that out for me? You know that as a whole layer of complexity toward being able to do anything with these data down the road and it really Has some profound implications for how science is done if we can't share these sort of proprietary data Okay, so two quick examples from the social sciences About the ways that we're using things like the internet this first example I'll get very quickly This is from one of our articles that was just published this year That we published in the on the occasion of the 25th anniversary of the web And we were looking at how the internet has become embedded in different kinds of research for those of you who took my workshop Yesterday morning you'll recognize these diagrams for the rest of you all explain it quick quite quickly These diagrams that the sort of gray dots or whatever color they are Represent all of science as indicated as represented on the database scopus these were created by colleague of mine named loot data store who's in the Netherlands and it's based on 50 years of Journal citations to each other. So if journal site show a lot, they're probably pretty close They're probably in the same areas and then you overlay a new research topic onto these to see how it spreads across the different Disciplines so this is the internet as a research area of research or a contributor to research in different disciplines and you can see 1990 it's you know sort of scattered all over the place by 1995 the social sciences are down here. The humanities are down in this little tale Medicines over here. This is a sort of science and engineering and math up there. It's starting to really spread across all disciplines till 2015 when the internet is being showing up in all kinds of publications across all disciplines It's really pervasive of the way knowledge is being generated and this is obviously of interest to someone like me Who's interested in how knowledge changes and grows But it's also interesting to anybody in these disciplines who stops to think about the fact that the Internet really has become a central part of what happens across all disciplines in a way that Some of us were able to predict 25 years ago, but certainly not everybody was able to predict, you know my Sir Tim Berners-Lee who was mentioned in the previous talk he's recently taken up an appointment at Oxford and You know one of the interesting things about the World Wide Web is When you when you meet Sir Tim You have to wonder I mean he's got genius certainly But you wonder whether the insight into creating the World Wide Web was a result of his genius at seeing how the world should work or just the fact that That's actually how his brain works When you hear him talk, it's very non-linear He sort of bounces off in different directions as he follows links in his own mind as he's speaking And so in some ways what we've created on the World Wide Web is the internal workings of Tim Berners-Lee's brain Which is is quite liberating and exciting because he's he is a genius But it has caused us to also start to think in these ways as we start to think about these connections around the world rather than in Linear ways of going to a library or a museum and asking someone to give us the answer on something We're taking off in all these various directions And then one other quick example from the social sciences Has to do with web archives and I brought this example just because some of you in this room may have dealt with web Archive anybody here worked with web archives at all one or two a few okay if you scattered around Web archives are deeply frustrating to me. I've been working with him for a number of years I had a project in 2009 with the Internet Archive in San Francisco Who some of you may have heard of they make this way back machine? Brewster Kale is the person who started the Internet Archive He made money in the tech industry and used some of that money to set up the Internet Archive Which he was quite pressured in 1996 that said oh look the internet's a thing The stuff disappears quite quickly Maybe somebody should be grabbing it and storing it in case we ever want to refer to this stuff in the future They've done this on a relatively tight budget So it goes out and across the web periodically and it gets these examples You can go to the Internet Archive way back machine presumably some of you have used the way back machine And if you know an old web page you can find the different snapshots that have been taken of that So you can see what I guess I should have done this for today's talk You know what did tape up his website look like in different times over the past Now that's great if you want to look at things individually, but if you want to look at things in any sort of bigger way The interface doesn't allow that it just allows you to put in a single URL So we've been working with the British Library and others in the UK web archive to try and do some things like build search interfaces The Internet Archive doesn't have a search interface largely because they can't index it as quickly as the data comes in So there's no way to keep the search index up to date Or some of these advanced search tools This was from a project in the slide is courtesy one of my research assistants on that We were working with the British Library to come up with some more ways to search within the resources being held in the web archives Now there's a huge problem with web archives in the UK and within some other countries in that the legislation that allows the British Library and the other depository libraries to actually gather web archive data in the UK Only allows there to be one copy on one machine usable in the library Okay, well, that's good if you want to go look up individual pages But if you want to do anything programmatically with that it's extremely difficult So we've been working with them at the Alan Turing Institute actually that I mentioned at the beginning on how to use Computation without breaking the law but can let us still answer interesting questions of this data that now spans 25 years because Historians of the future will want to know what was going on on the web today because this is where so much is happening of what we're doing in the world So this was one of the first projects using any sort of computational approach the web archives This was with my colleagues Scott Hale was one of my former doctoral students He's now a researcher at the OII Tahia Sarri Josh Cowles and other colleagues and we just tried to map the UK web space Using the links between pages on the web so you can see this is you know very simple data But we were the first ones who ever did this and this was only 2014 which says something So this is the growth of different, you know co or AC which is the academic domain and gov in the UK over time And this is obviously a logarithmic scale. So it did you know grows logarithmically But essentially by 2003 Everybody who was going to be on the web was on the web So the the web was saturated with the organization and so forth by 2003 And you can look at the relative sector size the academic sector had a much larger proportion even though calm was out or the Commercial sector was always the biggest Academic had a much larger proportion. So even though it grew a little bit it got you know disappeared in terms of the overall volume of web pages and This was a little interesting bit. We did out of this paper which said, okay How do these sectors link to each other and so this is normalized otherwise the co-co just ends up dwarfing everything else But which sectors link and so the outside color shows you the Curve going from that domain to something else. So you can see that co sends a lot of things over here to gov So they send links over to gov Gov doesn't send lots of links to too many other people other than other government institutions Academics quite like to link to ourselves. So this is academics linking right back to themselves and They link out to government but government Let's see doesn't send much of anything back over here to academics. So they don't care what I can there's there's government sending Things back to academics is a little tiny one. So this lets us understand some basic questions about who's connected to who on the web Very simple stuff We're now trying to extend this and answer these bigger questions about how it's grown this and there's pictures for this over the various years of the sample Okay, so those are quick examples from the social sciences, but moving on also to the humanities which some of you are Probably interested in I'll give you this one quick example that comes from my colleague Ralph's work About pinch on wiki. I don't know if any of you are fans of the American novelist Thomas pinch on If you are you know that he writes these very dense novels that have lots of hidden meaning in them There's lots of layers of beating and when he wrote his book against the day. It was about to publish his book against the day Chappu's name is now alluding me in California thought Well, we should annotate this You know, there'd been a previous annotation of gravity's rainbow and this took about 10 years for Weisenberger with 22 contributors to write and The pinch on wiki was set up to say could we do this? Maybe a bit quicker than 10 years so we can use this annotation more quickly So they did they set it up they got a bunch of contributors and they were able to annotate all of against the day in about three months and You know through a number of spot checks of different kinds of quality measures Essentially, it was a really high quality annotation they since gone back and annotated a lot of pinch on other novels and also previous novel subsequently and They this is raises an interesting point of a new way of doing a Humanities task which is annotating a complex novel can be done by a crowd and can be done quite effectively by a crowd but this wiki this was done by a non-academic he was just an enthusiast this the machine running it is Sitting under his baskets a sort of crappy old machine that he had lying around that could could run a piece of wiki software You know, there was no plan at all for what the lifespan of this might be if it were to disappear If he lost his ability to run it or the machine broke down and I think it raises interesting questions So if you've got a valuable resource like this, you know Weisenberger's gravity's rainbow libraries purchased They had them on their shelves. They stored them people could access them down the road I Can't make you any guarantee that this resource will still be available in 20 years time if someone wants to go back and consult it Except through some archive someplace So again one of our pictures here The pinch on wiki is also showing this taxonomic behavior where you've got an author who writes a book Different annotators are sorting that in a different buckets adding additional information and then making it available for readers and researchers to use and Then one other example from the humanity space is this whole area of discovery and how we're finding out about information This comes from a couple of different projects that we did with this thing called the research information network And we were looking at humanities information practices, and I'll briefly talk about scientific information practices as well And we were interested in a number of things in these studies You know, there's lots of stuff in the report that you can look at but one of the questions had to do with and this was 2011 Was there becoming a Google Opoly, you know Did Google just own all search and could the rest of us just pack up our bags and go home and leave it all to Google because they would sort it out all for us and oops When we did a survey of some of these people we found that 79% considered Google to be an important resource and 66% considered Google scholar to be an important resource. It's pretty big But also there were a lot of other things that people consider to be important resources visiting libraries browsing library materials online following citations and this bottom one of consulting peers and experts people still rely on each other To know what's important if they've got a new topic in their mind and they want to find out more about that They go and they talk to someone they trust down the hallway or around the world via email or Skype Now this isn't true of just the humanities Let me skip past this and go to the physical scientists. We also asked the same questions of physical scientists They use Google a bit more, but they also rely heavily on peers and experts The one thing the physical scientists don't think they do is use libraries So browsing library materials in person only 14% Searches of library materials 16% and we asked a lot of them questions about this. They said, oh, yeah, I don't use libraries I know I'm in a library in decades and then we said Well, how do you get the journals you get now? They disappear. I don't know where they come from and We said well, you know, it might be your library providing that well I guess I did you know when I travel I can't get access to some of these things So maybe that somebody's providing but I don't know who it is But they didn't think that libraries were even a remote part of their lives They were completely divorced from what libraries were doing even though libraries were providing a lot of this access And it raises an interesting question for organizations like libraries and other membrane institutions is over the last 20 years Many libraries have gotten too good at becoming invisible We've made our services so Transparent that people forget that they're even using them which is good on one level because you can just if you're on a Network that's got access to something you click and it gives you an article or a result or a Resort a subscription based resource, but if you're Making that too invisible then people don't think that they're using these resources at all And then to wrap up before I give my couple a couple of final slides to quick examples from the arts that I think are Interesting and this is work that I largely do for fun just because I like the arts I've got a background in the arts And so I often do these projects quite cheaply because I'm willing to participate for practically nothing Just to have some fun. So This is something some of you might have heard of Bitcoin, right? this is a Cover of a rather weirdly designed magazine talking about Bitcoin in this whole notion of a Bitcoin I won't go into any great detail about how Bitcoin works for those of you who are interested You can talk to me during a break But essentially it uses this thing called the blockchain and you're to be forgiven if you don't understand much about it up To a year ago, I didn't understand anything about it But my colleague Billy Ladeh-Vert has done a lot of training to teach me what I'm talking about essentially what the blockchain lets you do is Store records in a public space that anybody can see that a record has been changed updated happened owned whatever They can't necessarily see the content of that but they could see that the transaction happened And this is all done in public and visibly rather than a database being a private thing that only one person can see and control Anybody can see it and they can see that the transactions have happened Now this is interesting in the art world as it turns out So I work with this organization called DAX This is a London-based organization that it's one of these organizations that the name doesn't technically mean what it used to mean It used to mean design an artist copyright society, but now it's just DAX is DAX. It doesn't mean that anymore And what they do is they help artists get the payments that they deserve because of their art And so they they're designed by artists for artists I've been working with them for a number of years and the CIO of CEO of DAX called me up Number of months ago and said I need some of your time Eric I need to understand this blockchain stuff and at the time I didn't understand I said, okay, Jolene Well, I'll do what I can and we've been working with them since to think about how blockchain might be used in the art world To be able to solve some of their problems because some of these problems are things that I hadn't even thought about so They've got some things in the UK again apologies for having mostly UK examples, but that's why I work on this artist Resale right has your work been resold on the art market for a thousand dollars or more in the UK or thousand pounds or more Or euros, whatever it is You've got to write as an artist to get some of that income if someone resells your painting on and that's owed to you Or to your estate But actually tracking that down and making these payments available to artists is quite a tricky thing And so the question being asked is can blockchain make this more transparent and more able to track over time That these sorts of things are happening because if we've got a Blockchain record attached to a piece of artwork when it sells and the blockchain records that that transaction has happened and that piece of artwork has Been sold as a matter of practice for the provenance of that piece of art Then the payment that's owed to the artist can automatically flow to the artist through the blockchain. It's quite Enticing in some ways, but of course there's a lot between here and there actually happening in terms of changing behaviors of artists changing behaviors of art markets and so forth Likewise with Estates again something I hadn't even remotely thought about until Jillian was teaching me about it When an artist is alive Getting the money to them is a relatively straightforward thing if they know money's owed to them But the minute they die life becomes much more complicated for an organization like DAX because many artists have very complex Estates and money that comes to the estate might be distributed amongst many people in different fractional amounts And so it could be you know a child gets this much But some nephew gets another fraction of a percent and can they use blockchain to start to sort out these fractional payments that are moving out across the way Across a large group of people or organizations. This is something actually blockchains very good at but it's just moving into this space So we're working with this organization that makes this thing called ascribe They're called big chain DB now also another organization called bear start that are trying to build these technologies But they're building them in a way and we're working with them to say will this actually work for artists themselves And then my last example, which I bring up just because again, it's fun About how young people learn to collaborate with digital tools So this was a project I did with a couple of colleagues down in Swindon who are filmmakers and we worked with a group of school kids to make films using nothing but an iPad so they did the Keith Phillips who was one of the co-authors on this little report He's a filmmaker and he's been going to schools for years teaching students how to make films with you know big digital camera big video cameras and things and We said what if we just gave my pad nothing else iPads with a bit of software on it and said, you know, here are the tools for storytelling Can you make a film doing this in a few weeks time? And so we worked with these students and it was amazing what happened Which was it really unleashed a lot of creativity amongst these students So previously when Keith had worked with these students, you know, there would be someone who was the camera person and They would look through the viewfinder and they would see what was going on Well, suddenly if you're using an iPad to film everybody behind that person can also see what's going on It can start to see the choices that the filmmakers making a kid when the shots over give them feedback and discuss these sorts of things They were doing the editing back in the room on the big screen with the iPads They were seeing the editing choices were being made and they were making a lot of really interesting Collaborative choices as a group of young artists or young just people to understand how the world worked using this new digital technology one of the films was actually invited to a International competition, okay, so so what my last couple of slides. What's the point of all this? So it used to be the case that finding stuff out was a huge Skill needed in the world. So when I was a 11 year old boy in rural Ohio That's the town where I grew up and the Bank of Elmore the little town I grew up in We had a teacher who gave us this thing called question of the day and The question of the day was usually a question that was something you couldn't go find in our little inadequate school library It was something you had to sort of ferret out an answer to and one of the questions was whose portrait is on the hundred dollar bill and In 1977 in small-town America there weren't a lot a hundred dollar bills lying around I'd never seen one in my life. Most people I knew had never seen one and so I actually had to go to this bank and Ask a teller could I see a hundred dollar bill because I'd like to know what it looked like and she referred me to the Vice President of the bank and I went and shouted his desk and we he took me into the vault and we got out a hundred dollar Bill and he showed it to me and I found out my answer Now of course if you ask 11-year-old kids today that question It's a matter of seconds before they know the answer to the question It's not any skill to be able to find the answer that Benjamin Franklin is on the hundred dollar bill It's so simple to find information The question is what can we do with that information that's out there? What can we do to analyze this this data this information? And I think this is the new skill that we need to be thinking about Now one of the questions it raises is whether whether the easy availability of certain data We'll really bias the kinds of research we do a lot of my colleagues do things with Flickr and Twitter and YouTube and so forth And I talked a little bit about this my workshop But will this mean we only study things that are on Flickr and YouTube and and and and Twitter for instance Even though that might not be where the most interesting questions lie And also we got a big challenge as we move from this area that I won't I don't have time to really focus on but science Whether we rely on each other and whether we're certain what we ought to do So this is Physicists they rely on each other a lot to build the large Edron collider and in Switzerland and they know what they're doing Social science and humanities are down here. We don't depend on each other largely and we really don't agree on what we should be doing There's nothing wrong with that It's just a fact of the way the world works in science and social sciences humanities But as we move into the big data space and sharing we're being pushed into this area where we depend on each other And we have to agree more on what we should be doing with that And that's not a comfortable space for many of us to be in as we're starting to answer ask these new kinds of questions So back to my hyphen to wrap up My connection between the social and the technical I think that memory institutions have a big role to play here That's the Radcliffe camera of the Bodleian libraries. These are all knocks for that's the Ashmolean Museum One of the oldest museums in the world and that's the Natural History Museum both within a block of my office I'm working with all these organizations to think about how to use digital strategies to work with the kinds of data I've been talking about today in order to make sure that research can happen and that these kinds of questions We'll be moving science forward and the social sciences the humanities forward in interesting ways in the future And I think these are the same challenges. I'm probably preaching to the choir for everyone in this room So that's all I've got. I'm happy during the breaks. I've taken up just almost my time Maybe two minutes over. I'm happy during any of the breaks to answer any of your questions and so forth I'll be around for the rest of the conference also anybody down in Christchurch I'm going to be giving a similar talk and a similar workshop down in Christchurch on Thursday and Friday And then I'll be up at the Auckland Museum on December 12th anybody's based up there And I'm happy to talk to people while I'm here in New Zealand So thank you very much for your time and attention. Hopefully you've gotten something out of the talk today. Thanks