 Hi gais, yn maen o'ch ghaith. Beth yw'n gwybod i'n ffrân? Yn y rhwng, rydw i'n gweld i'n ddiwedd i Csd compus. Wrth gwrs, yn rydw i'n gwybod i'n gweithio. Yn hyffordi'n gwybod i'ch gweithio, yn y mae'r barthau. Roddo, Rwy'n gobeithio'n ddod wedi gweld i'n ddylch ar ywind. Beth yw'n ymweld i'n gweld i'n meddwl i ddaeth? Rydw i fi'n rai ar öi'r ymddiad, I was expecting to do a talk in maybe a month or something, maybe I'll have to blog post in the future. I'm literally in the middle of doing it, and he said he wants to give this talk. So, yeah, so he said he wants to give this talk, and I'm not entirely crazy about the data, but I thought, I'll do it live. So, this is the middle of the project, but that's really cool stuff. So, I'm sorry for all the bullet points, it's easier than writing loads for myself, I wrote this all last night. So, I'll look at Crossref. We've been set up in 2000. We provide persistent identifier for scholarly publishing, most of the articles. RadioHunt, if you don't know, the DOI is... So, if the DOI is persistent identifier, it means the identifier won't change. It's for a scholarly work article. And the idea is that the DOI is a link, and you can click on the link, and you can get to the landing page of the article. And if the landing page of the article changes, the DOI can be updated to that point there. So, they're the best way to cite articles. We have great math data for scholarly works. We have 80 million of them, and we can have about 5,000 members of the most of the publishers, which put the data in. Data science was a multi-article for instance, also a big DOI to the datasets. So, this is Wikipedia. It's the reference section of a Wikipedia article. You can see very traditional citation. Also, we see some DOIs. So, Wikipedia cites all over the web. There are lots of scholarly articles and citations. And there's a big push for the Wikipedia to use these, to use the OIs of citations. And it's really cool to see them there. And Prostrat, we knew that they were being used, but we thought we'd like to find out a bit more. So, Prostrat was founded in 2000, and Wikipedia was founded in 2001. They were about the same age. But some publishers who were members of Prostrat, been around hundreds of years, is the OIs published in 1672. It's our job to keep track of all the traditional citations between papers. But, things have changed a lot in publishing since 1672. This is Isaac Newton, and this is the DOI. So, there's this drive to track the forms of non-traditional scholarship in house signations work. And the field is called Lot Metrics, or Alternative Metrics, or Article Level Metrics, or ALMs. And it's an idea about trying to create metrics on things about non-traditional citations. And looking at how articles are used in the real world. Surely the non-traditional metric, though, Prostrat isn't doing that, but it's interesting to know exactly what is going on in the wild. So, we've got this project called Prostrat for Better Data. It's a partnership with DataSight for this launching conference. It's early and minimal by the product stage. And we're just kind of feeling our way. And we're basically concerned with traffic things that happen to our DOIs out in the wild. Because no one ever tells us anything. Our DOIs never call home. If people use them in Wikipedia or in a blog or in Twitter or Facebook, no one tells us. We want to change that one way or the other. So, what happens on Wikipedia is people make citations, including those with DOIs. But what happens on Wikipedia doesn't always stay on Wikipedia. Here is an edit to an existing article where this DOI has been added. So, the article changed. Here is some spam. All of this stuff is removed in this diff. And then it's reviewed. I've been using Minidoxil head loss shampoo for two months. Now I can clearly see how it works. So, the DOI was removed in this edit. So, oh yeah, another one. This is about urination. The DOI was changed because there might have been a time code. So, we're going to investigate. We're going to see what happens on Wikipedia. So, Wikipedia publishes a recent changes screen. So, every single edit that happens anywhere on Wikipedia comes in this pipe. And we subscribe to that. And there's up to 100 changes per second sometimes. And every single edit makes any article. We go back to the Wikipedia rest base API. We fetch the old and the new version. And we look at the DOIs and the old and new version. And we say which DOIs were added and which ones were removed. And then we pass this stage on to you for whatever processing you want to do. So, this is the live effect screen that I put up earlier. It monitors every single change. And we get an event concerning the DOI every couple of minutes. So, there's a very constant DOI activity citation that could be happening on Wikipedia. So, from this experiment we knew that Wikipedia is important for DOI citations. And we need to pay really close attention to what's happening on Wikipedia. Because this is a brand new way things are happening. It might be 15 years old but it's kind of getting into it's right now. We know that citations can go unlike a story we're publishing. Traditionally, a citation existed and that would never change now. An article can have references out of the removed. So, we know that DOIs are being used on Wikipedia. But people actually use them to people actually click on the DOIs. This is an interesting problem. So, big data. I think everyone in this room would like to have a big data problem. It's a really cool thing. Everybody would jump at the opportunity to use big data. And we had some data, the DOI resolution logs. So, when you click on the DOI. Sorry, we have the DOI resolution logs. Which is about 1.5 terabytes for about 5 years. Which is a large-ish amount of data. So, every time a DOI is clicked by a human or by a machine. Or by something else, however that happens. We get an entry in a log file. So, it's not necessarily human activity. But it is some kind of use of the DOI. So, we know how DOIs are being used by machines or humans or whatever. Yeah. I've written lots of log files. So, here's an example of the log files we get. We have the IP address. Some status codes for the date. A couple more status codes. The DOI. Another status code. And then we have this referer screen. So, this is in a law university. This is from NMNIH. OECD at all. And the referer screen happens when a browser, in a browser you click on a link, and the place you're going to, the browser sends a previous URL to that new place. So, that's how Google tracking happens. So, we know where people were when they clicked the DOI. If the browser sends that information. And that's obviously not sent or robots or whatever. So, I thought we could do some analysis on this. So, we look at the data stats and how the DOIs are used. Or what dates. If the referer was present. If it dated in your age of years. But, crucially I was going to remove the IP address. And the precisely referer URL and the precise time. Because there's a personal information. And Zara gave a great talk yesterday about how important it is to make sure you think really carefully about how you do this kind of data. So, actually Spark dropped up. I thought I've got a lot of data. 1.5 terabytes. It's like MapReduce, but a lot more flexible than that. There's a graph of transformations that happen that you specify. Your input data goes to all these processes and then comes out the other end. It has some very clever algorithms for partitioning the data. So, you put your data into a cluster and it can spit it into different partitions on different servers in the cluster. And then it can try to keep the data open to each node. So, here's an example from what I'm doing. You have load files coming in the top and there's a map stage, which will parse each line into a triple of DOI, refer a domain in the date. And then that might get mapped into DOI in monthday. And then that might get counted. So, in the end, you get for every DOI, in monthday, capped. So, that tells us how many times time it's done to DOI if it's visited every day. And the cool thing is, it kind of caches between each stage. So, there's a few pipelines and Spark figures out the most efficient way to get your data through and end up in some database. Here's an example of more realistic of the kind of stuff I was doing. So, you have a few parallel processes that they kind of share some of the half process data. And Spark is really, really different. It's magic, because I can write my code in star or closure and I can run it to my laptop for a very small dataset. And it works and I can iterate and I can refine my time writing. And then I can magically scale it up to a multi-node distribution in the powers and suddenly I'm doing big data. It's very exciting. And it run for 12 hours and sometimes it would fail during that time. Or the heat would explode after a few hours and I'd get up in the morning and find something has gone wrong. Or one by one, the load in the cluster would die so the last load was ending up doing all the processing. And the fact that you tend to figure what's the data from S3 into EC2 is still quite slow and I'm sitting that much is still quite painful. EC2 is really cheap but if you fail after 12 hours with a cluster of 10 machines it gets expensive. But it got the results. I got the numbers out and I was really pleased. So this is a year ago I got the stats of the DOIs and their thorough domains. It showed that Wikipedia was roughly the fifth largest non-traditional referral for the DOIs. And this is the DOI chronograph as of a year ago and it's a way of exploring the data that I derived from our resolution logs. So for Wikipedia for example this is per day and you can see weekly spikes. You can see around Christmas that it kind of dips a bit. But you see healthy activity people or machines or something are clicking the DOIs and they're probably humans because there's for a further experience there's probably a browser. And we see healthy activity and that's really cool. We knew people were clicking on DOIs from Wikipedia. That's 18,000 per day for example. A healthy amount of traffic. And then this reminds me of the English Wikipedia similar patterns. But then looking at the English mobile Wikipedia you can see over time use of that kind of rockets up. So that's quite interesting. First of all how people are using the mobile site secondly how people are doing the scholarship with the mobile site. And this data is really useful to us because we know we should concentrate on Wikipedia for Wikipedia because they know people are using the scholarship. Everybody's interested in this data and it's very cool. And then an interesting happened. I've talked to Dario Tarmarelli who's head researcher of the media and he would come and talk to me through a few things and realised Wikipedia is trying to move to HGTS and there are some ramifications of this. So there's an interesting twist and we have this discussion online it's fascinating having the discussion of Wikipedia because everything is public. You can't send a private email but the discussion of everything is all done in the open and it's quite interesting to participate in that kind of conversation. So start happened. The Russian government starts getting interested in blocking stuff especially in Ukraine. All kicking off. The Russian Wikipedia enabled HGTS only for a certain time. Snowing revelations and there was a fragile change in the media to HGTS only. And the catch is if you go from HGTS site to an HGTS site the browser does not send a reference group so that means we can use all of our data. Suddenly it evaporates and there's data which is a plasmydpedia with searches and little magicians which just go away. DI does support HGTS but only if you put the HGTS link. So there was this change with the Wikipedia to change all DOI links to protocol relative which means if you're on HGP if you're on HGTS there's HGTS. So we collaborated with them to kind of get that done and check everything that's working because the links were funny for everyone, our son there and the searches. So they made this change they started phased it in and it's time a year later So the questions I want to answer more were did the change in referral policy change the referral data that we get that we see on even seeing any data can we actually see what's happening with the change phase? Did all the URLs actually change? They might have missed some hours. Are people and machines still following the links all data, all questions I want to answer and so I was looking at this data and I thought I need a moment of honest reflection here how am I doing real big data? Big data is in terms of data set to so large a complex that traditional processing applications are in Africa for the Wikipedia 1.5 terabytes is a lot of data to crunch on a laptop so I thought in Apache Spark it did the job it was fine but I thought I just learned some lessons from Spark's architecture and I was trying to do it myself a bit more simply I wrote this thing in plain old Java it's a process that one's overnight for about a year's worth of data handles a few millions of lines is this big data? No, it wasn't on my laptop so the approach I'm taking is that if you can see the colours but I'm partitioning the data into a few partitions and I'm taking a hash of the thing I'm interested in like the domain putting it into this many buckets then I'm running over the data 20 times and each time I'm only looking at one bucket's worth of stuff so I partition it and then I have this map with the partitions all one after the other and then I kind of merge it with this current colour and if this looks a bit like a map reduce it is obviously it's the same thing except I'm doing partitions one at once and I'm not trying to spread them out in a massive parallel I'm just doing one position in the next one in the next one because because it runs overnight on my laptop there's no need to power advise so I'm partitioning the data so the partition is being activated to be meaningful but so does more data fitting around like 4GB each time the expensive parsing which takes overnight happens only once I'm using Cainron's CSD files I'm reading to and from GZIP so I can store all the data like this and I don't mind running a few hours once a month so here's an internet file this is CSD I do belong here so on my laptop I'm parsing normalising 15 months of work of data it was able to handle about a million lines every few seconds for a year 15 months, 100GB simple which is about 500GB of uncompressed data the output was about 43GB so input was 4,000 million lines which is not 0.4 billion minus if you're British or 4 billion if you're American you can choose which one you want so here is the data for all DOIs this has been announced around last night so the data is fresh I don't know what's going on here maybe with spiders who knows this yellow line is all DOI resolution activity and it's quite spidey and you can tell there's obviously something going on who knows what that is is that the red line here is where we know it was a referral form of HTTP and you see the week of spiders and you see HTTP kind of going on nicely and this blue thing at the bottom is HTTPS so I'm just going to remove the other line so we can compare them and so HTTP you get a fair amount of referrals from there but a small amount of rowing amount of HTTPS activity over the course of 2015 so HTTPS referrals are happening so it's really good to see there are HTTPS DOIs out there because people are clicking on it and now for big reveal we can answer all the questions that Wikipedia has yes it was amazing to see this last night I got the numbers through we see the start of 2015 this is referrals from Wikipedia via HTTP kind of normal and then towards the end of 2015 they tail off as Wikipedia enable HTTPS only and then the HTTPS referrals come up so we can see the switch over to work so the change over to HTTPS did work so it appears to be the most if not all of the DOIs because we have that volume of data same volume of data we got and the last time I learned it was not all large data even if it doesn't fit in round it's big data it's a good idea just to try and use normal techniques for big ones and I'm going to blog all this in a few weeks on the crossref blog check it out there's the URL for event data it's a new project you should go and take a look follow me on Twitter and I'm one minute ahead of schedule thank you we're just getting the changes for the stream so you don't know for example how many DOIs there are in Wikipedia aren't you being stuck on that subject that's an interesting question so I mean the data on infection is a snapshot of the old and the new one what I'm what I'm doing is getting a diff if I wanted to I could get a compiler list of every single DOI on Wikipedia that's something we could do for that particular project we were interested in how they were added and removed so yes I could if I wanted to store every DOI I ever seen and you'd see probably the same ones again and again and again because most of the data and there is in future I could try and correlate the DOIs that we see being added to Wikipedia against what they get referred there's a two month lag between DOI laws getting generated and not getting them so that would be a long-term project now why do you have that lag? because their traditional server logs in the system so I was wondering why wouldn't you run some sort of logging service that can do all this for you in real time like something like Kafka or something given that the coolest thing you want to know is if people are actually putting it to your DOIs or if something else is putting it to your DOIs so the answer is that the DOIs are a subset of the handle of the functionality and the handle is a server of technology for solving links so DOI system runs on the handle system there's a standard bit of software called the handle resolver and CNRI run that, is that correct? we don't run that we give them money to run it there's other, so there's a server handle server resolving the DOIs and that doesn't belong to us so it can have cross-tracked DOIs so we can't we can't suddenly change this but I think one of the things we're trying to do is build about that in addition to make the case with CNRI that they should modify the handle server to support that kind of reporting because it's becoming increasingly important for those people the other thing that they do is that of course we don't get data sites log files so they have to split these things up and they deliver them to us once a month but yes it would be better to get these things live I hope that by doing this kind of stuff and showing that it's really valuable so I remember that we can make the case that they'll make those changes I was going to answer a matter of question about who can resolve this including your sort of the DevOps on that I guess why you don't sense any point how does this have to do with CNRI so there's no kind of automatic that's the problem you're doing if you look into what the most popular setitions are has happened to CNRI yes so in the previous version of the program I was recording referrals from a domain to the OI I've not ranked those yet but I could but that wouldn't be possible like I said half way through building a new one interesting but it also highlights something interesting which is to set which are the most popular setitions so the fact that something cited often doesn't necessarily mean it's clicked on often and one of the things we're actually able to do is to show how often the things fall in and actually follow as well as how often it's cited is there something like author and player that would allow you to track all the papers that have written get all them and utilize them check where they are cited so you're about orchid so orchid is an orchid-identified system which allows it to connect so cross-refermented data is provided by publishers so when you write an article publisher collects orchid ideas and it will deposit it in the cross-refermented data so there is a link if publisher and there are 5000 organisations putting data into cross-referment so it is possible to make that link and also in the orchid system it's possible to say in your profile I wrote these to the papers but as for a broader connection of those things cross-refermented data is only from publishers but this event data thing is an initiative initiative in cross-refer to collect as much data like this is possible and then to make it available for free so people build exciting applications so when this is up and running I'm sure someone like you, I don't know someone in this conference could take the stream of event data from Wikipedia they could then go to our metadata API most of the DIYs the orchid API and get all the data and make it somehow combined and then they could do what we're saying so for event data we're interested in collecting the raw data and then making that stable we're not interested in for example altmetric.com create this donut which says the papers the time wrote for this DIY has this much activity on Facebook and Twitter and whatever we're not interested in creating a metric or some kind of analysis we're just interested in collecting the raw data so hopefully when that data is kind of broadly available to everyone somebody can come in and make that tool What happened in those two Januarys where there were actually no data at all? I think there was an error in the logs or they were missing Did you notice a good question? The answer is I started this last year which was 2015 2015 so I asked my society for all the log files and he said oh I've not kept we didn't realise how important they were we were doing some analysis for publishers to say to publishers oh your DIYs have been used this much but as for this kind of thing we didn't really think about it until last year so he went back and got the files from the tape storage went back to this album in STNRI and went back to 2010 but I think maybe some log files would have been missing are we done? share it