 The reason we wanted to do that is because since this is the point we needed to create, if we wanted to historical query for the full set of meetings, we needed to have this query, which is complicated, over each and every instance. We needed a superior way. Yeah, but there are other talks, even in the sign of interest. Now, for the themes, what are you going to do? The... We should go on and on and on and on and on. And we need to have that as well as what they think should be understood as well, what they expect from us, given that we have that. Okay? I think there was a ton to be done here. What kind of thing? Do you have to go to the last 40 minutes and talk to them? Yeah, sure. Okay, I'll just find something. Thank you. Okay, do you want to start? Sure, should we give this? Sure. I can see you've got something going on and we should make a meeting here. Do you have any friends? We are in a great crew. And everybody's here. One is Who is in a good mood, the other is an elite person. I think we should. One is... We should have a party. We have great friends. Yeah, we should have a party. We should have a party. Yeah, we should have a party. Yeah, we should have a party. Everybody else? Everybody has some coffee live. I'm sorry. Where are you going to pay me that money? You're a session. What's the name of your session? I don't know. I don't know. It's a great story. Don't believe that. Should we be a little more treated? Yeah. It's a new history for Canada. Yeah. So many great things. Is there any more things to do with the media tech hotel? Yeah. And it can help us to ask questions for our virus. I can get that. Do it. We are streamed. This is a big role. Yeah. I think we are streamed. Three talking. We can handle that hopefully. And not so many. Should we start? Sure. Okay. You're scheduled. Yeah. I think that's it. Okay. Cool. Welcome. Thank you for being here. So I'm going to present this a little bit. Why would it this? There is a problem. That is, media would keep data. It's a big problem. It's a big problem. It's a big problem. It's a big problem. Media would keep data. It's scattered in multiple databases. Like 800, 900 databases. And also divided into many tables. In a huge schema. That is not well suited for analytics purposes. We have to do many joins and operations. And long queries. This is a query that tries to extract a metric. What was the metric? This is New Better. New Better. This is an account. It makes one edit within the first day. And that they're called a new editor. So I'm going to explain a little bit of the query. So you can see what the complexities are. For example here. We're using a CTE comment table expression to extract from both the revision table and the archive table. All the revisions and archive revisions that belong to a given wiki. And after that we're joining this with a logging table. Which contains historical information encoded in different ways. Where we have to select editors that will create with they created their own account. They created their own account so we can count the new editor stuff. So it's a pretty complex query that and this is just an example. So there are many other metrics that have complex operations like for example, the ones that involve deciding whether a user is a bot or not. Because they're based on the bot flag, our bot group that in the media wiki databases doesn't contain historical information. So we only know if a user is a bot right now. We don't know if the user is a bot in the past or vice versa. So a user that was a bot in the past with many edits as a bot might be today not flagged as a bot. So we want to calculate metrics for historical metrics that need this flag. For example, Active Editor. Active Editor is a filter bot and a bot user is out. So the way the media wiki databases are designed is that we fail in identifying edits performed by bots in the past. What else? Yeah, also as there are 800 wikis, 800 databases we have to repeat that operation for each one of the wikis. Manually that's impossible but there are ways of doing that and scripts with specific tools but it's a complication that gets in the middle. Right now these databases are on Labs TV. Is that what you're talking about? They're both made around production and it's the same thing that's exposed for people doing analytics but in other ways. Okay, can you show me one? So we work on this to be able to use a query like this and it's a lot simpler and media wiki history stands across all wikis so you don't have to do the trick of multiple queries for all the wikis. You can have everything into one single single table and there's also Yeah Conveniently if a user is created by self which avoids you to have to join with another table and do combat stuff. Yeah, for example that is if we were going to calculate a metric that is dependent on the bot flag we would be able to do that with a simple field as opposed to having to join with another table and go through problems. I think that's the problem that we had and the reason why we work on this. Yeah, so other fields like in the process of building this new data set we also decorated the information that is hidden in the wiki databases with convenient fields like evaluation types now or others that we will see when we discuss a schema. I could use Joseph. So like for that instead of joining the login cable to find where the person created their account and getting that date, you just have an interpreter with the provision that they need to see if they have that cyber vision within 24 hours of that date so it's right there. Hello, I'm Joseph, Judge Mentor Speaker so I'm working in the analytics team as well as Marcel and Dan and to continue on what Marcel has explained indeed the number of fields that are cross tables it's difficult to at the end make a query that joins on all of those fields. In order to do that we work that job with a very complex flow that reads the table for you basically it takes all the table we've been talking about page, user user groups the login table, the IP restrictions and mingle everything together in a way to provide at the end a single table where also the definition is denormalized and consolidated in order to be queried in a simpler way. Is that everything clear for everybody first on the change we made or do you have questions or do I go deeper into how we made that? Sorry? You were showing hand groups where in a single table you had a schema that you had Yes a medieval key story So this is just a single word? That's a very good question Description of the schema will be very helpful for that since we took all of the values tables are named and grouped them together in order to provide a single table there were a few things to do at denormalization every event is either related to a revision a page or a user because there are three types of main categories of events we work on the revisions there is only a single type of event which is revision create there is no change on revisions, it's only creation on the user side there are multiple types of events there is creation there is the rename that can happen when a user changes its name there is a change of groups and the change of of blocks and the user blocks for the when user gets do some analysis for instance and is blocked those are the various events that gets there and for the page you have the page creation and the page rename ok? those are in the event entity and event type field over there so you can filter and this is actually a necessity if you want to have correct information to use those two filters to only treat the type of event you want in event entity you have revision user or page and depending on the field you choose you choose the event type accordingly then we show you after I'm done with explaining how we did that all the fun that you can just filtering around those those fields ok? next information we have is page information user information, revision information obviously when we have revision information all those user things are new because the event is not about the user it's about revision but in that case there is the page here because the revision is related to a page so we extracted with the revision the page information that goes with it like that when you access a revision role you also have information about the table it was related at that point in time same thing for the user that made the action so when a user created a revision we store that in the field called event user and all this information is accessible at the same role of the revision or other actions allowing you to access information without looking at joint table ok? you're with me? you wanna go over that? we can explain that a bit for instance here even to the groups or page title page title is a value that change over time when a page gets renamed you have the page at a point in time and then it changes it's also interesting to know what is the current version of the page title almost always so we store 2 information page title that is the value at that point in time referred by the event timestamp that we have up there and the page title latest which is always the current version of the title that's especially interesting for a page at even now so when thinking about media which history often we think mostly about traditions the fact of having the user information and the page information added makes it really more complicated particularly with the fact that data is not historically fight meaning the user and page information we have the current state of those the users and pages that was in the past by chance there is a logging table in the dv this table stores events of everything that happened but as Marcel Gloria explained in one of our document it is both a blessing and a curse because yes it stores every event that allows us to reconstruct the history but it lacks very crucial information and you'll see it very straightforward for instance on a page rename we know the old title we know the new title but we don't know the page ID page ID was not stored on that logging information so that makes the work of rebuilding history way more complicated particularly with the fact that page can be renamed to the same name as another page but at a different point in time taking that into account makes the things really complicated we can dig down in that field if you want later on but there are very complex examples in order to solve this problem we wrote an algorithm in Scala so first we extracted information from the dv into Hadoop we have jobs that do that using a technology called SCOOP it reads all the table the 800 table, one more project the 800 databases 6 tables 5 tables 8 so the few tables for databases and copy everything back into HGFS on Hadoop on Avro format we used to use Avro formats and processor using Spark Spark is a parallel processing framework that can be code in Java, Scala or Python we used Scala for for this job and we rebuilt history the first two steps is to rebuild the history for users and pages and by history rebuilding I mean basically reconstructing the chain of names through times because as said the issue is that we have the names old and new but we don't know the page ID the same exact problem for the users we have the old and new user names but not the user ID it relates to so the history user history runner we have the same for the page takes into is fitted the data from all the various tables and recompute history using an algorithm we built pages the same here and there and then we use those two data sets that are historified to join them into the denormalized table with the revision data sets the first thing we do with the revisions is we augment them with some information as Marcel said there is some information that we need like is bought is created by self has been reversed that is very interesting to have for each revision but is not present in the original DB it needs to be computed so the first step we do is enhance the revision rules with those computed information and then we join back with the user and page historified in time using the time as the link to be sure that the page information we get for a revision at a given point in time was the correct status of the page at that point in time if a page was named A then B we have the page named A associated with all the revisions that occurred while the page was named A and then page is renamed to B and all the revisions that happen when the page is named B are linked similarly to the correct version of the page title everything can be joined to page ID which is the actual correct identifier page title latest as well indeed you definitely follow that's pretty much it the format we use to store this information on Haru is Parquet because the main proposal of the dataset we have released is to make analytics really suited to these things and you can query this information on Haru using Hive or Spark you have a question? no? I didn't I decided the same I have a question we consider actually just amending the logs to actually protect this information because we've done that before several times and it's controversial to bring all this data back into we are just at the end of the process but this can definitely be known yeah so this week I talked to people here and they were they were excited about the idea these logging tables compared to the revision tables it is more relatively speaking they are not six months jobs it seems to be some cases it does it's weird it's like the idea of the page it used to be the idea of the page that was now redirecting to the new named page and then it switched at some point in like 2012 or something and I think now it's okay yeah there is a sphere called log parent I think that it has a blob format that only PHP is able to parse and write so it makes the query really complex unless you have some parser but one of the things we did was write that parser that unfolded this blob and transform it into a map so that we were able to process it and now it's like normalizing easy to query but yes we totally I don't know if the logging table has all the fields necessary to store all the information that we have now it reconstructed but I think totally we could populate back some of the fields with some information if that is not breaking because I'm not sure if the media ecosystem is expecting older rows to be formatted like the older way this is the reasoning that went into the log page migration script and it doesn't say anything about that I'm reading out of this because in our case we had to when parsing the logging table we had to try different things let's see I'm trying to parse a rename log so I'm looking in this field to see if this field contains the old name and the new name if this field does not contain I try this other field with this format and then try another format and it was like a trial and error process I'm not sure if media wiki code is looking at the timestamp of the event and deciding given the timestamp which format is it going to use or is it also going this trial error thing so I'm not sure if we can populate back everything but it's something to think of it's in there somewhere talk about pivot yeah you've seen that we have the data in Parquet format on Hadoop we also loaded it into one of the tools we use called read that has an interface that is called pivot that allows us and you to play with it in a graphical way and then we'll give you some examples around so yeah this is this is a quick way that I like to explain it this is pgdata but it's a little bit simpler so I like to do this demo it's real fast so like in the summer we had this bug where we saw a weird pattern in page use if you see there's kind of normal humps and then there's this weird middle part that kind of looks funny if you look at just Chrome and you kind of see it's more pronounced and then you look at the browser version you see this pattern and what this is is the blue line is Chrome 51 you can see it's upgrading to Chrome 52 so that's a very clear pattern going on but then the purple line is Chrome 41 why is Chrome 41 all of a sudden getting tons of requests and Brandon Black figured out it was a TLS handshake error and Windows 7 update or something that broke the way Chrome 41 works and so it caused all these requests to come in that looked like they were a request to our main page but they were not there's the bump and it's like super fast to find it the idea with Pivot is you have dimensions which are things that have multiple values like page title or what browser version and stuff like that and you can split or filter by those and then you have measures so what you're counting so in this case the number of views applied to MediaWiki history we have revisions so like Joseph and Marcel were explaining the event entity is one of the dimensions it can be a revision, it can be user page so in this case we're looking at revisions for about a year or so worth of data and we're counting the measure here is the event measure so if you if you look at this you can now start splitting it by the things that we distilled from the data so one of the things that Joseph worked on is figuring out which revisions were reverted so looking at the SHA for the checksum for the revision text you can see that if you see one SHA and then there's a bunch of revisions and you see the same SHA later that means that someone undid back to that revision right if you do it in today's database you have to join that revision to itself it's huge table to huge table and you have to figure out where those SHAs match up and do all the math well Joseph's process did it for us and we now just have a column that's called revision is identity reverted so if you add it here you can see of all the of all the revisions that have happened so the yellow line at the bottom is the are the reverted revisions and the blue line are the ones that were not reverted yet so yeah so this kind of gives you an idea of how much of the work that gets reverted now there's an interesting thing that Aaron Hefaker looked into he said actually like these are not real revert cases maybe some content made sense for a while and then it goes away so like there's another way to think about meaningful edits and he came up with this idea of productive edit did some measurements and found that if something's not reverted within 24 hours it's generally considered to have mattered to the people working on that page for enough time that it stuck so we have that as well we have revision is productive edit if you wanted to do that today you'd have to do the same Shah stuff and then also check the timestamp to see that it was no more than 24 hours that it was reverted in our case we can just drag this in and split by that and an interesting question that I was thinking I could show as an example is so now we have this idea like even even there are fewer nonproductive revisions than there are reverted revisions right so you can filter by reverted reverted true and split by productive and what that gives you is the revisions that are productive that were also reverted so you can see that here and you can get this kind of insight like mixing things that you would have to join a billion tables in which it never worked so that's an example one of the interesting things here is we are not experts in looking at that information we try to provide useful data for people to use we are not the real users and not the real power users but that information computed in real time in the matter of seconds this is compared to what was before and now we are not even filtering by projects so it's all projects we should to be more relevant here provide another filter that should say which wiki we want to actually work on the what we expect from the work we've done is really a change of usage in terms of capacity of analysis for analytics so any questions any thoughts how's it going so this is what I was going to go over next so this is January 1st, 2001 to October 30, 2016 and you can see there's 3 billion revision events and 1.1 trillion text bytes diff so I wanted to go over this field real quick and show you the graph of it over time that might take a little bit longer because it's the whole time for all weeks so revision text bytes diff is get the revision, get the rev parent ID and do the difference on the bytes because every revision stores the byte length at that time so this is the diff the number of bytes I noted that this diff is pre-computed the link with the parent has been known before as part of processing the stuff we pre-compute all these metrics so that's why the page view data source is easier to show because that's just view counts on the measures selection on here we have lots more measures like the byte diff so if we want to graph this by time we can add a split on time and this will take longer let me see what's let's click a smaller wiki no it's okay I can talk about it so while it works the graph mirrors I didn't have time the graph mirrors if you know editor decline there's this big rise in content and then kind of a slump but then you see this crazy stuff on the right maybe bots we can add bots to split I have no idea if I'll just kill this thing you can see split by bots and see if that changes later in time you can see even for a whole history you can get meaningless because it's this giant graph but you can see general trends and zoom in and see if you want to look at more detailed time span I can show the other stuff and come back and we can check it out this is another example if we look at the page events instead of revision so these are page events for just a couple of months but if you look at the bot split of this so these are page moves and before the bot split you kind of see there's four humps these four kind of spikes and looking at the bot you see that one of them is not really explained by bots but the other three are so it just gives you the importance of splitting and really digging into the data same for user you can kind of look at if the user is created by the account is created by themselves so you can see that split as well and kind of see that oh yeah like users not created by themselves tend to have a wild variation but created by themselves is more normal it's like central off is creating them or another user is adding them as part of an editor that's on basically the central off thing the way it currently works is if you have a user created that you created yourself to edit English wiki and that you're logged in and visit another language wiki by default central off will create that user for you even if you don't edit just by visiting this thing seems to have finished in the meantime and maybe yeah maybe it makes sense to look at what's going on with bots lately because they have definitely much more spiky behavior there and then there's also this negative spike that apparently the content was deleted and if we wanted to do a review what do you want to zoom into the drop what year versus what okay this is difficult yeah we can zoom in like this and you can see this is like 2012 to 2013 around here and then these spikes here are like 2014-2015 yeah yeah that's what I would guess as well yeah 2005-2007 it's still the same yeah yeah we still don't know yeah yeah yeah so you can only split by one thing at a time in the line graph but you have tables and stuff they're not as fun to show but you can split by more things and then get the breakdown it's kind of like wiki stats does it I need to spend a few minutes if you want to answer the question how is it so fast okay so I'll split by wiki yeah you can talk here so this is a pivot it's an interface over a backend system called dreed dreed is an open source analytics oriented data store that allows for very fast querying of time oriented information so in dreed you can't put any type of data it needs to be time oriented or it is time series kind of events in times and it uses the kind of the latest knowledge we have in how to store data efficiently when you want to retrieve it efficiently and pre-compute stuff columnar storage indexing so on and so forth to actually this kind of computation very very fast it charts so if you have more requests or more data you add nodes and the data gets spread over the nodes and the computation gets spread over the nodes naturally for the examples we have here data is about 1TB globally the end of the data is stored in dreed it's 1TB and we have 3 machines and it's fast and this is thanks to the format the data is stored in dreed and the fact that also dreed is smart in caching this is an example sometime it happens in caching dreed is also very smart in caching intermediary information it's not to be reused by other queries it's a great great tool we introduced it for pageviews at the beginning and now we put every interesting data set we have in there as long as it fits the right side is a different metric try this yeah it's splitting by wiki because there are 800 of them so the way it does it is hard for all of history this is what it does it does top 5 but it kind of computes the rest too because it orders them right I think it does top 50 and shows the top 5 there are ways the requests are done that are way faster it does an approximation of the top the first thing it does is an approximation of the top wikis with the exact numbers for each of those top wikis in order to display them to get faster so the next steps we want for this data set is this one is extracted from the analytics store db which is not labs which means that it's not sanitized so it contains data that is private potentially private so we can't expose it to the public yet now yesterday there was a presentation from Jaime explaining that we have brand new servers on labs so we'll be able to get the information process it and push it back without having to rebuild the sanitization process over the data set we have because it's a very complex one so that's one of the next steps to actually do the same exact thing with production data with sanitized data and then we also have another another idea around so use this information to provide regular metrics that are needed and that are willed by our community because the wikis stats project it's getting to be deprecated I don't know if you all know that but the idea behind this data set is to provide information for users to be surf-served kind of with the raw data and we are investigating as well on and this is an example of metrics that have been computed with that data set we have monthly new editors by project and there is a full set of metrics we've not shown you all the SQL queries that allows to compute those metrics but we have seen a reduction of the size of the queries based on the new data set from the previous one the fact that now computing for all wikis is just group by wiki which makes it really really simple instead of having to go over all the databases and I was saying another thing we are investigating is playing with another backend called Clickhouse that allows to have very very fast SQL queries one of the downside of Dread is that the way to query this system is not usual it's not SQL and it's a bit cumbersome when you don't really know how it works it is JSON you need to send it JSON and it reprises you with JSON and there are things that you won't be able to process in Dread like all the SQL queries you can think of Dread can't do them can't do all of them so for all up analytics that they are showing slicing and knitting it will be good but for very specific analytics SQL will be better so we are also willing to have a fast SQL query engine in order to be able to do analytics fast in SQL in the meantime we have Hive in the meantime we have Hive and Spark it's a bit slower it's not as fast as Dread obviously not but it's reasonably okay like you get the results in a matter of one or two minutes over the full history real quick about Hive if you are interested these are the tables that are scooped in and one nice thing about it is they are joined across all wikis so the wikiDB is a column in here if you want to query across all wikis you can do it in the old structure and then there is the mediocre history is the new table am I blind am I not seeing it oh no they are at the top so yeah you can check out the schema and it has comments for everything and this is also documented on wikitech and you can take a closer look can you do a set of old stuff from mediocre history? just to see that it works and this is Hive so this is a computing environment parallelized over a parking oh look nope it's wrong it's an entity not type yes it will be there you can control it I don't okay Hive takes a long time to do stuff it's smaller than 3d and this is why we prefer to create an understanding in 3d it's way more fun plus the fact that it's that it's visual it's vibrant do you prefer a higher command line I prefer Hive I don't know if we let it depends on how far the job got if you let it go further it wouldn't work because once it's launched in the yard I don't think you can control it it'll kill it the first one tries to kill it and that wouldn't work but then if you kill it it'll kill the gvm and just take the whole thing down the information about how the data is stored in the yard or in Hanoop the data set we have in Hanoop there are about 380 gigabytes or which is always to be normalized or 1,165 files stored in parket the reason is because parket is very good it's the same as 3d parket is a way of storing data very fast it's not a point for me to explain how parket works now if you want we can spend time parket yes parket parket floors the way that they're kind of exact like that it's the same again parket I would resist the temptation it's like my main point 2,967,000,000 this is the latest 3 billion in pivot it's not just creation it's about everything in revisions we only have creation nothing can happen to a revision yes but we we have the field is deleted when a revision is deleted the field is present in the denormalized table but we prefer not to add event revision deleted but to store it I don't think it's a real entity we're hiding this with a shell it's complicated actually revisions can be archived with a deleting of a page and that's a different thing from deleting or hiding or redacting revisions and we're not considering right now like redacting or deleting revisions but that's totally something we could consider the main thing that I think we're excited about is the process that takes lots of disparate scattered data and basically like shapes it for people who want to ask questions about it revisions we can add fields to answer them is deleted yeah it should probably be like is belonging to a page that was deleted and archived or something let's explain in the comment have you got any plans to do this or I'm not sure how much information we have that it's actually a lot of this but like file storage and file usage for images for instance and then like getting this more towards the content type of contributions other than the higher level entities and then the main driver of it was Wikistats so the metrics that are computed there inform what we did here they're not too much related to content in some cases they are we do want to like bring in the content and other interesting meta tables like page links and template links and stuff like that and do the same kind of thing with them I mean it's just going to be more work like we did so far yeah many of these things traditionally also because we simply didn't have any options for it I guess other than like raw outside of use consumption but I think that consumption of creation of the content is a very difficult part to define but it's really important to what we do so instead of just it's very hard to stop at this is how often things are being viewed or this is how often things are being created in a revision but what is actually inside a revision is also very important and maybe much more important and it's much more difficult but basically by not having those statistics we are sort of fooling ourselves with what's important first of all it's not important I sometimes do we think something is important like page creation because it's something that we can actually show we can find that information and that is what makes it important but that's stupid because who says page creation is actually important are you thinking like the number of references I'm thinking about the complexity of a page I'm thinking about things like the amount of images of a page or at least different types of objects the amount of independent linking and outbound linking that we can collect these kinds of things that create a more detailed picture about the structure that is our content instead of the visual the high level definitions of there's content there somewhere it's on the it's on the plan but it's very complicated and the reason it's really not that much about analysis because as you said we can analyze wiki-text and extract the information about scalability issues and this is where the core of the thing is we only have scalability issues for that one the concern with scalability issues is that you only have them when you run the full stuff and it takes a long time to run so the iteration process is kind of slow so it's on the plan we'll get there you can at least make recommendations about what we should be changing in core to further enable this the fact that we track the amount of bytes added to a revision or removed from a revision came out of requests if I remember correctly from a revision we've heard teams that were doing analysis about what is a dangerous and dangerous change and it's very easy if you know how many bytes have been removed or added that tells you something about the likelihood that something isn't bad at least it was a very simple metric that's why it basically got added there's no other reason why it got added but we could do that at multiple places collect more than a type of information but if there's there's no request I guess it will never happen so this is also a chicken and egg problem in that regard I totally one of my personal goals was to spike the chicken and egg situation yeah there's a good example of like WikiStats tracks deletion drift so the active editor metric for August 2007 changes today because a page that one of the people that was an active editor back then edited has been deleted today and therefore that person is no longer an active editor so that metric drops by one and that's called deletion drift and we did that because WikiStats gets its data from dumps which only has the current situation and cannot compute that without like super hard so doing this makes things that makes it less easy to have an excuse about not computing certain things and yeah yeah and also for the chicken and egg problem not having access to this information yet prevents us to say look you know what because it's expensive to add fields to media Wiki schema and so on even for an analytics proposal so by train to do backend analysis like this we are actually trained to provide a way for our users to use this information we will get feedback on the information that is used and at some point we'll say you know what this is the information that we need to add to the schema at some point we will know it we can actually know it yeah and you said that we should look at putting this back into core and I'm super I would love to do that but I feel like we failed somehow in getting people interested in this session at least core developers and that's what I think we look for guidance on is how do we strap everyone in chairs and force them to listen to this presentation not only that but that like sometimes like reemerging that information back into the database is no more than well should be no more than like we've been playing with this and this and this and look we get all this work to extract all this information out of it and we now sort of we think we understand the data that is inside the database well enough to say we can reconstruct this data this is how we should be doing it and are there any objections if we write a PHP script to basically insert all that data I don't think anyone would very much object to that it's just that they won't come asking for that to give that a new writer so would they ask like what are people going to do if it wasn't there no I think that the biggest in one there has to be a use case and I think there's clearly a use case here so yes because it's a more sustainable long term canonical version of the data right so but I think in some cases we'd have to change the code that writes the data in the first place and then backfill what we have so far yeah but I mean we've manipulated the logs often enough I mean I would just look there's like five or six maintenance scripts and then in the Mediwiki maintenance we're manipulating all formats on the live tables yeah I think there's probably like unbased view on both sides like we're scared to like get into core and try to merge patches and probably like people are scared to get into the data yeah and it's mostly a thing like someone needs to which is true and someone needs to make sure that nothing breaks and whatever but you've already done a path forward in figuring out what is where and what it means and that's usually the biggest problem like oh we should definitely add the page ID filter a lot different yeah everybody agrees we should definitely do that but nobody really understands what the consequences will be and what it all means and where the good data is versus where the bad data is and especially that last part in the past we've already figured out completely so the chances of us making mistakes and writing back a new amount of data is significantly lower than it is and also to be fair we don't have there is a percentage of error in the reconstruction we do obviously we consider it small enough the number of events we don't match is small we might have shorter line edge what I call line edge is the name of a page through time shorter than we never have too big of a line edge but sometimes shorter because it did not manage to link back everything that exists today is included and then of the logging events we applied I think 99.5% of them in all cases some of them get discarded because they're just nonsense and this is just as a user to say why this kind of work would be very useful like one thing I particularly I was talking about this earlier then that I find very frustrating about logging is for example page new logs if you say I am looking at a page named X today I want to see the history of this title and there are pages that are extremely interesting the one I was looking at just now was like the page on Hillary Clinton because there has been like a long year of page to be Hillary Clinton and all this kind of thing and so if you want to say that I want to understand this history you say I want to look at the log for Hillary Clinton and that just shows you page moves where it was moved to that title it doesn't show it was moved from that title to something else it only shows how it was plus the fact that it only shows the path where the page Hillary Clinton was at that actual name imagine it has been vandalized through something that is completely different you don't even have a trace of that in your log and this is why we remilitar the easiest way to do it in the logging table if we add the information of the page ID would be you get the idea of the page and you ask for the log based on page ID and this would be the correct way to have it but normally we should be able to do that in there the latest I think what if you created a nice a complete summary of what this would bring we'll have revisions in there what can be done and this picture shows something like trying to print out what you're trying to do I think if you come a decade well enough a lot of people would be interested in getting that it would require a deviating it's harder to develop especially because you came back you're not just saying let's make this change so it's better in the future and there's the consistency between how it works and how it works out let's make this change so it's better we did that many times before I mean the use of creation log was backfilled from 2005 to 2003 and then before that it was simply impossible to reconstruct Tim Starling was telling me there's like apparently dumps that he did of the early days when there was a media wiki it was like some pearl mod something something like that and there's like a year of data that he put in XML format and it's not imported into production database so we can import that and there's also I think that's been a testament for a very long time that we constructed those early user creation dates based on first edit not on the people who had it but you like graph reconstructing the early user creation dates based on first edit you did that where it's basically like even for people who didn't edit you say that user id was around between user id and other people who did edit so you know when we don't know the user creation from an event we use the first edit of that guy as the creation date and this is exactly what we wanted to see it should have started by time I think a lot of English Wikipedia editors who want this ok so next first and next step would then be to use LAMPSDB and provide that information so these yeah I guess we wanted to do the tech talk and explain but maybe it's better just to show examples and sort of point them all to the same thing like this is all this thing look the cool stuff that we found we want this yeah so if you have ideas just keep shooting them you can do a page history a unit history whatever you want to do I turned this into a blog post and then enough contributors will come complaining that it might be actually just happen ok Eddie? yeah first of all congratulations it's amazing thanks ok first question it's also amazing work done with other this other dashboard digitalization tool and also graph extension so it's like amazing work on several digitalization tools is there any thought about convergence I mean I know this ok so you have to do some sanitization but I can bring to that goal is there also the goal of making data from this accessible for the graph extension on wiki or also the other dashboard tool for example I don't know where some sanitized version of this or version of query API that the graph tool will be accessible yeah the thought behind how we build wiki stats was put a query API on top of the data that has canned queries that respond with like what wiki stats needs to render its graphs and then that of course would be available for the graph extension or anything else that can issue a simple client side query and get data the graph extension does that well dashiki does that well so dashiki is the other dashboarding tool so we dashiki metrics you know complex arbitrary queries those are pretty complicated yeah so then for the slicing desing we have pivot which is a third party software we're not really maintaining we think it's like feature complete enough to do what people need to do there's a couple of bugs we might fix so yeah the first steps towards providing data the data set we have is pre-compute the metrics that we know are useful for people and provide them through graphs this way this will be the first and it's already existing then providing an external version queryable by people like inquiry we also have that as a goal we want it but it's a bit more complex yeah certain things are really common in the number group of add-ons in our vision and a time series of a certain page that's something you can grow in the best space every time we have to deal with the page per page statistics that dimension has tens of millions of entries and yeah it just ends up being if it's by time overall yeah those numbers but if it's like daily numbers by page we get into trouble but the same thing with page views we had a problem with that it's just a lot of data it's like terabytes of data for this kind of information if you definitely go to Hadoop or a fast sql query service just because the natural engines that allow for either pre-computing this way or read on a specific number doesn't deal well with huge dimensions I guess you could have some sort of higher level bucketed version on top of it for certain specific use cases but in a general sense I guess the data is just too much in a generic way and it depends if we want to put like 10 machines behind this data set with a system like either read or click house it will be super fast, everybody can use it almost cool but it's also a lot of money yeah I think we should be very transparent with how much money these things take and when people have questions just respond back with a price tag yeah we want to provide that to you yes so one thing I was really wondering about and this is something that the whole Wikistats did a little of how are you going to deal with changes of definitions or like changes in the data like finding problems inside the data and rectifying them like for instance we missed this part how are you going to love this and make it transparent to outside users, researchers to take this into account in their work have you thought about this yeah I mean right now we just recompute everything all the time because we're stooping the full data so any problems that we find bugs that we fix just go into the new numbers and we'll just simply say these numbers change over time if we significantly change like a definition of a metric or something then that we'll do a block post we'll try to evangelize a change or whatever the case may be but yeah, small bug fixes maybe you should just have like the log like analytics of Wikipedia where some of these things for the moment let's example of page view because it has a longer time span of existence we log the changes we do and the problems we have on the page, on the wiki page and it's accessible for anybody who wants to see the various changes and obviously the same process will happen for this dataset it's actually annotated also I don't know if it's just with the web request but still the way you guys look for page views too you're like a on the tech list of like table revisions there is a version tag that allows you to when there is significant changes the version tag changes should probably have it on the web example page on that one? yes, of course I just want to real quick the folks who haven't spoken know like pressure anything but if you're looking for anything out of this at the moment we are almost over any any questions or thoughts okay the annotations there we were talking about this at one point is that exposed at all? no, yeah there's a fab ticket where we talked about putting this in the qs itself and then being able to query it yeah, that sounds like a good idea to me we just never got around to it we have a ticket for that but the number of tickets we have for things to do is just huge so it'll get done at some point often something like this annotations that has to do with reliability and the problems that we make towards researchers for instance that's very important for the credibility of the data set so I think something like that should be even though it might look not so important I think it's actually for the credibility it's very important and I think for that reason it is a much higher priority than in mine to us it makes sense this is where you can interact with us there's always like fish analytics throw a task in there and making a little more pressure with sorry I don't understand yes, we're talking about the Facebook blog or the Hillary Clinton and every look or page she does in there the API current I teach you the current title maybe now with this the data set that you have here you can utilize that and I think this will take a quarter to have the API counter yes I have to take a quarter to I was just kind of like to the super person into the page and fall into make separate ones yeah just go into the table now to get that of course but yeah like completely doable but very big query in terms of like reconstruction of of no we have to refill that back into the database we can't query one by one so what we would do is try to figure out how it would fit in the schema and make it accessible to media itself the page view API itself for the plans such that I could find the pages for all of these titles as well right right so like I think the path there would be put this data back into media wiki make an API call like action API that gives you the full title history and then use that for like marjo with page API or something because it's of general use like the history of the page title and querying directly in the analytics store doesn't make sense for stuff like that that needs to be production time this is actually another good point another good benefit for merging back the data so another thing that could be mentioned in this email this is what we did for that blog post which will basically trigger the community to ask for it maybe making it happen that's exactly what you've been saying in the page view analytics they've been asking for it quite some time what about the re-names what about them it is tricky though because re-names are sometimes deleted and then there might be created later so it's not the same re-names like the history of a page whether it might be a redirect for a different page theoretically it's possible if you query all the pages for the previous titles theoretically fill in the gaps unless there's a situation where there was a redirect and then that redirect name got erased that page got deleted and then a new page got created a totally separate page got to that name re-named and now has that re-named in its history it's pointing to the it's not the same page that's not the same page but maybe if it's broken down as a response in the API people can use their own judgment and in front of the correct way to look at this from this page view standpoint we need to have the page views per page ID again we don't have it yet it's on the same thing we are missing the page views per page ID for mobile for desktop we have them once we have them for all of the pages we have it will be much easier to actually get the pages not only per page title but for full page however the reason for which it will still be problematic is for historical calls because we won't have them for history we'll be able to rebuild the history of page view doable that's how we have here but this is why I said it's another big set of big queries we're over time so if you want to go to the next session thanks so much thank you for being here let's follow up I just have a question basically the two most I had in October one was is it currently possible to use the same technology and infrastructure you guys have set up for this or other topics, for example or some more manner of questions for other applications the technique is like just generic as you get it's like this data warehousing or data making whatever you want to call it if you have questions then you're a great guy so we'll be changing the view yeah but I'm just in terms of like seeing how this is set up and using the same service yeah right I guess I'll put it to the boss right there we're working on it's been eight months we've changed that it's pretty good it's very good it's well organized it's there it's there it's got this it's got this it's well organized it's a lot it's a lot yeah thank you