 Hello, we are here from the analytics team to talk about what we've been calling the dashboarding problem, which are the many issues that our users have with the dashboards, namely that dashboards are hard to find, they are hard to use, they are hard to set up, and also that it is complicated to show large volumes of data with the dashboards we have because they are well set up to set up to display timelines, but it's not easy to set up large volumes of data. So we had to build, we had to gather the data for a project called the vital signs, and we decided in the scope of this project to start working about fixing the dashboard, the many dashboard issues that we have. The Editor Vital Signs, what it tries to do is to come up with metrics that compare pairs to pairs when it comes to addition, so you can intercompare editor data on the, say, the Tamil Wikipedia with editor data on the Catalan Wikipedia. And we decided that while exploring the data for this project that had to work for all the Wikipedia projects, which is over 800 projects, and we were going to backfill the data for four years back. So it was a lot of data to explore, and we decided that as a side goal we will take some time to explore dashboarding technologies. And the data we wanted to show is really very simple. It's only the big amounts that we have that make it complicated to visualize. The data is all time series, like the ones we have there, like there's one data point per project, per metric. So say like our metric here is pages created, and we have for our Wiki 50 pages were created in January 1st, 2014. For the Wiki there was 100. So we have about 50 metrics like this one, and we also have 800 projects times four or five years. So it's a large volume of data, but not complicated data, it's all time series. So the first problem we had to solve was the access to the data. We wanted access to be easy, and not only that, everyone should have access to this data. So we wanted the data to be public. That was the first problem we identified, that we had to have a public data set from which we can infer the data. And the problem number two was visualization. And visualization is a problem that people focus a lot here when they talk about dashboards. They think, why don't you use this library for visualization or this other one? But the thing is that visualization, it is not such a hard problem to solve because you have this guy working for you, which for whoever doesn't know who this guy is, he's Mike Bostock, is the creator of this read, which is the visualization engine that is amazing. And many, many projects are built on top of the three that actually solve really well on the open source space, the visualization issues. Like, for example, there are projects like digraph, which is a very restrictive set of graphs, but it has amazing interactive capabilities. You can zoom in and out, and it's very rich when it comes to interactions. There's RIC show, which is like everything you make with RIC show is beautiful, but it also has a very small set of graphs to work with. And the one we decided to go with was Vega, which has less bells and whistles than the rest of the visualization projects. But it has an important thing that is a grammar on top of D3. So if you've seen what Eric presented on the metrics meeting, there was this extension in MediaWiki that you can create graphs. And that is a project that Dan worked on while he was on the hackathon on Zurich, and after Zurich, he's been working on it. So you could edit these very readable JSON and have the graph on the left change. So it makes it a very powerful tool for visualization because it's very intuitive. And we decided to go with Vega because also an important reason is also if you remember Eric talked about how we wanted to support users with non-JavaScript on Wiki, right? We want people that don't have Javascript unable to be able to see the plots. So Vega can render server side on Node, so you could actually run this code on the server and come up with a team image as a PNG and send it to the client. And then you have support for people without Javascript. And the problem number three, which is something that is not thought about enough when it comes to dashboarding, is information architecture because if you have to show data for a large volume of projects, then it's complicated. How are you going to browse and access the data, right? And these, we had 800 projects for five years, but you could also think how are you going to visualize timeline data versus funnel data? For example, the pipeline registration in Android. So one problem we have and how this is solved right now in our report card is the forever scrolling problem because if you, the report card, I'm not going to click on the link because I think everyone has seen it on the metric meetings. So LIM gets a lot of bad press, but actually LIM when it comes with visualization, that's really well. It does complicated things quite well. It has the rescaling of the axis here. This is not easy to do, but the problem is that you have to show a lot of data that matters and you have no way to know that the data is there. You need to scroll one page, a second page, and there's actually a third page to scroll. So that was a, wait, that was a, can you hear me? Yeah, if you need to scroll. So we realized that problem number one and problem number two, access to data and visualization, those are engineering problems that our team can solve. Second one with the help of Mike Bostock, right? But information architecture is actually a design problem. So in order to solve that part of the problem, we team up with the design team and we decided that they can probably solve that problem better than us. We will work with them and at the same time they are designing the browsing interface for this project. We will be working on the other two problems. So this is the tool that Pau designed for us and we're going to demo it a little bit just in case you haven't seen it, Dan will do it in my computer. So we can try to add a project and remove a project. So you can browse the projects we have and languages. Wikidata is already there at the bottom. So you can browse Wikipedia, say, and add a language that is now there. Say Wikipedia, Esperanto, I don't know if it's there or maybe Esperanto, maybe at the end. Because I always look at Esperanto because I'm like, how? So you can, you have access through the browsing, you have access to all the projects that we have, which are the hundreds of them. You can browse projects either by language. If you know the name of the database, you can do that too in Wikidata or you can browse projects by the project itself. And for every project, there is a set, every project has a number of data points backfill. So far, we don't have that much data because we have about like a couple months of data. Yeah, not even a couple months yet. But the idea is that we avoid the forever scrolling problem and give users that are familiar already with our projects an easy and intuitive way to browse the data set. And you, how do you look at a different time series or a different window? Right now, we only because we just started backfilling the data. So we only have data going back to the end of August. So we have editor data and new registrations for all projects showing the whole time span. If I just want to look at the last week, how do I do that? Yes. So once we have enough data that you want to look at just last week, we will have a time, a time browsing component that we can actually show, we can actually show on pause prototype. What is there if you want to type just but yeah. So this is the, this is not the real tool is the prototype, which we're working to complete. So for time, it has a slider and it has dates. Obviously, since we don't have that much data right now, it doesn't make to have it yet, but we will work on, on that one next. And we can go back to the presentation, I think. And where the clicker? So I'll just move with this one. So when we were thinking about how will we access the data, one problem we have right now with LIM, which is the current dashboarding tool, is that the data needs to be, it's needs to be fetch on one machine, move to another machine. And after it's available in an HTTP port. So we decided that our query tool, the outcome of it has to be data that is queryable on an HTTP port. So data should be all on an HTTP endpoint, which means that you can not only plot the data with the tools we have, but anyone has access to that data, they can get it and actually plot it themselves. So all the data that the dashboard will show with all the time and the backfilling is available on this endpoint. And we, right now, our tool is all labs and this is on an NFS, it's on a mount that is backed up. So this is all files or labs, but anyone has access to the, to the data set. And the data is available in JSON format. We debated this for a little bite. JSON format is easy for the browser to parse because it's native to the browser. The browser can, with a very low level code, interpret this into what it is. And also having JSON, it allows you to have metadata mixed with the data. And although now the metadata that you see there is quite critical, it'd be easy to get the metadata such that you can get this file and it describes what is on it. And after thinking about the problem for a little bit, since the application is heavy on the client side, Dan actually was thinking outside the box and we were thinking why couldn't we make it serverless, which simplifies greatly the stack, because you don't have a middle tier at all. We wanted the data to be public. And the middle tier is often used for, like, re-truing the data up, reprocessing it, and to add authentication. And we didn't need authentication and our data was already easily digestible. So we decided to go without a middle tier. So the application is just a set of JavaScripts and HTML that anyone can download. If you set it up on a port, it already works. And also, one advantage of being serverless is like you must have an API. You don't have a server, which allows you to decouple the visualization piece from the data fetching piece, because you don't have any other option. You have to code that way, because otherwise things will not work. So I'll describe a little bit what is the stack, the request comes in. The web server is Apache. And behind Apache, we just have a set of JavaScript and HTML files that were built, and those files access the data store. But they don't access the data store directly, but rather they access it with an HTTP request. For the configuration of the dashboard, like the default, which are the metrics that I'm showing as a default, and the metadata that pertains to this dashboard or to anyone, we're thinking we're going to use media-week storage, so that way we can change the default configuration of the dashboard without having to access the code. But this part is still, we haven't built this part. We built the rest. And something important to know is that, because this is just a dropping JavaScript application without a server, it's very performant. This is behind Apache. If the caching is set up right, it can be deployed to a CDN. So it can take as many users as we want. There are no issues with performance on the server side, because there's no server. Then the bootstrap in the stuff of the dashboard is also JSON, and this is what we would like to store. Just like even login schemas are JSON stored on media-weekly, we think that it would be good to store the default state for the dashboard and for other dashboards that we built on top of this technology stack on media-weekly. So management or the product managers can edit the dashboards without us having to intervene. And also the fact that we have the data coming in like HTTP means that we can swap this data storage. We could actually serve and visualize, for example, now our page view files. If we make them in a way that if they are JSON, which will be easier to interpret, we can actually visualize any type of file that we can access to via HTTP. So I think Dan is going to talk about it. Thanks, Nery. So since Nery took care of the hard part, just wanted to dive a little bit into the choices we made for different packages and give you my experience to hammer with questions if any of you are interested in this stuff. Some of the tools are really cool. Yeah, oh, is it? All right, here we go. So good package manager, dependency loading to shuttle code and artifacts back and forth, good data-biting library and DOM manipulation. We combine those and a good testing framework. So we like libraries that do what they do well, what it says on the box, and hopefully the box says something very small. We just like that idea. It's just the same kind of concept behind the serverless idea of separate roles and they'll be able to focus better. So Bower is a cool package manager. If you haven't used it, it takes care of browser facing projects better. It allows you to look at CSS and as opposed to NPM, you pull that in and you have to jump through hoops to reference artifacts in the browser. Like for example, Browserify. Require.js is one side of the Require.js Browserify war for shuttling dependencies to the browser. We looked at this and found this funny kind of comment war on GitHub between the Require folks and the Browserify folks and there's actually a spin-off project that's better than both of them called Webpack. But Require has gotten really mature and I really like the optimizer especially. What it does is it looks at your chain of dependencies and bundles them exactly the way you want. So what you can see here, we have a scripts.js that includes most of the stuff that we're loading. It's got project selector.js further down and all the different configuration JSONs. So we can bundle whatever we want to load later, the project selector being one example here. It loads a lot of stuff so it made sense to load it asynchronously. And as this, I mean this is just a tiny exploratory project right now but as this grows and you're adding lots and lots of components, this makes a lot of sense. Knockout, these handles are DOM manipulation and dependency tracking, all kind of data flows through the app. I love this tool and it's gotten cooler in the last year and I think it's worth everyone have a look at it. It's developed by this guy Steven Sanderson and for the past, since January, well, I guess for the past couple of years, he's been working at Microsoft on the Azure Cloud Management platform and what that is is over a megabyte of JavaScript code that he's loading asynchronously. If you do this without properly architecting it into components and thinking through how you're delivering this to your users, it just fails pretty miserably. So out of that experience, he developed knockout components which are really lightweight way to basically make custom HTML elements and combine them with the logic that drives them, their view model. So the other thing in this space right now, so this is all going towards the HTML Web Components spec which is not ready, won't be ready for a while. Polymer is the other thing that's doing that and if you're using Polymer now, you end up having to polyfill like 95% of the browsers out there. I think this barely latest Chrome supports it. It's really slow, even when it's supported and as an example of that, Google doesn't even... I mean, Google developed it and they don't even use it for any other projects. So, I mean, it makes sense for them. It's a Google-centric idea because with Polymer components, you're going to be able to just drop in custom Google Maps and you just drop in a little bit of HTML and reference your API key and you're up and running. So cool stuff is coming, but knockout does it in a way that it's backwards compatible all the way to IE6 and that just speaks for itself. If you can write something for IE6, like it's really simply done. So what that does, what components do for us is they allow us to break everything up into reusable modular pieces and that's what it looks like for our project right now. Basically, the project selector is over there. It's got a little type ahead. The metric selector is on the top and in the middle, this is actually two components layered on top of each other. There's a visualization component which just takes data and makes a Vega graph out of it, translates it into Vega grammar and there's a coordinator on top that looks at what you've selected and translates it into what the visualizer should visualize. Basically, it fetches the data. And this is what that whole thing looks like. So this is all the HTML you have to write. You can see these are basically the interfaces, the contracts that the components have with the rest of the world and selected projects is going into the project selector, selected metric is going into the metric selector and they're both coming out and going to the Wikimetrics visualizer. So basically, those are observables that are changing dynamically as people interact with the site and Wikimetrics visualizer fetches the data and sends it over to the Vega time series visualization component. And I can dive into any of that stuff if anyone's interested. I guess I can keep going or I can dive in if people have questions. Yeah. A side dive would be, are you... I know the Ops team uses graphite as a back end for storing a lot of JSON data. Have you tried pointing this front end at the graphite back end? I know graphite as a visual, graphite kind of looks a lot like the design you have implemented here. But also, I'm just wondering if you tried for compatibility between the two systems. No. So in theory, there's no obstacle to that. That's one of the reasons we wanted to go totally serverless because I don't have to pipe stuff through any server and convert it somewhere in some random way to display. I'm just going to point... If the data sources are made publicly available, I'm going to point to them and make a data adapter that transforms them. So whereas this is... The key thing about this though is that it's solving the information architecture problem of this shape of data, so project and metrics. And... Sorry, go ahead. That's fine. You answered it really well. The other follow-on is, is there a simple way to share a dashboard with another user through some other intermediary or just through the URL string? Or how do you... I'm sure we can... Yeah, so Norea just wrote this. It's not the latest version of the code. You might have seen some weird design twitches. This is actively being developed and yes, there's a URL. The URL is now changing kind of like while you navigate and you can just copy-paste that and it'll load. Yeah. So that's pretty trivial because, I mean, the thing's loading client-side anyway. There isn't some other way to like include your favorites or vote on them or I'm just thinking from a social perspective of, hey, I've spent a lot of time making this really great dashboard and I want to share it with as many people as possible. No, I mean, we haven't thought that through. But, I mean, in theory, you'd have some server that stores that kind of information. Yeah, and or MediaWiki API and then you would just call that. Here's a question from IRC about SQL access. How long until we can plot stats from queries? Stats from queries? I mean, this is what this is doing. It's just it's feeding it through like, you know, custom Wikibetrics like metrics that Aaron basically and research team developed. I mean, there's no barrier basically. Whatever output format people have, we'll make a data adapter for it and you can plot it. But right now, it's just early days. We've talked a bit to UVPanda about integrating this with Query in some way. So it's definitely something that we've thought about. But right now, we, like Dan said, are focusing on visualizing the metrics that the research team came up with. So that'd be a particularly simple data adapter because Query outputs JSON that's very, you know, reasonable in its format. So... No. Oh, for those who don't know what Query is, it's a really cool project that allows you to do ad hoc SQL querying against the media-wiki databases and labs and check it out. The main idea of it is to socialize kind of knowledge about research and probing into the analytics world. Yeah, and also to expand on the idea that the problems that what it has currently been, or actually what we're building together is... So, probably the most painful part of the current setup is that with LIM... So LIM is basically agnostic about what type of data you want to visualize, right? And currently, we have this problem of important data being scattered across multiple instances of LIM, not applying a consistent definition. So I think LIM solves the problem of a custom, like generate a graph from an arbitrary data source, pretty easy. With this project, we're trying to solve something slightly different. And I think that's one of the reasons why we're focusing on this use case first. In addition to time series, can you add events? So a code deploy that may have an impact at your standalone little bubbles? Right. Annotations are like top of the feature request list, for sure. In Vegas, we're a really cool choice for that reason because it basically, you know, translates data into visual views. So Annotation is just another data source. Is Vega what? Is Vega adding annotations? Is Vega adding annotations? It doesn't need to. Annotations would be out of the box. You can just load a data source and plot it. So that's just another thing you could do with its grammar. What they're adding that's really cool is interactivity, declarative interactivity. So the way that Vega works, there's like a research team at Stanford. They do a cycle of publish research paper, talk about it, and then implement it. They should be releasing code in October that adds interactivity, and that's particularly exciting for the graph extension on MediaWiki and anyone trying to do that kind of stuff. Cool. Sorry, I'm full of questions. Ability to embed a graph object that you've created inside another page? So that's another one of the great pluses of being serverless because there's no work required for that. You could just do it. I mean, it's all HTML and see it. Like, you could just copy it. You can take the build, there's a disk folder, just copy it, and you've got it. So I guess I'll talk about testing real quick and then I'll go more questions if there are. One more. So I'm from IT, so I have interests of graphing datasets which probably shouldn't be public. And just having it on a private backend system, that's not what you designed it for, I'm just curious. I'm going to sound like a broken record, but that's another plus of having it be serverless because you just put up a firewall wherever you want and then you deploy it there. So there's no... So actually, Cheryl from Fundraising is going to do just this and we're going to support her as the first use case of someone else using this kind of technology and work with her. Her data is also private. So it's going to be behind a password-protected server that she's building custom. And the idea was... Sorry, I actually forgot to mention. The idea with Bower is we're going to publish everything that makes sense to be reusable in Bower and people can just import it that way and use it that way. So Karma is a really cool testing framework, a test runner. So it'll run QUnit, Jasmine, whatever testing framework you're used to, but it'll do it really, really fast. And I love this... I mean, Nurya knows every day, I'd be like, I love this testing framework. You basically open up a little shell under your code editor and you just start Karma. You save the file and it'll rerun all your tests. And this sounds like, oh, that's not such a big deal, but it does it so quickly that it becomes something fun to do and fun to write tests. I can show you some of the testing code if you're interested. Knock-up components make it really easy to write those as well. So the whole testing experience here was really fun, mostly because of how quick Karma is. And it's the AngularJS team developed it. Another question from IRC. Is there a way to embed with an iframe-like approach? Yeah, but you wouldn't really need that. That'd be overkill, because you can really just copy the HTML and JS and host it yourself wherever you want, internally, externally. So just to summarize real quick, we wanted to be very, very lightweight. We wanted to build a thin wrapper on top of open-source technology, so we factored out the work that really great people are doing, like Bostock and everyone else. So the whole thing is about 900 lines of code. We're really happy about that. The LIM approach is about 10,000 lines of Cocoa, which compiles into about 20,000 lines of JavaScript. So we're happy we went away from that. And the code is here mirrored from Geart. The lessons are, serverless is fun, as I've said, amply. Super easy to deploy, we just get pull, and we have a sim link to the disk directory. This whole thing is predicated on cores for our data, so you can access it publicly. We just set it up on our data sources, and we only support browsers that support cores, so latest stuff, that's no problem for us in our world. Uh-oh, sorry. Things like error logs aren't available. We don't have a server to log to, so stuff like that is lagging, but we can add it. The default state we're going to move to MediaWiki, as Nario was saying, and basically the vision for this is if it takes off and people are interested, we can build a dashboarding tool that's controlled from MediaWiki, from metadata that anyone can edit, so you no longer have to go through the LIM workflow of pushing code, making sure that gets merged, deploying it to the server and everything like that to get your graphs up. So another feature request you maybe haven't thought of is sure we have Asinga and all these other tools for monitoring alerts, but maybe I would like an in-browser alert for when something goes beyond a threshold, have it at least ping me in my browser. Sure, yeah, I mean real-time is possible, but more complicated for sure. Oh, so these graphs are not being updated real-time? Not while the browser is sitting there. They're not requiring, I guess you could refresh. So you'd have to go back to your data sources, and right now in our world we just cache those for a day that gets updated by Wikimetrics as it's generating the data, and then we get those. To do real-time you would have to set up a different visualizer that's doing that kind of thing, and I mean totally possible. Vega updates smoothly so that the client doesn't get a reload or anything, and yeah, I think that's most of the stuff, so we can just, questions or...? I think it's important to understand sort of why we did this project in this way, and it wasn't to build another dashboard because there are lots and lots of dashboards. It was to solve a very specific problem, and I think the way Dan and Nuri and the team solved that problem by leveraging a lot of open source, just writing a very thin layer on top of it that to do the information architecture was really right. It's not clear how far we want to take the dashboarding functionality. They also solved the data problem, which is anyone who's set up a limb dashboard knows. It's not necessarily the dashboarding, it's getting the data in a place where you can graph it. That's the hard problem, and that's one thing that we did solve. I'm just... don't mind me. I guess that's pretty much it. We need to get these metrics up so the community can see them. That was our number one goal. It's not clear how far we want to go. I personally love the idea of the in-wiki graph because all of the ability to talk about it and discuss it and link to it, if it's on media-wiki, that just... we get a lot of that out of the box. So, you know, that's all. I'm excited about Dan and Nuri's extension that does that. I just wanted to show you some random code to give you an idea of the components and how they work. So, basically, this is the Wikimetrics visualizer, the one that's coordinating the fetching and displaying. It takes what's passed in as metrics and projects. And this is basically all the code that does anything. It sets up a data sets object, which is computed on top of the metrics and projects. And when all the promises for those data... for that data are met, it just concatenates all of them together and puts them in merged data, which is what gets passed to the visualizer. And that's it. So, it's very single purpose. Very much, you can see how this can be reusable and easy to test. You just pass different things in and it's very predictable and clean. So, question or future question. So, you showed the project selector. I think it's very powerful and solved that problem beautifully. What if I don't know how you spell my language in English? Say I speak Italian or French? Sure. I mean, there's opportunity to do localization and I don't think anything we did prevents that. Do you? You can... Can you hear? Yes. You can do that. You can have files for every language. But of course, in that component, given that there is like 200 languages, it's going to take a long time to load, right? Because you would want to have every language in every language, maybe. Because you may be Italian and want it to look at something else. Yeah, so I was thinking not like the full localized interface, but at least having the ability of knowing that French and French are the same projects. So, maybe using having the two strings associated with the project so that you have basically the English and the localized language name in the original script, that will allow you to type and find it directly, I don't know. That's actually there, it's just not deployed yet. Yeah, the two letter code? Yeah, I meant, sorry, not the two letter code, but the full... The friendly name. Yes, there is no technical restriction to do that. But you know that, it's like all those decisions that are UX-based is what we were saying before. We'd rather defer those to design, so the design is consistent. So design on those decisions and we shall implement their advice on that. And I think, I mean, this is the end of the talk. If there are no more questions, I think we are done here. So thank you everyone for coming.