 analytics team. Not working? Now? Hi everybody. This is a tech talk on mostly an overview of the analytics infrastructure that Wikimedia uses to process data. It's mostly going to talk about sort of like a few different technologies and kind of how we plan and have had this set up and we'll have it set up in the future. A little bit about some history, where we came from, etc. and how to use it. So data sources that Wikimedia analysts need to access in order to do an analysis on the data include mostly, as far as I know, media wiki databases and web request logs. The databases are great. There are a bunch of slaves that are set up for analysts to run queries on and as far as I know, that mostly works great. People use them all the time to do a lot of analysis. The web request logs are a very large data set. These are a, you can't see the slides? Okay. Nuriya just posted, or somebody just posted, not sure the slides are being shared. Like shared? I can post the slide back. I think I can post it to the public handle. Okay. If it's not there. Okay. Should I continue or should I wait? You can continue. So basically, we need to select the window with the slides. I think now everybody is still looking, seeing the slides. Okay. Oh, because my mic is selecting, I see, my window. Probably. Because I'm using this mic. Right. So web request log. So this is a log for every single HTTP request that hits a Wikimedia foundation website, which is a lot. You can see there that it maxes sometimes at over 200,000 requests a second. That little spike in that graph there, I like a lot. It's very appropriate for this time of the talk because that was during the World Cup final. We had a huge spike right there. Historically, web request logs have been collected via a piece of custom software that the foundation wrote called UDPTLOG. This software actually works fairly well, but it doesn't scale very well. As Wikimedia has grown, there have been many cases when these UDPTLOG demons that are running out there on servers collecting all the web request logs, they drop packets or people will change the configuration slightly for a new use case to collect different types of data, and things just won't work. One of the big problems of this scaling in the way the system is designed is that every single instance that's collecting this data has to process every message. Every packet of all this data has to come through each box's network interface and then be processed. It's also UDP, so it's unreliable traffic delivery over the network. So it's hard to tell necessarily whether or not all the data is making it through. The stats at Wikimedia.org is many of the, probably most of the stats that are on there are actually generated from this UDPTLOG data. There's lots of page view data in there. Ericssaka has worked a long time on getting this up and it's been around for a while, but the data that's there is generated from UDPTLOG collected data. That data is sampled via UDPTLOG. One of the hard parts is that it's because there's so much traffic, we don't have the storage space or the capacity to actually process unsampled logs, and for a lot of statistics that's fine, but for smaller Wikis and for other statistics that require unsampled data, this doesn't work well. It kind of limits us on the types of analysis that we can do. So we would like to have a system in which all web request logs are saved for easy and fast analysis. That's what we're trying to go for here. So I'm going to describe sort of the Analyst Cluster as it is. So we use Hadoop for batch processing of logs. I'll explain sort of what that is to those of you who don't know. And our current direction is that we're using mostly Hive to make it easier for analysts to get at this data inside of Hadoop. Here's a bunch of buzzwords. There's a lot of buzzwords. There's actually a lot more than this. I kind of limited a few on here, but I'm actually going to talk about each one of the words on this little cloud here. So this diagram will pop up a few times. I'll try to describe each bit of it. First, we'll talk about Hadoop. So Hadoop is really just two parts. It kind of refers to a whole ecosystem of software frameworks that allow for distributed analysis of data. But the main two pieces of it are a distributed file system. So you have a bunch of nodes and there's a file system that's mapped on top of all the nodes. So it looks like one file system even though it's distributed across the network. And a framework for doing computation on that distributed file system. So the idea is that you write code and rather than putting all the data in one place and the code works on one node and processes the data, the code is farmed out to the locations of the nodes where the data lives. So you write a job and the system knows where the data that you want a query is and it sends out the code and runs whatever analysis the code is supposed to do on the data and then brings you back the results. So in this diagram over here, this little bit over here is the Hadoop cluster that we're talking about here. We just got a bunch of new hardware. It was actually installed last week and as of Friday and or yesterday kind of like coming back online with a later version of the software. We currently have 22 data nodes with about 770 terabytes of storage and a total of almost a terabyte of memory across the whole cluster. Which is great! A lot more storage capacity which will allow us to keep data for longer and do more analysis. So a hypothetical question that maybe some analysis wants to answer might be how to get the top referrals for a particular article. The great way to do that is with Hive. So we use Hive to map SQL like tables onto the web across data in Hadoop. So that allows people to write SQL queries that then get translated into MapReduce jobs. MapReduce is a framework that runs in Hadoop to do distributed queries or distributed analysis. So it's a really handy interface. So this looks like any SQL that you might see if you're writing a MySQL query or any type of SQL in general. There are some limitations to what Hive can do here and there but for the most part a lot of the simpler use cases are supported. So this is a query that Oliver actually wrote. I found it on a WikiTech page that he made to compile a bunch of useful queries that he was writing. And you can see that this query will do what we're trying to do, how to get the top referrals for a particular article. So this article is looking at the London article on English Wikipedia and collecting the top referrals there. Counting them. What's cool about this is this actually lets you run SQL on top of text data. So the log data that we're importing is JSON and there's configuration that tells Hive how to read the JSON into a particular format so that you can write SQL and query it. This middle part of the cluster here is called Kafka. This is what we're using as our UDP to log replacement. So we want a system if we have all this data inside of Hadoop we also want to be sure that it gets there reliably and in a timely manner. So here's a bunch of buzzwords about Kafka. Kafka is a reliable, performant, PubSub, distributed log buffer. So there's a bunch of brokers and they act as peers and the front-end cache servers which we'll talk about in a second all know how to talk to these brokers and send their data to them. What's nice is that they all the Kafka broker nodes talk amongst themselves to figure out who needs to get what data. It's very horizontally scalable. You can just add more brokers to it and remove them. And so far it's working really great for us. So yeah we've got four brokers up right now, approximately 80 producers. The producers that we're talking about here are the varnish front-end cache servers. So these are all the servers that actually get the web requests from the internet. So a web request comes in from the internet, produces data to Kafka and then it's stored in Kafka for a while for us to send to HDFS. The stream of data that we're sending is maxes at around 200,000 messages per second, which amounts to about 30 megabytes a second. We're consuming this data out of Kafka into HDFS about every 10 minutes. And that data is then available for querying at about once an hour via this other piece of software called Camu. So that's just a little piece on here because it's really just a, it's just a job that runs on Hadoop. It's a piece of software written by LinkedIn to consume from Kafka from in a distributed fashion and write the data to HDFS. All it does is just launch, it tells Hadoop to run this code and knows where Kafka is and knows what data you're supposed to read out of Kafka and then writes it to HDFS. The most convenient part about it is that it will, it's configurable so that you can tell it to inspect custom content based time data in the data you're importing. So we can tell all the jobs that are importing into Kafka to look at the data as it's being imported and see what the request timestamp is actually in the message. So that will allow us to store the data in proper bucketed directories based on the actual contents timestamp and not the timestamp when the job runs. One of the disadvantages of the UDP Deluxe System that we've been using is that logs just come in whenever they come in and they're archived and rotated via log rotate. So a lot of times, especially on the borders of a rotate, there will be some data that is from the file will be named, like the example that I have here in the slide, the file will be named July 15th at midnight, but we'll actually have data from a different hour or something if it's rotated hourly. And that's great because then we can be sure that if people are running queries or looking at this data, all they have to do is make sure that they specify a particular directory and they will be sure to have all of the data for that particular hour. So if someone's trying to query based on hour, then their counts are going to be correct. Another buzz word that was up there before is Uzi. I'm going to kind of just kind of the next couple slides just go through them a little bit fast because you can ask questions more afterwards about how they work. Uzi is just a Hadoop job scheduler. Similar, we could use Kron for this, but one of the nice things about Uzi is that it will allow us to launch jobs based on existence of data rather than the current time that's passing by. So in Kron, you can say like, oh, I'm going to run something every hour and it will just go. In Uzi, you can say like, hey, check for this data set to exist and then launch all this like this like complicated workflow jobs that will like do things like we want to do like aggregate and anonymize or geocode or whatever it is we want to do. We can all trigger that based on once one of these Camu jobs finishes importing all the data for a particular hour. Hue will be a nice thing when all of it actually works really well. Like I said, they recently reinstalled the whole cluster and there's a new version of Hue and it's been pretty good, but I haven't had time to play with it since it's been reinstalled. But it's just a web GUI for interacting with all these technologies. It's got a Hive query interface. You can build Uzi jobs in it. There's a bunch of you can write pig scripts in it. You can browse the HGFS file system. It's got a lot of features and there's a lot of add-ons you can add to it too. We envision this to be a way that analysts and others can interact with Hadoop and Hive in an easy web interface, but the command line interface is available as well. That's the whole cluster there. I think I got to every part of it. I just wanted to point out we're importing the data with Camu into the cluster, but on the right side of this diagram you can see there's all these sort of like loopy back arrows there with things like geocoding and sanitizing and page views. These are all envisions to be batch jobs that work on the data once it's in HGFS. The plan is to collect all this raw data from the varnish front and caches via the Kafka cluster and have it in HGFS for only a short period of time. This would be the raw logs as it comes from the varnish cluster. We would keep maybe 30 to 60 days, something about that depending on how much space it takes. Then probably every hour there would be more regular jobs that would take that data and transform it into a format that is a little bit safer for people just to get at, so with personally identifiable information removed. I'm sure there will be some pretty regular aggregations that we'll do. Aside from just ad hoc analyst queries we'll standardize some aggregations of data and store it in HGFS as well. We'd then also like to export the smaller aggregated data sets into something that's a little bit more easy to get to, so whether or not that's MySQL or Redis or whatever is sort of like something to think about down the line. It would be really nice if the Wikimetrics project could also access some of this. If we stored some of the aggregated data sets in a MySQL database somewhere that Wikimetrics could access then there would be a really nice way for people to associate some of the cohorts that people are using Wikimetrics for with page view and readership data. Right, so all of these technologies are actually pretty thoroughly puppetized. They work easily in labs and in vagrant. The labs actually, I'm not going to go over that right now because it's a little bit more complicated because if you've used labs before there's the managed instances pane and you can check boxes and have certain boxes applied and that all works but the actual boxes aren't out there for any project. They only work in the analytics project right now. So while making the slide deck I put a little to-do on there for me to make that easier kind of like get the puppet groups formalized and the full managed instances interface. But for vagrant, all you have to do is enable a role. It's good to have extra RAM. I've tried to limit the amount of memory that these processes need in vagrant but it's still good to have a little bit of extra. So I'm not going to go through these full instructions right now. They're in the analytics role class in the vagrant puppet code if you want to look it up. But all you have to do is up the RAM, enable the analytics role and provision your instance and they should all come up. You should have Hive and Hadoop and Uzi and Q, all the stuff available to you on a vagrant instance. I'd like to put it to-do on here as well for myself to make some of that a little bit better so that a sample web request table, the stuff that I just kind of went over is also populated. Right now that's not the case. You'd have to sort of figure out that on your own right now. But it'd be nice if there was a table there for you. And I expect a lot of questions so the talk is a little bit short but that's the end of my slide back there. So I know that was very fast but yeah, questions. So how are we as a developer or product owner or a manager, how can I expect to get access to all this data that's being stored and processed in Hadoop? Depends on which of the data you're speaking of. Most of the data that I've been talking about here are the Hive tables that will be mapped on top of the raw web request logs. Eventually we would like that data a little bit more restricted and maybe only accessible by ops people but I think that there would be more sanitized data sets that would be more accessible. In the short term and in the long term we would like Stat1002 to be used to access this data. So there will be Hadoop and Hive clients on that machine and you can access the cluster and run Hive queries via there. That in conjunction with the heweb interface that I mentioned before, those would be the two ways. If you want to get access now, the way to do it would be to submit an RT request and get approved. Just like shell access to anything else in the cluster, you can just submit a request and get approved. So is Stat1002? Now that should be used to access this data. And note that when you're doing that, when you actually log into Stat1002 it's very similar to the reason why you would use one of these stat boxes to connect to one of the MySQL slaves. It's more of a bastion than anything else. You can save local results there for further post processing but really all that's installed on that machine are clients that interact with the cluster elsewhere. So when you issue a Hive query, that query is then off to the Hive server which is then translated into one of these MapReduce jobs which is then submitted to the Hadoop cluster which then knows how to translate all the data into something that is queryable, gets processed and then send back to you. Andrew, a question from IRC is to add to Toby's question. What about random members from the community? I might actually let Toby answer that question. You're doing a great job, Andrew. So there are two concerns with community members and both of which we're thinking about. The first is respecting the privacy of our readers. We're going to be addressing this publicly very soon but the concern is that these are basically the web logs and we need to anonymize them and aggregate them before we really let anybody outside of the operations and analytics team use them. That's one big concern. And then the other concern is maintaining like a public Hadoop cluster is kind of a hassle or hassle is probably not the right word but it's a challenge because Hadoop is not, it's a great processing system it's not a great multi-user system so what we anticipate is that these aggregated logs will be made available probably in labs in a MySQL for the community to access. We think that this is the right balance between making our data open and actually being able to support reliable access to that data but giving the community access to this data is super important. So I've read that Hadoop is very good for text processing as well. Do we have any plans to put the dumps on Hadoop? Yeah actually they're not there right now because like I said I just reinstalled this cluster but before the reinstall we were regularly importing the page you count dumps and those were going on to Hadoop. I think they were there sort of like as a proof of concept but those could be used to generate data sets too. I was actually looking at the corpus like the actual encyclopedia. I don't know if we have plans, if that's useful then certainly yeah we can use Hadoop for processing really any type of data that people large data that people need processed so the web request logs is their main use case because it's kind of one of the biggest data that the foundation has but it would be very convenient to be able to join that data with other data sets so whether or not it's like editor data sets from the SQL databases we can import those into Hadoop and use those in conjunction or the I suppose even the XML dumps of the whole encyclopedia could go there too. Those are edit logs too. Yeah I'm not sure which way is better but yes is the answer like if that's a useful thing to do we can import other types of data sets and use them as well. It would also be interesting to do like sentiment analysis of talk pages and things like that anyway yeah. Awesome. Does anybody else have any questions before we end the podcast? Not podcast, Tech Talk? Okay great thanks for coming everybody. Thanks everybody. Thanks Andrew. Thank you.