 in this extremely crowded S-Zimmer. I'm Jakob, I'm your hero for tonight until 10, and I'm here to welcome you and to welcome these wonderful three guys on the stage. They're going to talk about the infrastructure of Wikipedia and yeah, they are Lukas, Amir and Daniel and I hope you have fun. Hello, my name is Amir. I'm a software engineer at Wikimedia Deutschland, which is the German chapter of the Wikimedia Foundation. Wikimedia Foundation runs Wikipedia. Here is Lukas, Lukas is also a software engineer at Wikimedia Deutschland and Daniel here is a software architect at Wikimedia Foundation. We are all based in Germany, Daniel Leipzig, we are in Berlin and today we want to talk about how we run Wikipedia with using donors money and not lots of advertisement and collecting data. So in this talk, first we are going to go on an inside-out approach. So we are going to first talk about the application layer and then the outside layers and then we go to an outside-in approach and then talk about how you're going to hit Wikipedia from the outside. So first of all, let me get you some information. First of all, all of the Wikipedia infrastructure is run by Wikimedia Foundation and an American non-profit charitable organization. We don't run any ads and we are only 370 people. If you count Wikimedia Deutschland or other chapters, it's around 500 people in total. It's nothing compared to the companies outside. But all of the content is managed by volunteers. Even our staff doesn't do edits at content to Wikipedia and we support 300 languages, which is a very large number. And Wikimedia, it's 18 years old, so it can vote now. And also Wikimedia hosts some really, really weird articles. I want to ask you what is your, if you have encountered any really weird article in Wikipedia, my favorite is at least the people who died on the toilet. But if you know anything, raise your hand. Do you know any weird articles in Wikipedia? Do you know some? The classic one? You need to unmute yourself. I need. Oh, okay. This is technology. I don't know anything about technology. Okay. Now the, my favorite example is people killed by their own invention. That's a, yeah, that's a lot of fun. Look it up. It's amazing. A list of prison escapes using helicopters. I almost said helicopter escapes using prisons, which doesn't make any sense. But that was also a very interesting list. I think we also have a category of lists of lists of lists. Yes, right. Yeah, okay. And every few months, someone thinks it's funny to redirect it to Russell's paradoxes. Oh, yeah. But also beside that, people cannot read Wikipedia in Turkey or China. But three days ago, actually the blog in Turkey was ruled unconstitutional, but it's not lifted yet. Hopefully it will be lifted soon. So Wikipedia, Wikipedia project is just not Wikipedia. It's lots and lots of projects. Some of them are not as successful as the Wikipedia. Like we can use. But for example, Wikipedia is the most successful one. And there's another one that's Wikidata. It's being developed by Wikimedia Deutschland. I mean, the Wikidata team with Lukas. And it's being used, it has the data that Wikipedia or Google Knowledge Graph or Siri or Alexa uses. It's basically a sort of a backbone of all of the data through the whole internet. So our infrastructure, let me. So first of all, our infrastructure is all open source. By principle, we never use any commercial software. We could use as lots of things that they even sometimes were given us for free, but we were refused to use them. Second thing is we have two primary data center for like failovers when for example, whole data center goes offline. So we can failover to another data center. We have three caching points of presence or CDNs. Our CDNs are all over the vault. Also, we have our own CDN, we don't use Cloudflare because Cloudflare, we care about our privacy of the users and it's very important that for example, people from countries that might be dangerous for them to edit Wikipedia, so we really care to keep the data as protected as possible. We have 17 billion page views per month and which goes up and down based on the season and everything. We have around 100 to 200,000 requests per second. It's different from the page view because requests can be requests to the objects, it can be API, lots of things. And we have 300,000 new answers per month and we run all of this with 1,300 bare metal servers. So right now, Daniel is going to talk about the application layer and the inside of the infrastructure. Thanks, Amir. Oh, the clicky thing, thank you. So the application layer is basically the software that actually does what a wiki does, right? It lets you edit pages, create, update pages and then search the page views. The challenge for Wikipedia of course is serving all the many page views that Amir just described. The core of the application is a classic lamp application. I have to stop moving? Yes, is that it? It's a classic lamp stack application so it's written in PHP, it runs on Apache server, it uses MySQL as the database in the backend. We used to use HHVM instead of the... Yeah, we... Hello. We used to use HHVM as the PHP engine but we just switched back to the mainstream PHP using PHP 7.2 now because Facebook decided that HHVM is going to be incompatible with the standard and they were just basically developing it for themselves. Right, so we have separate clusters of servers for serving different requests. Page views on the one hand and also handling edits. Then we have a cluster for handling API calls and then we have a bunch of servers set up to handle asynchronous jobs, things that happen in the background, the job runners and I guess video scaling is a very obvious example of that. It just takes too long to do it on the fly but we use it for many other things as well. Media Wiki is kind of an amazing thing because you can just install it on your own, share it, hosting 10 bucks a month web space and it will run but you can also use it to serve half the world and so it's a very powerful and versatile system which also, I mean, this wide span of different applications also creates problems. That's something that I will talk about tomorrow but for now let's look at the fun things. So if you want to serve a lot of page views, you have to do a lot of caching and so we have a whole set of different caching systems. The most important one is probably the parser cache so as you probably know, Wiki pages are created in a markup language, WikiText and they need to be parsed and turned into HTML and the result of that parsing is of course cached and that cache is semi-persistent. It nothing really ever drops out of it. It's a huge thing and it lives in a dedicated MySQL database system. Yeah, we use memcached a lot. For all kinds of miscellaneous things, anything that we need to keep around and share between server instances and we have been using Redis for a while for anything that we want to have available not just between different servers but also between different data centers because Redis is a bit better about synchronizing things between different systems. We still use it for session storage especially though we are about to move away from that and we'll be using Cassandra for session storage. We have a bunch of additional services running for specialized purposes like scaling images, rendering formulas, math formulas. Oris is pretty interesting. Oris is a system for automatically detecting vandalism or rating edits. So this is a machine learning based system for detecting problems and highlighting edits that may not be great and need more attention. We have some additional services that process our content for consumption on mobile devices, chopping pages up into bits and pieces that then can be consumed individually and many more. In the background, we also have to manage events. We use Kafka for message queuing and we use that to notify different parts of the system about changes. On the one hand, we use that to feed the drop runners that I just mentioned but we also use it for instance to purge the entries in the CDN when pages become updated and things like that. Okay, the next session is going to be about the databases. Are they very quickly? We will have quite a bit of time for discussion afterwards, but are there any questions right now about what we said so far? Everything extremely crystal clear? Okay, no clarity is left, I see. Oh, one question in the back. Can you maybe turn the volume up a little bit? Thank you. Yeah, I think this is your section, right? Oh, it's Amir again, sorry. So, I want to talk about my favorite topic, the dungeons of every production system, databases. The database of Wikipedia is really interesting and complicated on its own. We use MariaDB, we switch from MySQL in 2013 for lots of complicated reasons. As I said, because we are really open source, you can go not just check our database tree that says like how it looks and what's the replicas and masters. Actually, you can even query the Wikipedia's database live. Even you have that, you can just go to that address and log in with your Wikipedia account and just you can do whatever you want. Like, it was a funny thing that a couple of months ago, someone sent me a message like, oh, I found a security issue. You can just query Wikipedia's database. I was like, no, no, it's actually, we let this happen. It's sanitized, we removed the password hashes and everything, but it's still, you can use this. And, but if you wanted to say like, how the clusters work, the database clusters, because it gets too big, they first started charting, but now we have sections that are basically different clusters. Really large wikis have their own section. For example, English Wikipedia is S1, German Wikipedia with two or three other small wikis are in S5, wiki data is on S8 and so on. And each section have a master and several replicas, but one of the replicas is actually a master in another data center because of the failover that I told you. So it basically two layers of replication exist. This is what I'm telling you is about metadata, but for Wikitext, we also need to have a complete different set of databases, but it can be, we use consistent hashing to just scale it horizontally, so we can just put more databases on it for that. But I don't know if you know it, but Wikipedia stores every edit. So you have the text of Wikitext of every edit in the whole history in the database. Also we have partial cache that Daniel explained and partial cache is also consistent hashing, so we just can horizontally scale it. But for metadata, it's slightly more complicated. Metadata shows and is being used to render the page. So in order to do this, this is for example a very short version of the database that I showed you. You can even go and look it up for other ones, but this is a S1, S1 Ikea, this is the main data center. The master is this number and it replicates to some of this and then this one, the second one that is supposed to be 2000 because it's the second data center and it's the master of the other one. And it has its own replications. Between cross CC replications because the master data center is in Ashford, Virginia. The second data center is in Dallas, Texas. So they need to have a cross CC, the replication and that happens with a TLS to make sure that no one starts to listen to in between these two. And we have the snapshots and even dumps of the whole history of Wikipedia. You can go to dumps.wikimedia.org and download the whole history of every wiki you want except the ones that we had to remove for privacy reasons. And with lots and lots of backups. I recently realized we have lots of backups. And in total is 570 terabytes of data and total 150 database servers and queries that happens to them is around 350,000 queries per second and in total it requires 70 terabytes of RAM. So, and also we have another storage section that called Elastic Search, which you can guess it. It's being used for search on the top right. If you're using desktop, it's different in mobile, I think. And also it depends on if you're a RTA language as well. But also it runs by a team called Search Platform because none of us are from Search Platform. We cannot explain it this much. We don't know much how it works. It's a lie. Also we have a media storage for all of the free pictures that is being uploaded to Wikimedia. For example, we have a category in Commons. Commons is our wiki that holds all of the free media. And we have a category in Commons called cats looking at left and we have category cats looking at right. So we have lots and lots of images. It's 390 terabytes of media, one billion objects and uses Swift. Swift is the object storage component of OpenStack. And it has several layers of caching, front end, back end and yeah, that's mostly it. And we want to talk about traffic now. So this picture is when Sweden in 1967 moves from left, driving from left to right. This is basically what happens in Wikimedia infrastructure as well. So we have five caching layers. The most recent one is Exyn, which is in Singapore. The three one are just CDN, ULSFO, IKEA, the Samsung Exyn. Sorry, ULSFO, Samsung Exyn are just CDNs. We have also two points of presence, one in Chicago and the other one is also in Amsterdam. But we don't get to that. So we have, as I said, we have our own content delivery network with our traffic. Our location is done by GeoDNS, which actually is written and maintained by one of the traffic people. And we can pull and de-pull DCs. It has a time to leave of 10 minutes. So if a data center goes down, we have, it takes 10 minutes to actually propagate for being de-pulled and re-pulled again. And we use LVS as transport layer and which one is the fourth layer? This layer three and four of the Linux load balancer for Linux and supports consistent hashing. And also we grew so big that we needed to have something that manages the load balancers. So we wrote something on our own system, which is called PiBall. And also lots of companies actually peer with us. We, for example, directly connect to Amsterdam, M6. So this is how the caching works, which is, anyway, there is lots of reasons for this. Let's get started. We use TLS, we support TLS 1.2, then the first layer we have NGNX minus. Do you know, does anyone know what NGNX minus means? So that's related, but not correct. So we have NGNX, which is the free version and we have NGNX plus which is the commercial version and NGNX, but we don't use NGNX to do load balancing or anything. So we stripped out everything from it and we just use it for TLS termination. So we call it NGNX minus. It's an internal joke. So, and then we have varnish front end. Varnish also is a caching layer and the front end is on the memory, which is very, very fast. And we have the back end, which is on the storage and the hard disk, but it's a slow. The fun thing is like just CDN caching layer takes 90% of our request. It responds at 90% of request, just gets to the varnish and just return. And then if it doesn't work, it goes through the application layer. The varnish holds, it has a TTL of 24 hours. So if you change an article, it also gets invalidated by the application. So if someone edits, the CDN actually purges their result. And the thing is the front end is sharded by request. So you come here, load bus, it just randomly sends your request to a front end. But then the back end is actually, if the front end can't find it, it sends it to the back end. And the back end is actually sort of, how is it called? It's used hashed by the request. So for example, article of Barack Obama is only being served from one node in the data center in the CDN. If none of this work is actually hits the other data center. So yeah, I actually explained all of this. So we have two, two caching clusters. One is called text and the other one is called upload. It's not confusing at all. And if you want to find out, you can just do mtren.vicimedia.org and you're the end node is text lb.vicimedia.org, which is our text storage. But if you go to upload.vicimedia.org, you get to the hit the upload node cluster. Yeah, this is so far, what is it? And it has lots of problems because A varnish is open core. So the version that we use is open source. We don't use the commercial one, but the open core one doesn't support TLS. What? What happened? Okay, no. No, no, you should, you're not supposed to see this. Okay, sorry for the, huh? Okay. Okay, sorry. So varnish has lots of problems. Varnish is open core. It doesn't support TLS termination, which makes us to have this nginx minus there system just to the TLS termination makes our system complicated. It doesn't work very well with Swift that causes us to have a crown job to restart every varnish node twice a week. So we have a crown job that just restarts every varnish node, which is embarrassing. But also, and the other hand, when the end of varnish, like backend, wants to talk to the application layer, it also doesn't support TLS termination. So we use IPsec, which is even more embarrassing. But we are changing it. So we call it, we are using Apache traffic server, which is a very, very nice. And it's also open source, fully open source, like in with Apache foundation. Apache does the TLS, ATS does the TLS termination. And still for now, we have a varnish front end that still exists. But the backend is also going to change to the ATS. So we call this ATS sandwich. Two ATS happening between and at the middle, there's a varnish. The good thing is that the TLS termination, when it moves to ATS, you can actually use a TLS 1.3, which is more modern and more secure and even way faster. So it basically drops 100 milliseconds from every request that goes to Wikipedia that's translated to centuries of our users time every month. But ATS is going on and hopefully it will go live soon. And once these are done, so this is the new version. And as I said, the TLS, and even if we can't do this, we can actually use the more secure instead of APsec to talk about machine data centers. Yes. And now it's the time that Lukas talks about what happens when you type in that Wikipedia.org. Do you want? Yes, this makes sense. Thank you. So first of all, what you see on the slide here is the image doesn't really have anything to do with what happens when you type enwikipedia.org because it's an offline Wikipedia reader, but it's just a nice image. So this is basically a summary of everything they already said. So if, which is the most common case, you are lucky and get a URL which is cached, then so first your computer asks for the IP address of enwikipedia.org. It reaches this global DNS statement. And because we're at Congress here, it tells you the closest data center is the one in Amsterdam, so ESMs. And it's going to hit the edge, what's it called, load balancer slash router there, then going through TLS termination through engine X minus. And then it's going to hit the varnish caching server, either front end or back end. And then you get a response and that's already is and nothing else is ever bothered again. It doesn't even reach any other data center, which is very nice. And so that's, you said, around 90% of the requests we get. And if you're unlucky and the URL you requested is not in the varnish in the Amsterdam data center, then it gets forwarded to the IKEA data center, which is the primary one. And there it still has a chance to hit the cache. And perhaps this time it's there and then the response is going to get cached in the front end, no, in the Amsterdam varnish and you're also going to get a response. And we still don't have to run any application stuff. If we do have to hit any application stuff, then varnish is going to forward that. If it's upload.wikimedia.org, it goes to the media storage Swift. If it's any other domain, it goes to media wiki. And then media wiki does a ton of work to connect to the database. In this case, the first chart for English Wikipedia, get the wiki text from there, get the wiki text of all the related pages and templates, no, wait, I forgot something. First, it checks if the HTML for this page is available in parser cache. So that's another caching there. And this application cache, this parser cache might either be memcached or the database cache behind it. And if it's not there, then it has to go get the wiki text, get all the related things, and render that into HTML, which takes a long time, and goes through some pretty ancient code. And if you are doing an edit or an upload, it's even worse, because then it always has to go to media wiki. And then it not only has to store this new edit either in the media backend or in the database, it also has to update a bunch of stuff, especially if you, first of all, it has to purge the cache. It has to tell all the Varnish servers that there's a new version of this URL available, so that it doesn't take a full day until the time to live expires. It also has to update a bunch of things. For example, if you edit a template, it might have been used in a million pages. And the next time anyone requests one of those million pages, those should also actually be rendered again using the new version of the template. So it has to invalidate the cache for all of those. And all of that is deferred through the job queue. And it might have to calculate thumbnails if you uploaded the file or create retranscode media files, because maybe you uploaded in what do we support? You upload in WebM, and the browser only supports some other media codec or something. We transcode that, and also encode it down to different resolutions. So then it goes through that whole dance. And that was already those slides. Amir, you're going to talk again about how we manage. Hear me? OK, yeah. I quickly come back just for a short break to talk about managing to manage, because managing 1,300 bare metal hardware plus a Kubernetes cluster is not easy. So what we do is that we use Puppet for configuration management in our bare metal systems. It's 50,000 lines of Puppet code. I mean, lines of code is not a great indicator, but you can roughly get an estimate of how it thinks work. And we have 100,000 lines of Ruby. And we have our CI and CDs cluster. So we don't store anything in GitHub or GitLab. We have our system, which is based on Garrett. And for that, we have our system of Jenkins. And the Jenkins does all of these kind of things. And also, because we have a Kubernetes cluster for services, some of our services, if you manage a change in the Garrett, it also builds the Docker files and containers and push it up to the production. And also, in order to remove SSH comments, we have Cummins that's like in-house automation. And we built this one for our systems. And for example, you go there and say, OK, you pull this node or run this command in all of the data varnished nodes that I told you, like you want to re-start them. And with this, I get back to Lucas. So I am going to talk a bit more about Wikimedia Cloud Services, which is a bit different in that it's not really our production stuff, but it's where you people, the volunteers of the Wikimedia Movement, can run their own code. So you can request a project, which is kind of a group of users. And then you get assigned a pool of, you have this much CPU and this much RAM. And you can create virtual machines with those resources and then do stuff there and run basically whatever you want to create and boot and shut down the VMs and stuff. We use OpenStack. And there's a horizon front end for that, which you use through the browser. And it locks you out all the time. But otherwise, it works pretty well. Internally, ideally, you manage the VMs using Puppet. But a lot of people just SSH in and then do whatever they need to set up the VM manually. And it happens. And there's a few big projects like Toolforge, where you can run your own web-based tools. Or the beta cluster, which is basically a copy of some of the biggest wikis. Like there's a beta English Wikipedia, beta wiki data, beta media commons, using mostly the same configuration as production, but using the current master version of the software instead of whatever we deploy once a week. So if there is a bug, we see it earlier, hopefully, even if we didn't catch it locally, because the beta cluster is more similar to the production environment. And also the continuous integration service run in Wikimedia Cloud Services as well. And also, you have to have Kubernetes somewhere on these slides, right? So you can use that to distribute work between the tools in Toolforge. Or you can use the Grid Engine, which does a similar thing, but it's like three decades old. And through five forks now, I think the current fork we use is some of Grid Engine. And I don't know what it was called before. But that's Cloud Services. So in a nutshell, this is our all systems. We have 1,300 bare metal services with lots and lots of caching, like lots of layers of caching, because mostly we service read. And we can just keep them as a cached version. And all of this is open source. You can contribute to it if you want to. And all of configuration is also open. And this is the way I got hired. I started contributing to the system. And people are like, yeah, come and work for us. So this is a. That's actually how all of us got hired. So yeah, and this is the whole thing that happens in Wikimedia. And if you want to help us, we are hiring. You can just go to jobs at Wikimedia.org. If you want to work for Wikimedia Foundation. If you want to work with Wikimedia.org, then you can go to Wikimedia.D. And at the bottom, there's a link for jobs because the link's too long. If you can contribute, if you want to contribute to us, there's so many ways to contribute. As I said, there's so many bugs. We have our own Graphana system. You can just look at the monitor. And if fabricator is our bug tracker, you can just go there and find a bug and fix things. Actually, we have one repository that is private, but it only holds the certificates for SETLS and things that are really, really private. We cannot remove them. But also, the documentation for infrastructure is at Wikitech.wikimedia.org. And documentation for configuration is at NOC, the Wikimedia.org plus the documentation of our code base. The documentation for MediaWiki itself is at mediawiki.org. And also, we have our own system of URL shortener. You can go to w.wiki and shorten any URL in Wikimedia structure. So we reserve the donor sign for the donate site. And yeah, if you have any questions, please. We have quite a bit of time for questions. So if anything wasn't clear or they are curious about anything, please, please ask. One question, what is not in the presentation. Do we have any efforts with hacking attacks? So the first rule of security issues that we don't talk about security issues, but let's say this way, we have all sorts of attacks happening. We have, usually, we have DDOS. Once there was happening a couple of months ago, that was very successful. I don't know if you read the news about that. But we also, we have an infrastructure to handle this. We have a security team that handles these cases, and yes. Hi, hello. How do you manage access to your infrastructure from your employees? So we have an LDAP group and with the LDAP for the web-based systems. But via SSH, and for the SSH, we have a strict protocols, and then you get a private key. And some people usually protect a private key using UV keys. And then you have, you can access to the system, basically. Yeah. Well, there's some fireballing setup that there is only one server for data center that you can actually reach through SSH, and then you have to tunnel through that to get to any other server. And also, we have an internal firewall. And it's basically if you go to the inside of the production, you cannot talk to the outside. Even you, for example, do git clone, github.com doesn't work. It only can access tools that are for inside the Wikimedia Foundation infrastructure. OK. Hi. You said you do TLS termination through Engine X. Do you still allow non-HTPS, so HTTP non-secure access? No, we dropped it a really long time ago. 2013 or so? Yeah, 2015. 2013, we started serving the most of the traffic. But 15, we dropped all of the non-HTPS protocols. And recently, we even dropped, and we are not serving any SSL requests anymore. And TLS 1.1 is also being phased out. So we are sending warnings to the users, like, you are using TLS 1.1. Please migrate to these new things that came out around 10 years ago. So yeah. Yeah, I think the deadline for that is February 2020 or something. Then we'll only have TLS 1.2. So we are going to support TLS 1.3. Yeah. Are there any questions? So does read-only traffic from logged in users hit all the way through to the parser cache? Or is there another layer of caching for that? Yes, we bypass all of that. Yeah, you can. We need one more microphone. Yes, it actually does. And this is a pretty big problem and something we want to look into. But it requires quite a bit of re-architecting. If you are interested in this kind of thing, maybe come to my talk tomorrow at noon. Yeah, one reason we are planning to do is active-active. So we have two primaries. And the read request gets requests to from logged in users can hit the secondary data center instead of the main one. I think there was a question way in the back there for some time already. Hi, I got a question. I read on Wikitech that you're using Garneti as a virtualization platform for some parts. Can you tell us something about this? Or what parts of Wikipedia or Wikimedia are hosted on this platform? I'm not so sorry. So I don't know this kind of very, very sure. But take it with a grain of salt. But as far as I know, Garneti is used to build a very small VMs in productions that we need for very, very small micro sites that we serve to the users. So we built just one or two VMs. We don't use it very often, as I think so. Do you also think about Open Hardware? Not for servers. I think for the offline reader project, but this is not actually run by the foundation. It's supported, but it's not something that the foundation does. There was a lot of thinking about Open Hardware. But really, Open Hardware in practice usually means you don't, you know, if you really want to go down to the chip design, it's pretty tough. So yeah, it's usually not practical. Sadly. And one thing I can say about this is that we have some machines that are really powerful that we give to the researchers to run analysis on the between this itself. And we needed to have GPUs for those. But the problem was there wasn't any open source driver for them. So we migrated and used AMD, I think. But AMD didn't fit in the rack. It was quite an endeavor to get it to work for our researchers to have used GPU. I'm still impressed that you answer 90% out of the cache. Do all people access the same pages or is the cache that huge? So what percentage of the whole database is in the cache then? I don't have the exact numbers, to be honest. But a large percentage of the whole database is in the cache. I mean, it expires after 24 hours. So really obscure stuff isn't there. But I mean, it's a parallel distribution. You have a few pages that are accessed a lot. And you have many, many, many pages that are not actually accessed at all for a week or so, except maybe for a crawler. So I don't know a number. My guess would be it's less than 50% that is actually cached, but that still covers 90%. It's probably the top 10% of pages would still cover 90% of the page views. But I don't know. This would be actually, I should look this up. It would be interesting numbers to have, yes. Do you know if this is 90% of the page views or 90% of the get requests? Because requests for the JavaScript would also be cached more often, I assume. I would expect that for non-page views it's even higher. Because all the icons and JavaScript models and CSS and stuff doesn't ever change. Unless that's already inflating the 90%. But there's a question back there. Hey, do your data centers run on green energy? Very bad question. So the Amsterdam CDN one is full green. But the other ones are partially green, partially coal and gas. As far as I know, there are some plans to make them move away from it. But the other hand, we realize that we don't produce as much as a carbon emission. Because we don't have much servers and we don't use much data. There was an estimation that we realized our carbon emission is basically the same as 200 in the data center, plus all of the travel that all of the stuff do and all of the events is 250 households. It's very, very small. I think it's 1,000 of the comparable traffic with Facebook, even if you just cut down with the same traffic. Because Facebook collects the data. It runs very sophisticated machine learning algorithms. That's a real complicated. But for Wikimedia, we don't do this, so we don't need much energy. Does that answer your question? We have any other questions left? Yeah, sorry. Hi. How many developers do you need to maintain the whole infrastructure? And how many developers or, let's say, developer hours you needed to build the whole infrastructure? The question is, because what I find very interesting about the talk is a nonprofit. So as an example for other nonprofits is how much money are we talking about in order to build something like this as a digital common? So if this is just about actually running all this, so just operations, it's less than 20 people, I think. Which makes, if you basically divide the request per second by people, you get to something like 8,000 requests per second per operations engineer. Which I think is a pretty impressive number. This is probably a lot higher. I would really like to know if there's any organization that tops that. I don't actually know the actual operations budget. I don't know. It's two digit millions annually. Total hours for building this over the last 18 years, I have no idea. For the first five or so years, the people doing it were actually volunteers. We still had volunteer database administrators and stuff until maybe 10 years ago, eight years ago. So yeah, it's really nobody did any accounting of this. I can only guess. Hello, tools question. A few years back, I saw some interesting examples of salt stack use for Wikimedia, but right now I see only puppet that come in mentioned. So what happened with that? I think we ditched the salt stack. I cannot, because none of us are in the cloud services team, and I don't think I can answer you. But if you look at the Wikitec, the Wikimedia network, it's probably, if last time I checked, it says like it's the applicator and obsolete. We don't use it anymore. Do you use the backdrops, like the job runners, to fill spare capacity on the web serving servers, or do you have dedicated servers for the roles? I think they're dedicated, right? The job runners. If you're asking job runners are dedicated, yes, they are. I think five per primary data center. Yeah, and they don't, I mean, do we actually have any spare capacity on anything? We don't have that much hardware. Everything is pretty much at 100%. I think we still have some server that is just called MISC1111 or something, which run five different things at once. You can look for those on Wikitec. But sorry, it's not five, it's 20 per data center, 20 per primary data center. That's our job runner. And they run 700 jobs per second. And I think that does not include the video scaler, so those are separate again. No, they merged them like two months ago. Okay, cool. Maybe a little bit of topic, but can you tell us a little bit about decision-making process for technical decision, architecture decisions, how does it work in an organization like this? Decision-making process for architecture decisions, for example. Yeah, so Wikimedia has a committee for making high-level technical decisions called Wikimedia Technical Committee, TechCom. And we run an RSE process, so any decision that is cross-cutting strategic or especially hard to undo should go through this process. And it's pretty informal. Basically, you file a ticket and start this process. It gets announced on the mailing list. Hopefully, you get input and feedback. And at some point, it's approved for implementation. We're currently looking into improving this process. It's not, sometimes it works pretty well, sometimes things don't get that much feedback, but it makes sure that people are aware of these high-level decisions. Daniel is the chair of that committee. Yeah, if you want to complain about the process, please do. Yes, regarding CINCD along the pipeline, of course, with that much traffic, you want to keep everything consistent, right? So is there any testing strategies that you have set internally? Of course, unit tests, integration tests, but do you do something like continuous end-to-end testing on beta instances? So we have beta cluster, but also we do deploy, we call it train, so we deploy once a week. All of the changes get merged to one branch, and the branch gets cut in every Tuesday, and it first goes through the test wikis, and then it goes through all of the wikis that are not Wikipedia, except Catalan and Hebrew Wikipedia. So basically, Hebrew and Catalan Wikipedia volunteer to be the guinea pigs of the next wikis. And if everything works fine, usually it goes there and there's like, oh, the fatal monitor and we have a logging, and then it's like, okay, we need to fix this, and we fix it immediately, and then it goes live to all wikis. This is one way of looking at it also. So our test coverage is not as great as it should be, and so we kind of abuse our users for this. We are, of course, working to improve this, and one thing that we started recently is a program for creating end-to-end tests for all the API modules we have, and in the hope that we can thereby cover pretty much all of the application logic, bypassing the user interface. I mean, the full end-to-end should, of course, include the user interface, but the user interface tests are pretty brittle and often tests where things are on the screen, and it just seems to us that it makes a lot of sense to have tests that actually test the application logic for what the system actually should be doing rather than what it should look like, and yeah, we are currently working on making, so yeah, basically this has been a proof of concept and we are currently working to actually integrate it in CI, that patch should land when once everyone is back from the vacations, and then we have to write about the thousands or so tests, I guess. I think there's also a plan to move to a system where we actually deploy it basically after every commit and can immediately roll back if something goes wrong, but that's more mid-term stuff and I'm not sure what the current status of that proposal is. And it will be incubated, so it will be completely different. That would be amazing. But right now we are on this weekly basis, if something goes wrong, we roll back to the last week's version of the code. There are any questions left, sorry. Yeah. Okay, I don't think so, so yeah, thank you for this wonderful talk. Thank you for all your questions. Yeah, I hope you liked it. See you around the next talk, yeah, we'll pause.