 introduction to MediaWiki. In this three-part series I'll be walking you through some of the general concepts of MediaWiki. In part one and two we'll be focusing on the core components as well as some of the extensions that are installed on Wikipedia. We'll be answering questions such as what does this component represent to an end user of the software? How is it used by the community and what role does it play in meeting Wikipedia's needs? We're part one and two approach these components from a high-level perspective and part three will be taking a small selection of these components and translating them to how they materialize in the code-based day-to-day and using them to write your first patch. In part three we'll also cover how to make a change to an extension and how to run the test locally. So with that let's get started with the MediaWiki core concepts. For this introduction today I'll be using the database schema as our map and letting it guide us through the various components that are in core. I will not be covering every single one of these. The idea is rather to give you exposure to just a selection of these concepts and the idea is to give you a solid base. When you then learn about other components later on they'll be similar to ones you already know and involve familiar concept and ideas. You can find this diagram on mediaweek.org slash wiki slash DB and click explore database schema to follow along. The first concept I'll cover is the user. The user represents a registered account. Such an account can be created through the create account special page. It's worth noting here that you don't actually need to create an account to edit a wiki. Wikipedia allows anyone to edit and that includes those who haven't created an account. This lowers the barrier to entry even further. Having said that there are reasons to create an account. There are certain benefits. There's actually a page on Wikipedia itself about why you might want to do that. The one item I'll highlight from there is the ability to communicate more effectively between users. You'll have a stable talk page and things like that. There are other features such as being able to set preferences, being able to follow articles on your watch list and lots more. While account creation is public and open by default it's not required to be. This is very much configurable in the software and in fact the foundation's own internal office wiki is an example of exactly that. Where account creation is closed creating an invite only or a viral group of editors. Next up is the preferences system. So when you're logged in there's a link to your preferences page. Preferences include things like beta features that you can choose to opt into as well as community gadgets with additional features that you can enable. You can change your notification settings for how you want to be contacted. In the back end preferences are known as user property. So you might sometimes see that term used instead. The preferences also include a setting for your interface language. This is an interesting one because the content and the software can be set to different languages. This can be especially helpful when you're assisting people on another wiki where you might not speak the language of the content that is being written there but you can still engage with other aspects of that wiki. So for example when I was a volunteer on the Dutch Wikipedia I would sometimes help out with how some of the configuration settings are set on the German or Japanese Wikipedia and I was able to do that easily by setting the language to English or to Dutch. So for example if we set this wiki to Japanese the content is still in English. Notice that the setting only applies to my account. I've not changed the configuration for the wiki itself. To give an example of how far reaching localization settings can be in media wiki I'll set my language to Hebrew instead. Hebrew is a right to left language and you see here that one of the side effects that this has had is that now the entire layout is actually in a different direction. Notice that the content is still written left to right. This is powered by a feature known as CSS Janus and this is one of the ways in which media wiki provides a platform where new features can be developed in a way that essentially works across languages from day one. Most products and extensions that are written for media wiki do not need much awareness of individual languages. As long as they use the core concepts the rest sort of comes for free automatically. It's really quite powerful. Next up is the permission system. In media wiki there's two terms here that you need to be familiar with. User rights and user groups. User rights are part of the software and are stable over time. They identify specific capabilities that you can have as a user. The special page list group rights you can visit it on any wiki will show you an overview of all the different rights that exist in the system. Some of these may come from extensions where the user rights are a stable part of the software. The user groups on the other hand are highly configurable and mutable over time. Individual wiki communities can decide what kinds of groups they would like to have and what combination of rights is assigned to them. This can change over time as well. One of the defaults group that ships with the software is the administrator group. And while the specific rights assigned to it may vary from wiki to wiki, the general idea is that these are elected community members that are able to engage in user management. So for example, the blocking and unlocking of an account. Administrator rights typically also include the ability to protect a page. This is where rather than the page being editable by anyone, it is restricted only to members of certain groups. For example, you might say the page is only editable by logged in users or it really needed only by administrators. The way you can get added to such a group is through the user rights special page. And you can click on the help there to learn more about how that works. Another user group that ships with the software by default is the bureaucrat user group. Where if you are a member of the bureaucrat user group, you're able to add and remove members from a particular user group. The ability to grant user rights does not have to be an all or nothing thing. The default bureaucrat group is configured that way. But there's also the idea of a viral user group where you can grant someone the ability to add members only to a particular group. On your local wiki, you'll find that the admin account that's been created for you during the installer places you in the administrator as well as bureaucrat group, which gives you basically all the abilities that you might need. And what that shows you is of course that you can be in multiple groups at the same time, in which case your account will have the combined rights of all of these groups. I'll give you one more example from production. So my account on media.org is an administrator, but not a bureaucrat. And that means that most of these checkboxes are disabled. But I do have the ability to add people to these particular user groups because on media.org the community has, we have decided that an admin can actually grant these groups as well. In addition to the local user groups that we see here in production, there's also a concept of global user groups. This is implemented by the central of the extension, which allows a single account to be used across all of the wikis. We'll talk more about that in part two when we talk about the Wikipedia extensions. Before we move on to logging, I'll cover one more aspect about permissions and that is the bot passwords feature. Bot passwords allow you to generate an additional password that can be used to log into your account, specifically a password for which only certain user rights are granted. So this way you don't have to give full access to your account when you are running some kind of script or some kind of bot that you've developed. The way it works is that you pick a name to log in with, which will be appended to your regular username. And you can choose which rights you want to grant it. So especially if you have an account with additional user rights. So right now I'm using the admin account that would basically give maximum rights to this bot that I'm developing. But if the bot only needs to make basic edits, you can give it access to just those capabilities. If you're developing a bot or some script that is meant to be used by other people, such as through a web application, then we generally don't recommend using this, but rather to allow the user to give permission through OAuth. But when you're developing a script just for your own personal use, this is a great way to get started without the complexities of OAuth, as you'll be able to log in simply through a username and password combination, which definitely keeps your scripts a lot easier to get started with. Next up is the logging feature. The logging feature represents a public record of actions that have been performed on the wiki, specifically actions that are not edits. This can be viewed through the special log page. And while you could go to this URL directly, the way you would normally find this is either on someone's user page where you can find access to the logs or from the history of a particular article, you can view the logs that are about this page. And what you would find there are actions, specifically admin actions for the most part, but there's some other entries here as well, such as what an admin has done with their user rights. So have they blocked a particular account and for how long and why, whether a page has been deleted from the public view, whether protections have been applied that restrict editing and things like that. This is particularly important for transparency and openness. It allows the community to hold its administrator that it has chosen to be accountable and for people to independently verify. I would say that the big ethos of the wiki approach is that you shouldn't have to trust what someone says in terms of bringing in new information. On the content side, this is manifested through no original research, which says that things should be verifiable by an independent source, but the same applies to social communication and administrator actions as well. If someone says they did a certain thing, you should be able to know that they have done that for a fact. Lastly on the first row, we'll cover the comment table or the comment store, as it is known in the code base, stores small pieces of text that are associated with either an edit or a log action. For example, when you're making an edit on the wiki, you're asked to enter an edit summary and that is stored as a comment. When you perform an administrator action, such as granting someone certain user rights, then you're asked to enter a reason as well, and both the log reason as well as the edit summary are internally referred to as comments. Next up is recent changes. Whenever an edit is saved on the wiki, in addition to it being saved in its primary place in the database, which we'll cover shortly, an entry is also added to the recent changes feed. The built-in interface to this feed is the special recent changes page. That looks something like this. And from here you can see a real-time feed of edits as they are made on the wiki. There's various different filters you can apply, and the larger the wiki, generally the noisier this will get, so there's quite a few filters and specific interests that you may want to use. The most personal of which is the watch list. The watch list shows you essentially just a subset of the recent changes that intersect with the pages that you are following. However, the underlying feed for recent changes is also accessible to tools, and that's really quite important for Wikipedia because it powers all of the various monitoring tools, and I'll show you some examples in a moment. In addition to monitoring and counter-vandalism tools, it can also be used for some pretty cool other visualizations. One such tool is Listen to Wikipedia. It shows you the edits to decide as they're happening in real-time, and there's different notes being generated in the background, which is quite pleasing to listen to. Another example is this code pen that I put together a while ago, and it just shows you how the traffic for new edits is distributed across the different wikis. So we can see here that on average there's about 30 edits per second, or about 1 to 2,000 per minute. In terms of more interactive applications, we have both automated as well as manual tools. The automated tool that I'll mention as an example is a Clubot. Clubot is a real-time counter-vandalism bot that basically subscribes to the recent changes feed, and in real-time assesses whether a edit is likely to be vandalism, and if it reaches a certain threshold, it will actually automatically revert that edit. You can see the examples here. It's performed more than 6 million edits since 2010, and I would say it's quite critical to the quality control of Wikipedia to have a tool like this, and that's not just through automation but through human effort as well. We have tools such as Huggle that subscribe to that same feed, and make it super efficient to go through a large number of edits in a short period of time, also allowing the various patrollers, as we call them, to collaborate. Next up is the page. The page is central to almost everything in MediaWiki, and this is a concept that we'll be revisiting a few times throughout the hour as we progress. Every article on Wikipedia is a page, but lots of other things are pages as well, and they're usually identified by a particular namespace. We can look at the list of namespaces that we have. There's a number of built-in namespaces in MediaWiki Core, and extensions can add additional namespaces to this list. As covered, the main namespace, the one that has no prefix in its name, is reserved for articles, so what we consider content. Depending on what the wiki is about, you might not call it an article, you might call it something else. For example, on Wiktionary it's called an entry, and on Wikibooks it will be called a book. The user namespace is reserved for pages about individual users, so that's typically where you would write maybe a short bio about yourself, or where you can host your profile page, basically. For every page we also reserve a discussion page, and we call this the talk namespace. In order to make sure that every page has a reserved talk page, we use the odd numbered indexes for this, so typically whatever namespace identifier there might be, there's always a talk namespace right next to it. Speaking of namespaces, let's talk about the special namespace. The special namespaces is reserved for the software's own interfaces, and the pages in this namespace always exist, and cannot be created by editing a page. You've already seen several special pages today. For example, creating an account works through a special page. Setting your preferences is a special page. Adding or removing user rights is a special page, and one more example, the search results is also a special page. Most any kind of user-facing feature tends to be implemented as a special page. You can find a list of all special pages that exist on a particular wiki through the special page called special pages. This is a page automatically generated by the software that always lists these pages that currently exist in the system. Next, we'll move from pages over to revisions. To understand how revisions work, let's start by actually saving and edit. I'll add the words hello to this page, and press save changes. If we go to view history, we can see the past revisions to this page. Each entry here represents a revision, and you can compare two revisions to see the diff between those two revisions. When we create or edit a page, the actions usually occur bottom up. So the first thing we do is we store the source text, in other words the wiki text, in the text table. That's over here. After adding the text, we then add the revision. The revision represents the metadata of the edit. So this is which page was modified, who modified the page, and when. The revision points to the text, and then lastly we update the page record to indicate what is now the latest revision to the page. In production, the setup is slightly different from this. Rather than the text being stored in the text table, it is stored in a separate cluster known as the external store. The reason for this is that the text table is by far the largest table of a given wiki. It contains the entire wiki text of every single revision, whereas everything else is essentially just metadata. To learn more about how this works, you can follow the link to the text table, and in particular the page about external storage. Now that we know the basic elements of editing a page, let's now think about viewing a page. When you view a page we essentially perform the same three steps but in reverse. For example, if you were to navigate to the talk page about the banana article, step one is to parse the title from the URL, and so we see here a namespace separator and a title. This is unused to find the page in the database. Having found the page, we can then use the page latest pointer to find the current revision. The revision in turn allows us to fetch the corresponding wiki text for that revision. The wiki text is then processed by the parser, or by parsoid, to render it into an HTML web page. Next up, let's talk about link tables. A number of different features are powered by link tables, but the main one I'll be showing today is categories. When you view an article on wikipedia, at the bottom of the page, you can find one or more categories that are essentially tagged onto this page. For example, for the WordPress article, there is the category free software programmed in PHP. Each category can be represented by a page, and like any other page it can be edited, it can have a history, it can have a talk page. What's interesting about categories in particular is that the editable portion is only the description text. The rest of this page is automatically generated based on which articles are currently tagged with this article. On the edit page, we then see that the edit window will only contain the description and some other metadata. The fact that categories are pages does not take away from the fact that article categorization is a first class feature in media wiki, and indeed it has its own database table as well as its own link table. However, when it comes to deciding where we store the text for the description, we connect it to a page. This is great for interoperability because it allows reuse of many other capabilities. For example, the monitoring systems that we looked at earlier for recent changes, for history, for watch lists and all of these things, they apply to category descriptions automatically without these systems having to know what categories are or what their limitations or features might be. To learn more about this philosophy, you can check out the page on mediawiki.org, aptly named Everything is a wiki page. Closing the loop with the link tables, this represents the outgoing links from any given page. Shortly after an edit is saved, the metadata that was extracted by the parser is saved in the link tables. For example, if we look at the WordPress article one more time, each of these category associations is stored as a category link. And when you then view a category page, we simply list the articles that have an outgoing link to this category. And so in this list we find WordPress, as well as other familiar software such as MediaWiki and FabCator. Given that the category itself is represented as a page, this then also brings rise to the category tree. The category page itself has categories and this allows you to create a tree of various categories, essentially parent categories or super categories. Briefly covering the other link tables. The page links table, for example, contains all of the blue and red outgoing links from the article text. The template links table keeps track of which templates are embedded in a given article. For example, on the WordPress article there is a navigation template here. This template is embedded from a separate page, which can be edited in one place. Each of these articles embeds this same navigation template and when we make edits to this template it will be reflected on each of those articles. The way that works is that after an edit is saved on this page, we then perform a so-called refresh links job in which it finds out which articles embed this template through the template links table and then re-renders each of those articles, thus applying the latest version of the template. This is powered by the job queue and we'll cover more about that later. The image link table keeps track of which images have been embedded into a given article. For example, this article about a Dutch drink contains a photo and if we click through to the comments page about that photo we can find where it was used. This is powered by the image links table. Specifically for comments it's powered by an extension known as global usage. This can be used to measure the impact of a particular photo and can be quite engaging for people who contribute photos in this way. They might initially upload the photo with the intention to use it in one article but then later find that it's been used in many other articles as well. And that applies not just to individuals but to institutions as well. There's an acronym that we use in the movement called GLAM which stands for galleries, libraries, archives and museums. And many of these institutions contribute large collections of photos to Wikimedia Commons. And by listing where a photo is used we can then measure how impactful a particular contribution was in this way. For example, you could enter the name of an institution or indeed an individual and see how many times one of their photos or diagrams has been used across the different Wikis. And this is something that institutions often ask for when they're considering to donate large collections to Wikimedia Commons. The last link table I'll mention is the external links table and this stores links to other websites. And this is quite a powerful feature because it allows you to find across the entire Wiki where links are made to a particular website. So for example I've entered a search here for the BBC website and you can find each of the articles that has a BBC reference in it. In addition to measuring this for websites that you might consider to be useful where it might be used as a reference or a citation, it can also be used as a way to combat spam. So for example if as a content reviewer you've found an article to a spam website you can then use this feature to see where else links to that same domain were made. And that brings us to the statistics feature quite a small table. It usually contains only a single row with various counts that we increment after relevant events. It can be seen through the special statistics page. These are incremented through deferred dates or job queue jobs after relevant events including after every single edit. Next is search. Search is one of several components in MediaWiki that is fully pluggable. That is to say there is a default implementation that has no dependencies but it can also be fully replaced for a large scale production use case such as Wikipedia. The default implementation that comes with MediaWiki actually uses full-text search through a SQL lite database file. It doesn't even depend on MySQL in fact. And in production we use the SeroSearch extension to index the site through an elastic search cluster instead. The key takeaway here is that there is a stable interface between the search component and the rest of MediaWiki such that the interface doesn't depend on specific implementation. This has benefits for the foundation and that it allows us to evolve our infrastructure and replace dependencies as we go. And it means that for a small install or for local development you're not required to install additional services. Next is the job queue. The job queue allows you to perform work in the background separate from a user-facing request or response. This makes it suitable for batch processing tasks as well as for shorter tasks that you can carve out from a web response in order to make it faster. You can read this page in MediaWiki.org to find several examples. There is the it starts with the one that we've already mentioned about template changes and there are several other examples listed there as well. You can learn more about the general principles we use to speed up web responses on Wikitech. There is the backend performance practices guide. And indeed one of the general principles that we use is whenever a task can be deferred we typically do. And that is not just tasks that take several seconds to run, but even for example a task that might only take a few 100 milliseconds to run. If it's something we can carve out from the web response it's usually worth deferring. The job queue system similar to the search component is also a pluggable service and it's actually two different services. There is the job queue and there is the job runner. The job queue is what stores the job and the default implementation for it is to simply append the job description to the job table. The default job runner is actually part of the page view cycle. After the web server has finished sending any given page view response to the browser MediaWiki will instruct the server to do some additional work in the background. This happens fully server-side and asynchronously and does not delay the page view itself. Now where the foundation has been using an alternate implementation for search since at least 2008 starting with Lucene and later with Elastic. The job queue has actually the default job queue has actually served the foundation for quite a while. It was used from 2009 all the way up to 2013 storing jobs just in a single table by appending them from there and then removing them from there when the job is executed. Alternate backends for the job queue include Redis and Kafka and since 2017 the foundation uses Kafka. For the job runner, the way that it is typically scaled up is by disabling the default one and instead setting up a cron job for example every one to five minutes you might have a cron job that will automatically execute all the pending jobs. The last section for today is multimedia. User uploaded media files are a first-class concept in MediaWiki and they too are represented by page. For example here's the file page of yesterday's picture of the day in addition to being a wiki page that is used to describe the photo in question there's also various additional metadata that is automatically calculated and added below the page. This is similar to how category pages have a editable portion as well as an automatically generated portion. In case of files the interface portion is the photo on the top as well as the metadata on the bottom. The multimedia component and specifically the so-called file backend is also a pluggable service class similar to search and job queue. By default uploaded image files are stored in a local directory as might be the case on your local development wiki and at scale this can be replaced with an alternate backend that is either a networked servers or a cluster that you run for example in production we use a Swift cluster to store files in production. That concludes the overview for today I'll see you in part 2.