 Hello everyone, I'm your host Sarah R. Welcome to the new season of the Wikimedia Tech Talks, which are backed by popular demand. Tech Talks are an opportunity for individuals from the technical community to share what they know. Going forward, we'll host in-depth talks on technical topics about once a month. There will also be other opportunities for sharing in the future, including shorter lightning talks and demos. We'll share more information soon. Tech Talks are open to everyone, so if you have an idea for one, please feel free to contact me either by email or on IRC, and both of these things are in the slide in front of you, or follow the instructions on the Metawiki link that's also on this slide. Today's speaker is Subbu Sastri. He's the principal software engineer on the contributors team at WMF, and he will be speaking on the long and winding road to making Parsoid the default MediaWiki parser. There will be some time for questions at the end, and you can ask these in IRC on the YouTube stream, or if you're in the San Francisco office, please come up to the mic and ask your question. And without further ado, I will let Subbu take it. Thank you, Sarah. Let's see. Okay, do you see this? Yes. All right. Okay. Good morning, everyone, and my name is Subbu, and I'm going to talk about making Parsoid the default parser for MediaWiki. And so on the parsing team, our mission is three-fold. It's about Wikitext, the output of the parser, which is HTML, and about the parser itself. And today, I'm going to focus primarily on the last piece. That is, how do we use the same parser for reads as well as edits? And that makes the question, what exactly are the two parsers, right? So that's going to be as part of the background in the first part of this talk. I'm going to tell you what the two parsers are, why we have two parsers, how they are different, and what is the core reason why we have two parsers today. And as part of that, I'm going to talk a little bit about some details about HTML, the DOM, Wikitext, and how Wikitext behaves, and how Parsoid tackles this. And in the second part, once we have a background, it's easier to look at the roadmap in terms of where we started and where we are today, and why we are porting Parsoid to PHP, and how do we get to the end of this road? So first things first, what are the two parsers? So right now, if you look at MediaWiki, there is the default parser which takes Wikitext and outputs HTML. And there's also Parsoid, which not only can take Wikitext and generate HTML, but it can also go the other direction. It can take the HTML produced by clients like visual editor and produce Wikitext. So the default PHP parser has been in MediaWiki since about 2003. It's currently used for all the desktop previews. It's used by mobile web. It's used by the iOS app. And if you're a client that uses the Action API, then that's also backed by this default PHP parser. So when did Parsoid come into the picture? So Parsoid came in around 2012. It started as a project to support visual editing. But since then, it's come to support a lot of other products and tools. So it's used by flow, content translation. The Android app is backed by Parsoid output. It's used by the 2017 Wikitext editor. We'll enter extension, use Parsoid, and Google Qwix Offline Reader under STPI. The thing to note is that right now Parsoid is written in Node.js and runs as a separate service. So what does it mean to unify the two parsers? Right? So as it turns out for a bunch of reasons, all the routes essentially lead to making Parsoid that single parser for media Wiki. So first of all, the current legacy PHP parser cannot support Parsoid clients. And most and also importantly, so the output from Parsoid, the HTML that we produce, is annotated with information about the source page that is about Wikitext. And this information is useful for clients for extracting information about the page that was parsed. And long-term, it's really not tenable to have two parsers and it really hamstrings future work. For example, if we want to evolve Wikitext, we want to fix templates, then if we have these two parsers, then we have to make those changes in both and that's really not efficient and really slows things down. So this is something that we've been moving since about 2015, and which is why I'm calling it a long time. Long and winding road, because it also goes back all the way to 2011. And here I've actually played you the Beatles song, but I think it may be the case that if I played it, YouTube may take it down for copyright violation that according to the tech talk and we don't want that. But you can imagine the song in your head and how it kind of applies to Parsoid and media Wiki. Okay, so on to the next slide. So here are some pretty pictures of some long and winding roads and they are somewhat metaphors for the work that has gone on and that we need to do. Occasionally it can look treacherous, you go through tunnels, like for example we are in right now, porting Parsoid to PHP and you hope you come out soon and get on with two other things, right? And occasionally it's a nice and clean path. Sometimes you see it looks like it's the end of the road, but you don't know what's beyond the horizon and it may even be a pretty picture. Occasionally it's uphill. So those are all metaphors for the project of Parsoid and where we are going broadly. So let's set the stage for the Parsoid unification with Aladdin's genie. So you might know the story. So Aladdin had been stuck in a cave, he found a lamp, he rubbed it and out came a genie and the genie gave him wishes. So let's say we have the genie with us and we could make a wish, right? So we could ask the genie, please port media Wiki to Node.js and integrate Parsoid with that and no services is going to be a monolithic codebase. So the question is, now that we have all the code for media Wiki, Parsoid, all in Node.js, how many parsers will we need today to support our products? Well, you may think we will probably be, just need one parser, but you're going to be wrong. We'll need two parsers today, right? How about we go the other direction? Report Parsoid to PHP, once again integrate this with core, no services, right? We have a single codebase. All code is in PHP, how many parsers will we need today? Once again, you'll need two parsers. All right, so let's make a different wish. We asked the genie, all right, port Parsoid to PHP, but also fix Parsoid's output and feature differences and fix extensions that need to be fixed. Then if you ask the question, how many parsers we need, then it's easy to answer, we just need one, and that would be Parsoid written in PHP. Or you could go the other direction, you could ask the genie, port some subset of media Wiki core and some subset of extensions in Node.js, and once again fix Parsoid's output and feature differences including extensions. And if you did that once again, you'll probably only need a single parser, in this case, Parsoid written in JavaScript. So this might be a little confusing, right? So the point of those slides is to kind of get to the core of why we have two parsers today. It's not because one is written in PHP, another is in Node.js. It's not because of the service architecture, whether we have services or whether it's a monolithic codebase. Of course, they do matter, but the core reason and the primary reason why we have two parsers today is because the way they process Wiki text is different. One is based on processing strings and generating strings and the other is that's Parsoid, uses a structured representation internally. It uses Stokens, it uses Saddam. And so as it turns out, this difference in processing and the difference in pipeline introduces a bunch of differences. And so yes, language and service architecture do matter, but more as constraints on the parser. So they affect the quistiness and length of the road but are not really the primary reason why we have two parsers today. So to make a little bit more sense of that, what I said there, right? Let's go through, let's understand a little bit about HTML and DOM. So HTML and DOM are web standards. And so if you look at HTML document, it has structure under semantics. And this is defined by the web standards. And so DOM is a document object model. It means when you parse a page, the written HTML and build the DOM, then you have a structure and the objects on the DOM have a relationship with each other and it's all defined by the spec. The reason this is important is because you can use HTML and DOM libraries written any language. It could be Ruby, PHP, JavaScript, Java, and they all behave the same way. So if you're a client that is looking at the output of Wikipedia, output of a page, then you can analyze the page and extract information from it. You don't have to deal with Wikipedia potentially and which is where we'll come to in a little while. So as an example, let's look at this piece of HTML and for that, this is how a DOM might look. So a DOM is a tree structure. If you use the regular trees and this is an inverted tree where the root is at the top. So there's the HTML element which has two children, the head and body, head as a single child in this case, the title of the document. In this example, the body has two children, the H1 and 8 acts. And the web standard and semantics of the spec define what this all means. And so if there are more children of the body, there would be more elements in this, in the DOM. And so given this DOM structure, you can query and manipulate the document. So I can, for example, ask, give me all the headings in the DOM, in this document. Or you can ask, give me the second section. Or you could say, give me the first column in the table with the ID, badminton, give me the first column in the table with the ID, badminton, tournaments. Or if you are interested in modifying the DOM, you can try to replace the first list item in a list which has the title, which has the ID, greater span, with the Beatles. You can do this with any DOM library in any language. So the question is, if you look, so given our parsers, we now know that the output has a structure and you can use libraries to query and manipulate the DOM. So the question is what about the input document? What kind of structure does the input have? And more importantly, can you map the input and the output? The reason this matters is because if you're trying to answer questions about a wiki page or you want to manipulate the wiki page, it's important to note that HTML is not the canonical representation of a wiki page. It's wiki text. Wiki text is the authoritative representation currently. So you cannot really answer queries or manipulate the page by looking at HTML and HTML libraries without having a mapping between the input and output. So for example, if you wanted to ask, give me the info box content and all its parameters, you won't be able to do that. If you say give me all the citations that come from templates, no, you can't do that. Or if you want to replace this info box with something else, you could if you passed and manipulate wiki text. But that means every single product, every single client that needs to do this would have to deal with wiki text and that's not really a scalable and good solution overall. So the question again to go back to what I asked earlier is what kind of structure does wiki text have? So do wiki text constructs behave like dawn trees? No, they don't. Wiki text does not have a spec formal semantics. So the behavior of wiki text emerges out of what the legacy PHP parser does. And that parser does not really think in terms of dawn concept at all. It's primarily concerned about very efficiently constructing the output HTML string. And this makes sense. I mean this was written in 2003 and at that time there was no foundation. It was all volunteers and more importantly the servers were not as powerful as we do now. And it was very important to do this very efficiently. And that particular design makes good sense. But that's not the case today. And so given this templates pretty much inherited the same behavior from wiki text. And so the lot of templates are once again primarily about constructing the output HTML string. And any meaning that you might think templates have emerges from conventions and practices and wikis are not because there's a spec or something that's well defined there. So to make this a little bit more concrete let's look at an example. Right. So here is a piece of wiki text. And so you have these triple quotes which are which signify bolding. And so in there you have foo and bar and you have a template and outside you have bars. So foo will be bolded clearly so the question is is bar going to be bolded. Right. So that's a trick question. It depends. It will be bolded if your template is one of these three on the first line. So the one next template is just the echo template is just output parameters there's nothing special. But if you had the templates on the second line bar will not be bolded. So already it's clear that you cannot look at a piece of wiki text in isolation. It depends on what templates might do and what expand to. So let's let's go further. Let's say your template was is this one X with a string A which has both tags around it. You would think that A would be bolded. Right. But the question is will it be bolded when it's included in the page in that wiki text above. No it won't. Who will be bold bar will be bold but A will not be bold. So this particular example I've given here is not at all hypothetical. So recently when we are replacing tidy editors had to fix wiki text pages and they had to go changing wiki text to make sure that when tidy was replaced the rendering will not break. And editors on multiple wikis encountered something very similar to this. So italics would the behavior of italics aspects would change depending on the kind of template they were using. So you would expect I think going back so you're expecting looking at the template that bar would be bold A would be bold but in reality that's not true. This would be possible if wiki text had independent parting that is if you pass the top level page separately from the template and then you plug the output in the hole where the template is. So that requires independent parsing that requires dom scopes without non-local effects. So that's the state of affairs with wiki text today. So there are non-local effects in wiki text and it has a bunch of implications. It has implications for usability so for editors and humans it makes it a little hard to reason about consistently. So you just can't take a look at a string of wiki text and say what it's going to do. It makes it hard for tooling because if you're trying to manipulate a page then you got to deal with all these non-local effects. Thirdly it makes it hard for getting really high performance because you really cannot break the page into independent chunks and parse them independently and put it together. You cannot do that as you just saw. So the question is how does parsing deal with this? So clearly the output has DOM structure. So what parser tries to do is it assumes DOM structures wherever it's required to make this whole thing work and given this far input DOM and the real DOM in the output it tries to compute a mapping between the two. So for example it treats extension output as an independent document. So that's a DOM fragment technically. It looks at link content and treats it as an independent DOM fragment. It looks at figure captions as DOM fragments. There are a bunch of others like that. But templates you cannot do that and where templates don't really map to a DOM tree it tries to expand the range of the template so that it fits a DOM tree. So for example normally as we just saw this particular template invocation with A surrounded by bolt tags would map to a DOM tree but in that example it expands the range of the template behavior to include foo and bar which is not really part of the template. So this is an exercise you guys can try later on take this piece of wiki text add it to a page and try editing in visual editor. You'll see the behavior of editing. So basically parser tries to scope and bound the behavior of a template so you can edit it as a single unit. So the other example where parser deals with DOM semantics or DOM structures is when you're adding when it adds section wrappers around wiki text sections and so it has to handle scenarios where a wiki text section doesn't map to a DOM tree and there are a lot of cases like this on actual wiki pages. So here is a sales pitch for parser. So broadly parser has demonstrated two big ideas. One is that it is practical to go back and forth from wiki text and that's how visual editing is possible. That's how content translation is possible by doing HTML editing in those products and secondly it can actually map an output DOM node to the input wiki text string and as I said it assumes DOM structure as required to make this work and the reason this works is because for the most part wiki text on pages is do the right thing and templates for the most part there are a whole bunch of templates which do the right thing but it also there are also of course wiki text which has broken markup there are templates which don't fit this model and the parser has a lot of hacks and fallbacks to handle scenarios where DOM semantics don't apply. So the other part of the sales pitch is parser it deals with wiki text so you don't have to and so the implication is twofold. So if you are a client that is concerned about reading the page and analyzing the page and extracting information from the page then you can use a spec that parser it publishes to query the output DOM and extract information about the input document. So for example Google used to run and query media wiki to get extract information for a small extract it doesn't do that anymore it uses parser it's output because it's much simpler to do that and if you're a client like visual editor or content translation and you're concerned about editing the document then you can do that entirely on the DOM you don't have to deal with wiki text at all and then you can rely on parser to map it back to wiki text faithfully. So that is the background so we talked about what the two parsers are and why we have two parsers and I went through some details of HTML DOM wiki text and what the parser approaches here. So in the next part of this talk let's look at the roadmap where have what is the history of parser and where we are today and evolution of parser itself and I'm going to also talk about why we are putting parser to PHP now you might have got a glimpse of it early on but I'm going to talk a little bit more and finally what is left to actually get to the end of the road so here is 10,000 foot view so I'm going to first this is again in three pieces so in 2011 and 2012 which is when parser was conceived so feasibility was entirely unknown for example when I joined the foundation in 2012 May 2012 a couple of people I know talked to me and were pretty sympathetic and told me that I'm joining a project which probably doesn't have much chance of success and it was in fact unclear where this was going and whether we can actually get this done to map this to a dump so in the period 2013 to 2015 it is a period where parser got established and it was clear it was here to stay and so in 2015 so we had this realization oh no we have two parsers now and also two parsers and two of other things and so 2015 onwards we started looking at unification and I'm going to talk about what that all meant so let's rewind to 2011 so this is from a page on Wiki where it says Wikimedia Engineering's key future-facing priority for 2011 and 2012 is to create a rich text editing environment backed by a revamped normalized and more consistent WikiText parser so that's pretty ambitious and so in 2011 May 2011 during the Berlin hackathon a new parser was planned along with WikiText next which I guess we are calling WikiText 2.0 and visual editor work on this new parser started around November 2011 as part of the visual editor project and sometime in February 2012 it was called it was renamed to Parside the thing to note here is even back then structured representation was at the core the plan was to maybe build an abstract syntax tree of WikiText and also maybe use the DOM and so in early 2012 the core design of Parside was in place and this is still the design we used in 2019 so you will take WikiText parse it to tokens you transform the tokens build a DOM and then transform the DOM and the output would also have annotations to expose information about WikiText for example what kind of link is it a Wiki link is it an external link is it a language link is it inter-wiki link for example and it also annotates information about extensions it demarcates extension output it provides information about templates the name of the template what's the boundary of the template what are the parameters of the template and other meta information for example so the plan in early 2012 was to prototype this in Node.js and then port this to C++ making this a PHP extension in core this was meant to be self-contained with fallbacks to core functionality only for extensions but by the end of 2012 this had changed a little bit so we had to do a lot of rapid development for the first release in December 2012 by then we had dropped the ID of a C++ port and we deployed the Node.js implementation and that's how we got a parsing service and we also dropped the pre-processed that was in Parside it had a whole bunch of edge cases because the processing model was different and it was not clear how it was going to perform so we relied on the MediaWiki API for expanding templates of WikiText so overall this particular design was working fairly well and we still had a lot of work to do we had a bunch of compatibility things to resolve before we could even call Parside feasible so it didn't make any sense to put in a whole bunch of additional work before this was done so we just talked to this architecture so of course this led to all the subsequent debates about services, third party support and so on but back in the back at that time it made perfect sense so in 2013 and 2015 support for VE had stabilized and new clients were appearing so Flow, Content Translation Mobile Content Service which is behind Android App Google was using this Kvix is using this and with the GSAP project we also built a prototype of WikiText Linting Tool which we fleshed out later during tidy replacement and we will talk about that later in this period we also had to start dealing with the ramifications of a different WikiText pipeline and a different processing model and we also had to deal with the fact that WikiText as a syntax was not really designed for going from HTML to WikiText so we had to deal with a whole bunch of issues around escaping WikiText and editors becoming unhappy about all the no Wiki annotations and trying to introduce a lot of heuristics to reduce those and of course on different Wiki editors were used to different kinds of templating, template formatting and so we had to deal with template formatting issues and we had to deal with dirty disks now the reason all these things matter dirty disks escaping formatting is because once again the canonical representation of a page that editors are used to is WikiText and a lot of the workflows that editors and admins have really advanced editors is entirely based on WikiText and it's based on WikiText source disks so it makes sense WikiText disks if you add dirty disks then the make they work really hard so anyway so I think all this makes the work of Part 5 harder and in this period we also started helping third-party users with their parcel installations and provided a Debian package for Part 5 it was also in this time that the debates around services heated up and Part 5 became posted child for the debate so I'm going to talk a little bit more about what I meant about ramifications of a different WikiText model we looked at this a little bit before so we have templates that produce strings that have no equivalent templates that produce just an opening tag of a div there's no way to represent that in a DOM you either have text or you have the entire content of a div you have the opening tag, closing tag and the content of the div you cannot just take a div and represent that in the DOM so for example there are info box styling templates that produce a style of the info box a part of a tag so once again you cannot represent that in a DOM you have table cells style templates which emit something else and there are templates which produce table start tags and tags and so this means we have to deal with template boundaries while respecting DOM structures so we looked at this earlier with that example and so rectangles in this case show the boundaries that Part 5 creates for templates and we also started realizing that we need Part 5 specific versions for extensions so we have site, gallery poem and a bunch of others and we also had to deal with bad markup and we had to round up it back without dirtiness I mentioned this earlier because editors would be unhappy if we did that and I'm not going to go into the details here but we can if this comes up in Q&A one of the most prominent markup errors is what is called faster content and this is when content is embedded in tables at the wrong place and this is HTML5 spec thing and as you know on Wikis templates are very common sorry tables are very common and so faster content errors are also fairly common and this has been a source of a fair bunch of complexity so now that Part 5 has been established we started thinking about what does it mean to go to a single Part 5 how do we use Part 5 output for both read and edits and also in this time we had to deal with the fact that Part 5 is a service and there is interaction between Part 5 and Media Wiki and so Part 5 has been making a lot of API requests to Media Wiki so we built an extension the Part 5 batch API extension to reduce the request volume and we also had to deal with timeout issues because of cascades between services we tried cascades on timeouts so this was the time where we were stabilizing Part 5 in terms of performance deployment in Q&A so we had about eight Part 5 incidents between 2013 and 2015 four of them in 2015 and as part of the work we did in 2015 we could stabilize it sufficiently that there was only one so this brought us to start thinking about the future so how do we start using Part 5 for reads as well as edits so we started charting options for how to identify the parsers we started cataloging all the ways that Part 5 was incompatible with the PHP parser and the first step here that we identified was to upgrade code HTML5 by replacing HTML4 tidy because Part 5 output was patient HTML5 and which was the source of a bunch of differences in the output and we also started respectively planning around Wicked X 2.0 so we started writing RFCs and proposals for fixing templates fixing Wicked X semantics and in retrospect this was probably a bit premature given that we had two parsers and we really wouldn't be doing this on both parsers but at least we are well prepared to do this okay so what does it mean more specifically to identify the two parsers so we have taken a two pronged approach in some cases we are going to move the PHP parser towards Part 5 and in other cases we are going to move Part 5 towards the PHP parser for example as I mentioned earlier the PHP parser output used tidy which was HTML4 based and so we started thinking of migrating that to use HTML5 and we also have a project which is not completed yet to migrate the legacy parser output media output to parser style semantic markup this is work in progress which will be completed later in other cases we started bridging gaps in parserized output which is like language variant support was missing which is once again work in progress it's mostly done at this point that linking was missing we fixed that so I'm not going to talk about all these differences I'm going to focus primarily on this one big project that we did which is replacing tidy so this was meant to be a one year project when we started this in mid 2015 but this turned from one year to three years it was a blog post we published last year we can take a look at the blog post for details so we had to do a whole bunch of things to replace tidy so first of all after a bunch of different attempts we finally settled on Remix HTML which is a pure PHP HTML5 parser and we knew we knew right going into this project that this is going to impact pages on wikis but we really did not know how that was what kind of impact we are talking about so we built a lot of custom QA tools to figure out the impact on wikis so first of all we had to spin up two virtual machines in labs one would run tidy the other would run Remix and we populated both this VMs with 60,000 pages from about 40 wikis and we revamped some of our QA tools we had for parseride and to we had to run a lot of of a page on these two wikis and then compare the screenshots and generate a diff but the problem we ran into is that these diffs had a lot of noise for example if there was even a one pixel vertical shift then it would render it will show a lot of noisy diff which was pretty much useless from the point of view of a new tool called upright diff which is based on video motion detection detection tool and this was used for doing a diff this would actually hide vertical motions which played a big which was very important in letting us get numerically quantitative numbers which you could then use for figuring out the impact of differences between tidy and Remix so once we had all these tools we then did a bunch of testing and then this revealed that replacing tidy would potentially cause a lot of disruption on wikis and more importantly we realized that editors might have to fix pages and templates to actually mitigate this impact the reason this is so is because wikis had over time come to depend on tidy specific behavior this very inadvertently it's not something very deliberate it just editors write wiki text and it behaved right and they just went on with their work but so to deal with this we initially started adding some compatibility code around tidy to handle some issues but it also was going to require a whole bunch of fixes to wiki text and templates so we had to build tools to assist editors first of all we got to figure out what pages were going to be impacted number one and given a page that was going to be impacted precisely what piece of wiki text on the page would have to be fixed that's something that we wanted to help editors identify so we built this linting tool that I talked about earlier which could precisely identify what was going to be fixed for that and we built a linting extension to expose this to editors and all wikis and we also built a parcel migration extension so that editors could actually make a fix and verify that the fix that the change to wiki text would actually fix the page and we then undertook 18 months of community engagement 6 months of which was actually preparation for the engagement and so as part of this we had a bunch of information on wiki published help around linter pages and made regular announcements and gave wikis a one-year window and so progressively we did this so it was clear that we couldn't make the switch at all in one go so we started doing the switch progressively based on monitoring how wikis were doing with their fixes and we managed to get some early adopter wikis primarily Italian I think German British wikis to switch over from tidy to remix and based on that we discovered that there were more tidy related issues that had to be fixed so we again fixed code in parsoid and PHP parser and made a final switch around July 2018 and finally we removed all tidy support from media wiki in version 1.33 so one thing that we it was very clear from this effort is that we have a lot of legacy constraints so here are some very prescient thoughts from the 2011 hackathon so Tim Starling writes there are a lot of features that people count down that rely on the current regime indeed and Neil K another person in the hackathon says a simpler parser seems possible but it becomes impossibly distant when insist on not breaking anything of course it's clear that there are a lot of pages in the process of upgradation that would really not be acceptable and as we see that a lot of work has gone in parsoid and in replacing tidy so that we can actually slowly make progress in evolving our technical stack without breaking the pages that we already have and I think if we want to do more of this in the future that is fixed wiki text that we use for replacing tidy as a template we can use parsoids wiki text linking abilities and use the extensions we have linter and parser migration migration to assist editors to fix pages were required and will of course require community engagement here so this is the only slide on the services debate so till about 2016 I mentioned earlier that will it have services or will it have no services and till about 2016 there was really no clarity around this where media wiki architecture was going to go more practically for us what it meant is that the constraints around unification of the parser were very unclear so in 2017 all this came to a head there were more debates there were position papers there were conversations working groups and summits in 2018 and the parser unification was one of the components of this platform evolution and one of the first steps was porting parser to PHP so let's talk about why why this actually makes sense porting parser to PHP because it's going to add more work we are trying to unify parsers and now we are going about porting now I think it would make sense if we look at what we talked about earlier so let's bring back the genie here so there were two wishes we made to the genie where we ended up with a single parser which one involved porting parser to PHP and which two involved porting a subset of media wiki core and extensions to Node.js in both cases we still have to fix parser whether PHP or Node.js output and feature differences and here is the reason why which one makes sense and the underlying assumption here is that we got to fix the service architecture of parser first of all it makes sense because it's a limited scope all you need to do is port the parser code based on PHP versus an unknown scope where we say port a subset of media wiki core and extensions to Node.js because we don't know what else the second important reason why it makes sense is when we identify feature gaps we can actually leverage core code to bridge those gaps for example internationalization and localization support for extension is missing in parser right now and we can just use core code when we port this to PHP there are some other side effect benefits from this non wiki media wiki is get a simpler install in some cases and we're going to get a potentially simpler code base in parser there is no async code there is no code to talk to media wiki and as we realize realizing we are expanding the exposure to parser code base to others during this port and we hope eventually I think there will be little bit more clarity about services once parser exists in the services dimension okay so once we port to PHP how do we get to the end of the road here are some milestones first thing is to finish the port and take parser out of service and no pun intended there and we think it's probably going to take four to six calendar months we'll see what happens there and finish some of the other projects like fixing the media output fix the already known bugs finish implementing language variant support identify any other parser feature gaps and bridge with core code into the internationalization support most importantly we have to fix parser performance so in order to actually map the input and output and enable clients to work on the DOM, parser does a lot more work than the current PHP parser and is going to be slower there's really no way around it right now so this will require some serious performance work including even potentially throwing hardware at and then we've got to focus on making parser the default and wikimedia wikis so which means we have to establish regular QA runs to identify any and core tissues just like we compared tidy and remix here we're going to compare parser output and the PHP parser output maybe run it once a week and see what we find analyze the results file parser identify if there is anything that editors need to do on wikis and we got to finalize a new parser hooks API and migrate over wikimedia extensions so I think the ad space is probably 18 to 24 months from now but it could be sooner depending on how things progress and depending on how many people get involved here since I think we are probably close to end of time I'm going to skip this but broadly I think the reason why we need to change is because we need to change the types of parser hooks that we currently have which are based on the PHP parser internals and finally I think once we are done with this we can focus on media wiki by publishing the new parser hooks on API publishing docs around fixing extensions and following the application process and hopefully that will bring us to the end and hopefully that will be the end of time I think we are all involved at the people who are currently involved in the porting effort and the names in blue is the parsing team and the names in green Brian Gabriel and Trevor were the ones who were involved originally in taking this project off and the others would have been involved with it. Thank you Subu so feel free to ask questions on IRC. Subu if you put that slide up again it has the information on IRC which channel and then also you can ask on the YouTube stream or if anybody in the room has questions let me know. Do you see the slide? Yes. If you have any questions and seven more seconds then I will assume that we do not have any. If you don't have a question or if you have a question later after thinking about it do feel free to either ping Subu or you can also ask me and I can send the question along to him as well. Okay. We'll give everyone on the stream a minute to catch up. Subu if you have any questions on the IRC channel the slides will be made available and yes they will. They'll be made available after this talk and I'll send out an announcement where you can find them. And then we also have a question for you Subu is the rollout timeline 24 months? I mean I anticipated problem is going to be 24 months after the actual port and after that is when we can actually get on with the rest of the unification project. Yeah. So I think it'll probably be 24 months from now. And then we have another question for you. Would you recommend that editing tools or bots start using parsoid instead of wiki text? I definitely recommend looking out and checking out parsoid spec because it's going to be much simpler. You can just use HTML manipulation. So that's definitely my recommendation to check out and we can provide you more information and help if you require to switch to parsoid. I mean just to kind of give you a sense, right? So the library that bots use is called MWParser from hell and I don't know why it was named that. I think I assume it is probably because it's hard to pass wiki text presumably and so I think it's probably going to be simpler if you just used HTML in some ways. But again depends on what you're trying to do. Okay any other questions from IRC or in the room? Oh one more question. Will there be a formal spec of wiki text? Well, so I think one of the three goals of the parsing team was improve wiki text and make it easier to use parse and not make errors. So I think we do, we would like to publish some spec at some point but it would be hard while we have two parsers. I think once that will require making parser the default and deprecating certain behavior and fixing some common cases as undefined and I think we would like to but we may have at some point but it's not right away and definitely if we move towards wiki text to point it will have a spec in terms of behavior and all of that. Questions? I don't see any other questions. I do see that there is a comment that a formal spec of the HTML DOM more or less. I think Cisca on the chat in on the office channel. Thank you. Yeah, I think so we I mentioned earlier that I think if you're a tool you can just use the output of parser and that output is spec under version and so if you're actually only dealing with HTML then that should probably be sufficient for you as a tool writer and I think we also have because it's version we also have a protocol for how to upgrade the versions and it won't just break on you so there is a content negotiation protocol and so you can pass the version header and you can continue to use the old version till you get a chance to upgrade there is a spec for the HTML output from Parser. Okay, we'll give everybody just another minute to ask any questions if they have them. I'm sad I didn't get to play the Beatles. You'll have to play it for us the next time you visit. Yeah, you can take it up with Sony. Okay. Well, I think that's it for questions if you have think of something later and you want to reach out please do. We'll be happy to pass those along to Subu. Once again, thank you Subu so much for doing this first tech talk of this new season. We're so excited to get started and I'm really grateful to you for being a person who was willing to step up and do the first one and really help out with these. Thank you for organizing this.