 Let me tell you about what goes into curing a critical security bug So at the start of the start of the year wordpress 4.1.2 was released With a fix for what was one of wordpress's most severe security bugs ever What we affectionately call the Trojan emoji bug The name actually works on two levels First that we used emoji as a cover for a committee that fixed wordpress But also that emoji can be used to trigger it now Andrew Nason gave a talk earlier this year about About how this bug works If you haven't seen it, I strongly encourage you to check it out It's called anatomy of a critical security bug and it's up on wordpress TV For everyone who hasn't seen it or doesn't recall the salient details allow me to give you a brief refresher So my SQL supports a feature called character sets Character sets in my SQL are simply a standard way of storing text on disk whether that be letters numbers or emoji That's all there is to it So by the far by far the most common Character set in my SQL is utf8. It's certainly the most common in wordpress prior to wordpress 4.2 About 95 percent of all wordpress sites used utf8 Now utf8 is kind of interestingly named because it's not actually utf8 The second utf8 is the actual standard and was only implemented in my SQL 5.5 as the utf8 mb4 character set As of wordpress 4.2 we default to using utf8 mb4 when we can But the utf8 character set on the other hand was only a partial implantation of the utf8 standard The utf8 character set was actually an implementation of what's called the basic multilingual plane or BMP The BMP covers most European languages and most of the And the most common East Asian characters But it's really only about 6% of the entire utf8 standard any character outside of that for example again emoji Is not supported by the utf8 character set But you can easily type an emoji character. So how does wordpress handle it when you try to save an emoji character? See what happens when you type in a mode If you type an emoji character into a comment box on a wordpress 4.1 site before this security fix and submit it Take this slightly contrived, but not particularly unusual comment. It's a link to a site It has a title. It has some slightly unsavory unsavory styling, but apart from that it all looks fine The entire comment goes through several levels of sanity checks before making its way to WP cases cases is a function that Is a function for checking that the HTML is valid and removing any invalid tags or attributes In this case style attributes aren't allowed on on link tags So cases will remove that and we end up with a validated comment This comment goes through several more levels of sanity checking before it finally arrives at WPDB Now WPDB is the interface between all of wordpress and MySQL The job is its job is to insert modify and retrieve data from the database It also makes two assumptions about data with regards to sanity checking Number one it assumes that all that any section of wordpress sending it data has appropriately sanity check that data Within the scope of that section. So the comment code ensures that it's a valid comment. That assumption is a hundred percent correct It also assumes that the data it sends to MySQL will be written exactly Will be written to the table exactly as it sent this turns out to be not correct Now remember what I said earlier about MySQL not being able to store emoji in the UTF-8 character set Here's to where the problem comes in what MySQL does is it truncates the string immediately before the character It doesn't recognize So that gives us this which is very clearly wrong. It's totally invalid HTML Cases would have removed it earlier, but when cases was looking at the string It was valid HTML and so we've written unverified HTML to that to the database And it can be used as a script as a vector for a cross-site scripting vulnerability This vulnerability didn't just apply to comments, of course it applied to anything that could write to the database comments posts Options just to name the vectors in core Any plugins that wrote to the database were vulnerable your feedback form your e-commerce site Your booking form and anything could be used to carry out this attack. And so that was the problem And we needed to figure out how to fix it. We knew from the beginning that we had two options The first was called strict mode MySQL has a handful of settings which change how it behaves as an SQL database called modes One of those modes is called strict mode Strict mode fixes this problem by rejecting invalid strings instead of truncating the comment It would just refuse to insert it and return an error the this behavior by itself is perfect Unfortunately, that's not all that strict mode does it also adds a bunch of extra restrictions to Which changes how some queries work with MySQL and this would have ended up breaking the vast majority of WordPress sites Our second option is what we call pre-flight checks This takes the string being inserted and determines if my SQL will be able to insert it correctly This is the trickier option, but in order to not break many many sites. It's really the only one we had There are a few requirements for we have for how this patch had to work It needed to be a hundred percent accurate We couldn't afford to leave additional holes open as announcing this bug wind would inevitably cause extra security To this area of code it needed to be fast Ideally, it shouldn't add any extra queries or if it did they should be as few as possible And it needed to be invisible It should do it should just do its job and silently protect the database the average WordPress site should never see the difference With that in mind, I present to you the our first pass at fixing this bug This little block of code would have been added to WPDB's prepare method. You don't need to worry about reading it It's notable for two reasons one that it's the only version of the fat of the patch that fits on one slide It's it's about 1% of the size of the final of the final patch we committed and To it's completely wrong. It's fast. It's invisible and it doesn't work The the regular expression we used is totally wrong for UTF-8 it matches valid valid characters in the big five character set And there are probably other bugs in it that we that we didn't find before it was abandoned And so from those inauspicious beginnings, we began experimenting Our next experiment went through several iterations and several different names But the idea of strip invalid text remains Remained the same given a string and a character set it would try and remove invalid characters locally if it could or By asking my SQL to handle the string if it couldn't now this experiment Unfortunately introduced two of the bugs that ended up making it one of the two bugs that ended up making it into the final patch It used PHP's MB converting coding function to try and remove invalid characters So MB PHP's multi byte functions are a great under very specific circumstances Specifically you need to be a hundred percent certain of the character set of the string you're converting You can't rely on the character set of that PHP is configured to for example because it might not match the input string If you don't know the character set of the string of the input string it means you have to guess And so while we got pretty good at guessing it just isn't a hundred percent accurate which can introduce bugs We iterated on this concept for a while We even ran into the into some bugs with the UTF-8 character set in PHP So we moved to using this regular expression that you can see here and which did make it into the final patch The good news is that this regular expression does work Because it's based on the UTF-8 spec In hindsight it should have been a hint that there were probably bugs with other with other character sets as well The next problem we ran into is that you can do write queries directly to Directly to the database using WPDB's query method instead of using the helper insert or update functions So we had two options here. We could write an SQL parser in PHP to extract the data fields and figure out if we could insert them or not or Just to check if it's probably a write query and run our tests over the entire query string Obviously we went with the latter The use write query function was comfortably running in hyperDB for years So we borrowed that and use it as a test for whether an entire query string needs checking I mentioned earlier that earlier that there were two bugs that ended up making it into the final patch It was around this time that we introduced the second bug We'd been starting to test this patch on WordPress.org which was great for finding edge cases There's a lot of legacy code running on WordPress.org During that testing we discovered an interesting case that we hadn't considered How do we do pre-flight checks on tables that have different character sets on different columns? So we went back and added some more architecture The pre-flight checks were already happily running at this stage They'd connect to my SQL determine if the data could be inserted and then they they returned So for this to work We had to switch to doing pre-flight checks for every different character set in the set of columns referenced in the query The issue here is that we overthought the problem It seems fairly logical different character sets need to be checked differently But it was an incorrect conclusion. You can't actually change the connection character set mid-query. So this extra flexibility We added was at best useless. The worst case is it could have been a new vector for attacks The good news is that we ended up removing it but not before it managed to be the source of quite a few headaches We also tried enabling strict mode in my SQLs in WordPress's nightly builds Suffice to say the reaction was swift and overwhelming There were many different plugins and use cases that were broken by strict mode. So we quickly rolled it back Strict mode is certainly something that I'd like to revisit in the future though I do have some theories about how it could work and with recent versions of my SQL moving away from non strict behavior We'll will need to adapt in much the same way that we needed to adapt to some of the changes in PHP 7 So if you happen to have thoughts or opinions on strict mode, I'd certainly be happy to chat about them Come and find me today or tomorrow here or on Sunday at the hack day As part of this bug fix we added functions to automatically remove invalid text what when we were comfortable doing so For example, we'd prefer to remove invalid characters from a post title rather than blocking a post from saving and risk losing that post One of those cases we tried handling were serialized objects or arrays It's not possible to remove invalid characters from the serialized string So as that causes it to no longer be a valid serialized string instead We needed to un-serialize it Recurs through all of the data to check to check each member individually then serialize it again This was inspired by the WP JSON encode function, which does a similar thing before JSON encoding data Luckily, we came to the conclusion that this was way more complex than we actually needed and it was cut from the final patch So far we'd only focused on right queries as that's where the danger lay We didn't want invalid data being inserted in the database We did begin to wonder if this bug could be modified to exploit read queries as well As it turns out it was possible It didn't directly affect WordPress core that we could discover but it could very easily affect plugins or themes in weird and subtle ways For example, let's take a fictional user table and we'll insert an admin user Sometime later an attacker registers a new user with a slightly different name as you can see it has an emoji character at the end of it It's obviously different to the admin user we inserted, but what does this query actually return? So that's a problem returning incorrect results can lead to Information exposure or even privilege escalation attacks. So we needed to close off this hole as well The good news is that thanks to the way the entire fix had been architected It was only an extra few few lines of code to check or read queries whenever they needed to be checked Naturally, we reported this bug to Oracle, MariaDB and several other MySQL fork maintainers who've taken action to fix it And so this brings us to April 21st of this year WordPress 4.2 4.1.2 was released with this bug fix the release itself went fine Millions of updates rolled rolled out thanks to the auto update system Unfortunately, there were some bugs that made it into the into the final release a day or two after release Someone noted that individual characters weren't the only thing that would cause MySQL to trunk to truncate a string it would also truncate a string that was too long for the field It was being stored in in the case of comments. It's a 64 kilobyte field So everything after the sixty five thousand five hundred and thirty six character was removed It's it's pretty easy to generate a string of that length and submit it as a comment to any WordPress site So we quickly released a new version to deal with this bug The good news here was that The way we architect the original fix again made it pretty easy to add in this extra check We're already getting column info to from the database for the character set So we just needed to test against the column length as well We also had some problems with some of the less common character sets CP 1251 which is used in several Eastern European countries and EUC JPMS which is used in Japan most of these are legacy character sets, but they still have Substantial enough user user bases that this fix cause significant problems The primary issue here as I mentioned earlier was that we assume that PHP's multi-byte character Multi-byte libraries had the same behavior as my as my SQL's character sets, which totally wasn't the case We removed those checks now Everything goes either through the utf-8 regular expression we saw earlier or it goes directly to my SQL for checking So looking back, I think there are a couple of important things we learned from this process Allowing us to improve the development process within WordPress core unless there's lessons that you can incorporate into your own development practices First of all, if you've assumed something is true test it The hardest part is figuring out what your assumptions are But when you create something you take the time to step back and see what it relies on Does it rely on those things correctly? Does it take different conditions into account? You can make it easier to figure out your assumptions by simplifying your code Don't be afraid to throw everything out and start again In the initial ticket alone for this bug there are over 50 patches including five complete rewrites On this on the subsequent follow-up tickets There are another 50 or so patches several of which made significant changes to the code or completely remove the parts that didn't work No matter how clever a piece of code is if it doesn't solve the problem correctly, it should be removed Don't try to build the ultimate solution I've worked on a done on a bunch of different projects across WordPress core on Jetpack on WordPress comm and it occurred to me recently that large chunks of the code I've written Large chunks of the code I've written over the years have either been removed or replaced not because it was bad or wrong Well, some of it was bad, but now it's not the time to go into that Because it was bad or wrong But because it's the nature of software development that the code you write will be replaced over time Instead of focusing on making it on making the ultimate solution Make something simple so that the person who comes after you can easily incorporate it into a bigger vision As they say we all stand on the shoulders of giants So when it comes time for someone to stand on your shoulders make it as easy as possible for them to reach higher So before I'd finish I'd like to give props to my co-conspirators on this voyage Andrew Nason WordPress lead developer Nikolai Bacchewski WordPress securities securities are and Mike Adams magical coding wizard and Of course the entire WordPress security team for acting as sounding board tester idea source and sometimes personal counselor And that is the story of how we fix the Trojan emoji bug. Thank you I'd also like to offer a warning this fairly innocuous conversation was how I was introduced to this bug So if Nason ever tells you that he has something in for interesting for you, you may be getting more than you bargained for Thank you Morning hi It's on alright It sounds like a lot of the complexity from this came from supporting many different configurations lots of different legacy stuff Are there plans to force migrations to a consistent table or call our characters that formats? No It's it's one of word presses strengths I think is its flexibility and its ability to be used for on many different configurations and for many different use cases So I think it's it's important that we continue to support as many different character sets as possible And if people if people want to use it on crazy configurations, then we let them do that Hey, Gary. Hi. I think there were concerns when this was being fixed that other open source Packages might be affected by the same bug. So since it's been released. Has that come to light any other? things being affected No, not that I'm aware of I think a lot of a lot of other open source CMS is to give an obvious example Drupal is as quite it's quite well used and Drupal takes a different tack to what we to what we do is that they Basically rewrite on a regular basis and break backwards compatibility which isn't an option for For word presses we want to maintain backwards compatibility as much as possible, which does mean that We do have to write some Some of these things together to keep things working, but it's the way we view it is it's it's important that we it takes a few of us to write a few of us a year to write this patch is better than breaking half of all WordPress sites and Take meaning that hundreds of thousands or millions of developers have to spend Days or weeks fixing their code. I was just curious what What was the reasoning for including emoji into WordPress? I mean, obviously there's a the fixing the bugs in the database of traction, but Why actually include the emoji script on every site? I know there's a lot of contention from a lot of people out there about that so what Your site already supports emoji so what we were what we really did with the with the emoji script is fill-in compatibility like what we do for example with with adding in hacks to to To fix internet Explorer bugs the same thing we do to make sure that your browser Displays emoji correctly emoji by itself is is a language So it's important that we allow people to communicate in whatever language they choose Hi Gary You mentioned the WordPress security team and what is Can you introduce people to Responsible disclosure if they happen to find a security bug and how to reach your security team to properly Handle the situation without making a mess of course So security team is made up of about 30 different people from various parts of the WordPress community If you do find something that you think might be a security bug then email security at wordpress.org Please don't submit it on the public bug tracker security tickets do need to be handled privately and Even if it's not a bug then we'll get back to you and let you know If it is a bug then it's much better that it's that it's handled privately I'm just wondering how serious of a bug this actually is if by for whatever reason you choose not to upgrade Or can't upgrade for whatever reason very serious it affected 90-something percent of sites any site that had the where comments were turned on for example and it's Excuse me it It allowed an attacker to completely take control of your site So it is very important that you upgrade Smiley face thumbs up smiling pile of poop Thanks I've found working with various character sets when doing code development sometimes a bit of a trial Do you have any tips that you learned when working through this on how to work with various character sets efficiently? I think So for modern development and if backwards compatibility isn't a concern then you use utf-8 or in my scroll utf-8 mb4 That's the based on unicode and it's the idea of unicode is that it covers It's a replacement for all character sets if you do need to handle handle legacy systems, it's I don't think I've found a particularly good way. It's mostly just a case of Knowing the differences between them and Kind of working from there. I was just curious. I don't know if this is even really Possible, but when it comes to major security issues like this and stuff like that once they get reported Is there any way for you guys as a security team to kind of track? when sites fall victim to these kind of security issues and so I mean so you can kind of keep track of You know how serious this was before it got patched and things like that or I mean again Don't know if it's possible, but it just be interesting to see that kind of data We don't have data on across all sites, but we do work with the various security companies Security and voltpress for example to kind of get an idea for what the level of attack is or If we if we see it see something that's being attacked through comments Then we work with the Accusement team to to get an idea of how it's being exploited Hi, I was wondering if you could speak to the communications that came about after you fix the bug How do you roll it out to the world and get everybody to upgrade your site their sites? We auto upgrade It's this this was always the idea of the auto upgrade system was that when wordpresses now over a quarter of the internet and When there's a security bug and when it's announced People work very attackers work very quickly to try and exploit it So if we're able to roll out a fix for it to every wordpress site to tens of millions of sites in half an hour Then it keeps it secure so The the auto updates is the main thing and then obviously the actual information about what What the exploit was is spread through the wordpress news news blog and that gets shared around Well, I think that's it. Thank you very much I'll see you around