 Hi everybody. Today I will tell you about the relationship between Mercurial and Python. So myself, I'm Pierre Yves. I'm working on Mercurial for about four years. I'm working at Facebook. And this talk was originally prepared with someone called Alexi Metero that work at Modula. So Mercurial is a version control tool. If you don't know what version control tool, you could be using better FS, but you should probably learn what the version control is. We don't know what version control is here. Okay. I know you lie. So Mercurial is written in Python. It's ten-year-old. It's very similar to Git because it was created by the same people at the same time for the same reason. So you can basically assume it's Git written in Python with some differences. So why did the guy who did Mercurial in the first place pick Python instead of something else? It was a kernel developer. It's a bit strange. So it's a usual reason. It's really easy to make a proof of concept. It's really easy to support multiple platforms without to write more code for that. It's fast enough to do stuff. And it's really easy to make multiple versions of something to try different approach to a problem. So you can have very fast development. A quote that goes like that on this topic is this is an IRC code. NPM is Matt Michael, the creator of Mercurial. And we have users that say, yeah, Mercurial is great, but I would like it would be written in C. And Matt Michael replied, if it was written in C, it would have taken me six months instead of two weeks to get a prototype running. I have 20 years of C behind me. So Python was a great choice to have something. As a good example of that, in 2005, there was the initial Mercurial announcement, one month later there was an HTTP server to pull and push. And one month after that we had Windows support. But before talking too much about Python in Mercurial, as your old Python developers, I'm going to talk a bit about how having something written in Python can help you users to get something from it. Having something in Python isn't easy to hack and in the Mercurial case, it's easy to extend. But before talking about how easy to extend it is, I'm also going to just talk a bit about how awesome it is to use Mercurial. So the official API for Mercurial is not in Python. It is the command line. So things that is very stable and very powerful and can be used from anywhere is our command line API. We also have values binding so you can talk to Mercurial using Python, Java, C sharp, stuff like that. And what they do is they actually talk to Mercurial through the command line to make sure it's stable. So for example, if you want to know something simple like who did the most commit in my project, you can just ask Mercurial to only display the author of every commit and then use your usual bash knowledge and you get the good result. You can also go faster using some magic syntax to select revision so you can already do all the sorting on different stuff to get results. The first thing we just saw, this screen, oh, yeah, okay, is something called revset where you can select stuff by using functions. So the first thing, get all the chance set done by someone, done Alexi at Mozilla. You can also get all the chance set into last week that touched Python file and you can actually do much more like all the stuff that are in the 2.8 release and not in 273 or all the stuff that are between 2 weeks and that fix something and that are not merged. This kind of thing. You can also do a subtraction addition stuff like that. So we're done with the command line, going back one level on the Python. So it's Python's, we have ILOvel API to do stuff in a very pathetic way. This code is almost working. I just have to remove four imports at the beginning. You just have to create a repo and then from this repo, you can just get the chance set into it. You have the same thing that we did with the command line, but with the Python scripts. You can turn that in an extension by just adding a few things at the beginning and then you have a new agstat command that does the same thing. You don't have to do that because we already have it, but it's a good example of how easy it is to extend Mercurial in Python. It may seem kind of useless, but actually most of the big company we move from something to Mercurial always have some kind of crazy workflow or some tool they want to integrate with and you can write a certain line of Python in the file and magic the version control system is integrated with something else. And actually right now the Python people are writing a small extension to fetch and upload paths from the bug tracker to the version control system. Of course, also eating Python gets you more performance and so the command line version is about twice slower than the Python version because we have all the points all around the way to get that. The timing is probably a bit wrong too, but so let's stop talking about Mercurial and go back about Python. Because you know Python is a language that is slow, right? You can't write anything serious with Python, of course. Well, actually the most important thing is your algorithm. If you have stupid algorithm you could write them in assembly, it's going to keep being stupid. So without changing language or anything, we got like 10x improvements from just changing the way we add and remove file when you have did a lot of change and you want to just record them. We get 40x improvement by just using dedicated data structure to store all the hash for chunks instead of basic dictionary and stuff like that. So focus first on your algorithm and then maybe you will have to use a different language. Talking about data structure, it's really important to get stuff right to get something running. The way stuff are organized in Mercurial is in multiple files. The first and most important files maybe is the changelog. The changelog contains all the chunkset data. The chunksets are who did the commit, why he did it, when he did it, on what other commits this chunkset is based on. And so it's also core information for commits. There is a second file called manifest that contains the list of every file under version for every version of that manifest. So the chunkset is saying, okay, someone did some change and that change contained that manifest. And there is another set of files that contains the actual revision of every file. Which means that as we have different files for everything, if you need information about the chunkset, you can just ask the changelog. If you need information about what's changed at high level between two chunksets, you only have to cut the manifest. And if you want to run blame or you have the log of a specific file, you only need to accept that specific file where all the information will be. Each of these files is something called a revlog that is a mix of full snapshots and binary diff. When you first add a file, you get the full snapshot of that file because you have no information whatsoever about that file. So you get the full snapshot of that. And from there, when you have a new version of the file coming, you make a binary diff to only store stuff that changed between the first version and the second version. And you keep going like that until the amount of diff you have is start to get big. And actually it's going to get bigger to read all the deltas and diff between the new one you're going to add and the full snapshot you have that actually storing a new full snapshot. And so you store new full snapshot. This is similar to what video encoding is doing, where you have full image and then delta against it and then a full image again and everything. This means that you have something which is efficient in space because you mostly store diff, but that is still good in access time because you don't have to apply a thousand of diff to get anywhere. Something good from having this format where you keep adding stuff at the end, it's happened only, which means that it's very easy to do a transaction. You write stuff at the end. If the transaction is aborted, you remove everything that you added at the end. But it also means that you have some constraint. Like getting file content, you have a direction where stuff are easier to read than the other one. So we talk about Python and how data structure are important, but we're going back to another adventure of Python, being able to use C while you actually write stuff. Because C is much more efficient for multiple kind of thing. So in Python, we have about five percent of the mercual code is in C. It's used for all the low-level operation, like reading the disk, writing to the disk, starting all the directory to find what changed, these kind of things. Computing diffs, applying diff to a file. All the data structures that are very heavily used, like indexes for hash and these kind of things. And also, most of the common graph algorithm, like looking all the children of something, looking at what are merged and these kind of things that go much faster in C because they are just looking at big array with integer in them. But we have a big Python thing that can be properly organized with object and clean API. So all the C part are just a small bit that is implemented both in Python for compatibility and readability and have a C version that do one thing and just one thing. And so the craziness of C can be easily contained. So Python is great, but sometimes it's not that great. So you have to know what the done side are and what the constraints are going to have and when not to use it too much. So function calls are slow. It's about 16 hours going to make a call, a function call. It's not much, but if you have like a reposition with 1 million chances, which is about what Facebook have, this means that you're going to spend 60 milliseconds every time you had a function call in something that goes through every trendset if you ever have that. And that's pretty expensive for just one function call. So you're going to reflect to be more, you're going to use function less, you're going to have more code duplication at some point because you cannot, they're not going to be in line. So you cannot use them everywhere. In the same way, object creation are slow. In the example we saw before, we had something that gets a username reading, creating a trendset object. And actually, if you don't create the object at all and you just read the data from the road data structures, it's going much faster because creating object and everything is also slow in Python. So again, if you have a code that are going to run a lot, you cannot really use object in it, or you have to be careful about what you do. You don't have good multiple support because the GIL, so if you really want to do multiple, to use all your CPU, you either have to write for really hardcore C code or to do forks, and then communicate between your forks, which means that you're going to do that less because it's complicated and expensive to do. And on some platform like Windows where standing a thread is taking a few 10 seconds, it's not an option for most operation. It's also slow to start compared to about every other programming language, but Python 3. So if you compare to that, we're doing pretty bad. And in addition to that, something that was already told in the previous talk is imports are super slow because you're going to look at every single, like multiple time in every single directory of that stuff. So just importing some specific directory is going to take maybe some specific module can take up to a second, and I'm not getting here. So we have a lazy import system that when you import a module, you don't actually import it. And the first time you try to use it, it's a time when you're going to import it. So we can have a full complicated import tree for every command we need, but unless you actually need that module for that very common, you're not going to use it. And so it allows us to share of about half or the basic invocation of Mercurial, but we are still 60 milliseconds to just print a version, which is basically doing nothing but putting a string and putting in a standard output. Stuff like setup tools and eggs used to make the CIS path terrible. It's got much better since the recent effort in the Python packaging. The garbage collection is slow too, and if you have something that creates a lot of objects and then do play with them, even if it's a very basic type, it's going to trigger the garbage collector and can have a significant impact. It doesn't seem too much, but if your command is about to run in a quarter of a second, it's a big impact. And something which is less about performance is typing, and it's eventually going to be solved since you went at Python. Mercurial is pretty good cut base, but it's still 10 years old, and there is stuff a bit in every direction, and sometimes when you do refactoring, you may have a good test with going to catch mostly everything, but you're still not too sure about what is this C variable about, what is this L variable about, and having actual type checking and type annotation would be great. So in conclusion, Python is really great as look as you don't want to call function, don't want to create objects, don't want, don't have molecular hardware, don't write command line tools, which is more common, and don't need garbage collection. So this is a bit trolling slide, of course. This means that you can't really do that in any very intensive part of your Python application. You can, it's really great to have Python for all the glue, or when Mercurial, all the logic of what the command do, and how they interact with each other, but all the core logic that actually need to read data and do computation on them is really limited, you have to use a more limited set of Python. So how are the other DVCS doing? When we wrote that time, Bazar was just dead. Now I don't know who still know what Bazar is, but it was, it has good idea in it. There is two interesting stuff that killed Bazar. There was a full talk about a post-mortem analysis of why Bazar didn't work well, and it's really interesting. But two things are interesting here. They didn't have a very good format on the first try, and still didn't have a very good format on the second try. So it took them some time before converging to something that was performant enough to reach the same performance that Mercurial had good performance from the format at the beginning, and Bazar had trouble with that. Second thing they did is they had an official internal API in Python. So they exposed much more than us in the API, and so refactoring and changing big concept was much harder than in Mercurial where the only official API you have is a common line. So changing big stuff, they had to keep strange concept around and stuff that keeps them slow and didn't like slow them down in their work. There was another GBCS written in C that is quite successful. Because it's written in C, it's much faster to do a lot of things that are very short, because it doesn't have the time, it doesn't have the import time, and it doesn't have any overriding stuff. So like exporting a patch, which is GitShow is very fast, the first result of GiftLog is much faster to come than the first result of HLog. Status is going to be snappier too. So there is all these kind of things, but there is all the stuff that doesn't rely on the format either, on the language either, like the format. So we are about sooner in size. The time to get a diff can be much faster in Mercurial because, not much faster, the same in Mercurial because we have a better, like a more isolated data structure so we get the results as fast as Git does, and then you just have to reach from this, and it's going to be your bottleneck. Stuff like rebates that are also very format intensive, and it's even a bit slower on stuff like Clone, Pool, and Update. Update is mostly because we use Multicore there, but Clone and Pool, it's also because of the way the format works and stuff like that. On stupidly huge repo, like my employer do, it's Mercurial is also much, much faster because as it's very extensive, it's very easy to change stuff that doesn't scale at massively huge scale. So we have stuff like Watchman that takes the whole logic that look at the disk to see what files changed, to just as the kernel to do that job for us, and it offers the same Python API than the old object used to do, and it should just get there and give you results in a second for a few hundred thousand files without any issue. We can change the algorithm used for compression to have something that is less space efficient but more speed efficient. We can also change core concept, like when you pool you don't actually get all the file history because you don't care about most of the file history because people did 3,000 chances since your last work and you're not going to need all this intermediate version. You're just going to get all the metadata like the chance set and the manifest and you're going to get the Falconton on demand if you need to. And on the server side, you can do stuff like intercepting every right to the repo to build a journal of that and then use SQL database to synchronize all multiple servers. So you have multiple master that are always up to date compared to each other and you have multiple servers that are writable which means you're scaling much better because more people can be pooling at the same time and also if something fails, you're still running. That's it.