 Hello everyone. What I'm going to talk about is one of the attempts to get a net BSD to a modern version control system. So net BSD is a very old project. Historically, it's the oldest of the new BSDs after a UCB basically got sued and out of business when it came to software development. And we are also one of the biggest and oldest CVS repositories still around. And well, it would be nice to move to something else. Most of this talk is not really about net BSD. Most of it applies to every other big project as well. So it would work just as well for open BSD. It would with some changes apply to free BSD as well if they want to get rid of subversion. So I'm going to talk about what happened in the last 25 years since we started with CVS. I'm going to talk a bit about what we are currently doing in net BSD to get it to Mercurial. I'm going to look at a couple of issues we have with Mercurial and what we are going to do to fix them in collaboration with Mercurial developers. And I'm also going to take a look over what's hopefully going to happen in the rest of the year or maybe the start of next year. So VCS migration is a very involved project. A couple of people started to aggressively complain about CVS around 2005. And one of the results of those unstructured complaints was the creation of a mailing list. So people don't bother all the other people that actually try to get things done like writing software. So they have their own little corner where they can be productive or not. So lots of talks happened. No one really did anything. And this continued for quite a while. Sometimes in 2009 to 2010, we had basically the situation that a couple of tools existed that could do a conversion from CVS to subversion or from CVS to Git and they sucked all of them. So we had, for example, CVS to SVM and it's child projects for converting to Git. They were hopefully slow. We are talking about converting the NetBSD source tree in something like two days or more. That's okay if you want to do the conversion once, but if you actually want to prepare migration project, you want to have something like a life more or a nearly life more. So having a lack of a day is normally okay, but if you have to do a full conversion every two days, doesn't work so well. Some other tools were much faster. For example, from CVS was decently fast at the time. The problem is it didn't really support CVS that well. So, for example, keyword expansion was completely broken, branch handling was quite broken, imports didn't really work that well. So this was depressing. So what's the open source approach if you have a problem with the existing things while you invent your own? So that's basically what I did in 2010. I sat down and wrote a conversion tool from CVS to Fossil. The primary reason for going with Fossil was, well, it was new at the time. It was BSD licensed, it would be attractive and it's been designed around database. So it makes it actually quite attractive for querying and reporting. So if you want to analyze what's going to happen in your tree and maybe fix up things like that, it's very handy to have an actual SQL database and not just a couple of very special purpose tools you have to deal with. So this conversion tool I wrote was also based on using the database because, well, at the time of 15 years of history in CVS, you're going to run into lots of historical garbage like bugs in CVS that creates strange things and people that misuse tools, things like that. And again, this is an ongoing project. So you want to fix up those things. So you don't have to do it again and again and again. Some of the requirements for me was I wanted to have a conversion tool that's as faithful to the original repository as possible. So it means if I do a CVS checkout and I get an equivalent state in the new version control system, they should be exactly the same. No differences at all. It doesn't exactly work. There are a few special cases where keyword expansion can be tricky. Like, do you do a basically a checkout of a specific revision or a checkout of a branch that can sometimes be a difference? But for every case I have, there's a clear definition of the behavior and you could get the same output with the white flags from RCS or CVS. So that's good enough. And I want to also deal with things like vendor branches in a way that makes sense historically. So it's quite tricky if you want to deal with CVS to get the state of a vendor branch last month or two months ago, especially their local changes on top of it because CVS doesn't really provide a way to extract that data. But what we can do is we pretend we are in the time machine and provide what you would have gotten if you checked out the tree like two months ago. That works very well, but it's also quite tricky. I also want to have a quasi-incremental output. It's not really incremental because I'm actually doing a full conversion, but the output is stable. So what you get is the full repository conversion, but all the old stuff gets the exact same git revisions or fossil revisions, whatever you want to call it. So you are basically only adding a couple of new things on top. There are a couple of optimizations to make this faster, but it's just like that it's an optimization. So at the moment a modern hardware we have a conversion cycle for the source tree of about two hours, for packages it's about one hour. There's some latency added at the beginning to make sure that the tree is in a consistent state, like no one is currently committing to 100,000 files and so you want to avoid getting into a middle of the commit. But beyond that, you basically get 12 updates of the source tree. Okay, why it's a hit technology. Okay, yes. If you have a Wi-Fi. Okay, let's just wait a minute for that. So as I said, we have an update cycle of about every two hours. This includes updating the more on GitHub and updating the more on Bitbucket. So that's actually quite a bit more work than just converting the repository to one system. It includes all three of them. The result consumption is also quite low. The tools are written to work in less than four gigabyte of run and on the official NetBSD machine that's doing the conversion. We also have a 32 gigabyte memory file system for well, the things that we've written all the time to keep the turn on the SSDs low. But if you compare that, for example, with repo surgeon from Eric, he complained recently that he's running out of swap space on machines with 64 gigabyte run and more for projects that are essentially comparable in size. So that's tricky. So the first material experiments were based on CVS to SVM and they were not very encouraging. So asking for the lock entry of the latest revision took something like 30 seconds because what Mercurial was trying to do was figure out which tags exist for this revision and this involved pausing the tag file for every single branch in the system. And well, at the moment we have some we have 399 branches. It was a bit less at the time, but doing the pausing 300 times in a row was still very slow and there was no tag cache at the time. So it would have been done over and over and over again. Well, turns out this doesn't work so well. A couple of years later in 2014 Alistair did another try based on my conversion tool and then basically taking the good fast import output from Fossil and importing that into Mercurial. Well, it helped on the Mercurial side to get the general data support documented properly and this reduced the size of the Mercurial repository from something like 25 gigabyte to 2.5 gigabyte, which is okay and at the time we didn't really get branches because, well, we are going via the git format and the git format means well, you take the heads of each branch and so things were turned into bookmarks, which is not really what we wanted and this was also quite slow. It also took two days or so because while the tools didn't exactly scale and things stopped at that. So what happened is there were another group that wanted to discuss how can we finally get to git because, well, of course, we only have the choice of taking git because git won and everyone is using git. So the same happened as before. Nothing. Lots of talks. Nothing that actually produced something and well, I got annoyed by that. So I kept those mirrors for git running for years and it works. But people still continued, well, we must use git, but we aren't willing to actually do anything. So, well, I was sitting down and see, well, I don't really want git. I'm quite okay with Mercurial, but last time it worked so well. So let's sit down and see what changed on the Mercurial side of things and, well, let's take a look. There was this famous Microsoft guy with developers, developers, developers. So what do we actually need if we want to switch? What are our requirements? The first one always was we should have the version control system in the base system. You know, screw that. I don't care. I don't see a good point for having any version control system in the base system. It's not really required. It shouldn't be required to do sensible things like updating your system. So ignore that. Which also sidesteps the real discussion about, oh, we need to import Python, or we need to import Perl, or we need to import Bash or whatever. Dependency the version control system has. So let's ignore that. We do want to have proper branches simply because during any bigger operating system project without branches, especially for release management and so on, it's just insane. So, well, need to check out if it works or not. And if it doesn't work, what is necessary? Of course, we still want to have the incremental conversion during the test phase. And it's also quite important to work on older or smaller machines. For example, Raspberry Pi only has so much memory. And it's also not necessarily the fastest when it comes to disk performance. And a lot of other older machines have quite similar restrictions. It doesn't mean that I care about during development on Amiga or things like that was maybe a 32 or 64 megabyte of run, but it should still be possible to get the tree on a decently equipped machine. So what's the current state? There are two Python projects. One is for handling the fast import stream from Git and parsing it and providing an interface for interacting with it. And there's a mercurial extension on top of that, which basically provides a convert source. Turns out, well, actually it works for branches just like I want to. So it basically turns the Git branch into a real mercurial branch by delaying some stuff and then perfect. The mercurial fast import extension was not very fast. And more importantly, whenever it saw a block command in the Git fast import stream, it would create a file in a temporary directory. This is okay if you have like 50,000 changes in the repository. It doesn't scale if you have a million changes and wants to create a million files inside one directory. So first attempts were to basically create a hash tree. And later I just went with a SQLite database again. It's just so much easier to have only a single file. And once you have that, it's also quite easy to fix the remaining problems to get incremental operations working. The biggest problem with that was originally the extension was preserving the modifier in the fast import stream for a revision based on the assumption that while this is something stable on the source side, but a fossil is actually creating this one as it goes based on the current database state. So if you recreate the original repository, you would get different identifiers. This different identifiers would leak onto the mercurial site as metadata of the commit and therefore the commit IDs would end up being different. Turns out the packet is not very happy if you have more than a couple of thousand head revisions and branches because while HTTP header sizes are sometimes limited by proxy components and things like that and if you have too many heads they will blow up the HTTP header when you try to fetch something and not very happy. But well, we are now at the point where all 399 branches in that BST can be found on Bitbucket. You can actually look at them, which is something that didn't work for a couple of years on GitHub. You would always get a timeout when you open the branch page and they fixed it in the meantime by actually offering pagination, but that took surprisingly long. I still think they don't have a pagination for directory listings, which is a problem for package source in some cases, but I digress. And the repository we have now is comparable in size and complexity to what Mozilla has, which is the biggest mainstream open-source repository in Mercurial right now. So we're hitting pretty similar issues in some cases. They're also quite happy if we are fixing things. There are a couple of larger repositories in Mercurial. Facebook, for example, but Facebook also has its own server implementation. So things are a bit different, but let's see what we can do with this tree. How does it work? So let's take a look at cloning. Well, not really. The interesting part about the Mercurial design is that there are a couple of things that share the same code. And just have a bit different plumbing and clone is one of them. So basically cloning and pulling goes through almost the same code. And for large changes, you can basically simulate what the network-based communication is doing by just creating a bundle and asking the system with unbundled to apply it locally. The only difference is that clone or pull would create a bundle dynamically, whereas we can actually create it offline. And this has a very nice advantage that you have easy reproducibility, which is very important if you want to benchmark things. There's one exception. If you do streaming clones, it will not create an actual bundle in the normal way, but basically tar up the repository on the server side, send it to the client and the client just extract it, which has quite a number of implications. We'll come back to that later. So the primary things of interest for me are the memory used on the client, because that's the one resource that we can't fix. I mean, CPU time, you can always wait a bit longer. If it needs too much run and goes into heavy swapping, you don't want to wait that long. And the other important part, of course, is network bandwidth, because while we are not Facebook, we are not Google, we have quite a few constraints on how much our servers can actually send out. What is only of secondary interest is CPU load on client and server. On the client, well, get a faster machine if you don't want to wait. On the server, right now, we hope that it simply won't matter enough. So as baseline, I'm creating a full bundle of the repository, compress it with Facebook's ZSD. Level 22, so pretty much the biggest possible, use whatever large window you can. And this compresses pretty well. I mean, we get a bundle size of 840 megabyte, which is quite good, especially if you compare it to simply compressing the RCS files. For example, those are actually larger. And trying to apply that bundle for the old release version of Mercurial required something like 827 megabyte on AMD64 and needed something like eight and a half minutes. And that was, okay, so what's going on here? Let's try to take a look. And turns out there was an index being kept of offsets in the bundle for every commit and every file and so on. And nothing really used that index. So it was almost 180 megabyte of completely useless data. And that was fixed. And since then we had just a bit over 600 megabyte of peak memory use. It even got a bit faster. So that was very good. One of the concerns were that the large window from the compression algorithm was also going to increase memory use, but turns out it has almost no impact. So, perfect. So let's take a more detailed look into what happened. I saw a linear increase of the process size over time. So it's definitely the bookkeeping of things going on. And once you find very soon as a person has a surprisingly large overhead for objects, any number, any string basically requires at least 32 bytes of memory or integers sometimes go into 24 bytes, but it's still much higher overhead than you would expect in C. And the main memory hook that remains is a transaction object itself where basically Mercurial keeps track of the changes it's currently trying to apply. So if it has to abort, it can reset everything. And also various extensions use this transaction object to figure out what changed. So you can, for example, send an email on incoming changes, things like that. And they need to know, of course, what happened as well. And this transaction object uses something like 200 megabyte out of the 600. So there's quite a lot of potential for fixing that. The phase transitions take about 40 megabytes. I'm looking at almost completely eliminating this. The truncation maps where it basically keeps track of which repository file got new data and where it was before needs something like 70 megabytes. Pack files where Mercurial would stop using one file for every person the repository would have to remove most of that. It's tricky going directly to the on this journal for the files would also help. But they all have some implications that are very deep. And so this is not an easy thing to fix, but it's something we are looking at. One thing I was testing is what happened if we don't apply the rule history in one go but basically slice up the history into yearly changes. Well, this adds some overhead, of course, less redundancy for the compression to remove. But it also helps a lot to reduce the memory use. So with that, we are a bit slower but take only about 360 megabyte of memory. This is mostly because the transactions themselves are smaller. I mentioned earlier there's special options for cloning the streaming clones. For testing, this was done over the local interface. It has to transfer all the files without removing redundancy between the files. So it's going to transfer a lot more. At the same time, it's also quite dumb. So for the client, it needs only 160 megabyte, which is much better than Git, if you ask Git to ignore all the history. And this is normally only attractive if you have a very fast network connection to the server. Well, it turns out it also helps if you want to reduce resources on the client. And it has an all-or-nothing selectivity. You can't basically say I want only one branch or so, which is a restriction. And the server and the client have to be pretty much compatible in the repository format they are using. So this would also mean keeping them mostly in sync. Well, for Diff, this was actually a bit annoying when I was doing the testing and preparation for this talk. So we are talking about the fully cached case. And we are looking at the differences between two NetBSD branches. The actual Diff is something like 800 megabytes. So it's expected to take a bit, but turns out the cure was surprisingly slow here. It took three times as long as CVS and even worse compared to Git. Git primarily wins here because it has to do very few operations on the VFS layer. It only has to open a handful of files and then use memory mapping for accessing them. Mercurial doesn't have that benefit at the moment. We are looking at how that could be done, but it's not there yet. CVS has one advantage here that the RCS format basically keeps the Diffs in line-based. So it can actually sidestep quite a few computations here. Let's profile it. Oh, wow. So turns out Mercurial spends about 70% of the execution time of the Diff command writing output. One of the fixes was to basically cluster writes more aggressively. So instead of doing one write per output, it's going to do them for 500,000 lines. This matters because we are doing output buffering, but output buffering is line-based. If you are sending one line per write, it's effectively not buffering at all. This basically cuts the time almost in half. It helps Mercurial also in Python to do a join of the buffer you want to write and then write it in one go. So basically you are doing one allocation and one write instead of doing many small writes. We also did a couple of optimizations to the algorithm that's creating the Diffs. So we're not exactly at the CVS time yet, but it's getting better. And now the time is spent on things like parsing the repository, which again is not that easy to fix. So the low-hanging foods are gone, but we know what to work on. The reflog handling is basically the new hotspot. I'm looking at doing proper indexing of the files, but it's difficult. And of course, it's a bit annoying that the profiling output will still show a couple of functions that are no longer performance-relevant simply because the instrumentation overhead changes performance. The last big item is Mercurial has this feature called clone bundles. Well, when you do a clone server basically tells you, you know, my administrator has prepared this nice bundles for you in advance. You can find them at the following locations. And you can put them, for example, on the CDN instead of the main server. The problem is all this clone bundle support at the moment assumes you have a second server that you are connecting to out of band. There is, as part of the large file extension, some support for instrument clone bundles now, but it's not what we really want to have. Part of the problem for NetBSD is that CDN.netBSD.org is hosted by a third party. So we don't really believe that our fastly wants to hack us, but we don't really put too much faith into them either. So it's a bit difficult to verify that the cloning has actually gotten, this data should have gotten. It's somewhat possible to do it with incoming and outgoing commands to basically see if we got more or less than we should have, but it's not perfect. It doesn't work well with obsolesion markers and things like that. So we'll see. One idea to fix that is the getBundle operation on the wire protocol of Mercurial basically could just use a pre-computed file, and this is essentially what clone bundles do. The server already knows what the client wants. The server already knows what the client has, so it could select a matching bundle. The only problem is the client might get less than it asked for, and at the moment the client is not very happy with that state. So the client has to be fixed to deal with that, and patches for that are under review. The performance is actually very promising, so the overhead is very low, and I'm hoping to get that soon. So what's going to happen in the near future? On the Mercurial side, primarily reduced memory overhead of transactions that's going to help everyone. Looking at implementing pack files is definitely also very useful, and of course, do more profiling. There are a lot of other commands I haven't looked at yet because they are not as crucial, but I expect there are still quite a few low-hanging foods there, so if someone wants to work on it, for example, for the sum of codes, be my guest, come to me. On the NetBSD side, it's basically finishing documentation for good workflow, so how are developers supposed to use the system, and how are they supposed to do things they are doing in CVS now? Because, well, things are, of course, different, so it needs a couple of adjustments, so we have some tooling like the build cluster, the automatic regression testing, and so on, and they need to be adjusted to not use CVS anymore, but Mercurial. Well, we need to look at new tools that serve the same purpose, but can use Mercurial, and, well, the final point is, of course, to convince the other developers that they no longer want to have CVS, and which color the bike should have. So, do you have any questions for me? In that case, thanks for your attention.