 Well, hello everyone. Thank you for coming. I'm G2, and this talk is going to be about fast interpreter startup on Python 3. So startup time is a concern everywhere. I've seen native developers worry a lot about making their apps load faster, especially the mobile people. And despite the differences, there are some interesting parallels between how dynamic linking of shared libraries work in Unix, and how importing modules work in a bytecode interpreted language like Python. The first common thread is that they're both relatively slow and expensive. So here's a random slide with a cursus progress bar. And usually when you see a progress bar, it's a mental cue that things are going to take a bit of time. But with command line apps and rappels, on the other hand, when you type something out, you expect things to be snappy. And sometimes it takes an eternity to respond, the upper bound for eternity being a couple of seconds. And I have aliased out the name of the command line tool to avoid naming any specific tools. This is essentially the names of two most popular version control systems in existence spliced together. Also, one of them happens to be written in Python, albeit Python 2 and not 3. So yeah, we tried doing some profiling to see what's the bottleneck here. And a lot of things. But one thing which really stood out was about 20% of the startup time was spent on just importing modules. And most of it was spent on disk IO, essentially just loading PYC files. And also on allocating lots and lots of small objects. CPython incidentally has a lot of optimizations in place for making a small object allocation really fast and cheap. But we won't settle for anything but completely free. And I kind of love the sound of the phrase, zero cost abstractions. This is something which is popular among the C++ community at one point. And these days, the Rust community keeps talking a lot about this. So how do we go about solving this problem or maybe making it less of a problem, perhaps? So one possible solution is to just go for the latest hippest ahead-of-time compiled language and rewrite your whole code base in it. So out of the approach, I've seen quite a lot of projects do this, both open source and closed source. Isn't the easiest thing in the world because the first thing you need to do is it's a very sort of a temporal thing. What's hip today might not be very hip tomorrow. So you probably might want to go to Hacker News and find what the static language of the month or the week is. And you probably have some fun learning the new language, and then you actually have to get down to the hard work of actually rewriting your Python code in this new language and dealing with all the missing Python idioms or Pythonisms. So here's another contrived example. So I just have Python with a single line, import sys. So the question is how many PYC files need to be loaded for just running import sys? And sys module, incidentally, is a very special module in Python, so special that it's a part of the interpreter core. And it's implemented in C. So I'd have expected it to take zero or maybe not more than a handful of PYC files to load. So let's see. OK. And let's use out a big list. And I'm excluding things starting with underscore frozen, because as we'll see later, it's baked into the binary. By the way, I should have used list comprehension here instead of a list constructor with a gender expression. Should have gotten my slides code reviewed. Sorry. I don't know how to go about doing that. Yeah, so essentially I'm just importing sys, and I'm printing sys.modules, and excluding anything starting with underscore frozen there. OK. So what's surprising about this is, I mean, it's not very surprising, but I still find it very fascinating to see how much of C-pattern is actually implemented in Python. OK. So let's see how big the list is. And I don't know if you can see it. Number's 44. At least that's the number on a Mac. I think it's a bit smaller on Linux. So the question is, how do we go from loading 44.pyc files to, say, four pyc files or zero pyc files, if you could do that? And I guess in some ways native languages also suffer from the same problem. And I mean, you have a lot of native applications loading hundreds of shared objects, and it tends to slow down their startup. And the traditional approach for solving this in native languages is static linking. Go is quite famous for this. You just statically link everything together into a big fat binary, and you just trade one problem for another. And it's an OK trade-off. And sometimes it's a great trade-off. So the question that I wanted to answer here was, is there a way to possibly steal some ideas from static languages to maybe make Python startup faster? So now for the agenda, we're going to start with an overview of the module loading system. It's going to be a very short, high-level introduction. We won't go into the details of how importlib works, because it's quite low-level and relatively complex. Also, this talk is going to be about Python 3 and not Python 2. Some significant differences in how the import machinery works between Python 3 and 2. And needless to say, it's a lot nicer and simpler in Python 3 than it was in Python 2. OK. Next, I'll talk about what we did to improve the startup performance, and spend some time on the prior art from other dynamic language runtimes. And there's a long history of this. And anything you can think of in terms of dynamic language optimizations, chances are that Lispers have done this decades ago. So and finally, some future work on how we could go about making this better. So OK, let's get started then. So module loading in CPython. I'm going to be glossing over a lot of details, but for those of you who are interested in details, I believe the next talk, which is scheduled right after mine in this room, goes into more details of the lower level aspects of loading code objects, things like that. So Python's interpreted language, and it compiles to bytecode, which the interpreter runs. There's a PYC file for every PY file somewhere in the Dunder PyCache directory. So now the interpreter loop essentially takes a PyCode object and runs it. And so let's look at how we get from .pyc files to PyCode objects in memory. So the PyC file format is by far the simplest file formats ever. There's just a 12-byte header, and everything else is just a martial code object. The header used to be 8 bytes in Python 2, if I recollect correctly. And now there's an extra field, so it's 12 bytes. So the simplest working .pyc loader would be just a couple of lines of Python. You just open the file, skip the first 12 bytes, and hand it off to martial.load, and you get a code object. There's a bit more work involved in turning these code objects into module objects. It involves running it and things like that. But fortunately for us, Cpython's PyImport C API functions have us covered. You just have to call one of those functions, and it does everything for you. So now let's dig into some details about the martial module. So people are usually encouraged to use a high-level library like Pickle or something else for serializing and unserializing the objects. But martial is lower-level, faster, unsafe. And that's what PyC files use. And also it only supports a limited set of types. It's quite a long list. Let me just read it out. Bullions, integers, floats, complex, strings, bytes, byte arrays, tuples, list sets, frozen sets, dictionaries, and code objects. And also nuns, ellipses, and stop iteration. OK. So this is probably the most important point in the whole talk. Martial object graphs in .pyc files are made up of a subset of the types which martial supports. And this subset is completely, yeah, so this subset of objects is immutable. And it's a very tiny subset. So it's just bullions, integers, floats, complex objects, strings, bytes, frozen sets, and code objects. OK. So here's a plan for improving startup performance. So we want to bake in frequently used modules into the data segment of the compiled Python binary. And this isn't entirely unprecedented. Cpython already does this for two modules from importlib. So parts of importlib are written in Python, and it presents an interesting chicken and egg problem. How do you import parts of importlib before importlib is ready at startup? So Cpython's solution to this problem is something called frozen modules. So they just serialize the .pyc files into a C array of bytes and bake it into the binary with a header. And at startup, you essentially just pass this byte array to Martial, and you just get the module back. So you can essentially load these modules without having to invoke the importlib machinery. OK. So this approach that Cpython takes essentially only addresses one part of the problem, which is disk.io. But we want things to be absolutely free, as free as it can get. And we don't want to pay the price of creating lots of tiny objects in memory. So we cheat. We use a sneaky little trick. There's a neat feature in C99 called designated initializers. It's a bit like what lists or set literals in Python are, except for the fact that we are talking about C, which is lower level, and all you have are arrays, structures, and unions. Incidentally, this is one of the two C features that I know of, which haven't made it to C++ yet, as of C++ 11. The other feature is variable length arrays. OK. So sys.modules is a dictionary which maps from module names to module objects. So we can call py import exec code module object, which is a mouthful, on our statically frozen code object pointers in C, and put the resulting module into sys.modules. Importlib is required to actually look into sys.modules before essentially searching for anything else. And also importlib adds it every time it successfully imports something into sys.modules. So sys.modules is like a cache for importing. So we cheat by injecting modules directly into sys.modules. It's a bit of a hack. And perhaps there's a cleaner way to do this, but this seems to work for now. And finally, I'm missing a bullet point titled profit. OK. So let's look at some benchmarks. So this is a very simple microbenchmark. We try to count the number of open and stat calls that the Python binary makes at startup. And you see that the number of stat and open calls are reduced by half. I hope the slides are readable. And so some of you might be wondering, why haven't the numbers gone down to zero? And that is because of dynamically linked shared objects, the C equivalent of imports, roughly speaking, which also ends up calling open and stat to load their modules. So libpython, libc, any native extensions, and a lot of shared objects loaded also indirectly end up calling open and stat. So we cannot entirely get it down to zero, but we're verified that we're not loading any .pyc files. So this is just native stuff. OK. So here's a very similar benchmark. We try benchmarking importing difflib, which is around 2,000 lines of Python code. And we achieve a nearly 2,000 reduction in the number of stat and open calls. And we chose difflib for benchmarking startup because it's a relatively large Python module, which also happens to be a part of the Python standard library. OK. So in terms of performance, the improvement is rather modest compared to the reduction in the number of open and stat calls. In this case, it's approximately 21% improvement. And incidentally, Bench is quite an interesting benchmarking tool. It can run any command line app until it can find a fixed point for the startup time, sort of like a generic version of Python's time it module. But this happens to be written in Haskell. And I tried to get a building on my Linux server, but I ran out of patience while trying to install Haskell stack, so which is why these benchmarks were run on a Mac instead. So the same benchmark, again, with difflib, approximately 22% improvement. So you might notice that we're reducing things by approximately a constant factor in both the number of open and stat calls, as well as the startup time. And this pretty much is what's going on here. I mean, we could bake in more modules and get fatter binaries. So it's a trade-off. So now we'll look at some ugly, generated C code. I hope it's readable. Yeah, OK. So this is the simplest possible case. So actually, none's are simpler, but let's just go with Inz for now. So essentially, we just have static structures with a bunch of macros. So if you can essentially cast this into a py object, it behaves like a py object, except for the fact that you can't free it. But that's another thing. If you try to free it, you'll get a hard crash. And so the way we prevent freeing these objects is by cheating with a rough count of 2 instead of 1. So it never frees it. But we only do this. This sort of cheating is only done on top-level objects. So the inner objects in the module graph actually have accurate rough counts. And incidentally, another optimization that we perform is if we had the same objects, which are repeated across multiple modules. We just have a single serialized C instance of it, and we increment the rough count, and it shared everywhere. So immutability wins. And strings, also known as PyCode objects, are a little more complicated. There are about four different internal representations for Unicode objects in CPython. And we had to implement support for serializing all of them. So tuples are a bit more complicated, because they're containers for other objects. And this means that there are some interactions between these objects and the garbage collectors. And we just tell the garbage collector to not track our objects. I don't think the slides are readable, but yeah. So we just say GC-untracked. So we just set the GC-untracked flag, and we set the previous and the next pointers to null, and it just works. So the tooling I wrote to do all of this was about 800 lines of Python, 700 lines of C-code, and some 200 lines of tests. And the patch to CPython was about 100 lines or so. And that was it. So I think once you figure out all the hairy details, it's incredibly simple to implement. Incidentally, the generated C-code for all of this was about 75,000 lines of C, but it's machine-generated, so I'm not really that concerned about it. Yeah, there's one thing that I mentioned, which is we had a minor problem with frozen set objects. So hashes for strings and bytecodes are randomized at start-up to prevent hash-changing attacks. And we can get around this for the cached hash field on unicode objects by setting it to minus 1, and the interpreter just populates it on demand, not so for buckets on set objects. So the comment you see there, if you can read it, cached hash-code of the key is a bit misleading. Setting it to minus 1 leads to a hard crash while the interpreter does a set lookup. So the workaround was to essentially have a predicate which checks every set in the code to see if hash randomization is applicable to any of its items, within items. And if that's the case, we just tag it, and we do it at runtime. So this is probably the only bit which is not zero cost, but everything else is absolutely free. And OK, so let's talk about prior art. There's a lot of prior art dating all the way back up to the early 70s. So things that we take for granted today, like linkers and loaders and shared objects and things like that, these things did not exist in the early days of computing. And the only way to go about doing things was to have a little repel, and you just run things, and you dump the memory image onto a disk, and the next time you come back, you load the memory image again. And so there's a lot of image-based languages around. So let's start with the modern times. V8 has this feature of startup snapshots. JavaScript, despite being syntactically rather different from Python, shares very similar problems when it comes to importing modules, I mean requiring modules in JavaScript. And this problem is particularly pronounced for the now fashionable trend of writing desktop GUIs in JavaScript with the electron framework. And Atom, one of these editors written in JavaScript, uses a V8 API feature called Create Snapshot Datablob to essentially create an image, a pre-warmed image with all the required modules pre-loaded, and that seems to make a startup faster for them. So going a little further back, Emacs, which is mostly written in ELisp, again a dynamic and separate language, had a very similar problem. And the solution to it was quite interesting. They implemented an unexec function. It's one of the more bizarre function names I've come across. It's quite logical, actually. exec, which we all know, takes an executable file and turns your process. Unexec takes a process and gives you an executable file. And I believe this feature has lately been deprecated by the Emacs people because, apparently, it's a lot of work to support it across all platforms. So, yeah. And finally, here, image-based languages. I think these are two that I know of. Incidentally, there's also Factor, which is a modern fourth dialect, which also happens to be image-based. And I remember that the function to save images in SBCL is called save, lisp, and die, which is quite interesting. OK, so future work. There's probably some things we could do to make things better here. The first thing is we can remove the need for a C compiler. There's nothing preventing us from writing a binary image with one or more modules, with some headers, to perform traditional rebasing and binding fix ups, which an operating system's dynamic image loader performs while loading native shared objects. And if you can do this, we don't need a C compiler. And we can do this at runtime. So we can have something like Python's compile all modules, and you can just get a big, fat binary image for your whole Python application. And the next time, it starts much faster. And I think that'll make it a lot more accessible to everyone else. You don't need to have a C compiler around. And finally, I did mention that injecting modules into sys.modules is a bit of a hack. So I think it makes sense to do it the proper way with custom finders and loaders for import lippings, a bit like zip import in Python, only a lot faster. So that's it for this talk. Thank you very much. And please feel free to ask me any questions. OK, any questions, please? I think it's OK. Hi. I may have missed this at the beginning. I came in a little bit late. I was wondering what the use case for this is, because you showed command line applications, but typically they're going to run against the system Python install, which typically wouldn't be customized for that application. So is it for faster server startup time? Actually, it doesn't matter as much for server startup time, because server processes are usually long running, right? Yeah, of course. So now, even if it is system Python, you have a set of modules which are always imported. And you could shave a couple of tens of milliseconds off of startup time for system Python. But obviously, if you're bundling your, so I think something like, say, Py2EXE or whatever, could probably stand to benefit a lot more from something like this. Hello. You showed us the sys calls, and the number decreased. But there was the column that said that calls without errors were the same, the same number. So let's just go back to it. The previous one. Right. The previous one, even. Oh, OK. Yeah. No, these are errors. OK, this without is the comment from above line. OK, sorry. Yeah. So I think importlib tries to look for a whole lot of things. And I think we did not lobotomize importlib sufficiently for this test. Sorry. So is this, oh, sorry. So I'll stand up. So this, your patched Python, like, do you want to use it for yourself, or do you want to push it to upstream at some point? Well, I don't know. I think the upstream doesn't really like a dirty hack like this. But I suppose we could probably clean it up and have it as some sort of a third-party module where you just import stuff and things. I mean, this is something I mentioned in the FutureWorks section. So if you can get rid of the dependency on a C compiler, and you could just have something like, say, import startup speed, which just makes things faster. But you're actually using it right now? No, no, it's just an experiment. OK. OK, I think we don't have time for questions anymore. Let's thank Gidu. Thank you very much.