 Hello, everybody. There's this famous interview question that says, you type python.org into the web browser and press Enter. What happens? This talk is a bit similar. It's about what happens when you try to import some random module. Whew, lots of stuff happens. A little bit after I submitted the talk, I learned that David Beesley did a three-hour tutorial on this, on this year's PyCon. So I'll try to look at it from a different angle. But if this talk is not enough for you, then there's lots more material you can use to learn. More than a deep dive, it'll be like a guided tour through what happens when you import something. But hopefully, when we're finished with the talk, you can take deep dives through the source code yourself. So what happens when you execute this command? Under the covers, there's a global DunderImport function that gets called. And the result from that is assigned to a variable. That's pretty much all that happens. Now, the import statement is a little more powerful than that. It kind of evolved over the years. So you can do sub-package imports with these dots. You can import stuff from modules. And the mapping from this to the DunderImport function is not always trivial. But it's documented pretty well in the docs. So if you want to do that, then read it. But all the DunderImport function is an interface to the import machinery, which nowadays is all written in Python. It's in the import lib module. So if you don't want to do this when you import something programmatically, there is also a convenience function called import module that is much better to use. So if you have a string with a module name, just use that. It's also just an interface to the import machinery. The other thing you can do with the DunderImport is replace it with your own function. But that is not very useful because then you have to re-implement most of the machinery yourself. So it's not useful to call DunderImport. It's not useful to replace it. So it's probably better if you just forget about it. The import statement calls the import machinery. And I will talk about what the import machinery does. I'll skip all the locking and caching, error handling, and older Pythons, all the stuff that takes most of the library. But it's not really necessary for you to know what's going on. So the basic algorithm for what happens when you import something is actually pretty simple. Looks like this. The first thing you get is this cache. There's the sys.modules dictionary. So if you import a module that already has been imported, it's stored in the cache. So when you re-import a module, you get exactly the same object back. There's a catch to this. When you delete something from the dictionary and then re-import the same module, it's gone from the sys.modules. So the import machinery thinks it's never been imported. And it imports the module again. You get a brand new module object. And every function and every class in there will be brand new, which most of the time is not something you want because stuff doesn't expect this. But you can do this. The other thing you can do is poison the cache. You can just assign anything to sys.modules. So you can put a string in there. And when you then import it, you get a string as a module. And you can use all the string operations on it. And some modules actually use this to make modules that are callable or subscriptable or have arbitrary attributes. There's some limited use to this, but maybe you shouldn't do it in production. So that's the first statement. The second, there's this findSpec function that takes the name of the module and the path. In most cases, the path will be sys.path, which is just a list of all locations that modules can be imported from on my system. It's usually much longer than this for details on how it's constructed. See David Beasley's talk, he talks about it at length. So with these two, I call the findSpec and that gives me a spec object, the module spec object. And that is just a description of how the module will be loaded and where it will be loaded from. There's actually a utility function that you can call to get the spec without importing the module like this. So the module spec gives you the name, the loader, which is kind of the strategy how it'll be loaded, and the origin, which is where the module will be loaded from. And so you can do that without importing the object, which might be useful at some times. Also, the module spec becomes a permanent record of how the module was loaded. So with any module, you can look at the renderSpec attribute and see where it got loaded from. The next step is at the actual loading. We'll look at it in a little bit more detail later, but what happens here is an empty module object is put into its modules, and after that, it's initialized. It's important that it's done in this exact order. First, it's put into its modules, and after that, it's initialized, and all the functions and classes get assigned to it. And after that, the machinery looks into its modules and returns whatever it found there. And this is a simplification, of course, but you can already use it to solve real-world problems. For example, import cycles, everybody's favorite thing when it comes to importing, as I've learned. So we have two modules here, and one imports the other, and the other imports the first one again. This is a very bad thing to do, but yeah, it usually results in errors that are not so nice, but if you know this algorithm, you can reason your way through what is happening. So if I import foo, it checks its modules, doesn't find foo there, so it finds the source code for foo and starts loading it, starts going through it. Well, first it puts it into its modules and then starts going through the source one line by line. The first thing it finds is import bar, so it goes to import bar, doesn't find it in its modules, so puts an empty module object in its modules and starts going through that. The first thing it finds is another import, so it tries to import foo. Looks into its modules and it finds foo in there because it already put it there, but it's not gone through all the initialization yet. So we get a half initialized foo object here, and then we try to call this function, which Python didn't see yet in this initialization, so this falls with an attribute error. And the whole thing builds out, you get an import error and you start looking where the error is, it's not so obvious. There are some tools that can detect these import cycles and warn you which you should use, and the best way to solve this problem is probably to take the functions that both modules need and move them to a different module and put that. But if you ever run into this situation, you already know how to solve it. Okay, so here we go. You can see I left some space here because there's obviously something more, and there's something more has to do with sub-modules and packages. So let's go through a little bit of vocabulary. Our random module was a top-level module, you can import it directly. So is URL-lib, for example. But URL-lib also has some other module below it. So URL-lib is a package, it's a parent of URL-lib parse, and URL-lib pre-quest, and URL-lib response, and those are sub-modules of this URL-lib. Everybody clear on that? I hope you knew that already. So what happens when I try to import a sub-module is first, the path is different. For sub-modules, the path is not in SysPath, but the path is taken from the parent. So the parent has this path attribute, and that says where all the sub-modules are loaded from. And the second thing that's different is these two parts. So for sub-modules, the parent is always loaded first. There's no way to load URL-lib parse without loading URL-lib, it's always done first. And if loading the parent somehow causes URL-lib parse to also be loaded, at this point it's just return, otherwise it's imported normally. And at the end, after everything's done, the sub-module is set as an attribute on the parent. So if you import URL-lib.parse, the object you actually get is URL-lib, but it has an attribute parse on it that you can get by the dot because it's set as the attribute at the very end of importing. So there's the more complete algorithm, which you can use to solve or reason about more complex situations that involve sub-module loading. So for example, if I have this simple package in a dot pi with two imports, some constant value and some code that uses it, and I try to import that. So what happens? The parent module is always loaded first, no matter which one of these you import. So first it looks and says modules for foo doesn't find that, goes to find, it goes to look up the source and execute it line by line. So here we go, goes to the import function, import statement, which invokes the machinery again, looks and says dot modules for foo dot main, doesn't find that because it says sub-module, it goes to load foo, which was already loaded, it's already in this modules, so it returns early and it goes to executing the code. It gets to this import statement and it tries to import foo again, looks and says modules, some module is there, so it returns that and then we try to use it, at which point we use the foo module, but it doesn't have the constants attribute yet because that gets initialized at the end of this import that we didn't get to yet, so once again you get in there. And this is kind of complex and you have to understand this algorithm which arguably is not that hard, but if you have bigger modules then it gets a bit complicated, so I've prepared a set of little rules that you should follow to be okay. First of all, your init should be kind of a public interface to your package, so it should just import stuff from sub-modules, maybe set it under all, and do nothing else. And then your sub-modules should not use the public interface, they should import directly from the sub-modules that you want because you probably know about the internal structure of your package, and obviously you shouldn't have import cycles in the sub-modules themselves. So if you follow these rules, you should be okay, otherwise understand this algorithm and you can reason your way through. Okay, so that's for this, and maybe you're wondering what exactly this fine spec does, so let's look at that. Let's look at first the result, where do you actually load a module from? So if I import my random module, I can print it out and I see it's loaded from some location on my system. I can look at it under file attribute, you get the same thing back as string. But if I import another module, say sys, print it out and I see it's built in. I see it doesn't have it under file attribute. Does anybody know where the sys module is actually located on your system? No, sys is actually built into the executable itself. So in my case, it's under use of N503. It's built into the actual program. Yeah, but all the other modules are in this place. So these are two different types of modules. We have a look at the aquarium of module types we can see. We have the built-in modules which are written in C and compiled into Python itself. We have some source modules which are written in Python and loaded from files, and we have some other types of modules as well. We can have extension modules which are written in C or some other compiled language and loaded from a file or a shared library. On my system, that's math. For example, some NumPy core modules can be extension modules as well. And the fourth type is frozen modules which are written in Python and compiled into the executable itself. One example that everybody uses is frozen importlib which is a copy of the import machinery that's built into Python for loading the real import machinery because you have to use the import machinery to read stuff from files. And stuff like app or py2exa actually compile Python modules into the resulting executable to make one file executable, so that's a use case for that. So how do we load all these different kinds of modules? Well, there's this list of strategies to use in sys.metapath, and the algorithm is quite simple. We just ask each of these finders in turn if they can load our module. So if I'm loading the sys module, I ask the built-in importer, hey, do you have a sys module? And the built-in importer looks at the list of built-in modules, and it says, yep, here it is. Here's the information, and it gives me a spec for it. If I'm importing random, I ask the built-in importer, and it doesn't find a random module in built-in modules, so I ask the frozen importer, it doesn't find random in the list of frozen modules, so I ask the pathfinder. And the pathfinder is a bit more complicated. This is the thing that looks at sys.path. It goes through every entry in syspath in order, and for every path, it has what is called a pathhook. The algorithm looks like this, so it'll go to the current directory and construct a pathhook for that. So zip importer can't handle directory, so it's skipped, but there's a filefinder which can handle directories, so that one is used for the current directory, and there we look for these files, and we probably don't find any of those there. So we'll go to the next entry, which is a zip file. We ask a zip importer for this zip file if it can find any of these, it can't. So we go to the next entry and ask the filefinder if it can find any of these in there, and it can, it's there. Random.py is actually in this directory, and since the file is there, the spec is returned. Now at this point, we have a spec, and when we actually have a spec, we don't look any further, so when the file exists, the spec is returned for it, and the machinery doesn't look any further in the path, so the first match wins. And what's in the module spec again? We have the name, we have the origin, which is the source code to load, we have a location for the cache file, which may or may not exist, we have the loader, which is the strategy to use to load the source, and some other loader-specific information. You can read all about this in the pip that I will link later. Yeah, so that's how you get the spec. And we have a bit more time left, so I can talk about how to actually load a module. So once we have the spec, the loading is kind of simple. First, we create a module object, and a module object is nothing special, it's just an object that has a name attribute, an undername. Either the loader can create one, or if it doesn't want to, then we create a default one. After that, we set the initial module attributes, which are actually just copied from the spec. So the spec gets copied to the under spec, the name gets copied to the under name. So now we have two places for each of these bits of information, which is kind of redundant, and you can change each one of these individually, so it's a bit of a mess. But one of these is always used later. And after that, we put the module into that modules and execute whatever source code we find. The global variables are actually just attributes on the module objects, which is kind of fun to play with if you import the main module, so you can assign a global variable, get it back as an attribute, or vice versa. This is also where the under name comes from, it's assigned very early in the loading phase. So by the time you get to executing your code, it's already there, you can check what it is. So yeah, that's executing the module. And one more thing I have, oh, one more thing I have is how to actually get this code for a source module. So in the module spec, we have both the origin, the py file, and the cache location. And if the cache location exists, and it was compiled from a matching py file, it has the same size and same modification time, then bytecode is read from the cache file and execute and returned and executed. If it doesn't match, then it's read from the origin file and potentially stored in the cache. If you're familiar with how Python 2 did this, the origin and cache were in the same directory. It had the problem that if you deleted, can you see that? I guess you can. If you deleted the py file, then the pyc got executed. So this zombie that for some reason was there and did the same thing as a deleted file, which used to throw off a lot of beginners, and not only them. In Python 3, we have the pycache directory, which no longer has this problem. So in the pycache, we have the pic, but if the py is not there, the cache isn't even looked at. What you can do if you really want to load things from pycs is copy the pyc over to the old location and delete the py, and it'll actually work. And this is all the code. It's just a screen full of what you have to understand. And if you want any more details, the import lib is installed on your computer, so you can just look at it now and see what's going on. Thank you. Thanks, Peter. Do we have any questions? Thank you for your talk. I would like to know if it was the use case on being able to load source code from a zip file, because... Excuse me? Where is the use case on loading code on zip files? From pyc files? No, no, zip. Oh, c files? Zip files. Okay, when I showed you the different kinds of modules, I wasn't really complete. It looks like this. So you can load from native code. For example, written in C, you can load from Python code or byte code. You have built-in frozen extension source and sourceless, sourceless are the pycs, and you can also load source or sourceless files from zips. And this is done to ease packaging. For example, some Windows users don't like deep directory structures where you have lots of files in directories. So you just zip those all up into... Nowadays, it usually has the pyz extension, and you can import directly from that. You can actually run those. The pyz is assigned to Python, and if you have an under main module in there, it'll actually run it. Also on Linux, if it has the shebang, it can run those. So it's just an easier way to package things. You just download one zip file, and everything's in there. Any more questions? There are no questions. I can find something else to talk about. Do we have time? A few minutes. All right, so one thing I forgot is this create module and exec module. So this is for Python modules. For C modules, like the extension or built-in ones, everything happens in create module. There's a py init hook function that creates the module and also initializes it at one step, and then this exec is just a no up. It does nothing. So that is the current situation with Python 3.4. For Python 3.5, there is a new mechanism that does something similar to the Python modules. So the create creates an empty object, and then there's a separate exec that you can do your work in, which is better because at the time exec is run, the module object is already in CIS modules. For example, what could happen before is if you run some user code, run some Python code, and it tried to import your module again, you would get into an infinite loop because it's not in the cache, so it would try to re-import your module again. And the loading is a bit more declarative now. It's in PEP 489, and you can go read that if you're interested. Yeah, so there's work going on in this area still, and I hope the talk won't be obsolete in a few years. Hi. I was just wondering what would happen if you loaded a module with a class init? Once again? You load a module, it's got a class init, you instantiate that class, and then you do that trick that you said you shouldn't do at the start where you re-initialize that module. You reload it. Yes, so what the re-initialization or just reload does is it creates a new module object, or it creates new class objects, but every instance of an existing class has a reference to the original class. So all the old instances would use the old class, and all the new instances would use the new class, which creates some problems. For example, if you try to check for equality, and it's implemented by looking at the class, then the classes obviously don't match and you have a problem because you think there's the same, and string representation is the same, but the class is actually different than who looks at the class ID, right? So there are some use cases for this, but it's usually better to stay well away from it. Okay, thanks very much, Peter. Thank you. Thank you.