 Okay, so this is Antonio Cuny and I'm Maciej Fiorkowski and we're gonna talk a bit about Python and PyPy performance today. And who are we? We are PyPy core developers. We also work on a bunch of other projects. The maybe most well-known one is CFFI, which is a way to call C, a bunch of other projects. We also do consulting and run a company, Barak Software. So, well, let's start with the usual quote, that premature optimization is the root of all evil. And that you usually end up spending 80% of your time and 20% of your code. But it's important to remember that 20% of one million lines is still 200,000 lines. So, it's 20% tends to scale with the size two. And that can cause trouble if your program is not 10 lines of code. We're gonna talk a bit about how to identify slow ports and how to optimize them. So, the first part I'll talk about the profiling tools and how do you go about saying what's wrong and the other one, Antonio will talk how to address those problems. So, yes, the first is identifying the slow ports. What is performance? Like, who ever tried to optimize something? Most people, good. Well, that's a valid question. Usually it's a time we spend doing task x. So, the task x might be serving one HTTP request or computing one protein or doing one of those things. But sometimes it's like number of requests per second. Sometimes it's the latency. And the interesting question here is what sort of statistical property are you actually interested in? Do you care about the average? Do you care about the worst case scenario? Like, I don't know, if you're developing a car break system, the metric you're optimizing is not the average time it takes to break. It's the worst case scenario it takes to break. And this is the metric you're working with. Sometimes you don't care. If you're serving HTTP requests and one in 10,000 requests takes 50 milliseconds more than, well, too bad. Somebody just lost has to press control R, too bad. So, it's important to know what you are trying to measure first before actually starting to measure. Once you know that, it's cool to have some means to measure stuff. So, benchmarks are very good. And if you don't have benchmarks, you might want to just check stuff in production, how it actually works on the real data, and see if Python is your problem or your problem is waiting forever for IO, or the fact that you have 700 microservices, each of them talking HTTP to each other all the time. It's important to be able to quickly determine whether something that you changed, actually changed stuff or it didn't. It's the same as debugging. If it takes you a week to go around and try again the next thing, then chances of optimizing anything are really, really low. No, thanks. So, we already got to the point where we know Python is our problem. So, oh no, looking at top or whatever, we see that Python processes just consume time doing CPU. One important thing is that systems these days are way too complicated to guess. You cannot just stare at the code and then rewrite it differently, and then hope that, oh, but I know how Python works. Do you really? I don't, for one. And you have to actually measure. You have to see, okay, I run this, it gives me five seconds. I run that, it gives me four seconds. This is when, but I actually have a tool to measure that. Again, remember about the bottleneck can be small, but can also be really large and distributed over the place. Profilers, who used Cprofile here? Almost everybody. Who used Plot? No one. So, Plot is a tool done by Dropbox guys, and I think they run it all the time on Dropbox. So, the difference is that Cprofile is a tool that is even based. So, each function is instrumented to report, like, now I'm entering the function, start a timer, now I'm exiting the function and the timer. It's a way to do profiling. Yes, you get time spent in functions, you get trade specs, you get stuff like that. The problem is that it's a relatively high overhead. On CPython, I think it's like two times, roughly. On Python, it's worse. And the problem is that overhead is not evenly spread because overhead is cost per function that's relatively constant. But then, if your function does a lot, then the overhead is small. If your function does little, then the overhead is large. And that's very bad, especially bad on PyPy. So, it skews the results towards putting more emphasis on small functions than large functions. And VMProf is a tool that we, me and Antostar, are working on this year. This year, last year, something like that. And it's like a statistical profiler. So, it will sample your code as it's run and see, okay, now I'm in this function. Then wait 30 milliseconds, where am I now? Wait 30 milliseconds, where am I now? And try to capture those stacks and give you statistical information. So, it won't tell you how many times the function got called, but it will tell you how much time statistically likely it's possible to spend there. So, VMProf itself is the tool that was inspired by G-Purve tools, which is a similar tool for C. It's based on an interrupt, so it runs on Linux machine about 300 times a second, which is not granular enough if your program runs for 10 milliseconds. But if you run a big server that runs for seconds or minutes, you're usually fine. And samples the C-stack and works on C-Python and PyPy and possibly different virtual machines in the future. So, the problem while we didn't just use G-Purve tools is that C-stack does not contain usually much useful information. You'd see, if you look at the C-stack of C-Python anywhere, you would see that 90% of the time is spent by a level frame. Thank you very much. That's not a useful piece of information, so we want Python level function. The situation is even worse if you have a just-in-time compiler, which I'll talk about later, but we want to be able to reconstruct the Python stack from the C-stack snapshot. Demo. I wanna know. So we have, say, a small program that lets pick PyStone because everybody loves PyStone. Where is PyStone's lead Python test here? So PyStone is a benchmark that's written for Python 1.1, which, who remembers Python 1.1? It was a while ago, I don't even know where. What's translated is from C to Python, the Github and Rustum, and thus a bunch of operations like nice style of what's capital false, for example, zero, predates booleans in Python. So it's a little data benchmark, but fine, let's run it. So I would run Python minus MVM prof, comes as a module, just run it like this. Well, that was like, that probably was not enough. Let's run it slightly. Stop 100 passes. More. No, even more. Okay, now it's running long enough to actually have any useful information. So this will just display the statistical information about which function was spent, how much time, so Proc1 was 37%, but this is a little useless because we don't see who actually called Proc1, what Proc1 called and stuff, so we have a little web tool for that. Web. I'm running this on my local machine because, well, network here, 8,000, zero, I think. We're gonna wait a second, like rerun the benchmark. Okay. So URL, and let's hope it renders properly. Thinking, thinking, okay. So here we see that those like the main code that's called main, then called pystones, that called Proc zero, which called Proc1, Func2, and Proc8, that are directly called by Proc zero. If we do the same thing on PyPy, then we might want more. Then it's relatively similar, except here we can see that, first of all, the split is slightly different. So you can see that PyPy somehow optimized stuff. Remember, those are percentages. So like all of them goes faster, but some of them got much faster than the others. You can see that PyPy is not optimizing equally all the code. And here we might see that it's a number, and that's just code, simply. That piece was compiled to assemble and run. I'm working actively on taking this information and making sure it ends in stuff here so you don't see random numbers. You'll just say, okay, this was 100% jitted. So this is VMProof right now. It's something that we fixed as of yesterday, so maybe you should be slightly careful, but it's good for trying and seeing if it actually works these days. There was Nemo. Yes, yes. It's the runtime overhead of running VMProof is really, really low. It's roughly 5% of the time. So we want to, yes, that's what I have on the next slide. And we are looking into a profiler that is possible to deploy the profiler and run profiling all the time. It's not a thing that you can only run benchmarks on, somewhere in the isolated corner, like cprofile, but you can actually deploy it and get the data coming to you. Which is, again, to another point, we want to have the real-time monitoring of performance. So you're gonna look in last hour or filtered by request or stuff like this. Something that I should probably mention is that we are trying to run this as a service where you would just take the data uploaded somewhere and you do some sort of advanced analysis like what happened? Did the code got compiled correctly? Are there unusual patterns to look for? Well, this is work for the future. Multi-threading right now is at least not tested, which probably means it doesn't work. But the goal is to make it work. Like the signal handlers and everything are designed to work with multi-threading. So we just need to look, how does it look from the Python side? It might obviously work, who knows, but that probably doesn't work. Yes, so as Magik said earlier, when you want to optimize something, you have to spot the parts of the code that are low, and then you have to try to make them fast. Well, there are many ways to make Python code faster, and there is, of course, no time to explore all of them. You can write as the extension, you can use Saiton, you can use Numbau, and you can use Saiton, you can use Saiton, you can use Saiton, you can use Saiton, you can use Saiton, and you can rewrite your Python code using tricks, which you can find on the internet and saying, most of the web pages you find on the web are saying, that there are these tricks, but most of the time they just don't work. They don't make your code faster, but that's another topic. Or you can use PyPy, PyPy is an alternative implementation, which I, how many of you know what is PyPy? Yes, almost everyone, good. Things changed in the things 10 Euro Python's ago when nobody knew what PyPy is a good thing, I think. And yes, PyPy is a Python implementation with a JIT compiler. I'm going to concentrate on this tool in the next, in the next, in this part of the talk, because yes, we wrote it, we are biased, and it's an interesting tool. The nice thing about, in comparing to the other tools I'm talking about is that in theory, PyPy gives you most of the wins for free. You don't have to rewrite your code, you don't have to use another tool, and et cetera, you just run your Python code and it goes faster. Currently, we released PyPy 2.6, which has nothing to do with the Python language version of Python, which PyPy 2.6 implements Python 2.7.9. There is also a release for Python 3K, and if you are interested in knowing more, there are other Euro Python talks during the week. And if you go to speed.pypy.org, you see nice graphs, saying that it's seven times faster than C Python doesn't mean anything. Of course, PyPy is seven times faster than C Python on these benchmarks that we selected. We are not interested in benchmarks in which we are really fast. We try to select benchmarks that shows real-world problems and et cetera, and on these benchmarks, the average is seven times faster. I say that PyPy contains a JIT, which is the part that makes your code faster, and I'm briefly going to explain what is a JIT. So suppose you have this piece of Python code, which contains function codes and loops and et cetera, and if you interpret your program with C Python or PyPy without the JIT, you see that you will spend sometimes in the green part, sometime in the blue part, a bigger part of your time in the red and orange part, et cetera. The idea behind the JIT is that we can optimize the slower spots by compiling them to assembler so that they execute much faster, and then the total time spent in your training your program is lower. And of course, I cannot go in details because there is no time, we are a bit in a rush, but how does it work? The key idea is that first we compile only the spots which are slower, which are executed for the most time. And then how do we make them fast? Well, we do it by specialized code, basically. So if we see that a certain loop or a function is run with integers, like we have an addition with integers, we produce a specialized version of assembler which knows that these variables are integers, and if later, by chance, we see that we have floats or strings or lists or whatever, we produce another specialized version of your Python code which is fast on these new types. And the idea is that you pre-compute as much information as possible during the jit compilation phase so that once you have finished, your assembler code does only the interesting things and the ones who are really needed for your code. For example, suppose you have this line of code, which is, well, it happens very often in Python program, object.foo, it's a method call, and this is a very simplified version of what's happening. So first of all, we look up foo in the dictionary of the object, of the instance. Then if it's not found, we look up it in the class, and then if it's not found in the class, we start looking it up in the base class and the base of the base class and et cetera. And finally, when we found it, we execute. And if you are interpreting, you have to do these steps again and again and again. Suppose you have this object.foo in a loop that you run it one million times, you have to do a look up one million times. And so the idea is that in PyPy, you pre-compute the look up so that you know which function code corresponds to foo, and so you can jump directly to it. But of course, well, Python is dynamic, so things can change because I could change the class of the object. I could add and remove attributes either on the object or on the class, and I can do all sorts of tricks. And these tricks are done in real world programs. So the idea is that we compile the code pretending that object.class is constant and that the class hierarchy is constant. And so we can do the inlining, do constant propagation, and et cetera. To make sure that our code is still behaving correctly, we insert a guard, which is a quick runtime check that we do, that our assumptions are still true. If the guard fails, then it means that the assembly code we are executing is no longer valid or no valid for this case. And so we bail out and we start interpreting things. Yes, it's going to be slower, but it's better to be slower than be incorrect, of course. And eventually we compile a new version of the assembler for these new assumptions, et cetera. So at the end, we get a situation in which all the parts of the code which are executed often are going fast because we have just compiled everything. The hard part is who decides what to specialize on. Because, for example, as I said, we specialize on the class of the object. And but we could specialize on the number of the attributes or if the object starts with O or something like this. There, basically, we have to do some heuristics and the PIPI code is written in a way which assumes that something is more constant than the other things. So we assume that usually classes of objects don't change very often, but the value of the attributes can change. So we promote the class amount of the values. And we assume that usually the modules are kind of constant. It's not that we add and remove function to the modules at run times. We can do, in that case, yes, we specialize twice or three times, but it's kind of a safe assumption. And sometimes we just have a constant in the bytecode. So if you write A plus one, one is the constant. And then we assume that the constant is constant, yes. And this is usually true. Of course, specializing as a trade-off because if it's specialized too much, then we spend most of our time compiling new code and not reusing the one which we already have. And so we consume memory and we spend all of the time compiling things. If we don't specialize enough, we produce inefficient code. Because for example, if you don't specialize on the class of the object, we have to do look up again and again. So, yes, it's a trade-off and it's our job to find the best. And T is, brings us to the next point, which is how to write our code in a way that it's friendly to Pi Pi. Of course, unfortunately, we cannot spend much time on T because it's like a topic which can be one week long, not half an hour. But one point of view is that you should not do the things that you have done until now. Usually to optimize pure Python code without using external tools, you did things like trying to avoid method lookups. So maybe you save the bound method in a variable and then call it repetitively or you try to call, yes, you see, this is something that, where is the mouse? This one? Yes. This is something that Guido wrote a couple of years ago. Be suspicious of function method calls because creating a stack frame is expensive. That is completely untrue in Pi Pi because the function are in line. So basically, if you follow this kind of advice, you write the worst Python code because you are trying to optimize it manually. And if you just write nice Python code, the Pi Pi JIT compiler can do it for you. So a couple of points that are a general advice of how you should write Python code, simple is better than complicated, which means that if you write really plain Python code, which is self-explained and that you can understand it well, well, probably the JIT compiler can do it as well and it has a better clue on what's going on and it has better chances to optimize it. You should avoid to do string concatenation in the loop because the S&C Python is usually fast because there is one optimization which works only if your string is only a reference count of one. You can't do this optimization. So you should avoid it, both on C Python and Pi Pi. You should try to avoid ITER tools, monsters. Sometimes I see pieces of code which are an ITER tools called to another ITER able to generate all which calls another ITER tools and et cetera and I have no idea what it's doing and if you have good for you. And but this kind of confuses the Pi Pi JIT. If you just write your nice Python loop and with nested loop and explain what you are doing, there are chances that the JIT can optimize it as well as ITER tools or even better. The other usual advice is to write stuff in C. Well, no, this is good for C Python because the pure Python code on C Python is very low compared to C. But if you write stuff in C, then the Pi Pi JIT cannot know what is happening. So it has to stop optimizing at some point and call to C. If you write everything in pure Python, the Pi Pi JIT has a better knowledge of what's going on and has better chances to optimize your code up to until the best performance it can. And so if you want to interface with external C code, the best thing to do is use a CFFI which works both on C Python and Pi Pi and it's fast and optimized and et cetera. You should avoid the C extensions which are using the C Python C API. We have this compatibility layer, but it's really like an emulation of preference counting and other things and it's very slow on Pi Pi. Yes, it works. It's useful if you want to try your software and et cetera. But if you're using a C extension with using the C Python C API, in a part of the code which is important from a personal point of view, it's going to kill all your personas. And then you should try to avoid things which confuse the Pi Pi JIT in the way I was explaining earlier. For example, we assume that the class is a constant and classes are kind of fixed from some point on. So you should avoid creating classes at runtime. If you have a function call and inside the function you create the class and then you return an object after instantiating this class. Yes, this is valid Python. It works on Pi Pi, but then the JIT will specialize on this new class again and again without reaching a fixed point. It's not that it's not allowed. You can create new classes. For example, during import time because you want to do some metaprogramming and et cetera, it's perfectly valid. But if you create one million classes, well, the JIT will create one million assemblers, basically. And for example, this is an example of what you should not do to optimize your code because if you try this monster saying, apply operator dot atgetter of x and map it to the element of the list and et cetera, well, it's much, much easier to just write the list comprehension. And this is the kind of advice you will find on the web. And if you measure it, on C-Python it's exactly the same speed, so it's not true even on C-Python. And on Pi-Py, the first one is just a bit slower. So please just write your nice Python code which is understandable and the JIT will remove the operator for you. If you want to know more about Pi-Py, we are around for the whole week, so just ask something to us. Tomorrow there is an open space in the A4 room at 18, so it probably will be a Q&A session, so just come and ask. And yes, before the good, yes, right. Machik want to show you a better demo of VM Pro. Yes, so since we have some time, I did, well, we did the experiment yesterday to see a Django example. It's a small Django example and it's obviously rigged to show slow parts in the Pi, so this is the Django example and indexed us some spurious pickle calls. And I guess this is the thing I wrote and wanted to see, can I find this stuff using VM Pro and make sure things work nicely. So I'm gonna run it first on C-Python, so I want to upload it to the local host and I'm gonna run, manage that by run server. And now, because it's Django, it starts like multiple processes when one, like if you just run it like this, you end up profiling the watchdog process which does nothing. Thank you very much. No reload. And then I'll disable threading just in case. Okay, I'm running the server on 48001. I'll look at it. Host, 80001. So it's okay. Good, fine. So I'm gonna run a simple AB and I'm not going to listen to you saying that AB is a terrible tool but I want to just send some requests. So, much, much thinking. Well, it's C-Python so I can probably just stop it. 4,000 requests is fine because there's no JIT warm-up time or anything like that. And then, like, upload the profiling data. Hope, pretty sure it's trying to load fonts or something. Okay, so we have Django doing its job which has billion, call stack of billion functions, handle, run, call, call, get response index. Okay, so I click on index and index actually spent only 47% doing the actual job. But as you can see that dumps here itself spent like 66% of the total thing. So, if you kill that, the spurious pickle that I put there, it should be like 50% faster, okay, fine. Now I'm going to run the same thing on PyPy. Hope, hope, it starts slow and then, like, starts warming up and gets the request faster. This is how JIT works, it takes time to warm up. It really depends on your workload but it's usually something that can run, if you run for five seconds, then it's fine. So, like, this was like 600 requests a second so like about 3,000 requests here was fine. We did measure, yes, this is the longest request, takes 46 milliseconds and here the 50% takes below one. So, you see the warm up is really slow. I think so and then I try using Java and then it's really, really slow. So, it's not that bad but it's still relatively slow. The overhead here of profiling, we checked the other day, it was like, here was like 660 requests a second without the profiler on. So, that's what, 40, below 10% anyway, it's like 8%. If I look, it's really not working, right, here, this. If I look here, okay, I can't scroll to the right because it doesn't render, maybe no, I don't know. So, what I was trying to show is that here, we have normal, the jungle stack which is five billion levels deep and here it goes, run, call, call, response, index, somewhere, somewhere there but the index itself spends far less time than on CPython, like if you click on index, it's like 32% spend on doing index and there's one very, this guy, which is make style and like, what's the name of this guy? It's color style. So, that made me wonder because my little program that returns okay, does not do much coloring as far as I can tell. So, we looked at what's going on there and jungle search, jungle, core, management, color. And one of them, the guy's name, color style. This guy, make style. So, make style does something and as you can see, it defines a class within the function body for no actual reason whatsoever as far as I can tell and this thing alone makes this simple benchmark about 30% slower. Like, if you remove this and do that, hop, we made jungle 30% faster, bye. On this absolutely idiotic benchmark, I agree, but still, full request pending. This browser thing is not really working. Anyway, questions, I suppose. Hi, thank you for the presentation, it was very interesting and fun. I have two questions. The first one is about profiler, the both of the questions. Do you plan to support Windows because as far as I understood, you use signals and there are no signals on Windows. So, yes, we plan to support Windows. We didn't make a precise plan just yet how to do it, but probably what you can do is you can just run a separate thread, then the C-level thread and then just sample this. You have to be very, very careful, but maybe things are possible. I don't know. We support Linux right now, 64-bit Linux. We have plans for support, like OSX is in the process and Windows will look how to support, but I don't know yet. If there's high demand, we'll support it. Okay. Oh, for PyPy, PyPy works on Linux, OSX, and Windows, and on 32-bit, 64-bit Intel, except Windows. It doesn't work on 64-bit Windows. You can run. You mean for the profiler or for PyPy itself? So, freebies actually works. It's not officially supported, but it works. People port it. But, journey speaking is the same as for Python. Like, most of the stuff that you are porting are CPython standard library and all the calls there. So, the effort is very, very similar. Like, we use a few extra calls to say compile assembler, allocate this, but as long as you're not supporting awkward architecture, like MIPS or something, then it's relatively easy. Yeah, the second question. About profiler, I didn't quite understood. Is it possible to make cross-profile in between C and Python? And if it is possible, how do you do that? So, what happens is you capture the entire C stack and you have C stack that includes special entries for Python functions. I can show you later. Okay. But like, the idea is that you have all the C calls and then you just throw them away. So, having extra C calls present there is very easy. The only problem is then you need your dwarf data to be present. Like, you need to be able to look up the symbols. But the support is already there. Thank you. Hi. Is it fully compatible with Python 2.7? Can I just replace CPython with PyPy? So, for the most part, yes. The only difference is you might want to look into C extensions, not all C extensions work. But if your code is pure Python, then generally works. Is there a way to check, for example, if a Django project to see if all requirements are compatible? Well, you create virtual and then try to install it. Like, typically what you need to do and stuff like that, you need to replace same MySQL bindings with MySQL CFFI bindings. So, they're usually, for most stuff that's popular, they're equivalent libraries that do the same thing. But instead of being a C extension, use CFFI. So, I can't replace my CPython directly. I have to do some additional work. Depends. If it's Django project, then you usually need to slightly change your requirements. Okay, thank you. So, I have a question. If you want to write a function and you would have multiple ways to implement it and you don't know which one would be there and when you compare the times, then the times kind of get different over time because the garbage collector is always present. You cannot disable it, right? In PipEye. Yes. Yes, but then you just do more statistics. Okay. The average over time. Just running more time, okay. Yeah, okay. But the garbage collector in PipEye is already incremental. So, if you exclude the JIT warm-up time, which is slow at the beginning, then the GC will be spread almost even across your stuff. Okay, because it's assuming that you do a warm-up, but then you will get some spikes in months. Not anymore. We fixed that. Sorry? Not anymore. Oh. We have incremental garbage collector. So, it does a little bit of work. Okay, small spikes, I mean, not that big. Good luck trying to measure them. Okay. Okay, thank you. You're running very quickly into the resolution of your clocks. Like, below millisecond, which is like most garbage collection spikes, the resolution is really bad. Okay. Does Jitting work in a separate thread? Is it one thread or how does it work? Or does it just stop and compile? There is a talk by Armin about how to remove global interpreted lock in Python, but right now PipEye, as it is as you download it, will do everything in one thread. So, part of the deal is that the JIT is a tracing JIT. So, it does some work by running stuff bit by bit. So, you can't do it in a separate thread because you're really running stuff bit by bit. Then, optimizing and emitting assembler, you could, in theory, do this in background thread, which we never implemented. Okay, so now the function approaches, compiles, and then runs the compile version. Yes, for the most part. It's slightly more complicated than that, but yes. More questions? Larik here? No? If I have a project with a Python extension, would it work with PipEye or do I need to change it? So, if you have a new enough Python, it usually works, but it's slow. So, the extension will be slow because it will go through Cpy non-C API. There's some effort to, A, make it faster, and B, I don't know, maybe just make Python compile stuff to CFFI bindings instead of compiling it to C, and then, but that work did not materialize just yet. Yes? No, I want to add something because, for example, something that I did a couple of days ago is to try to speed up something which both on Cpython and PipEye, and I did, like, use the pure mode of Python, so I have the type declaration in a separate file. So, on Cpython, I compile it with Cyton, and it's fast, and on PipEye, I just ignore the Cyton part, and it's pure Python, and it works fast. So, I think this is, like, the best way to use Cyton. Of course, it doesn't work if you want to interface with C library, then you have to use CFFI, but yes. Did you ever measure the amount of time you spent in PipEye? In PipEye, I mean, for example, you have a process, and this process spends some time on doing the work which needs to be done for Python, yeah, Python code, and you have your own GIT and all the stuff in PipEye. Did you ever measure this amount? Yeah, I think you are picking about the time we spend in the GIT compiler, for example. Yes. Yes, well, how can I go back to VMProf? Well, the browser, yes, here. Well, here, in this case, it doesn't work because it shows 100% interpreting, but, for example, VMProf shows you how much time you spent warming up, which means GIT compiling, how much time you spent in the garbage collector, how much time you spent in the interpreter, and the green, well, the green box is GITed. So, I think this is your question? Yes, this is wrong. I don't know, I don't know why it didn't detect the GIT code. Can you check out the sample? Can you try to check out the sample? Yes, well, okay. Well, that's just me. PipEye has zero overhead. I'm not sure I understand the question, like, what do you mean spent time in PipEye? All this time it's spent in PipEye. In PipEye, you do your own work, you need, for example, GIT compilation or garbage collecting, and there is user code, which is wrong. Yes, so we mean how much time we spent in runtime and how much time we spent in user code. So, yes, we will measure that, like, examples of runtime ourselves, DIC lookups, like DIC lookups are not GIT compiled, but it's usually very useless for users to know that, unless you really want to know how much time you spend in big number calculations or stuff like this, then it's usually not very interesting to know how much time you spend in this C functions or how much time you spend in the user code. Like, do you care if your code is calling DIC to lookup and then spending some time in a little helper or not? Like, that doesn't matter. Yes, but yes, okay? So you mean it's Python overhead compared to what? The same program written on Java? Yes, or maybe in C++? Well, then the answer is like, there's no really good answer for that question because you wouldn't write it the same way. You would use different libraries. You do different things. Like, in some places you would not use a list, you would use a dictionary, stuff like this. Like, if you write the things exactly the same or the speed comparison between Python and Java, that doesn't have a good answer as a question. Like, it really depends on the program a lot. Like, our JIT compiler is quite good. Like, the best case scenario is as fast as Java, roughly. But you don't always hit the best case scenario if you write Python code that looks like, like this, where class style was defined inside, for example, you wouldn't write that code in Java, right? Because you can't do it. So, see, like... Well, you can write it in Java, but it would not be the same inside after compilation, of course. Yes, so the... Python lets you write... If you ask, like, what's the... If you take Java code and write exactly code like Java in Python, then PyPay will be roughly at the same speed with not much difference. If you write code that does tuple of s.upper for s in, like, I don't know, keyword arguments of the function.keys, which you can really write in Java very easily, then it probably will be slower than the equivalent in Java where you actually have to iterate by hand over them, for example. So, it really depends what sort of style you follow. Python lets you do things that are expensive and they're not hard to write, and we are trying to optimize them, but we don't always do as good job. As if you're writing in C, you would never allocate memory yourself because it's paying... It's a real pain to deal with. Because then you have to remember where to free, how to exit the function, but in Python it's very easy to allocate memory all the time. I've seen routinely comparisons of PyPay and C, and in PyPay they used a list and append, and in C they used, like, pre-built buffer of 1000 because allocating lists that you have to resolve is just too hard. But this is apples and oranges, again. Well, I hope that answers your question. Yeah, thanks. Any other questions? Thank you for the talk. My question is, will type hints help PyPay? No. The short answer is no. So the thing is, like, if you do JIT compilation, you know all the types anyway because the types are what's actually there. So you actually know that the longer answer is that type hints are not precise enough for the sort of stuff that PyPay does. Like, PyPay would not only specialize on the class name, it would specialize on what shape of the class is there. So for example, what sort of attributes are present on this precise object. And this is something that you can't express with type hints. So type hints are actually, not even not helping, they're not, you can't do the same things as we do in the JIT. We can do whatever type hints allow you to do, and we do that, and we do more. So I have a question for the VM Prof. That's very interesting. Are you using it to determine where PyPay is relatively not that much better than Cpython? So you did a lot of switching back and forth between the two cases in your webpage. Do you have an integrated view that tells you, in this PyStorm benchmark, for example, PyPay is particularly good in Proc1, but it's relatively bad in Proc3. Do you have that? Do you intend to have that? Because I would presume it helps you find test cases where PyPay can be improved for a given program. That's a very good point. I didn't think about it before, but it's probably something very, very useful. Like to be able to compare, like not even to interpreters, but also like two different libraries, for example, or like two different setups for a function, and like say, okay, if I do this, what actually happens to the profile? That sounds like a good idea. I think I checked about six months ago, and PyPay is incompatible with G-Event. Has that changed, and is it gonna change? Yeah, I think the new release of G-Event will support PyPay. I mean, the trend version already does. Hello, just a quick question about VMProf. Is it possible to customize the sampling rate for sampling? So yes, there is an option to customize the sampling rate. The problem is that Linux signals won't allow you more than like the system clock, which is around 300 Hertz. I think it's there, I don't remember, but it's like around that number that you can't go lower without changing strategy completely to something like threads, which we might need to do for Windows anyway, so. And does it increase the overhead linearly? Yes, yes, obviously. It would, if you sample more often than like, VR 300 Hertz, you have like 30 milliseconds of time in between, and in this time you're doing the job, and also the sampling. Yeah, okay, thanks. Hello, and again, thanks for the talk, and I have the question about the PyPay, so you showed the example that making a class inside the function is bad. It obviously is pretty bad in CPython as well, because of some overhead, but my question is, let's say there's some testing library that mocks the classes, so how do you deal with monkey-patient stuff? Maybe... Depends on the setup a lot, but if you just mock the stuff, and do you mock the stuff for each function? Well, I mean, we are talking about stuff that's called millions of times. If you mock it for like each call of the test function, how many test functions do you have? Like 500, that's definitely not a problem at all. I mean, those test functions won't be jetted anyway, because of run. No, I mean, what you were saying is your assumption is that the class is a constant, so obviously when a monkey patch, I change this constant at the run time. Yes, but do you do it million times, or do you do it like 100 times? No, I'm going to do it like several times a run. No, that's completely fine. So it's a float, it's kind of float in assumption. Yeah, yeah, it's only if you're like really having this sort of stuff, like in this example, that this make style is called for every request, and I'm doing 10,000 of those. Thanks, I clarify it, clarifies for me. One more note, it's not that the code is incorrect. I mean, the Python semantics is always preservatives. That's it, if you do this kind of two-dynamic style, maybe you have bad performance, but it still works. Hi, I was just wondering, is there a linter or something that I can run to tell me about that, that the class is a bad idea in the end? No, it would be nice to have, but no. Sorry? Yes. Yes, but yes, let's do it. Ha ha ha ha. This file is from Django, right? Django, yes. This is from Django. Django has code in it that says if we're running on Pocket PC, don't do something. People try to run Django on their Pocket PC on Python? That file is dealing with reading the terminal color palette or something that I'm sure Django does that at all. So maybe it shouldn't do it on Pocket PC, I don't know, but in my opinion, shouldn't be done doing that at all. Are you planning to add support for Python 3.4 and 3.5 in later, in PyPy, any time soon? I didn't understand any of the questions. Are you planning to have the PyPy 3K support for Python version 3.4 and 3.5? Yes, eventually. The problem is that there are not many PyPy developers working on it, and so the development is low, but yes. More questions? No, so thank you very much.