 despite heat and the long lines. So I work for MongoDB just kind of down the street near Times Square, and I work on a number of MongoDB database client libraries written in various programming languages, including Python. And a couple of years ago, I got an email that really kind of frightened me and made me a little bit anxious about whether I was going to continue to be working for MongoDB if I wasn't able to come up with a really good answer to this email. And that is the story for today. So it all began when a blogger whom I don't know named Vlad published this article to a popular website called D-Zone. And Vlad had run some benchmarks with both the Python driver for MongoDB, which is called PyMongo, and also the Node.js driver for MongoDB. And there was a lot of good news for us from reading about this article of Vlad's. For one thing, he felt that he got pretty impressive performance. He was using an ordinary Windows desktop computer, he got almost 90,000 inserts per second. He thought this was pretty cool, so this is a positive article about our product. And the other good news was that D-Zone's popular website, he got more than 20,000 views on this positive article. So like I said, he had benchmarked the Node.js driver for MongoDB, and that was the headline figure that he got. And then he also benchmarked PyMongo. And this is where things start to take a turn for the worse. Kind of as an aside in Vlad's article, he noted that the code that I work on is about a third as fast as my colleagues code in JavaScript. So I didn't actually know about this article at the time. The first I heard of it was when I got the following email from my boss. And so Steve, at the time, was my boss, and he's writing to me. And he's see-seeing Elliot, the CTO of MongoDB, one of the most frighteningly intelligent people or just frightening, full-stop people whom I've ever worked for. In retrospect, I was overly frightened of him. Elliot's actually a really nice guy, but this terrified me. And especially the subject line, why is your thing slower than your colleague's thing really got my pulse racing? So we'll walk around the office, and I read the email, and it says, hey, why is the Node driver three times faster than yours? And here's the link to the article. So to get up and take another walk around the office, and then my pulse goes down again, and I start to formulate a plan. First things first, I'm going to download Vlad's benchmark and run it myself, and I'm going to read through it and see if maybe I can see some mistake that Vlad's making. So let's do that. Let's take a look at Vlad's code. All right, so straightforward, he imports PyMonga, the client library. He connects it to his server. He gets a handle to a database within the server. He gets a handle to a collection within the database, and he drops it. All right, so he's getting started for his benchmark. Nothing to see here. Let's move on. Next, Vlad declares some variables. So he says I'm going to insert 80,000 documents. He wants to insert them 5,000 at a time. So far, so good. That's a good way to use MongoDB. I would expect that to perform well. And he writes down the start time. And he's going to stick documents into this batch list, which is empty right now. So what's next? All right, what are we doing here? He's going to import date time. And then he declares two dates, the first days of 2012 and 2013. And then he calculates the number of seconds between them. So that's the number of seconds in 2012. I don't know why he's doing that. Let's see what that's used for in a minute. All right, moving right along, he imports random for some reason. And all right, so now we're into the main loop. He's going to go for i from 0 to 80,000. He's going to do something. What does he do? Ooh. OK, so this is a complicated statement. Let's read it from the bottom up. That bottom line there, so he's going to take a random number and multiply it by delta, which is the number of seconds in 2012. All right, so he's got a random second within the span of 2012. And then he takes that min date, which was the first day of 2012. And he transforms that into a Unix timestamp. And then he adds that random second to it. So now he's got a random timestamp somewhere within the year 2012. And he converts that now back into a date time, which even more confusingly, he then calls date. All right, so we've got a date time. It's called date, and it's a random second in 2012. Let's keep going. OK, so he makes another random value. This is just some random floating point number. And he sticks them both into a dictionary. And what does he do with the dictionary? He's going to append it to his batch. And every time his batch reaches 5,000 documents, he inserts it into MongoDB and clears it. All right, so it's a little bit confusingly written. But I don't really see anything like any red flags in this code that make me convinced that it's his fault and not mine, that this code runs at a third the speed of the node code. His final lines, he correctly calculates the duration of his benchmark. And he prints out how what his throughput was. I run this on this laptop. And I get about the same number that he did on his Windows desktop, which is, in his case, 29k and my case, 30k inserts per second. So, so far, so good. I have reproduced the thing that I am trying to diagnose. Now, another thought that I have is that maybe he's doing something like the workload is really different from whatever he's doing in Node. Maybe Node's faster because it's doing less or it's doing something different. So let's take a look at his Node.js code. Actually, this is a Python conference. Let's skip it. I read his Node.js code and I verified that it appeared to be doing the same thing. So why is it three times faster? All right, how am I going to answer this question? And I need to answer it well because not only Steve but also Elliot is going to be reading my response, right? I really feel like my posterior is on the line here. The tool that I pull out now is a profiler. And why do I choose a profiler for this? Is it to tell me why is this benchmark slow? It's not really a rigorous question. I think that you really need to apply a scientific method to questions of optimization. I think that optimization is a lot like debugging. Debugging is applying a scientific method to code in order to you generate hypotheses. Why is it behaving this unexpected way? And then you test your hypotheses by changing your code and seeing if it's fixed or at least if it has a different bug. Optimization is also an application of the scientific method. You come up with a hypothesis. And when you're optimizing, your hypotheses are, I think that if I change my code in the following way, it's going to yield a worthwhile improvement. Now, profiling, well, we'll get to this in a second. You come up with a hypothesis. You look at something you think like, well, if I cached this or if I improved that algorithm, then my code would be faster. And it will be substantially faster. It will be worth the effort and probably like the extra complexity of whatever optimization I'm going to attempt. The way that you test the hypothesis is with a benchmark. You run the benchmark before and after the change, and you see if it's substantially faster. The benchmark, of course, needs to be in some way representative of a workload that you care about. So people frequently run their unit test suites, and then they change their code and see if the unit test suite goes faster. That's a terrible idea, because unit test suites are typically the opposite of the use cases you care about. Your unit test suite is testing a ton of little edge cases. Your benchmark needs to hit like the main functionality of your code, and that's it. So you write a benchmark that actually represents something that you think you care about or your users care about. You make your change, and you see if it's faster. How long do you keep doing this for? You do it until your code is fast enough. So if at all possible, have a target in mind. Like this program is the bottleneck in my overall system. I need to make it fast enough that it's no longer the bottleneck, and then I'm done, because I should optimize some other part after this. Otherwise it won't do any good. Around 2008, 2009, there was Amazon published a really incredible fact that for every 100 milliseconds of latency in their website, they lost 1% of sales. And even in OA, can you imagine like 1% of all Amazon online sales? Millions and millions of dollars. So they could put a value in dollars on performance, and they could decide if this optimization costs us this much time or money or hardware, is it worth it or not. Unfortunately for me, as an open source library developer, it's not easy for me to put dollar values on performance, because PyMango is incredibly popular. It's among top 20 PyPI packages. There are tens of thousands of projects that use it. I don't know what all those projects really care about. So it's safe to say I can guess that certain functions like insert are widely used, and that lots of people care about their performance, and I should make it as fast as possible. But unfortunately, I can't say anything more rigorous than that. So what's missing from this slide? I'm still not talking about profiling. The point of profiling is to generate hypotheses. Generate ideas for changes you can make to your code that might make it faster. People often mistakenly think that profiling will tell you what changes to make or measure the performance of different parts of your code. And unfortunately, this is not true, because profilers come with enormous amounts of overhead. Your code might run 20 or 100 times slower when the profiler is on. And unfortunately, it isn't slowed down evenly. So that means that a change that might actually be worthwhile will make your code run even slower with a profiler on, but it's actually worthwhile once you run your benchmark without the profiler on. So the profiler can tell you likely changes. Well, hot functions that are good targets for optimizations, but it doesn't tell you if the optimization has worked or not. You have to test your hypotheses by re-running your benchmark. So which profiler did I pull out? At this point, everybody mentioned cprofile, because it's in the standard library. But I pulled out yappy, which is yet another Python profiling, something that begins with i. And it's by a very smart Turkish programmer who is very bad at naming projects. But his name is Sumer Chip. And he wrote yappy, which compared to cprofile, it's got about the same amount of overhead, maybe even a little less. Cprofile and yappy, they both give you function run times. They're not line by line. There's not yet, to my knowledge, a line profiler that has low enough overhead to really be practical. Yappy can tell you how much CPU time you're spending, as opposed to wall clock time, which is an advantage over cprofile. And yappy can also measure all your threads concurrently, including threads that were spawned after you began profiling. So you can profile much more complex concurrent applications than with cprofile, which only profiles the thread that started cprofile. And the killer feature, which is actually the reason why I turned to yappy under these circumstances, is that it can output to this profiling file format called call grind. Now, call grind, you can visualize it with a Linux program called kcash grind. And the nice thing is that even though I was running my code on a Mac, I could output to call grind and then visualize it on anything. So I started up a Linux VM, installed kcash grind, and looked at my output there. So we can contemplate this for a moment while I take a sip of water. And what's not really surprising about this is that 2 thirds of the time is spent doing the insert. So 2 thirds of the time is in my code, either sending or receiving with the database or just waiting for the database. And that's sort of unfortunate. It means it might be my fault that this is slower, at least to some extent. But there's also a lot of other junk in here that is surprising. Maybe hard to read. But what we're spending time in here is from timestamp, make time, time tuple, and then also round. I was pretty sure that none of these functions are called from PyMango, so I searched through my code. Indeed, PyMango doesn't call these. So it's all right here in that weird opening statement of the benchmarks for loop. 1 third of the time is spent doing this kind of intricate juggling with random date times. So what do I have now? I'm actually speaking very imprecisely. I say 1 third of the time is spent here. I don't really know that. I know that the profiler thinks that, but the profiler is extremely inaccurate and unevenly inaccurate. All it's giving me is ideas of where my code is spending its time. So I've come up with a hypothesis, which is that if I change this code, the benchmark will run faster. But I haven't proven that by looking at the profiler output. In order to do that, I have to run an experiment. So what's like an easy experiment that I could do here? Maybe I could just set date equals 1. Or I could set daytime to now. I take a quick look at the standard library. Now is a pure C function. It doesn't do very much. So I think that this is going to be much, much faster than all of the time that we spent mucking around with date times and time tuples. So now I've got my hypothesis. I've got a test, but let's run it. I run Vlad's benchmark. And woo-hoo, this is looking pretty good. Before the change, I had my 30k inserts per second. And after, I have 50k. So we're already starting to really close the gap between my code and my JavaScript colleagues code, just by fixing up an inefficiency in Vlad's benchmark. And no insult to Vlad. He said right at the top, I'm not an experienced Python programmer. He was just doing something that looked totally reasonable, but not in Python. Plus, he was using an outdated version of PyMongo. So I also decided to upgrade. And at the time, PyMongo 3.0 was the latest. And with PyMongo 3 in place, we're now closing the gap much closer. So node was at 88k, and now we're at 59k per second. So I'm starting to feel pretty good at this point. I'm starting to compose a reply to Steven Elliott saying it's not my fault. And anticipating the likely final question from them, I also try running the whole thing through PyPy, which is a JIT compiler, and therefore somewhat more comparable in its execution model to node itself. And with PyPy in place, we're now at 73k inserts per second on my laptop. And if we think that my laptop performs about the same as Vlad's desktop, then what we've said here is that we've really covered about 80% of the ground. And it's not my fault. I get to keep my job. I wrote this up to Elliott and Steve. It felt really good. And I turned back to trying to remember whatever the hell I was working on right before I got this scary email. So I performed much better than I expected. I really optimized this talk. This is the whole point. Use a profiler not as evidence, but as a way to generate hypotheses, as a way to look for places to think up likely optimizations. But once you've got those ideas in place, you need to come up with a rigorous test and apply that test with the profiler turned off. In this case, I also used the profiler a little more deeply. I used it to estimate the likely improvement. So it's not simply a hypothesis of, I think that this is a target for substantial improvement. It's, I think it's a target for a 30% improvement. And as you can see in this case, that was actually incorrect. The improvement that I got was much better than the profiler had guessed I would get. And that's all I got. So I'm happy to talk to you more about profiling. I'll be around the conference all day. And enjoy the rest of the conference. Thank you very much.