 It will probably take me couple of minutes anyway too. Shouldn't it be green? Let me know when I can start, I'll start. Sure, two minutes. Yeshia, this is working. Okay, fine. No, no, I'll stay quiet. I just want to confirm that people can hear me. I speak somewhat softly, usually. Cool, not a problem. Shall I start off? We're going to be talking about processing data faster using Python. Now, not that Python is particularly slow, but there are a number of things that one can do to make it a bit faster. And what I'm going to be talking about is a few of those techniques which will hopefully marginally improve your code. And the reason why we discovered these techniques was a piece of work that we did fairly recently on the election side. Before I go ahead, you will find the code for this, the entire talk, which is a bunch of IPython notebook slides on GitHub under my account, which is sranan0. I've also posted this link, or I will be posting this link. I've scheduled a tweet for 1250 on my Twitter account. So either way, you should be getting the contents of it. The story actually started when a huge touchscreen monitor, a 92-inch monitor was installed at the CNN-IB in office. And you would have probably caught up on this when Bhupendra Chaubey and Rajdeep Sardesai put together the CNN-IB and Microsoft Elections Analytics Center. And we created the software behind it that was doing all of these visualizations. So for a period of one month, people were walking through the history of all previous elections, explaining what exactly was happening in each constituency and on during the live elections, how the results were moving and so on. But some of you might be able to see a bearded figure in the red t-shirt at the right side, top right. So that's me making my first television appearance. I was obviously pretending to watch the monitors there very carefully, but what I was really doing was calling home and saying turn to CNN-IB and watch me. So it was a pretty interesting month. And during the course of that month, we were playing around with data like, for example, what was happening in each assembly election. So for example, one of the things we were playing around with is which was the election that had the largest number of voters, who were contested in an election. Our guess was that there are probably 100 candidates at most in an election, but even 100 candidates is a fairly large list because when you look at the ballot sheet, you'd have to flip through a few pages to find your candidate. So we were playing around with it. This, for instance, is the constituencies, assembly constituencies in Tamil Nadu. In 1967, each circle is one constituency. So this, for example, is Vandavasi, this is Nilipakkam and so on. And the size of the constituencies, that stood for election. So large circle like, for example, Perambalu means that there were as many as 10 candidates standing for elections there as opposed to something smaller like Thindiwanam where there were only two candidates standing for elections. Now this was in 1967, the circles are coloured based on which party won, yellow is ADMK, red is CPM and so on. 1967, there weren't a large number of candidates in any constituency. 1971, picture hasn't changed much. 1977, there's a sudden spurt of people standing for elections. That's mainly because of the introduction of a new party, the ADMK which came in and therefore in practically every constituency you had at least one more candidate. A lot of independence came in as well. The next year, in 1984, now there's one constituency, Madurantagam where there were as many as 90 candidates standing for elections. That's a pretty big number. 1989, sudden spurt across the entire state but nothing stands out. However, in 1991, two places stand out. There's Pallipet with 264 candidates and Avarakrichi with 249 candidates. At this point, you have a ballot booklet, not a ballot sheet. You have to flip through and find your name and all that. All of the spurs in comparison though with Madakurichi which had 1,033 candidates standing for elections in 1996. So this is a ballot book. In fact, this is probably the telephone directory of Madakurichi. If you look at the details, it's obviously mostly independence, a party cannot field more than any candidate but if you look at the names out here, for example, let's see. So there's Palnisami D, Palnisami K, Palnisami K, Palnisami K, Palnisami K, Palnisami K and how do you figure out which Palnisami K you are if you want to vote for yourself? Which seems to have been a problem because there were as many as 88 candidates that did not vote even for themselves. They got 0 votes probably because they didn't find their names or whatever. Now the thing is when you're generating this, what we're doing is as you click on each constituency quickly extracting the list of people from that constituency, sorting it in descending order of votes and displaying it. Now let's walk through how one goes about optimizing it. What I'm going to be doing is walking through various problems that we face during the creation of this visual and a few other visuals, taking different examples and talk about how we start optimizing it. Now let's take this simple problem. I want to see all the people that contested in Bangalore South and sought their votes in descending order. Now the data looks like this. I'm sorry if some of you at the back can't see it but it's a simple CSV file that has the state name, the year, the constituency, their name, the number of votes and so on. And it's one massive file with about 4 lakh rows which has the details of every assembly election that has been contested in. And I'll come to how we scraped it very briefly later. Now how does one go about extracting the top candidates the candidates that got the largest number of votes in Bangalore South? Let's say historically. Now this is a standard interview question for us and a standard interview answer that we get for this question looks like this. What people do is I make this a little bigger so that hopefully you'll be able to see it and obviously the code is also at the GitHub repository. What people do is loop through each line. They open the file assembly results and they loop through each line. They skip the first row which has the header and then they split the line using commas and then they check if Bangalore South or whatever is the constituency we're looking for is in the right column. If so, they get the name and the number of votes and put it into an array which contains tuples and then use import operator and sort it by the first column number one which is effectively the second column. I have a feeling that the reason I'm putting this piece of code together is because this is sort of the composite of the most frequent answers that we get. I have a very strong suspicion that this import operator is a result of a stack overflow search. Let us say just how do I sort by the end column and this would have been given as an answer to a copy-paste but copy-paste works beautifully. This gives us the right result but the question is how long does it take? The trick to optimization is to first figure out if you really want to do it. The first step to that is to time it and see if it's acceptable. Now this particular thing can be timed on... All of this presentation is on an ipython notebook. On ipython you would just say percent time it and run the function that you want to time. In this particular case I've created a function called most votes which is exactly the same thing that you saw on the previous page and this is taking 472 milliseconds. Half a second. Not good enough. No probability. But do remember that that is a question you want to ask every time. I love quoting Calvin on this so when Calvin has this homework that is quite tough, he says I always like to break it up into pieces so Hobbs asks so does that mean you're going to start with the first chapter first? No, I ask myself the question do I even care? So do I even care about optimizing it? At half a second it may be fast enough in many cases it is. In our particular use case it wasn't. So we ended up making it a bit faster. But never forget that first step. Is it even worth optimizing? Premature optimization is a big evil. Try not to do that. But let's see how we can go about optimizing this particular function. The first step is to figure out what is slow. Let's find the slowest part in this piece of code. Now there is this lovely module called line profiler. Line underscore profiler which you can do a pip install off. And when you do that it lets you it tells you how long each function, each line in a piece of code is taking. So specifically on the iPython notebook I am sorry I've just got this piece of code. You say low-dext line profiler. It takes a bit of setup. It's not trivial to setup but it's not complicated either. And I've got a link where you can set it up. Once you do that and say %lpira. That's an iPython notebook magic. Let's see what it does. It's running it. Now this is a function that evaluates it for a long time. And finally at the bottom it gives me the result of the computation which is a little difficult to see here. So I'm going to switch back to the presentation and at the bottom you see the results. Each row and the function is defined. And against each function we have the total number of hits which is the number of times that line was called. The total time taken by that, the per-hit time and the percentage of the total time. Now I usually only look at the percentage of total time. I'm simply interested in optimizing the slowest step. And there are a bunch of lines here. Five of them which seem to be taking anywhere between about a dozen you know 12 percent to 30 percentish. So somewhere here is what I need to optimize. I will not bother optimizing anything that already takes zero time obviously. So which already means that we have cut down the zone of search to what is specifically to be optimized. Now there's a fairly simple procedure by which one goes about optimizing. Which is basically around picking your battles. First, if you see something that is obviously redundant, get rid of it. There's no point having it. Second, look at what takes the most time. To reduce something that takes a lot of time there are two ways. One, reduce the number of times it is called. Or every call, reduce the amount of time that it takes. Now what we're going to do is apply these principles in sequence to this piece of code and try and see if you can get it to you know maybe 20-30 percent faster or how much ever faster we can. So let's start by eliminating what is obviously redundant. I'm going to go back to the code and zoom in a bit. One of the things that we're doing is every line we are checking if it is the first line or not. And then it is not the first line going to the next one. So this particular check if row is greater than one is being called about four lap times. So which is why you can see here that it's taking approximately 15 percent of the time. Let's get rid of that and that's fairly easy to get rid of. All we have to do is as soon as we open the file, skip the first line and thereafter there's no need to do that check. It just moves out or outside the following. If you're not doing something that really needs to be done multiple times, just do it once and be done with it. Now that has given us a 9.5 percent increase in speed which is small. Nothing great to start with. Let's move on. Now let's see having done this what takes the most time. We have again run LP run and there are three lines that take a reasonable bit of time. The bulk of the time is being taken up by this line dot split. So we're splitting by commas and that is taking up about 45 percent of the time. So that's worth optimizing. Now there are two ways of optimizing like I said reduce the number of times that it gets called or reduce the amount of time that it takes. Let's start by reducing the number of times that it's called. It's been called four lakh times. Now what are we doing here? We are merely listing the number of we only want to extract the people from Bangalore South. I'm not interested in anything else. So if I can just check if the person is in Bangalore South without splitting the commas that does my job. I only need to split the commas when the person is in Bangalore South to get the number of votes. So why don't we move the splitting of the commas after the check? So what we do here is first check if there is a match. If you find the assembly constituency name in the line, anywhere in the line only then do you do a split. Now what this does is makes the code 83 percent faster. Now we're talking something. The reason why this has had a good effect is because it was already taking up about 45 percent of the time and we are drastically reducing the number of times that it's getting called from approximately 4 lakh times to just 36 times because there were only 36 people that contested in Bangalore South constituency elections. So now what is the bulk of the time going into? The bulk of the time is going into if line.find of AC name is greater than or equal to 0. In other words in the line we are checking to see whether the string exists or not. Now can we call this any fewer times? Probably not. I mean there probably are smart ways of getting around it but nothing obvious strikes me. So let's go for the second technique. Is there a way of making something as simple as a string find faster? Let's do a bit of disassembly here and see what exactly goes on behind this function. There is an interesting module called DIS which is a disassembler and when you run this module on any piece of code it tells you what is the exact Python bytecode that runs behind it. So we created a function here called check. What this check function does is simply takes a line finds the assembly constituency name and checks whether it is greater than or equal to 0 and returns the value. Then you import this and you disassemble the functions dot funk underscore you get the following result. Now before I show you the result I think this funk underscore code is worth spending a few seconds on. You would be surprised at the level of access that you have to functions internals. For every function in fact let me just show you what this looks like. Let's say we have this function. Not able to close it for some reason. Let's take a function this one. Most votes check first. Now if I take a function like most votes check first. There is most votes and let's look at its methods. Every function has a dot funk underscore closure, funk underscore code, funk underscore defaults, this, that etc. And this funk underscore code is somewhat interesting but leaving that aside you can get the functions name from the function itself. If you have the function object you can figure out that its name is funk underscore votes. You can figure out that it has the following variables as global variables. So it is in what text is it being evaluated in you can figure out what else. You can figure out how many arguments it takes and so on. Which lends itself to a bunch of interesting use cases. Supposing you have a function which is passed a function. Now you don't know whether this particular function takes one argument or two arguments or three arguments. You can figure that out. So depending on the type of function that comes in you can inspect whether it returns a value, whether it is part of some module, whether it has access to a certain global variable. And based on that you can perform different operations which lends itself therefore to fairly powerful polymorphism. But this talk is not about that. I just want to mention that funk code is funky and move on. Now what this does is out here first loads a variable called line. That's the first thing that we're looking at. Then gets the attribute find of that line which is a string and then loads a variable called AC name and then it calls a function which is effectively calling the line dot find on AC name. Then it loads a constant zero, compares the result against zero and returns a value. That's a reasonable number of steps. Can we reduce the number of steps that are happening here? Well, yes we can. All we have to do is say AC name in line. Now which does exactly the same thing. It is effectively the same code as the line operator but when you disassemble it that's a lot less of code. And if you look at how long it takes, this ends up being about 61% faster. It's almost 62% faster than the previous piece of code. So the point is it's probably a way of getting stuff to run faster in almost every case. It's largely a matter of exploring but we have learned at least a couple of principles here. One, that you find what takes the most time and focus your effort on that. There's no point focusing your effort on anything else. And it seems that the operator and the reasonably good way of doing that is use the line profiler function. There are many other profilers and it's worth exploring those. And secondly when you're stuck with optimizing a relatively small piece of code, then the disassembler can come in handy. If only to let you see what the internals of that function are and perhaps think of alternators. In general, a rule of thumb is that operators are faster than functions in Python, not necessarily in many other languages. Just keep that in mind. So overall we started off with something that took about half a second and got it down by reducing a redundant line, reducing the number of times the split function is called and then using an operator instead of a function and gotten to a 3.2 times speed up which is about 320 percent. Which is not great but not bad either. Now the other question that will come to your mind is look, you're doing all of this in Python. If you were to do this in C, it would be so much faster. Why don't you use a library like Pandas or NumPy which is blazing fast. Let's do that. So this is the code in Pandas. We read the CSV file and sorry, we read the CSV file. We check for the assembly name to see if it contains AC name and then get and sort it by descending number of votes. This takes 1.1 seconds. In other words it's about 87 times lower than the piece of code that we had. What's going wrong here? It's not that the Pandas code is slow. Pandas is inherently very fast. It's not that it has been implemented in a bad way. This is about as fast as you can do. The thing is a library like Pandas encapsulates a lot of things. What we did was we did not actually load the CSV file into memory. We just looked through the lines. We did not even split each line into commas unless really required. We only split it when absolutely required and with that kind of optimization, even a library written in C is not going to be able to compete. Remember libraries are good. Your brain is better. Just if you focus on optimizing, you will get to stuff that's much faster than anything. Just reject dogmas, measure it. If it's faster, use it. Now, which begs the question then? I do want to go to the fastest format. Can you give me a suggestion on what is the fastest format? I've been playing on a lot of structured semi-structured data, at least in file formats. Here's the summary of the result. If you're looking for a fast file format, then among the ones that I've tried, if you're using a dictionary, CSV, so if you're using CSV files, loading using csv.dict reader is a little slow. Pickle.load is a bit faster if you're storing it in a pickle format, but it's not necessarily as portable as you would like it to be. JSON is a bit faster than CSV depending on how you store it. You can either store a CSV file or tabular data as an array of dictionaries or as an array of arrays. If you store it as an array of arrays, it's quite faster. Now, a CSV file can be read faster as just csv.reader. csv.dict reader does impose a fairly heavy penalty, even though it makes the reading much easier in the future. So think carefully about it, but if you're using data as CSV files, there's no doubt that pandas is the fastest and it's fairly heavily optimized. So CSV is a pretty good format. But the format that beats it all is HDF5. It was built for speed. If you have large scale tabular data and you do want to optimize it, don't bother looking further HDF5. If you do want it to be portable, then your tradeoffs are largely between csv, JSON and heaven forbid XML. The choice on that is reasonably obvious. Go for CSV. It's faster than most other formats. Have I tried using csv.reader with a name tuple? Yes, I have and that is faster than csv.dict reader, but which would definitely work better than csv.dict reader. Having said that, I've always gone for pandas because it's just faster than anything else anyway and switching over to HDF5 quite actively. I'm missing something. Let's go on to a different topic, which is scraping the data. We were pulling this data from the ECA website and the main problem here was from the ECA website not that the calculations were running slow. We were just passing the HTML. It was not even that the network was slow. The main problem was that the network was unreliable. So it would fail in the middle and then in that case we had to just start the whole thing again. Which means that making a program restartable is highly relevant and not just for scraping. In any case where you have computations that take a long time, if for any reason either the computation or the program itself has to stop or you want to split it across multiple systems, then you do want it to be restartable and let's take one simple way or not one simple way one simple example of how we can go about making something restartable. See, you have typically in scraping a list of URLs and you loop through all of the URLs and you pass the URL. That's broadly it and it's the parsing that takes the bulk of the time and not because the parsing computation is slow but because the network request is slow. So if I wanted to cache this, then the way we would do it is write a function like this which say gets a URL. What this function would do is store the contents in a file name. So some sample.file.name. If the file does not exist then it would retrieve it and save it there. If the file does exist it would just open it and return the handle. So which means that the first time that I run this piece of code it's taking about four and a half seconds but the second time that I run it, it's only about 339 microseconds. Not really. So that's more than a thousand times faster and you say, yeah, obviously because heck, network is slow and you're just, you're not doing much of calculation anyway. So fair point. But the thing is, given that we are caching this, we can't quite arbitrarily cache the URL into a single file. The next time I need it I have to have a different file name. I need a mapping between file names and the URLs. So one way is just to say, look, let the URL be the file name, which runs into a problem when you have special characters question mark, colon, all of those. So you say, fine, let me get rid of the question marks and colon, which would work fine until you realize that sometimes you do actually need those. So question mark x is equal to 1 is different from slash x1 and by just stripping those you get rid of it. So then you say, fine, let me take a hashing function and Python's filled with a lot of cryptographic hashes. The hashlib library has a whole bunch of those. Fine. Let's evaluate them for speed. If you go through the algorithms in the hashlib module, here's how long each of these take. md5 takes 2 hash, 1 million occurrences of a string, about 1.9 seconds all the way up to 2.83 seconds for sha512. You notice that firstly it's a relatively narrow range. It doesn't really matter too much which hashing function you use. On the other hand, however, it's not too slow for 1 million hashes if it's taking about 2 seconds. That's kind of fine. Normally, I just leave it at this. But what I notice was why are we looking at a cryptographic hash to find a unique alternate value? The purpose of a cryptographic hash is to make hackers slow. You don't want it to be fast. You want it to be as slow as possible and that's how they're designed. We want the exact opposite. We just want a regular non-cryptographic hash and turns out that Python has a built-in function called hash that does exactly that. It takes any object that you can put as a key to a dictionary and the result is a long which you can straight away use as a file name or use in any way and it is over 10 times as fast as any one of these. So my standard workflow for saving a file is to simply complete its hash, convert from a signed long to an unsigned long and use that as a file name or some variant thereof. Not because the hashing was a slow step but there's no point going to a cryptographic hash when you don't need one. Moving on to another topic. So one of the visuals that we created had the full history of India's assembly results. So what you see here is in every single state the party that was in power and for how long. So for example Congress was in power in Himachal Pradesh for a fairly long time and then switched over to Janata National Party and then Congress again, then BJP and so on. So for every state the party that's in power from a certain duration to a certain duration is the one we're looking at. Now to get this data we had to parse the dates of the elections and the dates of the elections were available in every row of the data. Now this turned out to be a surprisingly slow step. To give you an example I'm going to create some sample random dates here. So here's a file that we've created that contains dates potentially in the format year, month, day, day, month, year and the month could be alphabetical or numeric or day, space, month, space, year and so on. And typically you have multiple file format multiple date formats. Why do you have multiple date formats? I'll tell you why because somebody in the US would have created this file in the month day format sent it to someone in India when they open it in Excel in India the date format switches so half the dates get converted to day month year, half the dates remain as month day year and it's a huge mess. If the dates are less than the less than or equal to the 12th of the month you're in trouble. You never really know where it's the day month year format or the month year format. You just open, pray for the best. So which is why it's good to stick to alphabetical at least for the month if possible or go to something like year day month. But really the data that you get is never clean. So you have to factor in multiple date formats. Now let's see how long it takes to parse one lakh dates like this. On my system it takes about 6 seconds to go through 100,000 dates and parse these. Now that's slow. Just to load the file and this is that one fourth of the file takes 6 seconds or 24 seconds to load the file no way. Now what do you think is the problem? Passing through it seems logical. Let's test out and see how big the problem that is. So if you run LP on this, 99.3% of the time is spent in parsing. So absolutely no doubt whatsoever that there is the only thing that you would optimize. But do not forget to do this step because no matter how reasonably sure you are you could be wrong. And in this particular case we are fine. So we'll take the optimization of parse. Now let's look at other ways of parsing. So one possibility is you say look maybe the parsing is slow because I'm calling it so many times maybe the function call is slow. Let's go for pandas or numpy or whatever we take it as a vector we just call the function once and it does the job. So pandas has a two-day time function. You load this as a series. In this case what I've done is taken just one date the 31st of January 2012 and repeated it one lap times and said convert it to daytime. That takes 6 seconds. That's even slower than before. We were only doing it 5.4 seconds. And giving you examples where pandas is slow I by no means believe that that is the case. It's actually bloody fast library. I'll show you more examples later on. Let's do something else. Let's take date util. A date util is a pretty nifty library that lets you parse various kinds of date format. So you don't have to worry about whether it's in day month, year month, day whatever and you can loop through as a list comprehension and get all of these results. That's about 5.25 seconds. Slightly faster than before but not quite good enough. What if I say look I will only take I know that this is in the month, day, year format or which format it is. So why don't I just take this string, this string, this string split it and or better use strip time. Day time, strip time is written in C so that ought to be pretty fast and it is. It takes only over 1.78 seconds. You can go even further and say look I'm not going to take the overhead of strip time also. I know the format so I'm going to take the year which is from character 6 to 10 the month which is from 0 to 2 and the date which is from 3 to 5. And that brings it down to a whopping 236 milliseconds. So which means that we have traded off flexibility for speed from 6 seconds down to 200 milliseconds great but I no longer can parse all of these. Is there a way to break that compromise? Well let's recognize that the only reason this is being the problem is that it's being called so many times. What if I could cash it meaning once I've parsed the date I don't need to reparse it. What if we have a function like this? A lookup function where you say I take all of the unique dates. Sorry if you can't see it from the back I'll just read it out. You take all the unique dates, loop through only the unique dates and convert it into a dictionary. So for this unique date I parse it and store the result and then return then I go through all of the dates and look at this dictionary. Dictionary lookups are faster than date parsing. Now this takes 13.9 milliseconds down from 6 milliseconds on the same data set and if the dates are varied it's still dramatically fast. So the thing is you also need to understand the nature of the data. If there's a lot of repetition there's a fair bit of potential for caching. You cannot arbitrarily apply what is normally a fast function to data that has completely different characteristics and expected to still work fast or assume that it cannot work faster. Speaking of which, see I started using list comprehensions in the last couple of examples. Let's look at how fast loops are which can be replaced by list comprehensions. Let's take the simple process of an array and squaring all the values in an array. So I've defined a function square that simply squares a value and you out here loop through all the values in a data set compute the square and append it to a result, append it to a list and return that list. So we're just looping through every element in the array, squaring it and returning it. This takes 28 milliseconds. Now, there are two things we're doing here calling a function square and we're looping through in a for loop. Let's see what happens if you inline this. Basically remove the function call. All we've done is taken the value times value which was originally in a function and put it inline. Now that gives us a 70% improvement. In Python functions are slow. Custom functions not necessarily see functions. If you're calling a function it incurs an overhead. If you're calling a function many times, like inside a for loop, that is slow. Try and inline functions if you can. That comes at a loss of flexibility and we'll talk about how we can break that compromise also. But if you're programming in pure Python try and avoid pure Python functions if you can. Let's take loops. Now, if instead of sorry, I'm going to skip this. If instead of putting this in a for loop I put it into a list comprehension value times value for value in data. Now, firstly the code is smaller, potentially more readable all of those benefits exist. But this is 90% faster. Which means that for loops as well you incur an overhead. Like with functions which you ideally would want to replace with operators, loops are ideally replaced with list comprehensions and these apply to most programs. But there's another way of applying this which is thinking in vectors. So library like NumPy or Pandas would and NumPy really which is what Pandas built on. Let's you do the following. You can define the data as a NumPy array or a Pandas series and then just say I want to square it by saying data times data. That's just one straight operation. This takes 420 microseconds. That's about 16 times, not 16% 16 times faster than the code that we had. So clearly there is a huge benefit to using these libraries. They're statically tight. They use C internally. If possible push it into a NumPy array and do the processing there. Rather than you ever doing a loop. Loops are slow, functions are slow. This can be applied in a wide variety. Okay now I'm going to skip this example. Let's skip this. Let's take another example. Now one problem that we often face is supposing I have an array of numbers and I want to see how many numbers are between sound 2 values a and b. Where a and b are any 2 arbitrary numbers. One way of doing that is to loop through all of the values in the list. If the value is in the range a to b, then increment the count by 1. That's taking us 284 milliseconds. So reasonably slow. Let's use vector operations. If I put the same thing, the values are in a vector and I say all the values that are greater than or equal to 0.25 and all the values that are less than or equal to 0.75, whatever my range is summit. This one line expression is 7629% faster. In other words, 76 times faster. Total no brainer. So if you're getting a 76 times optimization in something like this, you'd say that's crazy. And now we're at 3.67 milliseconds per loop, which is perfectly acceptable. You'd normally stop there and I would advise that as well. But do not ever make the mistake of assuming that that is the fastest. Because here you can remember that if I sorted an array then it becomes very easy to find out how many numbers are between a and b. You just see where a occurs and you can do a binary search for that. You can see where b occurs and do a binary search for that. And this can be done in an vectorized way using a function called numpy search sorted. So you solve the values and if it is a sorted result, you just search for where 0.75 would occur, search for where 0.25 would occur and find the result. And that's 5 microseconds. In other words, 73,332 percent faster than the previous one. So what was already 76 times faster has further been optimized by 733 times. So now I can't do the math that fast but we're like a few tens of thousands of times faster than before. The best optimizer is not the library. It is what resides between your ears. The better algorithm invariably wins over a more optimized execution. But having said that, it would be useful to see how one can go about writing these things in C. Now, if you're like me, you would rather not have to learn or relearn C. Python is fairly easy to learn and you'd rather keep it that way. So the question is, can I write Python code and somehow get the benefits of C? And the answer is yes and there are many ways of doing that. Cython is one such project. What Cython does is lets you sort of compile in C. Now you have to install Cython. There are many ways of installing it. But the easiest way is to pick a scientific Python distribution like Anaconda or Canopy or Enthought or Python X5 or whatever. And in that case, you automatically get a C compiler on any platform including Windows, which is what I'm running it on and I have my reasons for that. So out here, for example, if you say in an Alpython notebook load the extension Cython magic any cell that has person-person Cython directly gets compiled into C. Now what I've done is taken a function that adds up numbers from 0 to n-1. So you loop through from 0 to n-1 and increment the counter and return the value. It's a simple sum function which in Python is defined and the same thing the only thing that I've done is changed the function name, put this in a person-person Cython cell magic and said A is defined as an integer i is defined as an integer. So we're just doing static typing on top of this. Let's compare the two. 6.6 milliseconds 68.4 microseconds So now we're talking 100 times faster. Same code same algorithm we've just added static typing and said run it and that's a 100x and this is something that you could do to every single one of your functions. Sort of like free optimization. You do need to know what the types are though. This does not work universally. Let's take the count, counting the numbers between a range that we had earlier. Same function where in an array of values, you're looking for how many are between A and B. Now here we define this as an integer, define this as a float. Now this function is unfortunately only approximately crisis fast. You come down from about 279 to 147. Crisis fast is still good but not the 100x that we saw earlier. Let's take another project. Numbaa. I don't know how to pronounce it. I'm going to say Numbaa. This is better for a couple of reasons. The first is that you don't need to worry about the types. It does it automatically for you. You just install it and then say from Numbaa that Decorators import jit. That stands for just in time. Just in time compilation. And then you say at jit any function. So I'm taking the same total function which is the identical code that I've written in Python not modified it in any way other than to change the function name. And now let's time it. Python is in milliseconds. Scython is in microseconds. Numbaa is in nanoseconds. So I'm not even going to bother doing the computation but it's more than few tens of thousands of times faster than the original Python code. See at Scython itself you would normally stop. This is not the slowest piece of my code you'd move on. But this is sort of like a free lunch. Take it. Let's look at how it optimizes the other piece of code which Scython struggled with. So out here we're doing the count function. Exactly the same code as in Python but with a at jit prefixed. And now it's gone down from 283 milliseconds to 2.91 milliseconds. Almost a factor of 100 improvement here. So in other words, Numbaa doesn't need you to know the types. It figures it out by itself nor does it and it's faster as well. So increasingly it's replacing Scython as the way of optimizing code. What have we seen so far? Fever to a summary of what we've looked at. Firstly, check if you really need to optimize something. If it's fast enough, don't bother. Just stop there and move on to whatever is the slowest step. If it is not fast enough, then just find out what the slowest step is and focus only on that. At any point, optimize only one piece of code. And in order to do this you will need to have solid unit tests because no refactoring works fine. You never know when you've screwed up on the functionality. Make sure that you have solid unit tests running. There are two ways of reducing the total time that it takes. One is reduce the number of times a piece of code is executing. Reduce the number of hits. This is obviously a better way of optimizing because if you get it down to zero, you're just completely done over the whole thing. So that way it helps. Now there are many ways of doing it, apart from algorithmic changes. One, if you can cache the result and memory is not a constraint that gives you one boost of speed. Secondly, make sure that your operations are executed only when required. Like we moved the if condition outside where we split it. We split only if the constituency was bangled south. Try and minimize the number of computations that you're doing. The next way is to make the slow operations faster. There are many ways of doing it. Firstly, try and see if you can replace functions with operators. See if you can replace for loops with list comprehensions. Try using Namba or Cyton if you want to go in for something that's C based. Most importantly, see if you can change the algorithm. This has the second biggest impact of any method. The biggest impact obviously comes from completely eliminating that code base. See remember code is a liability. It is a functionality that is an asset. If you reduce the lines of code or eliminate the code completely and still achieve the same functionality, that's great. You have what, effectively negative infinite productivity or something like that. Get rid of as much of the code as you can. That in itself will reduce the number of op codes and therefore the total piece of code. Hope that was helpful. Open to questions. Thank you. So why are you really trying to optimize the parsing part? I think parsing is typically done when you pre-poss the data and then put in some form and then you reuse it again, right? True. There are two reasons why parsing had to be optimized. In a lot of cases, we take the data, process it once and move on to the next step, effectively batch processing. Secondly, in this particular case, half a second was 20 seconds was not too bad because it's a one-pan setup. But when debugging, we do it so many times that every time a start-up of overhead of 20 seconds is way too much. Okay. How does NumPy compare with PyPy? How does NumPy compare with PyPy? I haven't done the benchmarks yet. My problem is I cannot go to PyPy because I use pandas. So this is a constraint I have never evaluated PyPy nor am I, sorry, for those who are not aware, PyPy is Python in Python and it's pretty fast. I have never been able to use it because I need NumPy too much and NumPy uses. The thing I found interesting is that this has been my barrier with PyPy as well. There's always going to be some random library that has not been properly tested, doesn't work. And what this sounds like is I have a free ticket here. PyPy kind of speed without using PyPy and that sounds a little too good to be true. So I'm just wondering what the quirks are. I'm also a little scared to use it in production. Give me a few months. Share notes. The reason I'm scared to use it in production is simply because we haven't tried it enough. NumPy is this fast fairly recently and I haven't found a negative overhead and even if it's easy enough to opt to sort of benchmark and see which functions you apply NumPy to or not. From an environment and distribution perspective, given that Anaconda and other distributions exist, it doesn't seem to be a problem. So I'm going to start using this in production from pretty much a couple of weeks from now. See where it goes. But it's got to the point where I'm comfortable enough talking about it and deploying it at a client's site. Yes. When you are optimizing path, there could have been a step where you enumerate the counts because anywhere count is a one line that always comes up and comes up all the time when you are counting through the book. So does enumerate help out? Does enumerate help out? Well, firstly, without taking my answer as given, try it out. The thing is, now what does enumerate entail? It creates a generator. Now, a generator has a number of additional useful properties. Like in the middle of the generation, you can change, you can mutate things, you can stop the generation, you can throw an exception, all kinds of things. In a C loop, you can't throw exceptions. You can't mutate, you are fairly constrained. Now Python provides a lot of flexibility with iterators and generators and with that flexibility comes loss of speed. As a rule of thumb, I always avoid generators and wherever I have a generator, I replace it with a numpy array. This is only when you are processing large scale data. There are enormous benefits of using it in other cases. Yeah, okay. Let's assume you have a large number of elements, let's say a million elements or 10 million elements. How will you advise speeding up a lookup operation? If you have 10 million elements and it fits in memory, dictionary lookups are reasonably fast. If it doesn't fit in memory, an external dictionary lookup is fairly fast. Now what does an external dictionary lookup mean? Any key value store. React if it doesn't fit on memory, readers if it sits in memory, memcache if it's dynamic. Any of these would be pretty fast. And how will dictionary lookup will compare against a set lookup? A dictionary lookup and set lookup have pretty much the same speed. Set lookup tells you whether it's event or not. A dictionary lookup also gives you an associated value. If you are only checking for presence, set lookup is always faster. However, if you are looking for something that only needs to be approximately right, you can do a lot better by using something called a bloom filter. Just check that out. It gives you a good chance that the item is there. And then what you've done is use a very fast operation to test like 99% it's going to be there or not. And then to reconfirm, you can do a slow operation which will overall speed up your process a lot. Also, how much space does it take for dictionary and set? Which one is better at space? Dictionary takes more space than a set because you have to store the value as well. Set takes more space than a bloom filter because a bloom filter is approximate. Thanks. Hi, Odisha. We're using the NumPy array. And sometime I found on doing operations, there's sometimes the list operation on the list. Listing is fast. Listing is fast enough than the using of the NumPy array. And sometimes the NumPy array fast. Can you resolve this confusion better to use NumPy array in the end of listing? If we take a specific example, I will be able to explain better, but let me give you my general observations. Whenever a NumPy array is used like a list, it tends to be a little slower. So if you create a NumPy array and do a for loop on it, it is invariably slower. Now, NumPy array is meant to, so when you have a NumPy array and you have a for loop in the same piece of code, then something is really going wrong. You never use, not necessarily a for loop, any kind of loop. You do the whole purpose of, or one of the purposes of using a vector operation is to avoid iteration. That is a clear giveaway that would be slower. Beyond that, there are very, very few operations where NumPy would actually be slower than a list. Let me give you one such. Supposing you're calling a function. So if you say NumPy.apply of X times X. Now what you're doing is squaring it and in order to square it, you're creating a lambda function. Functions are slow. So what NumPy does is loops through fast but has the overhead of calling functions. The same thing if you were to write as a list comprehension, you have no overhead of a for loop. You have no overhead of a function call because it's just X times X. It's an operator and that can potentially be faster than NumPy. So if there is a function call or a for loop, those are giveaways that NumPy may not be able to it's not structured well. The functions are slower, right? In the case, is it possible for us to in case of Python, generate a local scope for the function or have a function object and then using it, how much it actually go along with that there? See, if I have understood your question and I may not have understood it, it is a compiling aspect slow, meaning can I just pre-compile it? The objects are pre-compiled. The problem is not that it's not converted to C. It is converted into bytecode, which is pretty fast. The problem is that Python offers a hell of a lot of flexibility. It allows you to, for example, raise an exception inside a function. In order to raise an exception that has to be trapped and passed in a certain way, there is a lot of type conversion that happens automatically. So as you enter a function, the first thing that it does is looks at all the arguments that are coming in, looks at that, I'll see if it needs to be converted likewise at every step in every line it does all of these things. So the reason why Python functions are slow is by design. Unlike JavaScript, which was sort of not quite design and people managed to optimize their way around it dramatically. That is less easy in Python because it's going to give a certain amount of flexibility that one cannot step away from. So it is compiled. You're saying let us create a function object dynamically and do it. Try it. You will find that that is not the main problem. The problem is really that if it is a Python function object, then it works fine. What Numbard does is says I'm not going to give you that flexibility. If it raises an exception in the 10th step, I'm just going to give up completely and not let you handle it the way you normally would. So it actually works by stripping down functionality and that functionality can be used on a case-to-case basis. Not if you are really, let's say, processing or let's say you are scraping, you do need that exception in the middle of the for loop. But do try it out. I suspect that you will find that no matter how a function is created, it is slower than not having a function. Hi. Regarding your suggestion to use hash instead of a cryptographic algorithm, isn't hash calculated at runtime so it's not the same hash you'll get if you restart the program? I know it is the same algorithm and if you restart and run, okay. I thought hash is the memory location in which it is. Right. But if you give a string, which is all I've ever given it, it is always the same. Yeah, it uses everything that I know tells me that it should be the same. Recheck once again. Regarding the other example of election data that you are going through, right? Some sort of where you are through different constituencies. Right. And let us assume that the file is not larger and processing is always going to take a few seconds to read out that file and data. Then a user would not wait for a few minutes to see what the data update is. Then is there a way to permanently update your display and again keep processing the data in the background? Absolutely. See, that gets into the realm. Easiest example for that, but another fast way. You don't need threading for this. So the question is supposing the users made a search and you want to display something in the middle. Display something in the middle. So a number of libraries support it. So if you take Tornado for example, you can have an async operation and as it feeds the results display them one by one. But even better, you don't even have to do that. You can put in any random junk. The good part about users is that no matter what you show them they will try and read it. So put some dummy results or some other interesting things while you are waiting, etc. and that will be a distraction. And then as the results start coming, you don't need to wait for it to end. You can put it up there. You can also recognize that people are often searching for the same thing. So once you get a certain result often enough, you start caching that. Further, you can start partitioning it. Say like dictionary lookups why should I have to search through the whole thing? If I'm searching for something that begins with B, maybe I could just split the data into things that start with A, things that start with B, and so on and only search within B. And then if you take that concept further more like creating an index and then when you take it to its logical end, you will have a database index. At which point so it depends on how far you want to go to speed it up. And the ability to serve something in parallel is independent of data processing. It's an event of speeding up data processing. It's sort of a way around it which is a perfectly valid and useful. I think we've done it so we'll take more questions offline. Thank you so much, Anand, for the talk. Thank you.