 Hello. Welcome. I would like to talk to you about profiling Ruby. The title is from profiling Ruby to Frankenstein Programming or how I learned to stop worrying and love using Ruby for scientific computations, which not many people do, which is a pity and I hope to address this today somehow. The talk will consist of a short humble half an hour introduction and then some wrap up about how to profile stuff. My name is Piotr Szatkowski. I'm also known as Chastel. I have a bit of a dual personality during the daytime. I'm Mr. Hyde. At night I turn into Dr. Jekyll. And basically I work for CVCRM, which is a lovely open source CRM web applications for NGOs, nonprofits. I've been working there for six years now. And it's a wonderful product. It's a great tool. And it's written using these two technologies, which is why I look like this usually when I write. And I also work at an assistant professor at Warsaw University of Technology in Poland. And this is my other life. This is me at my desk with some proper beverages, some unfortunate student remins. At the university, as you know, we write everything in Fortran. When we don't write in Fortran, maybe we use Lisp. Sorry, Lisp. And those of us who are not really that into performance are allowed to use C. Or maybe C++. Some of us which are really lazy just use Java. You can see there's a certain distribution pattern here. But I'm not there yet. I mean, obviously getting there, but not yet. So I'm much more into scripting languages. And when you think about scripting languages for scientific computations, obviously Python. But how do you introduce Python to the scientific world, which is usually very conservative, so to speak, when it comes to programming? Well, what you do, you usually get a proper book. You walk around with this book displayed. And when you're tired of walking around with this book, you put it on your t-shirt. And well, this is how I learned to use Ruby and love it. So basically if this talk was limited to one slide only, this slide would probably be this. Because this is the general perception of how performant Ruby is. But usually we believe that Ruby is slow language. It's okay for us, but not really that good for doing anything performant. So actually how slow could it be? Can we somehow benchmark it? So it happens. There is a bunch of great birds somewhere in the basement, somewhere in Europe who run this wonderful website, showed out earlier, Debbie and Ork. And this is a website that compares performance of different languages. And you can go there and write a very nice letter to them. Please, please, please tell me what's the comparison between Ruby, speed and for example C, which is a good benchmark language. And they will say, well, on one hand Ruby is up to 200 times slower than C. But on the flip side, it uses up to 500 times more memory. So the real thing why we do use Ruby and why I ended up using Ruby is this. We all know that Ruby is much more descriptive. It's much shorter to write the same program in Ruby than in other languages. It's much easier to write what you want to write in Ruby. So the general problem is that developer time is usually much more precious than computer time, even when you have a long running problems. And there's a lovely quote that other than actually, you know, the question upfront, do we really should consider benchmarks short programs to be telling anything. The quote is, synthetic benchmarks that are sweet, funny elements about real-world performance of code, architecture being a much more significant consideration in the proportion of raw MIPS a given language will deliver on a given platform. The average net book could happily run all of Teller's fashion bomb models along with the full telemetry analyzers of all the Apollo missions and the puzzles between loading XKCD comics and binning junk mail without the user being any deviser. This is a quote for Eleanor McHugh, who is a wonderful Ruby hacker currently writing a new Ruby implementation in Go, Ruby Go Lightly. And this is the gist of this talk. Actually coming up with good algorithms and good architectures, it's so much more important than, you know, using the faster language, although there are exceptions that I will address. So how does one profile Ruby? This person you might have heard about, it's a young Python programmer from Switzerland. He's called Leonard Euler. He's best known for wearing underwear on his head, obviously, but other than that he has this lovely humbly named project Euler website, which you can go to and you can ask pretty please, can I have a problem to solve because I'm all out of problems apparently for some reason. So I went there and I got this problem. Starting in the top left corner of a 2x2 grid, there are six roots without backtracking to the bottom right corner. It looks like this. If you consider the upper right, upper left corner, your home and lower right corner, your coffee place that you want to go, there are six ways to get there, so after a week you will have to start reaping yourself. How many roots are there through a 20x20 grid? It's a typical everyday problem we face. And if you think about it, one approach which is definitely not the best one, but one that could come to your mind as the first one, or at least my mind, is that there are only right and down turns. So we could represent them with zeros and ones. This is how usually programmer mind works. And these six paths are 0011, 0101 and so forth. So we want to basically find all for a 2x2 grid, binary numbers with the same number of ones and zeros. This will be the different paths, one of the representations. So if you wanted to program this in Ruby, we could say grid equals two. We take all the numbers from zero to 15. We turn them to binary representation. This is the fastest, definitely not the most performant way, but the fastest to turn them into binary representations. We pad them with leading zeros to be able to compare the numbers of zeros and ones. And then we select from them the ones that have the same numbers of ones and zeros. And then we count them. The answer is six. This algorithm works at least for a 2x2 grid. So if we turn it into a script, you can say Ruby, grid, RB, then put a number. The number will be read into the grid variable for a simple script. So let's run it. We run it. It prints out six. That's very cool. We run it for the number 20, for 20x20 grid. And we wait. We wait, we wait. Days pass. You know, you do some other things. Your computer crunches it. You press control C. So you try to figure out what exactly takes so much time. Where is the problem? Maybe, you know, maybe we can figure it out. So we time it. For a 2x2 grid, it's really fast. For a 8x8 grid, it's also rather fast. For a 10x10 grid, it's already an order of magnitude slower than for 8x8 grid. And for 11x11, again, we get an order of magnitude worth computation time. So basically, we have to figure out how to crack this net. How to, how, what exactly is not performant in that algorithm. And there is a way for providing Ruby that is distributed to every Ruby installation. It's the profile library. If we first time around for a 9x9 grid, it's under a second, we can do this, which is the simplest way to actually profile anything. Just require the profile library. We get the output. The main problem is it took us 40 seconds to get a profile of a under a second running program. So basically, this is what the servers at the university would look like if I were to profile a computation that runs for a day. If it's, you know, two orders of magnitude slower to profile it. So way too slow. Okay, so we Google a bit. We learn about Ruby Prof. We do the same. We just run Ruby Prof as an executable instead of Ruby. We also get a nice profile, which I'll get to in a minute. But we still have like three times slower profiling time than the execution time of the same algorithm. So the main advantage of Ruby Prof is that you can call it like this. And there's this nice graph that says that, for example, string R just was called 200,000 times, 162S, or RA selected actually, half a million times, which called what. But we'll get to this in a moment again. It's unfortunately way too slow. If we had three times overhead on profiling stuff, it's not really that useful oftentimes. And then when we learn, usually when we learn about Perftels RB. Perftels RB is a very nice thin Ruby wrapper around Perftels, which is a Google library for profiling C code. And this is a very interesting approach because it doesn't track function calls. It doesn't get into your program and keeps track of everything you do. It actually lets your program run. It pinks every, by default I think, 4,000 times a second. It pinks its call stack and sees what's exactly at this time the function that you're executing and the whole call stack. So basically it does a statistical analysis rather than real profiling. But if you have enough, if it runs for long enough, it's usually very close to what would you get with profiling. So what you do, you export a variable, that's a CPU profile and it's a name of a file. Then you run your program just requiring Perftels instead of whatever else. And you can see that it almost there's almost no overhead. It runs in under a second for the same program that run under a second without profiling. And you can get a profile which has this added benefit that it actually, thanks to actually pinging your program and seeing where it is at the moment, also calls garbage collector which in this case amounts for 20% of what happens, which is actually information you wouldn't get with the other profile tools. But you can actually, let's try to do this for a 10 by 10 grid which is a bit larger. Again, we try to profile, again, there's almost no overhead. I mean, there's 3 versus almost 3.1 versus almost 3.2 seconds. And you can get, again, the profiling run and for 10 by 10 grid you can see it's a different, it's no longer the garbage collector, it's the most expensive part. It's fixing them to S calls. But the coolest thing about Perftels RB is that you can get a PDF graph of how your program runs. And it actually shows you by the size of the boxes that we spent 23% of time in 2S, 27% in array select, 25 in array map, and so on, and 15% in garbage collector. So we can do this the same for grid 10 by 10, 11 by 11, 12 by 12. These runs are longer and longer, but you can see that the profiles look quite similar. So it would be great if we could get rid of the 2S calls, but let's try to get rid of those array map and array select calls which are pretty big so we can try to get rid of them and then see whether it helps. So this was our original program, but you can see that we don't really have to compare the number of ones and the number of zeros. We just have to check whether the number of ones is the size of the grid because we just, the ones are the steps to the right. And if it's the size of the grid, this is enough. But if we have this, we can get rid of the map to S and just put the 2S call into the select. And if we know Ruby enough, we say all of this time we've been saying that you don't select and then size it, you just call count, which takes the block and counts how many times it's true for a given enumerable. So this was the previous one, this is now. You can see it's much nicer profile. Unfortunately the elephant in the room is just 2 times bigger now, 50% now. But the program, it used to run in under a minute, it now runs in under half a minute. So nice gain but not really getting us anywhere. This is where we kind of think that maybe not necessarily we need to code upfront because this is how developers usually approach problems. I mean, oh, there's a problem. I should try to code a solution. But actually this is how you should approach mathematical problems to look for whether they are actually solved previously somehow. So if we kind of read about what we are trying to do is population count, counting the number of ones in a given integer. And our implementation of population count was just to add the number and count the number of ones. And if we run it, we get, as I said, under half a minute. But there is a way that is a bit more proper to implement a population count. This is a bit shifting. You don't have to read very much into it. But this is one of the simplest algorithms to count the number of ones in a given binary value. The problem is when you use it, it's actually slower than 2S and counting the number of ones because counting substrings is really fast in Ruby for some reason. So what you usually do, you say, well, Ruby is slow. I mean, obviously we spend 90% now in this pop count bit shift function, which is obviously the slow part. Let's rewrite it and see. And I'm jumping ahead a bit, but you can actually rewrite parts of your Ruby code and see in place. And this does exactly the same thing. And instead of half a minute, it takes now four seconds to count this example. And if you are really into programming C, then you can say, well, I'm using GCC. There is a built-in function. That's actually called built-in pop count L. And I can use that. Well, the difference is it's not that faster, that much faster. It's 4.6 versus 4.5. Not really faster. But you can see that we no longer spend 90% of the time here. We spend 99% of the time or, well, technically 86% iterating over the range and the remaining part mostly in here. So this is when you think, well, maybe this approach is not really that good. I went all the way down to C. I still don't have a scalability solution because if for 12 by 12 grid you have four seconds, you still won't get to 20 by 20 in any reasonable time. Maybe I should really read up a bit. And then when you fetch your favorite statistics books or mathematics books, whatever you keep at your nightstand, and you realize that we want to make a 2 by 2. We basically want to make two times grid steps by combining grid right and grid down ones. So we want to pick grid elements from a set of 2 by grid elements. It's actually a known number. The number of those combinations is the binomial coefficient of 2 times grid over grid. It's actually always given a certain grid size and a certain number of steps you want to take. It's always known. It's for grid of the size 2, it's the factorial of 4 divided by the factorial of 2 times the factorial of 4 minus 2. For grid 20, it's simply the factorial of 40 divided by and so on. And this is a computation that even Ruby can do quite fast. I mean that's like 40 multiplications and one division that's actually really doable. And this is the number of ways you can get from your home to your coffee shop if you live in a large enough area and there are 20 blocks each way. So basically, we didn't really need to use a fast language to solve this. We could have solved it with a very slow language if we only actually use the right approach. But sometimes you actually need to do some coding to solve a problem. This happens to, I think, everyone of us. So again, you can go to Project Euler. This time I got a question about the sum of the primes below 10 is 2 plus 3 plus 5 plus 7, 17. Find the sum of all the primes below 2 million. And I won't get into the details of this or why this is a very common problem that everyone of us faces, but maybe not realizing this really. But basically, we could solve it if we have a way to say whether a number is prime. If we had a prime predicate, we could say just from the range from 2 to 2 million, select the numbers that return true for the prime predicate and just sum them. So we can do it like this. There's a very stupid implementation of a prime predicate which basically says let's assume it's prime and it goes through all the smaller numbers than ourselves. And if we are divisible by any of the smaller numbers, we are definitely not prime. So we say prime false. After we go through all these numbers, we return prime. The stupidest way you can implement a prime predicate, but it actually works. So what's the main problem with this? Well, basically, if you do 2 million dot prime question mark, you basically go from 2 to 1,999,000 and you check whether you're divisible by all of those numbers, even though you're pretty sure 2 million is divisible by 2 because it's an even number. So maybe there's a simpler way that just goes through the first number that actually is divisible and then returns. Probably it's a very nice language. It has this very nice all method for enumerables which basically checks whether this block is true and returns false as soon as it's not true. So now if we check whether 2 million is prime, well, we only check whether it's divisible by 2 and these blocks will return very soon because this will return false. It's not non-zero so not all numbers are below it. There are numbers that are factors which are divisible. But if you think about it and if you had some mathematics somewhere in your life, you know that checking all the numbers lower, smaller than oneself is not the best way. You can actually check only up to the square root of the given number and it will be enough. If there is a divisor of a given number smaller than that number, well definitely because it's a divisor, it will be smaller than the square root or it will equal the square root. So again every single one of these changes, it's an order of magnitude change in performance and we'll see that in a moment. Well, you could be saying again, very interesting talk, very interesting story, but you know if you do require prime in Ruby, you can get integer prime like already. It's all there in the standard library. So well, that was my recollection as well but just for this very simple example I tried to benchmark it. So this is again our simple prime solution. Just go through all the numbers smaller than the square root of ourselves, check whether they are not factors. And if you do a benchmark comparing these five methods, the stupid one, the tedious one, the simple one, the clever one which I will get into in a moment, and the standard library one, right, we check whether for every number from 2 to 10,000, we check whether it's prime. If you run this benchmark, we can actually see that, well, stupid is really not that performant, it takes 13 seconds. Tdus is an order of magnitude faster because, well, it returns as soon as it knows it's not a prime number. The simple one is again over an order of magnitude faster. The clever one, which I will get to in a moment, it's twice as fast, but the standard library one is surprisingly slow. The standard library predicate for checking whether a number is prime is actually two times slower than this solution. I mean, why? Well, just for the sake of it, let's drop the two slowest one and let's compare for 100,000 because maybe, you know, there's warm-up phase or anything like this. If you run for 100,000, again, we see that the simple approach took 1.2 seconds, the clever one is twice as fast and the standard library is twice as slow. Well, that's interesting because you would assume that the standard Ruby library prime predicate would be by far the fastest. Well, it's not. So, about the clever solution, it's not that complicated because if you try to check whether 173 is prime, you will go from 2 to 13 because that's the square root of 169 is 13, so you will check the divisibility by 2, 4, 6, 8, 10 and 12, which is not that efficient really just to figure out that 173 is prime because it's either divisible by all of them or by none of them. So, the clever prime is this very simple solution. If we are number 2, then we are prime. If we are even, we are not prime except number 2, which we check explicitly and then we just go from 3 up to square root of self in steps of 2, so every odd number. Again, a very, very simple written that is 3.5 times faster than the standard library one. But, okay, let's assume we're not all, I don't know, assistant professors at the university for some reason, so let's talk about caching a bit. This is the usual answer. But, okay, how would you profile web application, which is probably thing that some of you are interested in? Well, I will give you a simple example for a web application. Rails is rack-based, so it will work for Rails, I think. What I did when I started learning web development in Ruby, basically what everybody does, writes myself a tiny blogging engine. It's a Sinatra app. We wrap it in a rack server and let's see, we can Apache Bench and install with many items. And my problem was that the more items I added to my blogging engine, the longer it took to render any of the pages, like this, which was on a local machine, which was kind of slow. I did this. Oh, so benchmarking is really simple. You just, again, export a file. You just add it to your rack up call. For this, we'll use our lovely trusty web brick, maybe not the fastest, but it will do. And we just visit ten times the local website of our blogging engine. This is the simplest example I could come up with. Well, every single request was six and a five seconds, which was kind of long. And why was that? So if we did require perf tools here and specify the profile here here, we can again do a very nice graph of what actually happens in the web application. And this graph looks like this. So it's a bit more complicated than the graphs I prefer, but it's much more real world graph. But what's important, there are some large boxes here. So if we zoom in, it actually shows here, maybe not that visible, but we spend 60% of the time by checking the date on every item. And in my blogging engine, it's basically this. It uses time parse, which is now known to be slow in Ruby. I didn't know it by then. So the simplest solution, because of profiling, because we know this is something that has been 60% of the time now, is to cache it and basically do this. This is a one line change that gets us from six seconds to under two seconds. For this, not that contrived example. So again, profiling is very nice. We can profile this again with this one caching change. Again, the graph looks quite scary, but there again is a one part. And this is very simple implementation. Unfortunately, if you have one part of a profiling graph that is large, it's lovely, because that's your bottleneck. Basically, you zoom in, you see that, well, actually, there's item path now that takes 45% of time in this very simple implementation. Turns out calling file base name in Ruby and then splitting it. It's again not that fast. If we cache it, we go from two seconds to half a second. So, very nice. We get from six seconds to half a second by two line changes just because we knew which parts were slow. But let's profile it again. This profile is not as nice. We basically see that, actually, so how tributary works, it creates a stream of items, and then if you requested a single page, it finds it in the stream. Otherwise, it goes through the whole stream. It's not really important, but what if we cached the whole stream? So, on the first time your blogging engine runs, it loads the whole stream and then keeps it in memory. We can do that, we can Apache Bench, and it turns out that, well, again, one line change, a bit of memory we pay for this, but we go from half a second for 400 milliseconds to 40 milliseconds. And again, three changes, three one-line changes that introduce caching in the right places because we knew where to put it, got us from six seconds, six and a half seconds to 45 milliseconds. And we can actually reclaim some of the memory because now we know that caching the whole stream is just a ticket, so if we un-cache the diet and the path, we can actually, again, the first call because these are 10 calls. So the first call takes six seconds, but all the subsequent calls because the stream is cached take 45 milliseconds, but we got the, well, we didn't have to cache all those other things. But sometimes you can't figure out a faster algorithm. You want to find a faster Ruby implementations. How can you speed Ruby up? Well, one obvious choice is try some alternative implementations. JRuby is Ruby on JVM. MacRuby is Ruby written on Objective-C, I think. Rubinus is Ruby written on Objective-C++ and Ruby. And this is a bit dated, but I couldn't find a newer one shoot out between those implementations. What's important is this, they are all, this is actually standard Ruby and these three, they are all faster in different contexts. So you should actually try all of them, figure out which ones might be the fastest in your case. But they are usually in some contexts faster than standard Ruby. You could use fast libraries. There are a lot of libraries written in C that have very nice, crunchy Ruby wrappers. There's NRA which is a very fast way of doing any kind of matrix calculations in Ruby. There's Ruby bindings for the no scientific library. If you're doing any statistic stuff, this is one way to do this. There is a Ruby boost regex which is a very nice way to having very fast regular expressions. Then you should look into this. And there's Google hash which is not necessarily this history. It's a hash that is implemented by Google for doing very fast hash lookups. If you have only integer or floating keys in your hash you should try out Google hash because it's a very nice wrapper for just having a Ruby like hash that is fast for a change. So the next thing I would love to cover is strange love programming and strange love if you didn't recognize the quote at the title. Well, there was a quote about Edward Teller which is the father of the hydrogen bomb. If you know the Dr. Strange Love movie by Kubrick this is a movie where Peter Teller plays Dr. Strange Love who is the personifications of Edward Teller. And basically what I call by strange love programming is getting into C libraries from Ruby libraries that do not have already nice Ruby wrappers. And you could use RubyDL or if you are writing now your own wrapper you probably should use Ruby for in function interface because then it will run on JRuby and on say Ruby news and you can plug into different libraries. There's a very nice gem called Levenstein. It's a gem for computing distances between two strings arrays and so on. If you want to know what strings are similar because you have for example database of first and last names and the new name comes in and you can want to gauge whether this is a new name or a one that you already have in a database but you might have a similar because it's differently spelt then a Levenstein distance between a differently spelt names will be kind of similar. It's not the best way but one of the ways to do this and this gem is written very nicely because for strings arrays and arrays of strings it uses optimized C code so it's very fast it's for everything else because it actually works for any objects that implement each method so if you have anything that has each method you can compute Levenstein distance for two objects of this class using this gem and if C is not available it falls down to Ruby implementation of the same algorithm. It's a very small gem if you have performance problems and you want to rewrite parts of your coding to see do check out how the Levenstein gem is written it's a really nice example of a very simple algorithm written in all those versions and actually there's a logic that checks whether C is available, uses C if not uses Ruby. So there are bridges to the languages you can use C if you're using the canonical Ruby implementation obviously you can use the D language via Ruby Ruby library you can use Haskell code via Hubris this is a library for accessing Haskell code from Ruby you can use any Java libraries if you have already a Java implementation for a problem you can just fire up Ruby and access directly Java classes from there. You can use the R language, this is a very nice language for statistics via R-Serve again a very nice way of bridging from Ruby to the probably right tool for a job especially if you have a large library already. So the last thing I want to cover is Frankenstein programming and by this I want to cover the problem what about embedding for in language we can embed as simply in C if we want to have really fast programs can we embed other languages in Ruby and we can, with Ruby in line as you saw before and I'll show you in a second, with Java in line you can embed Java right into your Ruby code you can embed Haskell into your Ruby code if you want so an example we had this simple prime implementation it was a very simple implementation nice Ruby one liner if you want to make it faster we can rewrite it in C in place we basically require in line we say in line to builder builder.C pass it a string the string is C code and this C code will be well Ruby in line it will actually expose the in line simple function as function of the integer objects so it's very hence of approach to doing this, this will become simply a method of integer objects and this will be compiled the first time it's run and then it will be used as a very fast implementation of a priority check but you can also implement the clever function which in C is not that much longer which again if we are number two this is true if we are even we are not prime and otherwise step every second item from three so all the odd numbers and if we again compare the simple one in Ruby the clever one in Ruby the standard library one and the one we written in C we can see that well the one in C is again an order of magnitude faster than the one in Ruby so if and only if you know that you came up with the fastest algorithm that you could have then consider rewriting but otherwise you know there's probably much more to be gained by coming up with a better algorithm we started with two orders of magnitude slower than this one and we ended with something that is three and a half times faster than the one in standard library so we will make it ten times faster still by rewriting it in C but only because we ended up with a quick algorithm so was the right solution of course the right solution is go and read some documentation there's a very nice prime generator and you can call it take all the primes it will yield subsequent primes all the primes that are lower than two million and just sum them this is actually this is as fast as using the C primarily to check so what should be your world domination plans after this talk first profile always profile always know what is your bottleneck then optimize the algorithm cache parallelize if you want if you need if you can benchmark what is the fastest approach to solving your problem rewrite in a faster supposedly or actually faster language as the last resort actually do search sometimes because surprisingly many common problems are actually already solved this is what happens to common problems so if you want to check some other stuff there are automatic Ruby to C translators you put in Ruby code they give you the equivalent C code you can compile it it will be faster there is a Ruby optimization mailing list which you should check out it's a bit dead now but now and then somebody post something and there are replies you can try utilizing processes you can use DRB which is distributed Ruby to distribute parallelize your problem stay in Ruby just use I don't know 16 other machines to do your problem or BRB which is DRB on event machine and there is a lovely library for profiling memory if it is your bottleneck there's CUDA which is a library for doing vector manipulations and all kinds of mathematics stuff on your graphic card and there are actually Ruby there's a Ruby library I think for doing this to plugging into the Nvidia library for manipulating this on a graphic card thrust, call look them up they are very nice so the last question that I want to address why not optimize code right from the start why not start by writing optimized code just in C well because it's premature optimization if you know the quote the quote goes like this we should forget about small efficiencies say about 97% of the time premature optimization is the root of all evil this is a quote from Donald Knopf and this is a very famous scientist I think it's the word Professor Emeritus I love this picture because it was turned into a t-shirt that says Knopf is my homeboy and if you actually ever considered wearing this t-shirt then Donald Knopf is still alive so you could bump into him in this t-shirt and this is actually what Jacob Applebaum who is known for his work on Torr the on-year and for his work on WikiLeaks he actually made himself this t-shirt and met with Donald Knopf the thing they do is nested parenthesis obviously and the only funnier thing that you could possibly do is if you took this picture and put it on a t-shirt and then you have nested nested parenthesis it says I'm actually Knopf's homeboy so thank you very much and there are links to this presentation at the end all the things that I covered you can see it under this address if there are any questions I'm open to answering them questions alright it was so obvious everything okay thank you very much if you want this is actually how the prime predicate and it uses a prime generator and that's why it's I was surprised that the prime predicate is implemented in Ruby in the standard library rather than in C but this is how it's implemented probably if it was implemented in C it would be so much faster but this is something that you can dive into if you're into this kind of stuff and again if you come up with a question later you can mail me and I'll be happy to answer it thank you