 My name's Davie Stevenson. I work for Esri, which is one of the fine sponsors. Thank you Esri for sponsoring this conference and allowing me and some of my co-workers to be here. I'm here to talk to you today about science. So, but first I'm gonna tell you a little bit about me. I grew up in a very small town in the Columbia River Gorge. For locals you may recognize this view. This is White Salmon across from Hood River. It's a very beautiful scenery, but it was a very rural area. Growing up I didn't get internet until I was in high school, though we did have computers before then. Thank goodness. As a kid I always really loved science. I wanted to be an inventor to discover things, to create things. When I found out that the times of sailing around the world and discovering new continents and islands was over that was kind of depressing thought to me. My dream was to become an astronaut so I could explore brand new areas of space. I decided to leave the West Coast for college and attend a school in Massachusetts. It has a beautiful campus, great curriculum and I entered college planning to be an astrophysics major. I didn't realize that it was going to look like this for six months out of the year and not only was the weather a challenge but I found out that the classes were a lot more challenging as well. I didn't enjoy a lot of the classes as much as I thought I would or should. I could follow along with the science and I was learning a lot of really cool things but there was a part of me that just didn't think that I was going to be the type of person that could create new science. I was just gonna be following along. I realized I'd really been a big fish in a small pond in the tiny town that I grew up in and that can feel really, really great. It pumps up your self-esteem and man, you are just really awesome but that feeling that you're struggling is also a good feeling to have because it means that you're pushing yourself to your maximum potential. If you're the smartest person in that room then you could sit there feeling really great about yourself but maybe it's about time to find a new different room. Because Williams is a liberal arts school I was required to take a lot of classes outside of my major. I took Chinese literature, mathematics, psychology, philosophy, English and computer science. I started taking CS classes relatively late in my college career. My first CS class was my sophomore spring and that first computer science class probably changed my life. The class was taught using Java and this class was incredibly rewarding because I was learning in more detail how to manipulate and control computers. I found out that this was something I was very, very good at and for once I was able to create things and build the things that I wanted to build brand new things that never existed that I never really felt that I could do in my physics classes. So even as the physics became more and more challenging my CS classes became more and more fun. I worried about switching my majors so late in the game and everyone told me not to do my astrophysics thesis and then all other upper level CS classes my last quarter or semester but I did it anyway and that was my favorite semester of school. As graduation neared I knew that I wanted to be a software engineer and not a physicist. But I wondered if I was giving up my dream of being a scientist. I had been trying to focus on being a scientist and that was something that I really wanted for my entire childhood. Was I giving that up to become an engineer, a software engineer? Was all the things I was learning in physics gonna go to waste? Algorithms are analyzed based on overall complexity which is also known as big O notation. What we're trying to determine is how algorithms scale. Running an algorithm on a set of 10 items is almost always fast. What happens when you scale the data set up to 10,000 or 10 million? So this is a graph that might be familiar to a lot of you. This is the big O complexity notation for a number of big O complexity designations. So when we're analyzing an algorithm for big O notation what we're really looking for is the highest order term. We ignore all constants and lower order terms when we're making this analysis. So log N algorithms here at the bottom, the blue line, grow very, very slowly. These are the algorithms, the sweet spot algorithms. These almost always assume that your data is always sorted in some way. Which is why database indexes, caching and pre-sorting our data is very, very important for our jobs. O N algorithms require a distinct analysis for each point in the data set. O N log N algorithms encapsulate most of the best sorting algorithms. So this is pretty much the best you can do if you have a random pile of data and you need to sort it in some way. And this of course we know is very, very important to allow for the O log N algorithms to work best. The quadratic algorithms are where we're starting to get into trouble. These are starting to be the realm of rapid growth as the number of points in the data set increase. This can really bring your site or your database to its knees. Exponential, if you're running an exponential algorithm unless you know that you're running on a very small set of data points, you're going to have a bad time pretty much instantaneously. The exact solution to the traveling salesman problem is an exponential algorithm. And then we have the factorial ones which just say no entirely. The brute force algorithm for the traveling salesman is a factorial. So you have to be, you better be an expert if you're trying to do any sort of things in this sort of realm. So let's talk about some algorithms. We deal with algorithms every single day. An algorithm is a set of instructions that will solve a particular problem. A problem might have many different algorithms that describe correct solutions and possibly many, many more that describe good enough solutions. Depending on your use case, sometimes the good enough is good enough and you don't have to find the exact. There's other cases where the exact solutions are really what you're relying on. I'm going to describe two different algorithmic solutions to the same problem to give you the sort of idea about how these sorts of algorithms can change and differ. The convex hull is a set, is a mathematical term to describe the envelope of a set of points. The vertices created by the polygon representing the convex hull are always consist of points that are within the data set itself. This is something that we can assume. To envision this, we can imagine wrapping a string around a bunch of nails nailed into a board. The convex hull has many real-world use cases in computer graphics, pattern recognition, image processing, and GIS. Finding the convex hull also allows you to make further calculations, such as finding the centroid or the center of gravity, or calculating intersections with other polygons, such as what you might need for computer graphics or games. One very common algorithm for calculating a convex hull is called Jarvis March, which is a variant of the gift wrapping algorithm. So first, you need to find a point that you know... Sorry. First, you need to find a point that you know is on the edge. This can be accomplished by scanning through all of the points and finding the leftmost point. Next, we are going to scan through all the remaining points, and what we're looking for is a point that's the rightmost point with respect to this point that we found that's on the hull. Now that we've found it, we move and shift our algorithm and run the same process over again. Calculate the... through all the other points, try and find that rightmost point and continue around the path. And now you can see why it has its description of the gift wrapping algorithm, because basically what we're doing is wrapping around and finding each line around the hull as we go along. Yay! Until we found the original point again, and now we can be assured that we found the convex hull. So what's the complexity designation of the Jarvis March? So first you have to find that one edge point, which is ON, since you scan through all the points. Then you have to find the rightmost point in the set of all the remaining points. That is also ON. You are repeating that process for each point on the hull, which we're going to describe by the designation H. So that gives us ON plus N times H. Since we're only dealing with the highest order of terms, that's ON to the H. If we're talking about worst case scenario here, that means it's going to be ON squared. If every single point in the data set ends up being on the convex hull. The second algorithm that I'm going to talk about is the monotone chain. It is in some ways more complex than the Jarvis March or gift wrapping, but it provides a performance boost under most conditions. So in this case, we're going to start very differently by sorting all the points by their position along the X axis. We start by placing the first three points on our list of potential hull points. What we're looking for here is the turns between the angles. So you can see here that what we're looking for is the clockwise turn, as is described by this first set of three points. That's good. We're looking for that. So we move on to the next point. Here, however, we have a counterclockwise angle. And so this means that this third point, that angle point, is not part of this proposed hull. So we can start removing points from the list in this way until we're back to only clockwise turns. So we keep going along. So now we have two clockwise turns. That's great. And you might be thinking right now, this doesn't really look like it's doing a very good job. This is not anywhere near the convex hull what's going on. But you have to bear with me, and I'll show you how it works out in the end. So we move down again, and here we have our counterclockwise turn. Going to remove that point. Still bad. Let's remove that one too. Yay! So we can continue on this manner, looking at the angles between all the different points, removing the ones that are counterclockwise until we've made it all the way through. And now we have half of a convex hull. Yay! So what we do at this point is we found half the hull. We can repeat the same exact process again with the points sorted in the opposite direction and find the second hull. So here we go. It's all there. Great. So let's analyze this algorithm. So first, we're talking about sorting all the points. And as we discussed earlier, that's an O N log N operation. Then for each of the half hull constructions, that's an order N, since we're only going through all of the points once. So that's O N log N plus N plus N, which collapses down to an O N log N algorithm. And the cool thing is that this data is already sorted, then we're at an O N algorithm, which is great. So the most appropriate algorithm for any particular problem depends a lot on the dataset and expectations around where this algorithm is going to be used. The Jarvis-March for many cases might be good enough. It's a pretty easy algorithm to implement. And if your datasets are small or you aren't expecting a lot of the worst-case scenarios, that might be perfectly fine. Other times, if speed is an issue or for other reasons that you know best, that John may have helped you decide where to change your decision-making process, you might need to find a more complex algorithm, more complex in building it to provide a faster solution. So how do we tell a little bit more about how these algorithms perform? Benchmarking tells us how fast things are and encourages us to test our assumptions. I think in our community, we do a lot of testing of our code, but I don't see people talking about benchmarking their code very much, and I think that that's something that we could spend a little bit of time doing in more detail. Use benchmarking to prove or disprove our assumptions about the performance of our code. Hopefully everyone has already been introduced to the benchmark utility in the standard library. If not, here's a very quick overview. We're going to be doing a simple benchmark of array. To start, we want to create an example array. This is just a random array with zero-to-size integers in it that are sorted in some way. Then we, inside the benchmark utility, create these reports, and this describes what the things are that we actually want to benchmark. The first one we're going to do is we want to benchmark the at method on array, so how fast is an index look up on this array? We also want to benchmark that against index, where we're trying to find an object within the array and return its index. Then we're also going to benchmark the index miss, so we're purposely asking it to find an object that is not inside that array, so we know that it's always going to fail. This is what you get back from the standard benchmark. This is really what you want to look at at the very end to tell us how it performs. But if we look at this benchmark code, we can see that there's a lot of repeated code that doesn't seem like it's adding much value. It can also be really, really hard to find the right value of n. When you're creating these benchmarks, you might be going back and changing the value to try and find something that is running for long enough that you're getting meaningful results from it, but is not too long that's going to take, like, 15 or 20 minutes to run your benchmark. So there's another benchmarking gem that I highly recommend called benchmark IPS that gets rid of a lot of this extra crux, and instead of providing you with the raw number of how long that code block executes for, instead tries to find the iterations per second of that code block. So as you can see, it removed a lot of that extra code. Pretty much everything else is the same, and then the other cool thing it provides down here is a way to run a comparison against the different reports. So here's the same exact code running through IPS. It has this warm-up block where it tries to figure out about how many iterations per second run in 100 milliseconds so that it can more accurately figure out how to calculate the iterations per second in five-second blocks. This is, of course, configurable if you need to change it around. So here's that example of this comparison where it compares the iterations per second between these different methods. Again, this is kind of where you really want to start looking for the raw data that's coming out of this example. And then here is the comparison, and the problem with this is that it can be a little bit misleading. Is index really 75 times slower than at? Well, no, because these algorithms have different complexity designations. You're comparing the wrong thing, and comparing only on a single slice of the size of the array. So I've talked about complexity and big O notation a lot in previous talks, and a lot of people have mentioned that this is an area where they still feel nervous. Big O notation shouldn't be as scary as we think it is. So I decided to build a tool that would be used to help benchmark algorithms. So benchmark big O relies a lot on top of benchmark IPS, and it provides big O notation benchmarking for Ruby. So here's the IPS code again. When taking this code and converting it to benchmarking the algorithm complexity designation, we need to make a few adjustments. So at the top here, we're setting the size of the array that we're testing and then creating that array object. Within benchmark big O, the variable that we are adjusting to test this behavior is size. And then we also will still need to know how to create the objects that we're benchmarking. So here's the equivalent benchmark big O code. You can see that not a lot has changed. So the two main things are that we need to tell benchmark big O what type of object we are generating. I have array built in, and hopefully we'll build in a couple other main objects as well. I'll also show you how to benchmark other types of objects. And by the way, all of this code is on GitHub. I tweeted out a link to the repo containing all of these examples. So if you want to look at it that, it's there. So we also need to indicate how we want these objects to grow. You can grow linearly with a specific increment. So this would grow, we have an array 2,000, 4,000, 6,000. Or we could grow the array exponentially. There are a bunch of additional operators to help you configure exactly the sets of sizes that you want to test. So the next big change is the blocks and tags about generating reports. We don't have the array and size objects directly available to us anymore. The gem will be generating these values for you. Since these values are still important in defining the behavior of our block, we need to pass them in now. So that block, instead of just being a variable less block now, I'll give you these two variables that you can then use. And then finally, we want to chart this out, since numbers on screens are great and all, but charts are even better. So we can also chart this out in visual form, and we need to place those charts. So it still provides you with a bunch of fun data if you want to dig through it, but the more important part is the graphs. So now we can see in much more detail the growth of these different operations as the size of the array increases. So we can see here that just as we'd expect, AT performs very linearly, since this is just the basic index lookup with an array, and that the index miss takes longer than index because it has to search through every single point in the array. Index itself will bail out early once it's found that object. So these grow linearly with the size of the array. So let's talk about convex hulls again, because I love convex hulls, they're pretty cool. One of the open-source projects that the Esri Portland office works on is called Terraformer. This is a library that we provide in a bunch of different languages, and Ruby is one of them. And this library performs a bunch of geometric toolkit operations for you to use to help you convert geographic data from different formats and perform a couple simple tasks. One of the tasks that it performs is calculating the convex hull. We initially started by implementing Jarvis, because it's pretty easy. There was great pseudocode out on the Internet. It was simple to plug in. And what we found was that it was performing pretty slowly. And so we decided to look around for other algorithms, found monotone and plugged that in, and it was the incredible speedup. So, you know, why is that? Or what was a little bit, like, give us a little bit more information about how these two different algorithms perform in actual Ruby code. We talked about it hypothetically. Let's see what's actually happening under the covers. It's benchmarking time. So first, I wrote some code to help with the benchmarking process. There is a method here that easily allows you to calculate the convex hull for a very particular implementation. So in this case, you can pass in whether you want it to run the Jarvis march or the monotone chain. Next, I added a very simple manipulation on the float. This is mainly just for my benchmarking, so that I could create some random numbers, which is why I did this. Don't do this in production code. So we need to be able to build the data sets that we're testing on. So I created another helper method that will generate a random walk. And we want a random walk of a very particular size. And so here is the main code where we're generating the random walk. So we start out here in Portland, and for a size amount of times, we're going to make a small permutation on the last item in the walk and add it to the list. It's pretty simple. And then we need to just create the line string in the proper format using Terraformer. Here is an example of such a random walk that I generated and the convex hull that would be appropriate for this data set. So it can walk pretty far. So here is the benchmark code itself. So first we have to tell Big O how to generate this object. So this is a little bit different since we don't have random walk line strings built in. So we have to give it a block that will generate this object. And it's pretty simple. You know, we pass it in this size and we just use that helper method that I defined earlier. Then we have a bunch of configuration options, and then the two main points to look at is down here in the two different reports for the random walk jarvis and random walk monotone, that we just use that other helper method that I defined earlier to calculate the convex hull of the object that it's being passed with the particular implementation that we want. So what does this look like? So here we can see that the monotone chain in purple is well outpacing jarvis-march in orange. Since I added the compare option, it's also going to give me some cool other charts where it highlights the... given the dataset, highlights where log n and n-log n and n-squared would be and map it next to that actual data. So here the orange is the actual data of the benchmarking code calculated, and all the others are example lines. So we can see here that in this case jarvis is performing just about at n-log n. The last data point I think is a little bit spurious. There is some wiggle in the calculation itself, plus or minus 10% or something like that. So it's pretty much on n-log n. And then we look at monotone, and it's performing better, but it's still about an n-log n algorithm. So you're like, well, it's not really that much better. Why is it performing so much better? So this was a random walk, and what I really wanted to do was to see if I could find out how to completely destroy a jarvis march. We talked about it a little bit before, that the jarvis march performs particularly terribly when every single point in the dataset is on the hall. Could I benchmark that and see how that worked? Yes, yes I can. So I created a different helper method that's just generating a circle. Circle, conveniently enough, is going to create the exact type of object that defines this every single point is on the hall. This is really easy to create in Terraformer. There is even basically a method where I want to give it the lat long and the diameter of the circle that I want to create, and I can tell it via the size how many vertices I want in that circle. Perfect. And this is an example of such a circle that would be created. So in the benchmark code, it was very little I actually had to change besides some of the names, which I didn't change on this slide. Perfect. So it should be Cirque Jarvis and Cirque Monotone instead of Rand. You can see up there that I just swapped out the generator to be generating these circles instead of that random walk. And here is the growth chart. Now we can really see how Jarvis performs incredibly terribly in this sort of environment. In this case, Monotone almost looks like a complete straight line at the bottom, even though it is not. So again, on the comparison charts, just exactly as expected, we see that the Jarvis march in the worst case scenario tracks N squared. And Monotone performs pretty much exactly the same. In fact, like you can compare these two. And it's pretty much exactly where it was before. It's pretty stable in how it performs. And stability in algorithms can be very, very important. You want to know that your algorithm is going to perform equally well on all data sets instead of fast on some and slow on others. So this is a much more stable and reliable algorithm for us to be using. Yay! Benchmarking. We learned some things. So as I said before, all my code is up on GitHub that I went through. It's all here on Benchmark Terraformer. It contains all the helper code, the benchmark code, the little map viewer that I used to take screenshots of the maps, and all the example charts, the GeoJSON that was generated, and all the raw data. It also includes all of the array benchmark code that I showed in the very beginning to compare how using the regular benchmark utility, benchmark IPS, and benchmark Big O, how that would work. So benchmarking is as much an art as it is a science. It sounded like I just, you know, went down this path and pretty easily was able to create these charts, but the actual process is a lot longer than that. You try and come up with an idea and try and benchmark something, and then maybe it's not giving you the results that you need. Maybe you're not benchmarking the right data, or maybe you're not using the right inputs to provide you with a good view. So, you know, a lot of this process is really trying to figure out what you're trying to investigate and what you're doing. The other thing that's important about science is peer review. I would really love it if other people who were interested in this sort of thing would take a look at things and make things better. There's a lot of things that could be made better. One thing that I kind of wanted to do from the very start, but haven't had time to do, was instead of just drawing the charts with the lines of the different Big O complexity designations, what if we did some line fitting calculations, and it just popped out the algorithm, the description of the growth of the data set. That would be awesome. That would be amazing. Pretty dear graphs. I just chucked it in a chart kick and had it output some data, and that could be way better. Other people are way better at that than me. And then, not only that, I really would like to get people to verify my math. As you may have seen, I only show the comparison stuff on the linear charts. The exponential ones, I am not too trusting, generate really great, useful things yet, so you can take a look at it and see if there's better ways to be doing this sort of thing. We've talked about some basic CS concepts. What can we learn from science in a broader sense? What if we tried to solve our problems using the scientific method? What might we learn from this process? The first step in the scientific method is, of course, to formulate a question. You know, why is something slow or broken or the opposite? How can I tell if something is successful? Next, we use our observations and knowledge to make an educated guess. This involves narrowing the scope down. It's slow because the index is missing. It's about isolating components. It's broken because the input data is incorrect. It's about rigorously defining our expectations. How to verify that our hypothesis is actually correct. It's tempting to jump right in and test things at this point, but that's the wrong choice. Putting together a prediction can help you determine if you're on the right track at this point. Sometimes it might just be easy to plug in the fix for your hypothesis and test it, so this is a good step to take. It's about outlining the expectations on the input or output that you're trying to then write the test about, to verify the other components are working properly, and you can have more than one prediction. If your prediction is long and rambly, it is not testable. It's better to have a few extremely simple predictions than one very large one. So now that we've isolated some very specific results that we expect to see when running our tests, now we can actually run our tests and see if we are correct. This is very small and explicit to test. If you do not, then go back and fix your predictions. Narrow them down further or split them up into a few different predictions. And, you know, let your predictions guide your testing. So the analysis is this is where you get to see. You get to check. Does your results match or don't match your predictions? And finally, finally, you can actually fix what's wrong and write tests to lock in your predictions. But by this point, you should already know what the problem is, where it is, and why the problem exists. Or, conversely, you know that things are working and that you don't have to worry about this area in the future if you write the tests to lock in your predictions. So do I feel that I gave up my dream of becoming a scientist? You know, I don't think so. We have a lot of science that can help lead our development. Study in complexity and measuring our code. Inventing algorithms and creating new things, new tater bases. Many other things. In fact, in many ways, I feel I have more opportunity to create new things as a software engineer, to invent things. To study things in software than I did in physics. And I hope to inspire a little bit of the scientists and all of us. So thank you very much for listening to me talk about science. If you want more, there's my Twitter handle. You can contact me on Twitter. Here is a list of big resources for all the things that I used, including leaflet and esri leaflet for the maps and a cool leaflet Ajax to help load in that GoJSON too. That was awesome. And then here's a bunch of other attributions. Thank you, NounProject. And Flickr, Creative Commons. Any questions? Links to all of the things. Here they are. Any other questions? I can't see. There you go. So the question is a really, really good one, which is, for the application of this library, do we know more about what the data that's going to be coming in is for? So that's exactly why we need to benchmark some of these things. Do you know what the data is coming in so that is it never going to be the worst case? Then maybe you don't have to worry about it. In this particular case, you know, this is a library that's used by a lot of other people, so we don't exactly know fully how other people are going to be using it. But the thing about a lot of, you know, geo things is that, you know, if people are creating circles or creating other polygons, that's probably a common use case. If you're thinking about things that are being mapped on a map, a lot of those objects coming in are going to be sort of these worst case scenarios. And so that's something that we do expect to come in commonly as an input, so that means that's exactly why Jarvis March is going to perform really poorly. There are probably a lot of other examples where the data coming in matches a lot more like the random walk or there's a lot of points that are inside the convex hull and not on the hull. That's also still very, very common. I spent a little bit of time trying to think about what would be the worst case for monotone chain to see if I could map that. But there wasn't necessarily something as exactly obvious as like the circle. And I think that's like a really good use case because people create circles out of everything and create those a lot. That's going to be something that is created and in a lot of mapping, the mapping world, creating the circle out of polygons instead of trying to create this, the diameter. They are creating these polygons and they might have a lot of vertices to get that nice circular look. So that's something that's going to be commonly encountered. Exactly. And now have the math to back up the decision making process there. Any other questions? Thank you.