 what is the best algorithm? How do we know if one algorithm is better than another? People say that a binary search is better than a brute force linear search, but in my hash table video I showed that a binary search was actually slower than a linear search. What gives? Today I'm going to show you software performance, big O notation, and algorithms. So here we have a Rust program, but inside the compiler this is represented as a string. We want the compiler to tell the user about this syntax error right here, where we use the single quote instead of a double quote. The problem is the compiler sees that this is index 23, but it doesn't know what line number it is. So we want to convert the number 23 into line number two. But how do we do that? Well, it's actually pretty simple. All we have to do is count how many new line characters there are before our error point. In this case, there's only one new line character, which means we're on the second line. So what code do we need to make this work? Well, all we need is a basic for loop. Our algorithm starts as if we're on line one. We loop through every character before the offset, which is that single quote that we talked about. If we see a new line character, we increment our current line. And otherwise we keep looping. Finally, when we reach the offending character, we return its line number. Now Rust program is love iterator. So let's write it in iterator style instead. So we tested this code, it works, ship our compiler, great, fantastic users are happy they have line numbers. So the first lesson today is to try it. I did the simplest thing that could possibly work. And of course it worked. Even if you can't think of a super elegant algorithm, just try the first thing that comes to mind, probably good enough. But is our algorithm good enough? Let's measure it. So I benchmark this algorithm with files of different sizes. And here were the results. Hey, wait a minute, if a file has like five lines, or if a file has a thousand lines, it still takes the same amount of time. What's going on here? The problem is our algorithm's performance does not depend on how big the file is. It actually depends on where the error is in the file. An error in the beginning of the file will take less time than an error at the end of the file because we have to do fewer loop iterations. So that graph I showed you was actually the best case where we find an error right at the beginning of the file. But let's measure the worst case where the error is at the end of the file. So as you would expect in the worst case, our algorithm takes longer. Notice that the worst case line is pretty much a straight line. If you remember in grade school math, you can write an equation for a line using y equals mx plus b. In this case, x is the file size, y is how long the algorithm takes, m represents the slope, and b represents some constant. When we talk about big O notation, we care about the shape of this chart, not about the specific numbers. Therefore, we're going to drop the constant factor. It doesn't matter what the slope of this line is either, we just care that it's a straight line. So we're going to drop the slope from our equation. So we end up with y equals x. And what we write in big O notation is O of x. Now actually, we don't normally use the term x, we use the term n. It's the same thing. So you would say that O of n is linear time complexity, linear because it makes a straight line. But that's in the worst case. What about the best case? In the best case, we also have a straight line. So we can use our straight line formula. However, in this case, our slope is zero. And because our slope is zero, the x term disappears. And we're just left with the b. But as I said, in big O notation, we don't care about this constant factor, we care about the shape. And we treat all constants as if they were one. So in the best case, our algorithm is 01, which means constant time complexity, which means the algorithm's time is constant, regardless of the input size. But only in the best case, which again, is if the error happens right at the beginning of the file. So the lesson number two is to check your loop bounds. When you're analyzing your algorithm, don't just see, oh, there's a loop. You have to see what is it looping to? Is it looping to the size of the input or is it looping to something else? That affects the time complexity of your algorithm. In our case, the performance is determined by where the error is in the file, not by the total file size. Lesson number three, stress test your algorithms. Don't just feed your algorithm the best possible case, try random inputs, try realistic inputs, and try to break your algorithm by giving it malicious inputs. In this case, I tested in the very beginning of the file and the very end of the file. And later we'll see testing random inputs in the middle and such. All right, here's a table of the algorithms we looked at so far. So we have our naive algorithm. And as we saw the time complexity for our algorithm in the worst case is of n, where n is the number of bytes in the file. Now, what's our real world performance? Well, we don't have a frame of reference. So we can't evaluate this yet. But let's try a different algorithm. But first, a pop quiz. What is the time complexity of the naive algorithm in the average case? We looked at the best case in the worst case, but what is it on average? Okay, let's talk about another algorithm. So we have our string. What if we pre compute some data? What if we look for where each line starts? So the first line starts here at offset zero. Second line starts here at offset 12. And the third line, offset 40. So we're going to put all these numbers into a vector. And then when we need the line number given an offset, we can search in our vector what line that offset would be on. So in this case, offset 23 is in between offsets 12 and 40, because it's in between those two offsets, it must be on this second line. Well, how do we know it's line number two, we can use the offset into this vector. Rust vectors are zero based line numbers are one base. So all we have to do is add one to the vector index, and then we get our line number. So how does a search work exactly? Let's look at a more complicated example. So we've made our table already, we're looking for the line number at offset 80. So we start at the left of a vector, we compare the number in the vector with our offset 80. And 80 is not less than zero. So we move on to the next one. 80 is not less than 12, we keep going. And we keep going until we find some number that is greater than our offset, in this case, 99. And all we do is look at the slot left of that. And that's where the line starts. Okay, so let's compare our naive solution with our new table based solution. The data is clear, our new line table solution is way faster. So let me zoom in the line table line on the y axis. And you'll see that they're both basically straight lines, which means they both have O of n time complexity. So even though both are O of n, one is clearly faster than the other. So the lesson here is that big O is not the full story. Remember that when you talk about time complexity, you're excluding constant factors like the slope. And in this case, the performance of a line table is a straight line, with a much lower slope than the naive solution. And because of this time complexity is not the same as time. So back to our chart, we show that line tables also O of n, and the performance of the line tables much faster than the performance of the naive solution. So let's look at a chart of our line table solution again. But this time instead of time, we're going to look at number of comparisons, just so the data is less noisy. And let's zoom in onto this little section of 500 bytes in less. Now, this doesn't look like a straight line number of comparisons should give us exact numbers. Why is it kind of jittery here? Well, the reason this line is jittery and not straight is because our x axis is file size, but actually our algorithm is based on number of lines, not the file size. And in real world text files, different lines don't have the same length. If we adjust our x axis to instead of being based on the file size, it's based on the number of lines, we end up with a straight line, which is exactly what we want. So really what we want to say is our algorithms time complexity is O of the number of lines, not the number of bytes. So we need a different variable than n to represent the number of lines. Typically in algorithms, you will use the letter m. However, I don't like the letter m because it sounds very similar to the letter n. So I'm going to pick capital L instead. So we'll say that our time complexity for our line table solution is O of L. So lesson number five, be careful with your variables that you use in your big O notation. Different variables have different meanings. The number of lines in a file is different than the number of bytes in a file. And in fact, you have fewer lines and you have bytes. And that affects our real world performance. Also note that if you see n, it doesn't mean number of bytes, it could be something else. So when you write big O notation, make sure you document what your variables mean so other people can understand. So you may have thought, Hey, wait a minute, our line table solution has this pre process step, but we're not measuring that pre process step, we're only measuring the lookup time. So that's not really a fair comparison between our naive solution with no prep time and our line table solution, which needs some prep time. So let's do a more fair comparison. So if you only do one lookup, the naive solution is clearly faster, because the naive solution only needs to look through it once, the line table needs to look through it twice, and keep some extra bookkeeping. But if we need to look up three times, the line table saves a lot of work. So it ends up being worth it. So let's include the time it takes to generate our table in the chart. So because our line table needs to scan the entire input, the time complexity is O of n. And because the naive solution doesn't need to do any work to prepare its tables, because it doesn't have a table, we would say that it has constant time complexity or O of one, it doesn't depend on the input size. But that's still not the full story. What about memory usage? A line table consumes memory, right? It's not free. So let's compare how much memory is needed for a line table solution versus the naive solution. And as you would expect, the naive solution basically uses no memory. And our line table solution consumes memory proportional to the number of lines in the file, because we store one entry per line. The lesson number six is to pre-process your data. You can get massive performance wins if you do some work ahead of time, instead of deferring the work to every time you look up. However, pre-computation comes at a cost. If you're only going to look up one line number, it's not worth it doing this table. But if you're going to do it many times, then the pre-computation becomes worth it. But you also have to keep in mind your memory usage. It may be faster, but maybe you don't have enough memory for it. Pop quiz. What's a different way to pre-process the data that can make lookups even faster? Let's look at our line table again. This time we're going to use a different algorithm to search through the line table. Instead of going left to right, we're going to use a binary search. So what a binary search does is it keeps track of a range of possible answers. When you start out the algorithm, all answers are possible, so our range is the full vector length. And in a binary search, we check the middle element. So we're going to see is 80 greater than, less than, or equal to 69? In this case, it's greater than, and greater than means nothing on the left could be the answer. So we set the left end of our search range to the middle element. So then we pick the middle of the new range, which is 110, and compare it to 80. 80 is less than 110. So we set the right end of the search range to before our element. Now our range only has two items. So we pick the middle, in this case, we're rounding up. We do the comparison, we determine that it's less than, and then we update our range. Now our range is spanning only one element, so that must be the answer. So that's the binary search algorithm. Now let's analyze it. So compared to our linear search using the line table, the binary search using line table does far, far fewer comparisons. If we blow up this line, we'll see that it's actually different than the linear algorithm. Instead of being a straight line, it's more of a curve. It's a jaggedy curve, but it's a curve. So the line jumps at 16, 32, 64, 128, and 256 lines. So when we go from 64 to 128 lines, which is a doubling, we do one more comparison. And when we go from 128 to 256 lines, which is another doubling, we also do one more comparison. That's different than a straight line where we would go over 10 and then up 10 or over 10 and up 8 every time. Here it's a multiplier on the x-axis. Now what is the time complexity of the binary search? Now it looks kind of like a logarithm, and that's actually the answer. But our binary search is all jaggedy. It's not a smooth line. What gives? Well the problem is you can't have half of a comparison or a quarter of a comparison. Where the log line would be say 7.3, you have to round that up to 8 comparisons. So here I'm going to round up the log line to the nearest integer, and you'll see it matches exactly to our binary search line. So we would say that the time complexity is log 2 of L. L is the number of lines. And typically we will not include the base of the log in time complexity. We just say it's a logarithm. So let's omit the 2. So lesson number 7. Big O is approximate. We had a bunch of jaggedies. It didn't follow the exact curve of a logarithm. Instead when we were talking about time complexity, we're looking at the trend, the shape of the curve. Big O is not precise. Pop quiz. In the binary search, when we doubled the number of lines, we got one more comparison. But why is it double and not triple? Or maybe 50% more? Why double specifically? So let's include the binary search solution in our chart. As we said, the time complexity for the lookup is log of L. And what about the table generation? Well, we're using the same table that we made for the line table. So it's the same time complexity. But what about real world performance? We looked at the number of comparisons, but what about time? Let's see how this binary search solution performs on real hardware. So here's a chart where I compared the two algorithms using random offsets. Instead of best case or worst case, this kind of represents average case. And we can see that the binary search performed way better than the line table solution. Pretty good. Our binary search is leaving the line table solution in the dust. But hold on a minute. I noticed something here. Let me zoom in. For very small files, the line table seems to be a bit faster actually than the binary search. Binary search is clearly better for big files. But for very small files, line tables a little bit faster. While looking at the linear approach, I noticed some optimizations I could do. And if I did those optimizations, it's way faster than binary search. So what's going on? I thought binary search was supposed to be faster. Well, if we zoom out again, we can see that the binary search does beat out the line table, even our optimized line table. But for small files, even up to 150 lines, our optimized line table solution, the linear solution is competitive. So lesson number eight, big O is for big inputs. Big O notation and time complexity tells you about how your algorithm will scale given huge inputs. And that matters for big data applications and distributed systems. So if you expect your file sizes to be really big, your input data sets to be really big, you really care about big O. But those constant factors that we talked about, like the slope of the line, definitely matter at smaller input sizes. Pop quiz. What is the time complexity of the SIMD approach? We know that the unoptimized line table is O of N. So what is the optimized version? Well, let's look at the implementation of the optimized version. Here's the unoptimized version, the line table solution that does a linear scan. And you can see that we're looping over the line offsets. And we're doing a comparison. If the offset is that we're looking for is less than the current lines offset, we return, I find it a lot easier to work with a normal full loop. Here, we're using a loop with enumerate. So let's refactor this into a while loop. First, you make our own counter variable, and then we turn the four into a while. Now we have this temporary line offsets variable. Let's inline that move the white space a little bit, cram this all on one line. And then what we're going to do is some copy paste. We're going to copy paste this line eight times. Now we're processing eight items per iteration, not one item, we basically unrolled the loop. So let's see how this performs. Here's the chart we showed before. And well, it's not super fast, but it's definitely faster than the original one. So we got some speed up just by copy, paste and code, probably not a good thing to copy paste code, but gotta do what you gotta do. So let's look at this code again, just for the slides. I'm going to condense this a little bit. And what we want to do is use SIMD SIMD is single instruction, multiple data. So single instruction, multiple data means we could do things like a bunch of less than comparisons all at once, but SIMD does not allow us to do multiple returns at once. So we need to separate the returns from the less than comparisons. So we're going to move all the returns below all of our ifs. So this code is functionally identical. Oh, forgot the commas. So let's measure it. And it looks like on average, this is a little bit slower, but don't worry, we'll make it faster soon. So let's condense this code a little bit for the slide, move some white space around. What we're going to do is start using SIMD. So what we need to do first is load the data from our vector. And to do that, we're going to use SIMD from slice to load eight items at a time. Now we have this less than comparison, but we want to do it with SIMD. So we need to use SIMD dot LT, but SIMD dot LT works on two SIMD values. We need to convert our offset number into a SIMD containing that offset number. So let's look at the performance of using the SIMD way. It's, it's still kind of bad what's going on. Well, the problem is our bottleneck is not the less than comparisons. We just refactor the less than comparisons to use SIMD, but the real problem is all these ifs down here at the bottom right now when we're doing the less than operation, we're immediately converting it into an array. Let's split that up into two separate steps. And now we have access to this mask variable. And the mask variable has a super handy function called any. So let's use that any function to say if any of them matched, then do all the if statements. If none of the match, we skip all of those if statements. Let's see how that performs. Now we're talking. We're still not faster than the binary search, but we're much faster than the unoptimized line table solution. So as I said, it's the zip statements that are the problem. So how do we get rid of all eight of these if statements? What we're trying to do is find the first true in our array and return its index. We could do that without converting our mask into an array first to bit mask followed by trailing zeros. Now what this does is count how many falses there were up until the first true. So then we use that answer instead of a bunch of if statements. And when we measure performance, now we're beating binary search, at least up until a hundred lines. So lesson number nine is SIMD is awesome. You can get major performance speedups with SIMD. We all know that threads are a large part of your CPU's processing power on an eight core system. You're wasting over 80% of your CPU if you're only using one thread. SIMD is similar. You're wasting 80% of your CPU if you're only using scalar values. Now, compilers can sometimes auto optimize your code to use SIMD, but it's pretty rare. So it's worth it to check and see if you can apply SIMD to your own algorithms. So let's look at our table of algorithms again. Our optimized linear line table solution is faster than the unoptimized version, but for big file sizes, the binary search still wins out. The optimized version still uses the same line tables. So the time complexity for generating them is the same and the memory usage is the same. And we'll look at the time complexity in a bit. Now for the pop quiz answers. So pop quiz number one was what is the time complexity of our naive algorithm on average? Now, if we look at our naive algorithm again, we see that we're looping up until the offset, but on average, we're going to be looping up until maybe half of the length of the text. If you have a uniform distribution, the average number would be right in the middle. So looking at this loop, what is the time complexity? Because half of the text length is proportional to the text length, the time complexity of our algorithm is proportional to the text length. So the time complexity in the average case is still O of n, matching our worst case. Pop quiz number two was what's another way to pre-processor data that could give us a more efficient algorithm? Well, with our line table solution, we took the beginning of each line and put that in a vector. But what if instead of taking the beginning of each line and putting in the vector, we just pre-compute all of the answers? In that case, looking up the answer is a simple vector lookup. So if you remember, a vector lookup, in other words, an array lookup, takes a constant amount of time, there's no loops involved. So the time complexity for our pre-computed solution is O of one. And compared to all of our other algorithms, an array lookup just it beats the pants off any kind of loop. And because we need to create a table, we have the cost of generating a table, which takes O of n time. And unfortunately, the memory required for the table is O of n, which is much bigger than O of L. Our table has to contain one entry per byte in the input, not just per line. So our table ends up being maybe eight times bigger than the entire file. This is the classic memory time tradeoff that people talk about. You can make your algorithm super fast if you pre-compute all the answers, but pre-computing all the answers might take up a lot of memory. So our third pop quiz was when we have the binary search algorithm, why does the number of lines need to double in order for the number of comparisons to increase by one? So if you remember in the explanation of the binary search, once we make a comparison, the remainder of the algorithm only looks at half the data. When we do another comparison, we look at half of that data and so on. So if you look at this chart, once we do one comparison, we look at half the number of lines in the next iteration. When we do another comparison, we look at half the line still. The reciprocal of one half is 2x, explaining why it's double and not maybe 50%. Fourth pop quiz was what is the time complexity of the SIMD line table solution? And just as a reminder, here's what the chart looks like. We can make a straight line, but it doesn't look like it's a good match. But if we scale our data up to 50,000 lines and then draw a trend line, it's much more clearly a straight line. Because remember, big O notation is about how the algorithm scales, not how the algorithm performs at small data sizes. So the time complexity of our SIMD line table solution is also O of N. I hope you learned something about big O notation or SIMD or how compilers work or something. My last video was really popular, so I decided to make these rust stickers. You can buy them on my girlfriend's shop. They come in the thick and the extra thick variants. Class is missed.