 So I'm a software engineer at Airbnb, did not mean to do that, and I'm going to talk to you today about Ruby, and particularly I'm going to talk to you about a time that Ruby lied to me, and probably many of you have had this kind of experience before, Ruby is a fickle language to have fallen in love with, but I'm going to tell you a story about what happened to me. So a coworker and I were arguing about an algorithm as you do, this is an artist depiction of people arguing about an algorithm, right there is me and that's him, I'm the fashionable one, and it started with us talking about a pretty classic problem, some of you might have maybe had an interview problem or something like this, how do you generate all the substrings of a string, and so give it a string like hello, all the substrings of hello, it's pretty straightforward, you start with the string itself, then all the substrings of length 4, substrings of length 3, length 2, and length 1, and all these together in a set are the substrings, so how do you generate those, well it's pretty straightforward like I said, if you look at the way a substring is generated, if you just look at the substrings of length 4, you can see that it starts at i equals 0 and ends at j equals 3, and then the other substring is i equals 1 and j equals 4, and so the insight you get here is that each substring is uniquely defined by a start index and an end index, and as long as you have those two you have a unique substring, and as long as you generate all unique pairs of start and end indices, you have all the substrings, so you can turn that into code using Ruby pretty easily, so we're going to write a method called substrings, and we're going to start doing it each with objects so we can collect all the substrings inside this nested loop, and we just go from 0 being i all the way to the end of the string, inside there we do another loop for j which is the end index, and that starts at i and goes to the end of the string, and then we just shovel into our loop of substrings the substring from i to j, so pretty straightforward, and what you can see here just from the structure of this code, right, that we have these two loops here, we have this i and this j nested inside each other, and they're probably going to be quadratically many such pairs of indices because of this nested loop, therefore the inner loop runs on squared many times, so for any of you who are remotely familiar with big own notation, this should be pretty straightforward. So I say this algorithm is on squared, but what about what's inside the loop, right, we kind of glossed over that for a second, we looked at the structure of the loops, but if you look at this part right here, so we know that shoveling into an array is on one, hopefully you know that, if you don't it is amortized, but how long does it actually take to build this substring, to take this string from i to j. For simplicity here, we're going to assume that we're having, we're going to assume fixed width strings, even if they're not fixed width, this still, all holds true, but just going to make it simpler. Also Ruby treats strings less than 24 characters differently, but for large end we can ignore that, so with these caveats in mind, let's look at how strings are represented in Ruby. So Ruby obviously gives a lot of abstraction over what's really going on at the machine layer, but in reality what a string really is, is just an array of characters, right, it's an array of, if you imagine it's an ASCII string, it's an array of ASCII encoded characters, and those are just each one by each, and they just sit next to each other in memory, which is what it means essentially to be an array, so it's just adjacent characters somewhere in memory, and to take the substring ELL, and to copy that somewhere else into a substring, we have to go into a whole new place in memory, and one by one copy each character into these new cells in memory, and that's where a string two now lives. So, obviously each, copying each substring to copy all the characters one by one to a new place in memory must take linear time in the number of characters, but when we're thinking about each substring, it's linear in the length of the average substring for each substring, right, so it's one thing to say, yes, for copying a substring it'll take linear time, but linear time relative to that substring, and if we don't know how large the average substring is, then we don't really know how long it'll take to collect all the substrings in an array. So how long is the average substring? It's actually not obvious just looking at this pyramid how long the average substring really is. It could be a one, maybe the substrings are dominated by these really small ones, I don't know why this keeps moving back and forth, it could be as dominated by the ones at the bottom of the pyramid, maybe there's so many of those on average as so one, maybe it's logarithmic for some reason because that's the only thing you know, what most people assume is the only runtime between O1 and On, or maybe it's On, maybe it just scales linearly. It's not obvious. So let's figure it out. Let's use Ruby and actually compute it. So I'm going to grab the substrings code that I wrote earlier, and I'm just going to have this global function that I can call it called substrings. And I'm going to write this method called average substring ratio. And it's going to, given an original string length, it's going to compute all the substrings and see what the average substring length was, what the ratio of that is to the original string. So I essentially want to know how does this ratio change? So the string I'm going to be building is just a series of As, assuming as long as the size you give me. And to get the substring lengths, I'm going to just call substrings, the method we wrote earlier, and turn them all into lengths for the substrings. Now finally, to get the average substring length, pretty straightforward, add them all together and divide by the count. And that will give me the average substring length. And the return value for this method will just be the average substring length over the original length. I want to know that ratio. So what I'm going to do now is I'm going to go, I'm going to count from the numbers one to 150, stepping by five. So, you know, one, six, 11, whatever. And I want to see what is the average ratio of the average substring, sorry, what is the ratio of the average substring length to the original string. So let's do it. Here we go. Okay. Let's get down there. All right, three, four, three, three, three, nine, three, three, okay, three, three, eight. And it's getting lower and lower. And what actually, what you can see is that it's converging on one third. And you can also prove this mathematically, which I'm not going to do because it takes some time, but you can pretty easily see just by induction here that this number is getting really close to one third. So the average substring length actually grows linearly with the original string. Maybe somewhat counterintuitive, but true. And therefore this copy right here that happens inside this loop is ON. And if that copy is ON, then the whole thing is ON cubed, is what my colleague said. That is actually what my colleague looks like. So basically I was wrong, but not so fast. I knew something he didn't know. And what I knew is I knew Cal, which is copy on right. That's what Cal stands for. I guess there's like an animal theme in this talk. So copy on right. So what is copy on right? Copy on right is a kind of structural sharing. So if you've seen functional languages use a lot of structural sharing. But copy on right in the context of strings is, I'm going to show you what that looks like. So let's say we have this string hello. And we want the substring from, let's say the same substring as before, 1 to 3, we want the string ELL. So what copy on right will do is instead of copying everything into a new cell in memory, we're still going to say the same thing, string 2 equals string 1 from 1 to 3. What we're going to do instead is, whoops, what we're going to do instead is create a shallow object. What the shallow object is going to do is just going to have a pointer to the first index. Oh man, all right. I'm going to stop using that clicker. I'm just going to use manual clicking. Yeah, so this is a shallow pointer and it points to this index and the index at which it starts and just the length. Okay, I think it's auto-advancing. I think that's the reason why it's doing that. Cool. So that makes sense. We all followed that, so did I. Great. So here's a proof. I want to show you proof that it's actually doing this. So I'm going to grab this library called display string, which actually credits to Pat Shaughnessy for writing this. It's a C binding that basically lets me inspect the struct that basically constructs the string in Ruby. So I'm going to grab this debug object, which is exposed by this little C extension that he wrote, and I'm going to create the string that is the alphabet. So it's just a to z joined together. And now I'm going to dupe the string. So if you're familiar with dupe, it's just cloning, right? It's creating another copy, the exact same string, and that should go in a new place in memory. So now I'm going to display the string, the first string. And so it's going to have this R string, which is a basically a pointer to where the object itself lives that has all the pointers to other stuff in memory. So all those are going to be unique. That's defined by like an object ID essentially. Then there's a pointer to an actual string somewhere in memory. This is like where the array lives in memory that have the corresponding characters. And so printing that out, that corresponds to this string here, A, B, C, D, F, G, whatever. And it's got a length. Cool. Now I'm going to look at string two. String two has a different R string. You can see that the addresses are different for the actual objects. Meaning they're different objects. They have different object IDs. But if you look at where the pointer lives, it's actually the same. They're pointing to the same string in memory, the same array of characters. No actual copy to a different place in memory has occurred when I called dupe on this string. Now let me show you the same thing is true for substrings. So I'm going to take the substring from 1 to negative 1, meaning I lop off the first character. So I'm going to display the strings again. First string looks the same. Second string, still different object ID. Now what you can see is that actually it's the same string in memory but offset by one. Because the second string was B, C, D, so on. What just happened? Okay. Cool. Let's click through that again. This is going very well. All right. So what happens if either string gets mutated? Right? So clearly this is not going to work if one of the strings changes and they're both pointing to the same place in memory. So let's try that. Let's take string 1 and let's assign that first string. Let's throw an ampersand in there where the B is supposed to be. And of course string 2 starts with a B so that's going to throw a wrench in all of our plans. So let's watch Ruby break. First string looks right. Okay, cool. And then second string, actually what you can see is that second string looks fine. It starts with a B and it's in a totally different place in memory. So the right forced a copy. By writing to the first string, the second string had to go and copy all the characters one by one into a new place in memory so it could live in its own place and not get screwed over by string 1. So cool. Ruby's doing his job. It knows what to do. That's encouraging. So again, looking at this shallow struct here, right? Like this thing that points to this old string. If I modify string 1, actually I guess one thing to note is that string 1 must hold onto callbacks. Meaning I need to know as string 1 who depends on the integrity of my data so that if I change I need to let these other people know, hey, you guys need to go in and copy what I own. Otherwise, you know, stuff's going to break. So when I go and mutate string 1, the string 1 is before it makes the mutation. So it's going to block and then it's going to tell string 2 or all the strings that are waiting on it. Hey, copy my stuff. I'm about to break. And so string 2 goes in and says, okay ELL there we go. I've got my stuff. You can go ahead and change. And string 1 is now written to. So that's copy and right optimization. So this is a shallow copy. Meaning this is actually a one. It's just creating a shallow object that points to the original object. It's not copying everything one by one. And therefore this whole thing takes all in squared time. Therefore, you can do the substance problem in all in squared time. I was right case closed. So a couple days later my colleague got back to me and he sent me this. So we're requiring substrings. We're using the benchmark module here or the benchmark library. We're going to create a big string. So this is eight characters multiply by 128. So that's a lot of characters. 1024. And then another string is twice as long. So 2048. And we're going to use your benchmark. BNBM. If you're familiar with benchmark, BNBM is actually really, really nice. If you use benchmark.bm which is a standard benchmarking method, it'll just run your things and count the time. But if you use BNBM, it'll actually run twice. So it'll use a rehearsal and then it'll run the actual specs or the actual things that you're benchmarking. Which is great because it warms up Ruby. So you don't get any false information because of caching or garbage collection or whatever. It's a fair trial. So using BNBM, we're going to run the spec. First thing we're going to run is we're going to check to see how long it takes to generate all the substrings of string one. Then we're going to check to see how long it takes to generate all the substrings of string two. And we're just going to, we're going to run this here. I'm sorry, I just missed a very important part. So we're going to, finally what we're going to do is we're going to check to see how long, how much longer does it take to run the substrings of string two compared to the substrings of string one. What's that ratio? And remember string two is twice as long as string one. So with the string twice as long, how much longer does it take to run this algorithm? So let's run this output. So we get 0.33 and then for the second one we get 2.6. So the growth ratio, whoops, I am not good at this. The growth ratio is 7.689. So if you look closely at that, that is interesting because when the input doubled, the time it took to run this algorithm grew by a factor of 8. And that means the algorithm doesn't look very quadratic. It looks decidedly not quadratic. So what's going on? Why did this happen? Why is Ruby betrayed me? Let's go and see. So here's what we're going to do. We're going to run a benchmark. We're going to run it 100,000 times. We're going to do the same thing, get some huge string. This is a really big string, I don't remember how long. This is a string that's twice as long. And now what I'm going to do is I'm going to benchmark just substringing. Just that. Like I thought, I was pretty sure Ruby did copy on randomization. Maybe I'm crazy. Let's check it out. So we're going to do copy on randomization, or sorry, take a substring that should be copied on right for both of the strings. We're going to do it a crap load of times and see what happens. And what we get is that even for the string that was 10,000 characters versus 5,000, they took around the same amount of time. In fact, implausibly, the second one was actually slightly faster. So, you know, that really looks like copy on randomization. So what's going on? So after a little fiddling around, I tried this. String one to negative two, instead of to negative one. If you try this on both strings, what you get is this. That is, that's very strange. So it turns out only substrings that include the last character are copy on right. And of course, the vast majority of substrings do not include the last character. So on average, this substring will take linear time because on average it will not be that J won't be the last character in the string. And so this whole thing will be o and cubed. It was all I. So Ruby is once again betrayed me. But I'm not done. So I go to this repository, which you may be familiar with. And inside this repository, there's a line called, or there's a file called string.c, which naturally is 10,000 lines of C code. And so delving in here, I was really trying to figure out what exactly is Ruby doing that it's only doing copy on right optimization for the only strings that don't include the last character. And so after trawling through the code for a while, I came across this line here. So if you're not familiar with C, there's a macro that basically says define charitable middle substring as false. Okay. All right. Why not? Okay. Now let's compile Ruby, I guess. Okay. Well, I mean, I got to be right. So let's try this. So I'm going to run make install in this new version of Ruby. It does a bunch of stuff. This is all the stuff it does. I kind of cut off. There's more stuff. But now once I'm done, I have a custom version of Ruby now in my user local bin. And you can see here it was made by me on October 23rd. And if you look at the version, it's dev off the trunk. It's my version of Ruby that has this new feature just for me. So I can prove other people wrong. So let's now go back to this benchmark. So we're going to run the same thing, huge string from 1 to negative 2. And let's see if it does what we expect. So running the benchmark again, we get some rehearsal. 0.0204, 0.0203, boom. Ruby is now doing copy and rat optimization on all strings. And this bad boy finally takes all n squared time. Okay. Thank you. Okay, but you have to wonder why was it doing that? Why was it the default behavior? That's kind of weird. Other than to just try to trick me. So actually what's going on, if you look a layer deeper, the Ruby or at least MRI, of course, is implemented in C. And all of what I'm talking about applies to MRI. It doesn't necessarily apply to other versions of Ruby. So in C, all strings end with a null terminator or a null byte. And that sits at the end of this array of characters. And this is how C knows that it's reached the end of the string. So if you call length on the string in C, it will just read until it finds a null terminator. So if you do this, string 2 equals string 1 to 3, you'll grab this ELL string, right? But if I pass a substring, which did not include a null terminator, like this ELL string, into a library that was written in C or an extension written in C, if it tried to grab the length or try to read the string, it might keep reading bytes until it finds a null terminator, which means it's going to go all the way to the right and I'm going to get LO instead of ELL. So essentially, this behavior guarantees that any C extensions are going to treat Ruby strings correctly because by creating a new string, it also creates a new null terminator that sits in that spot where it should be ending if it were a C string. So that's it. Mystery solved. Finally done. We now have an O and squared algorithm for substrings. So that's pretty cool. Except for one problem, which some of you might have already entered into your minds if it hasn't, then you've got this one thing. Remember where we started? We started by talking about this classic algorithm of generating all the substrings of a string. So we have to actually generate all these substrings. But did we actually generate them? Imagine we do this. We put substrings hello. Now, of course, it takes linear time to print to each substring. We have O and squared many substrings. Each of them is linear in the length of the original string. So printing all the substrings will still take cubic time, right? In fact, if you want to serialize that down a wire and send it to somebody who's asking us what the answer is, that would also take cubic time. So in what sense is this algorithm actually O and squared? If you think about it, the whole idea of copy on write is a kind of laziness in that it doesn't do the right until it actually has to, right? It doesn't do the copy or sorry, it doesn't do the copy until it has to. So what we've created are lazy strings. So instead of making these, which is originally what we were looking at, right? How, how, hello, how, hello, all these substrings, what we've instead has created these. All we've really done is reduce the problem to building each pair of indices. And so the Ruby array that gets returned by substrings, even though if I call .class on all of the strings, they'll tell me they're strings, but it doesn't actually contain the substrings. It's just a clever and lazy way of expressing the substrings. It's lies all the way down. That's it. Thanks for listening. You can follow me on Twitter at adhaseeb. And if you want to find the code for some of this, it's up on GitHub. That's it. Thanks to, thanks to Ned, David, and Pat for helping me out with the talk. And thanks for bearing with me. Sorry about all that. Sorry about all the clicker troubles. I didn't realize it was going to do that. So any questions? So you, so let me, let me copy and maybe you can, you can correct me if I'm wrong. So you're saying if I have string, string one equals some string, string two equals string one, string three equals string one. In that case what you're describing, they're all literally pointing to the same object in memory. So none of those are different objects. If you use string one dot dup, is that what you mean? Okay, okay, cool. So if you're using string one dot dup, string two dot dup, and if you mutate then string one, then string two and string three are going to, oh, so I guess the question you're asking is, are they each going to copy their own versions into memory or are they going to be smart and share one copy? I don't know. But I would guess they're probably going to copy each their own version just because I'd imagine it's a pretty rare situation to optimize for and probably potentially complex. But I don't know for sure. So that's an interesting question. We could probably find that out. All right. Thanks everyone for listening. I really appreciate it.