 Did I get that right? I've been practicing all morning. Come on, guys. Debug hard. Let's write a test with the following assertion. I should have said welcome to the first RubyConf in Thailand, but this was a long time ago. We run this test. Clearly this blows up a few times until we have this piece of code. We run the test again. Life is good. My name is Vishal Chandani. I wrote this test at Deaf Method in New York City. How's that for test-driven introduction? Honestly, I had this idea long before the talk itself, on a long train ride, a commuter train to the city. Enough about me. I'd like to talk about my family. Right in the center is my best friend of 14 years, and we have two little gems of our own. Quick disclaimer, I'd like to get that out of the way. Every talk needs a story. When I first started learning Ruby, I was fascinated that it's written almost entirely in C. A few years of working in the language, I couldn't help notice the implementation of the methods, the click to toggle source or view to toggle source button that some of you have seen in the documentation. I clicked on a few to study their implementations, and with time started making simple comparisons between Ruby and C for simple operations like string reversals. As I explored the topic some more, I learned that certain strings, in particularly Unicode, don't play well with certain string methods, like the reverse method. So I started exploring and I found a lot of articles and blogs online seeming to accept this as a problem, suggesting Unicode normalize as a solution. So typically when you're working in a language that's built on top of or uses another language underneath, you tend to suspect the implementation. You're like, no, the bug is down at the C level. That being said, I put on my debugger hat expecting to find my first Ruby bug. I went down the rabbit hole that is C in reverse and honestly couldn't find anything wrong in the way C implemented the solution. So I stepped back, took a closer look at the inputs being passed into the function and realized that Ruby was incorrectly, at least in my mind, presenting certain Unicode characters. That whole experience changed the way I approached such problems. This talk is aimed at providing a more logical approach to debugging in several languages. Let's get started. This is a picture of the moth found trapped in a relay of a Mark II computer at Harvard University way back in 1947. That's nodding. That's good. Meet Admiral Grace Hopper. By her own admission, she wasn't coining the term debugging, but she used it so often it became popular. So we have thanks to her. We're going to explore the Ruby's string library, in particular its reverse method. Coming to my first programming language ever, C was developed in the early 70s at Bell Labs, which coincidentally is also where I started my career. And by the early 70s, it was intended to write utilities to run in Unix. By the late 70s, it had become so popular that it was used to rewrite the kernel of the Unix operating system. Imagine a language so powerful, not only does it write other languages, but also operating systems. How cool is that? I needed an environment to be able to build Ruby from source in order to debug programs you have to modify them. Clearly, I wasn't going to use this pristine Mac and clutter it with different versions of Ruby had the compilation failed. I chose virtual box. It was free at the time and well worked right out of the box. I chose Ruby 251 for this exercise, which is available here. Given that both Ruby and CR are my top two favorite programming languages, I'm still fascinated to see both languages come together in this actual file. Pretty amazing. The first tool we are going to explore today is grep, which is used to find patterns in files. It's a Unix command. The name of the reverse function implemented in C is actually called rbstr-reverse. So the first thing I learned is to grep for that string in the entire directory where I had the Ruby source code. Turns out it lives in this file. String.c, very aptly named. The first version of grep was written overnight by a gentleman named Ken Thompson to help a friend analyze the contents of some Federalist papers at the time. Speaking of overnight and great ideas on a train, around that same time came this 1974 classic, Murder on the Orient Express. If you haven't seen it, I will not give it away. Given my belief in test-driven development, I went looking for tests and found a few tests in this file. We see that they attempted to reverse a few strings like beta, there's a palindrome in there, and that's fine, but nothing more complex, nothing for Unicode, so I didn't really find much. At a high level, the RBSTR reverse function uses pointers to basically swap out and reverse a string by copying characters from the beginning and the end and switching them out. In doing so, it needs to calculate the length of each character that it's intended to copy. The Ruby 251 build Ruby from source link basically says you have to do three things. Configure, make, install, all you have to do, right? I wrestled all weekend to get a basic version of Ruby built from source in virtual box, and perhaps it's because of the environment I had, but I took notes and I have the commands for later if people are interested. This is the file I started writing that basically contained a simple string. We call the reverse method and watched its results. Raphael, the little E with two dots, as I called it before getting on this quest, is really the Latin lower case E with diaryses. The two dots are called diaryses, and the way they work is you emphasize the letter they're on even though they're preceded by a vowel. So in this case, we explicitly say Raphael. Other examples include Chloe, the name, or re-entry. So you're emphasizing the letter twice. Very few people in this world actually use the diaryses in their name. Meet Raphael Javier Varane. He is a footballer who plays for the French National and the Spanish Club Real Madrid soccer teams. This part of the talk brought to you by Emirates. So we reverse Raphael. What's wrong with this picture? Right, the diaryses is on the wrong letter. Turns out this is a known problem. People have accepted this. So let's find out what happened. The second tool in our tool belt for today is a simple method called CARES. Applied to a string, Raphael in this case. It provides an array of all the characters in that string. Please note, I'm saying characters, not bytes. The little E with diaryses is one example. At this point, we need to learn a little bit more about Unicode. Unicode is basically a standard that lets you encode, represent, and handle text and symbols from different parts of the world. In other words, characters that you typically cannot type on a keyboard. The little E with diaryses is one example. Here's another one. This is a fancy version of the Dev Naagri Om, which also exists in the Unicode standard. Fun fact, I have the Sanskrit version tattooed on my left arm. Wrong shirt for today. The Unicode standard was proposed by a gentleman named Joe Becker from Xerox in 1988. He proposed Unicode as a unique, unified, universal encoding scheme around that same time. Came one of my favorite movies of all time and the inspiration for the title of this talk. Code points. Code points basically provide a numerical representation of characters. Passed to a string, we use code points and use a simple iterator to print the value of the character as well as its hex value. So we see here that the E with diaryses was split into two characters, the little E, followed by the diaryses. The little E shows up as hex 65, which sounds about right. The combining diaryses has a UTF representation of hex 308. This one is actually more my favorite because it gets us closer to the byte level. Previously we were talking characters and if it's one takeaway you can get from this talk is a character is not necessarily a byte. Applying the same method, iterating over each character, we see that the E stays as hex 65, that's one byte. The combining diaryses has a two byte representation of hex CC and hex 88. Looking closer at the UTF representation online, turns out this is the UTF 16 version. It's not even UTF 8, so it's confusing on many levels. Before diving down into the implementation, we need to understand a little bit more about this concept of pointers. As the name suggests, pointers are variables that point to other variables' memory locations. They come in different shapes and sizes. A popular picture among many preschools back home. Here's a simple example. We have a string of 25 characters called hello world. The char star ptr notates a pointer of name ptr. The star basically says it's pointing to a variable of type character. We use the for construct to loop through that string and print the contents of each character. Simple use of pointers. Print f is a lot like put s. It stands for formatted print. It's a great way to get started and knowing where you are in your program. So you can basically pepper your code with print f like you do with put s. Here's a simple example. I used it to determine the length of a character in bytes. And know where I was in the execution. Using some very high level print fs and saying block one, block two, block three, block four, etc. I was able to zoom in on this piece of code. We see here that the code is calculating the length using rbencfastmbclent. And then using memcopy to copy the characters and replace their values using their length. The length is important. That's how it knows how many bytes to copy over. Print f has its origins in BCPL, had a function called writef, 1966. The Thepa Sadin Stadium right here in Bangkok was constructed around that same time, just in time for the 1966 Asian Games. Fun fact. So print f is great for a high level view of where you are in the execution of your program. But given the complexity of this function, I found myself in a rabbit hole very quickly. So I needed something a little more powerful. This is where the GNU debugger, gdb, comes in. You invoke it as follows, gdb space ruby. And basically set break points, meaning stop at line 5575 on string.c. And then you set your run on debugger.rb and use s to step through. So you're basically debugging by setting break points in the implementation. You do this a few times and you realize that you have to set multiple break points. So I found the next one, which is regenc.c, line 62. And it turns out that that's the file that contains the function used to calculate the length in bytes. gdb was written in 1986, Richard Stallman. Around that same time, Sumit Jumsai architected the robot building. Actually my dad pointed me this one. It was built for Bank of Asia at the time to signify computers in the world of banking. How many of you saw this in the video they showed us on the first day? I saw it sneak by on the bottom left. Thank you guys. So we zoom in on the NBC ENC LEN, which lives in a file called utf8.c. So we've got our debugging code in place. We have a debugger.rb file with a string. We call string.reverse. And the debugger code, the debugger.rb file is using the compiled version of C. So you're going to see all your debugging statements come out. Like this. Honestly, this is the most dense slide I have. I apologize. We see here that the lowercase e has a length of one, which is correct. The combining diarises actually doesn't even print properly using printf. So it looks garbled, but it's the one that says clen2 towards the bottom of the slide. Here we realize that c is doing the right thing. We passed in e, it calculated a length of one. We passed in a combining diarises of length two. It's doing the right thing. There's no apparent bug in the c code, at least to me. So I decided to try this hack. And this is where the knowledge of pointers comes in handy. I believed the correct representation, according to the specs, for the e with diarises intact should be hex c3ab, two bytes. So I decided to look for the lowercase e and overwrite it. Use a pointer to point to s. Set its contents to hex c3. That's what the star vptr does. Increment the pointer. Stuff another hex ab in there. Put the lowercase l back in for good luck. And not to forget the backslash zero to terminate the string. Prevent it going out of bounds. I call this the hack. I also put a printf with hack in there to show you what it did. And we see that it actually reported the clan as two bytes, which seems more correct to me. We run our reverse program and this is what we get. A correctly reversed string with the diarises or two dots, whatever you want, in the right place. Ruby versus c. This is the dilemma I faced at this point in the investigation. Remember this famous arm wrestle scene? Name that movie. Any movie fans out there? Or am I the only one? Yes. No. You guys want to hint? The guy in the right is Arnold Schwarzenegger. Come on, you can do it, Charlie. Yes. Awesome. That's right. But let's talk after her. Thank you. So it turns out there is a known solution, supposedly, for this problem. Everyone says, oh, just use Unicode Normalize. It'll take care of all your problems. The anomalies you see with string functions will all go away. Well, let's understand how Unicode Normalize actually works before accepting it as a solution. So it turns out that Unicode has this concept of equivalence where some sequences essentially represent the same character. The little e with two dots, sorry, I still call them dots, but the e with diarysis can be represented in two different ways. There's the e with diarysis, keeping them intact, which has a representation of hex C3ab, like we saw. It can also be represented like the way Ruby did it, the lowercase e, followed by a combining diarysis. But really, they look the same. So that's what Unicode terms equivalence. Normalization helps you replace equivalent sets of characters so they essentially have the same code points. Composition, which is the default normalization form for Unicode Normalize, basically tends to combine characters. So the lowercase e, followed by Unicode 0308, which is the combining diarysis, give you the one character with the size of one. Decomposition works the other way around. We have the Unicode Normalize version e. I pass in the nfd option, which is decomposed, and you see that it sort of splits the e into the lowercase e, followed by the diarysis, with the correct size of two. Here's a side-by-side comparison of the same commands, methods we learned earlier, before Unicode Normalize, as well as after Unicode Normalize. In this case, CARES correctly shows us the little e with diarysis stays intact. Code points. On the left, we had 101 and 776. On the right, it stays intact at decimal 235, our favorite method at the byte level. On the left, we have hex 65, hex CC, hex 88. On the right, we have it correctly represented hex C3 and AB. At this point in the exercise, I had seen hex C3 and AB so much, it was, like, etched into my brain. I think that's going to be my next tattoo, probably on the right arm. So let's see how Unicode Normalize works. We've come to accept universally that it's a solution. How does it really work under the hood? We know that the default normalization form is nfc, so I was able to zoom in on just that specific block of code, where it does a gsub with two arguments. The first of which is a regx object, and the second of which is a hash. I tried printing the values of the hash, but it's an isor, so I think it's an exercise left for the reader's homework. I used two variables, m1 and m2, and this is where the C developer in me tends to use shorter variable names. They tend to use variables like i, jk, and so on, so that's the m1 and m2. On the left, we have m1 to signify the match based on the regx, and given that we passed in Unicode Normalize, it looked up and found a match of the e with diarysis. We see carers, code points each byte. We've seen that before, right? So far, so good. On the right, m2 is basically looking up that hash for a key, m1, and it actually gets the same value, similar looking, exactly the same looking value, but this is the correct e with diarysis that we want. Carers, code points in each byte correctly now represent this value the way we expect. So that's really how it's working under the hood. It's basically looking for certain characters and just replacing them, overriding them. So we started with a string called Raphael with diarysis with some debugging tools. We ended up with a correct, correctly reversed string. Along the way, we learned some very basic, yet powerful debugging tools in Unix, C, and Ruby. And another important takeaway here is to not make any assumptions. When I set out on this quest, I assumed the bug is going to be in C incorrectly. So it's important to not make assumptions and use the tools to the best of your abilities. Hopefully, this talk inspires you to debug hard. You've been a fantastic audience. Namaste.