 Hi, everyone, welcome. Thanks for being here today. My name is James Sariawini. I'm a software development engineer at AWS. And I work on Python tooling there. So some of the things I work on include Bodo3, which is the AWS SDK for Python. I also work on the AWS CLI. And another project is AWS Chalice, which is a serverless micro framework for Python. And we just did a tutorial of that today or earlier this week at EuroPython. And there's a link that we tweeted out if you're interested. But I'm going to be using some of those projects that I work on as examples today throughout the talk. And I'm going to show you some techniques for debugging when some of the more traditional techniques, like print statements and log statements, don't really cut it. So at first you might think, is debugging really that important? I mean, how much time do we actually spend debugging? And I tried to find an answer to this. And it really depends on who you ask and what study you look at. I found one that was as high as 49.9% of our time when we spend debugging, which seems pretty high. There's also another study that was 30% to 40% of our time is spent debugging. But what I would really say is I think that we spend more time than we expect debugging. Certainly the case for me. And there was an interesting study that I saw where they looked at novice computer science students. And they were looking at how well they did in a course, which is what they considered a good programmer, versus how well they did at debugging. And the interesting result there was that they weren't necessarily the same thing. So just because you were good at programming and got a good grade in the class didn't necessarily mean that you were good when specifically tested for debugging, although there was a much stronger correlation the other way. So my takeaway from this study is that debugging, while related to programming, is a separate skill that's worth studying and practicing and surveying the field to see what other techniques there are to make us more effective debuggers, because that will ultimately make us more efficient developers. So in other words, I think debugging matters. And when you look at the traditional forms of debugging, things that we've probably all done before, there's the print and trace style debugging, where you just annotate your source code with print statements and you put variables and print out certain code paths you take. That's probably one of the things we do the most. I do that a lot. There's also the postmortem debugging, which we don't really do a whole lot of in Python. And then there's also interactive debugging. So if you're familiar with some area of the code and you know that part of the code is where your problem is, you can set a break point or use a debugger built in in 3.7 or use a PyCharm debugger or whatever your IDE is and really step through your code and see exactly what's going on. So in this talk, what we're gonna look at is some techniques when print and trace debugging aren't enough. We'll look at some data visualization of what our program's doing to help us get a better understanding of our code. And in that same study, one of the things they cited was that really a lot of the issues with debugging and one of the things that prevents people from being good debuggers is being able to get a mental model of what the code is doing. And so a lot of times, you either don't have a mental model of how the code's supposed to work or your idea of how the code's supposed to work is actually wrong and that's the result of where a lot of bugs come from. And I think data visualization helps with that. Gives you a quick overview of what your code is doing and helps you troubleshoot faster. Now I'm gonna show you three visualizations. We're gonna go from a fairly easy visualization and then just go in order of complexity. And for each one of these, I'm gonna give you a real world example of where I use this technique to help try to troubleshoot a bug that I was working on. But some of the things to keep in mind is just like with log or print statements that you add to your code, it's not really meant for production code. So when you're done with it, you throw away the changes that you've made and hopefully you remove your print statements before you commit and push your code. And the visualizations as well, they're not meant to be perfect. Really the intent is just to help us figure out what's going on because the real goal is to fix the bug or to verify your design or some other thing other than the visualization. So with that in mind, there's a lot I wanna show. So we're just gonna jump into the diagrams here. So the first one, the simplest one is a sequence diagram. And if you haven't seen it before, this is an example of what it looks like. You read it left to right, top to bottom. And in this example, there's two requests being made to a browser and you can see that we send a request from one object to another and then there's a second post where we send a request to the browser, to the web server and then the web server sends a request to the database and you get a response. And you can draw these by hand. You can draw them in Keynote or PowerPoint or Visio, whatever you use for your diagrams. But you can also generate these programmatically or you can generate them by writing them out like this. So here you can create, this is a DSL for creating a sequence diagram. And the way this works is you can see it's the same diagram as before but you have a browser with an arrow to the web server and an optional label and then the same thing in reverse. And then the way you can render this, the tool that I like using is this tool called SeqDiag and you can just pip install it and you'll get a command line utility that you can point it at this file and then it generates a ping file that you can open up and see what this visualization is, which is this thing. Okay, so that's the idea behind a sequence diagram. Let me give you an example of where that's helped out before. So who here is familiar with Expect 100 Continue with HTTP? Okay, I certainly wasn't until I started working on the SDK. But the idea behind Expect 100 Continue is instead of taking an entire HTTP request in response, what you do is you break it up into the headers and into the body. So you take your headers and you add an Expect 100 Continue header and you send it to the server and you ask what do you think about this request? And so the server takes it, looks it over and decides is this the right host? Is this, do you have access? Maybe it's the URI correct? Check the auth header if you want. And if everything looks good, it sends a 100 Continue back and then you can send the request body and then you get your normal response. So the server can also send a 400 and say you're not authorized to access this URI. It could also say the request looks good but you have to send the request somewhere else and send you a redirect. And the reason this is helpful is if you have a large request body, you can avoid sending a large payload to a server that's just going to end up rejecting requests in the first place. So Expect 100 Continue is really useful and we use it for when we do uploads to Amazon S3. The problem that we had was that HTTP Lib requests the URI Lib 3, they don't support Expect 100 Continue and there's not a clean way to add support to these libraries directly. And the last time I checked, I don't think requesting your Lib 3 supported and they didn't seem like it was a great idea to add directly to that library in the first place. So we decided to add it to our own code because we thought it was a useful enough functionality and it prevents a lot of issues. So it's good to have for us but there was a bug. So occasionally the Expect 100 Continue flow would return 500 and that's normally fine. That'll go through our retry loop and eventually succeed but some of the error responses weren't actually errors and of course like all hard bugs, it very rarely happened. There weren't a clear set of steps that demonstrated the issue so we were trying to figure out what we could do. So the plan that I had was after trying the print statements and looking at log messages and not being able to figure it out, maybe we could generate a sequence diagram and see if there's any additional insight that we can gain. So the idea is we're gonna annotate our source code when interesting messages are sent. We're gonna generate that .diag file that we saw in the previous slides and then we're gonna generate an image from our diagram and see if anything stands out. So I'm gonna be showing some code here. It's not necessarily, I wouldn't expect you to understand there's a lot of code. It's mostly just showing what we annotate so you can see the changes before and after. And so this is one of the classes we have. It's an AWS HTTP connection and we're setting some stuff with the Expect header, setting a flag, calling the parent class and then resetting the state. But the point here is that just we noticed when you first start debugging, you add print statements. So you might start with something like this and you say, okay, if the Expect header is set, we're gonna add a print statement that said that, otherwise we're not gonna do that. But let's try to create a diagram from this. So we wanna generate a sequence diagram instead of using print statements. And you might think at first, this is a lot of work, it's a lot of effort to do, but it's really pretty straightforward. What I did was I created a debug.py module and you just drop it right in the root directive of your project. So we're gonna throw this way anyways. And we just have two functions here, add sequence and render sequence. And all it's really doing is just storing the string from the DSL that we saw before. So just where is the object coming from and to and an optional label. And then what direction the arrow goes. And just to show you how you can use this in a REPL, there's no magic to it. Just let's say we import debug because it's just in the root directory of a repository. If I create two objects, A and B, I can then say I'm gonna add a sequence. So from A to B, where we say sending requests, and then from B to A receiving response. And then if I call render sequence to dev standard out, instead of a file name, you can see it generates a sequence diagram. Right, so it's so far pretty straightforward. And the way we add this to our code is that instead of the print statements we saw before, we just change them to these message things. And you notice the first line I'm doing is import debug, and then we just say message equals debug.add sequence, so I don't have to keep typing that over. But instead of print statements it's just a little bit more structured. We're sending a message from ourself to self.host and we're starting expect 100 continue. And I annotated that all throughout the source code. You're gonna see a bunch of source code fly by here. The point though is if you see the green lines that are being added, they're just almost the same as what you do with a print statement with just a little bit more structure. And so I annotated the HTTP connection class, went into requests in URLlib 3, and for the most part we didn't really change any code. There's maybe one section I think right here where, yeah, so there's a con or self.newcon so request is trying to be fancy and we just broke up the lines and said, let's grab the connection so that way we can call add sequence on it, but it's still the same logic. So for the most part we're just doing a little bit more work than print statements. Now this was doing add sequence, to do render sequence we just have some driver script that shows an issue here. And so if you haven't seen a code that works with the AWS SDK, all we're doing is creating a client for S3 and then calling put object. So we're uploading something to S3. And I'm skipping ahead here just so that we can save time but if we sleep for 60 seconds, it's an interesting question what should happen because you'll have an idle connect timeout so the server should maybe close the connection because you're not doing anything. And then we make another put object request, what should happen? And so we let that go and then we call debug.renderSequence and then we generate a sequence diagram. And so this is what we actually get and we'll zoom in in a second. I just like seeing the shape of the diagram at first. And I wanna stress this is from real data so this isn't drawn by hand or anything. This is just running our code with that script and then looking at the sequence diagram and this is what we get. So zooming in here if you've used requests before you know under the hood it uses urlip3 which has a connection pool. And so we see that the first thing we do is check out a connection from the connection pool and it's the first time we've created it so we create this underscore socket object which is just a socket and it returns it back and then just like we saw on our slide we're gonna start the expect 100 continue flow. That was the first message statement we saw on the slides. So it goes through this and you can see the diagram's not perfect but you get the idea, it's sending headers, we wait for the 100 continue response and you can see it sending it to my bucket name so .s3uswest2amazonaws.com. And then after that it's gonna send the body once we get a 100 continue and we get a 200 okay and then we put the connection back. Then at this point we're sleeping for 60 seconds right and then we wanna see what happens when we create a new request. So we do the same thing again we call get connection, we create a connection here and then we go through the same flow that we're gonna skip, it's the same as before we send the headers, wait for the response and then send the body. But the key insight here, the thing that stood out once this diagram was generated showed a difference in how I thought the connection pooling worked which is that you have a single connection instance but you have multiple socket objects so the lifetimes aren't the same. So if a connection closes what we do is instead of creating a brand new HTTP connection object we instead create a new socket object and associate it with the existing connection instance. Comparing that to if we're reusing the same socket if we just continually make requests and we don't sleep this is the diagram that we see. And so now that we know the lifetimes are different after all this analysis the fix is pretty straightforward it's just when you close the connection you have to reset the extra state that we added which is a little bit different from say the file API so if you close a file there's not really an expectation you're gonna open that same exact file object later it's you would just create a new file object so it was a little bit different but once we were able to get that insight that we're reusing the same connection instance even when it's closed it was much easier to track down the bug. And so we won't have time to dig into the details of exactly how that worked but essentially the connection was in a bad state we assumed it was a terminal failure state and we didn't reset our state appropriately. So that was the first visualization it was really straightforward to add the sequence diagram because it did all the heavy lifting of generating the diagram. Now we're gonna look at the second one this is the second or the third one and it's visualizing internal state and we do this a lot this is probably the technique that I use the most and the idea is that as things are happening in your code you just log whatever the internal state of some object is and you can generate these interesting graphs. So to again give a use case where we tried to where this was really really helpful we're gonna try to download a file it seems like it's a pretty straight forward thing to do at first. And so if you wanted to download say a video file or something the easiest way you would do it assuming it's big enough where you can't just hold the whole thing in memory say it's 100 gigabytes or something, right? So what you would do is you would take the first chunk and you would download a file and maybe just write it to your local file and then you would take the next chunk of the remote file, download it, write it and you keep going and eventually you would have the entire file downloaded, right and that works and that's pretty straightforward but what we do in the CLI is we take the file and then conceptually we break it into chunks and for each chunk we then have a thread pool that's assigned to individual chunks of the file and they download the file in parallel and so whenever they're downloading a file the writes can come out of order and we just seek to wherever the offset is in the file and then write out the data and so at some point in time your file could look like this there's just whatever data we have downloaded we just write it out to that location but eventually you will get your entire file written just the same as we've done it sequentially and if we actually look at data here that plots the IO writes, this is using mapplotlib these are the sequential, this is actual data writing where the S3 writes are happening within the file and so we just took the file and shifted it vertically so the bottom of the file, the beginning of the files y equals zero and then the y max is the end of the file and you can see each thread is writing at the same time to different parts of the file and so you can see how this kind of parallelism can give a nice boost in performance sometimes but that and that itself is pretty non-trivial to write but we wanted to, there was an interesting use case we had where we wanted to download to a non-seqable stream I should ask, has anyone used the AWS CLI before? Okay, cool, so this is a CP command this is where downloading from a bucket and to standard out and so that way we can just pipe it to say gunzip and then pipe it somewhere else instead of the data and we don't need to have a local buffer so if you have a large say, I don't know 500 gigabyte file or something you don't need 500 gigabytes of temporary storage to download the file to you can stream it directly to a pipe so the problem that we had was that here you can't do this because you have to write data sequentially there's no seek on the pipe so one thing you could do is you could just stream it just like we were doing before and very early versions of the AWS CLI did that but we wanted to see if there was something better we could do because let's still assume that downloading is a bottleneck that writing to disk locally is much faster than downloading remotely so the initial idea that we had that we implemented was let's just have a buffer and so when you do the same thing with your thread pools but instead all you're gonna do is hold on to it in memory and if it's not the next contiguous block you just keep buffering it and then eventually once that buffer comes along and you now have a contiguous range you can then take that and flush it to disk so there was a problem with this we implemented this, it worked and during development we noticed that there was a pretty big problem which is in the worst case you buffer the whole file so if you're waiting on one chunk of the file and networks are lossy, networks have congestion sometimes things come out of order and so if you're waiting on one part you could potentially buffer a 500 gigabyte file in memory so we wanted to fix this and we tried to figure out what would be the best way to fix this without any huge performance hit so we ended up implementing I did the sliding window algorithm which is very similar to what TCP does if you're familiar with that and you just limit the range of what your file can download to so the way this works is while you have parts that are downloading you only allow them to download from that range and it's not until something on the left side of the window comes in that it then unlocks that contiguous range and so you take that you flush it out to disk or whatever your pipe is and then that moves over and the window slides over and then that's the new part you can download and that's your new max so that was the sliding window and we're not gonna look at the code it gets a little more involved because there's threads and you have to have semaphores and the appropriate locks but essentially there's an acquire which is on the right hand side of the window and then there's a release and if the release is on the left side of the window the window slides and again the point here is I wanted to show you the green highlights which is how we annotated the source code to be able to verify that this algorithm was working how we wanted and we'll look at what the T function is in a second but for the acquire there's only one trace that we were interested in and then for the release there was three parts that we talked about so one is if I try to release something in the middle of the window then we just put it in a buffer and if I release something that's on the left side of the window then I can start flushing the buffer and then potentially there's a chain reaction because once I unlock something on the left hand side any other contiguous range can then be now released so there's three events there and again the thing is all this function is doing is just printing some external states so it's printing the sequence number so where in the window we are and it's printing the time and then whether or not it was an immediate release or a chain release but just basically internal state and all this is doing is using the logging module so we have this tag tracer that's just a quick shorthand so that we can dump structured data using JSON and then we just have this t function again because we just want something really quick to add to our code but ultimately what you get is this you get log messages that say the time, the thread and then a JSON payload and then we can take that and parse it and then generate plots over time of our internal state to see what's happening and what we want to see is a sequential because we have to write it so we have to write it in order we should see something that starts from the bottom left and goes up to the right unlike the threads the diagram we saw where we had multiple S3 writers going at the same time so this is what the actual plot looks like and we'll zoom in in the details in a second but you can see visually at least this is a good confirmation that our algorithm is working how we want I mean we had unit tests and functional tests and integration tests so we knew it was working but in terms of the behavior was this more or less what we expected some sort of linear kind of up into the right behavior and if we zoom in a little bit this is the four trace events that we added that we saw in the previous slides the acquire which is the blue one and then the three releases we'll pick a specific event and walk through it but let's say that here the release always happens sequentially but if something wants to be released and we'll pick this data point here so the release is requested but it's not ready yet we aren't able to release it until we just trace along the same Y value that's when the release occurs but it only happened with this is the delay it only happened because that black dot there was what we were waiting for so that's the left side of the window and you can see immediately above that the reds those were the chain reaction those were the other things that could be released because they were just pending and then immediately after that you see the blue dots can immediately acquire the next part of the file and that's the sliding window in action so this was a good visualization for us to verify that yes this is in fact the correct mental model this is how it's working it does seem to be working how we expect and we could look at different window sizes so the previous graph was window size of 10 this is a window size of 100 and really the only trade-off you're making is just you can potentially stall and have a bigger buffer size right so here I think the window size is 100 and it looks like the stall is maybe 50 slots so as long as you're okay with a stall of 100 then this is a decent window size but as we start to look at bigger window sizes you can see the total time doesn't really change you're just buffering more memory so you don't really gain a whole lot from having a bigger window size so the takeaway here was by generating log statements that just graph internal stage you can then parse them and start to create really interesting plots to verify the behavior of your algorithms to make sure they're doing what you expect make sure you have the correct mental model so the third one that we're gonna look at is creating a custom diagram so so far we've been really leveraging other tools to do a lot of the heavy lifting so map plot lib for us and then we also use seek-dig but there's sometimes when you have some sort of visualization in mind and no existing tools really creating exactly what you want so for this we're gonna generate our own custom diagram and to keep with using the CLIs you're gonna have to explain new concepts we're gonna look at uploads now so there was a problem very very early on I think this was version like 1.0 five years ago so it's not really that interesting anymore but it's interesting in terms of the visualization because this has long been fixed but there was a case where there was a high CPU usage so the thing would work but it seemed like it was consuming unnecessary CPU usage and we were trying to figure out what caused it and the way that uploads work is similar to downloads there's a producer-consumer pattern and you would take something from the queue so you'd have a worker thread pull something from a queue and then in some interesting cases it would take work from the queue the worker thread would look at it and decide it can't do anything with it so it would put it back on the queue and then there would be times where we thought it was just cycling so it would take something from the queue, look at it decide it can't do anything, put it back take another task, decide it can't do it and put it back and it would just keep spinning like that and here's what the actual worker thread looks like it looks like a lot of code but it just is grab something from the queue which is gonna be a reference to a function and if we can call it invoke it put it on the result queue and if not, we put it back and all I'm doing here there's these log.debug statements and instead of internal state we're just logging with certain events happen so we're interested when do we get a queue when do we get an event from a queue when do we start to invoke it when is it finished when do we put it back so that's that queue put back in the middle there and really again this is the main takeaway just the things that we highlighted it's maybe five, six lines of code and we're just emitting when certain events happen and so the way that we parse this is we have a log file here that just has a various number of events but even looking at this it's kind of hard to tell because if you notice there's a bunch of threads here there's thread two to thread seven and it's calling get at certain points and it's really hard to see the state transitions or the state diagram here and so we wanted some sort of visualization to help confirm some of the suspicions we had so what we do take the time stamp the thread and the message and we put that in a name tuple so we go through each line and parse those out just by splitting on the hyphens and then from there we group them by threads because we wanna see a visualization of each thread at each time and so we have things like what time did it start what time did it end and then based on the actual if it's an invoke start or invoke multi-start we pick the event type and then once we have that we're ready to generate our image and I know that went really fast because the thing I wanna focus on is just how to use pillow which is what I use to generate the image if you haven't heard of it it's a fork of the Python imaging library when I had used it before I really only thought it could do image resizing, generating thumbnails rotations and cropping that kind of stuff but there's I think it's almost a hidden module there's an image draw module where you can it has primitives for drawing lines and rectangles and circles and basic shapes and you can generate an image from scratch and so that's what I did here and there's a lot of code here and the slides will be available online but there's really just four steps we go through each thread and we figure out what color we're just drawing a bunch of rectangles we figure out what color to make the rectangle based on the event type so if it's this put back thing where you have a worker thread taking something from the queue and then immediately putting it back we'll color that red so it stands out and then otherwise we draw a rectangle and the width of it is just based on that end minus the start timestamps we know how long it took and then at the end we just call image.save which saves the image to a file and that's going to generate a visualization for us so if we look at that this is what we get and the way we read this is left to right for time and then each row represents a thread so we can see what the thread is doing at each point in time and there's the legend there on the right so we can see the gray is the non-multipart upload the black is a multi-part and the red is a put back and if that's a little hard to follow the main thing we're interested in is the red part because that's the part we think might be having issues so just zooming in here like this is a multi-part upload the black part, the gray part, upload part and then the red is that put back part so what's nice about this diagram is if you take a vertical line and just trace it you can see what each thread is doing at a given point in time and so you can see how many multi-part uploads are happening at any given point in time for historical reasons we had a limit of three which we can't get into but that was an interesting visual confirmation of that you can only see it's always touching three black rectangles but if we zoom in here at least you can see these small red squares here which are the put back things that was the part where we could clearly see that there's at one point one, two, three, four, five six threads that were just spinning on CPU so it was taking something from a task putting it back and then continually doing that and this diagram it's pretty short but there were also times where other diagrams that I don't have anymore where it was showing much longer spin cycles and so it was very easy to visually verify at a given point in time here was all the threads are doing and they are potentially using a lot of CPU and the main takeaway from this which I know we went pretty fast here was just that we can use pillow based on some state events that we admit to generate our own custom diagrams and it's maybe 10, 15 lines of code to do that okay, so wrapping up there was three diagrams we looked at we looked at sequence diagram that was showing us the flow of messages between objects and we can use the seek-diag tool to generate those for us we also use map plot lib to graph internal state and that was where we just have periodically throughout time just flushing a JSON document that has all the internal state of our object we're interested in and then we can generate visualizations and time series plots from that and then finally when none of those work we can use pillow and have access to basic primitive shapes to draw visualizations based on the example I use which was events happening so hopefully the next time you're debugging and you can't figure out why your code isn't working and all the traditional means haven't worked maybe give one of these a try and maybe it'll work for you I know this stuff has been really helpful for me throughout the years and I hope it's equally as helpful to you thanks everyone thank you James so we have time for a few questions anybody wants to come here to the microphone? No, so then I have a question so have you considered using like vector representations of those with the last type of diagram for example to be able to maybe integrate more text and like zoom it in keep the oh yeah I have not considered it but that's definitely something that I don't know how much more involved that's going to make the generation but yeah I think that's kind of why with a lot of these it's whatever is the first thing we can get that'll generate a visualization that gets us pretty close but I think yeah if we're gonna put this in you know a blog post or documentation or something we'd look at creating an actual vector image and then cleaning it up definitely great thanks just a small question when you test this code you test it in production or in a test environment because you don't want to put your code you know to the repository you said so what do you do for the actual testing part? yep so for the testing part it's usually not in production it's usually on some sort of test environment that we try to and a lot sometimes for these code especially the ones that run a long time we'll have to leave it running you know sometimes for a day or overnight or something because it takes a lot for the bug to reproduce so you generate a lot of data and I think the other thing that's a little bit different here is because these are client side libraries it's a little bit different from say a web application where you have like a prod stage and a dev stage where you know with client libraries it's not quite analogous yep I guess you have a you have a server that is only designated for tests and yeah you can do that so if this was for a web app you could also have you know a percentage of your a small percentage of your traffic redirected to these test machines or these machines that have a little bit more intense monitoring on it yeah other questions maybe? I will ask a very genuine question because I was really problematic I want to learn something and take away to apply my code debugging what's your suggestion now? because that gentleman mentioned that you don't you won't definitely want to check out in the report right? so what we can do or what's your suggestion? so I think one would be if you do have a test environment or you have the ability to shift percentage of traffic you could do this the other thing though is that a lot of the stuff that I showed doesn't have a whole lot of overhead it's mostly just we didn't want the cognitive overhead of seeing what are these trace things in the code but I think just like refactoring code if you do something and then you do it again and then definitely if you do it a third time you think you know maybe if we keep doing this it's worth putting in proper code in there to do these traces and for a lot of these things we fixed the issue we verified the issue and we haven't had to come back to them but I think if we kept doing that then yeah we would try to actually clean it up and just keep it in the production code base and with the logging module you can set the log level high enough where you're not actually emitting debug and it's only if you turn on the appropriate log level that you would see these traces get emitted thank you again James thanks