 Hello, everyone, to this week's Bite Size Talk. I'm super excited that Phyllis here with me today to talk about the art of the minimal example. Thank you. Thanks, Pam. Yeah, a little quick Bite Size for you all today. My favorite type really, which is like not quite vague, not very well prepared on quite a loose topic, but we're kind of thinking about different talks we could fit into a slot. And I thought this is one that comes up quite a lot is this is going to fairly begin a focused talk and it's kind of as a maintainer of open source software or a maintainer of any software, this is the golden target that this is the thing that we all want, which is for anyone with a problem with their code who's looking for help or support is to have a nice small minimal reproducible example. But if you're kind of new to the world of bioinformatics, new to the world of software development, this is a bit of an abstract term and concept. So what do I mean by that? How do you go about it? What does a good one look like? What is it that you need to do and what do we want? So I started making slides earlier this morning while I was in another meeting and I got a bit carried away with the GIFs. So apologies in advance. This is a bit of a silly talk, so let's go in. I was also basically slides on an earlier slide deck from an old Bite Size talk I gave when we were still putting numbers in. And I was toried with the idea of putting my Bite Size number in here. I count, I think this is the 86th Bite Size talk, which is insane. So just a little shout out to how amazing that is and thank you to everyone who's given a Bite Size talk over the past few years because it's quite an amazing body of work now. Anyway, November 14th, 2023, no numbers. An MRE, a minimal reproducible example. You can see why I went into GIFs pretty quickly because I started off dry. This is just a screenshot from Wikipedia about what is a reproducible example. What do I mean here? So often called an MRE. Basically what it comes down to is trying to make something which is as small and simple as possible, which demonstrates the feature or the problem that you've come across and you're trying to describe to somebody else. So basically instead of having your actual production pipeline run in the next-door world with 100 gigabytes of data, which has been running for eight days and you hit a problem, instead of trying to tell someone about that, which is very difficult for them to reproduce because they need to get loads of data and run it for days. You try and strip it back to the core essentials. And when you've got as small as you possibly can, but you still demonstrate the same behavior, that's your minimal reproducible example. And that's what you share. So why? Why go to the effort of kind of doing this? Obviously it's easier just to take a screenshot of your log output and kind of dump it into Slack and ask for help. Why would you kind of go to the effort of working with these reproducible examples before you go to that step? What's in it for you? One of the first things is we've got the icon here for the NF Core maintainers group, which is a rubber duck. And this comes from the term rubber duck debugging. And it's this idea that when you have a problem, you try, you're at your wit's end, you've been, you can try everything. There can't be anything else. It's possibly, you know, it's not your fault. You start explaining the problem to someone in the hope that they can help you. And in the process of explaining it, you figure out what the problem is. And it's called rubber duck debugging because you can have a rubber duck on your desk and talk to it. You don't need a real person to do it. Minimal reproducible examples are like the ultimate rubber duck debugging, in my case anyway. I sit there and I try and strip away at this thing and make it as small as possible. And in the process, I usually, well, very often figure out what the problem actually is, what's underlying it and understand it better by the process of trying to cut it down. And so it's a great way just to actually do debugging, let alone sharing the problem with anyone else. Because often in the process of making the MRE, you figure out what the issue is and problems sort of you can move on. So very selfishly for you, it's a great thing to do just because it's a good way to solve your problem quickly by yourself, which is the best way to solve any problems. Okay, and sort of related to that, this is getting more obscure. You can see I started to lose it a bit as I was making the slides. I've got a red herring here. And so, and this is kind of part of a, the second stage of troubleshooting is just kind of understanding what the problem is. You've got to kind of cut through this massive picture of this big pipeline run and all this stuff that's going on where it could be a million different things by shaping your minimal reproducible example. It forces you to hone in on what the problem actually is and kind of ignore all these other things that we kind of could be a log warning message about singularity, which has got nothing to do with the actual problem or could be these other things. You know, you kind of discarding all of those by chopping away and getting to the minimal reproducible example and actually identifying what the underlying problem is. Very often, like I say, same as a rubber duck debugging, that solves your problem. And in the process of trying to make your reproducible example, you realize that it's just one of your samples has got no data in it or is dodgy. And you're like, ah, OK, I just exclude that sample and everything's fine. And you solve your mystery again, but through the process of making reproducible example. So number one reason for doing this is it is good. It helps you fix your own problems by the act of trying to share your problems with someone else. OK, this doesn't always work. I mean, I'm sure many of you have sat there going, yeah, but I can't always solve my problems by myself. And you'd be right. So what it's tempting to do sometimes is you take your screenshot and you dump it on the slack and you're like, this pipeline is broken. I can't make this work. What's going on? I'm frustrated. And you take everything again, you take your huge pipeline run and you sort of take a screenshot of the area and you're like, can someone help me? And it's kind of the equivalent of doing this. Now, if anyone who's sat on the other side of this gif kind of watching someone come in with problems, you can tell that they're obviously been struggling with this problem maybe for a long time. But there's kind of so much there, it's very difficult to really jump in and help. So part reason number two is to help other people to help you. And this is really the driving force of a minimal reproducible example is helping other people to help you. So how do you make a minimal reproducible example? First off, I won't talk about this so much, but a reproducible example is a key word there. If I'm trying to fix a bug, be it next low, be it multi-QC, be it a pipeline, whatever, the very first thing I try and do before I solve any code issues, nine times out of 10 anyway, is just to try and reproduce the same error on my machine. If I can reproduce the error, any I know it's real, be I know that I fixed it. Multi-QC is a classic example of this where sometimes if people don't give enough information for me to reproduce it myself, I can kind of make a guess about what caused the error and I might fix that bit. But it might be a second bug hidden later in the code and if I can't reproduce the bug myself, I don't know that I've actually conclusively fixed it. And this has happened to me lots of times when I fixed one bug and the person says, yeah, that's great, but now I get this different error. Now, if I could just reproduce the problem in the first place, I'd be able to iterate much more quickly. I can reproduce it, fix it, work through the whole process until I know it's running and then put the whole fix in and one go. It's much more efficient for everybody. So as a maintainer, being able to reproduce errors is key. So you've got to reproduce the error. Now, this is a knowing what information to provide so that other people can reproduce those errors is somewhat of a skill itself. And those of us who have spent a lot of time in NFCore, Slack and X-File Slack would be kind of very familiar with this to the extent that we actually have an automated Slack bot responder. So sorry, I'm very curious if you're the last person to use it that I found on Slack. So I'm sure many people have been on a receiving end of this where if anyone types more info without a space to get a little more to respond to saying, okay, this is the information we need to be able to help you. We need to know what the command was. We need to see any config files. We need to see the full error message and the log file and everything. Now, this is not necessarily enough to actually be able to reproduce the error, but it's kind of on the way there. And it's all part of the same kind of logic. Okay, minimal example. What does that mean? We've talked a little bit about reproducible examples. Basically, when it comes to making a minimal example, you can do one of two things. You can take your big example and you can chop and you can cut, cut, cut, cut, cut. First off is you have a hundred samples and you chop it down to 50. Let's see if you still get the same area. Down to 20, down to 10, try and narrow it down to a smaller subset of data as possible. Even subset maybe beyond one sample, but start chopping that BAM file down or that Farscue file down, making it as small as possible, just deleting stuff until the error goes away and when it goes away, you know you've gone too far. And of course, you can also do it with code. You can delete code, especially if you're a pipeline developer and you're writing a new script and it was working the last time you tested it and now it's not. You sort of chop away code to try and narrow down the bit of code where the problem was introduced. You can be really fancy if you're a developer and use things like git bisect to try and actually do this in a semi-automated way, which is a really nice git command that many people don't know exists. So step one is cut away until you cut away as much as possible while still reproducing the error or you start fresh. You have, this is especially more appropriate if you have a bit of an idea about what might be going wrong and then you start from as small as possible, which is a new file and you just add in a few lines of code and try and see if you can get back to the same error. And that's my preferred way of doing it personally. It's slightly more difficult because it kind of requires you to have a bit of an idea about what might be going wrong, but it ends up with much smaller, cleaner reproducible examples in the end. So I prefer it as a maintainer and I also prefer it when I'm making examples for other people. So if you can, I would say do this once. Okay. What would be a bite-sized talk without a live demo? Let's have a quick look and something like that. So I thought I would try and show an example or two of where I've done this and the kind of thing that I do. So one that I did a little while ago. This is super small for us to see. Yeah, I will make everything a bit bigger. Can you read that? We see your desktop, so no. Can you see my GitHub window? Can you read the writing? Yeah, that's true, but we see still a lot of your desktop. Okay, I'll just do my browser window and I'll switch between them a bit. Thank you for telling me. Okay, can you read that? Yeah, that's better. Thanks. So this is an error message that I log an issue that I made back in May. This is on the NextFlow repo. And you can see I've kind of described it here and put in a minimal example. Actually, I haven't put in a minimal example. I'm just putting the error message but I got to it by making a minimal example and I'll show you what that minimal example looked like or how I did it. So if I hop, I'm going to be switching between windows a lot here, this is going to be a pain. I wonder if I can just drag a portion of screen. There we go. Can you still see stuff there? Is that visible? We see things. Can you see a terminal window? Yes, I can read it. So it's okay. I'm going to make it a little bit bigger. So here, I'll make it this. I wonder if I can just resize this as I go. No, okay. I've got an empty folder here, basically. Actually, I'm cheating. I've got a file where I made earlier when I was testing just in case I couldn't remember what to write. What I might do is I could open up BS code but one of the things I quite like doing is doing NextFlow console. In fact, I'm going to create a new file. So touch the main.nf empty file. I'm going to see NextFlow console main.nf. What this will do is it opens up this little kind of plugin feature of NextFlow which is called the console or the Rappel console. And this is the NextFlow code editor up here and I can just run the code directly down here. So I can say, I can't make it any bigger, larger font. There we go. Brams.foo equals bar, print.nbrams.foo, right? And if I press this magic button just here which is a picture of a script with a green button, it just runs the code underneath. Great, very nice. So in this particular example here, I said that version 23.04 still returns a warning about DSL2 syntax and suggests using minus DSL1 and that doesn't exist anymore in this version of NextFlow so we should probably remove that. So the way I saw this was in a DSL1 pipeline and I had a process block without a workflow block. So what is the absolute simplest process block I can think of is probably something like this, Brams.foo, right? That's about as simple as you get. Hit save there, saved it to main.nf and I can hit run here and it will run and now it's a bit too small but you can see I'm getting an error message here from running NextFlow. So using the console, the REPL is a really quick way to iterate on very, very small examples with just like a few lines of code and this is like an ideal minimum we produce for example because anyone can look at this in a few seconds and see basically what's going on, what's missing and if I can do NextFlow, run main.nf and I get the same warning here. So this is why I pasted into the GitHub issue after making this and a little reminder that you can also do NXF and prepend that in front of a command and that will specify the version of NextFlow that you're running and download it if necessary. So again, for the reproducible part of the example this is very nice as well to show that you're using a specific version of NextFlow and you can, especially if it's doing, think that it's a regression and NextFlow behavior between different versions, you can have the same little script and you can just run it twice with two different NXF tags here and show the behavior changes between versions of NextFlow. So you can see how you can take that minimal reproducible example to demonstrate behavior and then paste this in an issue and everyone kind of understands exactly what's going on. Okay, what else did I have? So this was kind of one example then of an issue I made which I thought was a relatively good one because it kind of is fairly clear about what's going wrong. Though I realized I didn't actually put a reproducible example in it which completely shoots myself in the foot. And here's another one which I did on the NextFlow repo. And again, you can sort of see here that I've put in a very, very minimal example of how to reproduce this error which is four lines of code. So again, by doing this kind of thing other people can understand exactly what the usage is that I'm talking about and this is much better than saying when I run the, I don't know, the RNA seek pipeline and something goes wrong, it exits with an error code it's much harder to reproduce that but if I have four lines of code that easy. And one other thing to note is that when you're creating an issue on a repo try and kind of firstly read the documentation and when you go to issues, most of the repos that we have on NF core versus multi QC NextFlow is the same have these issue templates and try and use the one that says bug report and try and fill in what it says and follow the instructions because this is again is to do with reproducible examples by pasting the full error log and with multi QC by dragging and dropping the file which triggers the error. That means that may have allowed anyone else working on multi QC can run multi QC at the same version with the same file and try and replicate the same error so that we can then reproduce it locally and fix it and be confident about it. So again, it's all part of the minimal reproducible example. Next, what it looks like this is basically the same expected behavior and steps to reproduce, right. Go back to key notes, which I've minimized. So that's roughly how I make a minimal reproducible example. It's a bit vague because it's kind of a bit different for every type of bug or problem. But hopefully the concepts and the things I aim for is kind of clear. As an overview slash recap, I would say I view the process of solving problems and bugs as a kind of a basically chronology here where you start off by discovering the bug which is when you first run something that doesn't work and you finish up with fixing the bug where everything's fine and it's all solved. The temptation is to go looking for help as soon as you discover the bug. But basically by making a minimal reproducible example you start going along this track a little bit yourself as far as you possibly can. These two may be close together or even the other way around. But so making a reproducible example just takes us all a little bit closer towards fixing a bug and helps the maintainers. It makes it faster for you, makes it faster and less work for the maintainers and makes everyone's lives better. And really the key for a reproducible example is the reason for doing it, the reason it works is that you're controlling all the variables you can. So you're really like using it as a kind of a fair test where you're just flipping one thing maybe or making it as small and simple as possible. And hopefully at the end of all this, problem solved. Right, happy to take any questions or otherwise, thanks very much for listening. Thank you so much. That was very interesting. Now, if there are any questions, anyone is now able to unmute themselves. So just ask away. It seems there are no questions. Everyone's mesmerized by the Spider-Man GIF. Possibly. So how do you know that you actually are at the minimal possible example? I mean, that not something can still be cut away. Yeah, I mean, I guess minimal is probably too absolute a term. I usually feel very smug when I get to the level that I showed earlier where it's like six lines of code, then you can kind of fit it all in one little tiny code block. But obviously it's a continuum. It's not a, and anything that's smaller than what you started with is an improvement. So the further you can get, the better. Depending on your frustration level, your experience level, your understanding level and the complexity of the issue. Right. Let's say you cut something away and the error goes away. Then you've gone too far. You need to step back. But maybe it was just something stupid you've done. Not you, obviously, but. I do this stuff all the time. No, but that's why it helps you to understand what the problem is. Because if you chop something away and the problem goes away, then that's highly indicative that that chunk of code had something to do with the problem. And so this is kind of why I showed it as a process is because if you don't do this, and one of the first things that they maintain and might do is try and do it themselves. Because it's part of understanding the bug and understanding what the cause is and figuring it all out is kind of zooming in and doing bug hunting and kind of detective work to actually isolate what the problem is. So by you doing it, it just speeds the process up and it also helps your experience. Maxime. Yes. I know that on my side, whenever I have an error, especially with any next flow related stuff, like even if it's next flow or NF test or stuff like that, I do create a repo for that. And just that way, it's much more easier to reproduce because if you don't even have to type the code yourself, you can just like clone the repo and execute the stuff. Do you recommend something like that as well or maybe it's a bit going too far? It depends on what the problem is. If you can reproduce it in four lines of code, it's not difficult to paste that into a file and run locally. But yeah, I mean, if the minimal state is sort of a slightly more fully fledged pipeline than absolutely, the more you can do to make the life of the maintain it easier, the first year you'll get a response on the whole and the more likely that the maintainer will want to help you. So yeah, I think what you do with it with your, you could do it with a gist or with all your test repos is a great idea. And I like to do that, especially when you have like multiple files involved, like it's much more easier like having like to create like three files on your side and like to debug. So yes, I can thank you. Get app repos are free. So, you know, there's no limit. No one's going to look at your get app profile and be like, oh, he's reported a lot of bugs. That's a good thing. It shows that you do things the right way. Great. Thank you. Are there any more questions from the audience? I was also meant to do it at the start, I forgot. I was meant to point to one of the other bite-sized talks which I did, which is about troubleshooting, which is also kind of related and that's usually the first step of this is you do the troubleshooting first as part of understanding. Thank you. Well, I forgot to mention that you're from Sikera Lab. So Sikera, sorry. Anyway, if there are no more questions, then thank you so much, Bill. And I thank the audience for listening in on the talk and the Jan Sakerberg for funding bite-sized talks. Thank you, everyone. Thank you.