 in case you haven't met us. Martin is an R-Core and R-Foundation member. He has been on R-Core since its inception in 1997 and is on the board of the R-Foundation. He's also the co-creator with Doug Bates and maintainer of the Matrix package and a number of other packages that widely used. He's in a joint professor in mathematical statistics at ETH Zurich and he's also the project lead of ESS or Emacs Speak Statistics. I, on the other hand, am a frequent collaborator with R-Core, arguably the most prolific in recent times in terms of feature additions to the R language. And we'll go through some of the things that I've done over the course of a different section of this. I'm not going to belabor them here, but this is just to say that I've interacted with a number of different R-Core members quite a bit over the years. And so I'm bringing the perspective of the external collaborator to this. So the goal here is to talk about and impart how to contribute to R. So first we should talk about what this actually means. So our goals here are to help you learn how to help R by helping R-Core maintain R and thus benefiting yourself and us and the larger R community. The goal is not professional advancement, although depending on your specific situation as possible that will come. And it's not personal recognition or fame, although acknowledgement will always occur for any contributions that you do, that you do help R with. And the reason I say that is because there are much easier ways to become a recognizable figure in the R community than doing this. And so if that's your goal, that's perfectly fine, but this isn't really a good way to do that. This is really about helping R and helping the R language and the R community largely from behind the scenes. So what we will focus on is what kind of efforts are actually helpful to R-Core and to the R language and how can we perform those and what kind of low-mini actions are not helpful and how can we either avoid or improve those so that they will be ultimately helpful to R. So this is not the only sort of effort in this space that has or is happening in terms of outreach from our R-Core team and our foundation to the larger community in terms of fostering these types of engagements and contributions. So there have been a number of blog posts by Thomas Calabera, Luke Tierney, and the last one was also by Kurt Hornick of the R-Core team. All three of them are on the R-Core team. Talking about how you can help R and thanking people, in fact, for the large and helpful response from the first blog post. And those are all really good reads. They're all relatively short. So I encourage you to go and read those at your leisure. They complement the types of things that we're going to be talking about here. The other major effort which is ongoing is the R Foundation Forward's R-Contribution Working Group, which is organized by Heather Turner. It has the involvement of multiple R-Core members, including Luke Tierney, Michael Lawrence, and Martin Plummer, and other R Foundation members, of which Heather Turner also is a member, such as Jenny Bryan and Dai Cook. And it also involves the number of larger R-Community members and R-Ladies members that you might have heard of before, such as Cara Wu, Amelia McNamara, Toby Hawking, Michael Chirico, Brody Gasolm, and Sebastian Meyer, and a number of other people as well, some of whom are actually in the audience here, and that you will likely have heard the name of after not too long. And so there are a few different outputs so far of this working group. One of them is actually this tutorial. But in addition to that, there's an R-Debel Slack, which currently has about 110 members, is not super active. But it is a place where you can go and talk to other people who are interested in this type of contribution and collaborate with them. And we'll talk more about how exactly that might be useful over the course of this tutorial. And there's also work by Sarenjeet Kaur, who is in the audience here. She's developing our developer's guide to talk more permanently about how to do a number of these things, and she's working with Heather Turner and Michael Lawrence on that. So that is being developed on GitHub. It's relatively early days now, but the work is ongoing. And so that will be developed over the course of time. So before we talk about and actually get our hands on some bugs and what to do about them, I'm going to run you through some of the large parts of my history as a collaborator with our core and my contributions to R. And the lessons that I have learned sometimes quite painfully over the course of many years of doing this so that we can have those lessons up front for you so you don't have to do the same types of things yourself. So it began when I was a graduate student and I decided that I wanted to be able to put that I had patched R, I had patched successfully to R on my CV, which as I said before is not a good reason to do this, but it is to be honest why I did the very first one. So in 2015 or so Simon Anders posted a reputable example of a problem where if you have a graphics device open and you call the identify function, which identifies what points you're clicking on in an open graphics device, and then you call the function and you close the graphics device before clicking, it would hang R. And so I sort of used a modern version of R at the time and I, you know, confirms that I saw the same behavior. It did hang for me when you did that. And so then I diagnosed the problem and I found that it actually wasn't a problem and identify itself as a problem in the locator C function. And I had the benefit of already knowing the C API because I was working with Duncan Temple Lang who is also on our core as my thesis advisor. So I did have the leg up in that sense and I was able to fix this bug and a related bug in the locator function. And so I submitted a patch. At that particular time there was actually no response on bugzilla, which is not generally the case, but I could see that the changes I had submitted were actually applied in trunk. So that that patch had been accepted. So that made me feel good and I enjoyed that. And then another one around the same time was that Mark Bravington had wanted bitwise operations on raw vectors. And so I saw this post at around the same time. So I immediately sort of started digging around in the C code that underlies the, you know, bitwise, bitW and bitW or all these functions in R. And I patched it to add support for raw vectors in there because it wasn't in there before. And the reason that it wasn't in there before was actually because the normal and or, you know, vector operations actually already operated on raws. And that's that was actually the way that you're supposed to do that. And so it didn't need to be in the bitW functions in the first place. But the documentation to the bitW and bitW or and such didn't say that, which is why Mark didn't realize that that was the case. And I didn't either. And so ultimately that patch was not accepted because the bug was closed after a fix to the documentation that simply referred people to the and and or from the from the help for the bitwise operators. So the takeaway points from this section are working patches are not always going to be accepted even if nothing is wrong with the patch. And it's notable that these patches that aren't accepted don't take any less work on your part for the initial submission. And so what I'm trying to get at here is, you know, really consider before you start writing a patch is this patch. And it's this approach that I'm that I'm taking here the right one, the one that's going to help are the most and most efficiently with my time and with with our course time. And the other major takeaway is sometimes the patch that changes behavior is not the correct fix. Oftentimes a change to documentation is the correct fix to a particular problem, even if the issue is actually real. So the issue was real. The issue was that, you know, the bit why bit W help functions didn't refer to how you're actually supposed to do that for raw vectors. And now they do. And so that that issue is now fixed without any changes to R or C code. Okay. So even as I had like a 50% you know, success rate at this point, right. So but I was still feeling pretty good about myself, you know, I had I had code that was in R. So that was exciting. And then we came to another this this book is actually pretty funny. And so it used to be the case if you called any logical operator with only one argument, it was it would behave as negation. And so if you did the code that you see here with, you know, you tick marks around and and then true, it wouldn't negate true and give you false. So that obviously is that's a real bug. That's not what that should do. And so I decided to develop patch. And then I submitted my patch the details of this are not too important. But, you know, I submitted my patch I ran the code that was buggy, and it was no longer buggy, it seemed to work. And so I submitted a patch. And then, you know, I heard from Martin, you know, I said that I test my patch because I ran the buggy code and it wasn't buggy anymore. And Martin came on. And he said, Well, you didn't really test the patch because you didn't run ours tests, because some of them are now failing. So obviously, the patch was not in a state where it could be accepted at that point, because it, you know, it failed its tests now. So that was not great. And the takeaway here is always test your patches. Every change is a change. And so anytime you touch anything, even if it's this documentation, you really need to run the checks and run the tests before submission. And so that is something I'm going to say a number of times of the course of this tutorial, because it's very important to remember things can seem small, but you still run the test every time. And if you do make check the bell, that will actually run our command check on all of the base packages, which will also check the documentation, which is why I said even for a documentation change, you should be running these tests. So yeah, go ahead. There was a question to me in the chat about how to ask questions. And I forgot to mention that. Originally, we said that people should actually most conveniently open their own view themselves and say, or something and ask it, because that always works. And you have to be a bit brave. And otherwise feel free to interrupt at any time to ask questions. If anything's not clear, that is the most lively thing to everybody right instead of typing somewhere, and we have to watch this, the slack and the chat and so on. We tried to watch this, the chat and the Slack channel, me mostly the chat because that's integrated in Zoom and I'm used to it. But the best is if you open your mic and just ask and please do that because it's boring anyway to sit in front of a screen instead of a mix of other people. And so if we hear other voices, it's much better anyway. So is there an actual question now other than how to ask questions? Meta question. Okay, well, if there is, again, please interrupt. You can also, I don't actually know what it will look like, but there's a raise hand button in Zoom, which you can also try. But yeah, questions are very welcome. We're here to teach you guys for you guys to learn this stuff. I mean, we're not here just to listen to me talk. So whenever any question does come up, please feel free, as Martin said, to interrupt and we will get that question answered. So barring there being another actual question now, we will get to the next stage, which is my very first feature edition, which zero people roughly know about or ever use, but it's still in there. So that's pretty cool. And this is a change to how you can debug things that are S3 or S4 methods, essentially. And so I added a signature argument to debug so you can actually debug methods with debug instead of having to use trace. And there's also this debug call thing, which we'll actually maybe look at a little more later. But this is something that I'm pretty proud of, the first feature edition. But there are some things that you may not know about that, if you had known about this in the first place, which again, you probably didn't. And there's no shame in that because basically no one does. So some things to keep in mind are that I worked with Michael Lawrence at Genentech as our day jobs at the time of this, and he had just been elected to our core. And there also is literally about a 20-plus comment conversation on Bugzilla about this patch. And there are four distinct different versions of the patch that I developed over the course of these discussions and these iterations. And some more things to know are that I actually disagreed with Michael and Martin about the design API for debug call. I wanted it to actually run the call, whereas Michael and Martin preferred that it not run the call, it just actually set the debugging, and then you can run the call later if you want to. I preferred that it would actually do the call. But of those four patches, the second, third, and fourth, all of the ones after the first one implemented their preferred API rather than mine. And that's because they are the ones on our core, they're the ones that are actually going to own this code once it's been accepted, if it's successful. And so they get this final say always on what goes in. And it was accepted, but unfortunately, not until after feature freeze for that release. And so it actually remained in our develop for basically an entire year before it was actually in any release versions of R. It is now in release versions of R, because this was the number of years ago. And the final thing is that the patch was still further refactored by Michael in the process of putting it in. And these are all things that they're going to happen, right? Like this is not there's nothing wrong with this. This is this is the process working correctly. Sorry. Yep. What is feature freeze? Right. So feature freeze is something that software software projects will typically do when they're about to release a major new version. Sometime before that, they will have what's called feature freeze, which means no new features are going in only bug fixes to changes that have already been made are going to go in between then and the actual release. And the reason that that we do that is so that anytime you add a feature, there's there's a much higher probability of adding bugs if it's a new feature or major changes to code. And so basically you want a testing period where you're not doing any disruptive changes. You're only testing the changes that you've already made to try to make sure that there are a few bugs in there as possible. And so that's what feature freezes. And it's usually Martin, what's the what's the actual timeline for our like two weeks or a month before release feature freeze? Yeah, see, there is this once once a year minor release in some sense, it's major because it only happens once a year. And those have a period of even four or three or four weeks of feature freeze, whereas the patch releases also have a very small like one week feature freeze period. And by the way, if I may mention, Gabe mentioned the three blocks on our blog. And the third, the last one was just earlier in spring this year, where where we asked or actually looked in and Thomas Colbert said, please test R. And that was exactly during the feature free freeze period where we say, okay, in a month, there will be the new version R410, right, that was in April or in March. And then this is basically the next version of R because there will not be any new features because we are in feature free. So please test now, if there is any bug, we could fix before release, we would be really grateful that that was the whole topic of that third block post that that Gabe mentioned 10 minutes ago. And in fact, I should have had slides to that effect, I don't but like that is a major way that you can help are without any, you know, without touching code ever, right, is not very many people use our develop or the release candidates, which is what it's called between that feature freeze and the release, you have what's called release candidate our versions. And not very many people use these because most people use our in production to actually do their job, but any code that you can run in these release candidate versions of R to see if things are working is extremely helpful to the R core team, and very appreciated when that's done. So I encourage you to whatever extent you can to be testing R develop generally, but at the very least, these release candidates, because that that is a very impactful helpful thing that that you can do that will help the R project. But yeah, that was a that was a good question. You know, there's no reason that people would know that if they're not already involved in software releases. So that is what a feature freeze is and why why we do them. Any other questions so far? Okay, so we'll go ahead and keep going again. Anytime you don't understand something that I say, a term that I use, just pop in and ask the question, we're happy to clear those things up. So the next thing is, if you have heard of me, this is probably one of the things you've heard of me for, which is the alt rep framework. It's the single largest sort of thing that I have helped contribute. I don't want to say I put it in because it's very much a collaboration and Luke was on lead because it was such a big such a deep change. But I did do the proposal I proposed alt rep at the DSC, which is the directions in statistical computing. It's a it's an invite only research conference slash our core meeting. So that's the that's the meeting. And it's typically attached to us are when, you know, when us are is actually in person. There will typically be a met, you know, many of the our core members will go to us are and then so then they would be around and they would have their meeting. And they they have it as a as a conference. So they have talks and stuff. And I gave a talk when it was in Stanford in in 2016, where I proposed what I didn't call it all rep. I call it cost back for custom vectors. But what eventually became all rep I proposed that in 2016, I had already been in contact with Luke before the meeting. And so I had sort of informal interest in something like this, if I could sort of have the right proposal. And then this was such a big change that, you know, our core actually voted on whether this was a good idea, as far as I know. Although I think Luke thinking it was a good idea sort of affected many people's votes as as that's that we'll talk more about that later. But that's basically how things generally are. But it was accepted as a as a as a plan as a way to move forward. And this was an enormous internal change. And something that that is notable that I'll talk about a couple of times is, you know, the very first two words in the title of the talk were backwards compatible. And that was the only reason I think like the fact that it was backwards compatible was the only reason that they could even possibly consider a change like this. So over the course of, you know, 2017 and 2018 or so, that that's when all rep was was implemented. All rep is in now. You know, as our users, you wouldn't necessarily have ever interacted with it. Although if you use, for example, Vroom, the Vroom package by Jim Hester does use all rep under the hood to get some of the speed ups that it that it gets. So I had a proof of concept, which was a sort of a window, a vector that was a window into part of the data of another vector without copying it. And it worked. I was able to do that. But it didn't respect ours copy on right semantics, which Luke pointed out as soon as I showed it to him. Luke implemented the alt rep framework. I contributed code. Some of it is still there. Some of it had bugs. And so it was either not accepted or was taken back out. And that happened a couple of times and ultimately Luke asked Michael, who I was still working with at the time, to review my code internally before I submitted it to him. And that, of course, at the time was extremely difficult, painful for me to hear. But ultimately, it was understandable. You know, Luke is very busy. And these were very big and very deep and very important changes that were being made. And any bug that got in would be quite damaging to our and to our users. And so we just couldn't have that happening. And so he did ask, he just asked that that be done. But that wasn't me being kicked off of the project. You know, I still continued to be involved and I still have submitted several patches, which implement further various support of for alt rep in the R internals. Some directly to Luke, some to bugzilla, which then are often reviewed by Luke because Luke is still sort of the point person for that kind of thing. And so I have continued to be involved and continued to, you know, contribute meaningfully even after it was that extra step was asked for. And so the takeaway here is that code that's submitted to an R core member really should be very mature. And this is difficult to do, right? If you're if you're sort of early in your career or you're even if you're late in your career, if you're early in your career of this type of work, you know, it's hard, right? And code review can be painful and needing to be reviewed can be painful, but it's a great tool to get better. And so I encourage you both in if you any time you're writing software, whether it's, you know, patches to R or your own packages or whatever, you know, you should view code review as a as a tool to get yourself better and to get your code better. And the other thing to keep in mind is that our core collectively has very little time spread across their own work in improving R and their day job duties. And all of the collaborations and all of the patches that they're considering. So that's something that we as as external contributors need to be cognizant of and keep in mind. So near the end of the alt rep times, there was a there was another post by Michael Chirico on art develop, pointing out that if you have, you know, matrix with thousands of rows and thousands of columns, if you call head on it, you get six rows and thousands of columns, right, which is sort of not what he really wanted in that case. And so he actually had asked like that I just not do that. But that would be an extremely non backwards compatible change. But I responded on the mailing list with a proposal, essentially for a backwards compatible version where you could optionally be slicing a rectangle rather than, you know, a strip of rows. But that the, you know, that the pre that the existing behavior would be preserved if you didn't do anything special. So then Martin, you know, responded quite favorably to that post on the mailing list. And so I, over the course of time, I designed and proposed a patch. It passed all of ours tests, right, because I had learned that this is something that you need to do. And then, and so that was good. And I had done sort of what I could do to test it. And then, you know, Martin responded on bugzilla, because we were on in bugzilla at this point, that there was quote, quite a bit of breakage on crayon and quote. And yep, I can tell you, right. So because I took your pension, put it into the into the our sources. It was my responsibility. And, and of course, I did not run the 16,000 grant packages to see if they still pass all their checks. But the grant team somehow did after it was in our develop, and they get back to me. Yeah, well, there are these 20 grant packages that now no longer run their checks correctly because of the change that you Martin put in. That's why I done I came back to you. That's how it works. Right. You know that, but the audience probably wouldn't know that. Right. Yes. Thanks. And there are, I know, I think Luke actually has. Yeah, question following Martin's comments. So what kind of breakage was it Martin? Was it the fact that some people are using head in code that actually executes all of the data that you know, what kind of breakage was it? Was it the fact that some people are using head in code that they expected all the columns or? In other words, was it? I'm getting older. Honestly, I forget. And I didn't. I do remember. Well, actually, I don't remember. I looked at it again in the process. It's really a good archive of things that happened by the way. And that's interesting in itself. And so one of the things that that broke was something and we'll get into this a little bit, but it's something that I didn't consider at all, which was people were calling head on expression objects. And I had no idea anyone was going to be doing that. And so my new code didn't work in its initial form with expression objects. And so there were things like that that broke. And not only expressions, basically, people used head just to say, I want the first six elements and they that would work on any or objects that allow subsetting. And so expressions is just one of them that yeah. And then I think your code basically assumed atomic objects or even something like in the radius. I forget. Yeah, I don't remember what the initial version did, but it didn't it didn't work for that. It does now, like the version that's in there now does work for expressions and everything else. But but yeah, the the initial one, I hadn't even thought to like test it would work on expression objects because I don't call head on expression objects generally. So but some people were and you know, luckily, in fact, they had tests that would break if it didn't. And so so yeah, so then yeah, I've been doing when you come across something like that. So you've worked out that you know, some people are using head on expression and it's not what you expected. What did you do then? Did you just say, right, I need to make it work? Or did you go back to our core and say, you know, should the cram package maintainers fix their package, you who has to change? And how do you decide that? Yeah, that's a good question, Heather. With that's typically discussion either with the with me and the cram team, if I put it in, or even within our core. Sometimes you say, okay, this actually, but here, I think even ice, I basically decided on my own know this shouldn't have happened. Because basically, the old, the old help page for head and tail, right, there is also the tail function, basically said, this is just a more convenient way of saying bracket one colon six, with the default head, the argument six. And so bracket one colon six should continue. I mean, this should continue to be the same functionality, even with the extension. And also the news entry that that we put in for for the change said, this is back compatible back compatible read doesn't mean that it should work for for everywhere where the bracket indexing works. Yeah, and that was general Heather, you're right. Sometimes they come back and say, well, there are a few packages and they they use bad programming style as we see in this part of the code anyway. And, and rather than we help them change their code, because we think that the incompatible changes much more important than back compatibility. But that is rare, as Gabe mentioned, that compatible is a is a big thing for such in some sense, not what your project as R is. Yeah. And, and the other thing that I will say from my side is that I had intended it to be backwards compatible. So it was a it was a bug on the part of my code that it was not fully backwards compatible. And I had not sufficiently tested it because I hadn't tested it on things that I didn't consider people would be would be using head and tail for. And so another aspect of the answer to your question is, you know, that Martin at the time requested that I not refine the patch and submit a new version of the patch, because he was iterating on a local copy to track down and break these fixes. And so I had the when you give code to our core and our core decides that it's going in like you're giving them the code is their code now, right. And so often, you know, if they think it's it's a good enough idea, or a good enough fix, but it doesn't quite work, they'll come back to and say, please, you know, please fix this, please change it in this way. But sometimes they'll say, you know, it's just more efficient for me to iterate, like this is a good base for me to start on. And now I'm going to get it into the state where it's actually ready to go into the sources. And so Martin actually did that in this particular case, at least at this point in the story, he said that he was going to make some changes in that he was going to continue that work so that it would be ready to get in. And he did. And so he got the the breakages fixed. And then another person joined the discussion on bugzilla named Suharto Agno. Yeah. And he suggested numerous changes to the patch, which was quite frustrating to me at the time. But he did he correctly identified multiple flaws in the initial version of the patch. And he also suggested a number of different sort of implementation approach changes, some of which Martin agreed with and some of which Martin didn't agree with and stick with my original. But but the takeaway points here is that patches are not your baby. That's not what's happening here. So when you are contributing something like a patch, the goal is to make the patch as good as possible, regardless of who touches the code and ends up with the final design of a particular piece of it. And critiques can feel like attacks, especially if, you know, the language you're speaking isn't the first language of someone who you're interacting with that can affect the communication. And so you can feel like you're being attacked when you're really just trying to help. But it's important to remember that better patches make our better. And, you know, learning from this type of critique will make your patches better. And sometimes you don't get it right the first time. And so when you're iterating, being open to this, this type of input and this type of review is important. And the other thing to know is what we talked about, which is that it is often easier for our core members to make changes themselves rather than to accept or iterate on patches. And that's something that we should keep in mind as well. So our core, I have found them to be quite open to collaboration. And you saw that I've been pretty successful, you know, getting both bug fixes and even new features in. But it is essentially the case oftentimes that our core or whatever our core member you're interacting with is doing you a favor in a sense by, you know, taking your patch and vetting it rather than writing the code to do it themselves. And so, you know, I've found them pretty happy to do it and to engage me in this way. But it is something to keep in mind that a lot of people don't realize. And this is, I've seen other open source developers from other projects, big projects, talk about this too, where it's actually like it's basically a service that they're doing when by sort of interacting with patch submitters rather than just fixing things once they've been identified themselves. And so it's a really valuable service because it helps the rest of us sort of up our skills so that we can contribute more and contribute better in the future. But that is that is something that they're doing that sort of a lot of people don't realize. And then so the final sort of big thing that I have done, which was just quite recently is actually still only in our develop is, so I made a significant like quite large speed up to duplicated and unique and any duplicated for alt reps that know that they're sorted you know on the order of like tens of tens of of times speed up, you know, 10 to 60, depending on the situation. So this is a very large chunk of code. This is hundreds of lines of c code. And I partially or fully implemented this entire thing multiple times. And I tested it and I fixed bugs in my implementation. And I wrote an extremely exhaustive testing script that had every corner case that I could possibly think of. And this was all before the very first submission of this into bugzilla for consideration by an art core member. Right. And so this was a large amount of work for me to do. Right. And, you know, in the past, I had like that had happened in the iteration process with the our core member that I was working with. But for this one, I really wanted to try to save them the time of doing that and me, you know, the pain of having to like give them something and then have it be bad and then try to improve it. And so, you know, this took me months of not working on it full time, but coming back to it and looking at it again before even the first time it was submitted. But then once it was passing on my tests and passing all its own tests and all of that, and it seemed to be pretty good. And I had refactored it to be a cleaner implementation of my initial design and all of that. I submitted it. And then I waited. Luke, who was the person who would normally consider this type of thing, was focused on the native pipe, which I'm sure you're all familiar with that has just been added to R recently. And so he didn't have time to closely vet hundreds of lines of unsolicited C code, which is essentially what this was. And so the patch that in Boogzilla for a long time. And, you know, I pinged Luke, I think once or twice, and he said that he had seen it, but he didn't know he was going to be able to get to it because he was focused on other things at the time. And that was nothing personal. He just, you know, Luke is a professor, he has teaching duties and things like that. And he was working on another pretty big change, which was this native pipe. And so he just didn't have the bandwidth. So then Michael Lawrence came and he contacted me. And then he and I collaborated on getting the patched in. He asked me to formalize the test script into unit tests, which I, in retrospect, should have done. But ultimately we got it in. We had hoped that it was in time before feature freeze, but it didn't end up being in time before feature freeze. And so it's in RDevelop now. So if you check out RDevelop and you give duplicated a sort of vector, it will be much faster than you expect. But again, we're sort of in a waiting game where, you know, it's going to be, you know, from the time that it went in, it's going to be a year before it actually is in an anyone normal release version of R. And so the takeaway points here is that sometimes, many times I think a suggestion or patch not being taken up has nothing really to do with the suggestion itself, especially if it's unsolicited. It just, our core are very senior people. They have a lot of draws on their time. And so sometimes there just isn't the time or the bandwidth. Another thing is don't submit significant code changes without regression tests. This is true of everything. So you shouldn't really be submitting significant changes to R unless you've heard that you should from our core anyway, but anytime you submit significant changes to anything that you're working on any piece of software, you should add regression tests to make sure that the changes actually work and do what you wanted them. And the last is just to be patient, right? Like if you've heard that something that they're going to get to something, generally they will get to it, but it can take a long time depending on what the thing is, what else is going on. So occasional reminders are okay, but nagging is really not going to help you any, you're just going to annoy them. It's not going to magically make them have more time than they had before, right? So they're not ignoring you out of any sort of malice. They just have other, like it's in, like Luke has like, I think he has like a list of priorities of what to work on in R when he has time to do that. And, you know, this patch was somewhere in the list, but it wasn't at the top. And so he was working on this on the stuff at the top of the list. So the overall takeaways. So that's, that's been a run through of my history, a history of a sort of recent external feature submissions to R. I obviously have not added every single feature that was added by an external member, but a lot of them were me in recent times. So the takeaways is no one's perfect. So you don't have to be Luke Tierney or Martin Mechler to contribute to R, as you saw by the numerous mistakes of mine that I just walked you through. But when you're, when we're collaborating with our core, it's our job to make as little work for the R core member who's collaborating with us as possible. And we need to always be respectful and not demanding of their time, because again, remember, it's sort of counterintuitive. But they are, in a sense, doing us a favor, doing us a service by engaging with us for these types of things, especially with how busy they are. And I mean, they have stuff, stuff they're interested in working on, like they have stuff that they'd like to do in R themselves. And so they're taking time away from that to you know, work with us. And so we should, we should keep that in mind and be respectful. Another thing to keep in mind, which is difficult for some people to accept is sometimes the answer is no, even if you still think something's a good idea. And even if something actually is a good idea, which is different than you thinking that it's a good idea, the answer can still be no. And if that is not something that you're going to be able to accept, this isn't the right game for you. This isn't it's not going, you're not going to have a pleasant time of this. If you sort of aren't able to accept a no from our core, because you will get them, I get them, everyone gets them, because sometimes the ideas are bad, sometimes the ideas are good, and the bandwidth isn't there. Like there are a lot of different reasons. And never shoot from the hip, right? Each change is the change, which we talked about before. And so always test the things that you're submitting. Use make check develop before you submit to anything to bugzilla, any sort of patch, submit some form of testing or confirmation, whether it's a test script that the R core member who's considering it can run themselves and design tests from, or formal tests, but formal tests are actually not trivial to design. And so you have to be sort of confident in your ability that you're actually designing the right tests before you should do that. But at the very least, a script that you were running as when you were developing this to sort of confirm to yourself that it's doing the right thing, submit that as well. And a note here is if you don't have a script like that, your patch is not ready to submit. So that's something to keep in mind. So contributing to R. So now we're going to actually talk about the contributions that, you know, people like us can make. So first off, confirming bugs. This is one of the things that was asked for by R core in the first of those blog posts that I mentioned. Basically just is the bug real? Like can you still see it? So start up a development version of R, R develop, run the code that's reported to the buggy. And does it actually give the behavior? Right? Like do you actually, can you confirm that this is a thing? Particularly if you're on a different operating system than the initial reporter was, this is actually quite useful. The next step after that is generating reproves blank examples that only use base packages. So a lot of times sort of inexperienced bug reporters will just take whatever code they were running that gave them what they think is a bug and like paste that into the bug report. And generally like almost no one uses only base R in their actual work. So generally that's going to have a bunch of external packages in it. And the functions in those packages are going to call to other functions in those packages or other packages. And they're going to eventually they're going to call something in R, like in the base R language. But there will be multiple layers on top of that. None of those layers are helpful when we're actually trying to diagnose or fix a bug in R itself. And so taking that and translating it into an actual reproducible example that only has base R code in it is an extremely valuable thing to do. And if you can't do that, then it's probably not a bug in R. It's probably a bug in one of the base packages that was being used. So next step above that is careful bug analysis. And this is actually extremely, extremely useful in some senses, more useful oftentimes than patches to our core members when they're trying to fix. Because remember, the goal is not our achievement as individuals. The goal is to make R better and to fix problems that are in R. And a lot of times this is something that is really, really useful. So basically, once you have this sort of small example, you look at it and you debug it, essentially. This is debugging in the same sense that debugging your script or package code is debugging. You're debugging R. So basically start from the thing that happens that's a bug or that shouldn't be happening and figure out as much as you can about it. What functions are in play? How are they interacting with each other in a way that isn't useful or isn't by design? And here are some of the functions. So you've got trace back and options error equal to recover, which is my personal favorite. But you've also got debug and debug once because not every bug is going to cause an error. Sometimes actually worst bugs are bugs that just give you back an answer that's wrong. Silently wrong is the worst thing that software can be. And for that, trace back and error equals recover are not going to be helpful because there's no error. So then you're debugging. So then you're going to be using debug or debug once or trace. Another thing that we didn't add here is you can do warn equals to option warn equals to, which converts warnings into errors because sometimes there's a warning, but there's no error. And you can force that to be an error. So then you can debug it like as if it was an error. So that's another useful thing. And Martin added a hint there, which is this LS.str, which is often quite useful. So now I'm going to stop talking for a little bit. And we're actually going to get our hands on a bug. This is a real bug that is from the version of R that we asked you to come prepared with in the Docker container. So please start your Docker containers that have version 3.3.2 or somewhere around there. It looks like from what Martin added here it was fixed in 341. So anything between 332 and 341 is fine. But the Docker that we gave you instructions for has 3.3.2 in it. So start that up and then run the code that you see on the screen there, that histogram call. And we are going to go ahead and put you into breakout rooms now so that you can actually do a code analysis. Okay. So welcome back. Hopefully that was fun for you guys. This is a real bug. This really happened and was really fixed in R. And so now we're going to talk a little bit about what went on, what you guys found, and then anything that didn't get found we'll walk through. I will also tell you now that the initial report for this bug was extremely high quality. It was extremely well done. And Martin has the bug number. So after if you're interested you can look at the report which did a really well done, really exhaustive bug analysis without a patch, but that was still extremely helpful when Martin went in and fixed it. So this is what you guys should have seen. There's a warning. And then Martin did the additional thing of doing a traceback. So traceback will sort of give you the last error. So the call stack for the last error in reverse order. So I guess we can start with what did people find? So we asked people in the breakout rooms to sort of have it have one or more speakers that we're going to talk about what they found. So this is a more sort of interactive thing. So if someone wants to step up and just talk about what they found, what their approach was, what they found. We can go from room two, three, four, five, the four rooms maybe. I don't remember the number. I can start then. It was a lot of fun. So thanks a lot for finding this very good exercise as an introduction. It was quite a rabbit hole, I would say. Just looking at the pretty function and then looking up n-class ended up looking IQR. And I think we've finished looking at the Quentai.default in the stats package and trying to figure out where this rounding or, well, for floating point or whatever kind of error might have happened. And one line that looked a little bit suspicious, and we did not really grasp why the FOSS variable was four times dot machine dollar double dot EPS. And that's when time was up. So we did not manage to debug this further. Okay. That's fine. Again, this is not something that happens in the span of a half hour in real life, which is another way of saying, don't submit a bug report if you've only thought about something for half an hour. It should take longer than that. But I did hear a number of the things that were ultimately right and were ultimately related. So thanks for that. I think we'll go to the other rooms and see what they found. And then I have some slides that will walk us through what really happened. Yes. So we went to basically the same rabbit hole. And but first of all, we read the documentation for Histogram and tried seeing where, what caused the, if a change in parameters could help isolating the source of the error. So we tried the same vector with different methodologies and the same vector without this rounding error here. So after that, we ran down into n plus dot FD, which gave us a gigantic number for our vector. And I think that we didn't conclude that it was a problem in the interquantile range. We thought that the methodology was reasonable. Correct. And then we started discussing what should happen if the breaks parameter receives such a large number. Because since it's such a large number, the as integer function, the as integer call in the pretty function was the moment it crashed. So we dive into the source code, the C code, all up to finding that the problem was indeed that the number was way too large. And then understanding that as the true problem, we started a very interesting discussion on where should it be fixed because we imagined that we shouldn't change it at the C level, like make it tolerant for numbers larger than the integer maximum, or what happens if our switches to integer 64 in the future. So where should this fix be addressed, which I think was the the most interesting part of that. Okay. Well done. And so the next room is has our turn, Jonathan Keen, Lambda Moses, and Panteleis got up on this. Do you have a speaker? Yeah, I can do it. Yeah, so initially we realized that obviously you have this factor where you have one integer that has one number with a very small epsilon added. And if you coerced it to integer, there wasn't a problem running the function. But with that epsilon, this caused this problem. And then we found through the we did some debugging and realized that this was causing the number of breaks to be very large. And so we dug down and into the n-class.fd function to see why we were getting this large number of breaks. And we saw it was because the interquartile range was very close to zero, but not exactly zero. And we noticed that if it was exactly zero, then there was a sort of block of code to handle that particular case. But it wasn't exactly zero. So it was skipping that and going to the next block of code where the division by this number near zero was causing it to return a large number of breaks. So we think that's sort of the root of the problem. And we discussed possibly changing the condition that the interquartile range was exactly equal to zero to perhaps a condition with the interquartile range being less than some tolerance as a possible solution. And then we took a sneaky look at the real code or the current code of the day. But I won't talk about that because I think you're going to, so. Yeah. Great. And then the last room, some people left the room, but they're still here. I think Laura was there because she was sharing the screen. When I was joining, there was Sanjit Kaur and Fungi. And there was someone else. Yeah, hello. Yeah, maybe I can talk for a second. Sanjit, let me know if you want to. No, please go ahead. Please go ahead. Yeah, I think at the end we kind of did the same way of debugging as Aza. We did a trace back and then we realized that the issue was coming from pretty the default and actually more a bit upfront that the number of breaks were probably not computed correctly. So we also went to the n-class as FD function and then there we saw that the interquartile range was very small, but for us also it made some kind of sense. And so at the end, actually we did find out that if we were trying also to reproduce this example, but without the small numbers, there was indeed some way around implemented within n-class as FD. And then the question was still where should this issue be fixed? Should quite a default be able to handle a big number or should this be fixing n-class as FD set a different number of breaks for these two states? Great. So yeah, this is a really good job, everyone. If I can add something I think something interesting that we discussed is as n-class as FD is implemented based on a paper from, if I remember correctly, 1926. Yeah, so if this works on a theoretical level, but not inside the computer, should you fix the function? Should you still keep the model from the paper? So yeah, it's kind of a, I think a more broad question, but it was interesting to discuss a little bit about that. Yeah, that's a very good observation and Martin will talk about that in just a moment. But before that, we will go over, so I think across all of the groups, sort of collectively, all of the major pieces were hit. There we go. So just I'll go through this pretty quickly. So the issue, the sort of proximal cause is pretty.default complains that it got an invalid n, right? So it has received a value that it didn't like. So what does that mean? What is n doing? So can anyone tell me what n in pretty.default actually is for? What does that parameter do? And the reason I'm asking this is when you're talking about behavior of a function being buggy, it's very important to know what the function is supposed to be doing at any given time, because a bug is only when it's not doing what it should be doing. And so does anyone off the top of their head know what that is? If not, that's fine. I can answer that. But that is one of the types of questions that we want to ask ourselves when we're debugging. Like it's easy and fun to dive straight into code, but it's also a good idea to sort of make sure you know what should be happening at any given stage so that you can ensure that it is what happens. So I'll go ahead and answer that. So n is the number of breaks that you want to be made pretty, because that's what pretty does. It basically finds break points that can be printed nicely that aren't like scientific notation with like tiny epsilons added to them and such like that. So it's a number of breaks. So ultimately, we got a number of breaks that didn't like. Okay. Sorry. Pretty is I've spent many days of my life with pretty to improve it in the history of our so n is an approximate suggestion for the number. And it must be that it cannot be the real the final number is often different than the one you give it. Yeah, because the real important thing is that you get very small decimal numbers with very few digits and so on. So it makes the whole pretty functioning itself very interesting and very challenging. But the interesting thing, as you said, in the end, it's not the pretty function that has any problem. It just shows the problem. Yeah. So thanks Martin. That's some things like that are also important, right? You know, the difference between this is the number of breaks and since the approximate number of breaks is an important one. And in some cases, not in this particular case, but in general, that is very important. So in what sense was the end invalid? Well, we have a warning which helps us understand that. So if you if you look at the output that was actually given, there's an error, which is the thing that we're ultimately trying to debug. But underneath the area, you'll see that there was also a warning that was raised. And the warning mentioned NAs being introduced by coercion. And so that gives us a hint. It doesn't tell us the answer, but it gives us a hint about the type of invalidity that and had. So there are a number of different things that when you converse that can convert them can cause NAs. And so we don't at this point in the slides, we don't quite know which kind, but we know it was one of those things. One of those things has happened at some point in this process. And often when that happens and it raises a warning, it means that the code didn't expect that to happen. And that can often be involved in the ultimate problem, which it was in this case as well. So then as many, many, I think maybe all of the groups found, but at least many of the groups found what ultimately was happening that is that N is just enormous and is extremely large. And it's being passed. So if you look at the trace back, it says N equals breaks. So now we have to figure out where this was called from, which I think everyone found was in the hist.default function, which is a method of hist, which was the function that which is generic function that was ultimately called. So hist.default has breaks and that breaks is being used somehow, but we passed breaks equals fd, which is the character to the hist, which would then pass to hist.default. But by the time it's getting into pretty, it's not a character, which would have also caused an antiaircraft by coercion, but that's not what's happening. So if we look, if we debug pretty and look there, we see a really big N, not the string fd. And so what's happening there, there's a switch statement inside of hist.default that basically switches between the different algorithms for automatically deciding how many bins it has. And one of them is fd, which is the case that we're in. And I saw that one of the groups experimented with changing breaks to the other, to the other algorithms and saw that they did not have problems, but fd did. And so that means, and that caused a call to N class.fd, which at the time of the bug report, and in the R versions that you were using, looks like this, this is the full code. So what's the problem here? The problem as we saw it, as was mentioned by a number two, at least of the groups, two or three of the groups, was that, well, so H is the intercortile range. And H being identically zero, exactly zero is handled by this special case. And then it's handled again. And so if H were zero, then it would take the matter, mean absolute deviance. And then if that is greater than zero, then it does something. If that is equal to zero, then it returns one. So you get one, the N of one. But as Heather mentioned, and as the other groups found as well, if it's really close to zero, but not zero, these special cases, these sort of protective defensive programming measures are not hit. And so what's happening in this data is that you get a very close to zero, but non zero H. And then you're dividing by that number inside that call to ceiling. And when you divide by something that's really close to zero, you get something that's really close to not zero, like very big, right? And that's what's happening here. And so this is ultimately the place that the bug was. So the bug was that if you have an interquartile range that is exceedingly small, but not exactly zero in the implementation of Friedman-Diaconis, it explodes the number of, it explodes N in pretty, which explodes the number of bins that the histogram attempts to make. And it says, pretty says, no, I can't, I can't do that. Now, another thing that you could have done is if, so if you look at the bug report, the actual the vector that we're using, you know, you see that one of those ones has a Epsilon, as Heather said, you can make that Epsilon smaller and see what happens. And if you change that to like, I think it's negative five instead of negative 15 in the, in the exponent, what will end up happening is that it will not throw an error, but you will get a really ugly histogram, right? And so that is an indication that it's, that's another way of arriving at the fact that this is really the interquartile, like the, this little Epsilon here is what's causing the problem. The net Epsilon ends up being because of the well-constructed example vector that the original bug report had, that Epsilon is actually what's driving that and that Epsilon is driving the interquartile range. Essentially, what's happening is whatever that Epsilon is, is the interquartile range for this vector. So you can actually wiggle that Epsilon around and actually see what happens with the different interquartile ranges. And you'll get some really funny things like you'll get a, you'll get a histogram that has like 80,000 bins in it and all of the data is either in the first one or the last one and, and things like that. And so that's another indication that like, yeah, the code didn't throw an error, but this is also probably not what the person running the histogram function wanted to get. So, so yeah, so as we found it's, it's dividing by an essentially zero but not actually zero number and that is driving the bug. So what now? So a number of, of the groups arrived at that point and then started thinking about what to do then and realized that, well, that's pretty complicated actually. So I think I'll leave Martin as the resident expert on, on these types of things. He knows much more than me. So I will let him talk about exactly why that is complicated. Yeah. Anyway, so many things were said already the goal is still an n-class functions to find an optimal number of histogram bins. So the mathematical statistics behind, and also in Friedman Diakoni's paper was how should histogram, the number of histogram spins should be such that the histogram is as close as possible to the two underlying density function. Assuming there is a smooth density function. So that's the mathematics behind it. But by the way, where this power of minus one-third. So that's n to the minus one-third when n is the number of observations. That's basically the Friedman Diakoni's paper is derived is n minus one-third and the factor that you should need. Well, anyway, there is one philosophical remarks that I want to hear below. Well, of course, you cannot see my mind. You want me to, do you want the next slide? No, it's fine. No, not yet. So I'm an extremist because I always find, well, this time somebody else found such problems, because this sentence here published algorithm almost never take into account the most extreme boundary cases. I've learned this as an Audi, well, but I didn't know this 20 years ago, but it's a fact. And here it's the absolute, you just have to worry. And the other fact, which actually if you ever take time and go to the bug report and the analysis, I talk about this also, the problem in the quartile range is a nice, well, easy to explain robust scale measure, right? The standard deviation is not at all robust. And for histograms, for histogram rules, it's important that one outlier in your data does not determine the histogram bits for everything. So you want a robust scale. And the IQR is a robust scale. And so continuing here, fact is that software implementations of algorithms and the publication of an algorithm is really not the same thing, even if the publication is reviewed and so on. And actually, Freeman D'Iconi's paper was not published of an algorithm. It was publishing a quite mathematical paper on how to optimally choose the number of bits. But even when algorithms are published, typically people forget such things that the IQR can even be zero. That's actually the next slide, which I can show to yourself. The original function for quite a few years, as I say, this was introduced in R in 2001, actually from the mass package where it was written, well, at least in the last century. Just looked like that. So here they didn't use the IQR. They just did it in things in two steps. Contile of the 25%, 75% Contile, and then the difference of the two values. They may use as vector to get rid of the names. And the rest, of course, you see that's the same and then it was ceiling. So the case equals zero, as I write here, it wasn't even taken into account at all. So if the interquartile range is exactly zero, age is exactly zero, you define by exactly zero, you get an inf. And so that's what I say on the next slide, actually, not dealing with the case when age is exactly zero was even originally in R. And so actually I changed that in 2007, when then afterwards the body was changed. At the time we had the IQR function, or at least I decided to use it, because it's more easy to read that if age is zero, then I try the math. And if the math is still zero, then we get one as somebody already made. So that was the version before the one that you dealt with. And here, by the way, is the link of the bar report as well. So the problem is still, and then later, I already at that time made this difference myself of age still being zero, but I forgot that, of course, yes. If it's very close to zero, it's an explosion here. And that's why we got this report, then later. Yeah. So that was the story. I mean, the error message pointed to pretty being at fault. And it's natural to think, well, pretty should be able to deal with it with any end in some sense. But that is not the case. I mean, really, pretty was not touched at all in the solution to that bar. Because if people ask for a number of pretty points that are larger than the number of atoms in whatever, then you forget about it. And if you get even, I think, in computers in 10 years, probably wouldn't want this large number of pretty values. Any questions about any of that? Either the code analysis part or the, you know, specific, the mathematical strategies that Martin was talking about. Yeah, Namda. So actually, another thing we noticed was that, so, so initially, yeah, like, yeah, we're a group that talked about the epsilon. And so it's also tried like a plotting for like just one, one, one, one, two. And then I, with the breaks equals FD, then we got like just one giant bar in the histogram. And yeah, that's for R 3.3.2. And then I also tried to plot like the same thing. Yeah, like with or without like a one plus epsilon thing in the R 4.1. And then I got like two bars. It's so different. Yeah, also, we looked at the new version of that, like a n-class.fd function in R 4.1. It's quite different. Yeah, it's a lot more than just like a tolerance thing. Yeah, maybe I say something to this because I did all the changes that happened since. So it's still the thing that we need a robust scale measure, as the statistician says. So IQR is in principle a good choice, but we have to deal with the case when it's zero. And when it's very close to zero, as Heather mentioned, it's basically the same as being zero. And so the first solution was actually one that I've learned a few years earlier about smoothing splines, which was my PhD topic, well, not quite, but related. The idea of just rounding the numbers so that the epsilon becomes zero, but rounding in a reasonable way. And so that's why the first solution also in the body part was just instead of x, use signif of x comma, and then a number in the end we used five. So that makes the all small epsilons to be zero, and the big ectilons are still kept. But that was not the only problem to solve. And that's why why Alanda saw that there were more changes, because I said the use of the mad instead of IQR was actually not such a good choice, even though it was also me who made that seven years earlier, as I then found with the bug report. And you can all read that in box ill, it's all there. Because the mad very often is also zero. And then you get to the last case where you just give an equal one. And so that was not so good. And it was improved just a week later after fixing the bug formally. I added this extra case where I replace the IQR by the difference of not the first and the third one quartile, but then octiles, the first and the seventh octiles, or the first of the 15, 16 styles, and so on. And I went all the way to 500 or something and only gave up after that. So less and less robust, still robust. So one outlier would still work for large rain. Yeah, this is so in the end, this was really interesting. It was about how to do get a scale that is robust, that does not easily get to zero robust in a different sense. And that's why by the way, Lambda, I disagree that it's completely different. It's just in this special case, it does something extra, an extra effort. In all good cases, it gives the identical thing. But of course, if you if you rounded the numbers, then you have the difference between one two, I think, because of the other change. Yeah, I think we should go on because there are many more things that Gabe and I wanted to talk about. But there are more hands raised. Yeah, so I can't see your name, because your camera is showing, but Gavin Lee and Sergio. What was the first one? Gavin. Yeah, that's me. Great. Yeah, so just more of a zoomed out view, how would you rate this sort of bug in the scheme of bugs that have been come across in the last couple of years? Like, is this, would you class this as a relatively minor one, or is this, you know, a really big rap hole? In terms of the effort to solve it. Yeah, in terms of the impact. Well, so I mean, Martin may be able to speak from the, from the our core side, like I chose this particular bug because it was complex enough that it isn't like the solution isn't going to be immediately apparent. And also, another thing that was a really important takeaway is determining what is causing the problem doesn't always immediately tell you what the solution is. And this, this bug is an example of that. Whereas Martin just mentioned, you can have a really good code analysis and know exactly what's going on in the code to cause the breakage. And still, like, you have to do something else in order to figure out how to fix it. And that's something that's very important for us to keep in mind when we're when we're interacting with bugs, like a lot of times, or a lot of bugs, those shootings are much closely together. They're like figuring out what the problem is and knowing how to solve it are very close together. But sometimes they're not. And keeping that in your mind as you're doing these code analysis and knowing that the code analysis is enough to be very helpful without necessarily having the additional knowledge required to to know what the right thing to do is instead is still really valuable. So that was that's why we that's why we chose this bug. It's non it's non trivial and that there were a few steps that we that we were seeing that we had to follow to get to where we needed to go. I wouldn't say it's incredibly subtle that you didn't have to like you could go down into C code to see what's happening. There was a call, but ultimately, the issue wasn't in C code, which is more complex to debug, and would have taken longer than the 30 minutes that we were going to give you. But, but yeah, like I chose this particular bug for its teaching value. But I would say it's, there are many bugs that are less complex than this. And there are some that are quite a bit more. Martin, did you have anything to add to that? Yeah, I agree. And about the seriousness or importance. In some sense, his is is one function that even beginners of our use, right? It's one dimensional data visualization. So in the very easiest and very first thing that for some, it's the first graphics they do. And so if the his function sometimes gives an error instead of producing your blood, that's the that's one thing. On the other hand, you have to choose the FD, which is not the default choice. And so in that sense, it's not, that's why it has taken so many years before it surfaced. If FD would have been the default choice, then it would have surfaced much earlier. Okay, perfect. Thanks. So there was another question by Sanjay Kaur. Kaur? Yes. So we tried giving default values to bricks. So we tried three, five, 10, 10 hundred. And we were getting the histogram with two plots, getting further, two bars getting farther away. So what I wanted to ask is in such extreme value cases, how is the software program? Like, how do you arrive to the number where after which this is going to break? So there was that seven point, some number erased to some really big number. So how do you arrive to that point to that number? Well, so in this particular case, the breakage was caused by NNA. And so it's controlled by what can and can't be coerced into an integer. So it was, if the number that you ended up getting is larger than the maximum integer, the 32 bit integer, then it was coerced to NNA, and that causes pretty to throw an error. Anything smaller than that, you get a histogram that just looks really weird. In the worst case, you can get something that takes a very long time. And then you get the very ugly plot. So the error, as Gabe mentioned earlier, getting an error message is much better actually than getting no error message because you're just below the and you're still right. The maximal integer is two to the 32, which is about 10 to the 10. And that is very large for plotting. And so it takes a long time, I think, if you get just beyond that boundary. Yeah, so we are, I think, going to to move on. But if, you know, after the session, if you want to spin up the Docker again and do the thing that I mentioned where you actually still have the epsilon, but it's just smaller, you can actually see that we'll see what what Partman is describing. So we're going to try to keep going because we have some slides to get through. And then we have a special guest coming for the last half hour. Martin Lawrence will be joining us and then and then we will have a have a round table where you can ask us sort of any questions that you have. So hopefully, hopefully we can we can still answer anything that you're still wondering about more generally at that at that time. So right. So submitting bugs. We're going to go to this pretty quickly. It's basically the same as basically you should do everything that we just told you to do to other people's bugs to your own bug before you submit it is that's that's the extremely short version of this section. Um, so with the addition of confirm is present in our develop. So if you're going to submit a bug, you need to be running of a recent version of the development version of our to ensure that the bug is still there because even if it's a real bug, if it's not in our develop, that means that it's been fixed in the time since the release that you are currently using. And so we generally don't need a bug report for that. Then confirm that it's in our itself like we talked about before. Um, isolated to a the smallest example you can again, like we talked about before. And one thing that I will mention is I encourage you to look at the bug report for the bug we just explored in bugzilla because that is an extremely high quality, extremely helpful bug report by the initial reporter with a very good code analysis, the same types of thing that you guys were just doing written up really well in a way that was really helpful for Martin when he was looking to solve it. Then search bugzilla to make sure that the bug that you're talking about isn't already reported. And so here's here's how you can do that. So you can search for substring if you were talking about the sub-same function, for example. And then you get this list of a number of bugs that have been closed or are open that relate to that somehow or mention it in comments. And so basically, even if your exact bug isn't in there, if there's a bug in the same function that sounds similar, it's likely related. And so you may just have another case of a bug that's already reported in which case you can add it in a comment to the existing bug report. And it takes some practice to know when two things that are related versus not. Like I said, the first thing that I did, I actually fixed two bugs with one fixed because they were secretly the same, but they were two separate. I believe they were two separate bug reports because it wasn't clear until you got down into a C code that that's what was happening. So if you don't already have a bugzilla account, you will need to get one. Basically, you contact Martin and he is in charge of giving people. No, contact the official address. I'm one of two volunteers currently who. You contact the address, which will go to Martin and and to Payen and to Payen Sarkar. And one of them will will get you a bug, bugzilla account, and then you then you can submit. So next, we're going to talk about maybe you're thinking about patching, like making a patch for bug, either that you're reporting or that you have seen. So this is what I understand. And after numerous interactions with them and working closely with with them, what our course engineering philosophy is that you will be interacting with. Okay. So first off, backwards compatibility is very important. It doesn't mean that they will never ever change behavior, even non buggy behavior, but it's extremely rare. And most of the time, if it's not obviously buggy, the behavior is not going to change, even if they would agree with you, even in the case that they agree with you, that if they were writing it now, they would do it differently. The fact that it does what it does now is in the Bayesian sense a very strong prior in of what it should continue to do. And so that's something to keep in mind when you're talking, particularly when you're thinking, oh, I wish this R function did this instead, maybe I'll write a pass to do that. If the instead changes what other people's codes going to get that they already wrote, that they wrote 10 years ago, it's probably not going to happen. And the reason for that is because there are so many people that use R for so many different things. There are lots of R scripts that are run repeatedly that haven't been touched in years because they just continue to run and breaking those would be extremely costly for our users. And so they're not going to do that without a really good reason. And another thing which took me a long time to really sort of understand why they felt this way is anything that can go in a package should go in a package. This is fundamentally what our core feels about how R should evolve. At the very least as a testing or maturation stage, but almost always actually just permanently it should just stay in a package. And an important detail here is that popularity and widespread use in the R community is not a counter to this feeling on our first part. If there was something that every single person that ever used R, they all loaded the same package that still wouldn't necessarily be enough for it to actually be in R. And no one, there's no package that every single person uses. But there are package that are extremely widely used, but they're working fine as packages. And so really changes to R need to be such that they can't happen elsewhere. And another thing is that our core operates on a sort of individual initiative plus lack of opposition model, which basically means there are different aspects of the R code that are basically sort of shepherded and owned in a sense by different R core members. And that person basically, if that person wants to make a change or likes an idea that's coming from themselves or from outside and agrees with it, they'll sort of often pitch it to R core. But basically, unless someone else gets really upset and really pushes back strongly, they are left to their own devices to do the things that they think are good in terms of accepting patches in terms of adding features in terms of any of that. So generally, that means that convincing one R core member is usually enough when you're talking about a patch or a feature addition, because unless other members feel really strongly that one person is going to be the one who would ultimately do it. Just a few more things. There won't ever be any more recommended packages, most likely. I heard, I have heard that directly from Luke. And the reason for that is that there's really not very much upside because recommended packages are not part of R source code. So they're not tracked in the same way, but they're bundled with R. And so that dichotomy is just sort of logistically unpleasant and it doesn't really bring that much benefit. And so again, popularity of a particular package isn't going to change that. The packages that we like to use as users, we can just get them from CRAN. It's not difficult to get them from CRAN. And that's how it's going to be. They're not going to be added as recommended packages. Proposing new features creates work for them. This is something that we need to keep in mind. Even if you submit a patch and even if the patch is good, they still have to vet it, which is that scales in size of patch. And I would argue it probably doesn't even scale linearly in the size of patch. And so this is still work for them. Helping squash bugs saves them work even if you don't have a patch. So with all this code analysis stuff that we've been talking about, sometimes, especially if there's no patch, if you just have a really good code analysis that can be extremely helpful for them. So for your own sake, don't start with feature additions because you're going to be disappointed because they probably won't go in. Don't submit packages, patches which change existing non-bug behavior without directly hearing that they are interested in you doing that. And not liking the documented behavior is totally allowed. Disagreeing with it is totally allowed, but that doesn't make it a bug. And so it's not going to be treated as a bug. It's going to be treated as a feature addition and it's probably going to run afoul of backwards compatibility and then not be considered. Don't expect a quick turnaround on wishlist items. Wishlist items are really useful to R-Core. There's a way of collecting ideas for where R can go in the future, but they're also essentially unfunded mandates. And they basically behave like every unfunded mandate does, which means they'll get to it when they can get to it. And this is what I just put this in here for completeness. I've never heard of anyone doing this, but if you're talking with an R-Core member that's the relevant person for whatever you're doing and they say no, don't try to find another R-Core member to say yes and overrule them because it's not going to work. It doesn't work out that way. This is different than engaging and even disagreeing with R-Core members on an ongoing discussion on R-Develop. That is very valuable, provided it's being done respectfully. And I do encourage all of you to subscribe to the R-Develop mailing list and engage in the discussions that happen on there. So we're running quite a bit behind. Martin will be joining it. Or Michael, I should say, will join us shortly. So we're going to try to go through this pretty quickly. But these slides are available. They're in the same repository that the instructions were, so you can look at them at your leisure transition. So typo fixes are always welcome and appreciated by R-Core. Usually you don't even have to write a patch if you just send mail to R-Develop pointing out where there's a typo and what it should say. You will almost always very quickly get a response back saying essentially, thanks fixed in R-Develop. And that's true of the manuals. That's true of the help pages. There's any documentation. Larger changes to documentation are somewhere, thing that we need to be careful. Really only want to do this when it's necessary or when R-Core has solicited such a patch. And it's important to keep in mind that it must be at least as technically correct as the old documentation, which means that we can't trade clarity for approachability or approachability for correctness. Things have to be fully technically correct. And this does mean that you have to deeply understand whatever function you're trying to essentially write new documentation for. And that can sometimes be difficult depending on which function we're talking about, which is not to say clarity and approachability are very good. I'm not saying those are bad, but they need to be in addition to correctness rather than at its expense. So code patches always view the actual diff file that you're going to submit before submitting it and never submit anything that has white space only changes. I say this because I have done that and Martin was not happy with me when I did that. So just avoid that. And it's easy to see that it's happened if you actually look at the diff file that you are submitting. Consider updating documentation to reflect any change that actually warrants an update in documentation. Always test the exact diff file that you're submitting. This is another one of those every changes to change situations where if you just make one tiny thing, you're just like, oh, I'm just like adding a comma in the documentation that I wrote. I don't care. Run, make, check, develop again. Because you can accidentally make the RD invalid, which will make it fail, which will cause work for people. So it takes a little bit of time to run, make, check, but it's not that bad. And it gives you this protection against doing that. And always provide a test script or code that tests your patch that our core can run themselves and look at as they're considering the patch. Avoid bundling enhancements with bug fixes. Even if the enhancement is related to the bug fix, those should be two separate patches. Similarly, avoid bundling multiple separable bugs, bug fixes. Those should also be separable because, again, bedding these patches doesn't scale linearly in their size, right? And so reviewing two separated bug fixes is much easier than reviewing just them smashed together because now they have to look at a larger piece of code and make sure they understand how each of the pieces are interacting with each other. And avoid breaking backwards compatibility in any way, essentially, other than fixing what's obviously buggy incorrect behavior. So feature additions, wishlist items are great. If you have an idea for a change in behavior of R that you think would be beneficial, filing a wishlist item doesn't cost you very much. You should describe it well so that our core can understand. That's appreciated so they have all these ideas collected in one place that they can look over. There's absolutely no guarantee of it happening. Sometimes it won't, often it won't, but sometimes it does. And that's a really good way to get your ideas into R, even if you're not at the level where you're able to make a patch. Unsolicited feature additions generally don't do this. Just for your own sake and theirs, again, there's a good chance that it won't be adopted. And we don't want you to waste your time. We don't want you to waste our core's time. We don't want anybody to be wasting their time. So if you have an idea, bring it up on our development list or in Bugzilla as a wishlist item and see if there's engagement from our core, see if there's interest from our core. Because if there is, then that's a sign that you could collaborate with that person on actually getting that in there. So for a solicited or confirmed interest behavior additions, it's great to collaborate and voice your opinion, but the R core member is going to have the final say, like I talked about with the debug call function. I disagreed with their design choice and theirs is the one that's in there, and there's a reason for that. Be prepared to refactor your code possibly multiple times before submitting it. This is just good practice generally for software. It takes longer, but your first pass at anything isn't going to be your best one. So I've often heard engineers that are solving a new problem talk about, okay, so you solve the problem and then you throw that code away entirely. And then you start again with the understanding that you've gained over the course of that first implementation, and then your second implementation will actually be good and actually be ready to go in. So that's something that I'm not saying that's a hard requirement, but it is very useful. So you should consider doing it, test it to within an inch of its life, and then keep testing it after that. And if possible, have someone else technically skilled at our review it before submission or before each of the iteration steps. And that slack that I mentioned that our contribution working group has created is a good place to do that. I hang out in there. Sometimes there are other people in there. There's not much activity yet, but that's a good place to sort of meet with people. And Martin also mentioned something, which is like even pure programming when you're dealing with bugs can be helpful and useful. And I encourage you to do that. So, you know, you can collaborate amongst each other and like with other people that are interested in addition to collaborating with the our core member that you'll be working with. And finally, in the last couple of minutes before we just open the ground, open the Florida questions is so about purely speed up patches, generally avoid them. And the reason for this is that often when you talk about a speed up patch, you're you really, you're not going to have the impact that it might seem in isolation that you would. And so you really need to be confident that a that you're speeding things up without slowing other things down. And that can often happen. You can speed up a special case, but then, you know, something else has changed and you're actually slowing down a more common use case or something like that. And also keep in mind that speeding up code usually makes it more complex and thus less maintainable. So there are places in our where you could make things faster, but they're fast enough already, and they're more maintainable and more understandable in the form that they are now. And that's a conscious choice that that our core sometimes makes. And that in your own code, even outside of R, you should consider making as well. So premature optimization is bad. So you really only want to speed up if it's going to make a realistic real difference. So two examples here, if I make something that takes two nanoseconds, take 0.5 nanoseconds instead, that's a four times speed up, which is very large. But it only massive, it only matters if you're doing this like hundreds of millions of times, right? Like otherwise, congratulations, you went from a non detectable amount of time to a non detectable amount of time in terms of like actual people. On the other hand, if I take something that took 10 seconds, and now make it take seven seconds, that's only a 1.4 times speed up, but you're going to save people's actual time more of their time, right? You've saved three seconds. Think about how many times the thing in the top example would have to be run in order for it to even take three seconds total, right? It's a lot. And most things aren't run that many times. So microbenchmarking is a useful tool, but it's also extremely easy to misuse. Because really, at the end of the day, what matters is how long the entire script takes to run. And speed ups that are valuable, speed ups that they will consider, and I have had speed up only clearly speed up patches that have been accepted. These are the ones that are going to be in this bottom category. They're going to save actual time. And also preferably, don't make the code too much more complex than it is now. And then it's really valuable, and then it's appreciated. But it's not something that should be at the top of the list unless you really are comfortable with the code and know what you're doing and things like that. And I think that is where we will stuff the slides. There were some more slides about navigating a checkout, and we did have another practicum scheduled, but we had more questions, which was great. And so I will leave that as homework. The bugs that you can look at are, I've added them to the readme on that repo that we linked you to. There's a number of them. You can do as many of them as you're interested in. Some are old and fixed. Some are actually still live, including one in debug call, which Sebastian Meyer found and then left unsolved for you guys to actually get a look at first. And yeah, I don't know. Did Michael... I don't know if Michael... Okay, so Michael is here as well. So we're going to go ahead and open the floor to any questions that you have for Michael and Martin, who are our core members, or for me, from the side of collaborating frequently with our core from the outside. Are there any questions about...? I forgot. It was one of your suggestions also not to change the indentations, because I remember that's really important. Sorry not to change which. Change indentations when you submit patches. Oh, yes. That is a good point. So that's related to this whole no white space changes thing. So one type of white space changes, indenting changes. That will sometimes even happen automatically if you're using RStudio or Emacs and you have certain configurations. You hit tab and it does something. It indent your code somehow, but anytime you're submitting a patch, you need to be using the same indentation scheme that the R sources use. And anytime you're modifying code, you can't be changing the modifications scheme. And so that's one of the things. That's a good point that Naras is raising. That's one of the things that you should be looking at when you look at the diff. Does the indentation and coding style more generally fit with what's already there or not? And if it doesn't, then you need to make some more changes so that it does before you submit. So that is a very good point. Yeah. And I may add something, Naras. There are few places in both R source and C code sources where the current indentation is wrong according to the scheme because a previous patch or even a change by R core without the patch tried to do a minimal change. And so let's say you have an indentation by four as we often have, and then an extra clause, right, another level of braces with another if. And then sometimes people or I in the past used a two indentation for this intermediate clause. So all the other indentations would not change so that the change really only were like four lines instead of 15 lines where 11 of the 15 lines were just indentation changes. And some rare times we kind of rework and do fix the indentation to our own styles and then have an only white space commit. Right. That's sometimes possible if the indentation is really bad and then one change does nothing change but indentation. That's kind of okay because the commit machine says I only indenting. I do nothing else than that. But if you fix a bug that is really important changes in four lines and then changes 20 other lines just by indentation and other white space, then that's really bad because if you look at the change, then you don't see whether things really happen because you are kind of diverted by white space changes in the diff. I don't know. At least I'm sure not as understood what I meant. I hope some other people too. Okay, so I have just, I have one question. How relevant is it for us to review whole bugs and say, you know, I can still reproduce or I think if we say this is no longer a bug, maybe it helps. But it's helpful if we go over all the bugs and say, yeah, I can still reproduce it like five or seven years later. What extent is it useful to review all the bug reports? It is useful. Initially, remember, Gabe showed you the three R blocks that were kind of, and two of them said, please help us with a boxilla. And one person I've never met, she's a professor somewhere in the east coast of the US, if I remember correctly, Elin Warren or something. That's basically what she does. She goes over boxilla box and reviews them. Sometimes just saying this is still active, oftentimes doing some real analysis in addition, like looking at more cases or showing cases where the bug doesn't show, whereas where it shows. So that's what we mentioned with code analysis, trying more than just reproducing the bug, but also seeing when does it happen, when does it not happen, or even further down. Now, that's very useful. And as Gabe mentioned, I have time to say, sometimes the wishlist item is really just put the wishlist because it's really not the bug and nobody has time. But actually, it may be good if somebody reviews it five years later and says, well, actually, in the meantime, I've also found I would want this functionality and it's kind of easy to add or whatnot. No review of old code, old bugs that are just forgotten because we don't have time can be useful. As Gabe mentioned, it's of course, it's good to find a balance between a reminder with some extra information, namely this is still current, as opposed to nagging, why has this never been addressed? And so on. Of course, you don't know that. But maybe Michael can add, he has a different perspective. I think I have a very similar perspective. Yeah, I think every any help is good help. And I think the way we can make that most effective is if we ensure that we clean up, you know, those old bugs and ensure whatever bugs are in there relevant. And that's something that can actually also be contributed, right? I mean, just going in there and helping to find obvious cruft or things that maybe duplicate of previous bugs and things like that. Because I think doing a first pass of that would ensure that everyone's time is well spent. Because I suppose it would be a good first step before implementing features, right? Just contributing to that. Thanks. And I mean, I don't remember how many bugs there are open, but like it's there are there are bugs open, right? Like there are things that, you know, are there and either they're not bugs, in which case, figuring out that they're not bugs is really valuable because it saves our course time trying to track them down or figuring out they are bugs and making them easier to tackle is really valuable. And like, you know, a lot of our users probably think, you know, like, well, our doesn't have any bugs. I always been around for, you know, so and so years. It's, you know, and I use it every day at my work to do, you know, to do the same thing, which is the part they don't think about as much. But really, all software has bugs, right? And, you know, the way that we can improve our and help our and help the art community is by, you know, helping find new bugs and helping get rid of old bugs, but like, it has bugs. So there are there are things in there that are open and some subset of them are even real bugs that just haven't been fixed yet. And so they haven't been fixed. That's a sign that probably some help on those, you know, might help our by getting getting them closer to being fixed. It's like a question in the in the chat. Okay. If if our core team is so busy, then why not get more people to join the our core team? I can take a first stab at that. I think that's what this this seminar, this tutorial is all about. So our core is simply defined as the people who have commit rights to the source code, right? But that's actually a very small task in the in the overall scope of our development. And so what we'd like to do is have, you know, the our development team be much broader and much more inclusive and include people at various levels of contribution, right? And so I think what we're encouraging everyone to do today is actually to sort of join that team, right? And collaborate with with our core and making our better. So that's my perspective. Yeah. And so I can I can say, you know, from, you know, as having worked from the outside of our core, you know, just to just to sort of add on to that, you know, I have been able to contribute to our, you know, I've fixed bugs and I even over the course of a long period of time and, you know, developing these these relationships with our with our core members, even even feature additions eventually, although again, that's not that's not where we start, right? Like that's not where I started, that's not where anyone starts. But I can tell you that you can help are and you can have an impact, even outside of our core and whether or not your goal is eventually to be on our core, or if you always just want to, if you don't want that sort of extra expectation and extra better responsibility, you can you can you don't need to to in order to help are and and help our grow and, you know, improve and fix things, fix things that are problems with it. So so you can you can help without being in in our core and and we're encouraging you if you're interested to to do that. And hopefully this has been helpful in sort of learning some of the types of things that you would need to know in order to be effective at doing that to be effective collaborators with our core in this what Michael's calling a larger our development team that helps are and improves are and makes things better for for our users everywhere. So we do have a few more minutes if there are any and it doesn't have to even be related to collaborations, right? This is an open round table. If you have questions that you've wondered that you never had a chance to ask to an our core member before, this is a chance to do that. And if you have any questions for me, as a non our core member, I'm happy to answer questions about basically anything as well. So please feel free. There's no, you know, there's no stupid questions. If there's something that you don't know, you can you can just ask. I do have a question. And well, it's not highly related to the topics we cover today. But now that we have Michael on the call, I think it's okay to ask. And this is something that came up yesterday. I think one of the lowest barriers to entry to contribute to our might be translations. And there I think it would be much easier to do those contributions via web service so that you do not even have to check out from subversion and then do text edits and then run the test and then email the patch to someone and then get it checked into subversion. So I know a long time ago there was this PUTL server where anyone could just register login, provide some translations, others could review it. And after let's say one or two reviews, those could be committed to the actual code base. Not sure if there are any related plans. Yeah, that's right. Thanks for bringing it up again. Yeah, so we just learned about this at the translation tutorial thanks to Gertley because this is something I guess that Brian Ripley had put into place about a decade ago for facilitating submission of translations into the R code base. And somehow that has fallen off the radar. And so what we'd like to do as part of this overarching effort towards improving translations towards improving the translation process, building up the translation community and making it one of these on-ramps to our contribution is we have this working group sponsored by the Zero Funding by the R consortium where Michael Chirico and I and others that we will recruit into our effort are going to pursue that very topic. How can we make this easier, technical perspective from the community perspective? And so that's a really great to learn about that Poodle server. I think that'll be a useful piece of technology for us. And certainly if anybody here including you, Gurgley, if you're just joining that working group, just let Michael Chirico or I know, be happy to have you. Thank you very much. Okay, so we have another question in the chat. So Luis is asking, so Michael mentioned that there is a lot more tasks that our core members perform besides changing the source code of R. So they, as a small point of, our core doesn't maintain CRAN. CRAN is a separate but related entity to our core, but there's also the website and bugzilla and the servers and those are under the purview of our core. So he's asking how we can contribute on those other types of tasks beyond just the source code. So I have heard other people sort of lament the website not being sort of more modern and wondering how a modernization of the website might be contributed to that. I've heard that in the wild as well. So if Martin or Michael wants to talk about that, that would be great. Yeah. I mean, a general statement I would make, I guess, is just that we'd like to collaborate with whoever has the energy and interest in contributing these things. I mean, so approaching us as collaborators, if you have a new idea for the website, and I think as Gabe was saying in a lot of these slides, come to it with an attitude and a disposition of, I want to start small and work with our core towards some common end. And I think we can build up that relationship, that collaboration and work towards a better thing. I mean, so we're totally open in principle to improvements to all of those things. And so one of the things we have done, practical things, is this, Serenji Kora has written up this R contribution guide or R developers guide along with Heather. And yeah, I think that's that's one practical place to start if you're looking to see how to get involved. But that's my general point is just around, you know, approach this as a collaboration, and I think it'll be fruitful. Thank you, Martin. You had something to add? I noticed that I'm almost not, I mean, almost losing Zoom connection, maybe because I'm on the same Zoom for a VKH as most of the conference Zoom surrounds. The bandwidth is very limited, even though it cannot be at my home. So I'm sketchy. I never see the chat, for instance. Yes, the website, what people are not aware of, there are two websites. There is the R project web, which is all based on Markdown, which we really like. And we don't want to change anything of that. I mean, of that part of the interface. And then there's Kron. And Kron is mirrored to about 30 different places or even 50, I don't remember. And the Kron website is much more old fashioned than the R one, because the WWW, our project work does work nicely on smartphones and so on. It doesn't have to look like a website that has a team of people working full-time for the maintenance, because we don't want to, and we don't want to go to a completely commercial hosting. Also, we want to use free software and host our own stuff as much as reasonable, of course. So that's a bit, there are some constraints, but the main problem is that I think that we have two websites instead of one, Pram, which is mirrored, and our project website, which is just one place that uses relatively modern bootstrap technology for reactivity and uses Markdown as a source code, which makes the maintenance reasonable for us, because in the end it's us who would change. But as Michael said, we've often talked about how to ask, have other people do this in collaboration with us, and often, well, efforts have started and often have gone nicely, and not so many have really happened in this part of the webpage. It's not that we would want to move to some commercial web host that does some extra whistles, and then I think people don't understand sometimes that free software is a very important thing and are, and Python too would never have happened if there was not free software and open source is just one part of free, as you hopefully know. And that's why some of us, and I'm one of them, very much emphasize that we don't want to commercialize our complete infrastructure and be dependent on people who need to make money instead of advancing a free thing as our is. Yeah, and just one thing to add on top of that is that the website, even in its current state, is a lot better than it once was, and one of the reasons for that is it was actually a collaboration. So I think Hadley and a few other people actually came together to refactor the website about five or six years ago. And so now it's actually template-based and things like that. So I mean, those types of improvements can happen and have happened. I think that is the time that we had. Heather also just mentioned that in the chat, but we are at 30 after. So thanks everyone for attending and for sticking with it to the end. And hopefully, this has been useful and fun for people. And like I said, there's the developer guide and the Slack, our contributions working group Slack that you can join. And I encourage you to join and I encourage you to do all the things that we talked about. And hopefully, I will be seeing you guys around in those types of places as things move forward. So thanks again, thanks to Martin for working with me on this. And thanks to Michael for attending for this last bit for the sort of open question section. And with that, I think we can stop the recording and I hope everyone has a good day or night.