 Hi, ydw'n meddwl, Bethany Griggs, ac mae'n gweithio yma yn New York. Mae'n cyfnodd ar y UK. A ddiddordeb am y gwaith o ddweud o ddweud o ddweud o ddweud o ddweud o ddweud o ddweud o ddweud o ddweud. Mae'n ddweud o ddweud o ddweud o ddweud o ddweud o ddweud. Yn ymgyrch gweithio'r ddweud o ddweud o ddweud, Felly rwyf am rhydweithio hips o Ymgyrch. A fy nghaeswyr ddyığın feddwl icha heddor ar fy nid Fedffer Slyw friendship i ran bob yny yn ôl OGM. Wrth fy nghydat. Fyny'r gwirionedd ddddwn i'n hawdd â'r judgement yw Bod Plans y redhat fel bloody at IBM. Rhyw gweld Rhywή Sub party wedi gw widelyx, lemon informusio ar yr F focus on y rheswun sforce cynesio ac yn gwleithio felry idi. Mae o ddyn nhw'n ei bod ynOver hunch am eu plaw sydd stall y bydd nhw flyny yng nghyrch, y yma hwn yn haltyng sy'n d Fallone. fel hefyd ynnu'n gweithio. Felly, ydych yn casr nad yw yn nod i ni ddweud yn nhw'n gweithio yn yr ysgol yng nghydroi? Nog yw'n ddod i mi, mae WJS18 wedi'i gwaith arlaeddi. Cymru yn Ynno Pryddiw, a Mae gwah gods o'r dweud oherwydd fe wnaeth yw'r cywrthod yn gyffredig yn yr ysgol ar yma. A yna'r ymgylcheddau yn amlwg ond, maen nhw'n gwahodd â'r hanffordd ac ymlaeddi. Mynd i ar hi wneud mae WJS19. That was released in October and that's currently our current release. It has all the new features and the latest updates that are landing in the project. So, two big new versions. The exciting thing that most people get to are yay new features. So, yep, great new features. Everyone's excited for new features. And in node 16, and no dating, sorry, we introduced the Global Fetch API. This was a very anticipated API. Discussions go back to years. We finally landed it, shipped it. It's still experimental. But it's good because conceptually, people using Fetch in the browser have something similar in node that they can use. There's some performance tweaks we need to do, but people are working on it and it's getting lots of use already. And then we also have the built-in test runner. So, this was in node 18. We introduced a test runner. And this means you don't have to go off and install a module to test your applications anymore because you can use the built-in one. And the purpose of this isn't to displace all of the great test frameworks out there, like Mocha, Jest, you know, and those. It's just to provide something small and lightweight that covers a useful set of use cases. So, if you've just got a small app, you maybe won't have to go and install one of the modules. The way I see it being used is, yes, I start off using this, and then once my app gets complex enough, I need to go and explore some, you know, advanced features, then I would go and pick up, like, Jest or one of the other modules. Lots of great work, particularly from Colin on that one. And we also, even more recently, landed Watch. So, you can now watch files for updates and they will reload and rebuild and rerun. Great, think Nodemon. If you've used Nodemon, we now have similarly and similar implementation in Node. So, that's lots of, you know, cool new features and the kind of carrot on this, you know, stick to encourage people to upgrade. But I know this is not the reality. Upgrading versions takes time, it takes effort, and it takes money. I work with a lot of enterprises where it's still the traditional life cycle. You're not deploying every single day, like many of the cool companies are. You've still got long, you know, development processes. You've got code freezes. It takes time to upgrade. And you really have to find a window in all of your other priorities to have time to do the update. And this means inevitably it kind of gets pushed to the last minute until you're forced to upgrade because you're approaching the end of life. And that's why I'm here today. I want to share some techniques, what I've learned from helping teams at Red Hat and IBM upgrade and load versions. And the flow of the talk is going to be, I'm going to start off with what may seem obvious and then go into maybe some more involved techniques. My aim is that hopefully you'll learn, you know, at least one technique that you didn't before. So, the first good piece of context to have is to understand how Node.js releases work. Fortunately, Danielle did a great talk this morning that covered that. See if you caught that perfect. But if not, I'll just recap the very, very high-level basics that are necessary for this talk. First of all, the Node project follows semantic versioning, the semantic versioning contract. So, I'm sure many folks are familiar, but the first number is incremented for a break and change, the second for a feature, and the last for a fix. But that comes with some caveats. So, you can argue all day about the difference between a bug fix and a break and change. We've actually found cases where there are genuine bugs in the Node runtime, but people are relying on the behavior. So, we can't actually ship it as a traditional bug fix. We have to delay it and ship it in the next major, literally because it will just have too much impact. Same applies for spec incompatibility. In some cases, you may think, you know, making this API line with a spec is just a bug fix, but actually, it would have such wide impact, we elevate it to a break and change. We also have a blanket exception for the theoretical security fixes. So, where we maintain multiple release lines, we have to ship our security fixes even if they are technically a break and change. So, in our policy, if something is security conscious, we will ship it in an existing major release line. We often do try to give you workarounds so command line flag to revert back to the old behavior just so you can still work grade, but this is what we do. And then, last of all, very unexpected side effects. We try to adhere to this contract. We put PRs in, we label a lot of PRs with this and this, but to be honest, we make mistakes sometimes where things have unexpected changes. And that leads me in to the release schedule. So, Daniel did a good recap of this. The main takeaway for this talk is we have current and long-term support releases. Generally, for most businesses, due to the length of time it's supported, they will stick to the long-term support releases. So, the even-numbered release lines node 14, 18 and 20 seem to be 20. So, the main context here is current releases are kind of like beaters to try out the cutting edge. The LTS releases, even-numbered releases, are the secure and the longer-term supported ones. But what this means, when you come to upgrading, if you're in an enterprise setting, is you inevitably end up doing a double jump. So, you actually go from, typically, go from like 14 to 16 or soon 16 to 18. And what that means is you've picked up a lot of changes in the meantime, as the code's churned. As Daniel mentioned earlier, the main branch evolves as these releases are cut. So, you've probably got about a good year's worth of changes to pick up if you're going between 14 and 16. And this is relevant because it leads me into the first port of call when you're debugging a breaking change. And that's to look at the changelogs. But you need to know which ones to look at. And if you were going from, say, node 14 to 16, you'd actually need to look at both the 15.00 changelog and the 16.00 changelog because they're cumulative. You're picking up all the breaking changes between those two versions. So, if you've got your app and suddenly it breaks with the upgrade, the first thing I would do is check these two changelogs. In theory, everything that's identified as a breaking change will be in those changelogs. I've got some good news here, though. This is the number of breaks and the major PRs per release line. As you can see, it kind of peaked around node 10. And after that, we're heading back down in numbers, just less than 50 PRs of breaking changes. Stability. This is because, you know, nodes mature project at this point. Stability is key. So, over time, we're being more conscious about making breaking changes. Once you're in the changelog, another useful tool is our use of subsystems. So, a subsystem is kind of an area of the project that the code touches. As contributors to the node project, we tried to identify all our commits based on these subsystems. And this is useful for end users in a changelog because, say, your HTTP server is broken, you can use it as a shortcut to try and identify which changels are likely to be the problematic ones. This is great when there are intentional breaking changes with known impact. So, one I think we did really well was the OpenSSL 3.0 upgrade. Due to lifetimes of OpenSSL, we had to make this breaking change. But what we did, we could explain why we needed to make the change. We gave some options for people to revert and work around any new behaviors in OpenSSL 3. For context, OpenSSL, when it does new majors, they tend to block some older algorithms and say you can't use MD4 anymore so your application would actually fail. We provided the OpenSSL legacy provider command line so you can revert back temporarily. The things are really not always this smooth. Debugging upgrades, you can't always infer it from the changelog. Now I'd like to segue into one of my work experiences. At the start of the year, I was tasked with helping a team update their JavaScript client for their product. It's a Java-based team, a lot of Java engineers. They just happened to have a JavaScript client so they didn't really have the expertise of node runtime. The project was called Infinispan. It's an OpenSSL's data grid. I think it's just a Redis kind of equivalent. It backs the product Red Hat data grid if you're a Red Hat customer. They needed to support node 16. So they reached out to some folks in our team who maintained the runtime because they had the context to do so. So I started with some very, very obvious steps that I'm sure most folks would do. You run a test suite with one version, it passes. You run it with the next version, it hangs. The next and 15 also hangs. And what I do to do this, I just use NVM, switch between versions, and run a test suite, see what happens. Another typical debugging step is to isolate the individual test. So in this case, the whole test suite was hanging. I just needed to go in and figure out what the exact test was. And this does have some relevance later. So what the failing test was, it was a cross-site test. So essentially it starts two clients. It expects to be able to read and write to both. It destroys one of them, tries to write to the one that destroyed. And what should happen in theory is under the covers, it successfully persists the data that goes via the client that's still living. So that was the context of the test. And then I went to the very, very typical familiar debugging step of sticking console logs everywhere to see what would happen. Very familiar. This was good because I could roughly see what part was hanging. Great. And then I was like, actually I should be a bit more intentional on my debugging here. Let's use a better tool for this. So I pinged open the Google Chrome Inspector. So I started stepping through the code, what's happening here, stepping through. And what I hit at this point is this was an events problem. And stepping through the code just meant I was stepping through the event loop until a time had run out. And that was it. Like this wasn't giving me the, like helping me diagnose the problem. All I knew was it was waiting for something to happen. That never happened. So at this point I was really trying to think, hmm, what next, what to do. So I resorted to git bisect. And this is a binary search algorithm that you can use to find where a problematic commit was introduced in your git history. So as Daniel mentioned, the release group do this quite often because when we're building releases we need to check, you know, if a test fails we want to check which commit is causing that failure and pull it out before shipping the release. But in this case I used that technique to find out which breaking change was impacting our client. And so you give it a good and a bad commit and then you start to git bisect process. And now I have a demo-ish. If you give me one second to bring it on to the screen. Or not. I will get it. It was a lot harder than I made it. There we go. It's really hard to type when what you're looking at is like that. So I don't trust a demo god, so I always record all of my demos in, you know, ASCIM, which will just record a payback on my commands. So this is the exact bisect I used to debug this problem. So you start by calling git bisect start. You supplied a good commit. In this case the commit was node 13. Release commit, because I knew it was in the window. And I knew the bad commit was node 15 release commit. And this is all on the main branch. And then at each level I have simulated this, because if I built and tested node at every step here, we'd be here for at least an hour. So I just run through and simulated it. So what I would do at each step is run make build and then run the individual test with my custom binary for that particular commit. You can see it starts running through the steps. And git bisect does actually have, you can pass it a script instead, like a shell script, and it uses the exit code of shell script. So I could just type git bisect run test.sh and walk away. But I find this quite intuitive to demonstrate. So great. I found the problematic commit, which is net auto destroy socket. And based on the test at this point, I was getting some familiarity with this is probably a socket or streams type error. Now I will quickly go back to my talk. This is good. OK. So yay. I found a problematic commit, net auto destroy socket. OK. Bear in mind the context that the people I'm working with are Java engineers who maintain a small JavaScript client. It may be great to find a commit, but you need a lot of context to be able to go into call runtime and be like, hey, why is this commit impacting my small client? Because this is low-level stuff. So even myself, when I was looking out, I was like, I kind of see what's happening here, but I can't quite articulate why this is causing the breakage in the same way in my client. So again, you know, I had to sit back and think, hmm, what's next? What's again? And so I tried node debug. And I don't know if anyone's aware of the node debug environment variable, but if you supply that to your node process, you can get some lower-level runtime values out of the debugging output. And going back to what I said about the change logs, we use subsystems here as well. So if you wanted to see all of the debug-level logs for HTTP or Net or Stream, you could just do node debug equals one of those and you would get the low-level output. You can also enable all with a wildcard, but to be honest, don't do that because you just see time is counting down and is not useful. And then I got to the point where I could run with this low-level output with node 14 and node 16 and see the stream and net events. I chose stream and net because I knew from the commit it likely touched net or stream. And again, it was going through and I could kind of, you know, see what was happening. One side was just waiting for something to happen and a certain event wasn't being emitted. Again, you know, this was a couple of weeks of work just being like, I don't know what's going on. I'm getting closer, but I still have no idea what's going on here. So then I was like, ask the experts for help. And I reached out to one expert. I was just like, how do you debug streams? I have no idea how to do this. How do you do it? And one bid suggests that I had a few beers and then tried it after a few drinks in the evening to avoid the spare of looking at streams code. So that was one suggestion. That wasn't their only suggestion. The next suggestion is what really helped me plug all the pieces together, which was to create a minimal node core reproduce. So at this point, I'd spent literally days, almost weeks, stepping through this code, trying to figure out what was going on. And from all of that, I'd gained context of roughly what it was doing at a high level. So taking all of the specifics of the clients and the infamous ban out of it, what was happening here. And it was really just quite simple. It was like it was creating a socket, destroying it, and then trying to write to it after it was destroyed. What I know is that there's some behavior difference in node 14 and 16 with this. Oh, wow. Apparently it hadn't. Okay. So I created a minimal reproduce. So in here, we create a socket, and logs connected when it's available. I just used timers because I thought it was quite easy to demonstrate here. So you send a destroy message to the socket, and it destroys after three seconds. After it's had time to be destroyed, I just sent, you know, hello world to it. And I found this few lines of code emits an error in node 14, but it actually doesn't in node 16. So that was the key difference. And then I could eventually get to what the code needed to be to still emit the error. And as you can see here, it was a case of just emitting it via the callback. So this was the most minimally invasive way of adjusting the code in the project's client to get similar behavior in node 14 and 16. What was actually happening is the test suite was hanging because it was waiting for that error event. It was never happening, and therefore we needed to manually elevate it and expose it. And that's what enabled the fallback behavior to, you know, it tried to write to client B. That error wasn't coming back, and it was only once I forced it to give the error back, the client knew to write to client A instead and do the fallback cross-site behavior. And there was a sentence that said, you know, it was a bit of a flaky client written on some bad assumptions potentially, but in context of, you know, just trying to get this client working with the least amount of change and the least amount of risk, so it was a bit of a situation here. So just to summarize, like my techniques, and I'd probably use this part again going forward, is at first isolate the test failure and identify the specific version that introduced the breakage. I then look at the change log being guided by the subsystem. Obviously, you use our traditional debugging tools, such as console logs, Google Chrome dev tools, et cetera. But then the more involved steps are to look at runtime-level debugging with the node debug variable. You can resort to a git bisectives node runtime, and you may ask, like, why didn't you just do that straightaway? One, it takes a long time unless you have access to a powerful machine. I think it was like 1800 commits I was bisecting between. So it can take a lot, and builds of node could take, depending on your machine, 20 minutes or maybe a bit quicker. So it would take a long time. And also, had I not spent so many hours stepping through that test, really understanding what was going on, finding the commit, probably wouldn't have helped that much without the additional context. Also, ask the experts for help. So, you know, we have teams on the node project who are responsible for certain areas. So you can reach out to the group that are responsible for streams or the group that are responsible for the console or something. So knowing who to ask to help says for a lot. And also a minimal node call reproduce. And I give that the gold medal, like it's the easiest thing to go from what's the change in the runtime and how is that impacting my application. And what I will say is, if you do have a minimal node call reproduce, you're still unsure if it's, you know, an expected or unexpected change. That's perfect. That's the perfect thing to hand to a maintainer because they've got all the code. They know it's not an issue in your personal code or your company's code. They've got a minimal set. They can prove and confirm if it's an issue or not. So now, that's the process I would follow. How can you prepare for an upgrade? The key tip I always say is lots of companies only use the even number of releases, the long-term support releases. I would suggest, even if you don't intend to support or use internally, the odd number of release signs, maybe add it to your test matrix when it's released, because then you're just getting some early feedback. You don't need to fail completely on it, but instead of doing that double jump, if you, like in this case, if this team had added node 15 to their matrix, they would have learnt when that was released that they were going to be hit by the change, the window smaller, it would have been easier to debug. So that's my normal first suggestion. I also want to share that there's a GitHub Actions config. It's in the PKJS GitHub org, which is actually like a Node.js project-owned GitHub org. What this GitHub action does is it automates which node versions are in your CI flow. So instead of putting a PR into add 16, PR into add 18, it automatically, you can give it a value of LTS at which point the matrix, it will generate a matrix and run on all LTS versions. And why I mentioned this is there's also a setting where it will cover all of your LTS versions plus the current. So at the moment it would do 14, 16 and 18, it would also do 19 as your kind of, you know, early feedback on how much you're going to be impacted by the upcoming release. I'd also like to give credit to Dominicus, he did a lot of work on this, he worked on it, therefore. The other thing you can do is to try running your application with the pending deprecation flag. So when we're looking to deprecate or remove some APIs, we will sometimes register this within the code itself. So if you run Node.myApp with this pending deprecation flag and you're using an API that may be removed in future, it will actually error and show you, say, hey, you're using an API that's due to be deprecated in future release. Again, you wouldn't do this all the time, you would do this to get some early feedback on what you're going to be impacted by. And what I would say is if you are hitting a problem upgrading, please do open an issue or a discussion. Mostly because it could be an undocumented change. As I said, where it starts, you know, things can have unexpected side effects. What we can think could be a purely additive feature may actually cause breakage that we just weren't aware about from your use case. And it also could be the fact that, you know, if you're asking, it's probably that we haven't documented it or not documented it well enough. And if we can get that feedback and prove our documentation, it helps the upgrade experience for every user. And it also, I think the most important one is, this really helps us assess the ecosystem impact of the similar changes in future. Say, for example, this change, you know, all our tests look fine. It calls real breakage in real world applications. We know next time when there's a similar commit like this that touches similar areas, there is a consequence that may not be covered by, you know, our low-level unit tests. And what I would say, you get bonus points if when you are hitting a problem upgrading, you can come with us to a, like, small reproduce, because that, in terms of how much time maintainers have, having a small reproduce is really what helps us help you. And with that, I will close out. I'm happy to chat one-to-one about any debugging problems, version upgrades, issues. But yeah, thanks for your time. And I hope you learned at least one way of debugging an upgrade.