 Let's start with our panel discussion. The moderator for the panel is William Bartholomew. He is working on GitHub. And in our panel, we have Jeff MacArthur. GitHub again. We have Mirko Bern from InnoCode. And we have Dumitis Pinenis from the Athens University of Economics and Business. And William will give you the microphone. You have the microphone ready. Okay, thank you. So the purpose of this panel is really to just discuss some of the major issues affecting package managers. We'll have time to take questions from the audience as well. And so everyone will get an opportunity, well not everyone, but a group of people will get an opportunity to ask our panelists some questions as we go along. But I thought I would start off with everyone's favorite topic, which is naming and versioning. So major and minor versioning has been used for decades as a way of communicating, breaking changes to consumers. And this has been kind of codified through CEMVA, which has become de facto in a number of package managers. There's an increasing belief that CEMVA isn't meeting the needs of the community and that it's lowing people into a false sense of security and we'd be better off just doing chronological versioning. And so who would like to have a comment on this? Oh, comment. So yeah, we did CEMVA for a long time. I used to be in the eclipse community for quite a while and had quite rigorous semantic versioning practices, I'll say. And, you know, we often failed. It's hard because one person's notion of an API, we get to be very, very clear about what an API is. But I'm in favor of keeping and pushing down the path, even in the face of some failures, because it is an indication that an API is a contract between you the producer and, you know, the others as the consumers. And some level of communication is to my intent. I'm intending that this not break the API. I'm intending this be an incremental change, et cetera. Communicates to you as a consumer, I might be wrong, in which case that's like a bug. And you can report that bug and submit a pull request and all that sort of stuff. But without that, we're just kind of, hey, this is Thursday's build. Take it, maybe it's good. Who knows, right? So that's my thought. So I can say I'm typically a technology optimist, pathological optimist, but when it comes to versioning systems, it feels to me that we're always trying to add semantic information to something that's completely arbitrary. And therefore I'm actually in favor of something like chronological release because you basically communicate what version you're looking at without adding the artificial semantic information. But in the end, what works, works. So let's stay pragmatic. I'm in favor. It's a course mechanism. We would perhaps want to have interfaith versioning at the level of specific API endpoints. But it's something that the other thing would be too complicated. And it could be better complimented by having things like longer, long-term support versions so that people know that they can pick a specific version and support it for longer time and better communication of end-of-life policies. I had a comment on that. One thing that we did to help support that as producers, having tools that tell you when your APIs change, like we have much better technology these days for analyzing your code and telling you when a change you're making now might affect your API. So we implemented tools a long time ago that would say like, literally as you typed and you entered a new, you had an interface and you added an interface that was supposed to be implemented by consumers and you added a new function that would tell you, hey, you're changing the API. You're going to have to bump your version number appropriately. And that's stuff that we can do. It seems to have fallen by the wayside. At least I've not seen a lot of it, but it's certainly there to help. I think the closest I've seen to that is there's a number of tools that will take commit messages that have additional metadata in them such as this is a breaking change where it's taking the developer's intent, not quite as automated, but again, better than nothing. So kind of moving to the naming side a little bit, there's kind of two big problems in package naming, well, at least two. The third is coming up with a name. But one of those is there's an increasing amount of typosquadding where people with malicious intents are creating purely malicious packages that have names close to the original. And then a similar threat is because people will often consume packages from multiple feeds, we have no guarantee that, you know, package foo version one on your internal feed is the same as package foo version one on a public feed. And so there can be opportunities for people to hijack the package that you're intending to use. What are your thoughts about how consumers and package managers can help protect against these cases and help protect the community? I say a careful thing because this actually goes back to something that I ran into myself and was accidentally swapping letters and names and referring to a different package and everything broke. And it was so hard to find because you look at everything and it looks totally sensible. And yeah, we found it. I think this goes back to this idea of you're adding artificial information on top of what you're versioning in the code. In a sense, every versioning of code that gives it an arbitrary name adds a layer on top of the actual repository. And depending on the languages you use, my favorite solution is actually to refer to sub-modules in this case and to actually have a SHA-1 to point to the code I'm using. Then you don't have that problem. But that means right by passing package management. So it's a conundrum. One thing that can help here is basic hygiene principles. So let's be conservative in what we consume and what feeds we accept. If I go and invite 10,000 people to my home, something bad will happen. If I'm more conservative in the modules that I use and the transitive dependencies, that is going to be better. Regarding the accidental changes, some basic signature of the packages which will not work when the same package appears with the same name appears in a different feed should be enough. One interesting thing is that I think a lot of these things are endemic to ecosystems. And so package management people, the folks who actually create the package management systems bear a lot of responsibility for the things that happen in their communities. And so I think a lot of the tooling the package managers should be written with these sorts of problems in mind. And I'll just one example. I won't name which community this is. But there exists out there today a package management system where if you have multiple feeds enumerated, it non-deterministically picks which feed it's going to pull the package. It basically sprays requests for the package versions and takes the first transfer back. Right now, I mean, we laugh. There are reasons for that. But it does lead to actual problems where like my feed has the package and I'm happily using it and then somebody magically creates a package with the same name. It's not typo squatting. They actually create a different package with the same name in a different repo and suddenly due to non-determinism, you know, a hiccup in the network or whatever, I'm suddenly getting this other thing from outside. But only every fifth time. But only every fifth time, yeah, exactly. So I think that the package management infrastructure folks also need to be designing for these sorts of situations. I love the idea of like some level of signatures, whether it's like just a hash that's keeping track of this version is this thing or actual signatures with like trust chains and so forth, I think can do a lot to help. Yeah, so building on that idea of trust chains, one solution or way of addressing this problem is kind of using things like package popularity and the health of the package and that kind of metadata to give you a hint if you're going the wrong direction. What do you think about that as a way of helping consumers decide or realize that they may have made a mistake? I think it's a good idea. The more data you have as a consumer, you can make more informed decisions. Being told when there's an anomaly, like suddenly you're taking a dependency on a package that's not as popular as you thought as it was previously. Like you ran a resolution and all the popularity scores were above 50. I'm making arbitrary numbers and suddenly you ran it at resolution again and there's something that crops up that was like 20, much lower on the scale. That can be a thing and also as you're actually picking things, picking packages that are known to be more useful. There's a danger in that, in that it's like the herd mentality, right, where people only use things that are popular so anything that's new doesn't get any daylight. So that's a challenge. I'm a bit skeptical. That needs to be done very carefully. It's a good idea. If you get too many messages, people will just ignore them. If you raise the bar too high, then maybe you will not get warned when something bad happens. It just needs to be done very carefully. I think we also need to look at the scenarios where you would apply such a process. When you choose a dependency for a certain piece of functionality that you would like to import into your project, what I usually do is I do look at, for example, community health, like a number of contributors, a number of commits in the repository, ranking, things like that. But this is kind of a one-time choice you make. At a later point in time when this is about managing dependencies in an ongoing process, looking at popularity, it's not the kind of diligence I would expect there, right? Because at this point, nothing beats diligence, which means people with eyes on your configuration and sorting this out. But maybe we're mixing these perspectives of choosing what you use and then later integrating it into build processes, et cetera. A quick thing to add, you reminded me of something. One of the most popular markdown processors on Node for you, like if you went and looked at popularity, it was stale. It hadn't had any commits for like a year or something like that. And so people would go, and if you just followed popularity, you'd go and use markd, I think it was, and go and use that thing. But it was known to be stale, known to have problems. And finally, somebody went and put a thing on the GitHub page saying, don't use this. It's old and stale, right? And so just following the numbers doesn't answer the problem always. Yeah, and specifically in that case, someone had created a fork, which obviously would have started off with low popularity, which means people may not have been selecting it and that's why they were selecting the other to begin with. So kind of moving on to kind of how people express their dependencies that they want to consume. A number of package managers have a concept of lock files and they're intended as a way to allow you to pin which versions you're consuming, often for both direct and indirect dependencies. Can these be fully trusted and are they a substitute for a software bill of materials or other form of inventory? I guess they can be trusted if the tools are doing what they say they're doing. It's interesting, there's two kinds of lock files in my view. There's the input lock file that's essentially giving advice to the resolver and saying like, hey, if you need foo, pick version 1.3.9. Then there's the output lock files that are said, this is what I did. I'm the resolver and I actually picked foo version 1.3.9 and it's actually in the configuration. I think we sometimes mix those two and treat them as the same thing. I'm much, much more interested as a developer in the input one and as a compliance sort of person who's trying to manage large software systems, I'm much more interested in the output one because I want to know what the resolver did, not just what people thought it should do. The latter is not, well, actually neither are universally available and I'd like to see them be available in all systems. Will you ask, can these systems be trusted? Of course not. No system can be trusted. They serve a purpose for the outgoing definition. They serve the purpose that whenever you build your software, you deploy your application, you're using a version that you know. But it actually puts another owners on you because that means you're pinning the version and you will find that in many real life projects the idea of saying version that or newer is actually not popular with the people deploying the applications because they're pulling in bugs with every new build. But that puts the owners back on the team managing the deployment to make sure that they then stay up to date with updates. So it's a, you're kind of blocking the process of automatic updates for purpose and that puts work and responsibility on you to stay on it. So I don't think the answer can be, the question can be answered in this way. And in no way it replaces things like bills of material which are questionably useful in any way. It just serves a different purpose. It's appers and pairs. Oh, that's the German version. Appers and oranges, I think. Pinning is risk especially when there are security updates. So that's one problem and that's why I'm skeptical about them. They have a very nice advantage in that the producer of this component can scan and see which versions are used and what breaking changes might be introduced and how many will be affected. This is something that might be useful. I have a question. I mean, there's a whole bunch of opposing goals involved. You've heard security and features kind of it. There's more that you could do in the resolution mechanism. So right now you've got, I don't know, I usually call it manifest. You could take a manifest as a profile and you could try to produce a new lock file that's as close to, like the solvers can do that. It's not something that I think anyone's written but you could actually say, do these projects need critical security updates but don't update anything else because I know that that configuration works. I think there's a lot more information you could put into that process to tell the solver why to prefer one version or another that doesn't get put into any of the tools today. They're all pretty simple in that regard. So that was kind of my point about the differentiation between the input lock file. I actually think there are three things. There's the manifest, what did the developer think that they wanted, the developer of this component wanted. Then there's the consumer of the component that's taking this component and putting it in their system and they're saying that's all cool. You want foo version 1.3 or greater, you want 1.3.9. That's what you're going to use. Or maybe you give it a more constrained range and then there's what the resolver actually did and that's an output. Maybe you wouldn't call that a lock file. Maybe it's more of a dump file or something like that. Anyway, there's... So does anyone think that lock files are inherently bad and should not be used? I'm guessing no based... Anyone? I think the question is rather, what else would you use? We can't say it's bad, but we have a better way. On break you change, you change the name. The name is there to... give you a backward-confisible... future... It moves to semantic information from the version number to the name. It's identity, where identity is name plus version and you just move the version over into the name part. Yes, that's cool. Technically, but psychologically or humanistic. I would be psychologically annoyed. Is that a feature or a bug? I consider it a bug. What else can we do? If we don't use lock files for versions... My perspective is primarily that of a software engineer and I only get into package management at the end of the work. I use lock files. I pin the dependencies of my builds so that I know which version I use and I use a commit to change the versions I use to trigger a build and see work before, not works. Now we can use this version. That's a certain purpose that this serves for me. Of course we can say that's inherently bad. You can say it's bad practice. I don't know, better practice for what I'm trying to do there. I really like to put the question to the room. Do we have a better way to do that? There seems to be a better way. People don't look at me. The lock file right now is a file that contains some versions. It doesn't say why it contains those versions. What's typically assumed is that that's what the developer tested with. That's your best case lock file. That's what the developer tested with. In our world, we probably have several lock files for different projects or different machines. It even gets combinatorial at that level. What I would much rather have than a lock file is why is the developer special? I would like to know that someone tested with these versions. I could have the lock file from all my users for all the versions that they got working. I could have the lock file for all the deployers or distro managers or anyone who ever tested in that case and report that somewhere. Then at least you have a better way of assessing confidence in whether the package is going to work for some criteria. You probably need to go further than that and assign more meaning to it. They ran the full test suite. They smoke tested it. It compiled. There are lots of different things that that particular configuration could mean. I think you need more meaning. A friend of that that something happens in GitHub, there's this feature called Dependabot when it sees there's a vulnerability in one of your dependencies that tries to propose a PR that's going to update that. One of the things they do is look at all of the other uses of that component that they're updating to and the tests have run and see if the tests pass. They try to develop a compatibility score. It's by no means perfect. The reason I bring it up is it's down this path of crowdsourcing the notion of things working together. We've got all of these millions of people or depending on how big your ecosystem is lots of people using these packages we can actually tell whether or not by actual experience whether or not version 1.3 is compatible with version 1.4. A program that can do this file and so if clients were able to actually do an improved human review of a lock file you could set up a system so that Git actually does a Git in a human like reviewable way or maybe just give the plus or minus like this is a correct lock file or something like that. One other thing is we can be more careful in breaking changes so we can consider breaking changes to be an unsocial activity and we see things like the Unix system calls that have worked for decades with small changes and their names have been changed so we have wait and wait 2 and wait 3 and wait 4 because their changes were needed but read for example has stayed the same for 50 years. I'd like to respond to the statement that lock file changes aren't human reviewable It really depends. It's a matter of engineering practice I think more than tools so I usually ask that if there is a change to that it's a separate change like the configuration change is like one PR which means the only change you have is in the lock file which is either plain text or JSON and I think you can review it. It's work but I mean if you review a sizable PR it's also work so it can be done. The question really is is it a right tool for the job and the thing you said I think one thing we do lump together is that we kind of think of it as like one size fits all there is this one lock file and there it says these are the dependencies basically but if you take that code and you deploy it into different places that's the wrong tool for the job because then you need configurations for the different deployments that you're making we usually use a vendoring repositories for that where the configuration is in the deployment and then you pull in the code that you're using for that and we don't use the setup, the dependencies for the generic configuration but that's just the basic workaround for different deployments So I want to go to the other end of the spectrum so lock files give us a lot of flexibility and pin things in place but a number of package managers express their dependence or their manifest as executable code and you know that has the power of being able to be dynamic but with that comes certain challenges can you talk about these challenges and the problems they cause for consumers or is it just no don't do that So I mean okay it's cool you can do it, not everything you can do you should do but that's fine, at least if you do that in your ecosystem provide a mechanism for dumping what the resolver did because the challenge with all these systems and this is a problem we have in GitHub with things like dependabot if you've got running arbitrary code the only way you can understand what the resolver is going to do is by running the code but if it's arbitrary code well then it's not trusted so now I have to figure out how to run your code in a trusted or untrusted environment in a sandbox so I can even figure out what set of dependencies you're going to have now if you ended up at least at the very least with a lock file or a dump file about what the system is doing and then be able to reason over it the reason why declarative stuff is useful and powerful is because it's declarative I can simply reason over the face value of what's there as opposed to running arbitrary code I just want to say nobody wants that so for more than two decades now the idea of using a two incomplete programming language to do configuration management comes up again and again and none of the systems that came up have reached dominance in the market I think there may be a reason for that engineering is already complicated software engineering and if you add another layer to make it exponentially more complicated maybe that's just not the right way and I think the systems that are successful typically are almost simple to the extreme they use a plain text file to tell you what the dependencies are maybe they use a little setup file to install the package but it's like ten lines and you can read it if you know the technology yeah so maybe simplicity is key and that's why this doesn't get adopted it's working but it doesn't get adopted so I want to move on and talk about the topic of reproducibility from a couple of different aspects so one is there seems to be a mix and I'll actually pose this question to the audience how many people work in environments where you rebuild packages from source rather than consuming binary packages let's say roughly 50% so the same question was asked in the legal dev room earlier today and it was roughly 50% there as well um this poses a number of challenges there's some difficulties to it and it can cause problems um what do you think firstly what's your opinion as to whether this is a best practice and if so what are some things that the industry can do to make this easier and better certainly be an option of any packaging system to offer this ability we heard good reasons why this can help with the performance for example certainly for security for compliance and this is why I believe it should be available whether people prefer to use it or not that is something that should be left as an option we used to say don't take binaries from strangers right so um but the question is who is a stranger a lot of this is about trust we have a lot of discussions like this where we seem to think that technical solutions can replace trust imagine you have a completely trust worthy source so a github is running a built service and it's completely transparent you can respect everything and they're producing binary versions of packages and offering them a repository and you can completely um assess how they're doing this um and we assume that it's completely trust worthy would it really be useful to compile on your own all the time think of the trees think of how much electricity we're wasting well um so there we go yeah yeah what do I care about the yeah yeah why do what do I care about the megawatts we're using right um so in the end there's time for it so I come from a background of being a C++ developer of course we build everything from source code every time but really um but that's the question that's the matter of trust and of good technology right so um VC package solved a lot of the these issues for um for compiled languages um where binary versions weren't used at all not accepted I don't trust binaries from strangers and all of a sudden people say well this is good you can use it so I think it's it's an aspect of getting into the more modern world and then saving the environment and you disagree with me so it's interesting it's long been observed that there are two two camps there's those who won't use it if they didn't build it and those who don't use it if they did build it right and and there's no right or wrong I mean the interesting thing about building yourself if you have a really rigorous like the my ideal is that there's a really rigorous uh reproducible build system that can be strongly trusted so I can get it I can push build and it's real understood what build means it's not like some massive long command line argument that I can get wrong uh I can just run the build and it runs and there are no warnings there are no errors it just builds and says good but if there are but like imagine this imagine this you're a typical open source consumer developer person and you're using you know hundreds of components and you're trying to build them all and you've you've type make or whatever your build system is and all this spew of orange and red goes by on your terminal like are you going to trust what you just built are you going to ship that to your users or your customers no if it comes up all green I might start having warm fuzzies but if there's anything non-green on there I'm not shipping that I'm immediately now have to spend hours digging through other people's builds trying to figure out what's going on so that build system needs to be trustable like it's a do two-way street here right I have to go to trust the reproducibility of it I would love that that were true but you know we're we're not even at the point this is set context here I mentioned this maybe in an earlier talk 42% of the packages we did a little survey of about 200,000 packages 42% of them don't have any reasonable way of going from the binary version like foo version 1.3 to the get commit so that means you can't identify even let alone build you can't even identify the source for 42% of the packages sorry that was across a bunch of ecosystems yeah it's not it's not meant to be like the broad brush generalized but I'm trying like there are over 200,000 across a bunch of different ecosystems that's the kind of statistic that we you hit so whether it's 30% or 70% it doesn't matter it's not 2% right it's a good point yeah I don't disagree with the people who said that they prefer binary packaging it's just the reason that we don't use binary packaging is it doesn't work for us right like in fact like SPAC itself is it's a built from source package manager but we do binary packages right like MNICS or Geeks or some of these other systems and we're trying right now to provide enough prominence with the binary to understand where we can use it right the reason that people don't distribute optimized binaries because there isn't a good naming scheme for micro architectures I think I'll come to my top tomorrow for that but if you can provide a binary and provide sufficient prominence with it it will satisfy the user that you can totally distribute binaries everywhere and if you can make them trustworthy if you can sign them or whatever it is I don't think it's it's not a preference thing it's a utility thing it's which one do they trust more and which ones are possible too like in a more native code environment where you've got lots of compiler options that vary by architecture or whatever those two binaries are just not compatible I can't run them in the same place so you giving me a binary doesn't help me at all like but in JavaScript if you decide whether you can use it and that's what we're trying to do so we've got five minutes left so I'd like to open it up to any questions that people have for the panelists okay on dependency management and package managers please I think this would be useful and so the question is whether central packages should have a feature to mark some components as deprecated and in fact I would go further and say I would have a dead man switch so if somebody is pinged whether he or she is still alive to maintain the package and doesn't respond a couple of times then this package is marked as risky yeah distrust by default I just need to plug that there is actually an ongoing project do we have materials? yeah we have some survey questions you have some survey questions on your table and I think it's not just about thinking of deprecation it's about metadata associated with the packages you're using so when you pull in dependencies it is able to tell you that it was a binary breakage like an api breakage or there's a security advisory and deprecation warnings is basically just one other use case there you can also include compliance information in that so when you're building code as a developer you get warnings if you pull in something that has outstanding CBEs et cetera so I say this is totally useful we're not the only people who think this way because it's got even funded by the EU and part of that is being developed so my little hobby horse here is to get back to that connection between the binary and the source because if you get back to the repo whether it's on github or wherever we're about to use to the community that produced that binary you can start understanding much more about are there thousands of open issues are there no developers when was the last commit all those sorts of things that might help you make an informed decision so more information yes I'm a little bit leery of the social aspects of all the random registries running a poll saying this is good or this is bad that gives me pause actually yeah I think we need to have more information from the crowd crowdsourced to help people understand what's going on so I understand how that works if the package is dependent on by like do they automatically it could be a community that takes over to make sure it still works the other thing about CRAN though is that they archive versions of things so often and they update so often any checksums that you put in the package manager are likely freaking wrong and so like there's no way for us to securely fetch from them the version that we trust and I don't think that's helpful there's big issues yes well we're officially at time so thank you everyone thank you to our panelists