 Hi there, thank you. Audio okay? Great. I'm going to get started. I'm very happy to see so many of you here interested in kind of the technical details of Python package repositories. But first I'm going to give you some context as to where I'm coming from and the organization I work for. So CERN is the European Lab for Particle Physics and its mission is to perform world-class research in fundamental particle physics, trying to solve some of these kind of fundamental questions which are the fabric of our universe if you like. I'm going to show you a quick video of how we might go about solving some of these problems. So CERN is based in Geneva on the border between Switzerland and France and we have an incredible complex of accelerators and experiments to measure the results of these, the outcome of these accelerators. It really starts very simply with a hydrogen bottle and I'll show you the hydrogen bottle and out comes a beam and we accelerate it through our linear accelerator and then inject it into our proton synchrotron booster. Increasing the energy in the beam through four beam lines and then once they've reached a certain energy we transfer them over to our proton synchrotron. It's about 600 meters of incredible equipment going on here to again accelerate this beam to ever increasing energy levels. Next into the SPS, super proton synchrotron, this is about 600 meters, I might have got that wrong. And then once we're at the right energy, we transfer to the LHC, the Large Hadron Collider, 27 kilometers of incredible engineering, magnets, vacuums, cryo, incredible stuff. You saw briefly there at the end the goal ultimately. We focus these highly energetic beams and essentially make them collide. And this is a schematic of one of our experiments at CERN to measure these collisions. We're colliding incredible energies, very, very focused energies and from Einstein's equation there's this correlation between energy and mass and essentially all we're doing is converting this insane energy into particles and then studying those particles. Now, there are probably many physicists in here who are far more capable of explaining this than I am. I'm a software engineer. But this is what's going on here. Plenty of Python happening in the experiments in kind of this reconstruction, this data analysis problem. What I'm going to talk to you today about is not the experiments but rather the hardware that I just showed the video of there, the accelerators that are going into producing this. There are some amazing accelerators at CERN. I've showed you the LHC but there are some other equally impressive things. We've got an anti-proton decelerator so that we can get basically bottle an anti-proton and measure its properties and explore the fundamental research that goes into this. It's mind-blowing. Yeah, the Large Hadron Collider, as I said, 27 kilometers, the largest and arguably the most complex machine ever built, sits in Europe. As you can imagine, we have security and safety at our core, multi-tiered, starting at the hardware level. Okay, we've got hundreds of thousands of devices, basically C++ real-time device drivers and at the high level, Java and increasingly Python. So Python is at the heart and increasingly at the heart of the Large Hadron Collider and its operations. So the specific use cases that we have for Python, we've got access to, as I say, all those hundreds of thousands of devices that we have, we can control them with a high-level device control APIs in Python. We provide operator and equipment expert graphical user interfaces based on PyQt. We do big data offline machine analysis for fault tracking or optimization of configuration, typically with PySpark. We do physics simulations and we're increasingly doing online optimizations. So real-time machine learning and deep learning to really get the most out of these beams, maximize the number of collisions so that the experiments can maximize the science data that they can get out of that. So some context. I'm going to switch now less from the physics and more to the actual how Python's used and the topic of today's talk. So Python is operating and indeed everything operating on these accelerators is within a closed network, no internet. So we ship a Python distribution, a bunch of Python-based distributions, which are intended to be extended by our users using virtual environments. We've got around 500 users and there's a centralized support and software engineering technologies team in which I sit. And in terms of deployment, we encourage our developers to produce Python packages, not Python scripts, because that gives us all the benefits of dependency management, makes it easier to test. We hope it's maintained and it should be released to our repository or software repository. In terms of deployment, once released, then we actually just use virtual environments and we put them on a read-only network mounted disk. And this means that everyone can access, every machine in the network can instantly access the same installation. So we don't have to worry about if this is installed over here or it's over there. It is installed on every machine. So without internet, released using virtual environments. So of course, you need package repository for this. So I'm going to call it AgPy. Our package repository needs for quite simple, really. We have a bunch of our own internally developed packages. We want to use those on pipi.org. We want to use the same index. Obviously, we're going to need some kind of a gateway to get the internet into our repository so that we can access it from a secure network. We need to protect ourselves from pipi downtime. So we want some kind of proxy caching. It's good enough if we can only access projects which have already recently been installed. That's probably okay. We want some kind of visibility of what's being used in our environment. And we want the ability, if there are known bad packages, to cut them out and prevent them from being accidentally installed. Some other nice to have. Okay. We'd like to do per project upload permission, just like you expect with pipi. We want to be able to benefit from the newer index standards. So there's PEP592, which is the yank capability. In fact, that's probably more of a requirement now. It's kind of a security function. But also PEP658, so metadata extraction. That's an optimization on the PEP side. And we also want some kind of a gateway where people can find out about the projects that are in our environment, discover them, get readme, get links to stuff. Just generally have that experience that you have with pipi. Okay. These are novel requirements, actually. There are lots of tools out there. Very capable tools, in fact. Probably the most well-known ones in our community, DevPy, and maybe Artifactory and Nexus. But there are many other good ones there, too. A little bit of history back in 2018. Some colleagues of mine installed Nexus, Sonosite Nexus. And that was a very easy experience. I think it went from zero to first pip installed very, very quickly. And it kind of, it was advantageous, Nexus developed in Java. Advantageous that my colleagues have extensive in-house Java operations experience. So the two things aligned nicely. Ultimately, this tool met all of our essential requirements that I listed previously. It had kind of an advantage as well, that it was applicable to other types of package repositories. So condo, containers, or MPM. Okay. So Nexus is installed. It's serving nicely the key endpoints. We've got the ability to list available files for project, the ability to download them, some UIs to browse them, and the ability to publish using Swine. What's actually going on here, when you configure Nexus, or many of these services, you're building up a configuration where you say, okay, in our case, we wanted to proxy pypy.org, and we wanted our own internal packages, and we wanted to combine those together in a merged view. So I've indicated here, in kind of a gray color, this is an internal call, it's not going through HTTP, but then these things are HTTP. One thing that we did want, which Nexus wasn't particularly easy to get with, to get, was the ability to publish from our GitLab CI pipelines. So we wanted to use our GitLab tokens to publish. We decided basically to implement a very lightweight proxy, whereby you are communicating with our Python service, using the token, we authenticate the token, and then we send it off to our internal packages, using some service account. Next came, so we're happy operating in this mode for quite some time, and then around 2021, dependency confusion became a popular thing. So dependency confusion is a pretty simple concept. You've got a package on your internal repository of a certain name, and someone uploads the same name to pypy.org with a greater version. And suddenly in your merged view, you're installing a package from pypy and not the one you expected to do. Now, that could be malicious, or it could not be malicious, but either way, it's not very nice. It's not really what you want. So you have to do something. At this point, we, I guess, it was clear that Nexus wasn't going to be able to solve this problem, and still can't solve this problem today as it happens. And at this point, we could have said, okay, well, Nexus isn't really doing it for us, and we should be using another tool like defpy, for example. But we wanted some flexibility that we didn't think we were getting either with Nexus nor with defpy. We wanted a more modular approach, and we wanted to minimize the disruption to our users, mainly on the UI side, actually. So instead, we went for building some small components that would solve this problem for us. Pretty simple design. So via HTTP into our component here, and similarly for internal packages via HTTP. And then once we've got this internal representation, we, through function calls, merge them together using some priority merge algorithm, very simple implementation, and then expose that through REST back to Nexus. And Nexus continues to serve our PEP 503 index and our user interface requirements. It was a really effective solution. We got all of the benefits that Nexus was previously giving us, including offline caching. We still had the internal index and the ability to upload. And we still had the browser. So very effective incision to fix this problem. Focusing a little bit on this priority merge view, it works well. So the idea is basically, okay, if on my internal index, sorry, if on my internal index, I've got a package and another one on pypy.org will only take the one from my internal index. Pretty simple. But it's still runnable to something which I call internal dependency confusion. And that's kind of the opposite of dependency confusion. So that's, if I have a package on pypy.org and then I upload it to my internal index, I have a greater name and version, a greater version, then my priority merge is going to give me the internal one, not the external one. So imagine someone uploaded numpy. You still got some problems there, right? It's pretty uncomfortable. So for that, we need some proper mechanism to solve this. Okay, the text is a bit low, so I'll try and explain what's going on. First off, we fixed the internal dependency confusion in our upload service. So we prohibited people from being to upload packages, which are already on pypy.org. Then the way to properly solve this is through a mechanism such as PEP 708. So that standardizes and fixes dependency confusion. But in our opinion, or in my opinion, having some namespacing mechanism would be preferable, further preferable than this mechanism. PEP 708 is still in draft. I want to talk about the interface, this gray line that I was showing you. So pretty simple, nothing novel really. Async API to get the project page and the other PEP 503 definitions using standardized data classes that you get from PEP 691. Some detail on that you can pass cache context through and you're allowed to do ETAG in validation. We've got a dedicated endpoint to do resource preparation so that we can do on-demand caching kind of stuff. So we've got this repository component. We've got async APIs. And what's nice about this is we can also compose these repositories. So I've already shown you that they were kind of chained in some way. Well, actually, we can do that. It's quite easy to do. And it makes for very tight definitions of what you care about, what the repository component should be doing. On top of that, so on the other side, when you've got this async API, we've also got fast API based router, which uses that async API and exposes the standard PEP 503 endpoints. So we can compose repositories and then we can expose them through rest, through HTTP. We're beginning to build a little catalog here. So we've got the ability to access HTTP repositories. We can cache repositories. We can merge repositories. We've got an async API. We've got a router in fast API. We've got the ability to serialize into HTML and JSON. And next, we were kind of, we wanted to build some kind of backup for our Nexus instance. And we actually used our async client API to do that. It's about 10 lines of code. It's nothing special. It's not complex. But it just backs up into a PEP 503 compatible directory structure. Next, we were thinking, well, wouldn't it be great just to serve from that directory rather than through Nexus? And so we added a local repository component. Again, very simple. It's just looking at the directory and exposing the packages that are there through the standard interface. So you might be wondering, okay, so there's not much left of our Nexus instance. We've kind of plugged lots of different parts. The big piece really is this browser UI that it's offering. We were wondering could we do something better, maybe? So our needs, as I say, we wanted a gateway for people to find packages. So we wanted it to be an entry point for our community to get latest release information, to get readme, to get links, to be able to search for stuff. And we start with a PEP 503 index. We expose it through one of our components. We developed a new component called PEP 658 metadata extraction, so that we can get the metadata and display it in the browser. We add some fast API in templating the ginger, and we get something which looks remarkably like PyPI. But it's not PyPI, because it doesn't allow you to log in. It doesn't allow you to do any uploading. None of that's standardized anyway. And it does allow you to extend it with ginger and by kind of doing updating the endpoint definitions. Okay. I didn't say anything about what type of index that was on the left there. And in fact that's any PEP 503 index exposed through HTTP. Any 503 index coming from any of these tools. It's kind of cool, right? You can just turn this service on to any of your existing indexes, and you'll be away with a browser. Cool. Okay, we're using SQLite under the hood to help us do some to do the search. But maybe a little bit of an improvement for the package repository definitions. The project list endpoint that exists doesn't really do anything. PIP doesn't use it. It'd be nice if it could be more useful. Perhaps in the future it could be extended to do, for example, pagination, search, filtering, sorting. For example, we could probably list the most recently updated projects. Just an idea for maybe we could extend the repository definitions. Okay, repository components. I think I'm trying to convince you that repository components are a good idea. So this 658 metadata extraction component is very simple. When you are trying to get a resource, it downloads the wheel, extracts the metadata, and serves it, or extracts it and makes it available through this API. We're able to actually just lift this component and put it on top of our repository, and suddenly we've got this browser capability, but we've also got this repository capability which serves 658 metadata, and you get all the benefits of it being able to use that. Again, sorry, nice and a bit too low, but a very important takeaway here is that both repository consumers and repository producers can reuse these components, and you can start to build assumptions about the kind of repository you've got hold of if you've got these components below you. A very powerful concept, I think. Okay, what's also interesting is that as soon as we did this, we exposed a few bugs in PIP. Okay, they've been fixed this weekend in 23.2, so do update. But this was caused by the fact that PIP had no repository to test against, so they implemented the... they didn't quite implement the spec write, and PI-PI didn't implement the spec write either, and if either of them had done, PIP would have exploded, and it would have been a big disaster for everyone. And actually the only way to solve this was to introduce a new PIP, to mitigate this and actually find a new standard where both could be happy. So another big point here I want to make is that our standard development process can be enhanced if we use reference implementations. And it's correlated to this conversation because this PEP 658 metadata component that I'm talking about, it's less than a hundred lines of code. So maybe that would be something that we could do in the future. Okay, coming towards the end, so we've just released this last night, hopefully all working a simple repository, so you can actually run a repository, simple repository browser, so you can actually look at it on a kind of PI-PI-like interface, PIP exit. There's a big caveat there. We've released it as is. We're not ready for open development yet, and the code isn't... you know, it's not as smooth as I've made it look, let's say. And the reason for that, the reason we're not made an open development just yet, is because we can see this ecosystem is crowded, and we're really keen to get feedback to understand the level of interest before we invest in making this more general and more long-term maintenance capable. And particularly, okay, we've got some applications, a repository and a browser, we're particularly interested in whether we can get this into the core and for example get this into kind of a common standards development process. I'm thinking how successful packaging has been, the project packaging for unifying version handling, that kind of thing. So to wrap up, in CERN's Accelerator sector, Python use is growing massively. We've got a bespoke interface for browsing any PEP 503 repository exposed through HTTP. We've got highly capable repository service itself, which has got all the modern PEPs and allows you to safely combine internal and external packages. Okay, be aware that priority merging is still vulnerable to internal dependency confusion, and we need some standards to improve that, still in draft. And I want to sell this idea of repository components, because it's really interesting that both producers and consumers of repositories can benefit from these things. So wouldn't it be great if part of the standards development also produced these components? So as I say, we're not ready for open development yet, and we want your feedback. There's a repository up there, a repository called Simple Repository in the Simple Repository Organization. Open an issue, I'm on Twitter, find some way to communicate with me in questions as well. I like to recognize my colleagues who've also made major contributions here, so Ivan, Francesco, Wouter and Christian, and of course the core team who are driving forwards the standardization, which I'm not part of. So Donald and Paul, and many more, and some in this room. And I just take this opportunity to highlight that CERN is an amazing place, not just for physicists, software engineers, electronics engineers, take a look. There's opportunities at all levels, students, graduates, staff positions. So please find me, ask me lots of questions, and thank you. So thank you for amazing talk. Now we have a couple of minutes for questions. So please come to these microphones in the middle, and we can ask them. Thank you for your talk. I have a question about the namespaces you mentioned them. There's Konda channels. Did you think about something similar introducing namespaces? Would it fit in the concept, or would it be something totally that doesn't work? Konda doesn't have namespaces either. You have a channel, you can make a channel. So that's that, I mean, you have a channel, and then you can say use this channel, this channel, and then you can avoid some of the confusion. Sure, but we have the same in PIP, right? We can use indexes with PIP. The problem is the names are flat, and when you've got these indexes, you need some way to say, well, I want it from that index or that index. And, okay, maybe in Konda, actually, if I remember, there is a mechanism, so I should say. It's called a channel, so you can make your own private channel and give it a name, and then it's like similar to the namespace, in my understanding. Sure. Thanks, Mike. So I had a question about namespaces, but that's what's covered, and I have another question. So you mentioned that this, there was some issue with testing, but the buy-by, it has a testing server, so did you not consider it? I didn't mention testing. I don't think, testability of deployed packages, maybe? Yeah. Okay. What I meant in that in that respect was simply that when you've got a package, it's much easier to test that than it is a script, because typically you would write a package, such that it can be imported and there's no kind of runtime behavior, for example, makes it much easier to test. It wasn't about repositories, per se. Okay, but do you also have some testing like server like buy-by has for testing like what your package would look like? I understand. When being a bloke? We don't, actually. No. At the moment, we allow people to upload dev releases as well as actual releases, and we don't differentiate those things. Yeah, we could easily run an extra, I mean, as you see, it's super easy to run an extra index. We typically don't, haven't had much of a need for it, to be honest. Okay, can I ask one more question? Okay. Go to the back of the queue, and then if there's time, then yes. How does that sound? I told that you have some experience with Condaforge, and I was asking myself, because in my company we have also some problems with binary dependencies, so now you have chosen a real approach. So how do you handle binary conflicts and compilation of not only Python, but yeah. Sure. So I mean, wheels have done an amazing job. They've taken good ideas from Condaforge, and they've shipped them into wheels. Now there are some ugly hacks, as far as I'm concerned, that make wheels work. So the vendor ring, the namespace renaming, it's all a bit kind of, okay, it works, but it's not pleasant. In our case, actually, it's good enough. So we ship NumPy, it's fine. We ship TensorFlow, it's fine. We ship mostly it seems to work fine using wheels, and Condaforge we haven't actually needed at this point. Okay, so you also have QT and other stuff? Yep, okay. QT ships as a wheel now. You mentioned authentication once in the context of Git App tokens. Did you do anything more with authentication and authorization, and what kind of challenges did you face with that? Great question. So on the authentication, we have obviously GitLab, but we also have an internal authentication mechanism that allows us to have kind of role-based access on our devices, and indeed we also have the ability to use one of those tokens to upload. It's a bit like a no-auth token, but it's not quite. That's on the authentication. On the authorization, we are very, very keen, although we haven't got it yet, to do per package, per project authorization. I don't foresee that there's going to be major hurdles there, we just need to get on and do it. I'm curious about your question now. Do you have some experience here, and you would like to... Oh, yeah, yeah. Well, I mean, I've faced exactly these problems. I started hacking on something myself, and the authorization for PIP and Twine seems to be immature, so... Yeah, non-standard as well, right? Yeah. Thanks. Thanks. Well, I'm sure we'll reach offline and have a good conversation. So, the PyPy has a terrible search. Did you improve on that? No. No, I did not. So, the search, basically, is a good question. The search has got an index on the project names, which is easy, because we've got that project list API, and then we call the repository for the summary, and we can just search on that. But, as I say, SQL Lightbase, it's not using some improved search engine. So, it's okay. It's fast, and it gets you the project names quite quickly. But, yeah, I would happily take any experience of building a search engine. So, thank you again for the talk. If someone has more questions, you can continue in the open space. I believe the speaker will be there. Okay, so, and thank you again. Thank you.