 Oh, well, we're going to let Ernest close us out talking about the PyPI. Awesome. It's making Phil welcome. All right. Can everybody hear me all right? Excellent. So today, I'm going to be closing things out with talk about running vintage software, specifically PyPI's agent code base. But first, I'd like to say thank you to the organizers, volunteers, sponsors, and you, the attendees of the very first North Bay Python for sticking around the year for me, and that. So I'm Ernest. I'm a software person living and working in Cleveland, Ohio. Since 2012, I've been contributing the Python Software Foundation's infrastructure team. I'm a founding member of the PSF Packaging Working Group. And I'm excited to be your PyCon chair for 2018 and 2019 in Cleveland, Ohio. May 9th through 17th. And CFP is open till January 3rd, anywhere on Earth. Aside from that, I'm also an avid taco enthusiast. But most recently, many hours and hours of my free time have led me to believe I may be a masochist mechanic. And that is because I made every attempt possible to make this car my daily driver in fair weather. That is my first car. And I still have it, which is kind of cool. It's a 1967 Saab Model 96, one of the very last two-stroke powered vehicles in the United States. I've been servicing, repairing, and improving this since I started driving it in 2003. Driving a vintage car everywhere can be a pain. Things will break. Cash will generally be the easiest solution. Experts and parts may not be readily available. And engineering assumptions from 1960 may not be valid anymore. So is this a lost cause? Should we just scrap the car and just give up on it? I don't think so. It's just time to get to work. Oftentimes, you'll just be pulling things apart and cleaning them up. Sometimes it'll be simple telemetry that needs repaired. And once in a while, there's a need for more invasive repairs. So during a recent conversation on Twitter about my Saab, Hennick saw the need to interject. He posited that perhaps my being tied up in vintage Saabs was indicative of why I might have been tied up in one of my favorite ways to contribute to the Python community. I took the time to consider this carefully and decided that Hennick is not wrong. So let's talk about PyPI. PyPI, it's the Python package index, also known lovingly as the cheese shop. I've been one of PyPI's caretakers since sometime in 2012. And during that time, I've worked with Donald Stuft, Richard Jones, and a handful of other volunteers to improve PyPI's availability and keep the lights on. Let's briefly talk about a little bit of history of where PyPI came from and how it got to where it is nowadays. This is sort of the original thing that acted like PyPI in the Python universe. It's called the vaults of Parnassus, and it's the predecessor. It was a hand curated page full of links to Python stuff anywhere on the internet. Users could submit links for inclusion, and they'd be added eventually and occasionally fixed whenever the page moved or whatever. But that's it. There even includes a lost and broken link section somewhere on there. There it is. Because in the early days of the web, hypermedia, glorious web 1.0, if you're interested, there are still links of this floating around, including the Wayback Machine in the Internet Archive. I missed my bullet points. That's OK. So there's some key peps in PyPI's history if you're interested in learning more about sort of where it came from. PEP 243, which ultimately wasn't accepted, was effectively a disk details extension that provided a way to use metadata to determine where to go and install packages. PEP 301, which eventually was accepted, was the first iteration of PyPI, which also included the placed upload metadata and Trove classifiers through central repository. So that's pretty much what became of it. Eventually, there added a PEP, which set hosting modes. So initially, PyPI was really just a hypermedia place you could find links to go install packages, and all was great. Eventually, there became a need to sort of host packages to help that broken link problem. And PEP 4388 added explicit hosting mode. PEP 470 removed external hosting completely. And now, if you want to install on PyPI, it has to be hosted there. So PyPI has 120,000 packages, a little more than that, as of a few hours ago. More than 840,000 releases, over a million release files, accepts 200 million requests per day, and serves 40 trillion bytes per day, and has three working tests. So apologies to the testing goat and to Julia. Really thinking about it, we probably should have spent more time just adding more tests, there's something else that turned out. So to summarize, that's some of the stuff. And of course, pretty charts. It's seen incredible growth over the past four and a half years. These are all the stats that are available from our CDN provider, as of now it's the only way we can really go back and determine how much we've done. But 121 billion requests and 15.1 petabytes, we're pretty much passing about a petabyte a month now, though. So where did this code base come from? The initial commit to version control was in November 1, 2002, and the first release package was just a few days later, in November 6, 2002. This is vintage software. Remember how I said that initially PyPI was just a metadata server with a basic disk details plugin? Since then, it's sprouted in a sunset of plethora of features. Interestingly, though, most of the basic parts of the code base have lived on and been extended or modified to build these new features. On top of that, it's on a one-off wizarding implementation, which might sound really bizarre to us nowadays. But in 2002, the choices were incredibly limited if you wanted to do a web server with Python. OK, so trying to operate vintage software every day can be a pain. Things will break. Cache and cache, honestly, will generally be the easiest solution. Experts in context may not be readily available. Engineering assumptions from 2002 may not be valid anymore. So briefly, we'll walk over a couple of things where I draw these allegories. So things breaking. When I started my slideshow, this is a gif of all the tweets containing PyPI down since 2009. Things broke a lot, and regularly break, and we'll continue to break. This is the nature of the web. Fortunately, though, that is trending down over time, which is really good and exciting. But tools were built that expected PyPI to be available and served crucial functions, like continuous integration testing, deployments, and developer workflows. So I guess it's OK to get a little grumpy occasionally when that service goes away on Twitter. So some things that were initially done. Mirrors were added, and they were community hosted. User initiated, and client support was added to PIP to say, oh, well, if PyPI is down, I'll install from one of the mirrors. There are some downsides to this, like handing out a subdomain on Python.org to more or less anybody who offers it, or offers to host it, which can have some security implications. Also, by moving, oh, wait, I'll talk about that in a minute. So mirrors were added. This is now not really as recommended, but you can still use it for all sorts of good purposes. PIP still supports these. Cache is generally the easiest solution. So one of the single biggest impacts on PyPI's availability and stability has been the addition of Fastly as a CDN. So that has added tons of speed for installs. It's increased the reliability and reduced the client complexity. If the cache is serving most of what you need anyway, you don't necessarily have to put in that magic mirror URL, whatever things are going to arrive with the main PyPI. I love Fastly. Again, it's the only thing that makes PyPI possible. So round of applause for Fastly. So there have been some pretty major outages, but this is the one that stands out to me the most. This was sort of the turning point when I think everybody who was involved in keeping PyPI online realized that things were definitely needed to be handled. There was one VM, and it was manually managed on some bespoke infrastructure, and 100% volunteer supported. And DRBD, which is a block replication protocol, got really out of sync. And I think PyPI was down for a pretty good while while we tried to sort out precisely how to put things back together. Eventually, we had to go and say, we're going to take a little bit more of an approach to making this more highly available. But that basically required collecting all of the context that had been manually put into that server before we could reimplement it into a more reliable setup. So that included configuration management, end to end, high availability storage for the packages and documentation hosting, a high availability database, multiple web servers, which is kind of nice, and a hosting provider with an SLA. Engineering assumptions from 2002 may not be valid anymore. Luckily, most of what I'm about to talk about has been removed at this point. But most of them came in with really good intentions and were working around some issues and basically the availability of what we have today. Again, the custom WSGI or WISGI framework, which if you're not familiar with WISGI, it's the web service gateway interface. Excellent. So it's a protocol that is basically defined in a PEP that says, this is how you can talk to Python with the web or with your web server. Again, there were a whole lot of options. SSH uploads were kind of in a golden era of TLS and HTTPS nowadays with less encrypt and other progress than that era. But SSH uploads were a way to work around the fact that HTTPS wasn't everywhere yet. Those have also been removed. Because giving somebody a shell, no matter how lockdown, is still pretty risky just to upload a package. Real-time download counts, which were less truthful than you might have imagined, and also pretty fragile. It's pretty, any real-time infrastructure is costly. An OpenID provider, an OAuth provider, and some relatively broken cryptography of some DSA signing, which again, wasn't broken as added, but is now today. So is it a lost cause? So we just shut PyPI down and install packages from, I don't know, I guess a thumb drive or something. A little, yeah. We should definitely shut down the old PyPI. But we also have to continue to support the community in having a place to serve and share their packages. One of the biggest drawbacks to the current PyPI code base is that onboarding new developers and contributors can be a huge pain. The dependencies, database schema, and custom web framework cause more problems than they solve for new work. So that's where warehouse comes in. Warehouse is an effort to fully rewrite warehouse in, I'm sorry, PyPI with import statements. To fully rewrite Python in modern Python on modern frameworks with extensive test coverage, documentation, and modern best practices. It's built to use the caching that's available to us to the best of its ability. There's been an tireless effort by Donald Stuffton and others to drive the development of this new code base, but things don't happen overnight. As a matter of fact, warehouse has been progress for more than two years. Quick sidebar on warehouse, which is really exciting. I'm very excited to share with you that the PSF working group has secured a grant from the Zill's open source support grant program, and also announced that as part of that, myself and four others are going to be starting this week on pushing warehouse across the finish line to deployment and shutting down the old PyPI, which is pretty exciting. But in the interim, there's remained a need to keep PyPI available for the community. So I'm going to talk really briefly about some of the techniques that's made that possible. So these are some of the things we're going to get as many as we can before Chris comes and pushes me off the stage. But a quick warning, think concepts and not specific implementations. There's going to be a lot of code on screen. You don't really need to dig into it and try to read it. I'll highlight it as we go. And the goals here are slightly different than your average production system. Since we are kind of, the assumption was always, we're going to be throwing this away. Some decisions were made that maybe we're a little more ad hoc than they needed to be. For some details, I'm going to have, oh, they're cut off, bummer. I'm going to have some reference material to other talks you can go check out that have sort of, in some cases, even driven me to make some of these changes. But overall, capacity and stability patterns from Brian Pitzipi out of 2017 covers more in depth some of these approaches. The single biggest enable of effective change for PyPI over the past few years, especially given limited time and resources, has been metrics. Again, great talk from Hennick at PyCon 2016 on getting instrumented and talking about the way that metrics can provide value to your own software. These are all about answering questions. Often you'll see a focus initially on system metrics like CPU, memory and disk, which are fantastic. But there's a welcome trend over the last few years in tech towards application of instrumenting. System metrics are great for answering questions like what's slowing down or breaking and did my last change help or hurt performance. Application level can answer questions like where are my efforts best spent and what features may no longer be worth supporting. So a quick example. Here we have a pretty straightforward function or I guess the method on the, no, it's function called search. And so it does something pretty simple, which is take some operators and throw it into another module that'll give us our search example. So we're gonna instrument it. It's pretty straightforward. We can import our library of choice and most libraries for this sort of thing have decorators, which is really helpful. And this gets us really quickly with two lines of code, a bunch of metrics that we can use to measure the performance of that function call. We can get the raw count per bucket of time. We can get it as a rate. And we can also get timing stats. And with a little bit of math, we can combine those to get the overall time consumed by a given function. So what's helpful here is we can look at this and say right now the overall time consumed on our back ends is definitely taken by this one, which I'm not sure what it is, probably. So that would be a good place to work on it. If we can reduce the overall amount of time that it takes or reduce number of calls, we can improve the amount of time that our back ends are working. Another example, this is another search. Searching is expensive, it turns out. So we're going to instrument this by adding a couple things that we could have done with a context manager, but for simplicity's sake, we've just measured the start time, end time, and sent off a timing metric to our metric posting thing with how long this takes, which again, we get similar metrics out of. And you also note that along the way, we incremented how often this request timed out, which can help us tune that timeout. So the advantage here basically being that we can look at this, so this is the number of timeouts over the past however many, this is over a day, I guess, for that. And so we can see there's a big spike here, but overall, we're not timing out too many requests relative to what's going on. So that can help us tune the timeout so we make sure that users aren't affected by basically guarding ourselves. A quick side note on errors, we use Sentry for collecting errors, there are a handful of other services that do this, but it's super helpful, particularly in low-test coverage environment, to have a stack trace, pretty much all of your context in one place, and you can determine pretty quickly when something bizarre happened, and optimally figure out a fix without having to go into too much craziness. That didn't turn out very well, but you get your stack trace, and down here you'll get context on that. The next thing is scaling of tactical updates. So in general, we're serving a known service effectively, and changes to the way that it functions can affect the users in a very negative fashion. So when you're going to change something particularly large, you want to be as tactical as possible and think before you jump. So this is the search method, so don't focus too much on what's going on here, just sort of get a feeling for how big it is. So this is the thing that was powering search on the website, is when you go to python.org and search, or whenever you'd searched from the XML RPC, and this is the SQL that it actually generated. These are one, two, three, four, five, six different selects, which overtables that just keep getting longer and longer and longer, remember all these releases and projects, and so eventually the backend MySQL servers start to look something like that, or Postgres servers start to look something like that. So enter Elasticsearch, you know, for search. All of these are pretty naive, pretty naive like queries effectively, so saying just look for any column that looks like this string. So in pretty, and this is the Elasticsearch code that replaced it, so in pretty similar amount of code, we were able to move to Elasticsearch rather than using these naive like statements in Postgres, and that offered us a huge speed up in how long it takes to render a search. We go from a second to about 250 milliseconds, so that adds up very quickly given how many searches are coming into PyP on a regular basis. And here's how many 503s we were serving, and when we deployed it, it went from a lot to very few, which is nice, it's mostly timeouts. So caching, caching is cheating and cheating is winning, and Matt Romano gives a really great talk on cheating way to web scale with HTTP caching. This is something that if you're not using, it's good to look into for things that are taking lots of HTTP requests. It's built in with the cache control protocol of HTTP and the headers. There are tons of options and it's generally bolt-on. Get a CDN, maybe one like Fastly. So for caching internal stuff, though, you can't cheat as well. You're generally going to have your own custom logic and your own custom things. You can't just cheat with an HTTP caching protocol. The changes I'm about to talk about were directly influenced by hearing a talk from Lindsay Rockman at PyOhaw in 2014 called Making API Calls Wicked Fast with Redis, so that's what I did. Here is some puppet source code that calls PyPI.python.org and uses this thing called XMLRPC. If you're not familiar with XMLRPC, it's an old-school RPC protocol that uses XML and it's being used to get the number of, or the package, the available packages for a given Python package. This is a valid XMLRPC request against PyPI.org and there's only one problem here and it's not XML. XML is kind of rad. It's this, post. If you've ever done caching, you'll know that posts are generally more difficult to introspect and particularly given this XML, XML we'd have to parse to determine what the request wants, it just adds a lot of complexity to try to cache that at the edge. Instead, we'll cache it internally. And also, there's nothing really wrong with what's happening here, except for the fact that when it first was written it might have been five or six or seven servers and now we're sort of in the days where the number of servers that are calling that might look like this. So it tended to cause lots of problems for PyPI. So for caching internal stuff, if you will, this is the package releases method that was being called and I literally did it with a decorator, which was kind of fun. So sometimes it can be as simple as using another library that's available and applying a decorator that's tested and vetted and well described on its own. And this is the cache efficiency of that. It's not super great, but you can see when you get really big spikes of like a cluster spinning up and all requesting that same data at once, it can take, these are hits. So we got, most of those got absorbed by the cache rather than going in and causing a slow request to the database. Last, or not last, another one is rate limiting. For a lot of great information on rate limiting, Stripe wrote a blog about it that you can check out with all sorts of different patterns for managing this. There are two different types that I'm gonna talk about. One is traffic from one client and misbehaving or malicious scripts. Oh, not gonna talk about those. So spikes from one client. Cron is kind of evil. We have had a lot of scenarios where the only thing causing like a nightly outage of PyPI for five minutes is some huge data center that's got Cron all set to run at the same exact second and make an expensive request. So for spikes from one client, you can pretty much handle this directly with, I don't talk about, oh okay, I don't have points for it. So they generally might be coming from like a small number of public IP addresses and such, but so you're able to rate limit them based on just where the traffic's coming from VIP. Misbehaving scripts are also a common place where just tons and tons of requests come at you at once. And it's somebody who's trying to do something interesting like pull down all of the package metadata and make pretty charts, but occasionally it hurts us as much as it hurts their poor CPU. Oh, so rate limiting for the single client is actually pretty straightforward and if you're using Nginx and I think other edge HTTP servers generally will have something like this, but Nginx makes it really straightforward to rate limit based on, oh, well basically rate limit based on, it doesn't actually say what this is for. I guess things like P by default. A rate limiting within your application can also be very beneficial. So this is throttling concurrent requests to those expensive XML RPC endpoints and that's done with just a handful of lines scattered throughout here. The big one being we're gonna build a context manager that's going, that we can wrap around a function and it's gonna yield whether or not it's throttled. We're going to, we have another method that just basically takes the remote address and determines our answer for us. We catch some exceptions because who wants to deal with that and then we'll also use lots of metrics so that we can determine when people are being rate limited and when we might need to look a little closer. So this is what rate limiting looks like from the metrics side. We have lots of people who are invoking it but very infrequent actually that we're enforcing rate limiting. And again, the reason we had those exceptions was these are actually the, these are just the, these are the number of exceptions that were caused during that rate limiting, which you'd want to fail open or fail safely. So if Redis sort of has a blip or a problem, it's better to just say, okay, go ahead and get through than to cause the entire, you know, every request to start failing. And nope, we're gonna skip that. Oh, this is a good one, removing features. If you have an old legacy code base, delete as much stuff as you can. I'm still mad because the leading code is fun and it reduces your security error footprint and it also reduces the scope for the new software that's replacing it and migration strategies are always easier when you have less code and fewer features. And I'm always mad at Alex because he seems to find like these giant blobs of code to delete and thank you, Alex Gainer. Simplifying architecture, this is one that also helps with migration strategy. Working on PyPI can occasionally feel like it's run, like I'm running through a crazy murder, like mystery mansion and I'm trying to figure out like, you know, what's going on and how it all talks to each other. And so sometimes you don't need all of that stuff that you've had for all these years and you can simplify and go to a more modest solution that will suffice. So some of the stuff we've done has been to move file storage from like a clustered storage file system that we were running before to Amazon S3. This has the added benefit of being accessible sort of wherever we want to run the new code base and in whatever way we want without having to run it in the same infrastructure. Reducing, taking down the live download counts and relying on Google BigQuery has allowed us to dramatically reduce the complexity and the real-time nature of the code base. Moving from self-hosted Postgres to Heroku Postgres has been hugely great because I'm not a world-class DBA and they seem to know what they're doing. And then moving from homeworld, StatC and Graphite to Datadog has been an improvement in just how quickly we can sort of get feedback on what's going on. So to review, metrics, metrics, metrics, scaling of tactical updates, caching, removing features and simplifying architecture are some of the things you can do to make this a little bit easier. But that's not really why I gave the talk. I gave the talk because I want to talk to you about contributing and taking part in this effort. So PyPI is a pretty big deal and we all use it on a daily basis that are professionals and using Python. Our newest members of the community, one of the first things they'll do is pip install something. And so having it be available and being something its community offered is a great thing. If you're not already, you can start using PyPI.org. It's not done in totality and so you can't use it for everything you're used to, but you can use it for searching and viewing packages. You can check out Warehouse on GitHub. This is where most active development is going to be cropping up soon and has been going on for some time. There are issues there that are tagged generally with front-end, back-end and optimally with the level of experience that you have. So hopefully there's something you can find to contribute up front. Warehouse.readthedocs.io has documentation on working on Warehouse. Heading back from there, you can also file issues back on the GitHub repository if you find some documentation that's confusing or alarming. When you're using the Python packaging ecosystem and you run into a bizarre problem or something that's confusing or doesn't make sense, you can contribute by just filing that issue at the packaging problems meta tracker and that'll give us a better idea of the way things are being used and how we can improve it moving forward. And finally, if you have more money than time, you can also donate to the packaging working group directly at donate.pypi.io. And last but not least, thank you. And if you'd like my slides, they're there. Thanks, Anast. Thank you. Thank you very much.