 Well, actually, it's just after lunchtime, or afternoon, so hopefully we'll all be eating soon. This talk is titled, API performance, a journey down the rabbit hole, because that's what this felt like as I was digging in and looking and finding and going, oh, wow, because honestly I learned a lot. So who am I? My name is Julia Krieger. I'm a senior principal software engineer at Red Hat. I've been contributing to OpenStack since 2014. Three of those years I served as the project team leader for the Ironic project, and this talk is basically ironic specific, but a lot of the models and interactions and layers that we've built in OpenStack apply across all the projects. So hopefully we'll spread some knowledge, spread some context, and ultimately maybe fix some of these issues that exist in some of these APIs. So I'm one of the elected board members and the chair of the foundation right now, so if you have something to discuss, I'm happy to talk. So what will we be covering? Well, first, what did we notice? How did we get there? Why the API performance can suffer like that it did, and what we did in Ironic to make this better, and how to get the most out of your queries. So what did we notice? One day on internet relay chat, I just ran a quick test. I'm getting 96 bare mill nodes per second. I guess that's not horrible. An operator chimed in. We're getting 250 to 300 on this release. My response was, rut, row, and I felt like my corgi. So how did we get there? And really I'm talking about that horrible performance of 96 bare mill nodes a second being returned. Initially, our table size started with only 20 columns and had only two indexes. As the project grew, we ended up with 56 columns and still only had five indexes until we really took the time to dig into the performance issues. And none of those indexes were intentional. They were the results of relational databases. As of the yoga release, we now have 58 columns, 11 of those columns have specific indexes to match specific query patterns that are commonly queried. And in this, one of the very first things that we thought this must be the problem was cross-version object compatibility. The way this generally works is it provides a framework for you to, say, give me this version of the object in which the object model should have logic to go, oh, 1.36 drops this field and has this field that was transformed this way, and I can support this upgrade pattern. Turns out not many people do this. But that compatibility comes at a cost, and it's just yet another layer of additional checks and data transformations in your data. And it kind of looks like this. You start with a database response. That turns into a database client object response, which is get an object in Python. And then that kind of gets returned to a SQL alchemy result set, which is an object. And then we copy that data into a versioned object, which is bolted up with your versioned object logic. And then for lack of a better term, you copy this into an RPC object at times. And then if someone goes, give me 1.34's object, well, you have the logic hopefully there to make the transformations necessary, otherwise the fields will be empty or whatever. The key that I'm trying to get at is we're making many copies many times. But it wasn't all just the in-memory copies. I thought it was. I was totally wrong. We ended up improving the security with role-based access control. And with this, we added a huge amount of greater granularity on the API queries and allowed people to filter the data and also we filtered the results based upon the actual RBAC checks. But the RBAC checks themselves have quite a bit of complexity. And currently, because we are supporting a transition period of new rules and old rules, we're actually doing double the work. We're looking at both sides of the equation, the old and the new rules. In Iran 17, we had seven checks per node that was returned by the API, which meant any time you asked for a thousand nodes from the API, you ended up with 7,000 and two RBAC policy checks. That actually gets even worse to 14,004, because every single check gets executed twice until we remove the deprecated policy checks. We actually had to do some logic and front loaded our policy checking and decision and cash that and then pass it in further down so that we did all of our logic up front and we figured out our result state and then as we were working with data, we were able to make decisions from there. But of course, I thought, Eureka, it's the RBAC policies. I found the culprit, everything's better, but I only got 242.922 nodes a second. Not quite what the operator was getting on an older release. I thought, okay, let's keep looking. And it turns out we were doing a ton of unnecessary extra work. Those in-memory copies, deep inside of loops, and it was all the fields. So when we look at an actual query response, we go from a response to an object, in this case, a node object, to an API object, which is basically a JSON dictionary. And then we make a copy of that content, of those contents, and start stripping out all the data we don't need. And then send that to the user. So if you say, I only want five fields, we're taking 58 fields. And at the very end of the process, after all the loops are done at the lower levels, we're pulling out only the five fields. We actually had to teach ourselves to be selective with what we queried. Specifically, we had to teach ourselves to, oh, we have this RBAC stuff we should pass all the way down so that we understand the context of the query and also teach our query logic so that we don't need to actually populate every single field every time. So we changed it to be a lighter weight database response to a lighter weight object, to another lighter weight object, to a smaller API object, to a much smaller response that could ship to the user. And if you think about it, if you're paginating through data, this starts adding a lot of time to your query process. And to give you an idea of how much time this was taking when I was benchmarking this, it was taking just 1.9 seconds to get to that first database object in the database client. It was taking 4.1 seconds for that to be copied into the node object. 7.9 seconds for the API, 9.9 to get to that JSON dictionary where we start stripping everything out, and then 11 seconds to get to the final result that was shipped to the user for 1,000 nodes. When we stripped it down and we taught the lower levels what it needed to query by, we ended up with a .3 second time to get the 1,000 node database object. .5 seconds to get our actual node object, our list of nodes. .9 to get that to the API object. 1.76, and ultimately 1.96 to get it out the door. We ended up with a staggeringly mind-blowing number of 540 nodes per second. We started at 96. And I'm not kidding. We saw a 560% performance gain in CI with our database interactions, with our queries, and just purely staggering. And this is an image provided by CERN, Arne, which is one of our Iran contributors who tweeted this. And when you look at this, you can see on the left that most of their queries are taking about 2.5 to 5 seconds, with some lower level queries taking much less time. And on the right, you can see the same basic bands. But now we're under far under 500 milliseconds. And if you think about what, if you're working with 1,000, 10,000, 100,000 entries, this really starts to add up an overhead because you're just waiting. The end user, the requester is just waiting. But the real question that we then had to answer is how did we really get to this point? Because this was quite the effort to dig and research and find these issues. And the long story short, we failed to measure. We didn't look at the performance impact of changes. We didn't look at the performance impact as time progressed. And we also never sat down and went, what would happen if we put 10 or 100,000 entries in and then tried to run the same queries? And that experience changes dramatically when you do that. So I'm sure some of you are operators. And there are things you can do to improve your experience. If you're writing a script or interacting with one of the OpenStack APIs, use NOS token. Don't use Bash Shell scripting to invoke an OpenStack command client, command line client over and over and over again. You're adding overhead of authentication to that entire interaction. Because every single time you're going in re-authenticating, that token that you generate once can be used many times until it expires. If you can, you should consider limiting your fields. And if you are doing queries, you want to query against something that has a database index specifically. Or you want to query with order of magnitude of database indexing impact. And I know that's probably a difficult concept to wrap your head around. But if you know only, if you know a field is indexed and it's power state or something along those lines, and you know only that will get 30% of your results set pretty quickly, then make that your first entry on your query. The database will be much happier that way. And if you can't limit your fields, I highly recommend filing bugs against projects. Currently, Neutron and Ironic can both do user selectable fields. Nova, Glance, Cinder, they don't know how to. Or they never taught their APIs how to do filled level filter querying. The other thing you need to be mindful of is pagination. There is huge amounts of latency with repetition because you start adding in not just the time to prepare your response, but the time in transit, and the time for your client to process it. And when you send that query back, it's not like you have a database cursor sitting and waiting. You hit another thread or another API service, your query gets executed again. And if you have specific concerns with API performance, you really need to go to the developers with numbers, statistics. They do not read minds. They don't quite understand the entire picture unless you lay the operational picture of and performance profile you're seeing. The more information you can provide, the better. And if you think there's a better way to interact with the service, propose it, but prepare for a dialogue because we've built these APIs to be stable. And if you're a developer in the room, please sit down and identify your most common interactions. Measure the performance. Look for inefficiency. Is there a database index I can add that will make this even faster? It might not seem a tiny little increase in performance. It might not seem like a huge impact to the user. But if that user is executing that query 100 times an hour or 1,000 times an hour, it starts adding up pretty quickly. Do your testing with thousands of entries, preferably tens of thousands. You will see your loops take longer and longer with the more entries you have in your database. And time each step going from the database outward. And then do this in a CI job. It doesn't have to be voting. You don't need to really actually save the data. The logs themselves give you that reference. So if you know two weeks ago or a month ago or last year you were doing 500 records a second and you're now doing 400, there might be a problem there. You might need to go investigate that. So in Ironic, we actually have this as non-voting job. We just have the logs. We have an etherpad where we occasionally save the results so that we know where we are at. And if you don't support the syntax of fields equals give me x, y, and z or whatever in your API, add the support. And please wire all the way down to your database API so that you are not requesting an entire object from the database. The more work you put on the database to do, the faster the overall interaction will be and the experience for your end user. And if you have time, spare time? Anyone? Spare time? Okay. The CLI tools that are out there could use some help. So are there any questions from anyone? Sure. So the question is what tools did I use to measure the time between each step? And because of the way our code is built, I actually ended up starting with adding print statements in the code and just executing basic tasks and looking at the logs going, oh, that's what's happening. And partially I was doing that because I was trying to figure out where are these bottlenecks and get a feeling for the layer in which it was. And after that I went, okay, I'm writing something really simple that I can extract some of this data that just calls our internal code and goes, give me this, give me this, and just times the numbers. And that's it. So the question is if we use Rally or something else, we do not use Rally. With bare metal systems specifically, we've seen huge performance impacts across different cloud providers. Some run in operating modes that don't have nest advert enabled, which drastically impacts our testing performance, and we don't really want to have to deal with, oh, the job failed because we ran on the cloud that's a little slower this week. So it's something we've kind of avoided. We are actually, for the CI job that we have, we built a tiny little script to just print the data out on, so it's in the log and we can go look and see what it was. Nothing fancy, incredibly simple. It took all of a couple hours to wire it together. Yes. The question was, is this in the ironic repo? Yes. There's a benchmark folder under our tools. Yes. Can you state that one more time? So in our case, the query didn't become super bad. We had, well, sorry. The question was, how do we avoid the root of all evil of being over-optimizing our queries up front, correct? Versus actually addressing the issues. Hopefully I'm paraphrasing correctly. We actually started with our operator reports and going, oh, that's what's happening. And wrapped our heads around it and kept digging. And we did find that, well, this join is inefficient. Maybe if we tweak it a little bit, it might be better. And that we tried to over-optimize and over-think it. And we actually had to go get SQL Alchemy experts to go, ah, you actually want to do this because SQL Alchemy knows how to do this better than you do. It was not something we thought of. And it's really hard to find that in the documentation. So I would recommend finding a SQL Alchemy expert if you're having to dig into things like joins. So the question is, I'm going to paraphrase this, what can apply to other projects and how can that be raised up in the stack and identified, correct? Yes. So at the end, well, as a user, send to your vendor or the community. So the way this ends up working at the lowest level is, and this is across many open-stack projects, is we have a common model of database interaction. We use SQL Alchemy through OsloDB and we built queries on top of that. And it's very easy for us to forget indexes. And every operator is going to be a little different in that. And it might be you are the 1% executing this query this way that ends up having this huge hit impact versus everyone else. And that might make a lot of sense for you to add the index yourself if you see that. But if you still see performance issues beyond that, outside of your slow query logs, then you need to start probably looking at, well, how long does it take for the end result? Because we've repeated the pattern so much that with the knowledge I've kind of conveyed to everyone in this room, we should be able to at least go, oh, we know where to look now. It's all the layers we've built to abstract and protect and enable that kind of impacted us negatively. We just didn't think about it because we weren't measuring it really. So at the end of the day, if you could create 1,000 entries or have 1,000 entries, which is the common maximum pagination size, and just time it, and maybe provide insight into what an average record looks like, I would provide that to the community or your vendor because they should be able to take that abstract and move from there. There was a hand up front at one point, and no? Okay. Yes? Yes, most likely yes. The problem is a lot of the additional checks for us, they were in our API layer right before we were shipping the requests out. So I don't know how much that would apply to other projects, which is kind of why I wanted to present this. At the end. I think it was a speaker from Switzerland who mentioned that the database performance was so poor with OpenStack, and I think it just getting worse. And right now, the recent years, I've observed that with all this K, S-class gardener and so on, they have fully permanent API terror running all the time. I think it would take a new project with some of the problems in every project that just finds new standards, or I don't think you can solve it with any mutual or no or no. So the question is, do I think that we could solve this as a single department or single entity or do we need to spread this to a community context and make it a wider effort? And my personal opinion is that every single project is going to have specific challenges, and if we were to actually try and move this forward, it needs to be a TC goal. That projects need to look for this. The problem is when you start talking about pain points or user experience issues, everyone has their greatest pain point, and they usually don't want to talk or listen about someone other's most painful point. But if we do have patterns where we're just constantly hitting the database, we need to optimize for that, and we need to be aware of that use case. And I think one thing that operators can definitely do is go, here's the query that I'm running. Here's what I'm seeing in the database. Here's how long this is taking. This is way too slow, and from there, the project should be able to go and look at it and abstract it and try and improve performance. Because all of the projects have kind of gone in somewhat different directions based upon specific problems they've encountered. So it's not like there's one way that we can solve this across the board. I believe we have time for maybe one or two more questions. The question is, do I think I've reached the end goal or if there's more room for optimization? We think that there might be room for more optimization. We're just not sure if there's actually value in it. Specifically, the conversion of the API, the API object, the JSON object, that's far slower than we would expect. But we haven't really been able to wrap our heads around why that is, because we do the same thing at lower levels repeatedly. So we might be able to shave some more time off that. I'm just not sure. We hit 541. Wow, okay. Any other questions? Yes. Scripting I wrote. The question was, what tools did I use? And I wrote scripting to time at all. Any other questions? Well, thank you, everyone.