 All right. Oh, that's loud. OK. Levels are good. Welcome, everyone. My name's Sean Deg. I've been involved in OpenStack for the last five years at IBM at the moment. I'm going to talk about today large scale changes in OpenStack and some patterns of being successful in them. A recurring topic that's come up time and time again at various summits over the last few years is I've sort of figured out how to get a thing changed in one project and get engaged with that project. But the moment we start talking about doing something across a couple of projects or broader in general, people just run out of steam, fall over. It doesn't work. Not sure why. And the postponement process is not entirely clear. How could we do that better? And the thing is we have actually done a bunch of these in the past. There are patterns of success. I'm going to talk about those. I'm going to get us there. And I'm going to do it by starting on one of these problems that we solved over the last couple years. So it always feels better to start with a story, story of success when I bang my head into a thing and then we figured out how to go forward. So a year ago, this was my problem. Three years ago, we implemented this thing called API Microversions within NOVA. And it quickly spread between a bunch of other projects. They replicated what we were doing. That's great. I've been in the field for two years. There's actually literally no documentation about how to use it. In order to use it, you had to get inside the NOVA source code and find this document, which is kind of buried in an odd place. And that explains in pretty reasonable detail changes that we had made. But that was not part of the API reference, which was this nice, pretty page which tells you what you can do with various open-sac APIs. And this is what it looked like at the time. Thank you, Wayback Machine. OK, so that's cool. This looks actually pretty good. And these blue things are collapsible sections because it's actually a lot of information and it expands out. How do I get the details about the fact that we added a couple of new fields into this document? I go through this, which is SGML. Yeah, so we had this documentation system that was built because we had a lot of technical editors that understood DocBook. DocBook's cool. DocBook's superstructured if you're really familiar with it. But it's really a specific language that documentation editors know. And then they figured out that there was this emerging standard called Waddle that Sun proposed in 2009. But kind of let die on the vine, which was an XML markup of REST interfaces, which is like method calls and craziness. And then this was auto-converted into compatible SGML that was included in SGML documents and sad. Fortunately, I was not the only person that ran into the fact that this was massively inhibiting us from having accurate documentation because just to make changes was crazy. So there was already an existing effort that had started a year before I even ran into my problem. There's this thing called Swagger, which is now called OpenAPI. So this is a new emerging standard of how do we document REST APIs, except it's not actually how do we document REST APIs. It's a way to describe and design them. Looks like this. OK, it's JSON. We like JSON. Not a bunch of random brackets everywhere. But it's very prescriptive. And it prescribes an API in a way that a machine would deal with it. And it has certain features within it. And then when it gets to its edges, everything can only describe a certain amount of the problem space. Like, oh, well, you can add extensions to it. But that was the problem when we started to look at the OpenSack APIs and start doing real work to examples. There are two things that completely fell down. One, microversions, which was the thing I was actually trying to solve for. But even without that, there's this actions interface that exists in a bunch of OpenSack that meant we had to add these really crazy extension points. The net of which meant that of the 12 documented at-the-time services, only seven of them could pass those two barriers. Doesn't actually mean that Swagger would work for them for other reasons. But they already couldn't work within well-defined Swagger seven out of 12. Seven of 12 is not a success story, especially when those seven are the most complicated ones that we have to really describe interesting things with. And just think about it another way, right? When you buy into a community or a standard like Swagger, if you actually buy into it and stay within its box, you gain an ecosystem, right? You gain an ecosystem of authoring tools, stuff to build websites, automatic clients. The moment you start breaking out of that box, you lose the ecosystem. All the tooling and all of the great things that you got from being part of that ecosystem, like we have now forked, we don't get to do any of that, we need these things, which means not only did we sign up for this, but we signed up for writing everything. And given this is already understaffed effort, like this was not a path forward. So there's kind of three things you need to have a successful effort in OpenStack, right? You need some shared understanding of the problem that you were doing, a clear indication of who your stakeholders are and that they're all bought onto it, and then a plausible promise that like, yep, everyone believes that with these people signed on to this thing, we're gonna get it done, which means I will put effort, I will help and not get in the way. You know, shared understanding, right? No matter how smart you are and how much you think you understand about what any part of OpenStack does, the answer is you don't have it all in your head for any non-trivial piece of code. And, you know, great, I wanna go solve a problem, that's cool, this is the bit I understand, maybe there's other people that understand pieces that are the same as mine, we might be talking in a similar language, but realistically, there's a lot of other people that like see this thing from a completely different perspective. And in doing so, you know, this communication gap becomes very real, you know, this is, people often describe this as seeing different parts of the elephant, right? Like some of it looks like a tree and some of it looks like a rug and some of it, you know, looks like a snake, but without that common, you know, when words don't mean the same thing, there's no way you can make forward progress. And, you know, as you're getting your stakeholders, the other thing, people kinda forget this sometimes, right, like things only get done when people do them. You know, the abstract concept of NOVA doesn't get something done. Things get done because Redamon gets grumpy and says like, this has to be done by Friday otherwise, you know, whatever, right? So, and that there was support all along the way. Because in OpenStack, you know, nothing gets done without, you know, two or three people, someone writing patches and a couple of people reviewing them. So again, like every effort needs clear stakeholders, right? Who's doing the work? Who needs to approve that work? As you know, it might just be reviewers. Who's gonna be impacted by this? Because it might be people unrelated to the doing the work and approving the work. And even more importantly, who hates this idea? Because there's various reasons why people come up and become the opposition. And most of them are not because they're terrible people. Most of them are because they have a different understanding of the thing or they have a piece of information you don't have about what this might hurt down the road. And, you know, it's important not to fear your resistance but actually dive straight into that. Like, okay, I get you don't like this. I wanna, you know, make sure I at least understand where that's coming from and really, really get that feedback back in. Otherwise, I don't think we can build like a real solution for you. Plazible promise. This is a thing that Clay Scherke talks about in a bunch of his books about building online communities that, you know, really at the end of the day, you know, community like Wikipedia for instance, right? There are a few people that are like deeply invested spending a ton of their time on it, but there's also lots of little efforts that help or hinder along the way. And whether or not people are gonna spend these little chunks of time is whether or not they believe you have a plausible promise that like the thing that you want to do, like we've gotten the general agreement, you've figured out who the stakeholders are, we have a clear idea of where we would get to eventually. We have a clear idea of a next step which is useful on its own, because if this thing doesn't become useful until 17 steps down the road, like there's lots and lots of reasons why we will never get there. But if every step is incremental useful, that like adds to the plausibility of how we move forward. And like, do you have some way of even knowing like we have this goal that we're trying to get to, like do we have any way of knowing how to get there or that we're getting any closer? So if we look at the effort at hand, the effort that I ran into is like, I have this problem, okay, someone's already tried to solve this problem except this has been kind of circling for a year. Like why was it circling for a year, right? Like was there this shared understanding about what the problem we were trying to solve because from my perspective, the real issue, the consumer of API documentation is not machines, it's people. Swagger was really good at making it consumable for machines which was not actually the problem we had. Also it didn't really fit with the stuff we had, like the worked examples kind of avoided the hard problems to get something working, but like didn't quite like look for, like try to test, are we gonna break down really quickly? It was happening a little bit of a corner and right, it just the plausibility problem. Now granted, when they got started, microversions kind of got started at the same time. So that was a blind spot where we just didn't see each other, right? And so like that's fine, no fault of anyone there, but the actions thing had been there a long time and like the mapping definitely didn't raise quite enough red flags, like okay. So maybe we're circling for a year, we're not making progress, what do we need to do? Like let's step back and like think of a new effort. Like what are we actually trying to accomplish? Like what do we need, right? You know what are our big inhibitors is our doc format, this SGML craziness is like it's a new thing for contributors to learn, it's too hard, it is impeded people coming into the process. There was a thing I didn't realize until I started like getting on Google Hangouts with Ann Gentle and like what do you think your concerns are? And one of them was the current look and feel was something that the API docs teams were really invested in and it's like okay, that's cool. That's a thing I didn't realize that we had within our set of constraints, like how do we make sure we replicate that with whatever we put behind it. Microversion support I needed, you know, who were our actual contributors, right? The API, so there was a docs team, there was an API sub team of that, it was a very small group of people, they were definitely not keeping up on the API front. So in order for us to make forward progress, we also had to get all the project teams like what would we do that would get you bought in so that you would maintain your documents? What does the doc team need that means that it merges well with the rest of what they've got? And just remembering that we're communicating to humans, that's really important, which means we need to be able to put like long form pros of explaining what concepts are within here, which buried inside of triply nested JSON objects is a little bit funny to figure out how that comes out. And then it was, okay, so maybe we've got one and two, and we gotta come up with like, what is our plausible promise, path forward. And this whole process was like a couple, like a few weeks of you being like, oh, starting with, I will do the swagger thing. This is great. Okay, like show me some examples. I started going through and I'm like, wait, we can't do this and we can't do that. And I get on a hangout with Ann. I'm like, I think I found some problems with the path of record. And I kind of walked them through with her and she's like, yeah, but we've been doing this for a while, like we really can't just like throw this under the bus now. I'm like, okay, and then like a couple of days go by. We get back on the phone. So I found these other problems, like well, and you know, back and forth, I think over the course of two weeks, we were on video four times, and like over the course of it, like when I got her feedback and like understood what her concerns were and made sure they got integrated. And it's like, okay, how about I go and try something and I will come back and see if you hate it. So, I'm gonna build a custom swing extension. Why was that ever the solution, right? This is code I've never written. This is an area I've never touched. It's, you know, I hunt around. Doug Hellman has written like 10% of all the custom swing extension code in the world. And I start asking him questions like, yeah, I don't know how that part works. I don't know how that part works. Okay, right. So, deep in the bowels of craziness. However, this is, you know, like a human readable, editable format that turns into this. And you know, so like, where me and a couple other people got burdened by figuring out how to build all the little bits that turn this into that. You know, there's actually, there's a little more structure in here than is maybe obvious, right? This is a structured element and it means a real thing. These, because there's a certain amount of repetitiveness in like, you know, the definition of what an ID is. We have a lookup table where a lot more information can be provided. And, you know, examples. Examples are huge, right? Like, I want to know what's on the wire. So, so that's cool. We got the look and feel. I had a first pass on this. This is live, what's out there today. And then, Mugsy, who knew how to do a bunch of things in JavaScript that I didn't know how to do would, built on what I had. It's like, oh, we do this and this and this and this and this and this and bam, okay. It looks surprisingly like the old site. So, and our testing, like, are we gonna get to our end goal was basically, okay, we're gonna do this thing and we're doing it inside the nova tree so that if we screw it up, like, we haven't thrown anyone else onto the bus yet. Like, that's just my wasted time. And if we, if it's successful, we'll extract this as a pattern and let other people do it. And the reality was, it got a little more successful a little quicker than we were hoping for wherein everyone started copying all the code out of the nova tree and putting it in their tree because they wanted to get here faster. And so, once we had about 12 projects already doing the new system, I had like, all right, let me get all this bundled up and like get it, you know, clean everything up so that we don't completely fork the problem. So, you know, it wasn't, this was not in some ways, yeah, you wanna throw a question? It's RST, yeah, yeah, yeah. So, the details of this, which could be a whole other talk about the craziness involved in their Sphinx, which is the thing that converts RST to many different formats into man pages, into web pages, into just straight up text, has a mechanism by which you plug in parser extensions and stanzas and you can you can hook and change many interesting parts of the entire rendering pipeline and that's what this is. Not exactly, you have to have this thing registered. But we do in all our documentation trees. So, yeah, so, you know, I'll kinda look at a few other examples of efforts I've been involved in that have sometimes been successful and sometimes not, right? One of which was policy encodes. So, we had a big, wanted to do a big change within the OpenSec patterns. We have these policy.json files, they're everywhere. They have lots of crypticness in them. They're their own defined DSL. And when we make API changes, you need a new one. Otherwise, you might open security holes on you during upgrade, which is just craziness, right? It's a bunch of state that doesn't need to be there. So, in the Nova team, we got this idea, like, you know what, we should totally do this in code. Like, all the defaults are in code. Then we could run with an empty policy file. The only thing you would change is the stuff you're overriding, which means if we bring in a new interface, like, by default, you're getting a sane, like, level of protection. So, great. Who do we need to get a shared vision of this? Who are the stakeholders? We need the Nova team. We actually had a bunch of operator buy-in as to like, oh yeah, that would totally make life easier. We also need the Keystone team bought in. But this is one of those places where the Keystone team was in the middle of trying to do a much more complicated thing with policy from a completely different angle. And after, you know, I started with, okay, I'm sure that I can, like, have some high-level conversations with the Keystone folks and they will be able to kind of help me translate this to what it means there. And it turns out our understanding was so radically different of the problem space that is policy that we were never communicating. You know, I ended up having to go read a lot more of the Keystone policy code before I could even start speaking a common language. And then they had a different set of concerns that were coming from a different direction trying to do a different big project. And so the whole thing, I probably should have put the mailing list thread where the whole thing just blew up into nothingness and it was like, everyone walked away. Like, okay, that's no good. But even in failures, you can have elements of success, right, because part of the problem, you know, the problem we ran into was the shared understanding. Like, we weren't on the same page. The way to work towards getting on the same page was, for me, from a very different perspective, to go in and read a lot more about what their thing was and start to digest it and re-explain that to start building a common basis to move forward. That by itself moves the ball forward even though the entire effort collapsed and everyone went ragey away. It was still a good idea. Because the ball got moved forward, a year later, Andrew Lasky was like, you know, we really should do that. I'm like, go for it. We've passed the baton, he decided to go and take and charge it. And that's super cool. We'd set a new baseline. We had a new conversation with all the Keystone folks in the room. Also, there was a lot more cross-understanding of concerns and we just stamped, like, yeah, we'd totally just get that done. That was in Austin. And then it was off in the races, right? So all our stakeholders were there. There was a plausible path through the OZL. Here's a set of OZL policy changes and Nova changes. But also, really importantly, and the thing that we forget a lot, right, that this took a lot of project management to get through these multiple teams. So when this went forward, I was like, I am too burned out to actually help on the tech side of this particular problem, but I will help you project management. So you tell me the things that you need to get landed in all these various points and I will just keep chasing reviewers for you. And so that's what I did, is I would check in with Lasky at least once a week, if not more often. Like, okay, like, you know, we got these OZL policy bits that aren't in there yet. Let's go chase reviewers. Let's go chase reviewers. Who do I need? Who's all the right people to find a lineup on this? You know, pin down Dimms and Doug and Steve Martinelli and like, you guys gotta do this. And it got through, it was in. And now Keystone's actually putting the same thing back into their defaults, which is super nice. This is the title of an open stack spec. You never wanna see a title of a spec end with the question mark. That bleeds to the point that we clearly don't have an understanding of this, right? There was a big push a couple of years ago that like, event led is terrible and okay, that's fine, event led is terrible. And so we should replace it. Okay, but like, with what? And why is it terrible and why is every other solution not equally as terrible, right? Just because there's a great unknown which doesn't have all the piles of garbage that you've got in front of you does not mean it's actually a better place to be. And even if it was, how long is it gonna take to get there and what is the other cost along the way? And an effort of this scale, like I applaud people for being enthusiastic about big efforts, but like literally this is everything has to change all at once. And that just, it just can't work, right? Like, it just doesn't match up. So things like that, it's just like, you didn't have any of the things lined up. Glance, Glance is kind of an interesting one, right? Glance has been trying to get rid of the V1 API for a long time. And again, this was an effort that circled for a very long time. Like, I remember starting to have these conversations back in 2014. And eventually you just kind of wander, like, okay, four cycles we keep talking about, like even just in the Nova path, like why hasn't this code landed yet? What's missing? And that there was a total breakdown on the shared understanding, like from the Glance team side, it's like, oh, you just gotta use this other interface. Like, well, do you understand our concerns about the fact that we currently proxy your interface out? And like, that kind of breaks the experience. Like, there's just new things and the way that the interface works, we can't honor the old one. And what is the upgrade path look for an operator to go across this? The Nova team was pretty deep in like, we need to have a deprecation upgrade cycle that makes us all sensible. And Zen. If everyone's ever looked at the Zenvert driver, it does very interesting things and interesting for like, many asterisks after it. And the moment you show up and say, like, I've got this idea, it's like, well, what have you looked at the Zen case yet? And if people haven't, it's like, okay, go back into your homework first. You know, I get their small percentage of our users, but it's supported entry code, we can't just not do it. So, you know, the Glance team like, was doing a bunch of good work, but they had not kind of gotten the stakeholders aligned on this, and because of which, like there was no like path that made any sense here. So after about like four cycles of this, at the end there was a big push at the end of one release and it was just like, it was going in and like some crazy patches were being thrown together. Like, okay, just like, let's stop and like reevaluate, answer all these questions and we'll move forward. And let's not rush the release on it. So basically we stopped it for the release. One of the Glance team members went back, we're a beautiful doc that actually addressed every single one of these things. Like, what is the answer to that? And it's like, that's perfect. Like, all of a sudden you've like painted the picture of where we're headed, all the hard problems have been pseudo-worked and like, now we are all exactly on the same page about how this happens. And so over the course of four weeks we landed all the patches and cut Nova over. It was only four weeks of work once we got on this page and that's the important thing. And so I will kind of wind in on my hopeful one. This is kind of the end of it. This is one that's right now. Out of the Atlanta PTG, we got this, you know, hierarchical quotas have been a thing that people have been talking about for like three years, at least, and there's keystone support for building the structure of hierarchy of projects, but that like, it doesn't mean any of the projects actually do anything with it. So, and it was very clear in this Atlanta PTG section that a big part of the problem is we just didn't have a shared understanding. You had a whole bunch of people saying the word overbooking and meaning slightly different things, right? And then realizing that the moment we started having a conversation about like what do hierarchical quotas mean, everyone wanted to discuss exactly like the algorithm by which the following, and it's like, no, that's not actually, it turns out like, we can dive into a rabbit hole really quickly and talk about like, well, if you did this, then the following things would happen. Can we at least like get a shared scope of the problem? And there were some threads of that there. So I ran at, we built just a concept spec, which we put, which is landed in the Keystone project, which is, this is the general scope of the problem. This is the general class of things that we wanna do. This is a concrete set of things we're gonna do is we're going to move limits definitions into Keystone, which has the project hierarchy, because it turns out that validating the project hierarchy is saying is actually one of the hard problems. And then here's some kind of like pseudo steps on the next path forward. And that's out, that's landed. We had buy-in from like the CERN guys are like, this is great, like this totally works for a bunch of things. We had Dean representing kind of the client experience like, yeah, that's exactly like running around to 17 different projects to figure out how I tip up someone's quota so they can actually boot a server is not really a effective way to do that. And in landing that concepts back, it was okay, the Keystone folks, the first changes are there, so we're gonna go there first. But we have to at least get buy-in from all the core infrastructure, right? Like Nova, Neutron, Cinder, Glantz, PTLs all had to sign on to this before we moved it forward. And actually engage them early. And the only thing that got even kind of a twinge was like, all right, well, just remember this volume types case. But I think it fits in this model. And we have a part of a path going forward. We have some more detailed work to be done on this, but for the first time in three years, we seem to be making progress on a thing which is gonna touch a whole bunch of opens stack. And a big part of it is like fit into this model, right? Build your shared understanding, build that forward. Make sure you've got the right stakeholders and that you've got some plausible promise that you're going to get there. With that, I'm gonna end. We have seven minutes for questions. And if anyone would like to ask one, please jump to the mic. And we have no questions, that's okay too. But, yeah. Can you articulate a case where things did not work or something went so horribly wrong that stands out? I mean, the event, the, if you go look at the open stack specs repository, that's about 90% of what's in there. We had this really, when specs as a concept move forward, it was this really great idea that projects would have a way that people could propose new ideas and we could work out the details in Garrett. And then someone's like, well, there's certain things we wanna do all across open stack. And we'll have an open stack specs repository, which at the time seemed like a good idea. But many things got proposed there, like defined distributed lock management across all of open stack, like change a ventlet to something else, re-architect the world, right? So they were these big giant scope problems that had step one, write an open stack spec, step two, step three, profit, like it was just, there was no progression. And there was an assumption by people that were pushing these things in there that like, oh, open stack specs repository is the magic button by which, like everyone important signs on to a thing and work gets done. But you have to make a distinction between who is the approved listing Garrett and who are the important people that are signed off on a thing to get it done. They are not necessarily the same set. So that's, we still got a ventlet, we're gonna have a ventlet for a long time. The eventlet thing that being said, the like, all the conversations we had in that had a piece of productivity out of it, which was, you know, our API servers are running as a ventlet WSGI stacks, which is kind of terrible and not really good fitting into people's production deployments. Like, we could solve that bit. And you see as a, you know, two, three years later, we've got community-wide goals of everyone's API services need to be served off of real, like UWSGI or Apache over the next cycle. And it's actually pulling some bugs out in the process. So yeah, so that's an instance, instances where things went wrong and some lessons we learned out of them. How do you manage teams that are silos and when you propose these, you know, new changes, they're just, there's total advert to it. It seemed like from your experience with the policy change in Keystone, it was kind of by sheer luck that a year later, someone else picked up the baton and got it through. I mean, what are some tips besides the ones you've shared with us today? Sure. For those kind of edge cases? Yeah, yeah, no. Things that definitely from the outside, people don't quite think about how distinctive regionalized cultures grow up inside of every project, right? As well as understanding, right? Open second is enormous. There's so much to learn there. So the biggest thing that I've found is like, actually super useful is when you have some high level idea, right? Mistake I made on the policy thing the first time was like, and I kind of knew I was making a mistake at the time, but it was just a factor of time, right? The first thing you have to do is actually go do your homework deep in whatever it is, right? There was a set of words, there was a set of concerns. I actually didn't understand why the things they were really concerned about were concerns. And part of that was I was, did not understand enough of their project. I just like was missing a whole bunch of things. So if you are going to go and dive hard at a thing in somebody's community, like you have to go to your homework, you have to go and read a bunch of their source code and understand like, hmm, this isn't the way I thought it used to work. And in doing so, and also being like, hey, I'm just gonna ask some stupid questions, like can you guys help me out, right? Like that helps also build a relationship with all those folks. I mean I had like very decent relationship with a bunch of the Keystone team, but we just, we weren't talking the same language. You know, they did not, it took a while for them to envision why we would ever want to do the thing that we're doing. And part of it is they, their API doesn't change as much as ours does. And so their concerns were different. But yeah, just in general, right? Like go run to your homework. Right now I'm working through, we're reviving global request ID chaining so that like the request ID will be the same from Nova to Glance to Neutron when you make the call outs, right? Which is a thing that was like the first attempt on that was four years ago. And there were a few reasons why that fell down at the time. But in reviving it, it's also like, you know, like here is exactly the flow that's happening. And there's like actually, so we have to make changes in four projects, but they're actually really small and it's like unpack it all now, right? I spent a lot of time reading middleware over the last four to eight hours and it's like, oh, that didn't work how I thought it worked. Okay, but the solution's actually not that far away. So yeah, do your homework. Anything else? All right, well thank you all. We are at lunch. Enjoy the rest of your conference.