 Welcome everyone. We'll have a talk by Stefan Jensch from Yelp. We'll talk about how they use Open API to bind services, microservices together. And then we'll have five minutes for questions and answers at the end. Please welcome Stefan. Hello everyone. I'm a technical leader at Yelp. Short bit about Yelp. We're in the business of connecting people with great local businesses. Emphasis on the word great. We have over 100 million business reviews written by our users on the website that should help you make sure you find the business you're looking for. We are live in 32 countries and have over 90 million monthly active users. So what is this talk going to be about? To talk us about Open API or Swagger. Swagger is a set of tools and a specification for dealing with service, a service-oriented infrastructure, making calls to a service. And I'm going to give you a short introduction of some of the available libraries. And then I want to focus on the things tutorials typically don't talk about. So issues you will face in production, surprising behavior, especially when migrating from an older version of Swagger, Swagger 1.2 to the current one, which is 2.0. And the things that can go wrong. Some I call them war stories. So a bit more about Swagger. I'm going to use the term Swagger since that's what I'm used to. Open API is basically a rebranding that was done recently. It specifies the contract of your service. That is the main thing it does, which helps you a lot because it's human and machine-friendly. It helps you with testing. It helps you with describing your service and defining the contract, then making sure that contract is enforced. The tools actually help you with that. It does automatic documentation for you. So if you do several services, we have like hundreds in production at Yelp. The question is also like, which services to exist? What API do they offer? What do they do? Swagger helps you with that. As I said, it ensures conformance to the spec. It can actually make sure your end points behave the way you define them. And it has a community-driven set of tools which are language-agnostic. So I'm obviously going to focus on the Python part of it, but there's tools for many other languages. And there are tools written for Swagger, the specification which you can use and take advantage of, even though they are not even written in Python. So, and as I said, it solves several problems when faced with building your service-oriented architecture. The main library I'm going to talk about is Bravado. It's the client library you can use to make calls to services that have a Swagger spec. In addition to helping you with that and enabling you to write way less code and doing these calls in a nice, Pythonic way since you get like automatically generated object stubs, it also helps you enforce or verify the contract of your services by making request and response validation. There's another library, SwaggerPy, which is for the old version of Swagger. We're not going to focus on that. I just want to mention that it's available. Bravado was born out of SwaggerPy, but it actually has quite a few surprising differences in behavior. So for those of you who are migrating from SwaggerPy to Bravado, I've got some tips of issues you might encounter and you might want to avoid when migrating. Then since at Yelp we use Pyramid as part of our standard service stack, I'm going to talk a bit about Pyramid Swagger. Of course, there are libraries for other web app frameworks there like Connection for Flask or Django Rest Swagger for Django. That one, that's why it's only in light gray, only supports Swagger 1.2 again. So if you want to do Swagger 2.0 with Django, I fear you are on your own. The good thing is that most of these libraries, at least Bravado, Swagger Py, Pyramid Swagger, and all of those associated Python libraries you need for that, they are all Python 3 ready. So Python 2, Python 3 both work. Now let's take a quick look at an example of Swagger spec. This is done in YAML. You can also write them in JSON. This is the heart of everything. This is how you define your service contract. So at first you say, yes, this is a Swagger 2.0 spec. Some information, this is used for documentation, not for anything else. Some information about the service itself, where it lives, what scheme you use to talk to like typically HTTP or HTTPS. It consumes and produces JSON, can also be something else, of course. Now let's look at one of the endpoints we define. In this case we define a user's endpoint. It's an HTTP get operation. It returns unsurprisingly a list of users by user IDs. So we define a parameter, user IDs that has to be provided in the query, and which this is why it's required. It's of type array, and it's an array of integers. So all of this is defined here. You can see the operation ID further up. We're going to use that in a bit. This is what identifies your endpoint when using bravado. Now, what does this endpoint return? In our standard success case, HTTP status code 200, we return a list of users, which is an array of user objects. We're referencing here the users. You can see this is a reference to another section in the file, but you can also reference external files. So this is how you would split up your Swagger spec if it's getting too big. There's a default response I'm defining here, which applies in all other cases. So in the non-200 response status code, it references an error definition object here, which I left out due to not make this too long. So let's get to the definition section real quick. We're defining a user object here. It has two required fields, ID and username. This is the user object we've referenced just a bit above. The properties are defined here, the required fields. It's an ID, integer, username of type string. And we have an optional field business ID, which is also of type integer. It can be present, but it doesn't have to be present. So specs like these, you can write them in your favorite editor development environment. There's also a tool available. This is one of the tools I've mentioned, the Swagger Editor. It's a hosted solution, which you can just use to write your spec. It has nice syntax highlighting. It will immediately tell you if the spec is not correct, if it's not valid, and it will point out what errors there are. It can also show you immediately a nice HTML documentation of your spec. And all of this is, of course, free and open source. Now, we can also use a tool called Swagger UI to generate a documentation for our spec. This is what it looks like. This is for a service. I co-develop at Yelp, the business owner at Backend. You can see here some of the things, how they are turned into documentation, account and business, excuse me, account and business are tags we gave to several endpoints. So these endpoints are then grouped by these tags. We can see a list of endpoints. You can see the URL, the HTTP operation. In the bracket, you can see the operation ID, as I mentioned before. This is what we're going to use when calling the endpoint with Provado. You can also see that there's like an endpoint that is written and striked through. This is a deprecated endpoint. So typically one of the problems you have in service, in a service-oriented architecture when you do services, microservices, you don't exactly know who is using your endpoint. You can't really remove them without being very careful, but you can deprecate them. And when you have to do backwards and compatible changes to your endpoint, you want to add or modify behavior, you create a new version. Unfortunately, Swagger doesn't have built-in support for versions. So what we do is we just append the version at the end, which works just as well. So let's see how to use that. This is a small example of using Provado to make a call to the service as we have defined it in the Swagger spec. First, we create the client, the Provado Swagger client. We just tell it where our service lives. Typically, this will be the host and port to your load balancer, I guess. At Yelp, we use Smart Stack to do service discovery. You might use some other form. You just point it to wherever your service lives to its Swagger YAML or Swagger JSON. In this case, we're using the FIDO client. I haven't mentioned this yet. This is another library you can use. Instead of using the default synchronous client to make requests, which is like based on the great Python request library, you can use FIDO. Asynchronous communication is a huge topic here at Yeropython. I think it also was last year. My colleague is actually giving a talk on that tomorrow. And if you have visited one of those talks, you know that it's like quite a shift in how you program. You basically have to rewrite all of your application. By doing this, it won't enable full asynchronous communication, but it will do all of your service calls asynchronously, and it will hide all of the complexity for you, so no event loop or tornado or twisted or whatever. It will just work, and you won't even notice. The only thing you need to do is you need to work with futures. So in this case, we're calling the list users endpoint that we defined in the spec. We provide the user IDs. That's the parameter, so we don't have to build any sort of URLs or do URL encoding. It's all handled for us. But what this call does, it doesn't return the result. Immediately, it returns a future that we store in a variable. And just as an example, we're doing a second service call here. We haven't defined that in the spec. Just imagine we did. So we have two futures here, and then we call result on these future objects, which will block until the response is there. So why are we doing this? What does this give us? Let's take a look. There's a tool called Zipkin, which you might have heard of that will help you see what service calls you do and how they behave, how long they are. I hope you can see this in the back. The names and numbers are not important. Our service is a pretty heavy service, since it's the service that returns data to our mobile apps. So it does a lot of data collection. The endpoints take quite a bit of time until they are done. And this is an example of doing many service calls with bravado one after another with a synchronous client. So we see the first request is actually very fast, 8 milliseconds. We have to wait until it returns. Then we do the second service call, and so on and so forth. And when all of these are done, we return the response. Now, you can gain huge performance improvements by doing this, by just collecting all of the features as far as you can, and then calling result on them. Obviously, this is not always possible, since some of the network or some of the service calls you do depend on the result of previous service calls. So this is what we see here. We still need to do the first two calls serially, like one after another. But then we have all the information we need to talk to all of these other service endpoints and services, fetch all of the data at once, and the time it takes is the time for the longest of those calls. And now you can obviously see the difference. So this is the power of Fido and bravado. Very few changes to your application, and you gain these benefits. But let's focus now on the main part of the talk. My so-called war stories. Because all of this is nice, that's why I've done it pretty quickly. Like there are tutorials online, they will tell you this, and that's typically where they stop. But when you use this in production, we'll encounter several issues that you might not have thought of previously. So let's talk about them. The first of which is not really specific to bravado, but you have to think of it any time you talk over the network. You have to deal with network issues. And one thing we have encountered are besides connection issues, which you have to deal with as well, are timeout issues. So the low changes over the day, and not every request will be as fast as you want it to be. You have hopefully set timeouts. Please always set timeouts. Bravado's default timeout is no timeout. So please set it, because otherwise something in your stack will timeout after some ungodly amount of seconds. Please don't forget that. So let's say it does timeout. So what we've done here is we created a future just like we did before, and then we wrapped a call to result within a retry decorator that will just call result multiple times in case the HTTP timeout error exception happens. This works, unfortunately, or fortunately, it works if you use the default synchronous HTTP client. Why? Because in the synchronous case, nothing happens until you call result. That's the whole point of it. So when you call result, it initiates the network request to your service, waits for the response, returns the result. In the asynchronous case, once you create the future, the network request is initiated. When you call result, it just blocks if the response hasn't already arrived in the background. This also means that when you retry it like this, you will just add additional, in this case, two seconds, two seconds, two seconds, without ever redoing the network call and getting a result you want. So with this programming model, what you need to do is you need to wrap both the future creation and the call to result in your retry decorator. This will work. Another issue. This unfortunately has, I might say, bitten, or people have encountered it a lot, both when upgrading from Swagger 1.2, so the Swagger PyLibrary, as well as newcomers to Bravado that just have a service already in production and they create a spec for it and they want to use Bravado. So what we see here is a JSON representation of the user object we defined in the spec. And as we can see, there's a business ID field with the value of null. Unfortunately, the Swagger tools don't agree that this is correct. What they say is that field is optional. So either it's present and the value has the type integer or it's not there. But in this case, what you will get is a validation error because you return the type none, the Python type none, the JSON type null, instead of type integer. In Swagger PyLand, so in the old Swagger version, what you could do is pass a parameter to your result call telling it to ignore this case. Unfortunately, this doesn't work anymore. There's two solutions. The one I prefer is to change your services to just filter out. These fields just don't send them. It will make the response shorter. Everything is great. Sometimes you can't do that. In that case, pass the validate responses parameter as a config option to your client. Now, what I would suggest you do is that you create a separate client that you do endpoint calls that need this option. But please don't set it as the default like for every call you do because I consider the like the contract validation like making sure everything behaves according to the spec. One of the really good features of this Swagger ecosystem. Now, another surprising issue. Creating the client, it's a dynamic operation and it can take time. It will probably take time if you have a non trivial Swagger spec for your service. So you can see here, this is an example, an internal example where I hit one of our bigger internal services and I just measure the time it takes to instantiate the client. The thing is it will do like one or potentially multiple network requests in the background to fetch the spec. If it references external files, it will fetch those as well. And all of this takes time. When you migrate from Swagger Pi, you won't even think of that because Swagger Pi had a built-in mechanism of caching that. So if I did, like if I handled multiple requests and it would just hit an internal cache and recreate the client from the cache without you noticing even. So what we did when we upgraded an endpoint to Swagger 2.0 is we forgot about that. And then you get graphs like these. Unfortunately, I don't have the graph of where timings went up. This is the graph of where we saw the issue, fixed it by caching the client ourselves and then deploying that to production. You can see like our P50s, but the P50s you can probably not see. They went down a little bit, but especially the P95s and P99s. So like the slowest part of the requests, if you're not familiar with percentiles, they became way faster by just caching that. And this is something you need to do yourself. There's no built-in solution for that in Bravado. Now let's talk about some of the issues of deployment at scale. I typically am not like a huge fan of talking about at scale, especially like at what scale. I'm talking here about like not just running your one or two service instances and you have like three service in total, but you maybe have like tens or hundreds of instances of that service. So when you deploy them to production, you have like some sort of deployment strategy that ensure that you're not like dropping requests, returning errors, like that basically nobody notices you're deploying a new version of that service to production. There are ways to do that like our platform as a service open source offering that we have. The default strategy is crossover, so it will typically like just spin up X number of service instances that you have defined you want in production for the new version. So like for a period of time you have basically both of them running in production and then after it verifies that everything works well, it will just shut down the old ones one by one until only the new version is in production. Another deployment strategy that we use for our like old monolith at Yelp is just have instances running and then take them out of the load balancer, switch the code version, bring them up again, put them into the load balancer. So one by one or like slowly you flip them until they are all running the new version of the code. So how does this affect you as a service developer? Like you're not an ops person, you write code, you deploy it, does it affect you? And it does actually. Let's get back to versioning a bit. I said versioning you have to do that to do backwards and compatible changes. And as a developer you know multiple versions that means more maintenance, you want to minimize the cost of maintenance. So you try to cut corners wherever possible. You try not to create too many multiple versions of the same endpoint. Now what if you want to add data to a response of an endpoint? Typically this is fine. Bravado does enforce the contract of your service but it just checks that the data that you send conforms to the spec. If you want to send additional data it doesn't care about that. Only if the type of the data is not correct or if some data is missing that according to the spec should be sent. So you might be thinking hey I'm just adding data that's not a problem. The point here is non-optional. So you're changing the spec to say hey in this response it is required that you send this field. Of course we've done this. The fact is this doesn't work really because of how you deploy your service. So let's say we have this service. We added the field to the spec and we did the implementation. We're deploying this to production. We have both versions running in production during deployment. So a client wants to call our service. It fetches the spec first. Gets hits by chance. A new version of our service gets the updated spec that says well this field has to be present in the response. All good. Now it does the service call. Fetches the data as luck will have it. It will hit the old instance of our code running in production which knows nothing about that new field and it doesn't return it. So instead of a nice response all you get is an exception and if you're unlucky the way depending on how you programmed it this might result in a user facing error. So how do you do this without creating an additional version? Well like with many things you have to spit it up in multiple compatible steps. So the first thing is you add it as optional to the spec. So you add the field to the spec but you don't say that it's required and you add the implementation. So you're already returning the field. You shift this so you do a code push, you deploy this to production and then in the second step what you do is you just do a small change to the spec and mark it as required and all is well. If you don't do this this is an example like a screenshot out of a postmortem that was sent at Yelp. This is the error spike we saw when not splitting this up in these two steps when deploying to production. Now another thing of course it makes sense that as I said you cannot just remove data from the spec right from the response. That is obviously not a backwards compatible change. So you think oh but we can use the pattern we just used to do this other change to do this in a backwards compatible way right. So what we do first is remove the field from the spec. So the spec knows nothing about this field anymore but we don't change the implementation. We still return the field and only in the second step we do a second code push and we ship the implementation and all is fine right. No it isn't. Don't do that. This will also cause or might cause issues in production and I got the postmortem to prove it because you like as I said if you want to run this efficiently in production you need to cache your client which means it doesn't on every request refetch the swearer spec and instantiate the swagger client and everything. So you can't be sure that that client in between the two code pushes will fetch an updated version of the spec. You don't know and it doesn't as you can see here. So what you need to do to do this in a backwards compatible way would be to do the two steps like do the first step like I said change the spec. Know of all of the clients all of the callers of this service restart them so the cache is clear that's invalidated and then you can do the second code push and remove the field for real. A similar issue with adding a reference to a new spec file with the like dollar sign ref. This should be probably like a little bit obvious by now so I'm just going through it a little bit quicker. If you add a reference to a new file and you do it all in one step so you add the reference and the file itself you may hit the same issue that you get an an updated spec that references to new file but then when trying to fetch it you hit an instance that still runs the old version of your code that doesn't have this file. Same solution add the file first without modifying the spec so you are sure every instance has the file and then in the second step you add the reference to it all as well. In this case what you also can do is you can let pyramid swagger combine the spec for you if you're using pyramid and pyramid swagger. There's an option that was added recently that like reads the spec combines it for you and creates one big spec and it will then serve this one big spec to your callers eliminating this issue. The one thing this doesn't support is recursive references so if you have like your user object is self referencing is referencing itself this won't work like you need to have multiple spec files in that case. A colleague of mine is actually working on on solving that but right now this is the state of things. There's a pull request out if you're you can check it out if you're interested. Now one of the last things I want to mention because I find it so surprising is changing the tag of an endpoint like I kind of gloss glossed over this previously but when you look at how we used the client we say client dot user dot list users and the dot user that's the tag we gave it's actually like the first tag in the spec we gave to this endpoint and list users is the operation ID. Now operation ID it has like ID in the name obviously you know you don't change an ID or things will break but that a tag change will break clients. That is unfortunate I might say. Here's the post mortem all of this has happened in production at Yelp. The thing I find unfortunate here is that tags typically you use them for documentation and they are used for documentation so at Yelp this was done to improve documentation like you build more and more endpoints you want probably to use finer grain tags to make them better like that so that you can find them better in your documentation and the developer just changed the tag and all of a sudden all clients broke so just be aware of this. In conclusion I think the main conclusion here is when in doubt version it several of the examples could have simply been prevented if additional versions of endpoints were created we as developers we like to be efficient we like to reduce maintenance cost as much as possible just be aware that there's also a cost when you don't version a potential cost so be very careful and I can tell you from experience there's not a lot of automatic tools available like scripts that will check for these changes so when you try to do this anyway you need to rely on things like code reviewing and testing to catch these and the thing is most of these issues are very hard to catch when testing like when writing unit tests or integration tests or when testing manually on stage so when in doubt version second one this is more of a general service thing you have to deal with the network there's been a really great talk at PyCon this year that talks about some of the things you might want or probably need to do when talking over the network there's also a talk by my colleague that was done yesterday I'm going to reference it in a bit also this is something I don't see mentioned that often we always talk about like is this roll forward or roll backward compatible can we go from state A to B and we'll work everything in state B and what if we have to go back to state A will this break anything but there's also like a transition period like none of this is instantaneous so think about during that time because you will have both code versions running in production for a non-trivial amount of time then just for those migrating be mindful of the differences between swagger pie and bravado I just mentioned this because as you've no as you've heard but you haven't seen the difference in API is actually not that big so like migrating your code is actually like done really quickly it's the behavioral changes that will bite you potentially so pay attention to those and I just wanted to slip this in because there's still some kind of a like service hype or even micro service hype we haven't had great success with micro services we're more of a like services company I just wanted to mention something that like Martin Fowler and like a lot of people that hopefully know what they're talking about will also tell you services is not something you do because they're great like services is something you do because you have to to solve other problems like to solve problems at scale and scale doesn't mean traffic if you're not able to reliably run your monolith or your one application in production you're not going to reliably run dozens of services in production it doesn't help you with that um if you're not able if you're having issues with deployment you're not able to reliably deploy your monolith in production you're not you're not going to solve this by deploying multiple services with dependencies in production all that services do is they help you scale the number of developers so in order to help you with that and you know reduce the amount of blockage and issues you will face when developing with multiple teams on the same code base then maybe services make sense I just wanted to mention this other talks like I mentioned yesterday we already saw protect your users with circuit breakers this is another talk that deals with one of the things or best practices you should do when developing with services the whole premise like Scott talked about it yesterday if your call if the call you do is a simple Python function call you're good like this will succeed every time only if you go over the network then you might do a bunch of additional things to have less worse behavior in case of failure I urge you to check it out on YouTube once it becomes available and if you want to know more about asynchronous networking go see Laura's talk tomorrow he tells you all about the nitty gritty details how to use it how to do it in Python 3 how to do it with Python 2 and gives you a lot more detail and gives you some insight into what you have to do if you don't use something like Fido to do your network request go check us out we have a pretty nice engineering block you can find us at the booth as well I want to mention that we're doing a raffle right now so we're giving away a pretty nice drone so if you want that come find us we're giving it away Friday so you should be around on Friday and that's it thank you very much thank you Stefan we'll have a round of questions and answers please wait for the microphone so that we can hear you on the recording thank you this was a really great talk did the Fido client support twist it anyhow yes internally it uses twisted and crochet actually so as I said like it's not doing any magic but it hides a lot of the complexities if you do things like you also talk to your data store and you want to be that asynchronous as well you have to do it like yourself manually Fido will not help you with that but if most of the time all you do is network requests I think Fido is a pretty great tool to do that without having to deal with all of the peculiarities because I'm interested in if we are using twisted then if we can just yield their requests and it will work with all the twisted machinery already present in our code sure absolutely like if you want to do that yourself you have absolutely more power more flexibility absolutely thank you yeah thanks a lot it was a great talk actually I'm writing a client for an API right now and I'm realizing that I'm kind of reinventing the wheel and reinventing what's the library bravado in a sense because I didn't know about it and I wanted to know also if bravado generated also all the mock requests for example that you would need if you're writing a client at the same time of generating all those nice request functions for you yeah that's a great question yeah I didn't talk about testing at all unfortunately right now there is no open source solution for that we have something that is being developed internally I hope I cannot promise anything but I hope we can open source this we also have basically like something written by my colleague loris a mock server that will like when you write your mocks so you're like we still write our mocks manually right now for testing but it will then serve these like read the swagger spec and find the appropriate mock and serve it and all of it will actually be validated to make sure like the mock conform to the swagger spec and everything yeah but we I hope we can open source something that will actually like look at the spec and then generate data of the correct form like form and type automatically but sorry nothing available right now so just to understand the picture correctly the swagger doesn't help you to write the back end it only helps the client to validate data sent and received yes that is true it doesn't no it doesn't really help you like you still need to write the back end yourself it does provide you like the pyramid swagger for example it does help you with like serving the spec it does help you with making sure your endpoints conform to the spec but the implementation you have to write yourself and exposing the swagger the script it the files with the the description it's also up to you and you have to just expose this to via HTTP somehow right if you use pyramid swagger which is what I'm most familiar with this is like one of the core features so you're just pointed to the file on the file system and it will do everything for you like not only serve it but potentially as I mentioned also like read multiple files and combine them to one file to increase efficiency and things like that so for pyramid it takes care of that I think connection does the same thing for flask but I haven't used it so better read up on that so about the hypothetical mocks are you going to do it and like end process for example like responses works or is it going or is it going to be like out of process separate server like wire mock or mount a bank both are possible like it depends on what type of testing you talk about what what we have is like obviously like we have the real service implementation and then we have this so-called mock server which basically just you can spin it up just like a real service and instead of like doing real work it will serve mock responses this is really great not only for integration testing but also like like I said we are like a back end for apps like Android and iOS apps and they can use this for their integration testing without having like everything at Yelp spin that up and use it for testing that is one thing then for for the other part we do have something where in a testing environment basically it mocks out like you mock out the service call on the client side and it will just generate a mock response for you this is when you want to unit test your client without having like we have a we have an acceptance testing setup that spins up a bunch of Docker containers it's kind of sometimes a little bit brittle and it takes a lot of time so you probably want to unit test as well and this helps with that the alternative is and we still do like most of our code is like that we we just write the response manually like we mock it out and write the response manually it works but you should have some form of like acceptance testing integration testing preferably because we've had several issues where the mocks at some point just don't don't conform or are not what the service really returns anymore and that's when you had issues okay thanks just a quick question how does Pyramid Swagger relate to a library like Cornus that also gives you convenience features to create a REST API are they related at all Chronos Cornus the Mozilla I'm sorry I'm not familiar with that then does Pyramid Swagger provide other than serving the spec to the to the bravado client does it provide route generation features yeah all right yeah basically you do that we can like go on detail like yeah that's in detail over there we can discuss details but actually like Scott I'm not sure if he's here there he is he's the main author of Pyramid Swagger so he'll be probably like the best person to talk to all of the details thanks about Swagger the you can have JSON input and output and in the spec file you have a description that is really like JSON schema is it exactly the same thing or something different yeah great question it's actually based on JSON schema the the Swagger specification it's not a superset it's it adds something but it also takes things away like about the issue I talked about with the null value JSON schema actually like is able to deal with that because you're allowed to give a list of allowed types for a field unfortunately in Swagger this is not allowed so it's kind of like most of the time it's a superset but sometimes not everything of JSON schema is supported in Swagger any more questions so thank you for our attention if you want more information you can context it and also thank you