 Thank you, and sorry for the flicker. I hope you can manage to see the slides anyway. Who here knows what Yelp is? Oh, that's actually Sorry, actually quite a few people. So I'll be quick Well Yelp Yelp's mission is about to connect people with great local businesses. We have a website an app and a mobile website We have 142 million unique active Visitors monthly. We have 77 million reviews by those Users and we are available in over 30 countries. We most recently launched in the Philippines And we're not only for finding like great restaurants and bars But also like great doctors great shops and any other kind of local business We also have this is probably like less known. We have Yelp for business owners So if you're a business owner, you can come to Yelp. You can claim your business. You can market as your own And you can then measure visitor activity on your Yelp page You can interact with customers. So if customers leave a review for your business You can reply to that in public or in private and you can upload photos for your business and do a bunch of other things So who am I? I'm a back-end developer for the Biz owner app I worked on the main Yelp app back-end before that I'm a Python user since 2008. I Started doing a lot of Django work back then before switching to application and now mobile development So let's take a look at why we're going to Okay, that's already not good. There's supposed to be an image there Um About Yelp. Let's take a look. We were founded in 2004 and actually like one of the co-founders Jeremy Stoppelman is still running the The company is our CEO and all of our code was in a central repository. We called Yelp main Which means the code for the website including templates the mobile web the mobile back-end and the business owner site all in one Repository, which means we had a lot of homegrown code And as people worked on it and they introduced new abstractions They didn't remove the old ones and it was hard to Reason about the code do these big refactoring so as Yelp grew and we are still growing We're still hiring people this started to become a bottleneck. We actually like at one point. We had three different ways to do SQL statements was execute SQL statements in Yelp main. That was not nice So, yeah, we cannot really refactor all the code and I want to dig deeper into another area that really highlights our bottleneck I'm really sorry like I need to really sorry about this Yeah Sorry about all of this. Yeah, so we had a lot of homegrown code Finally, we see the images. This is what Yelp look like back then We have a lot of Yeah, abstractions. I talked about that. So let's talk about the push process, which is what we call when we deploy Yelp code We do deploy Yelp code several times a day This is done by a push master, which is an engineer that has production system access people Take their code changes their code review changes They want to push to production and they join a push and then the push master runs this push we have like several tools to Assist us in doing this. You see a screenshot of push manager, which is actually open source Where we manage the pushes and people can say hey, I want to join this push and I want to Push my changes to production and as you can hopefully see It's not that clear. There's like a small red bar next to my push Are you I ran and it's red, which means this push didn't make it to production We had to abandon it and I'm going to talk a bit out about why this might happen So when we run a push at first there are some automatic checks that take all changes Build a deployment branch where they merge all the changes in that deployment branches then yeah Well deployed to a stage system And then after manual verification so all people that joined this push they need to be present They need to verify that their changes work on the stage system and if everything is okay or test suite We are happy with our test suite runs then we send this branch to production We do the same thing we watch production for a certain amount of time And if we are happy the push gets certified the changes get merged into master and starting from that the changes are live and We're done people when they branch off and they started to work on a new change They will branch off of these changes. This is a two-hour process with really no upper bound Why no upper bound because like if we find problems let's say on the stage system We need to take out the problematic change Rebuild the deployment branch put it again on stage system run our test suites Go again to production and so on and so on until we have a new code version that is good and that we can Leave on production and you see here a screenshot. This is actually like another tool that helps us during the push Where you can see? Which hosts in our data centers are already running the new version of the code. This is the green bars which Hosts in the data centers are running the old version of the code Which is the red bars and the yellow bars are the hosts that are in the process of bouncing So switching to the new code version I have to say like our release engineering team. They are hard at work and like optimizing this process. They're really Like making it better and better more automatic But still it was obvious that like this doesn't scale you can run only so many pushes a day and The more people join Yelp and we are still growing The harder it becomes to have like a push without issues so Some yeah Intelligent mind set together at Yelp and they thought about a solution and they found one I don't think we are the first company that came up with this solution, but it's yeah kind of obvious We need to modularize We are at a certain size where you cannot work with a single code base You can only run so many pushes a day as I just said And even if you increase the number of pushes the number of people that develop at Yelp also increases Yeah, you will run into problems eventually. So let's build services How do services solve this problem? Well, each services Developed and deployed independently. So you don't actually need to know about this huge code base You just need to know about your service Service pushes are very easy and very quick People can do it themselves After like short training Services usually only cover like one aspect or one set of features Which also makes it very easy to like introduce new technologies to refactor code Reduce technical debt all that kind of thing and it's Actually, it might even bring some performance benefits Since when you have like this big monolithic code base and Python It's not that easy to parallelize things, right? So when you switch to a service oriented architecture, even though at first you might think hey Like I'm doing network requests instead of like function calls. This should be slower It might be actually faster than before if you do those requests asynchronously at the same time and Just wait for the longest result We actually also wrote like job service principles It's like a list of do-and-dones for services or reasoning about services or thoughts Go check it out. It's on on github. Yeah, so why might we not want to do services? Because well chaos might ensue like I actually stole this from had Fred Hatful like a colleague of mine He also gave a talk about services. You can find it online. It's really good Well Why not services consistency is really hard. It's actually like non-existent There's no such thing as a transaction over like several service calls You don't have clear dependency or usage graph So which means you need to maintain your interface with your API forever since you don't know who is going to use it And for how long it also means like that testing like one huge self-contained code base It's easy. It might not be simple, but it's easy But how do you test your loosely coupled services which are out there? So this is the chaos I was talking about. How do you make sure stuff doesn't break? Unit tests everybody loves unit tests or at least many people do but in my opinion They are great, but they are not enough since in a world of loosely coupled services a breakage Many times occurs at the interface level. So like some service you call it subtly changes its interface its API the developers maybe didn't even intend to and Yes, your call breaks. The answer is not that what you expected it to be and this is a huge problem What's our solution to that our solution is acceptance tests? So instead of mocking either like with unit tests at the function lever or further out We don't do any mocking we run all the code from the request as it comes in through all services. We might call Back the whole workflow and then we test the response we get and make sure it's what we want it to be It's as close to production as possible without like setting up your own dedicated stage environment And it's yeah, what we do at at Yelp So we put all the components the services anything else you need in docker images We spin up those docker containers. We use production code for these containers And we use docker compose. It's it was previously called fig to manage these Infrastructure, it's a bit heavyweight. So it takes a long time to run not the test itself That's actually fairly quick, but like spinning up all those docker containers Setting everything up that takes quite some time And it actually grows with the number of services you have obviously like so you call more services So your acceptance testing setup has to grow accordingly Yeah, so it's a bit heavyweight, but We're really happy with the results since it gives you a certain amount of Confidence in your changes because you can say yeah, this is going to work in production So just an example of what this might look like This is part of the acceptance testing setup for the biz owner app back end So we have some configs. We have the main Biz app definition where you can see under links. This is like all the dependencies we have all the direct dependencies Those are like several different Services within Yelp internal API is actually like to service Front-end of Yelp main. So that's like our legacy code base and you can see on the right That one itself has a bunch of dependencies. So that's how your acceptance testing Setup grows. Yeah, but as I said It can be a bit cumbersome also like setting up test data because some services have their own Data store. So when you create your test fixtures, you need to make sure like all services are like in sync have the same data So your tests work, but overall We're pretty happy with it. So now that we know why we do services and how we make sure They don't break randomly. What's our service stack? We originally started with tornado, but that Didn't work out quite as well as we hoped So our current stack is pyramid just the latest version of pyramid with you whiskey and sql alchemy and It works out quite well We use HTTP obviously as transport protocol JSON for the data format and one very important block of our service stack is swagger, which is an API framework So with swagger you specify your API you write actually JSON to specify your API And there's a bunch of tools included one of them is swagger UI, which helps you visualize The API you just defined This is what it looks like For for a random service at the Elb. I think this is business media So you see your methods get and post you see the endpoint you see the request parameters Your service expects or that endpoint expects the data model You will get as a response So you can browse and find all the endpoints you might need for your work Swagger does more. It also does request and response validation optionally, but I would encourage you to to activate that since it makes sure like your request parameters are There and in the specified type as as specified in the spec your response is actually Fits what we saw here in the data model It does data structure and basic type checking on individual field level And it works dynamically by reading a services back. So there's a library called bravado It was called a swagger pie. It's open source We it's on get up on our get up account And it like dynamically reads the spec and generates your stub So you can like do function calls or method calls in Python would actually do the HTTP requests and We used to do that with client libraries, which was quite painful So say you wanted to develop a new endpoint For your service you would do that then you would check out the client library. You would generate The stub code for that new endpoint You would commit that after it went through code review You would bump the client library version number and only then when people upgraded to the new client library They could use your new endpoint all of this Swagger pie or bravado takes care of it for us and it makes it really nice to work with it So let's talk a bit about a specific service the biz app service, which is the service that powers our biz owner app Clients Android and iOS clients It's a bit of a special snowflake since it's one of the very few services that you help You can reach from the outside usually they you can only reach them from the internal network It's also unlike other services. It's not constrained to like one set of features or one area It contains the whole API for our app clients And it has no local data storm. So actually many services have their own data store We don't so oftentimes we are just a proxy. We are calling other services We are calling Yelp main. We are aggregating data and returning Formatting it enriching it and returning it to our clients So how does our mobile API look like well, it's it's a rusty API, right one resource per endpoint do multiple calls To fetch related resources Blah blah blah probably already know all of this and this is how we develop services at Yelp But not how you do a mobile API Because you are over a cellular network You want to be as efficient as possible and you want to do as few calls as possible So what we do is we have one endpoint per client app page. So for every page Your app displays We just wanted to do if possible just one network request and send it all the data it needs For post-end points whenever like you you want to save something from the client We not only acknowledge that the right happened successfully, but we also send the client back all data It might need for follow-up pages to display that this is quite different from our yeah lower level service APIs That really are more rusty so you can say that Yeah, we aggregate we do like many service calls for typically many service calls for one Client request we get and we just aggregate and send data back and yeah act as a proxy So what does it mean to develop? mobile app Backend I come from web development. I imagine some of you do as well Turns out mobile apps have releases in our case for the this owner app They are synchronized. So we release Android and iOS at the same time with the same set of features and iOS apps need to be reviewed as you might have heard and This actually takes quite some time. I remember it like it used to be like five days Nowadays, I think our longest review time was 11 days. I think it's back to about nine days That's actually like quite a long time and you probably want to test the whole thing before you release it And in this case releasing means submitting it to review to Apple So our API needs to be done sooner than the client implementation And which means it needs to be done way sooner than when the app is released So when you can download it on your phone How else is it? different than web development You cannot upgrade the client whenever you upgrade the server. In fact, some Clients never upgrade like we still have a tiny portion of users on the 1.0 release for the biz owner app Which we released late last year So unless you want to drop support for those Users you need to support Your API forever, which means you cannot do backwards incompatible changes How do we deal with that? We do a multi version API So we have the same endpoint with a different version in this case We append the version at the end and we do maintain and test all versions to make sure they still work This is obviously It costs something it costs effort so maintaining multiple versions We don't want to do that needlessly so what are the ways we can think of to make sure we don't have to do Multiple versions all the time. It turns out if you just add data to the response That's backwards compatible our clients. They will just ignore it the legacy clients the old clients and Our response validation what swagger does for you It's also smart enough to just ignore additional data I just make sure that the data as it's defined in the spec is there and once we Develop on the server and we add that new field. We also add it to the spec. So actually our response will be It will be okay. It will validate right So this is what it looks like this is an example out of a JSON spec for an API And you see the green part. We just added one field time zone Type string with a description and we could do that without having to do any other change to that file So obviously like we didn't introduce a new version endpoint It would just work So how do we make sure it actually does work in production well, we do some monitoring We monitor the number of requests the server errors Task you send push notifications. Here are some examples. This is a tool like an older tool Almost not see it where we look at the types of errors that happened at the rate of errors We now have like a bunch of nice kibana for dashboards. We can like actually do almost anything you want We send a bunch of our metrics to signal effects so we can build nice dashboards You can visualize them analyze them and we use the last alert, which is really nice open source tool Be open source. I think last year you should really check it out. That's very easy to set up your own alerts So, you know whenever something is wrong For app crashes so whenever our client apps crash we use crash litics both on Android and iOS And as soon as you reach a certain size you probably need an on-call rotation So you need to wake up people whenever things break We use pager duty for that. We have integrated elast alert into pager duty So for severe errors we get paged. We have integrated crash litics. So if our app crashes spike We get paged Yeah, that's Basically already about it. I want to just Mention another talk if you're interested in services Scott Trillia is going to hold another Give another talk about services arrested development Surviving the awkward adolescence of a microservices based application That was hard It's Friday 11 a.m. In the Python anywhere room. Go check it out. It's really a great talk Also some other shameless plugs. We are hiring. So if that sounded interesting to you Check out yelp.com slash careers and even if you don't find like your ideal Job opening contact me or contact us on our booth. We will figure something out We are always looking for talented people. We actually have offices in Hamburg, Germany. This is where I work also in London And obviously in San Francisco Yeah, we also have an engineering hub where we aggregate our blog posts We have our open-source efforts documented there and more we are on Twitter And last but not least this is a fun one I urge you to check it out the Yelp data set challenge If you ever wanted to do some data analysis, but you didn't have data or you didn't know what to do Go look that up. The last one just ended But the new one will probably start like before the end of summer Go check the website. It will be announced there and the deadline will be sometime by the end of the year Yeah, it's a lot of fun and you might actually even Win some money So that's all. Thank you very much. Sorry for the technical difficulties and yeah, if you have any questions just ask Can you talk a bit more about your aggregator? Like is it something open source? Do you plan to open source it or Is it a process on its own or is it a web server module or How does it work? aggregating what? for your APIs Well, you you say like you have an aggregator for making only one request and To not and one request per page and not ten requests That's basically like what our biz app service does so when when a client on an app makes a request HTTPS request to our service it hits To our servers it hits our service and our service does everything it needs to like satisfy that request So it will do multiple service calls Aggregate the data collected from different services from Yelp main put it together Fetch related data everything the client needs and then send that back to the client in JSON Over HTTPS. So actually like our service. This is like what we develop is the aggregator and does all of this and Yeah, and sometimes like when we want to aggregate that data There is like I mentioned this internal API interface into Yelp main Sometimes there is no interface for the data we need so actually we will also be developing that interface And then use it to fetch and aggregate the data I have actually two questions. The first one is what was the problem with tornado? Just curious. Yeah, I was Fearing that issue might come up. I was actually not at Yelp when when this was Tried out You might try to ask Scott about it. I don't know if he knows or you can also come to our booth There are other people we can ask honestly, I can I cannot tell you I just know that it didn't work out well And now we are really happy with pyramid. Okay, and the second question is how do you handle logging in the docker containers? Yeah, that is an issue We do logging. It's it's done inside the docker containers. We exposed The logging folders as volume so you can actually like start another docker container and mount Those volumes from the individual docker containers and look at the logs Hi, thanks for the talk Can you comment more on Your development process because I can imagine that testing it's a huge overhead. So do you have separate team that makes the Testing or does every developer is able to set up the whole Infrastructure with dockers. How does it work? Yeah, thanks Yeah, that's a great question Yes, every developer is supposed to not only write the code but also write the test So typically like the development processes you you create a branch From within git, right? You do your development once you're ready You post your changes for code review other developers review the code and hopefully if they pay attention And you did a change without adding tests for that change They will say hey like you should write a test for that the developer then decides well is a unit test enough Do I need an acceptance test? But yes, every developer is able to run the whole test suite on their local machine and Then run the test Yes so actually like When I say local machine we have something called developer playground where you look log into a machine where you do your actual work on and It has like everything ready for you to run the test. So docker is installed everything is there We have our own local docker registry. So it will just yes run the tests and you can write them and run them by yourself Hi, thank you very much for the talk. I just have a basic question. How do your services communicate actually I didn't get this You mean us in San Francisco No, your internal services because you said you have a modular structure and yes, so it's just HTTP calls With yeah, Jzone data exchange. So that's basically how they all communicate unless it's like something very special Thank you You mentioned the acceptance tests when you do deploys Do you do any locking to avoid? two deploys of Services that are required in the acceptance tests so if if one service depends on service B and both this service and B once they want to be deployed What do you do then? Ah, yeah, okay? That's like since it's so loosely coupled either you have to pay like real Good attention or you cannot do that. So each service is considered independent. So If you need like a deployment or you have a deployment of your service and it needs to Have some other changes and other services before you can do that deployment You you just you as a developer as owner of that service. You need to make sure you don't deploy too early Usually like when we do a deploy The whole test suite is run as the first thing even before we go to stage So hopefully if you have good test coverage, you would like notice then that things are not there But generally it's your responsibility since we have loose coupling that you don't deploy breaking changes to other services That you remain backwards compatible and that if you depend on changes from other service that you do the deployment in the right order Okay, thank you. Hi. Thank you for the talk. So can you please clarify? What is the problem with the testing services? So you said that each service has a specification There is some tooling around it like swagger which Verifies that search a server response matches the specification and so there are service a which depends on service B and why cannot just service a expect the service B is always producing valid responses and It looks like you don't really need to run Request through all the services around and with the production version of service B for that Yes, great question. So it's mainly about the human factor. So We don't have the tooling in place to check for this automatically. Basically our acceptance tests are That tooling to make sure a new service deployment doesn't break anything. It it has happened in the past multiple times I actually remember once that a service got deployed which then broke Something inside Yelp main because Yelp main called that service because the developers they just didn't think of the fact that this small change Was actually not backwards compatible Since we don't have something like as rigid to make like sure. Hey, you you cannot deploy a change. That's not backwards compatible We have to write tests for that and that's basically our our check for that Hi, thanks for the good presentation If I understood correctly, you have a mobile API like an API specific to mobile applications Have you ever considered having different slightly different? structure or information returned to different types of devices like different for Android and iOS Yes, we actually do that for our what we call consumer apps So basically the apps you will download on your device We not only might do that depending on device type But also depending on the version of the app you run and other factors. So yes, yes We do that up until now We will we have been able to get away with just different version endpoints and sending generally the same data Both to Android and iOS But we are in the process of actually developing something similar for the biz owner app as well Doesn't it make testing your API more complicated? Oh much more complicated It's it's like basically any of these checks is like another branch in your code So yes, that's why we are trying to avoid that as much as possible. Yeah, completely agree Thanks. Hey, I think One more, okay Thanks for the talk. So question is you had a Yelp main and Was the definition of reasons like how you Recognize that you need to decouple this main From service. I mean, how you recognize service first like what's the definition of first the service first Well, we try to put like any new code we write if it's Reasonable we try to put it inside in service and then use that from Yelp main or wherever To decouple our code. There are also efforts going on and taking code, which is already in Yelp main Extracking it and putting it into services. Just so our development speed can increase Ramping up new developers actually becomes much faster since code is just simply less complicated and less huge That's Difficult to like answer generally, but if you look at our Yelp service principles, there's like I Could actually reasons about that So I cannot like mention all of it, but I encourage you to check it out. It's actually like handles that topic and If you want to talk more about this like I will be at the booth now There's also a bunch of other awesome Yelp engineers there. So just come talk to us. We're happy to nerd out about this Thank you again