 service-oriented design in practice. Hopefully a good compliment to the material that Chris just presented very well. So, tough act to follow. Like I said, this is how you can get a hold of me, at Ryan area on Twitter. And I work at a company called Efficiency 2.0. And a lot of this talk he's going to share some of the experiences we've had over the last year. So, in implementing and perhaps more interestingly, maintaining a service-oriented design through the course of building our application. Briefly what our application does is it pulls in a bunch of data about how people use energy and it makes targeted recommendations for how a given homeowner might best release their energies. So, the name of this talk had a service-oriented design in it. The name of Chris talks, Chris's talk had the name of service-oriented architecture, what is service-oriented design. Paul Dix wrote a book, service-oriented design with Ruby on Rails. And in the book he had this quote, which I really liked. He says, service-oriented architecture has become a loaded term. To many, it implies the use of tools such as SOAP, WS-DL, WS-STAR, or XMAW REC. This is why we use the word design as opposed to architecture. So, this isn't a paradigm shift here. Paul was just trying to free up the concept of source orientation from the baggage of these enterprise protocols and legacy technologies to give people a fresh approach to it and make consider for their problem domains. So, what's the point of this talk? We just talked about services-wide, second one. It all comes down to this. This is a quote by a guy, Louis Brandy. He wrote this on his blog about a year ago, and I remember reading it and it stuck with me. He says, never trust a programmer who says you know the C++. I know this is hard to read, but that's what I can do, but I would sort of walk you through it. The graph is of a programmer's confidence, self-confidence in C++ over time. So, in the upper left, the programmer has decided that they know C++. C++ is just like C with classes, which is awesome because everyone has confidence in C++ and they're using it for everything. And then a funny thing starts to happen. The programmer starts running some submissions with C++. Time for the error messages are pretty confusing. Reference types might be a bit too magical. They find WTF is a virtual deconstructor. I don't actually know. The exception specifiers are words of java, static object initialization set faults, and then finally the programmer takes you up on the trough in the middle of the graph and he says, we need some rules. And through thinking about some rules that you can apply to his use of C++, he's able to regain his confidence in C++ and actually use it without running so many of his problems. So, the thesis that I'm presenting here today is that service-oriented design like C++, and I believe like many other difficult programming concepts, is a two-peak concept. So, this talk will sort of walk through my trajectory in thinking about service orientation over the past couple of years, both at efficiency 2.0, before that I was at Guild, which uses a lot of services. So, I reached a point where, kind of like Chris alluded to, service-oriented design is going to be a solution to everything. It's like object-oriented programming, the single responsibility principle applied at the system level will encapsulate everything. You know, you can read blog posts out there about Amazon. I think they're the most prime examples of this. And everyone has heard that Amazon uses, I don't know, it's 50, 100 services to just render their single homepage. So, you think about that and you can, you know, I thought that everyone of our problems could be clearly defined into a nice set of services that are all going to coordinate and communicate together in a very structured form. Now, it's all a lot of the problems that maintaining large Rails apps might present. And it can solve those problems, but we also ran into some issues along the way. So, descending down the curve, we're running into issues like our designer can't read the app anymore because the script server doesn't work. We use a copy and paste it all over the play scripts. We've got this bug and we only know where it is. It's not just one data store, and one set of code. It's, you know, three data stores and three sets of code. Our tests are green, but production just broke. What happened there? And now, you know, we have tests that will fail if there's a bug, but there's such a pain to run. Nobody really runs them locally. It's just on CI. CI is getting better and better. And that was a big problem for us. So, a little bit of background about our particular use case, efficiency 2.0, team size of about five guys. The code has been around for various forms for between maybe one or two years now. And this is kind of what it looked like when I dropped in about a year ago from now. We had a front-end user-facing application that's pretty standard, and then we had three services supporting that, connected in some cases with Rad and Q. We had a bill collector that pulls in a bunch of data from our clients who are actually electric utilities. We had a calculator, which is this really crazy energy science algorithm store that turns on all the data, and we had 12 servers which takes in hourly weather data from a few sources and smooths it all out, makes it very readily accessible to the cap there, as you can see. So, what is our architectural look like now, about a year later? We've simplified things generally. We still use services, and we're very happy with the services that we're using, but the bill collector, as you can see, is gone. We folded that up into the user-facing application when we re-evaluated some of our core assumptions around it. And, you know, still the calculator, still the weather. We've been reducing the number of dependencies between services, though, trying to get one-way dependencies instead of single-way dependencies. We've gotten rid of Rad and Q entirely. In the end, we found Rad and Q was a very good tool for what it was intended for, but it was a little bit more than we needed. And sometimes not, you know, we just needed more visibility where Rad and Q was providing more speed, so we optimized for what we needed for now. We might return to Rad and Q as our scalability needs and performance needs increased, but for now we're going with something a little bit simpler. So, what sort of benefits can you get from the service-oriented design? This is a lot of stuff that Chris touched on, so I'm going to go through it pretty quickly. First off, there's isolation. So, this is one of the core benefits of services, and all the other benefits kind of in some way flow from this. But, just like you might expect, using the service-oriented design, you can take components of your application, break them down, break down their data needs, break down their tests into smaller units, and as you all know, smaller units are generally, by default, easier to maintain in programming, unless they're sort of other factors that are, they might make them more difficult. Robustness, so this is kind of like the encapsulation principle. If you had a well-defined API, it's probably HDP, REST, JSON, all those good things, whatever Ruby programmer that you like, you can change out any events going on under the hood, without having to modify any of the clients. So, in that way, your service-oriented design might be considered robust to change. Scalability, you can optimize different sections of your application for different operational characteristics, for example, write heavy data storage versus read heavy data storage. You can use two different data stores, wrap up over HDP. Agility is kind of an interesting one. People don't think about service-oriented architecture as being something that increases your agility. Most people think it sort of goes the other way, but there are some interesting ways where it can increase your agility. For example, in maintaining our services, we wanted to start looking at upgrading our Ruby VMs to Ruby 1.9. We're able to do that on smaller increments, and that allows us to get some code out of Ruby 1.9 faster and probably get the entire transition out faster because it's not so daunting. So, in that way, it allows us to be more agile. Interoperability, taking disparate systems and connecting them together. Twitter's a good example. They have a bunch of Scala services that provide HDP endpoints that the Ruby code uses. You can speak the common language, and then reuse. So, just that you can take a service, and maybe you've got a public API or you can start using that internally, or vice versa, but you've got a complaint and you can start thinking about how you can apply other problems. So, why did these services originally? We were looking for reuse. We were looking for isolation, and generally we just wanted smaller components to be able to do it. But, the rest of this talk is primarily about everything after the tip of the curve. So, I talked briefly about five problems we had made too earlier, and I'll show you the problems with you. So, describe how they manifest themselves to us, what we did about them, and hopefully if you're maintaining a service or into design now, this can provide some tips on ways that you might be able to make your day-to-day maintenance of that system better. And if you're not maintaining service or into design, one of my goals is to try to give you a whole other set of considerations which might be a little less obvious to weigh your decision of whether or not you want to apply these concepts. So, I listed off the five problems before. They really span four different areas. These are kind of bread and butter areas and the application that's being maintained. You've got local development, you've got deployment, operations, what's happening once that code goes out to the server and it starts processing, requesting production, and testing. So, the first problem we're going to do as soon as we introduce our first service is that our designers bring them on the app anymore. So, we have a guide, we have this way around a Rails repository. You write Tamil and SaaS for us in addition to being awesome at Photoshop, that sort of thing. But with services, we're suddenly seeding into a directory and running script server as a work. I mean, it's a very quick way to get a very nice and big 500 page, but it wasn't going to cut it. And so, once our designer can bring the app, that might be how to fix all the IE compatibilities that he would otherwise have to take care of for us. So, this is a problem that needs to be solved immediately. So, what do we do? We created a single command loop. We have an E2O command and it has sub-commands. So, E2O Start will boot up the Efficiency 2.0 app. The whole idea here is pretty straightforward. It's, no matter what you need to do to be able to start the app at this point, do it and just open the app working in a local browser. That means calling repositories, running on blur, running migrations, booting unicorn processes in the proper order. All that is encapsulated here. And if this thing doesn't give you a working app so, we put that into a package called E2O CLI This is a private gem that we maintain. It's hosted on our private RubyGem server and here's a sampling of the sub-commands that it has. E2O Start is pretty obvious. A couple more interesting ones or one of the big ones is E2O Inventory. This is the one I added in my day, which I really like and if I run E2O Inventory across all of our repositories, it will list out all the code which hasn't been deployed to production. So even something as simple as saying well, what have we worked on that we haven't actually shipped yet, it gets a lot harder with the servers in our architecture. So you have to do that end times. And doing things O of N is a lot less efficient than doing things O of 1. So we built it on Thor. I sheeplessly ripped code from the engineer in CLI to make this work. We only get going in about a day and then we've been adding to it since. So that's one problem that we faced and we've been really interested in. The second problem we're adding to is that we just copy and paste it all around the Playcode. So nobody likes me taking the Playcode. Nobody tests the Playcode. It just sort of starts to accumulate like a link ball underneath your couch. And then when you add your first service, the first thing you need to do is basically CP-R that can fit the Play directory into your new application. And that's bad, but it's not going to diverge in very subtle and potentially destructive ways. So CAP, you know, restart might do something over here in this directory but it might do something subtly different over here. That could be a big problem when you're looking at the prep circuit base that's actually going to make changes on your running servers. So what do we do about this? We package sort of play scripts into a Gen. So we have another private Ruby Gen that's called E2O CAP. Here's a directory listing. And you can see there's nothing special about the E2O stuff directory. It's got our recipes directory under there. And then for each sort of concern we deal with in deployment, we have a Ruby file that we use. So this is very similar to like the EY CAP Gen which many people are familiar with. You can do this for yourself if you have multiple applications that you maintain and you want to get your play scripts just right. So the nice thing about this is by extracting away all of the logic which is customized for us. It's exactly how we want it. But now in our CAP file, this is an entire CAP file for one of our services. So we just use the old fashioned Gen command to require the Gen. Right now we're using version 0.10. We loaded the recipes and then we just declaratively lay out what is special about this specific application as it relates to the other applications that we need to deploy. So for example, the Hoptode API feeds different per application so that many errors get reported to Hoptode and they get aggravated with the correct project. So that's a configuration variable. When we find that something needs to differ across different projects we create a configuration variable and bump the Gen. So into some problems that have a little bit more me on them what's causing this bug? You've got some, you know, run across some screen and it's not showing the right number there's a lot of different places that issue could be coming from different data stores, different code bases. Eventually we decided we needed some more introspection into what's going on in production to be able to see what's going on. So we created a Gen. This Gen is actually public, it's up on GitHub, it's on Ruby Gen. It's called UTOA Possible Aware and it does a few things that we found very useful in maintaining our service-oriented architecture. So if you run a kernel to one of our services this is our calculator service and staging we're getting a people load balancer it adds a few headers here which are kind of interesting. First, it stamps everything with transaction ID. This is basically just a quit. It's not useful in and of itself but it's useful when you start to look at it in the perspective of the whole system which we'll see in a second. Also, we had a serve by header so this gives the host name in case there's something that might be misconfigured on one host and we got the response. We don't have to go through phishing around trying to recreate it by checking each server one by one, you know exactly what the server was generated by. And third, this is a really useful tip I think we put a revision header in our response. It's easy. We cache it as a class variable so it doesn't have to be looked up on disk every time because it's still things down. But it's really useful when you just ran a request and you're trying to figure out did that deploy phish as that was running or was it the old code around the shirt this will always be right. Like I said, it's open source and everyone will be able to use it fork it, report bugs, pull requests all that good stuff. So the other side of the coin is we wanted more detailed logging. So this is a condensed version of a single request that was processed through our frontend phasing application. So you can see that the user doesn't get for a specific page and then there's some lines in here which are not normal for a typical browser application that I'm going to walk through. The first is the transaction ID. So that was the same transaction ID that goes into the header. It also gets logged in both the services or in the application. Whichever code we service that request and report it in that log. And then it gets a little more interesting for every request we make to services we actually log information about the request and response. So generally we don't have a logging turn up where we look at every dd query because there's a lot more on it but for services we found it important enough that we actually do want to record every request and response. So you can see the HTTP method, the URL. And then this is the response. And what we do is we record the HTTP status code that came back so if that's not 200 we're already going to be a little anxious. The server it came back from in case that might have an impact on the response we saw. The revision in case this is what we're looking at we don't know what code served that up even after we've deployed other codes since then. And then this is a little bit tricky but we record the transaction ID that was generated by the service. So if the calculator included a transaction header that says it was TransactionEVC that's what's recorded here. And then the transaction that you saw on the second line is actually the transaction of the user-facing request. So it's essentially a tree structure where one request can generate other requests but by keeping these cross-references if I'm going to go say look now in the calculator logs and say what happened when that request ran we're just going to copy and paste this transaction ID that's highlighted and correct the calculator production log on the app server to find the output for that. So the next problem is a pretty big one. What we found was by moving to services we had more code bases that each had their own test suites but every once in a while we would have a bug in production and our tests would be green. And we care a lot about the way we're testing things generally whenever we see we feel like we have a bug but we don't even fail and test. We view that as a failure in our testing approach or our testings themselves so we try to dive into that and look at what's going on. So what would happen? A simplified diagram would call to a service, you have an application, you have a client which is maybe just some classes in the middle active resource could be a client it just knows how to make HTTP calls to your service and the interface is break the error is something associated with shop working. So we thought about what we could do about this. There are some obvious solutions one is to do all of your testing through the entire stack you can boot everything up just like you do on a staging server you can make requests to it with SLIM and you can test every piece of your app's functionality like that and be sure that it all works together that is an option. It has drawbacks but if that does work we took a little bit different tact where we try to layer testing together in order to give us confidence about giving us such a large hit on our maintenance so I'm going to talk about that a little bit. First thing we did was we made sure that we were versioning a client with the service so this is the tree of our weather service you can see at the top of it we have a client directory and it's just laid out of a gem this was really important because for a while we had one client which was not versioned and it was used for and it required us to make changes to repositories which had to be kind of synced up to make sure that everything worked together so moving things in to one less repository and you can make the change to the server and the client in the same commit makes it a lot easier to test for you so the first thing we do is we test the client with the service so back to that diagram we sort of just draw a circle around the right side of it and we're going to make requests from the client to the server to make sure that the responses are what we expect so how many of you have heard of Artifest? A few? Okay Artifest is really cool it's kind of a little bit time-bending when you first think about it but what Artifest does is a library by IkutaCats it's pretty simple, it's about 70 lines but there are IkutaCats lines of code so that's like 700 lines of mortal program art code so it actually when you call activate with it replaces the net-hvp constant with a totally different thing and what it does is it takes calls to net-hvp it serializes them down into rack calls and it sends them to whatever you pass to do activate with in this case I am activating Artifest to replace all calls to net-hvp with a Rails application itself so that it happens in the same thread it's kind of like a loopback connection what that means is we don't have to loop any extra process to test the client with HVP effectively against the service itself so that's pretty cool, it's worth checking out for you and anything like this but we just put some records from the database particularly that our user is the client class I'm instantiating one of those and we check that it has association that has three rules that come out the other side so that's testing the client with the server we test the app or it could be another service and we depend on the service with test doubles so this is pretty straightforward we're only testing at this point the left side and we might have a test on our application side where we need some information say electricity usage from our service and we just use a simple RSpecMap there's nothing special about that so at this point we've tested sort of the app we've tested the other side but the whole thing can still break down to that testing the whole stack we just circle around the whole chart we might use QComber for this we would say that if given this we want to make a call to the service you whip through the end-facing user-facing application and we get the results back so this does actually run with muted processes and make the calls all the way down putting it all together if you can kind of see that I think testing all makes 10 diagrams and by adding that layer across the top you start the larger testing unit you start to be a lot more confident without having to have the difficulty of going through the entire stack for every behavior you want to test we avoid fake whip we used fake whip for a while but maintaining these strings of HDP responses became cumbersome so we really like artificial impossible or this sort of just vanilla test level strategy supplement that with the full stack tests so the last problem is now we fixed it so that if there's a bug or test fail that's really good it prevents a bunch of bugs but now the test repeatedly asks for a run so it might take a long time that's usually the biggest complaint interestingly this is a complaint of modelific applications and also a complaint of service oriented architectures it appears that the test always does take forever so I want to introduce this concept of the testing pyramid I didn't come up with this by a lot of people but I think it's a really important thing that often gets overlooked especially when it comes to movie testing especially in the era of cucumber and these other high level testing tools like WebRat so the idea of the testing pyramid is it's kind of like the food pyramid the things at the bottom you might have a lot of and then as you go further up the pyramid you might have less and less of this but you kind of need a little bit of everything in order to really have a balanced diet so the bottom is unit tests they run really fast they tell you that your objects are doing the right thing above that you have integration tests they test clusters of objects and then higher up you've got community tests those might run the browser, they might not run the browser but think about can you have maybe 100 unit tests for every 10 integration tests and for every single user interface test so those 111 tests will run much faster than if you had 5 tests but then you can try to get the same level of confidence in your code by layering these strategies together in a more strategic way so I caution everyone against inverting the testing pyramid that whether you're a monolithic or serves your own architecture if you're writing tons of cucumber tests and you're not writing unit tests because it just doesn't seem natural you are headed for pain you can talk to me there the underlying concept here is continuous improvement I really credit continuous improvement with our ability to work out the kinks in our evolution of applying a service oriented architecture I don't think we could have been able to do it without this approach we do a lot of root cause analysis anytime in breaks of production we stop everything right after stand up we have a 30 minute discussion trying to peel back the layers of the onion what went wrong here how can we change our processes and tools to better address that next time and one term we use a lot in those conversations is proportional investment so just because something went wrong once doesn't mean you should go spend the next three days carrying on the ultimate solution we didn't spend two days or even a full day working out the first iteration of the CLI that allows us to boot the app what we did was we created a very basic version which worked once or twice and then we said well as a team any time our designer runs e2o start and it doesn't start we're going to invest an hour at that time so in the beginning we did that pretty frequently over time it became less and less frequent it might be something even more esoteric that would cause it to fail we would fix those bugs as they came in rather than trying to solve them all up front so it was a proportional investment based on the frequency of the problem the problem might only occur once but it's a really good deal and happen in production it might be worth investing a few days in resolving it but try to be proportional the other concept that we try to apply is like production don't let your development tools slack and I'm also including testing a lot of people are tweaking their production environment but try to apply the same rigor to development so that it's gradually getting better and not a big step forward we're just going to wait until everything is so terrible the tests are taking now 20 minutes and nobody can get anything done before you make your first attempt at trying to speed them up try to start speeding up when it goes above 30 seconds because it's on its way to being 20 minutes to make progress in small increments so our key takeaways related to applying a service room to design to our architecture first off there's many hidden costs none of the things that I described the solutions that we implemented were things that we planned on when we were looking at how we were going to have the systems they can be okay but it's something that is not necessarily obvious when you're looking at making these decisions on the other hand it can still be worth it we're pretty happy with the way the calculator service has worked and our weather service has worked and that's why we maintain them like that today we haven't had to make many changes to the weather service because it changes to a half number of different rates of change and it has allowed us to lay a whole portion of our architecture and not worry about it while we're making fast changes to other areas of the staff like everything else service room to design can be very good but moderation is key don't lay out your entire application in terms of how many services can be used to solve this problem that's probably the wrong way to look at it you'll end up with way too many services you'll be heading down the skiing down that curve that you shouldn't be beginning to talk and we'll get to the other side as quickly as possible to look at the key finally address restrictions as they occur don't all those problems build up iterate with your team on your process and how you can solve problems you're mileage may vary I know teams that have no services actually Etsy is an interesting example of this they try to avoid services because they decided that the costs associated with them are not worth it instead they invest a lot in making maintenance of their monolithic architecture a lot easier so they've gone one way I've also seen teams that have a lot more services and they're going to deal with that very well so it's different for every team just try to keep a close eye on what's working best for your team thank you