 architecture. It's probably the most technical talk of the four I'm giving this week. I started by sort of apologizing because this is actually probably more about me and actually some of the pain I've gone through since about 2005. But the reason I tell the story this way is I'm trying to make sure that the pain you go through getting to services is not as severe as a pain I went through. So hopefully as I go through some of the things I did and didn't decide were necessarily bad things maybe you'll avoid those same mistakes yourself. So the story kind of starts in the beginning before services and basically a layer at architecture. As you can see we have some functional towers, we have some cross-functional services, of course this is a picture of Stonehenge. And before you ask, no I'm not old enough to be the architect of Stonehenge. But I find the analogy to quite powerful because this is how I describe building big applications. It's like trying to build another Stonehenge. It's not necessarily a good idea anymore. So this story starts with actually me living here in Bangalore working on a million line of code application. The application was actually quite well established at the time. In fact at one point in time this application was the largest Sun-certified J2 application in the world. And it started out his life very nicely. It's some very elegant design associated with this application. In fact you know Martin Fowler came as a consultant to this particular project. He looked at the architecture, a lot of you influenced the architecture heavily even to the point where he actually joined ThoughtWorks after this because he liked the company. So a lot of the patterns in this book come from this application. So in sort of the history of computer science and evolution of computer science this application was actually quite important. So what did I find when I got here to Bangalore in 2004 to work on this project? Well we kind of found a million lines of code. It had a couple thousand tests which sounds like a lot but really wasn't. They had a 70% success rate on running the acceptance test and that was considered okay. And it was kind of a different 70% every day. It was still okay. And I was a little bit terrified of that. We had a bug database that had over a thousand entries in it. So these are a thousand things we found wrong with it that we hadn't yet fixed over the previous you know five or six eight years. It was still being changed in fact very actively changed and almost no new tests were being written for all the changed code. So this was a bit scary especially when you know myself with with kind of an agile leaning and thinking TDD and all these other things and here was this monster that just almost didn't want to be tamed. And to some degree you know one of the things I tracked very carefully was unit test counts because I like to sit down and write code with colleagues. It's one of the things I enjoy doing. And one of the things you can really tell if you're if you want how a process is going is a nice metric to see if actually development is progressing and I at least count the unit tests. How many new unit tests did you generate this week? And I started tracking that and I get this graph which is going up and down and up and down and up and down and of course any system engineering will tell you the fact is oscillating like this is really not a good sign. There's something really horrible going wrong with your system. And in fact you know some weeks we were this weekly counts for new tests being read. So you see we got a high water mark there of about 180 when it's really bad it's down to you know 40 or so. I had 50 programmers working on this. This is all the new tests I was getting a week. So some week I would get less than one new test per programmer. So what is going on because they're very writing code. In fact actually I can tell the story here because I have the principle involved. You know if Badri here, stand up Badri, wave your hand. So if Badri who was working on a project was actually writing code that week I get tests. If Badri is doing something else I was getting no tests. I mean was that influenced by one programmer in the group that was actually doing this. Nevertheless if we took it from 2,000 tests up to 4,000 tests we doubled test coverage at some level even through this oscillation. So what's going on? Why was it so ugly about this? So I thought I'd run an experiment. I thought I'd just go in there and try to make an assertion. And there was a nice concept in this system of a loan. And a loan had a value associated with the amount you were loaning. And it was a nice great concept behind it, nice encapsulation of the concept. So I just wanted to write a test. Let me build a loan object and make sure I built one. And in order to make this really work effectively, because the programmers were complaining about how hard this is, I decided I'd come in on a Saturday so I don't get any interruptions. And it's about as easy as I could possibly test it. So five hours later it passed. There was that much in setup. Now what is going on here? It turned out that somebody in their wisdom at some point in time decided this loan object is a nice encapsulated concept. Let's get the number out of it and give it to somebody else. Because I need that number. That's really a bad idea. You should have kept the number there and put the code there. No, no, I'll take the number out. But once you opened up Pandora's box, it turned out to be about 37 other classes, 75 times pulling that number out and manipulating it. And so in order to set this test up, I had to make sure I kind of allocated all those other objects. And they were all set up appropriately. And they had other objects and other objects. So the setup for this test was incredibly massive. So badly, badly broken stuff. So yeah, we had one million lines of code at this point. But we have brought a colleague on board, Jeff Bay, Jeff Bay being a really brilliant programmer. And he said something about this application I found very profound. He says, this one million lines of code, there was really 100,000 trying to get out that the application should have only been 100,000 lines of code, if you'd really done it right. And that was kind of my takeaway from this. A little bit very similar to Craig's, Craig Lernerman did a keynote at the beginning of Wednesday session talking about this giant system that was being built that the guy said later, many years later, that I should have had 10 programmers. If I took the 10 programmers, best programmers, I could rewrite the entire system. I didn't need all those other hundreds and hundreds of programmers. And I sort of had that ops feeling from looking at myself. And Jeff Bay said it very eloquently. All right, so let's roll the clock forward. I'm now in, well, let's talk about what happened here. Who destroyed this application? What happened? This application was largely being developed at that time by an Indian contingency. Were these guys lazy about it? I mean, are they the ones that kind of destroyed it? Was it because they were lazy? And it was like, no, actually, they're not lazy. I was here working with them. They work very hard. Were they being sloppy? Well, maybe a little bit, if I consider it sloppy because of doing things. And experienced? Absolutely. They weren't really had the training necessary to sort of undertake a large, well-structured, object-oriented application and build it. They didn't have that background. But they certainly didn't have the power, they felt, to actually say, no, I can't change this application until you get the training I need in order to do that. They didn't feel they could actually have that conversation with the sponsors in the U.S. So we have a name for all these things when these things happen. We call it technical debt. I am not a fan of this term. I think it's an excuse for one of these other things going on. And these other things should not be happening. Yes, I know technical debt is a term that I think came from what Cunningham and I know there's a lot of people that talk about it. I think Neil Ford talked about it earlier today. And as a valid concept, I do not believe it's such. It should not be happening. We'll come back to that toward the end. All right, so evolution at this point. So that's my story about big applications. And now I'm actually in China. I'm working with a Chinese bank and they want a service-oriented architecture. And they have the opportunity and privilege to work with Jim Weber, who's a published author in this space and one of the experts of that generation, about doing SOA. And so we were putting together a system because the bank had all of these things that banks like to do. Because Chinese banks do the things they else do. They loan money. They try to give you a mortgage. They try to get you a credit card. They hold your money, give you loans and the like. And they don't want to do that not just from their tellers, which is kind of where most of the U.S. was at that point, but they want to run every one of these sort of transactions at every point of contact with the client. Where the client went up to these ATMs. And the ATMs in China, I went up to them, you could actually do amazing things that I could not do in the U.S. at the ATMs. The first generation of smartphones, you could actually do transfer money with these things, these smartphones and do things in China that you couldn't do other places. As well as just sitting at your PC. And so we proposed that you should really have some sort of service-oriented bus, a pub-sub sort of thing. Where if you have access to a client, you're one of the guys at the bottom of the pyramid here, and you have access to a client, you want to say, who? I have access to a client who wants to talk to the client. And it may be, yes, we want to pull some bank balances for him, but also the loan guy may say, oh, he's a good client. Historically, if you've got a client in front of you, make him a new offer. Or perhaps you want him to swap credit cards, make him an offer about that. Or maybe his mortgage payments overdue. Can he ask him about that? So the services wanted to wake up and say, oh, you have have a client in front of you. Now the way you presented that to the client depended upon the device. So the device itself would get a message back saying, if you have a client in front of you, ask him about a loan. And if you're a PC, you pop up a big red window saying, pay your loan off please. If you're the clerk, you say, pardon me, sir, I need to I need to ask you about your loan because your loan payment is in the mail. So the interaction style will be different, but the concept of what I'm trying to accomplish is the same. And I continue building these services that are independent of the vehicle for delivery. So we talked about that to the client. We were brought in, I was brought in basically to help sort of sell this concept to the client and get them to move because the client was looking at basically going out to Citibank in the US and just buying their software. And they say, well Citibank's a popular big US bank, so they must have great software. Of course it wasn't true, but we couldn't convince them otherwise. So at the end they rejected our idea. But I didn't walk away, you know, from a lot of new knowledge and a lot of dangerous knowledge to some degree. So going back to what Jeff Bay said about 100,000 lines of code, I now felt that was not true. That if I had a chance to do that system here in India over again, I would probably write it more as maybe 25,000 line of code services that are interacting with each other. Well these services were largely independent and this reaching through and grabbing stuff all over the place, we would sort of inhibit some of that stuff. So I was getting anxious to try some of these ideas out. As fate would have it, now I'm actually sitting back in the U.S. I've been sent back there because we have some needs there. I've been asked to look at a medical systems that's being developed. This company actually builds medical equipment, but they know that if you build this high-end medical equipment, one of the best ways to sell it is actually provides hospital software. So if a hospital is running your software, it's easy to sell your devices because they all work together and it's kind of a big sales cycle. So they were redoing their basically medical software and they had you know basically these various instruments they had in laboratories and even you know teams working on diagnostic information basically they were trying to collect information about various things. But there were a whole bunch of parties that were interested in this information like the patient, like the doctor of the patient or perhaps the emergency room staff or the critical unit staff as well as of course the guy in the corner the accountant who's always interested in any time you use a service so he can charge for it. So we decided this concept of PubSum, we worked really nicely, that we'd have these little information nuggets and we wanted to architect these nuggets very carefully because I didn't want to just dump the MRI images into the net and pass them on, that's not very useful to the other people. I wanted to publish conclusions. So these services let them analyze the data but don't give me my raw data. Do something with it. Tell me about it. So in this case perhaps you know you're running a CAT scan. Jane Doe has a CAT scan being done. It was done at a certain point in time. There's a conclusion that's being published. In this case there's some concern on the CAT scan. I mean it's not critical right now but there's concern. So I've classified the information, it's interesting about that. And also we have the concept that basically this information is only good for a little while. Yes your blood type probably doesn't change but if I'm recording some reason your heart rate or your blood pressure that information is probably only good for a couple of hours before I should take it again. So I didn't want this nugget lying around forever but the nugget should be there for a relatively short period of time based upon what type of information. And the publisher of that information is in the best position to know how good it is. So of course the idea when you publish this and you pick it up from somebody else. Again they rejected the idea. In fact what they decided to do was they had a currently had a C++ system with a web front end and their current plan was we're going to rewrite the back end of the system into Java because that will make it so much better. And we'll still use exactly the same code on the front end. It was like do you understand the architecture is not going to be different, the code is going to be the same size. Oh no no you don't understand. Okay fine. I don't know if the project ever shipped. Well I ejected my idea. You begin to get a sense here that I am a failure in my career at some level. Yeah actually when I put this presentation together I didn't realize how bad my career has been. It gets worse. So what did I pick out of it? Well actually then we went to Next Door. There's another company that's actually in the financial services industry. They run various hedge funds and the like. So they wanted to look at something like this. And they basically wanted us to produce up a new architecture for them. And to some degree I actually didn't get into this engagement because I didn't feel it was a very productive environment for ourselves. But one of my colleagues in fact Jeff Bay did engage in this project. As did actually Rebecca Parsons who was on the stage right before me. She was part of that same group that were working on this. And Jeff Bay came up with some really interesting ideas and I called these the Bayesian principles after Jeff Bay. It's not the same Bayesian guy. But he said two interesting things that kind of blew my mind. He said first of all it's okay to put more than one service out there and leave them running at the same time. And I'm like really? But I got a better service. Aren't you going to use it? No, no. He says put it out version one, version two. Keep version three. It's like really? And then he said something else that was equally wrong. Which was you can only deploy one service at a time. That's the deployment. No two service deployments. No, certainly no 20 service deployments. You only deploy one at a time. I'm like dude how can that possibly work? He says if you need the service and the client put out the service then put out the client the next deployment. But put it out that way. Well how about the how about this you know all the other people using the service? He said that's why I have multiple versions. I'm not trying to turn the old one off. In fact why should I turn the old one off? It's working fine. Why should I go rip up all those other clients and get those clients to change and put that code at risk? Why should I do that? The code's working fine against the old service. Until they're complaining about what they're getting until they need to be upgraded, why would you turn version one off? I couldn't figure out how to argue with him. And it turns out I think he's right. So quite a... Now what happened though is this this establishment was putting out a new release four times a year and they thought they were really hot stuff. So what did Jeff wind up doing? He was basically deploying three times a day. Blew their minds. And this had a ripple effect through the organization many years after we finished the project. So a success at some level, although it was just a prototype. But some new learnings about how to do services, how deployment works, and that these beasts are different than this Stonehenge I worked on in Bangalore. So time rolls forward. I'll take the question but I may defer the answer to later. So I think I understand the question. So the question is basically when we have multiple versions of services and they need a different data structures. I have to ask Jeff how he chooses to handle that. Later releases we actually start using key value stores. Which turns out as long as you're doing that sort of stuff, ignoring the new fields, it's pretty much sort of the XML tricks. So it has been a practical issue so far in my experience. But it would be a little scary otherwise. Alright so now I'm actually at another project. I'm sitting in Detroit working on a with a manufacturing company and they need to do a batch replacement. So basically there's some parts and cars that need to be replaced. And the business critical that would get these parts of these cars replaced. It's not a safety issue but they're using a technology that's basically running out of date and we need to replace it with newer technology. And the problem with this is it is kind of very critical at happen. The problem with this is of course I don't know where these cars are. I might get it known by address and maybe the car is brought in for some other service and I recognize that. It could be the customer calls up and talks about this thing. The source of the force of finding these cars was myriad. There can be many different ways I discovered there's a car that needs to be repaired. So that was part of the challenge with this. The vendor of the company basically said I can modify your legacy systems and get this ready in 15 to 18 months. Which was well outside the window of feasibility for the replacement. Had to be done much much faster. But this is how long it would take for to get it replaced by the existing vendor. So again a desperate situation, desperate client calls for new opportunities to play games. And so I love these opportunities to basically go into a desperate situation because I get to play it by my own rules. And I wanted to do this. So how do we attack this? I basically came up with an architecture I would sort of call the pinball architecture. Because the idea was I was going to take the information packet. There was information I knew and I wanted to refine this packet by basically balancing it from service to service to service. Trying to get more information to the point where it's a complete replacement order. And so it was may start out like I may I may know something about I may have the address of the or the cars located in. I might have the date or name of the person who bought the car. If I'm really lucky I have the vehicle identification number. Of course I may not identify who originally bought the car. I may not who owns the car now because it could have been sold. But it's a starting point. And I may have any of these things to start with. Again based upon the source of the information. And what I need to do is I need to fill in all the question marks in order to make it a whole order. So what we did was we defined a whole set of small services. Who would refine the information available in the packet. In other words kind of answering try to fill in the blanks based on what you know. So that might be one that would if it knew the VIN number and something else it might be able to inject this new part of information. Solve this part of the problem. Or perhaps if you needed a name I would a couple of things I could figure out what the VIN number is. I could give you a VIN number. But VIN number could get a VIN number unless I had something else. So basically what we did is we played pinball. So we drop this little thing as a pinball into the little bumpers here. And it kind of would bounce around. So yes I have an address and but I need a VIN in order to actually do the thing. Well I haven't got a VIN. But pass it down because the next thing I need is a VIN. Well I get a VIN if I have the name. Oh darn it. I haven't got the name either. But I send over this guy and give it an address I can get a name. Oh I see a solution here. So basically we took the database. There's a massive database available at the car manufacturer. And we carded it up into sets of small tables. And a service would only access certain tables. A service would not access tables that weren't quote its own tables. Yes you could write a sequel down in the wrong place or pull the data. But we weren't going to do that. We allocate the tables to certain services. In no words. And mostly interesting services were around join tables. Where I'm taking two pieces of information and figuring out the relationship. It turns out that's where the interesting behavior was. So in this case I can get all the way over here and I pick up and give it an address I can now fill the name in. I would fill it into the packet. There wasn't any persistence of this information. It was literally in the packet itself. And then we take that packet and we drop it right at the top all over again. Right on the top of the pinball machine. It would bounce around to the VIN. Haven't got the VIN. Run down and need your name. Oh I got the name. Cool. Have I got this other information. Well darn have it. Off to some other service. So we go around and around and did this. So this was our design. Again very radical thinking. Leveraging a lot of stuff. We picked up and originally with SOA with Jim Weber. A lot of refinements from Jeff Bay. By the way Jeff Bay was on this project as well. We delivered the project in nine weeks. Actually thank you. There's more to the story. Yeah we delivered it in nine weeks. So basically you know customers just blown away. Which is always nice to have. It was a radical change for them because basically they were running nine week development cycles. Nine week test cycles. Nine week development. Nine week test. We dropped this code into our nine week test cycle. Of course they found one bug. It took two hours to fix. So we spent an entire nine weeks writing the next release. They needed. To the extra additional function they wanted. Nobody else was doing such a thing. Again blew their minds. The nine weeks of testing that's for building stone hinges. And we were not building stone hinges anymore. So yeah basically at this point in time the client was just going bonkers. They wanted agile. They wanted to pull this stuff in. If anything we had to slow them down. Because they were just going crazy over this stuff. Now we picked up some new lessons from all this. So what more learnings I'm doing in the year. I've been always trying to figure out even with Jim Weber back in China. What is the service Jim. What's the boundary around a service. You know and I never really got a satisfactory answers. It was always kind of a hand waving thing. So I was getting to this point having played with it for a year or two. Is I was beginning to feel like service were like traditional classes and objects. But every service has a job. You try to make this job as small as possible yet still do something interesting. I don't want you to service just publishes raw data. I wanted to publish conclusions. I wanted to do something. It was like filling in a blank in one of my green green messages that is doing something interesting. And it turned out in this case they got really really tiny. Sometimes they're as small as a hundred lines of code. Because I'm just looking at this if you miss your day if you're missing your day to run this database do the query put it in there done. Like whoa where's where's the rest of the code. Well it's almost this service some is this service. So the applications being scattered to the forewinds. Very much like in a great object design. There's lots of behavior scattered across the objects. And I got some experience in object design because I came from a small talk background which I put up here. And small talk is all about sending messages and getting replies and that's the way small talk talks about it. It kind of fits this model very nicely. And there is a concept of encapsulation here. I'm thinking oh encapsulation like classes is kind of the same way. Because these tables are owned by a service and nobody else will go get that service. And you're only getting conclusions out of this. You're not getting the raw data. And so encapsulation sort of felt pretty good. So there's a lot of new revelations. But I ran across some problems as well. Things I hadn't run across yet. One was where are the packets? You know sometimes it would look like I drop a packet into the pinball and it goes round and round and round and round. It's like you know or sometimes I drop a packet in and it's like where'd it go. You're looking around it's just not they pop out the other side. It's not going around anywhere. It's just gone. So you know there were bugs obviously and it's just you know how do you debug such a system. So we started having to put some tracing in and and we actually logged in the packet we'd already visited in case you get to one of these cycles. So we started doing some stuff to try and do that the logging sort of things. Things that we actually understand much better now. The other problem actually even more profound was thinking about the problem as a set of small services was not natural to the programming staff. And I had some really bright programmers on this team some very strong guys. Yet when they try to decompose a problem they would still try to build a big service because that's the way they think about a problem. They were having trouble wrestling with this carving up into very small pieces and every piece does its own and kind of passes more information around. It wasn't a natural way of thinking. And so they wrestled hard with that. And in fact when they got to release two because after release one I was sent off to London to work on some other projects. And when they came to release two they struggled a little bit. In fact the service is trying to get a little bigger because the granularity was there. It was easy to write them that way. In theory they're easier to write. So in fact it actually won an award at a ThoughtWorks technical conference but it won an award for being a really bad idea. So as much as you should clap too early. So to some degree it was delivered but maybe it wasn't the best implementation. I would argue that they weren't smart enough. They could argue other things. So but I'm getting some more ideas here. So instead of thinking about it is basically 25k services. My idea now feels like it's much more 200 small services. That these services are going to be much smaller than I ever conceived of before. That could replace this giant stone hinge with a set of very tiny services. Alright so like I said now I'm off to London. Let's see if that's true. Yeah so now I'm at London. I've transitioned into a basically a little company called Forward. It was 35 people when I joined it. When I left it four years later we had 470 people and they were in every sort of business. I'm not interested to talk about much about that but a very interesting tech environment. Sort of supporting a lot of businesses. And the first thing I walked in is that one of the things we're doing is we're doing a lot of Google ads. That was the core business at the time. And Google ads has a lot of reports coming out and you pull this information and you basically want to push to the marketing guys the results of how yesterday's advertising campaigns worked. In fact you get feedback from Google sometimes as often as every 20 minutes. You want to be able to take this feedback and feeding it back to your marketing team so they can tune their campaigns. And it just felt like a great environment for this pub-subs architecture. Where basically a service will pull these reports down to a quick analysis of this report and publish something to the marketing guys saying oh there's something interesting going on here. Perhaps you want to change this or take advantage of this opportunity or maybe you should stop this campaign because you're losing a lot of money very quickly. So I started trying to automate this. In fact we define these agent services are running on behalf of each of our marketing guys to collect these things. We had a nice architecture to it. We went offshore for the implementation. Actually came into India for the implementation. But I ran into the same problem I had in Detroit. That basically it was very difficult for the developers here to understand the architecture, the concept of small services and how to make that work. It wasn't something natural to the developers. The CTO which is a former Oracle executive of course said all you need is a giant database with running SQL queries and you have problem when we go away. It was like yes but then it can't be changed. You need the service stuff. Nevertheless I lost the argument. So we try to replace it with a more traditional SQL. So again rejection. So my batting average is actually quite poor now. But what interesting thing happened. The organization learned about this to some degree and the year next year they actually started to put together card walls. We put together basically you know literally emulating a card wall you know with columns associated with defined and in progress and all these other things. But we also put in the background a messaging system for this which turned out to be the real core of the system. So basically our users will be able to get into the system and we started pulling these Google reports and starting posting messages to these agents. So a person can log in and look at their account. Yes they have stories they were playing in order to do this marketing campaign. Again this card wall was not tracking programming stories. It was tracking marketing stories. Things a client would like for us to do for them. So this is a card wall associated with interacting with our clients using agile processes but they were a marketing organization working with their clients not writing software. But agile and it works very well in that environment as well. So we actually did an automated card wall. The back end in fact created this concept of services. We basically put the concept that yes one service is doing one thing and implemented that and we posted alerts with these recommendations. So basically we wound up actually implementing this. In fact this grew so large we brought up Hadoop clusters to run them and currently we're running about 5,000 jobs a day on behalf of our clients. About 500 different services doing this. So we finally had a success using this sort of architecture and it is working extraordinarily well. In fact in 2008 we grouped to a company of 55 people at that point before it did and our revenue for the year was 55 million pounds. So we're generating one million pounds of revenue per employee. These tools were killers. We were making tons of money with this stuff. So success. So new observations now that we finally have some successes in our belts. One of the things that was interesting is these services themselves you know hundreds of these services became almost disposable. They're so small they're only 100 lines of code rather than try to fix one you just write a new one. Because it's only 100 lines of code. Why bother understanding it? Rewrite it. It's 100 lines of code. If you struggle too much with that you should find another programming field because this is not the one you want for yourself. We couple them very loosely as possible. So we use RESTful interfaces. One that would plop data into a non-sequel database. The other would pull it out of a non-sequel database some sort of key value store. So we try to make sure they're loosely coupled. There was some flow coupling. In other words this service couldn't run into the other service had run. So that sort of coupling did exist. But to do that we basically had the services become self-monitoring. So we kill the concept of unit test because it's 100 lines of code. We can go wrong here. You can't desk check it and run it once at your desk. Why would I write a unit test for it? But these services should at least when they can't run raise your little hand and say excuse me where's my data. So you might be the reporting services and analyzing a Google report. You wake up you look around where's my Google report it's not here. Oh problem. Now it's nothing wrong with you. It's probably somebody else. But you raise your hand a programmer will notice this because we have big monitoring walls and he would go chase down what the problem might be. Maybe the Google API has changed. Maybe Google is down right now. Lots of things could go wrong. But it's important for a service who can't run to at least say I can't run. And that became a little bit of a substitute for the concept of unit test. At least when he's not working he's raising his hand. Notice we're beginning to trade off defect prevention which is all about unit test with fast failure. If you can raise your hand quickly it's a fast failure situation. We also began to replace the whole concept of acceptance test. Because acceptance test in this environment was getting and wasn't sure what it meant anymore. It's a hundred lines of code. It's deployed independently. What does an acceptance test mean? Well you could be running some preliminary tests against these suite of services to make sure they kind of worked in concert. But in fact the really problem is Google reports not there or Google is down. There are a lot of external API as we depend upon. So rather than running acceptance test we started writing active monitoring of the overall system behavior. In fact program is almost right that first. There was a paper given I think the last year the year before last that our hosts in the go-to conference and our hosts did mark. And the guy is advocating that the only real acceptance test is business metrics. If you want to have a real black box test just measure the business metrics. Because you can change yourself where all you want to is still the business metrics. And he's got some nice tools around that that sort of lean in that direction. And we had actually already done that. We were already beginning tools. In fact a program is the first thing that wants to build is business metrics. So they want to make these changes to these services and slight improvements they can see did it go up or did it go down. Was it successful or not. And we're actively monitoring these things all the time. So maybe something happens and we just did a deployment and all of a sudden it's going down. Well maybe it's something we did we're suspicious. Or it could be you know a tsunami hit Japan. We can't tell. But you can certainly roll back the change and if it keeps going down it's probably a tsunami. If it goes back up it's probably us. So at best it raises suspicion. But active monitoring becomes almost a core value at that point. So we still write some code associated with the overall system. But it's an active monitoring code running all the time. I don't want you to write the service in. Programmers love this. I can write the service in Ruby because I like Ruby. You can write it in C sharp because you like C sharp. Or Java. But they also started writing them in node. Using javascript or coffee script. Well they started writing some enclosure. Well the nice thing is you can write whatever you want to. It's just loosely coupled with restful interfaces. And in fact if you get if you picked it up and had to change it it's written in closures like. I don't know what that means. Look at all this parentheses. Oh my god. I rewrite in Ruby. My new version can be Ruby. I don't care. And this turned out to be very exciting for the programmers because they got a chance to try things. And this was a big motivation for the staff. A huge morale boost for the staff. And how many things they tried. They tried a lot of interesting experiments. We changed a lot of technology based on these experiments. And again made even more money with this. So sort of the current state. And actually even this is is getting to be about six or eight months old at this point. So still it forward. They have a website called U-switch. U-switch allows you to switch energy providers in the UK. The UK can switch energy providers and we can switch cell phone plans. A lot of other places in the world. And we sell some other products as well. Through the U-switch site. And we basically when we read we had to redo this software because it was written to .net. It was about 13, 14 years old. It's almost impossible to change. It was another Stonehenge. We undertook a large rewrite of this application. We were guided by a couple of key things here. Two big influential ideas that we brought in. And by the way we took people from that other team that did these very tiny services. And we seeded those people with that sort of knowledge. That sort of thinking into the U-switch development team. We wanted to bring the new ideas to them. And have them those guys drive this new project. So one of the things that influenced them was we want to have an event oriented structure. Not an entity structure. We want to record a little of it since they're occurring rather than have a transactional database representing the current state of system. Which was the current design. And that's influenced basically by the book In the Plex. Which talks about how Google does things. So every time you do a search for Google, there are events being triggered by these various services. And Google has a philosophy they will trigger events on anything interesting you do. Even though they have no concept of how they might use it in the future. So if you do a search they're going to record that. If you do a search and they suggest an alternative spelling and you click on that they're going to record that. Or if you ignore the alternative spelling they're going to remember that. You go to page 2 they remember that. And they stream all these events out to a large event queue. And that's where they actually go back and later on run algorithms against this. So they're not trying to build a relational database that represents all the things you're doing. They just put an event of streams out there. And go back and analyze it at some point in the future. So that we'd like that concept. We thought we wanted an event oriented architecture rather than an entity related architecture. The other influence was this concept of current state versus history. This is influenced by the introduction of closure as a programming language in our environment. Closure has an interesting philosophy. Basically fundamentally all the data is unchangeable. It's all immutable. That closure believes that if this is where I'm standing right now and I moved over here. Yes this is my current state but this is a perfectly valid piece of data. This is where I was standing a minute ago. It's perfectly good data. And our legacy system of course is a relational database. We're keeping track of just the current state but had lost all the history of things going on. We wanted to retain everything. So again an influence on our design. So with that in mind the legacy system as you might expect. Nice traditional layered architecture. We had one of these towers for the we're doing the energy switching. We had a different tower for doing credit cards. We had a nice transactional database that we're using. We had a reporting database. Don't let that little line fool you. That line was not anywhere near straight. That line convoluted through about 18 layers and on any given day there was only about a 30% chance that line was running successfully. We were always chasing down why didn't the data get through the reporting database today. It was really incredibly convoluted through old code. We weren't getting that. In fact there are a lot of things going on with the current system that we would not even bother recording because we didn't have the bandwidth to go through our system to get into the reporting database. We were just ignoring critical user events that we should have been tracking. But that's where we started with. So what do we build instead? We basically went to the Google the Plex again you know in well life in the Plex and we just collect signals. Oh look a customer logged in and he gave us this postcode because he wants to look up the energy rates in his area or maybe he gave us his name which is a lot of clever. Or maybe there's something going on with his address because we got his address now so more information. Maybe he's giving us the email address because maybe we sent an email to him in a marketing campaign he's clicked on it. Record the event. Now notice these events may have a relationship may not but the idea is a whole cloud of events. You can sort of begin to cull elements out of this cloud and see interesting things. So for example this information seems to refer to somebody who might be wanting to use our service. One of the things we recognize is a combination of a postal address and email address that represents to us an opportunity. I have a way to reach somebody. I know where he lives so I can offer him an energy plan. That's an opportunity. But what if the same email address has a different postal address? He probably moved. Let me contact him again right now because here's an opportunity to sell him a service again. Well what if the postal address is the same but now an email address changes. Oh I bet somebody else moved into the house. Again another opportunity. We did not anticipate this when we started collecting the events but as we started looking at it we found marketing opportunities beginning to pop out at us to let us go marketing more aggressively to make energy switching. And we made again lots of money off to be able to do this. Again just like Google collect the events we'll figure out how to use them in the future. Very powerful concept. So at this point everything we have is producing events in this architecture. It could be the apps themselves, services are running, the web services themselves report their statistics about what's going on, how many hits they have, who's coming through them, what the load on the servers are. Push that. Plus we actually put a lot of data from third parties. You know things about weather information, things about changes in energy plans, all those sorts of things pushed these events into the system. And we're beginning to build a lot of consumers of this. And a lot of our consumers are going to basically be things like basically almost just a pearl script or a slightly small ruby script maybe using the language art to do some analysis of data. Anything we actually want to do. And we're playing a lot with this space because now we have this rich set of data. We're in big data world and we just were thinking of new things to do almost all the time. Some examples of that I want to get to a minute. But actually we said as a message bus we wanted something very, very lightweight yet can handle huge traffic loads. So we actually picked Kafka. Kafka is an open source Apache project. It is basically the messaging bus from LinkedIn. So LinkedIn took their messaging bus which they're getting tons of events, relationships being talked about and yes you've been endorsed by LinkedIn. All of those are events in the LinkedIn world. And they're all dropped into a Kafka message stream. We've been doing some analysis in the stream. You can probably take about 12 million events a second through this stream. Yeah, the numbers are huge. And you can read the events almost that fast because they stream to disk and you're reading the disk stream. So if you look at some of the Kafka information it's quite spectacular some of the things you can do with Kafka. So we use Kafka as our base messaging architecture. You can usually hold almost a week or so of data very easily in this system. But then after that we'll start archiving just in case we want to later into high. Probably not even the best place to archive it but it seems to be working fine for us. But if a service goes down and it comes back up, it's easy for Kafka to sort of give you the stream from where you left off before. Just remember where you left off? Read the stream from that point. Read very efficiently. So some of the things we've been doing with this is we've created this concept of how we cross sell because we're selling energy switching, we're selling phone plans, we sell credit card switching, we sell car insurance. We wanted to understand what sort of things people are buying and what sort of things do they also buy. And so we created this map where the color codes represent the various product families and the bands represent how we get our traffic. So the blue area, I think we're representing actually credit cards, going to the energy area is actually quite wide. So if people are in there for credit cards, there's a good chance you can get them to look at their energy plan as well. And the energy plan in feeding in yellow feeds, I think yellow was the insurance. You know, some feed there, not necessarily a great feed. But now we're beginning to understand how that if you're on our credit card site, we should put more links to get you into that part of the business because now we have the data that shows it's effective. And we can make that change and watch how this graph changes. And we're beginning to run lots and lots of very powerful marketing experiments because of this architecture. We're also able to now actually actively monitor tons of things that are happening on our site. So we can actually actively monitor every user journey through our site. And we're getting lots and lots of user journey so we can see that you're living in the blue zone, which I think it's a credit card area. People move to the red zone. This is the page they're mapping to. We can sort of see where the flows are. Again, we can help pin down which page you have which links to it. This was drawn with R. I think the script actually pulled this graph out and popped it to R. It was around 20 lines of code. These are not big things. These are experiments that the programmer can take and run with very quickly. All right. The interesting thing is, this is a chart from my Anarchy presentation is, this architecture, these small services, has now influenced our development processes. That was a surprise result to me. I didn't think an architecture would necessarily influence your processes. But those processes like unit tests, acceptance tests, constant refactoring and patterns, all these things are things you do when you're building a stone hinge. They're not appropriate when you're building a 100 line of code service. I don't need to refactor 100 lines of code and make it look 80 lines of code. I just get it out there. I don't need to have a running unit test to make sure I can make a change to it. I'm just going to throw it away and write it again. It's not a big deal. So those sorts of things, when we talk about Anarchy throwing practices away, a lot of these things went away because we're not building stone hinges. We're building small services. And the very aggressive deployment, because we do continuous deployment, not continuous integration, we have no integration servers. We have no development servers. We have no test servers. The developer deploys from his machine into the production. And on average, he deploys the deployments happening in this environment every three and a half minutes to production. I'm updating a small service. I'm watching the metrics. I'm updating those small servers. Watch the metrics. Very aggressive stuff. Quick question? Message loss? Why do we have a message loss? Kafka is a backed up store. It's not an even memory store. And what type of message do you care about losing? So you also use a journey. I mean, in this environment of web sales, if you lose something, the earth does not come to an end. We might lose a sale. I'm sorry I can't hear you, but let's get the mic to you after I've questioned shortly. I'll almost finish with this. So again, processes were influenced by our architecture. I think that was one of the big surprises to me. All right. So what about technical debt in this world? So again, sort of reviewing, you know, technical debt as a result of, you know, sometimes lazy program or sometimes sloppy practices, often more often than not, inexperience. To some degree, in this environment, technical debt has shrunk. Because what's the worst a programmer can do to you? Write a hundred line of code service that doesn't work. I just write it again. It's not going to take me but a few hours. So the damage any individual programmer can do is become minimized. How can this programmer reach through and grab the data out of a middle of a service? He can't. The services aren't publishing their raw data. They're only publishing conclusions. So that temptation that existed to reach through and grab that 50 out of that loan object and pulling in and doing some manipulation on it, it's not possible in this architecture. It's a safer architecture for change. It's more resilient to bad programmers getting in there and doing bad things, whether it's accidentally or intentionally. And that was a surprise result as well. Our system has been extremely stable. Even though almost no services live more than six months before it gets replaced, the system has been up now for four years. It's almost like the human body to some degree. The cells in your body right now are not the ones you're born with. Yet you are still you. And our system is very much like that. All right. So some conclusions that I say so far because I'm now working in mail online as I've talked about sometimes before. And we're implementing a microservice architecture. Plus anarchy. We were doing the whole thing. And I'll have new conclusions probably very shortly as we do this. But conclusions so far based upon this is, these sort of principles now have been added to sort of things that Jeff Bay understood. First of all, I now believe services are very, very tiny. People that talk about a service and talk about 10,000 lines of code, they're not where my head is at. These guys are very, very tiny. Oops, too fast. Go back, go back yes. Oh, go back further. Second, loosely coupled. I really don't want these things coupled. Now there's a little bit of flow coupling but in the U-switch case, they're now starting to do PubSub because they realize, yes, there's a little bit too much flow coupling that one service knows another service knows another service. They want to try to break that. We need to put PubSub in there as well. Multiple versions. Fine. In fact, we're deploying into the cloud. Unemployment versions into the cloud and you deploy it with the right sort of virtual machine. And we tend to use Amazon, we tend to use the Amazon small size virtual machines, not the tiniest ones, not the minis, but the smalls. If we need a lot more capacity, we buy lots of smalls. Put a load balance in front of them. But multiple versions are encouraged. In fact, to some degree because a version that's not being used causes almost nothing because we're charged kind of by the CPU and the memory footprint is tiny. It's almost not worth looking for the dead ones because it's only probably costing us a dollar or two dollars a month. Why would I spend time looking for these things? Now, you get some nice Amazon reports and if something hasn't had any CPU usage for the last three months, I could probably turn it off. But I'm not spending a lot of time actually looking for those. So there's probably some dead cells running around the system. We don't care. It's very important for these services to monitor themselves so they're raising their hand when they fail. When I can't get my information I need to do my job, raise your hand. If you have some sort of calculation error, raise your hand so that somebody can aggressively look at that. And there's a team always ready to attack these as soon as they happen. We're in a fast failure mode, not defect prevention. Publish your interesting stuff because again, you have some things that are doing interesting publish them. Sometimes we'll go back to a service and rewrite it to publish more things because we think it is doing more interesting things. But don't worry about who's going to use it. If you think you did something interesting, publish it. It's easy. I don't know what an application is anymore. What is an application with these little things running around this way? You know, there's some sort of system there obviously. But what an application boundary is? I don't think it's there anymore. So the living software, again, the analogy with human body, very long lived. We're doing continuous deployments. I say five to ten minutes. Last time I checked it was three and a half minutes between deployments. This system is complicated. But it's not complicated. I think it was called essential complexity as a thing Neo4. It's not any more complicated than the main we're working in. And you think this makes it too complicated to sort of test and maintain, then think back to the one main lines of code, which only had 70% of the acceptance test running. That was complicated also. It didn't help us any. The system inherently is complicated. This doesn't make it any worse or make it any easier. In fact, I would argue that most of the complexity we have is basically essential complexity, not sort of the accidental complexity is because of rewriting bad code. We actually radically impact to architecture. In fact, programming anarchy was kind of boring out of this team doing these services. And there will be a learning curve. When the guys come in, we hire new guys and sit them down at our table, and they look at us writing these 100 lines of code. Okay, I understand 100 lines of code, but how does this fit the rest of it? It takes a little while to understand that. It takes a little while to think about writing 100 lines of code and not 200 or 500. Oh, look, I can put some code here and do a second type of analysis of the report. No, no, no. Copy the code, change the analysis, run it as a separate service. But I could just change if you think. No, no. Copy the code. It takes a little while to get used to that, especially you're coming out of a traditional TDD shop and you're sort of following a lot of the agile practices or duplicate code is evil. No, it turns out not so much in this environment. So that is kind of the summary of microservice architecture, my journey with a little luck. I hope you don't go through as much pain as I did. Hope you have more success than I had. But I'm actually quite happy with where it stands right now. Just as a side note, after I've made this presentation a few times, on the latest Tech Radar from ThoughtWorks, and if you don't subscribe to Tech Radar, you absolutely should. Microservice is now on the Tech Radar. It's kind of the second circle in from the middle, which I forget what the official name for that circle is. I think it's things you should seriously be looking at, Circle. What? Trial, fancy name for Trial. Yeah, you should be playing with this thing and basically saying. So I'm feeling pretty good because we're definitely playing with this stuff and making a lot of and it's our production system. We're building production systems all of this. I think I have a little time for questions. We'll actually know. We've got about two minutes. Obviously I'll be hanging around outside as usual. Yes, sir. But the program figures that out on his own. This is anarchy. There is no business analyst or anything else. There are no people. There's no person to write a contract with. He's saying here's the business need. He builds what he needs to build. Watch the anarchy video. It's even stranger than that story. Yes, sir. Every service basically programmers do their own deployments into the cloud. And basically they make sure they write their deployment scripts. In most cases we use a Capistrano for our deployment which is the Ruby based system but Capistrano will deploy anything. We tend to use either Chef or Puppet to make sure the virtual machine is kind of up to date. So there's always the Chef or Puppet script that's been developed to help support that virtual machine. So we have some standardized virtual machines. We have a standard couple standard Ruby scripts for for deploying Ruby Ruby to our virtual machine. We have some for Python for closure. We have some for Node.js. We might even have one for Java. I hope not. But basically deployments into individual virtual machines. There is no concept of of anybody else doing the operation of that stuff. I think I answered the question. Okay, good. Yes, sir. Where can you learn more about this? I would say Google. I don't I don't I don't write down I don't write books you know because it's I rather write code than books and writing books is no fun. You'll see you can certainly this is videotaped because I think you find this videotape on a couple of sites because I present this I think in a partial on a Ruby conference. Otherwise, I would suggest you sort of chase down some of the stuff that ThoughtWorks is talking about in the tech radar. So the term is getting out there. But I would say yeah, this is kind of bleeding edge stuff as far as I know. Anybody else having comments about it's being done other places? I can't really say that. I haven't really looked at Netflix and some of the other I would definitely say this is how Google is running. I would say this is how a lot of the event based stuff is being used by companies that publish streams. I'd be very surprised that LinkedIn doesn't have this sort of architecture internally. Maybe not for the core some of the core stuff. But all the little things running on the side and service and searching I'd be surprised we're not running this service for architecture as well. I think the tendency is service are getting smaller and smaller. We used to think one service for each division of a company like credit Swiss that'd be the accounting guys would have theirs and the loan guys would have theirs. I don't know I believe they're even we're near that size. Yes ma'am. Can I extract the business metrics? Oh yes. Well for example one of the things we do is tell parrot cages. So I can I can look at these statistics and see how many parrot cages I sold. In fact that's what I'm tracking. Where in case of Google ads I'm looking for the places where we're getting people are clicking on our ads. So Google's feeding us back information about how many people are clicking on our ads. We're pulling those reports and graphing normal monitors in our in our in our development rooms. So we can watch it's how here's what's happening Japan here's what's happening in India you know in Middle East this is what's going on in UK this is what's happening. I mean these are data we've seen constantly. And in in particularly we have alerts that are written as these data goes the wrong way. You know messages are being sent to the to picture users based upon that. And those are the first services we write. The monitoring services because as you play games as programs are deploying several times a day they're watching this as their ultimate success rate. Because the domains we work in are quite complicated. It's hard to predict what any given change is going to have regardless of what a product manager or a business analyst might say. So they're kind of on their own they're trying various things they're looking at the feedback they're trying other things. This is how they live. Yes sir. And I think I better wrap it up. Yeah. With this with this type of implementation where in there are where there are there are there only 100 lines of code for each services. Will it not lead to code duplication? Absolutely. And we don't care. Yeah. That's the strange thing for me. I mean that some of the this is a lot of these things surprises to me that when we got into small service we don't care about code do in fact we encourage code duplication. Now if I have if if everybody is actually pulling Google reports we'll probably tease that out maybe turn it into a Ruby gem so that everybody can pull it out easier the next time. So we do really look for opportunities to pull out you know large say large code to make these services even smaller. But the drive is to make the service as tiny as possible. So will it not affect the maintainability of the application because you would have so much of code duplication as the application evolves. It hasn't been an issue. In fact you know I thought the same thing as you when we got to 40 services and I had this conversation I think last year with some senior ThoughtWorks guys Eric Dornberg and Martin Fowler ran across the new conference that said how this really strange thing going on we're writing these lots of services like 40, 50, 100 and they're little services that are running and they're like but that can't work. I said I agree can't work but it's working. They said you should talk about this. Oh okay and hence here I am. I'm not sure why it's not it's working so well but we're up to 500 services and nobody has this problem you're referring to. I think it's because each service does one thing and therefore what it does is very clear to everybody else and when it doesn't work you know you know who to chase down. That hasn't been an issue. Again we could have debt services everywhere we don't care about those they're not costing us much money but the one that running it breaks it's pretty easy to find who breaks. Sorry I need to cut it off at this point but you can certainly find me outside because I think we do for our break and then we come in back in for 430. All right thank you sir.