 So this is, first thanks, Johnny, for throwing another conference again, which last year was a lot of fun, and like he said, part of the appeal of the conference is that after the conference is over, you get to hang out and meet a lot of people. So I hope to have that experience again this year. And also, this is a really interesting venue that Johnny just told me the history about, so if you're curious, you should check it out. It's pretty well. So this is imperfect architecture or trying to accept impermanence. I'm the architect at Bleacher Report. We've been using Elixir in production for about four years. And as the person who was here earlier mentioned that small teams don't hire people, it turns out that we are hiring people. And one of the reasons that we're hiring people is because we have had such success with Elixir and also just the way that we've rewritten our platform. So we're hiring for a back end with Elixir, front end, which is mainly React, and iOS, and Android. And so from that, you would think that everything is great with Elixir and people talked up about how great Elixir is and these kind of things. And Elixir and OTP more specifically have really changed the way that we develop our applications, of course, but also the way that we think about architecture and even the way that we think about how we organize our teams and how they report to each other. And, you know, other talks have given, you know, Jose and Bruce and I wrote a book adopting Elixir about how to use a language successfully and how you can have this approach at your company and how you can reap the benefits as well. But that being said, you know, it's not all roses, everything, bad things happen. Bad decisions yield bad results. And so we'll start from complete system failure. So this was perhaps the worst night of my time at Bleacher Report. And it was, ironically, about a month after our best night in terms of traffic. So last year, we broke our traffic record on the NFL Draft Night. And we broke it by quite a lot and everything performed really admirably. And in previous years, we'd all been there just to be on call in case something broke, because something inevitably did. And this year, we just sat around and ate tacos and chatted. So it was really nice. And it gave us this real sense of confidence of what we were doing was correct and that we were on this trajectory to have higher traffic records and do all these things, these great things with Elixir. And so it was especially frustrating because it was only a month later, how do we screw things up in a month to make our system on a night that was only slightly, it was like two or three or four times what our average traffic was. How did everything fall apart? And part of it, I mean, and largely this was my fault because I had taken this idea, this is fairly hubristic and we're on the right track, we're doing the right things. We have these legacy bottlenecks that we know about, we have some other sort of couple dependencies that we know about, but everything's been fine so far. So let's move forward with this idea. And it was really a static way of looking at a problem. So, and this is part of taking our old infrastructure and moving into the new infrastructure was, how do we refocus everything? We had a monolith and we had some smaller service apps and now we have service-oriented architecture. How do we do that so that we can maintain the health of the system and expand it for years to come? So we decided, let's use an API gateway. So for an API gateway is exactly what this is. So you have one of the advantages of the API gateway is that it's simple. Like from a client perspective, it's essentially a monolith. You have one host that you call, you get all your information from there. They don't care what changes behind the scenes, what changes you make, how things are reordered. To them, they just call this host, give me this information and it comes back. It's also more secure because you have one endpoint. You can say, close everything off except for this one host and then traffic them as they're here. Because in the past, what we had was we had all sorts of ways where you could call into the system and which made it really static and we had these legacy clients that we couldn't sunset. So we had to do all these weird things to wedge things in and out. So this was really appealing for those reasons. However, there's a big disadvantage with the API gateway, as you can probably imagine, is that there's a single point of failure if your gateway goes down your screw. So this is what happened most of the time and this is a simplified version of our architecture but essentially the server icon on the bottom is the gateway and the server A and B on top are just two services. And we have two types of requests that would come in. One would be a single service. So you would just say, I need this type of data and it would only come from this service. And the second would be a multi-service request where you need to pull data in from multiple sources to return the full response. And this works pretty well for the most part. And with the single service request, you had the advantage of going through the gateway but you had the added latency and other problems and also load on the gateway itself. So that night, this is basically what happened. We had these couple dependencies and we also have a lot of third party providers for scores and other information that we can't control. And we're looking at this from a very, or I was looking at this from a very hubristic point of view where it's like everything has been successful. We're doing the right thing. Let's keep going. Let it crash and all this stuff. And we let it crash but the problem was that it crashed over and over and over to the point where nothing could happen. And it was frustrating, too, because we have different levels of data. We have primary data. We have secondary and tertiary data and so on. And so we want to deliver, we obviously want to deliver the primary data. But the problem here was that like a tertiary data was what was causing this outage. So I'll show you in a graphical sense what happened. So this is what happened. So on the top, that's the response time. And our response time hovers around 100 milliseconds. And this is for a 24-hour period. So you can see that the high spike there was quite an increase. And it was sustained increase for about three hours. And on the bottom you can see the reason I have the number of 500s is obviously it correlates to what you see above. But from the graph below it would appear that we don't have any 500s except for that period of time, which is absolutely not true. We have 500s and 500s are important because they tell you what's going on. We usually get a lot of junk requests. We get a lot of fishing requests and these kind of things. And it's important to understand that those are there. But this was such an order of magnitude problem that it looks like everything else is fine. And as it happened, this was right before my vacation. I was going. And so it was a very strange month for me because we celebrated this great victory with the Elixir system. And then this was a night where no one was even on call. I was literally walking out the office when the opportunity alarm goes off and it just sort of continued into the night. And it left me in this place of self-doubt and confusion. And I was on my vacation and my wife and I, it was our first year anniversary and she's like, why are you moody all the time? We're in this beautiful place and you're moody and thinking about the stuff that doesn't really matter. And it matters to me because I take pride in my work and I take pride, like Rob was saying, you're a scientist, you're a craftsman or a craftsperson and you do all these things and you want to feel good about them. And so I started thinking, what did I do wrong? Other than making the wrong technical decisions, how did we get into this point? And so I think it can be best summed up as to following this idea of this perfect trap. We want to have the perfect architecture. We want to have the perfect design. We want to have the best. Basically, we want to have a superlative and fixed iteration. And that's not really the way things work. And I would never also write about, like, good enough designer. I guess some people do. But in my mind, I was trying to plan this target architecture and I thought, this is what we need to do and then we can, like, a checkbox and move on to the next thing. And I've always been interested in etymology since I was a kid for whatever reason. And in high school where other people took German, Spanish or French and for whatever reason in this public high school in Lexington, Kentucky, where I went, Latin was offered. So I took Latin for four years and it really changed the way that I look at the world and how I use it. So when you think of the word perfect, you think of, you know, without flaw, without error. But in, you know, the perfect actually means finished. So the perfect tense is I ate, I ran, et cetera, and so on. And the opposite of perfect would be not perfect or imperfect, which is this idea of, you know, of continual and changing and never ending. And if you, and so, and I thought this was a really interesting idea, this idea of mutating things, of changing, of accepting change. And so I looked online, and I was like, surely someone has come up with this idea of imperfect architecture in software before. And there were a couple of blog posts talking about how software is ever changing and this and that. And that's sort of the idea that I was going for, but it didn't quite capture the entirety of it all. So I looked up imperfect architecture without the software, and I found this article experiencing the architecture of the incomplete and imperfect and impermanent, which is a title I'm proud to be proud of. And it's by this professor, Runeco Honda at the University of Nebraska-Lincoln. And in the article, Honda talks about how architecture is not simply when the building is complete or perfect, that's not the end of the building's life cycle. In fact, that's just one stage of the building's life cycle. And I thought, yeah, this is really cool. And there were examples throughout the piece about different ways of irregular architecture or imperfect architecture. But this quote really appealed to me. And it says, in 16th century Japan, we find an artist, Sen Norikyu, who relied on the properties of the imperfect, which in Japanese are called Wabi, in order to create physical objects that induced participatory interpretation in the viewer. And to me, that sounds maybe through somewhat of a distorted lens, that sounds like what software development is. Physical objects would be the architecture of an inducing participatory interpretation in the viewer that would be the programmer. And it all relies on this idea of the imperfect, this idea of Wabi. And so I never heard of this term Wabi. But fortunately, Aoi, as it so happens, my wife is Japanese, and she's a Japanese calligrapher. So she's very familiar with this concept. So usually when I talk to her about tech stuff or whatever, she's like, oh, that's interesting, but I don't, you know, that's what you do during your day, and this is what I do, and so on. But this, her eyes lit up. She's like, oh, this is great. This is Wabi Sabi. This is what I try to practice every day with my, with every, you know, stroke with a brush, with every piece that I complete. This is the idea that I put forward. And so Wabi Sabi is these two characters, Wabi on the left and Sabi on the right. And they have sort of contradictory meanings. So Wabi is the noun, and Wabi is the dictionary verb form. And it's this idea of feeling sad, troubled, apologizing, but also the intention of finding fulfillment in something simple and something beautiful. And Sabi, or Sabi with a dictionary form, is against sensing beauty and tranquility, quiet, quietude, loneliness, and how things age because of time. And the Wabi Sabi aesthetic is, in Japanese, is akin to the sort of Greek aesthetic, the classical aesthetic, the way that you, you know, how things were, you aspire to this and it's a continual struggle. So this aesthetic beyond the words themselves mean asymmetry, roughness, but also simplicity, economy, austerity, intimacy, and appreciation of sort of the natural order and relationship of things. And I thought that's actually, that's a really nice, I mean it's a bit, maybe it's a bit too romantic for an idea about stuff or, about stuff or, but I really like the idea of this, of this, because it's contrary to moving fast and breaking things. So mindfulness, you accept the fact that you're going to fail at times, and hence the loneliness, like programming or any real endeavor, I guess, is when you're working by yourself can be very lonely because you struggle with a problem and you try to find your way through it, but it's also something beautiful when you do find that solution or when you do find a solution. And this implies that different solutions are possible. It doesn't mean that there's only one solution of the problem, it means that there's a solution at this point in time, and then you can go forward thinking of things as being right or wrong. It's correct at this point or not correct at this point. And to use Aoi as an example, again, like when she does performance pieces and she does commission pieces, and for the commission pieces she'll write the same character, you know, tens or literally hundreds of times, and I go in her studio and the walls and floors are covered with all of these Washi paper, and then, and each one of them is a discrete solution. And sometimes the first one she do, after she does all these other ones, she'll come back and take that first one and decide that this is the right one. And I think this is a really nice perspective to think about software development because it eliminates this rigidity and it allows for more solutions. So indulge me a little bit and we'll go through this and try to apply this to architecture. So architectural components, right? You have systems and you have people. Most of the time we talk about architecture, we talk about the system, we talk about service-oriented architecture, but we don't talk about the people that are building that. How do you build a team to support this architecture? What happens when these people leave, how do you retain people and how do these all fit together? And of course you can't have one without the other. So let's start with systems since that's sort of what we think about when we think about architecture. So this was my mistake when I thought about the system lifecycle. So in the example of the API gateway, there's the architecture that has these crazy circular dependencies and we've been burned by these things over and over and over. So let's just pick an end goal which is the API gateway and let's go forward with it. And we coded it, we curated it. Part of the reason that we had this problem was because we didn't have full coverage of this and so we didn't expect this and then we deployed and we were done. That's a discrete thing. That's done, checklists move on to the next thing. And of course life doesn't work that way and software development doesn't work that way either. It's much more like this. System lifecycle is you introduce either a new concept a new feature, a new piece of technology. You evaluate it, you adjust it and then you're back to the beginning again because at that point in time you have a new system to adjust and evaluate and so on. And it makes it much more easy to think of production instead of as the endpoint as just another piece of the never ending puzzle. So if we go back to the API gateway we weren't ready to give up on this idea quite yet, right? People have done this successfully, Netflix has done this successfully, their scale and complexity is much greater than ours so we should be able to do this as well. So we revisited the points of failure. What have we done wrong? Where had we overlooked these things? And like I said we came to see the solution in a different light. We didn't, or I did rather, I didn't see this as a final solution, this is just a solution and also going back to this idea of simplicity we decided instead of changing a bunch of things at once we would change them in isolation and see how that responded. And that was another problem that we had was because we were all sort of working towards the same goal but not at the same time so the left hand didn't know what the right hand was doing and this kind of thing. So we were introducing libraries that we hadn't fully tested and since we have such a high scale a lot of libraries that we're using especially like the newer Elixir libraries they've never been put under this kind of pressure so we had to go back and re-evaluate these things. And so we identified something that should have been obvious especially from an Erlang and Elixir OTP point of view. Like a couple of service dependencies like we, when I spoke about this primary secondary tertiary information like why are we assuming that all this stuff is important to the user the most important thing to the user is the content and the speed at which the content is delivered is secondary. If we deliver a content slowly they'll go somewhere else if we don't deliver content at all they're going to go somewhere else. So for instance, if you open the app on one of the views you see the content which comes from one service and you see the scores data which comes from another. Content we can control we have full control over that scores data we don't we depend on a third party to deliver that and that third party is like everything else it has outages but for whatever reason I didn't consider this this is grave oversight on my part. And so we looked at the app from the front end from the app side like how do we degrade this app how do we think of this app in terms of the way that we should think of things things fall apart. I was viewing this the lens of a product manager or a product person who wants the app to behave perfectly at all times and that's just simply not realistic computers fail, networks fail, bugs exist so we worked with the product team to go through each view on the app and decide which pieces are important and which pieces are not important and this was a group effort and this was also the understanding that architecture is not a solo thing and it was a lot of back and forth with product because they wanted to deliver the best experience and deliver a realistic experience. And so one small change we made on the back end which got us a lot of success was and again this is partially, this is something that we done in the early days when we started with Elixir and we never had this problem so we didn't revisit it. So when we make requests to the the content service which is a primary data source and then we make an asynchronous call using task async to the secondary or tertiary service to return that information and in the beginning we were using task.await so task.await has a default time out of 5 seconds, if it exceeds that 5 seconds it crashes and this was the problem that we had that night so it kept both the tag service and the gateway were in this cycle of crashing over and over and it can never recover and because these two were crashing the content service as well so it took like a graph earlier it took about 3 hours to get over that and so James Fish on the Elixir core team he and I give an OTP workshop every now and again and I think he put it really nicely so when you should use task.await versus task.yield so task.await you should always expect it to return and if it doesn't return that it means there's a serious problem and it should crash because that should notify you about the problem with the system task.yield on the other hand that's when you unpredictable data or uncertain data or maybe trivial data so this is a change we made and on the top was the original on the bottom we used task.yield and as you can see you can just match on the return so you either get an okay results or a nil and this was really all the change that we had to make on the various services and so now if the scores and everything comes through so it looks the same thing like this except things come through so it was a small change and it was also a way that we rethought about how we architect our system was even though the tagline of Erlang let it crash and it certainly did crash we weren't anticipating this and we didn't think to be defensive around this so this small change greatly improved the reliability of our content which is the most important thing that we serve to our users and we also can do this thing we add metadata to responses so we can see how often these upstream services fail and adjust accordingly and the final thing we did was we really wanted to keep this gateway idea but it's kind of silly to have a request for a single service go through the gateway service and then on to the other service so we use Fastly as our CDN to intercept requests so now we have these it's just a VCL of varnish config and now when a request comes in for a single service it still goes through the gateway server URL but Fastly picks it up and redirects it to the service so this greatly reduces the amount of load on our gateway service and it also reduces the latency not by much but by some so this made a much more resilient platform and it was some simple changes that we should have that how we've been thinking with this mindset of continual of continual integration and continual change we would have done this instead of thinking of this fixed thing is okay the API gateway we're done and this is the response times since then I think this is for the for this year and this is not our gateway service but this is the service that fell over during that night there's a strong dependency on some of the legacy stuff that we're moving over now and we should have it done soon but this is much more realistic with what we want to have like we want to have around 100 milliseconds as our response time or sub 100 milliseconds because apparently that's according to Google about the level of time for someone to perceive a human can't perceive the difference between 20 milliseconds or 80 milliseconds when receiving content this is more accurate to what we have we have I would suggest reports owned by Turner and we have they do some pin testing and these kind of things so this is well within our expected 500 level and then we'll show you just the availability the availability on the left is that night which is pretty terrible and then how we measure it and then the availability on the right is what it's been since then and you know it's not 5 nines but it's well within the acceptable range of what we're looking for and so now we've talked about how how to revise an idea that had to prove upon an idea that had a good idea that had a bad result and move on for it but there's also the idea of you know, technology isn't static and we certainly made a lot of changes at Bleacher report over the last few years it's just like most other people here who have adopted elixir are considering elixir or why you're here because you want to add new technology so let's give an example of perhaps finding an optimal solution for a seemingly suboptimal solution for incorporating new technology but when considered on a longer continuum it makes more sense so because we've had this good success with elixir and because our platform is stable we're reaching out to do more experimental or more green field type projects and that's one of the reasons why we're hiring and for one of the projects we decided that we would try new technology which is Kafka and it makes a lot of sense we use RabbitMQ for a lot of messaging but we wanted a durable log essentially and Kafka made a lot of sense and so there's a few drivers some Erlang drivers there's a C driver with an Erlang wrapper around it and then there's an elixir driver and so of course we default to the elixir driver which might have been a mistake because the Erlang, like most libraries it's been around for a lot longer and has been through battle tested cases so we we try this and it turns out that the producer works fine it sends out the messages no problem but the consumer using the elixir driver kept crashing, kept falling over and it's like this is a real problem what are we going to do here now we've committed to this technology do we go back to Rabbit, do we do something else how do we evaluate this now using Rob's Dribbledore spectrum which I probably mispronounced my inclination was let's fix the driver this is the stuff that we live for this is the problem, we can fix it and then if I look at the business side the answer is you don't know anything about Kafka you just started evaluating how are you going to fix the driver with a complex application like Kafka and so what we decided to do was which might seem counterintuitive at first we used a Ruby driver and we made a Ruby service that all it does is it consumes messages and so in my mind I was like this is bad, we just moved off of Ruby why are we going back to Ruby are we going to get back into this trap what does this mean about what does this mean for the long-term health of our platform or the long-term decisions of our platform and I think it was a good decision because it works at this point in time so we don't have we have to choose the least worst option and this is a good option and also a lot of the issues we have with the legacy Ruby applications were not Ruby's fault they were our fault because we didn't maintain them we didn't upgrade them so this is a sub-optimal situation a sub-optimal solution but it works for now and then every three or six months we can re-evaluate it now the problem comes when we decide to integrate more thoroughly with Kafka what does that mean for our system and we'll address those when we get there but for now this works out really well for us and I think that that's a nice compromise between what we would actually like to have versus what's available and finally the next bit is maintaining the system so of course if this is in a continuum then everything is always being maintained in these kind of things or it's not being maintained and it falls apart and this is where the people aspect comes in people are the ones who maintain the system we write the code we maintain it and so on and the people life cycle is of course very similar to the system life cycle you hire people you try them out the employee and the employer evaluate each other you decide well maybe this person goes it's better over here or maybe this person is better over here or maybe they leave and go somewhere else and this is just the natural cycle of things and we expand and grow our team and all of these things we want to make sure that we keep the close-knit group that we have and we also want to maintain make sure that we have opportunities for people to advance to get what they want out of the job so we have this idea of trusted autonomy at Bleacher Report and we try to hire people who are independent, competent and who really are engaged and excited about this kind of thing these people don't like to be micromanaged competent people don't like to be micromanaged as well but maybe they need to be so maybe you shouldn't hire them in the first place but another story entirely so autonomy, we have that autonomy where you can say we trust you to do this thing that you say you're going to do and the trusted part of that is that we have these standards in place another talk about how we implemented these service reviews and all of our services are essentially within reason identical to each other in terms of standards documentation, tests and these kind of things so we already know that you maintain these standards and that means that the code you're going to write is probably pretty good or is within the realm of acceptable code that you write so this is sort of what it looks like and this also looks like a bit like a supervision tree so we have two programmers here so each service that we have has two service owners and the service owners are essentially architects of this service so they're responsible for keeping dependencies up to date for bringing up this part of the app needs to be refactored this part of the app is lagging and so forth they also are the guardians of the code so we use GitHub's code owners file to approve pull requests so when code review comes through that means at least one of the service owners has to sign off on it and what that also means is that since they're responsible for it if something blows up it's at least partially their fault because they allow the code to go through and we have these standards in place so we shouldn't have code that goes through that doesn't meet their standards and this isn't a way to punish people that's just part of responsibility everyone writes bad code from time to time bugs slip through so the idea isn't to punish, it's just a way for us to distribute the work to different people because if we hire like four or five or six or seven other people how is one person or a number of people going to one, maintain the entire architecture in their head or to be able to maintain responsibility for all of these other applications so this has worked out pretty well for us so far and now that all the apps are up to date we implemented this idea of application reviews so application reviews sort of take it flip it around so it's a bottom up type thing so since they're responsible for the service we meet a few people meet twice each service is two or three times a year depending on the frequency of development of the service and this is a way for them to say this is what changed and to think of the application is this never ending changing thing and how do we keep it from how do we allow it to maintain the health of this application and again this is working out pretty well for us so far and we'll have to see how when we scale to add more people to see if we can replicate this and a nice benefit of this is bi-directional ideas a lot of companies where I've worked here's a ticket, here's how you do it and then you do it and you have service owners and trusted autonomy now the people who own the service can say I know this service better than you do this is a problem and this is a way that we can fix it so we have a push notification system it was originally in Ruby we wrote it in elixir and it was you know significantly faster everyone was very happy this was something that everyone you know people on twitter and internal people and the upper management were all really pleased with this because it was a benefit that elixir brought to us but since it was our third elixir application it sort of aged a bit so one of the people who were responsible for the app said you know we need to address some of these things it was using the legacy APNS and GCM HTTP protocol so we need to move to HTTP2 and this developer sort of went through and did everything and he said this is my proposal this is what we're going to do and we went through it and he had some issues with the somewhat untested elixir libraries but this was the optimal solution because he worked with the developer of the library to fix the problem so they went to there's no one at our scale had used his library so we went through and he fixed all the problems and so Mike was able to take this idea that he had and was able to of his own volition to go through this thing so we had this idea propose it and now we roll this out we have a faster we have even faster push notification system coming like next month so this is a really nice idea of a way for people to express their ideas and also to introduce new technology this is the same way that we introduce new technology so if you're working on a project here's a proposal we want to try this out and now that we have our system in such a way that we can sort of isolate failure or isolate experimentation because processes are cost nothing to spin up so we have this if we call it ghosting but essentially duplexing so whenever a request is coming you start a task.start it fires to the staging environment or the experimental stress testing environment and so we get to actually see real production traffic and we get to see how that affects the system and this also you know the nature of everything is ebbing and flowing right my close friend of mine I was having some frustrations with work a couple of years ago and I was trying to explain it to him why is this this way and I was coming from something of an arrogant position where I was like I've done all this stuff why are the outcomes that are happening not in my favor and he said to me no one is irreplaceable I thought well that's kind of why are you saying that to me like you want me to leave and I looked at it from the other side and it's a much more freeing thing no one is irreplaceable which means that you're free to leave your job whenever you want like if I were to leave tomorrow should I feel guilty about leaving Bleacher Report no absolutely not you work hard at your job and then when you decide to move on you move on and by expecting and allowing for this idea of people coming and going it makes it much easier to predict how things will go so the idea and also when someone leaves we just pull someone else and add them to another service review or hire someone we pull them on to another service and so we have this continual shifting of change around our service owners and the knowledge is spread out pretty evenly so people have a good idea of what's going on and it's not just going so if someone left we wouldn't nothing would fall over we just we have an adjustment period and we backed up the scale and then also the nice thing about this is that we like to hire junior developers who are just starting out with Elixir or people who have a passionate or people who are passionate about the language whether they use Ruby, Python, Scala, whatever if they show aptitude then we'll bring them on and by embracing this ebbing and flowing and this service owner concept it's really easy to get them up to speed because they get to work essentially with a mentor a service mentor who brings them up to speed and then they will start adding on new services to their responsibilities as well and finally so bringing back to this idea of this circular idea of imperfect or impermanent architecture this really helped me as also from a planning perspective like what are the consequences of my decision like it doesn't have to be the right answer for a long term it has to be the right answer right now and it's a number of from a number of answers and you know one of the problems even with like the one of the when we first started out with Elixir it was like we kept trying to optimize stuff we didn't want to deploy it to production we wanted it to be perfect we wanted it to have no errors and this way when thinking about things in these terms it doesn't matter what you deploy to production is what you deploy to production and because it's very inexpensive to deploy new things to production you don't have to over optimize all of these statements are grounded in the context of having high test coverage QAID and the standards that we have in place on production it's something that we believe that through all of our metrics should perform well and some of the advantages beyond that of this type of thinking is that it's non-dogmatic when I first started out as a programmer and probably a lot of people have similar experiences where you would read like this is the way to write a controller in this language or this is the way to do X in this and so instead of you know thinking about this idea as a possible idea it's like okay well I just need to copy this down and then okay it works and then move on by thinking of having thinking of these best practices as a guide rather than as wrote or scripture you get to see that it becomes a much more learning experience and it's not a single path to get to where you want it's an infinite number of solutions to where you need to go and building on that it's this idea of creative solutions some of the problems that we've solved have been done in orthodox ways and have proven to be successful for a long period of time and some of them have proven to be unorthodox and haven't been successful so we've re-evaluated them and again this doesn't this means that we both iterate in small chunks and we both acknowledge that we have a number of competing solutions and just because one works out it doesn't mean the others are failures on the contrary it just means that this is for this point in time the best solution that we have and what I think is also really nice is that it includes it's shared responsibility like architecture is not one person's responsibility it doesn't matter what your title is if you work at Bleacher Report and you're on the development team it's your responsibility whether it's just one service or a number of services we all work together and we have this and moving to this model of having these sort of supervision-ish trees has really enabled us to get messages that we've missed before like we wouldn't have had this problem when everything fell over that night had we been talking to each other better and how do you talk to each other better you have these supervision trees and these application reviews and these other check-ins that we do to make sure that everything is working together and working in the way that we want it to do it also this idea of growth decay growth is just the natural order of things this is in Honda's article the sentence is talking about the building the perfect building and how things degrade over time and the Wabi Sabi concept you embrace this you embrace this idea of symmetry and also dissonance and also understanding that whatever you do now is going to decay in a month five months or a year or so on I think the trick here is to have these practices in place like standards, test stocks, etc so that your growth periods are strong and your decay periods aren't as bad if we look at the elixir apps that we had today versus the Ruby apps we had before we had growth in terms of features within the decay was so massive that every sort of growth period after that was offset so greatly by the decay that it never fully recovered and we never got back to that initial sort of growth period and the hope is that with these standards in place and these ideas of looking at it as this continuum that those deep values of decay anymore and so hopefully that will lead us to come up with more exciting things and build out these stable platforms with exciting features and again we're hiring so if you're interested in working on this come see me afterwards I was curious when you were talking about the working with product to decide how the system can function in an integrated state in an acceptable fashion with specific monitoring around those states and if so how did you implement that so we use Xometer for all of our monitoring and we just use the stats D module to send it to Datadog and we'd already known we already knew where the problems were and we even had the graphical representation to show and the data to show where the problems were we just never realized the extent to which this could bring down the whole system and I didn't really talk about this before but also in the past and this is probably maybe this might be a holdover from a monolithic way of thinking our API responses were all or nothing so you had even so play by play data changes pretty frequently you have a basketball game it can change multiple times a minute so why would you send the entire response that has the play by play data and also the content and also any other metadata that might be involved so we're using our APIs to be these incremental APIs or progressive APIs so that we wouldn't have this and those lines of demarcation were what helped us with product to say what are your top five things that you want to always happen and this is the way we design our apps and of course the any third party library or any third party data that we receive we always have to treat that as unreliable data and control over that and that's bitten us a few times as well and it's a struggle it's a back and forth thing with product because they're like well that's not good enough but if we don't have control over the API you just have to accept that this is a possibility and once they sort of came to accept that that's gone and also because of the success we've had since then we broke our traffic record two times additionally last year and those were without problems it seemed like this idea had some merit and they were able to convince them that because of these successes we were having that this was a reasonable way to go forward so I'm noticing that word progressive API design are there any resources where someone could look for what that means I think we have a loose understanding I don't know if it's actually an actual term or not but we were sort of regrouping after this and decided to split up the APIs and going forward to design things like this this is what we came up with but there surely has to be something on I don't know maybe there's another term for it but if not maybe we can write a blog post about it if that would be helpful Alright then, thanks Ben