 Thank you. So I have the unenviable task of being between everyone and the distillery tour. So I will move it a brisk space So this talk is kind of like to bring down Phoenix and it's a bit of a tongue-in-cheek title Everyone so far has been talking about the advantages of elixir and earling and processes being a big part of them And how great they are but processes can also be used for nefarious purposes So if you can do 100 things at once in a good way, you can do 100 things at once in a bad way So I'm the lead developer of leecher report. We've been using elixir in production for about two and a half years There's been a lot of experimentation and growth as we've developed our relationship with the language and then Phoenix the framework It's been an overall a great success We have a few articles out about the things that we've done and the savings that we've had and and what we can do with that Part of the adoption rate we adopted it before Phoenix 1.0. So a lot of the let it crash was Us and unintentionally breaking it and recovering and so this sort of entuned a a sense of refactoring like Since at the time their new releases of Phoenix were coming out every couple of weeks before 1.0 We decided that we're going to refactor so often which made it really great for us to help to adopt a language to learn new things and to Try out new experiments and to test things in production as well So one of the things that we worked on is load testing We were formerly a rail shop. We all of our our main app is about eight years old and started out with rails one Load testing with rails is quite different than Phoenix with Phoenix with rails You see the CPUs go up in the and add another box and so on and so forth With when we started using eXameter We were seeing CPU usage around 3% and we were shocked I was sure that I was doing something wrong measuring the metrics wrong and so on But it turns out that that's just not a very good way to measure how elixir performs And we've gotten a lot better We now have about 60% CPU usage So which means that we're getting closer to how we should be doing with the right amount of servers and and using them correctly And just to give you some idea of the scale of Bleacher report, which is it's pretty important It's when I talk about load testing We have about 15 million app downloads and most of those are pretty regular users And we have about 1.5 billion page page views from per month and we send over Three billion page notification push notifications a month as well The last NFL draft, which is one of our biggest nights of the year. We sent 225 million push notifications in the first three hours and Here's a graph of last week's traffic. It's pretty standard. These fluctuations are normal We have users all over all over the world And we have an office in London as well So when our day ends their day starts, which means that you'll have influx of news and traffic more European Football-related type stories, but still there's a lot of traffic and the European side of the world and the rest of the world as well and This is our 95th percentile latency Graph for our for our app that powers the client apps the Android iOS and the web client On the left you see the the 95th percentile and that's in milliseconds So it hovers around 100 milliseconds. The highest peak there is just under 200 milliseconds. So it's just quite quite good At loans to our elixir Steven Wintermeyer gave a talk about optimizations And he said that sub one seconds are as long as your response times are under one second Then it's not noticeable by the human eye So and sub 100 milliseconds is optimal apparently over time people notice that So we're still working on getting it under 100 milliseconds, but for the most part I think we're pretty good with the way that we're handling our traffic Another thing that we don't have to do any more is before events like the NFL draft especially the Super Bowl NFL NBA finals and so on we would have scale-up meetings How many servers are we gonna need because you know, we were running rails and we had a lot of tech debt so The way that we solve the problem was by throwing more money at it Let's you know, we need to scale up. Let's add another 20 servers or so on and that's not a lot of fun And it's not as a developer It's not a sense of pride when you have to scale up because your system is so clunky that the only way that you can serve The traffic is by adding more servers. So now we don't do that anymore. This is This could be the any night basically this could be the NFL drafts and so on so it's it's a testament to to elixir and Phoenix and how and how well they perform under the load that we have and the way that we need and But so we break out our old rails app into different services and we try to do it by functionality part of the problem of Having eight years of tech debt Especially in San Francisco is that people people move on they write to do's in the in the in the code and say This is temporary remove this later and no one has any idea what that means in six months or a year after these people have left And I'm sure everyone here has been familiar with how temporary things are in code So When we started to roll things out basically we would sort of ask the people who knew the part of code that we were Going to pull out how that would work and you know, does this part important? Is that part important and we try to get a reasonable guess and The biggest when we rolled out the biggest chunk of our new elixir services It turns out that a lot of these little things that people thought weren't weren't we're not important. We're actually quite important So we end up rolling back back and forth a few times until we fixed all the kinks But with this but what this became obvious to us was one we shouldn't make the same mistake twice I mean if you roll out something and you didn't know it happened fine But doing it more than once or twice is is unacceptable So we started to duplex traffic from our production apps to our staging apps This the obvious benefit is that we get to see how the app handles under load We get to see if we're missing any or getting any errors downstream as it is things or things working as expected And this is good for a day-to-day use case and this is basically how we do it We have this it's we call it ghosting traffic So it basically you get the the host that you want to send the traffic to You join that with the request path from the phoenix con And then you just use task start and you send the request and and off it goes so we use tasks that start because it doesn't it's a fire and forget it's not blocking it's done and We don't we have metrics in place that we can monitor to see how is the traffic actually going to where it should be going It's traffic going from production to staging in the way that it should be going and And that that's worked out really well for us it's given us a lot of confidence that we can deploy a new service and not have to roll it back and As we as each service we roll out we find that it's getting better and better in fact the last service that we Rolled out we really didn't have to do any changes after it rolled out So it's it's really nice to be able to To be validated by the work that you've done What about multiplying traffic so? Especially with like Bleacher reporters and media company this sports company every sports sports fans are passionate to say the least if If you have two sports apps and there's you know any number of sports apps ESPN is the biggest and we're the second biggest behind them in terms of traffic But the first by far in terms of engagement with our users through through our social media posts Instagram Facebook Twitter and so on So our users want to get the notifications first So if they get if they start getting notifications faster from ESPN or some other competitor, then they're going to go to that app So we need to see how traffic is going to respond in these situations. I mentioned the NFL drafts Kevin Durant when he was traded to the Warriors last year our traffic quadrupled in about ten minutes Which is so you can't really plan for for these kind of things We can't have a scale of meeting that says Kevin Durant is going to be traded. How do we handle this? We just have to have a system in place that handles that and And So this is it's similar to what we did. I'll just give a instead of writing a code I'll just show sort of the steps of it fires off of tasks There's we have an environment variable for the multiplier and then it spawns that number of processes and then fires off those requests and again We can use the Use the monitoring that we have in place to measure that those are actually being received and so on and What what's really interesting? What's really interesting is that we were able to Without any changes in server configuration or additional servers. We're able to have eight times the traffic of our normal day With the current configuration that we have in place And it turns out at that point the DB was the blocker it that's when it started to fall over was because it was The CPUs was getting pegged at a hundred percent and so on And this is a good problem to have I mean the product product or the business people in our in Bleacher Would be thrilled to have eight times the amount of traffic we have and it also gives us a sense that The system that we're building is future proof that it that we can handle the fluctuations in the loads and so on But this doesn't mean that we've done everything correctly We're not resting on our laurels We have a lot of things that we need to improve with the way that we use elixir and Phoenix and part of that understanding is By having developed our understanding of elixir and Phoenix as they also grew We we continue that and it's worked out really well for us But what about volatile traffic so With a multiplier you can just multiply it and send it but what if you want to do fluctuations? How does that affect your app? Can your app go from five times to three times to so on does that work out? Well so I wrote a library and As Jesse mentioned earlier the Erlang of Zen by Freddie bear is great It's it's a really nice especially if you're starting out with elixir It's a really nice overview of how of Erlang principles and the philosophy behind it And he compares processes to bees which you know, they have a queen they do some things they come back So I wrote this library called locust which is more of a chaotic type thing Locusts are swarm they behave erratically From a distance, but there's there are there's swarm behavior and that's sort of what this is mimicking So I actually wrote this about a year and a half ago Just started out with it because we're using a third-party service to do these kind of things I was like I wonder if I can if you can do that with elixir and it turns out you can and I I've kind of forgot about it until about a month or two ago when I Decided to do this talk and thought this would be a fun talk to do so I looked at the code And I realized how my understanding of elixir had changed quite a bit before I was passing around parameters from function to function And now I use data structures to pass that around and do those kind of things So, yeah, so here's the locusts. This is the chaos when it attacks your server So processes themselves are very interesting as was mentioned earlier their lightweight their independent they don't share state To give you an idea if you do and to give you an idea of the cost of spinning up a process For instance, if you do IO dot puts hello the response time Ranges from 20 to 60 microseconds if you do if you spin up a process it ranges from about 40 to 70 microseconds So there's not much overhead to use to spin up processes and Then one of the nice things elixir 1.4 release these calendar calendar object or calendar module and Which means that instead of having to do this and dealing with the Erlang tuple for timestamps you can just do this Which makes Significantly easier to deal with dates and times and you can specify seconds microseconds. I think nanoseconds as well So let's see how this works to see how this library locusts works. So some assumptions This is why I don't have so much time. So obviously no CDN. This is just a vanilla app There's only gonna be reads. So we'll see how many that can handle and it's just the app They're not using any sort of background processing or anything like that And so when I was a kid I came to Florida went to Sanibel Island it's the last time I was in Florida is 20 some years ago and There are alligators everywhere in Sanibel Island and it was great as a kid. I loved the loved animals I loved being outdoors and Just hundreds of alligators all over the place is fascinating to see and so is you know So what a what kind of demo app can I use to show this this in action? So I came to this alligator radar. So basically it's just rates alligators. It's topical, I guess It's and it's completely nonsensical, but it's totally artificial. I mean the you would never use you would never use this in production You know, this is just a demo Yeah, let's see. It's a user's ecto for persistence. No JavaScript anything like that. No brunch And it only has one endpoint which is git slash ratings and that just returns a list of alligators and their ratings so for reference one request takes about a hundred ninety seven milliseconds and I artificially Seated the database with a thousand records and there's no pagination. You know, these are all just ways to simplify and to simplify the app and As Omid was talking about the observer earlier This is what the observer looks like when I run the app when I run the locust app So you can see that the schedule utilization is at a hundred percent and those dips are when one set of the processes finishes And the next one begins And then you can also see that the processes are taking up the majority of the memory usage on the app And so here I was running it with 600 600 requests So with 200 requests it averages about 2.7 seconds The lowest is about 28 milliseconds and the highest around 5.2 seconds They're about a hundred and five errors And it's interesting the errors. This is with Ecto So a lot of the errors were dealing with database timeouts or pool timeouts with with connecting to the database With 400 400 requests. It's about the same. The average is a little bit higher One of the sort of the drawbacks of of this is that it doesn't pull out the error average So this average of 3.5 seconds is also when the when it returns an error So if you pull that out, it's probably the average would be much higher and and one of the other tricky things that I came across when I was writing this was If you spin up so many processes why was I'm using HTTP poison with the hackney is the for the HTTP calls and I Was seeing that there was always like a few like a millisecond or maybe even a second or two or three more I couldn't figure out why that was happening and the problem was was the hackney pool was didn't have enough workers So it would just I think it has 10 workers is the default if I'm correct And so it was send those 10 it would wait for those to return send 10 again And so on until you finish them and so that was quite vexing because I couldn't figure out what I was doing And it was driving me crazy but it just now you can use the put environment variable Function to set the hackney connect hackney max connections and I raise that to like a thousand and it seemed to work fine And then if you go above 400 requests, it just returns all errors So this is about what I could get using Phoenix on my local machine. So how can we counteract that? What is a good way to limit? How can we get some more use out of this? So since it's only reads It's fairly easy, you know, you would think that the database would be a one of the Larger parts of time that spent when the when the request is made and the response is returned So the ETS is a good solution ETS is early term storage. It's it's in memory persistence So if if you if your app restarts you lose the ETS table It's it's very handy. So some some pros some advantages of using ETS or a constant lookup time. So whether you have one or a thousand Records in your in your table. It's the same response time Relative deep database time independence in the sense that you don't have to make a call to the database. You just return the cash It's really simple to set up and use Which makes it a very appealing option if you need to heavily cash your app some downsides, however The syntax is clunky. So if you have like If you use a if you have to do select or match on these kind of things then that makes life a bit harder Because of the way that the Erlang search functions are written If it's also in a single node, so One of this is another issue that we are sort of working on a bleacher report a ways to get better is that When we first started with elixir about two and a half years ago There was no there weren't really any guides for how to deploy elixir or how to deploy Phoenix apps So basically what we did is we just use Docker through the application and Docker and through that up on the elastic beanstalk And it works fine That a ploy time is a bit slow and there's some other things to work out And then you also have the overhead of having to be competent with Docker But the problem but we have the limitation here because if you have say five Five server five instances of your application deployed and you want to use ETS tables You really can't because it's non deterministic depending on which server you hit you might get a different cash response So that's it's a bit unfortunate that we can't do that But now that as I mean as I showed in the earlier slides, we have a stable performance Responsive system now we can focus on these secondary tasks like improving the the way we deploy and using more of the distributed elixir and Erlang advantages and This is also part of the fact that we understand that It's an iterative process that we can't get everything right the first time And especially if you're just starting with elixir and Phoenix and I mean I think almost anyone does this when they learn something new They want to do it all correctly and they want to get it all done the right way But if we've done everything the right way, we'd still be using rails because it's just that's just the way it goes You have to take time to improve and to move on to things So let's look at how to set up any test how to set up ETS within your application And it's fairly simple again assuming we're using a Phoenix app you go to web.exe and add this alligator radar dot ratings cache dot create ETS table That does what you expected to do it creates the ETS table You put it this is in the start function You put it before any of the children start up in the event that you might get a call to the repo or Endpoint in the milliseconds or microseconds or however many seconds there are But before the app is started in the first request is made so just put that in the top This is the this is the create ETS table function and it passes in a name for the table and name table and public and set public is so that I believe other processes can access it and If you're not familiar with these spec Function at the top those are elixir type specs. I think people are mixed on whether those are useful or not me personally, I think they're really really nice to have because it We also have the doc function as well But I left that out of here because that's pretty self-explanatory, but the spec is nice because it gives you an objective Documentation of the function, you know that this function create ETS table takes no parameters and it returns a boolean You can also use die elixir to validate your type specs I believe now that Dialix dialyzer dialyzer, which is the wrapper for dialyzer works really well with with elixir before We're getting there's still some errors, but I think they've mostly been sorted out now So I it's I would recommend using it even though it's the error messages are kind of kind of opaque and mysterious But I think it's nice to have something to validate. It's a more intentional development when you say this is what I'm doing with the With the function. This is what I expect and if something else is returned it means I'm doing something wrong So here's the lookup function it just takes since if this were a actual application it would likely take some Parameters that would tell you which table to look up and which item in the table to retrieve But since this is just returning every ratings in the ratings cache, then it takes no no No functions or no parameters and then it returns a tuple which has the name of the table and then the contents of the table and Since in the event that your app crashes if you restart there's logic in the code that says The cache is empty after this request refill the cache and move on and so in the event that the app crashes You only have one one response where you would return an empty cache now on the inserts function This is takes takes a takes the entire list of ratings that returns a Boolean And it just inserts it into the ratings table as the second option in a second Field and the ratings table and with the ETS ratings There's about a thousand records As same as the same as before and as I mentioned earlier, there's no startup There's no cache when you start it up So the first request is the same as you would expect as there would be without a Without a response or without a cache and it's about 16 to 90 milliseconds for the ETS table So that's about half of what the response is with ecto. So it's an immediate immediately obvious gain and then Here are the responses from the ETS table We're using ETS tables the average response is about 3.7 seconds The lowest is 23 milliseconds, which is quite nice. The highest is around five seconds and you have 72 errors as opposed to I think it was like 150 so it means that they the while there's still a number of errors it at least you're seeing that the error rate is much lower and The errors in this case tend to be connection closed because the number of available connections is is is too few and Then again with the when you double it with 400 requests The average is about the same the highest is slightly higher and the errors start going up much higher And this is sort of what you would expect if you have some number of errors that are due to Connection unavailable, then you're gonna have significantly more if you double the request And then if you do and then moving any higher beyond that it just falls off and you get all errors There's a So there's some some things that I would like to improve with this library One is as I mentioned earlier ensuring that you get more useful Errors errors to say that these are this many errors of this type and so on the average The average response shouldn't be shouldn't include the the connection the connection refused ones because that brings the average way down so if you see that that the Average is is 3.7 seconds that doesn't include any of the ones that were timed up and then One of the another thing that I would like to do is Add some time-based attacks like one of the when we were using this third-party service One of the things you could do is over a certain amount of time send this many connections And the way that I have it is that you can just send So many connects you send a maximum number of requests and you divide that by Intervals and that's how it sends them and then The way that I've been testing this is using a library called bypass For those of you who aren't familiar with it bypass basically allows you to simulate Integration tests so in this case this would be incredibly helpful to To use that to so you don't have to make the calls out you can just make a Make a fake request to some simulator response and you get all that you need back from it and you can change like the the what the What the requests and response are within bypass, which is really helpful so I would recommend using that and Can show you a quick demo of how this works Let's see so I'll show you first So develop is with Ecto as the So let's send a thousand and You can also with the with locust you can do simulate post requests as well. So it's it's Getting there in terms of features. So basically what this does is you can send a thousand requests and increment them by 200 so it's in 200 400 600 800 and so on and Then you start that And then you can immediately see go over here and you can also see probably not very easily but the Response times are increasing as the number of processes and then here you start to see that the timeouts happen and then Try to bring up the observer as well. Let's see it try again So you can immediately see that the scheduler goes to a hundred percent And watches the memory climbs You know as I mean was saying this is a really really incredible useful tool. So you to see what's going on in your application You can see the memory allocation and so on as well And then if we just stop that now I can so And then we can check out the ETS branch and then if we just do one request Just to get the Here you can see that it's read by the by Ecto And if we try it again, you can see that there's no request. There's no query in Ecto So let's try it. Let's do 500 and increment by 100 And you can still see that it's the time there's significantly better But they're still well beyond the range of acceptable times considering that Response time should be sub one second and all of these are five six seven seconds and so on it's going to go a bit more okay So that's basically that and then eventually it'll start timing out because of the the connection error So I kind of went through this a little bit quickly because we're like a little bit behind But guess user process is responsibility responsibly and thanks for listening