 Hello everybody, I'm Paola Moresto. I'm the co-founder of a company called Nuvola. You can find me on Twitter at paolamoresto3. And so a little bit about me. I'm a developer, a turn entrepreneur. I've been in the high tech industry for a long time. I love solving hard technical problems. I come originally from Italy, but I've been in the US 20 years. And if you don't find me writing code, I'm usually outdoors hiking. So this is about performance. And we heard it loud and clear here at the RailsConf that faster is better. So we all know what performance is, but it's good to understand really the impact of low performance. And when I talk about performance here, I really mean speed and responsiveness. The speed and responsiveness that your application delivers to your users. So there is a famous quote from Larry Page that says, speed is product feature number one. So you really need to focus not only on your functional requirements, but also on the non-functional requirements. And speed is paramount for any web application today. And there is a lot of research and data that backs this and shows what is the impact of low performance. So impact visibility definitely affects your SEO ranking. It impacts your conversion rate. It impacts your brand and the perception that people have of your brand, your brand loyalty, your brand advocacy. It impacts your cost and resources because the tendency for low performance is usually to over provision. And that's not usually the response, the right answer. So speed today for web application is paramount. And then if you have a DevOps model, so if you move to a full combined engineering model where development and QA are combined and development QA and sysadmin or ops are combined and you have a full DevOps model where you have adopted continuous delivery and agile methodology, which is like the standard today for web development, then it becomes even more critical. So performance today in the cloud where you have a fully programmable and elastic infrastructure and you're adopting continuous delivery, it becomes even more critical. You need to be able to bless every build and make sure that not only it works, but it works at the right speed. So then what? What do you do? How do we tackle this problem? Well, the first thing is you need data. So this is a quote that I actually stole with pride from a talk yesterday, and I love it, in God with trust for everybody else, give me data. So is this a good model you deploy and then hope for the best and you have your customers or your users being essentially your QA department? It's not. I know of a company that says it's an e-commerce application and they say, oh, we know when we have a slowdown because our users complain on Facebook. Well, that's not usually the best way to do it. So you need data and you need a lot of data. So let's get started. So there are different types of data. So basically on the right-hand side, you have your deployments, your production, where you deploy, you have your live traffic, and that usually goes on the big umbrella of monitoring. So there you have all sort of monitoring data and techniques. And then on the left-hand side, that's your testing environment. It's usually people have a pre-production environment or a staging environment. Sometimes you can also test on production. There you have your synthetic traffic. So you're simulating, you're creating your users, and you're doing performance testing. So these are the two most typical source of data today. So let's start with monitoring. So you have different, many types of monitoring. So you monitor your stack. You monitor your infrastructure. You do some sort of log aggregation. You monitor what the users are doing with your application and what's the user's behavior and what are the most typical user behavior or what are the corner cases. And then you have what is called today streaming analytics or high-frequency metrics, where there are solutions that pump data out of the platform at speed. So and these are some of the examples of the solutions that exist today. We're not associated in any way with any of those, but it's just to give you an idea of the wide spectrum of monitoring and data instrumentation solutions that you can find. All of these complement each other. There is not one piece that fits all. And it all depends on your application. And there is an interesting problem today. You get all of these nice dashboards. And how do you correlate all of these data and figure out exactly what's happening? But the first step is definitely monitoring. As they say, your first instrument and then ask questions. However, monitoring is not enough. And why? So first of all, your life is noisy. So your life, you have all sort of users doing all sort of things. It's very hard to troubleshoot. If you have a scenario you're interested in and it's perhaps problematic, at the same time you have other users doing other things that has such the system respond in unexpected ways. The other problem with monitoring is that it's after the fact. So monitoring doesn't help you predict. And it doesn't help you prevent problems that might occur with your application. So like a friend of mine saying, monitoring is like calling AAA after the accident. It's useful, but usually you want to prevent the accident instead. So that being said, monitoring is the first line of defense. The first thing you've got to do. So then what are you going to do? Then we're going to pair up performance testing with monitoring. So the two complement each other really, really well. And here's why. So we're going to look at the left-hand side of our data sets and data sources. And here we're going to look at synthetic traffic. So it's not your live. You have the ability to create your traffic. And you're going to do some performance testing. And it could be on a pre-production environment, on a staging environment. Usually you don't want to mix your synthetic traffic with your live traffic. And you don't want the synthetic to have an impact on your real users. So that's why you test on pre-production. But you could also test on production for specific applications or specific times of the day, et cetera. So with performance testing, basically the users are not real, but the traffic is absolutely real. You have total control over the amount of traffic. And the user scenarios, the workflows, because that's how you have designed your tests. So troubleshooting is simplified here, because you have an easy way to reproduce specific scenarios that you thought were problematic. And number two, in terms of peeling the onion, which is a typical troubleshooting approach, you have already controlled two variables, the amount of traffic and what the users are doing. And then the other advantage of performance testing is that you get end-to-end user metrics. So you're measuring exactly what your users are experiencing. This is not about server metrics or database metrics or applications metrics or Ruby metrics. It's the true end-to-end. So we've seen some numbers where there was a factor of seven in between the end-to-end user metrics under traffic and the server metrics. So the server appeared not to be suffering, but the users did not get a good performance at all. So in order to have a good, complete view, you really need the end-to-end user metrics. And the other advantage of, so if you can test and create realistic scenarios, has closed as possible to what your users are going to do. And then the goal here is to figure out problems in advance before they happen. So again, one of the problems of monitoring is that it's after the fact, here we are going, we're coming before monitoring. So we are doing things before that they happen so that you have time to optimize. And you can't optimize unless and until you measure. So you want realistic scenarios if you have mobile applications paired up with your web applications then it's absolutely critical you test your mobile traffic as well. If you are around the world, that's a global application you need to test from different geos and then the end-to-end measure the KPI for the end-to-end user experience. The type of metrics, so this is a lot of that, this is around time. And so time is a variety of ways of saying this is response time or some people call it latency but essentially it's time to complete transactions, time to complete specific requests, averages, distribution, you could get throughput, the number of successful requests, first test or per specific time intervals and then you can get also error rates if you see some suffering on the server side you can start seeing errors. And then again the goal is to resolve issues before you deploy. And then when to test, so software changes all the time and as such it's important to understand whether a specific change is going to impact how your users are going to interact with your platform. And it's not just important that the software does what it's expected to do but it also does what it's expected to do at the right speed. The other point here is no matter even if you don't change anything things change around you. So applications today are spidery they have hundreds of possible optimization points they pull in plug-ins you're sitting on a cloud infrastructure. So this is a complex problem and the only way around it is to test often. So test for every change, test if you're going into a peak of traffic you don't want to go blind into that. Test if you have any types of infrastructure or changes to your deployment. There is a very good example where at some point a while ago several years ago Iroquo changed something in the routing system and that change was not publicized they didn't or at least it was not openly publicized and it only impacted a specific set of applications and but it impacted them greatly and so people realized because they started taking measurements and they saw a big difference. So the applications did not change but in that specific example the cloud provider made a big change and the only way to identify this kind of thing is to measure. So I guess what? This is still not enough and why? Well you can get results like this where you say wow I have a lot of errors and under traffic I apply a linear ramp that's kind of the green the green bars I get a ton of errors my response time increased dramatically then at some point decreases because the server doesn't even respond to requests so or you could get things like well my tests are telling me if I have 10,000 concurrent users I get my response time deteriorates from 400 millisecond to 2.5 second. So okay your tests are telling me your tests are telling you that your system is slow or will be slow under specific traffic and scenarios but it's still not actionable. You still don't know what to do you just know that you're gonna have a problem. It's almost like I'm gonna tell you well when you have 50,000 users on your platform you're gonna have a fever but there is no medicine. So what if we can extract some more information from these data and find a medicine? So stay with me so if you look at the typical performance troubleshooting process ironically and where people spend time the majority of the time is spent number one in reproducing the issue with the right data and number two in isolating the issue and then once you've done that the actual fixing of the problem is relatively straightforward. So the reproducing is I have a very good example here there is a company that I know and their client was a big bank in India and they had performance problems with the applications they had and it took two weeks in between the time differences and the engineers on two sides of two different continents. Two weeks with a whole team in a room and constant conference calls because before they were able to just reproduce the problem and have the data. So reproducing is partially or it's addressed by performance testing but then you're left with the issue of isolating the problem and isolating a problem usually takes a lot of time and it's a lot of effort and developers are left with doing a lot of correlations with data and it turns out to be a manual and high time consuming process but then once we're done with isolating then the fixing becomes relatively straightforward. So what we want is actually the ability to go from if you go to the left that's before testing that's your oblivious you don't even know that you're gonna have a problem then once you test you're like yeah we're gonna have a problem I have found out that I will have a fever of 50,000 users and then we want the ability to some help in localizing the bottlenecks because we know that localizing is gonna take a long time and then after that we can fix and then that leads to happiness. So then we're gonna add the third step here so we talked about monitoring and all the data instrumentation that you can extract data from your application with your live traffic. We talked about the performance testing and how you could use synthetic testing create the traffic you want to see how the application responds and now we're gonna add extract and add a layer of information from our data to help us localize the problem. So how, so what we want to do is we want leading indicators of performance issues so again we don't want after the fact you want to figure out this problem beforehand because so you have the time to fix and to optimize and deliver the performance you want and we have found that if we localize if we are able to pinpoint in these pioneering applications where the problem resides then we can accelerate the troubleshooting process which is otherwise quite painful and we want actionable data. So in order to do that we're gonna add something else here. So we have our monitoring so what you have in the middle is our monitoring when you have your live and you have all the monitoring data and you have your data instrumentation and then we already talked about how it pairs up really well with performance testing so the two go together and now we're adding another layer so we're adding here some data mining and machine learning to extract another layer of information from this data and help us localize. So this is how we do it. This is an example of our prototype that we built. So you apply a linear ramp or traffic and so that you do the synthetic testing at the same time you use the data instrumentation that is usually used for your live traffic but in this case we're gonna use it over your synthetic traffic so it could be on your test environment and then we mix it up all together if there are historical data for that application and that test we use that too and then there is data analysis that basically makes an attempt at clustering and identifying statistically meaningful variations in all of these timing and whether these statistically meaningful variations are clustered around a specific component of the application. So this is essentially how it works. So first you run a test, a performance test if your response time is good and you don't have any slowdown then there is no problem at all but if you have a slowdown, right so we go back to the example where you had all of these reds slowdown and arrows then you're left with the problem of figuring out how to fix it. So the first thing we are doing is we are removing what we call network and external effect so we want to see if there is any correlation with data such as network time, DNS time, SSH time and other data that are kind of external to our stack and if we don't find any correlations with those then those are excluded from the data analysis. And then, so assuming that there is no correlation there then we go for, we look into the data set and the data analysis identifies statistically meaningful differences using clustering and longitudinal analysis and identify whether these variations clusters around the specific sector and then the results are displayed. So I think we're already covered it so the whole point is also the thousand of thousands of available metrics we look at variations in real time and we attempt clustering then across specific what we call sectors that are components in the applications. So this is all using specific data analysis techniques so what we use is kind of a mix of techniques it's not only one, they all go under the umbrella of machine learning or unsupervised machine learning or data mining, we again it's not just one technique but definitely we use a lot of clustering and longitudinal analysis. So ready to see some real data and a real life example. So I'll give you a couple of examples. So this is a typical web application it's a real application so it's not a test application. First we run some performance test with a linear ramp up to a thousand users so this is a thousand concurrent users per second so that corresponds to usually we say there is a factor of a thousand so it corresponds to kind of a million monthly visits that's the type of peak that you could expect if you have that traffic. So and then we run some performance test and we see that as we apply a linear ramp the response time deteriorates it's actually three times as much at traffic than it is without traffic. So this is definitely a case that it's worth investigating. So then we go with the data instrumentation so the beauty of this model is that you could apply this method to pretty much any data instrumentation that you have or that you want to use so it's not married to one specific method or approach in this case we use a specific data source but again you could use anything and the way we look at data is that they're categorized under sectors so the various components for each sector you have categories and then you have classes and you have methods so you have actually a lot of data that are coming up for each one of these sectors and so this data is an agent while the test is running there is an agent that pumps these data constantly into our algorithm and the algorithm works in real time to do this clusterization analysis and so at the end of the cluster result this is kind of an eyesore but basically you see you identify the methods that actually have that shows variations with timing at the same time as the test as the response time starts increasing so they correlate well with the performance testing results and with the end to end user metrics and so this is kind of the end result so as a reminder what you see on the left are the sectors the sectors are groups, large groups of data you can actually dig down into this data and see exactly what is the component of this group that created the problem so what we see from here is that for example, although this test was run successfully without errors and we put a load of a thousand concurrent users at the end we see that the browser so everything that what goes under the browser component starts suffering right before 200 users so it starts suffering at the very beginning and then it enters the yellow zone what we call the T zone, the transition zone so that's where it's kind of deteriorating but it's not too bad and then it enters a red zone which is way, way over way over where it's expected to be and then the next one that starts is the app stack and the app stack is essentially what's happening with your Ruby and that starts deteriorating right around 300 concurrent users and then enter the red zone later so you could see that even though at a thousand users you see a triple response time things start deteriorating a lot sooner and it's also important to understand another very critical data point here is what is the first component because sometimes you have chain reaction effect if one P is slow down then the other slow down as well so what is the first component that starts is slowing down and slowing down the system and in this specific example is the browser now the browser again it's a set of data which is represented here and underneath here you have another hundred of data points so from here you could actually see dig down and see what are exactly the components within the browser that causes this slow down so again the objective here is to identify proactively so this is all before you have actually the thousand live users on your platform to identify proactively and under a specific workflow or scenario what is going to happen and what components of your application are actually the root cause of the problem so here I'll give you another one so this is another application the categories are the same just because we look at the same data I don't have the raw data here but you could dig down into all the methods that actually cause this and here you have an interesting perspective you still have the browser you have the app stack that closely follow but then you have what we call server and software which goes from green to red even enter the t-zone there there is almost like a step function where the metrics go from really well to really bad so in summary what we cover today is speed is product feature number one performance is paramount faster is better how do we tackle that as developers with data we start with monitoring monitoring is a good start first line of defense not enough add performance testing compliments well with monitoring techniques that's still not enough because what you want is you want some help in localizing the problem so here we have performance test plus data instrumentation plus machine learning we have another layer that we can extract from our data which we have called predictive performance analytics and we got to see it in action in a couple of examples so thank you I think I can take some questions now you can find me on Twitter at Paola Moreto 3 and happy to hear your questions and feedback