 I present you Kubilai Karechi that works for Bloomberg and he's going to present you load testing on the Python open source project. Kubilai? Thanks for coming everyone. My name is Kubilai, I work for Bloomberg in London, and today I'm going to talk about load testing and specifically a Python-based open source tool called Locust. So this is a product we own in Bloomberg. This is like a spreadsheet application specialized for the financial markets so you can have a list of securities you want to follow like stocks, currencies and stuff and you can have real-time market data on the columns and you can build a market monitor using this application. We also support real-time editing and collaborative editing and sharing. So it's a new cool product but we are still in the process of releasing it so it will eventually be released to all of our user base but we are not there yet. So that's why we usually have meetings to discuss our release, should we speed up things or should we slow down and focus on some bugs and stuff. So in one of those meetings we ask this question to ourselves like what is our capacity? And we notice that we don't have a concrete answer to that which we should have. That's why we decided to invest some time on this and to find out what our capacity is. So how can you find out your system capacity? Is like you can do this by exploring your system qualities. Well, what can be those qualities? It can be capacity in terms of your infrastructure. Is the infrastructure size adequate to your needs or does your system respond quickly enough? Or can your application grow to handle feature volumes in terms of like scalability? And does your system behave correctly under load which is a stability concern, right? So how do you assess those quality points? There are a few methods I will mainly mention performance testing, load testing and stress testing because they have some overlap but they're also a bit different from each other so I just wanted to briefly mention them. In performance testing you are evaluating the performance of your component against a benchmark. You don't have to create a high load but instead you tune your application and your testing to establish a benchmark behavior and your aim is not to find defects in your application by performance testing. Load testing on the other hand, this time you feed the system with the largest task it can possibly handle then you gradually increase the load until things start breaking. So you do this by creating a simulation of virtual users so you try to replicate your real users with the simulation and in this case your goal is actually find the defects in your application that might be hidden under a regular load but can be exposed under high load. Those can be like memory management issues or buffer overflows and stuff and you also want to determine the upper limit for all of your components in your system so your application can handle the load but maybe your database is hitting its limit or your network is causing problems so you wanna determine the bottleneck in your system. And stress testing lastly it's similar in the sense that you still attempt to break the system down but instead of creating a high load you try to take the resources away from the system. You can like take a few machines down or you can just turn down a third-party service that you depend on and you try to observe the behavior of your application after failure. Ideally you would expect your system to graceful of fail and recover. So I will continue with load testing specifically but before you invest any of these methods there are a few points that I think you should consider. First of all you must have like monitoring tools in place otherwise you won't be able to benefit any of those methods because you won't be able to see the end result basically. And specifically for load testing you need to identify the usage patterns of your application like because in the simulation you want your simulation to be a representative of your real users, right? So you need to know their workflows and stuff and you need to define your success criteria in measurable terms. Like do you want to handle a thousand requests a second or a minute or do you want to be like responding like in less than 50 milliseconds or so? So you need to define that and that's the last one but definitely not least you should always isolate your testing environment I even have like another slide dedicated to that because when I say isolate the testing environment I mean two things. First you need to isolate the testing environment from your production environment so that when you create a load you don't actually like affect your users. And you should also isolate the clusters that you generate the load from the system that's under test because like load testing will require a lot of CPU power and it will use a lot of resources. If you do the testing on the same machine where your service runs then you will be taking resources away from your service basically. So what's Locustin and how does it help with load testing? It's an open source project, it's Python based and it allows you to define your user behavior in code which is super powerful I will come to that and it's based on coroutines and it's an async approach which makes it very easy to scale and distribute the load. It's also a validated battle tested product we use it in Bloomberg but I have heard that like the video game Battlefield they use it before releases they create like millions of virtual users using Locust to test the game. So Locust provides like two different interfaces to run the test first one is basically command line tool you can provide the no web option and you should provide a Locust file this is where your user behavior is in code I will come, I will show some examples of Locust files then you provide the number of users you want to generate which is a hundred in this case and you need to provide a hatch rate this is the rate which Locust will generate your users and you can provide an upper limit 1000 in this case like stop testing when you hit 1000 requests and it will print the stats to the console or you can use this web UI you can again provide the similar parameters for your testing and once you hit start swarming then you will see this nice dashboard I don't know if you can see but you can see the request types and request names and you will see a bunch of statistic about those requests like number of failures, average response time, maximum response time and stuff and there are like a few tabs at the top in the failures or exceptions tab you can see a categorization of your failures like you can see which requests are failing with which error code so if you have good error codes for example you can notice your database is failing before something else and that button is to stop the testing which is important and this is a very simple Locust file you need to provide two classes in your Locust file which one is the one at the bottom which is the entry point it's a Locust class and website task is the task set you want to execute so in the task set you need to use the task decorator to define your users action in this case this is a simple website it has just two pages, home and profile and as you can see task decorator accepts an integer argument which is basically the weight you can assign to that action in this case home page is like more popular than the profile page so you can assign a higher rate this is where you can apply your usage pattern to the code and Locust also gives you a special hook it's on start hook this will be called only once when the user is generated so if you need to do any preliminary work you can do things like authentication or if you want to get random users from somewhere you can do that so that you don't have to do extra work in the actions itself and in the Locust class you provide the task set you just implemented and you can also define the minimum and maximum weight time and this is the time that Locust is going to weight before sending the next request for that user so it's either going to it's going to weight some seconds between five and 15 in this case but what makes so I forget to mention so by default Locust comes with an HTTP client so if your service communicates in HTTP you don't have to implement anything more than this but if you have some like crazy protocol for your service it doesn't know about HTTP which was the case for us you can implement your own custom clients and give it to the Locust so in this case we assume that you have a Python client that can communicate with your service and in this case we just wrap this send request method with this class so Locust provides you like exposes events so as long as in an action as long as you fire a failure or a success event that action will be recorded in the statistics so in this case we try to send the request using our custom client if it fails with an exception we send the failure event if it succeeds we send the success event for this event you need to provide some parameters like the request name, request type, response time so you need to measure the response time by yourself or you can also provide the exception type to get a categorization of your errors in the dashboard so this is how you can write a custom client for Locust basically once you have that you need to this is the entry point again you need to just initialize a custom client and assign it to the client in the Locust class so you write your tests and how do you run them how do you deploy them we use containers because mainly they are well suited for such tasks they are singular tasks and they are lightweight they are very easy to deploy and the most important thing is Locust works in a master and slave fashion so you can bring up as many slaves as you want to create a higher load so when you have a single container for each slave you can just bring up as many containers as you want and you can just register them to the master Locust instance and you will be able to generate a high load in this case we use like Docker Compose and it's just a simple argument to the run command like we want to create a Locust instance with like say 20 slaves and it will just bring up 20 plus one containers and all of them will start swarming so this is like simple diagram of architecture on the left this is a isolated cluster where our test runs we have a single master a bunch of slaves we also use a Redis instance to share the data across slaves we prepopulate some data from our database into Redis before running the tests so we have like some user IDs and some other stuff in the Redis and slaves can just pick around the user ID from the Redis instance and start sending requests for that user to our alpha cluster which is like isolated so our services is running on alpha in this case so yeah we did some test runs and they were quite embarrassing actually so while testing with a load that we were expecting to have soon we noticed many drop requests in the test and it was a single request in our service that was taking too long and blocking others and the queues were filling up pretty quickly so we were dropping the request so in order to find out the exact problem in that request we had to add more instrumentation around our database queries around our third party service calls so then after more instrumentation we did the tests again and we noticed that we actually introduced regression in a database query very recently there was like very obvious optimizations so we just fixed the query and shipped the results shipped the code and this is what we get as the results so this is average response time for that request specifically and yellow line is our development environment pink line is our production and purple is our beta environment as you can see today we fixed the issue on development it just went down from like three seconds to just a few milliseconds then you can also see the stage rollout like we have a few stages in beta and production so it took a few days to eventually hit production the fix was on production then they all went down to a few milliseconds so this was like a bit of a success story for us using Locust and yeah that's all I have today and thanks for coming and if you have any questions what do you think about using the regular computers the desktop computers of the people as a... to generate the load be sure that you restate the questions so that the recording captures them so the question is like whether you can use your regular computers to generate the load this is actually possible and it's a good solution because your computers are like pretty powerful compared to what you have on the cloud so you can... yeah some of the tests I was just running them on my computer it has like 16 cores or something, 8 cores and yeah you can use them and if you can just build a small network of computers and distribute load among them yeah that's possible and it's a good idea I guess it's possible to share data between users I mean let's say a user created data and another user with other right let's say an admin user, a super user will use this data is this possible in Locust? so the question is like whether it's possible to share that data between like slave instances or any instance of Locust no, like you can communicate with the master but that's why we were using Redis so basically for example in our case when a user like created a resource in our database it was also inserting the idea of that resource to Redis so that some other user, some other slave can do some operations based on this so you need to have a like shared instance of something that both can use yep I familiar with JMatter as the question is am I familiar with JMatter and what's like, can I compare them like I heard about it, I read about it I haven't used it but I think it's not possible to provide custom clients like that doesn't communicate in HTTP in JMatter maybe you can, I don't know but and also you cannot provide your user behavior in code I guess can you? Yeah, okay okay yeah, like we compared some solutions but I haven't actually tried running a Lotus with JMatter so the question is how do we analyze the results and can we like integrate with the CI to fail the build or something right now we haven't like completely automated the process so it's not running on CI we have to like it's just a single comment but we have to manually run it so right now what we are thinking about is like Locust gives you like CSV file with the response times and stuff so we want to have that job running on CI on Jenkins maybe not every, for every pull request but like regularly during the day and just compare the differences like for specific requests and we can maybe if it's up like 10% just fail the job and something like that you can do that Able to import curl archive scenarios to fill in automatically I don't know so the question is, is Locust is able to import what? Support curl archive files curl archive files no, I haven't seen anything like that but I don't know to be honest Are there cases where you discovered a problem in production that Locust didn't detect and can you, did you tell us what you learned from that if there were? So the question is if we notice something on production that Locust failed to detect no, we haven't so far but maybe they are still there and maybe we haven't noticed them yet Yes, please at the back GI, I can't feel it here What's the question? Have you ever tried Locust in GRPC, like Google RPC? So instead of using HTTP protocol we were, the question was whether you can use GRPC protocol as I can say we can, we are using like totally custom in-house protocol so if you have a GRPC client in Python yes, you can do anything even Is there any support in Locust for REST higher level REST than just talking HTTP and implementing myself? So the question is whether there's support in Locust for REST like at a higher level than HTTP no, I think it just gives you a better HTTP client then on top of that maybe you wanna have some abstractions Is it, I'm thinking about use case with Locust where the load injectors are not on the same machine nor even the same network but on different networks and different clubs to have some kind of swarm of connection Is it possible to do this given the latest constraints? So when you want to like Restate the question Yeah, I was going to So the question is when you want to like create a higher load with a like a bunch of different clusters where they are not in the same network how do you share data between them basically? So Redis is like just our solution you can still have a like common place or you can split your data so we use Redis to prepopulate some data you can split your data like for some specific users just go to this machine and some others use that so they like among clusters you don't need to communicate Yes So the question is what do we use to represent our test reporting? So right now we are just like manually scheming through the CSV files and we have a dashboard with our like instrumentation so we are just going through them and analyzing the results but when we move to CI we need to have an automated process to basically compare the results between different ones No, they are not automated So the question is what's the next step for this project? For us, first we have like we have two goals we want to automate the process definitely to have it on CI and also we want to test some other backend service that is completely in a different protocol so we would like to write a client for that and it's like a PubSub mechanism so we want to test those as well Yeah, so the question is how do you prevent like using the same data getting the same data from Redis among different like local slaves we just remove the data so if one user used the data it will just remove it from Redis Yeah, it's in the code but what we do is like if initially you can provide the number of users you want to generate so we take that number and prepopulate the data according to that so if it's 1000 users we will have 1000 different user IDs in our Redis instance so each one can take just one Thank you, thanks, welcome