 Okay, cool, so we'll be starting and Hi, my name is Tomak and today I will talk to you about the tips for marathon performance So I'm working at Allegro, which is one of the biggest e-commerce solution in Central Europe and we are located in Poland and we started using Marathon in fall 2014 and This actually is our my first pull request to Marathon When we started using it on a production with the production load So with the couple of years, we've got some insights on a performance issue and I want to share it with you Allegro is built on the top of our big monolith and also we've got switched to microservices currently on our production we have about 500 microservices and Depending on the environment, we've got on 3000 tasks running So first thing when we talk about performance We should talk about metrics without metrics. You are blind and we don't know what happened in our Cluster or in our product So first thing is to enable metrics in Marathon They are not enabled by default because we Marathon don't know where you want to store that metrics Metrics appeared in Marathon in a reversal of 0.13 and from the beginning it started to support All major players in a market. This means they were supporting graphite There's a dog studs D and there's also port to support prometheus there is a github repo with a prometheus adapter cause I don't expose the metrics in a Jason way and prometheus needs some different schema to read Metrics So to enable metrics we need to add Something like this to our configuration By default metrics are gathered by every 10 seconds What does it mean that on a big installation? Metrics gathering could take about 20 percent of your CPU time. This is a flame graph that's was gathered Profiling tour for Java and if you are not familiar with flame graphs Basically here on the bottom there is 100% of your CPU time that application in music and the stack on the flame graph is like a function call and Each call is stuck on the another. So here's a whole stack trace of Metrics gathering because Marathon is written in Scala The flame graph does not look like flame. It's more like a firewall with many small fire bridge So how we can fix this and this problem with the CPU that is taking put that We need to Decrease the or increase metrics gathering interval. So after increasing it to 55 seconds, which is Okay, I hope that will be better So after switching it to 55 seconds Which is more sensible in our infrastructure because our metrics resolution in metric store that we are using currently It's graphite is about one minute. We reduced the metrics gathering time to two percent of the CPU so it is Big deal to have a sensible configuration and do not rely on defaults because they may not work for you If you are interested this part this small block is a mess of native library, you should also monitor that because When there are some problems with lib messos, which is a native binding for messos used by Marathon You could see here the increased CPU time or some memory leaks. So you should monitor that Marathon at the beginning is to use use the cold hail A kadrop wizard metrics, but in 1.4. They switch to come on which is Another metrics solution for scala that is designed for application that are written with actors and more reactive way Why it's important Because the rule of thumb is that you should know dependencies of your system. So in our case once they we change the configuration of load balancer behind before our graphite solution and It turns out that we lost every second metrics And that was because our the drop wizard version used in Marathon has a bug And we need to that was fixed in the next release even minor release and we need to update it and deploy and everything works I'm sorry My company wants me to be very free Okay Next thing as I said Marathon is written in Java So you should and rule of thumb is to Monitor your dependencies and the Java is a big deal in terms of performance. There are many talks about Java performance And I will only show you the tip like a slides tip how to fix the JVM issues first thing in JVM when you heard about Java you should think about garbage collection and To monitor garbage collection in Marathon you need to add this Piece of code that will create the GC log file that contains information about what happened in a garbage in garbage collection when it happened what it What the garbage? collector does and how much memory was moved freed and everything that happened during GC time Once you got the GC log you should parse it or if you have some experience maybe you can see what happened just looking at the Exhibition of it by I recommend using some tools like Samsung and Samsung will print you Pretty nice information what happened and what should be done in order to Get a better performance. So like here. This is an example from our Marathon instance that I take in a summer and We got any information that there is problem with two small heap and premature promotion So the objects were created on a heap once again. Yeah question Ten zoom yeah, I will provide the slides and there is clickable link in here. So So There is hip to small indicator so that means you may increase a heap or change some configuration about promoting the Objects in on a heap so like GC is a big topic. There are dedicated conferences for it. So I will leave it here and Everything depends when we are talking in a heap it depends on your usage and it depends on the hardware that you can provide here the visualisation of the garbage collection times from our Servers so as you can see this is Pretty big time wasted on the garbage collection and it is young to see this means this could be stop the world situation and young garbage collection means that the objects from a young space in a heap are moved to old space and then It's generally not good thing if you move objects from young space to old space and then clean up them in a full GC from From old space Next thing when we are talking about Java is a kaka is a Actor model for a scala and I think it could be used with Java too That is strongly influenced by Erlang The the inspiration is so great. So the when you read the akka documentation you often get Note about that. Oh, this is copied from Erlang. So when we talk about akka and there is Couple of Config change a config that could be changed to get some different behavior. For example, there is a throughput that is a Number of messages that could be read by an actor and processing at one batch by default It's set for for five and if we set it to one this means that each actor will be spawn take one message process it and Will be released and the next actor could proceed so To change this value you need to add this configuration line to your Java code in our production we switch it to increase it four times to get 20 and It was just manual Detection that we see how it looks like it performed better but It really depends on your use case. We are heavily Using the events and that might be our issue was because the events was not dispatched in a proper time because the batches Were too small so you can do it, but check with the metrics if that really helped Third tip is the zookeeper zookeeper is a state store for Marathon And this is a strong consistency key value database so zookeeper is Great store, but it has a problem when we try it even you try to store two big elements so Sometimes in Marathon you could get error like this this means that When you want to deploy a really important app To a production There is a problem with storing an object because in Marathon priority 1.4 Marathon used to store The whole state of an application in one element this means if you have hundreds of applications the note stored in a Zookeeper will have the whole state and when you create a deploy Deploy element was Generated by having a previews and then you Configuration so this means if your whole state takes half of the megabytes The deploy element will take about one megabyte. So this is pretty big. It was fixed in a 1.4 1.4 changed the completely changed the layout of the zookeeper data. So this should not happen but if you get that error in releases priority 1.4 You need to do What is advice here? So increased Maximal node size in our production. We have a double it and What's the problem? because the changing the Marathon configuration is not enough You should also Change the configuration of the zookeeper because if you take a look at the zookeeper documentation And the max buffer which is a buffer for incoming Data must be set on all servers and client and Second thing is that you should keep in mind that the keeper is designed to store data on the order of kilobytes in size So if you are changing this element, you are probably have something wrong in your infrastructure We did this and what happens to us Was that the zookeeper writes take more and more time because when the zookeeper need to negotiate the state The bigger elements takes longer to generate checksum and stuff like this and also they need to be Passed through the network. So this smaller elements are in generally better Yeah, as I said Marathon stores a group only with references So this looks like the better solution But the problem with storing only references in a key value store is that you are creating a Transactions in No-SQL database What does it mean when there is some problem with a zookeeper? For example, one note could not be written to the zookeeper. You've got an outage of the whole Marathon cluster But don't worry. This also happened in a 1.3 12, but less often First thing when we are talking about zookeeper is latency So this is pretty easy if the notes are closer to each other and the network Latency is smaller. They perform faster and the data and the transaction happens quicker so you should keep the Zookeeper notes as close to each other as possible remembering that they should be probably in a different availability zones just in case one zone goes off and They could be they should be stored on a different machines than Marathon and messes because in other ways there will be small latency between the Marathon master and the zookeeper Master data when they are on the same machines, but they will compete on the resources that Because they were on one machine For tip is to update to at least 1.3 point 13 and Here's a little star. I will describe what I mean later First thing threats Marathon code has Two place when they are configuring the threat pools that are using by Marathon. He is first that configure Scala concurrent context, so the actors that are used by Akka So here you can see there only maximal value is 64 threads for the whole actor pool to use The second thing is a number of threads for For For IO operation and it's set to 100 So a little bit of math 64 plus 100 Means that Marathon could spawn up to 2,000 threads Why does it happen? Oh in fact, it's a Big o notation from the number of tasks Why does it happen because when you have an actor and you call a blocking request in that actor Akka will spawn a new thread for you and It cannot be configured to behave differently because that's how the actor model is designed to not block at any time Where I advise you to update to 1.3 13 with a star because there is a patch That isn't It's not that much. I think the 1.3 point seven was patched. So when you deploy it you will get a this is the before we deploy 1.3 point seven and he's after so we still get them more threads than configured because in mind point is Nearly 200 threads, but we are not killing cover machines with thousands of threads Next thing There is something called no no optimization so the fastest code you can possibly write is no code at all And we can rephrase it so the no optimization is the fastest operation you can possibly perform is no person at all So what does it mean? there are health checks health checks in Marathon were introduced from early beginning and in the beginning they were performed by a single Marathon instance and It was working for us until sometime why because the single Marathon instance was querying for every task It has to detect if it's healthy like this or unhealthy and store this value and present them To a user in a nice UI, but what happened when we scale? There is no way to perform health checks from a single instance without taking time outs or Getting some other issues When I said that we Should not perform operation at all. This means that we can move the operation we want to do to some Other so in this case we could ask messos to perform health checks for Marathon The tasks are not running in a void. They are grouped by a messos agents and Every task is running by an executor that could perform health checks for us so when executor perform a health checks It detects if the task is healthy and inform Marathon about this so the when Marathon get notification only when the health check state change and Task is killed not by Marathon, but by the health checks. There will be probes in a probably I think the messos 1.4 There's currently not support by Marathon, but I think they will be supported in 1.5 1.6 So if you read the Gaston-Cleiman blog post he did a Test so it turns out that single Marathon instance He didn't say what instance he spawned Could handle up to nearly 4,000 tasks with TCP health checks and up to 2,000 tasks in HTTP health checks and we were hit by this we've got Over 2,000 tasks on our production cluster and tasks get randomly killed by Marathon because the health checks timeouts so what we did we just Ported Marathon health checks from 1.4 to 1.3 which we are using and switched every task to messos health checks and this solved the problem with the task killed even if they are healthy and also help us with the Marathon issues Here's a comparison because we've got a Marathon 1.3.10 and we've got a problem with health checks So we see that 1.4 introduces messos health checks messos native health checks But after upgrading It was not working for us. So there was constantly timeouts people cannot deploy anything and then we decide to port messos health checks from 1.4 to 1.3.10 and Everything get more stable even than before and the health checks starts passing again What you see here is a problem that was seen That started in a Marathon 1.4. There was an issue for this and finally it gets resolved in 1.4.7 Actually, we haven't still updated yet because we started on a different topics. I will cover later Five tip is to do not use an event bus for Marathon. Marathon provides two ways of getting notification from Marathon so there you can subscribe for a callback. So Marathon will send you a post with an event in JSON and This is the deprecated format but the problem with this was that it sends every event and Even that Some events are pretty big like over 10 megabytes It's deprecated and there is some weird policy of retrying and it was totally resource consuming. So I don't You should not use the callbacks. Instead you should use servers and events Which is working in a quite opposite way So Marathon is no longer sending an events to a client But you are connecting to a Marathon with a persistent connection and Marathon is sending a data on one connection to one client What's more Marathon Support event filtering from 1.3.7 and this means that you can filter out the big events So your application do not to filter them on on their site These are the metrics from our solution to register tasks from Marathon into a console called Marathon console It's fork of a Cisco cloud Marathon console and we used to have a configuration that relies on a callbacks and with callbacks we've got the Seven minutes of delay between a Time stamp that was sourcing the event and the timestamp And the time on a machine when the event was received So in our case this means like a huge outage because the application will respond in seven minutes and killed and they For the seven minutes our state in our config Service discovery solution console was totally invalid. So after switch to a server-scient event with filtering We reduced our Delay to less than 30 seconds. So about milliseconds But still this means that sometimes we have invalid state in a configuration service and this probably is not the best way of registering in a distributed world Another thing I mentioned before that some events has are pretty big like over 10 megabytes Usually there are the deployments event and there is a pull request. I think it's yeah, it's from today So it's still open that will that will use introduce lightweight SSA flags so the events will be smaller without unnecessary information about what have been deployed or stuff like this Six tip is using a custom executor. I will talk about it more tomorrow Why I think it's important because when you see that we've got a five milliseconds delay between Event that the application is created and actually Registering it in a console. This means that the application is created or killed this means and with executor we can Register our application in a configuration service cover solution When it start be healthy for example because executor is the actual thing that is Taking care of Whole application or instance lifecycle and this is the way how the aurora works for example Seventy is to prefer budget deployments instead of many single ones when you create a budget deployment. This means that Each of your users create a Request to a marathon and deploy one application marathon don't like Many deployments at once because it means every each every deployment needs to be stored in the zookeeper and this increase the zookeeper usage So we can use a facade that will gather the deployments into bigger ones and deploy them in a batch This is something that we haven't introduced But we were considering instead of these we start starting our marathon installation That's why the everything I showed you and that's not I don't do not test the latest marathon version because right now we have Sharded marathon so the load of each instance is slower than we have before and how it's how it looks so we just got a facade before marathon and In that facade we are taking care of Authentication and authorization users to deploy something to marathon So in fact, this is the better solution than plugins because plugins work in a single threat and Need to check for every for example when user want to query marathon to see all instances the plug in need to ask if if that particular user heaven Is Authorized to see that application. So with thousands of applications. This starts to be a problem and When there is a sharding we just keep the simple facade and without user noticing it We are moving its instances between different marathon installation using consistent hashing cover application name so That will be all here's a Summary of what we should do. So just enable metrics to see what happened tune JVM and remember to tune JVM also for a zookeeper Using the same tools like you use for marathon. Remember that the zookeeper is has pretty different Responsibilities and works in a different way. So they need a special care will update marathon to 1.14 and Use messes health checks if you can so the best it was you can do right now is to use marathon 1.4 0.7 or 1.5, but I haven't used it and can't really recommend it Do not use the event bus instead of use an executor that will take care of care of application lifecycle Prefer batching when talking to marathon and and sharded if you need it If you take a hit the wall Hit the roof and Martin is no longer working for you. So think about sharding so That's all and if you have any question, we'll be happy to answer Yep, so we are not you no longer using marathon UI. We are using it for the back Purpose and Martin UIs no longer supported by a major sphere and there are no commits on their side and there were there was some movement to update the martin you are but the Initiative died. So we have our custom solution here Where is it's called up engine and when the user can see? It's his own application. They stayed databases that he's using Configure load balancer Caches and stuff like this So yep, the problem is that in our case this facade also take Prepare an API. So the UI is pretty simple. We just ask the facade and all magic and Routing keys done here. Okay. So I think there are no more questions. So Thank you. If you have any more if you got any question, I will be here today and tomorrow and Even on the town hall So, thank you