 Hello, microphone testing one, two, three, four, hello, hello, microphone testing, check, check, one, two, three, four. Yeah, okay, now here's a live stream. Hello, hello, microphone testing one, two, three, four, hello, microphone testing one, two, three, four. One, two, three, four, hello, hello, hello, microphone testing one, two, three, four, hello. Yeah, just keep it. Hello, hello, hello. Microphone test, one, two, three, four. Hello, hello. Hello, hello. One, two, four, one, two, four. One, two, four, three seems to be leaking somewhere. One, two, four, one, two, four, one, two, four. I have been asked to walk around with this mic. I don't know how that would work out. Are you guys able to hear me? Is the sound engineer happy with me walking around? How much more should I walk, Abhishek? Please respond. Hello, hello, hello, hello. Hello, hello. Hello, hello. Microphone testing, one, two, three, four. Hello, hello. A very good morning. Welcome to day two of Roadcon 2018. I'm Shakti Kannan, one of the MCs for the day. So before we begin your announcements, so the Haskeek app is available on the Google Play Store. So you can download that on your Android phone. And you have a unique QR code on your badge. So you can scan that and you'll be able to see your registration details, food menus and other information. We have a privacy policy. So if you do not wish to be photographed, you can pick the red lanyard. The yellow lanyards are for the Haskeek crew, team and volunteers. The audience will get the Gojek sponsor lanyards. The BOF sessions will be held today as well. They're not photographed. No videos are taken there. They're no live streaming there. You can take photographs. So you can talk candidly in those sessions. We have the Wi-Fi available. You can connect to the Haskeek Wi-Fi. The password is geeksrs. If you have any questions regarding your registration invoice, you can please talk to the folks at the registration desk. So as you enter the venue, on the left-hand side is the registration desk. You also have a childcare facility. So if you brought your kids, you can talk to the folks at the registration desk to use the facility. We will be having flash talks this afternoon. So if you want to give a five-minute lightning talk, maybe on some open-source project or some tips and tricks on DevOps, you can let me or Mehul Ved just give us your name, your phone number and the title of the talk. We also have office hours today. So we have a few speakers who have come forward to give the time today to have one-on-one interaction with you. So we'll announce them. It's also there on the schedule. The Q&A, we have a separate 15-minute slot. Once three speakers finish their talks, we will have the Q&A session. But if the talk gets over early and there is some time, you can take questions during the talk. So with that, let me start. Today's track is on architecture. Today we covered on security, IPv6, DNS, system security, how to protect your infrastructure against attacks, and so on. So today's focus will be on infrastructure architecture and DevOps. So the first speaker for the day is Sobik Bhattacharya. He's going to be talking on dealing with a feeling dependency. So Sobik loves to build systems that just work and is passionate about high availability at an intuitive services that makes user sign-in experience frictionless yet secure. When he is not in front of the computer, he likes to spend time with his family and his daughter's pet stuffed dog. Sobik Bhattacharya. Hi. I'm Audible, back. All good? Yeah, thank you for the nice introduction. I'm Sobik. That's my tutor handle. I'm not super active, but if you have any feedback, do drop me a message. I'm from Intuit. I'm a noise maker by day and a coder by night. So here I am to make some noise for you. So today my talk is about a few stories. These stories are about the way stories start with a problem and then there is a beautiful resiliency pattern that comes and solves the problem. But then in the end there is some twist as well. We did learn a few lessons, not like textbook type of things, but a few unusual lessons. I'll be taking you through that journey. Hopefully you enjoy it and it turns out helpful to you as well. So a little bit of context. Intuit is a financial software company. So we make products for accounting, for tax, and personal finance, and payroll, and stuff like that. So most of our products, they do things that we don't like to do. We don't wake up in the morning. Today is the day I am going to file my income tax. Wow. Most of us don't do that. What that means is that we tend to delay things to the point where the deadline is just around the corner. And this means that there are days right before those deadlines where our traffic surges. We see huge, huge surges of traffic. We are talking about tens of millions of authentication requests. So people are trying to sign into our products during those days. And we call them tax season. And an impact of a one-minute outage during the season is some thousands of transactions. So yes, in this context, resiliency do matter to us. And we have tried all these things that I'm going to talk about in production. And we have respectfully handled outages of dependencies. So this is the focus for today. So I have a microservice. Let's call it A. And A calls another microservice over the network. Let's call that B. Hopefully over using REST or something else. That's fine. So this is simple enough. And I hope a lot of us can relate to this picture, especially if you are using microservices in production. So this is something that is very famous. This came from Bernard Vogels, the city of Amazon. Indeed, failures are a factor of life. At the same time, I also want us to know or be aware the fact that the scenarios that I'm going to talk about, the residential scenarios, they are not the most common case. If we have built our system reasonably well, they're not going to fail 50% of the time. So it's very important that we have an availability goal in mind. And that's where it all starts. So this is really the beginning of the beginning. So let's say my service gets 1 million requests in a day. Now, if I am achieving 99.9% availability, that means that I'm willing to tolerate 1,000 failures every day. Now, is it acceptable? It depends on the business. But that's really where it starts. Also, if I am trying to achieve 99.9% failure, and then if there are scenarios or faults, that can happen maybe 1% of the time. 1% is rare, corner case. It's easy to say, no, I don't care. I mean, it's really very, very rare. But if we ignore 1% cases, then we are actually not going to achieve 99.9. So that's something to keep in mind. So faults, what type of faults we can have back to the same picture that we started with. So we can have a network fault, or the dependency itself may be down, and the faults could be permanent, or they could be short-lived. It may not impact all the transactions, maybe some certain percentage of your request failed, and so on, right? And this is our problem statement for the day. So as a person who wants to build a resident microservice, how do I still deliver the best possible user experience given that the network is unreliable and the dependencies are also unreliable, right? So this is the outline of the talk. So I'll talk about four different patterns, bulkhead, timeouts, circuit vector, and self-healing. And again, emphasis is going to be on the real-life lessons that we have learned while using them in production and not so much on the textbook. Okay, so the first one. So bulkhead. So the term bulkhead comes from the shipping industry, like containers, for example. So a hull of a ship is typically divided into partitions, and these partitions are called bulkheads. And the idea is that if the hull reaches, it floods one partition, but it doesn't bridge the entire hull. So we'll see how to build on that. So let's look at this simple fragment of code. This is Java, but hopefully the syntax is familiar. I mean, it looks very similar to other languages as well. So I have two APIs. First one is called MyHealth. So that does a shallow health check. It just returns okay. And this is used by the load balancer to just know that this node is responsive, the web server, the comcat is responding, and that's it, right? And the second one makes a remote call over the network to a dependency, right? And what is it that is not right here, right? If we see this code, it will carefully... Okay, so fine. So let's assume that we have done a reasonable job of designing our system. So the way we have provisioned our resources is like this. So say 99% of the dependency API call, partner health, is five seconds, which means that 99% of the calls to dependency is going to finish in less than five seconds, but one percent can take more. And then the maximum TPS that we can expect to this API is 20, which means that we are talking about supporting 20 into five, 100 are concurrent requests, right? And then let's look at the shallow health, right? And this is a shallow. It just returns immediately. The 99% latency is 0.5, and we need to support 10 TPS. So together we are talking about 10 into 0.5, that is five concurrent requests. So together we are supposed to support 105 concurrent requests, right? And then I give a little bit of headroom, so let's say, okay, another 50% headroom, so let's say we should be able to support 150 concurrent requests. So I should have 150 threads in my web server, and we are good, right? I mean, reasonably well designed system. Now let's see what happens when the dependency is down. Meaning this call, this particular call, the rest API call is going to experience very, very high latency. What is very, very high latency? Average latency of 20 seconds. That's what we are talking about. And this is actually happening. This actually happened in system production, so I mean, I'm not making up these numbers. And let's say we have a TPS of 10. So that means that we should be able to support 20 into 10, that is 200 concurrent requests, and I just have 150 threads. In my reasonably well designed system, I only have 150 threads. I don't have enough threads. I'm going to see the thread pool rejection and calls are going to fail. Not only the calls to the dependency that are going to fail, but the shallow health, the my health API is also going to fail because they both share the same thread pool, right? And that's where bulkhead helps. Okay, now before that, so what really happens when the shallow health fails? So the load balancer is going to think that this particular node is unhealthy, the Tomcat is not responding, so I'm going to make it out of the rotation. And if I have autoscaling, then a new instance will be launched, and that is going to have the exact same problem. It's not going to help us, it's just going to be more confusing, right? Okay, so here is bulkhead. So the key idea is that I created a pool of threads making a call to the dependency, and that's a separate thread pool for the dependency, separate from the web server. So when there is a call to this dependency, API called from outside, what not here is called, the web server thread pool asynchronously handovers the request to the dependency thread pool, right? And then a thread from that pool picks up the request, and it makes a call to the dependency thread pool. But the thread in the web server pool is now free. Okay, so there is no way, I mean how much of a time dependency is going to take? That's not going to impact the web server thread pool, and the shallow health check API is now being responsive. So I have been able to isolate the impact of my fault. I obviously cannot fix the dependency, but at least when the dependency is down, the impact is now limited to a certain thread pool and certain API. The partner health API is down. My health is up, right? And you can generalize. I mean you can have several APIs and several functions and so on. So this is what we have learned, that we could actually and using bulkheading by creating pools of resources and the resource could be threads, it could be connection pools, it could be so many other things, right? Now what is still not good? So that... So when the dependency is down, this API partner health is going to experience a very long latency, right? So which means that our users are going to wait, the callers of this API. So that is the problem. That leads us to the next pattern, which is timeouts. Okay. So again, some Java code using spring framework. And the way you configure timeouts depends on the language and the library that you use. But typically most libraries support configuring two types of timeouts. The read timeout and the connection timeout, right? And this is what we call network timeout. So they apply when the dependency... There is a call being made to the dependency from the dependency thread pool, right? However, so this is the call that we are talking about, right? This is the call that we want to put a limit on, right? We want to ensure that this call always finishes within a certain time. And we have seen that we have configured two timeouts, the read and the connect timeout, right? Now, let's try to understand how much time this call takes and how we can ensure that this call finishes within my configured timeout, right? So let's slice the dependency. So what happens first is that if you recollect we have a web server thread pool which hands over the request asynchronously to the dependency thread pool. And the dependency thread pool may be saturated. There may not be any free thread. In that case, the request has to wait in a queue, right? So that is the first part of this latency that we have. We are trying to limit. Once a thread in the dependency pool picks up the request, it makes a DNS resolution, maps the host name to an IP address, then establishes a PCB connection to that IP address, does an SSL handshake, and finally the data transfer begins, right? And the read timeout that I have configured applies only to the last phase which is the data transfer phase. And then I also have a connect timeout which applies to this phase where we are establishing PCB connection. But again, this is actually K times connect timeout. What is K? We are making a step back. So typically a service may publish multiple IP addresses. For example, if the service is behind AWS elastic load balancer, it does publish multiple IP addresses. And when a service advertises multiple IP addresses, a risk client tries to make connection to the first IP address. If that is unsuccessful, it tries to make connection to the second one, and so on, right? So it's actually if a web service publishes K IP addresses, then there are K connection attempts, and this stage can take up to K times connect timeout, right? So the point here is this, that, you know, I have some configuration to control certain aspects of this overall latency, but not the complete thing. I have zero control over DNS resolution time, I have zero control over how much time the request waits in a queue, and so on. It's very, very difficult to guarantee the network timeout and the standard typical read and connect timeouts. And that leads us to the other type of timeout called application timeout. So application timeout is typically configured in addition to the network timeout, and it is driven by the business logic or the SLA that we have defined for our service. And so this is how you do it in Java, but I mean the different languages are put in different ways. So to see it in picture, so network timeout applies when we are actually calling the dependency and the application timeout applies between the web server thread pool and the dependency thread pool. So as soon as a web server thread hands over a request to the dependency thread, the time starts ticking and if a response does not come back within the configured timeout an error is returned to the caller. So that's how it works, right? So a quick example, so typically the network timeout the way we configured is that it is twice connect timeout plus the median of read latency. What that means is that I am allowing for two connection attempts to the service plus a read that should finish in median time right? And if you define network timeout as that, and then you add some amount of headroom, and that could be a reasonable application layer timeout. So recap, so far with bulkhead we learned that we can isolate the impact of a fault so that entire service is not down, just you know that a certain part of service that has dependency, hard dependency on that third party is affected and we also know that we can put a hard upper limit on APIs or the latency of an API right? So what is next? Okay, again, so we have an user which makes a call API call and that API timeout. But what does the user do? I mean what does what does it mean for the user right? We just said hey, sorry, we have timed out there is a problem. Can we provide an alternative to the user quickly? And that leads us to the next pattern called circuit breaker. Okay, so again same picture. So what we do with circuit breaker is a circuit breaker is a library which wraps all the calls to the dependency okay? And it is either open or closed. If it is closed then it is allowing all the calls to the dependency to go through. If it is open, then it returns immediately without making a call to the dependency. And then we can have some logic where we can check if the circuit is open or closed. If it is open then we can route the call to an alternative provider and provide an alternative, maybe a degraded experience to our user. So this is how a circuit breaker looks like. So it has three different components. There is a circuit of course which is either open or closed. And then there is a health sensor which tries to gather data about the dependency, how many calls were made in last five minutes let's say, and how many of them were successful. It's a data collector. And then there is a finite state machine which uses that data and decides whether the circuit should remain closed or whether we should open the circuit, whether we should break. So that's how it is. And the benefits as I already said, it helps us provide an alternative. And in some cases there may not be an alternative. So in that case what we can do is we can disable the feature which we already know that is going to fail anyway. And for the developers it allows us to configure a fallback and we of course don't want to overwhelm a dependency which is already under trouble which is expressing some issues. And what happens in that case is that we try and that kind of easily overwhelms that service which is anyway under due risk and we don't want to do that. This is one way of achieving that. Netflix Netflix is a Java library. It was originally developed by Netflix and it was open sourced by them very, very easy to use. And it actually implements all the three patterns that we have discussed so far. Basically the bulkheads, the timeouts and the set of records. All three of them are implemented in this library. And as per Netflix documentation, so there are tens of billions of requests that are processed through history every day. So that's an amazing scale and it's definitely a library which is battle-hardened. But at the same time as you will see that it is really a piece of a puzzle. So our puzzle is the solution that we are trying to build and this is a tool but we have to ensure that the solution works. And we really have to think about a few more things. As and when we try to use a history or a circuit breaker in general. All right. So history is again simple Java code and I just want to show you how easy it is. You just add an annotation in spring and you provide that fallback method and that's it. I mean you are on. So history as I said supports all the three patterns. So the first two configurations of history decide when to open or close the circuit. The second one is the application layer timeout and the third one is help us size or configure the dependency thread pool for bulkheading. So I will not spend too much time on this. So let's see. Now we have tried and spent some time playing with history and what is it that we have learned. So if your service is a cluster of nodes or cluster of servers then there is a history instance in each one of them and they work independently. So there is absolutely no effort made by the history instance to talk to each other or coordinate with each other. So it's possible that the view of the data that one node has is slightly different from the view of the other node has and as a result two nodes will behave differently or his tricks into nodes will behave differently one of them will have the circuit float and other one will have it open. It's normal, it happens. And one reason we see this happening is because a lot of load balancers including Amazon ELB work using a round robbing algorithm and when one node is much faster than others that node tends to get all the most of the traffic and when a node opens the circuit the latency of all the requests that go to that node they all come back quickly and their latency is really very small. So that node is the fastest node and that tries to get most of the requests. Other nodes which are actually calling a dependency and which is also slow or down they have a very high latency. So because of this asymmetric situation the node circuit open which opens the circuit the first gets bulk of the request other nodes starve and they don't even have enough data to decide whether to open the circuit or not. So that type of leads to this type of inconsistency. So what happens to the user? So let's say user makes an API call and the load balancer sends it to one of the nodes. Let's say the user has the node and it goes to the regular dependency and user has one type of experience and then the next call from the same user may land in a different node with the circuit open and then that sends the call to an alternative provider and user gets a different type of experience. So there is a possible that at user level there is a inconsistency and how do we deal with that? So there are two ways to deal with at least that's what I know. One is that we could use user or session stickiness so that all calls from a given user always goes to the same node but that also has its own set of problems like what happens if that node is out of service and all that. But nevertheless this is not what we have used by the way we ended up doing the second one which is creating a circuit breaker which is more centralized which looks at the data from all the nodes and decides and we'll talk a little more about it when we talk about cell fill or just to give you an idea of where we are headed. Okay, so as with any technique so I think we should really when we use that in a system in production it is super important that we know what are the metrics that are important for us to know that this is actually working for us. For circuit breaker the parameters of the metrics that we look at are the response times. So let's say a dependency goes down at time t and the circuit opens at time t1 so t1-t is the delta that is the response time for us to break and since we have delayed it it's not a perfect system since we have delayed it there has been some amount of user impact so that is the user impact because we could not open the circuit soon enough. That is one metric to track. The other one to track is that once the dependency recovers again the circuit breaker has to figure out that oh this has recovered now I can close and since we did not close it soon enough again we presented user with a degraded experience for some time so that is the other metric. The user impact since we did not close the circuit soon enough so these are the two key metrics that we track for circuit breaker and we will see how those metrics actually lead us to additional interesting scenarios so we are talking about end to end behaviors here so I have service A which calls another service B via circuit breaker B calls C again via circuit breaker and it's possible in a company a big company that these two services are owned by different teams their history is even differently and it's possible that the history of the circuit breaker in B is more sensitive to failures than A so when C the final dependency goes down B detects it fast and decides to open the circuit however A is still waiting for more data for example B may decide that if I see 10 requests and say 8 of them fail I'll open A on the other hand may be configured so that it gets to see 100 requests and then 80 of them should fail and then it goes this type of things are possible right in this scenario so other thing is that A has an alternative experience so when the A circuit of A is open it calls the request to the alternative provider but B doesn't have an alternative now in this case the errors that B throws actually propagate all the way up to the user that's not what we want actually what we ideally wanted is to users to see the alternative experience that's not what is happening so the second one is so basically this is really the problem of us not opening the circuit soon enough the circuit in A did not open soon enough and as a result user got impacted user are actually seeing errors instead of the alternative closing the circuit so the circuit breaker as I explained needs data to decide the health of the dependency right and if there is no data then it cannot make a decision and it remains where it is so again the same picture but in this case we have both the circuits open right A and B and all the calls that land in A go to the alternative provider and B gets no fault at all now let's say C recovers at some point how will B ever know that C has recovered because it's not getting any traffic it's not it's not getting any data at all about the health of servicing so I think the key is that we understand those metrics that matters to us and we do an end to end testing it's not enough to test my micro service but also the entire system especially with this type of patterns in place and in this specific case we could also use a synthetic traffic generator again something I will touch upon when I talk about it so the next one is about request volume versus error volume so his tricks configuration has you know basically decides his tricks decides to open a circuit based on two configuration parameters request volume threshold the first one sorry request volume threshold and the error threshold percentage so examples so request volume threshold could be 100 which means that in a window like 5 minutes there should be at least 100 requests before his tricks decides to open the circuit and error percentage threshold is 50 which means that error percentage should be greater than 50% so that his tricks opens the circuit right so let's look at this scenario where in a window I have 5 requests and 3 of them failed which means that error percentage is 60% but you know intuitively do I really want to open circuit do I know enough about the dependency to say that it is down or unhealthy no because it could be just some random errors as well just 3 random failures right and we don't want to you know open and close circuit too often and confuse everybody right so and indeed since I have only 5 requests and my threshold is 100 a circuit does not open it is really desirable that is really the whole idea behind having a threshold one is on percentage second one is on volume but let's take this scenario the other scenario where I have 50 requests and all 50 of them has failed all of them has failed which is 100% so this is very very drastic scenario and intuitively at this point I actually know that my dependency is unhealthy right if I look at this data I will probably bet my money on saying that yes indeed dependency is down but what happens in this case is that since my threshold is on request volume and I am expecting 100 requests right I am not going to open the circuit and this is not exactly what I desire we are not failing fast enough right so if you look think about the 2 matrix that we had the 2 response times here we are not closing the circuit sooner and as a result we are actually impacting our results by not doing that so the intuition is this that when the error rate is very high we actually need a smaller sample to be confident that the dependency is unhealthy right when it is 100% I actually do not need to see 100 requests I can take that decision even at a smaller sample size right so so that's why the idea is that instead of is thresholding on the request volume maybe we should threshold on the error volume and that also has a nice sub property that we have a direct control over the user impact the number of users we are going to impact before we decide to act ok so let's take 2 sample configurations and do a quick A B test here so the first one first configuration is the traditional districts one where we have request volume threshold of 100 and error threshold is scripted second one config B uses error volume threshold of 50 and error threshold is exactly same which is 50% and then let's do a quick simulation so here the scenario is that my error rate is 50% right so 50% of all my requests are going to fail and as you can see right that both the strategies work exactly the same way right so I have to see 100 requests and 50 of them should fail and then both the strategies decide at the same time that it's time to break now let's look at the more drastic scenario where my failure rate is 100% all calls that are going to the dependency are timing out or failing here the request volume threshold unfortunately still needs to see 100 request because that is my threshold but error volume threshold reacts much faster right it sees 50 requests all of them fail and it breaks the circuit so we fail faster faster so so unfortunately history does not implement error volume threshold there is a rather an issue in history with hub issue tracker so at some point I hope this will be implemented but till then I think we can create a full request we can create our own fork and implement that or you know use our own circuit breaker so to recap we learnt three patterns bulkhead which allows us to isolate the impact of a failure and then we know how to put a hard time out and also we know that with circuit breaker we can provide an alternative so what is next the next one is about so self-filling right so the dependency has recovered and how do we know that we have to recover now right and here I introduce this term called management plane management plane is actually not a new idea as such but just that this term is not used often and the idea is not studied or talked about a lot so but let me quickly introduce that so by data plane I mean the actual service where the all the subsystems of the service that implement all my functional requirements right so service A called service B there are users all this is data plane and then if I run this service in production I will have a monitoring system I will have alarms configured and then there will be an on call engineer taking some corrective actions in response to that so whatever else that we have in my service that is required to operate or test my service are the they they form the part of what is called management plane right as I said it is not a new idea for example Kubernetes has a similar similar concept and also I think yesterday I was in the AV networks and they also have a controller which is really the management plane besides you know whether to scale out and so on right so basically we have nicely separated the functional requirements and operational requirements into different sets of subsystems and also if you see we don't have that on call engineer anymore we have a bunch of scripts or programs and they trigger when the alerts file and they automatically take the corrective actions and then we have and human of course now this human is a DevOps engineer who is actually writing those programs so this is how it is in an ideal management plane this is how it is right and so corrective actions is where the meet is so those could be rest API calls back to the data plane or those could be you know simple SSH so you do a remote command execution using SSH right and what is it that you can do here using the management plane we could implement our certificate which really looks at the cluster level data and decides we could also have a synthetic traffic generator to decide whether the dependency has recovered we could also use the management plane to reload some data or you know SSH to the ish node and you know restart some process like from cat right so these are my corrective actions so let's see how we implement a circuit breaker using this so just to recap circuit breaker has three main components the actual circuit a sensor and the finite state machine the decision maker making a component right and here circuit this is how it looks like with management plane so the the service that nodes of the service still has the sensor and the circuit but the decision maker the FSM is now in the management plane right so this is where it is so this is where the decision making is happening so all the data that close from here goes to the monitoring system as before and then when certain thresholds bridge that causes an alarm and the alarm triggers this finite state machine and then based on the data it decides to open or close the circuit and then there are REST API available here which allows this component you know open or close circuit from outside and then we could also use this to implement synthetic traffic generator so when FSM decides to open the circuit it also triggers the traffic generator that is another program here and which pumps in the synthetic traffic from time to time to decide whether the dependency service B has recovered so even though it is intuitive it is also you know there are certain pitfalls and there are certain key considerations that we have to keep in mind to ensure that this actually works so one of them is that we should start with operability in mind when you are designing my service or system operability is one of my requirements which means that the service itself should define a bunch of REST APIs to simplify the operational tasks like open or close circuit and it is very desirable that those APIs are important but if they fail let's say management one component in the management makes a call calls that API it fails it is perfectly safe to retry I can give quick examples like so we can define an API to open a circuit and no matter how many times you call that circuit is going to open that is the end result same so this is good that would be defining an API which will toggle the state of the circuit because that depends on the current state and if you toggle it opens or toggle again it closes outcome is different depending on how many times you call it the other one is the dependency flow so the whole management plane is also a new code and there is a network involved the new code may have bugs and network may go down there may be GC pauses there may be transient network issues like condition and so on what happens if any of this happens it will impact the overall availability numbers of my service it should not so the data plane should not depend on the management plane but it should be the other way around so management plane should depend on the data plane so again another quick easy pitfall so let's say I design my service so that when a node in data plane comes up it makes a call to the management one node in the management plane and tries to get some data let's say some configuration now what if while this node is booting the network is broken it cannot make that call it fails it cannot boot that's really a bad place to begin the better way to do it would be that the node comes up with some default values maybe the last known value and then there is a refresh call that is made from the management plane to the data plane to fetch the latest version of the data so to summarize so these are the four patterns that we talked about and bulkhead for fault isolation timeouts a circuit breaker to provide alternative and a separate management plane for cell filling and also other possible operational requirements so dependencies do fail I mean all of us who have actually worked in production and have been operating systems know that and there are techniques that help us but the key thing is that with every technique they are just really a piece of the overall puzzle so we have to understand end to end behavior we should know what to look for what are our metrics to look for and lastly so awareness, observation and refine so these are the really three key phases which we go through in our journey so I call this meta learning so these are the learnings that help me learn so the journey that we have been through may not be exactly same as the journey that you will go through and your lessons may also be different but it is important that you learn and in order to learn there are these are the principles that help us learn right if I am using a technique or some pattern or some library in my system then I should understand that what are the key metrics to track and also define the target SLA for each one of them so if I have a database which replicates between two data centers then perhaps replication lag is one metric to keep in mind if I call another API over the network then maybe the latency of that call if I am using a circuit vector then maybe the response times that I talked about and then once you deploy my system in production or in test setup we use the data we actually gather data we see what are the actual values and if they deviate from the target we come back here and we refine so this is what I will leave you all with and hopefully this is helpful thank you I am happy to take questions if you have any we have five minutes of time any questions please raise your hands I don't seem to be any questions or the audience is too shy yeah we have one question here so from production side I have a question on the load balancer and from your experience because you are also dealing with high volume transactions what would be the situation where you have two load balancers and you would not keep them in hot-hot and you would keep them only in hot-old could you think of some scenario where that would be the right configuration instead of having it in hot-hot yeah so I think there are different load balancer architectures I think the one that is popular these days is where you neither have hot-hot it's really like you have a distributed set of loads doing load balancing and if one of them goes down then automatically the other one compresses it and so on if you really want to have a hot-old I don't know why you will actually design your load balancer that way I mean then you also have to worry about the time it takes for a system to decide that this load balancer is unhealthy and let's go switch over so there is a failover time and during that time you are impacting your users because you are still sending traffic to the unhealthy node which cannot forward request anyway so I think nowadays like for example if I look at ELB it's not like it has it is completely distributed there are multiple nodes and DNS does some magic so DNS always returns the IP addresses of different nodes at different calls so that way the load balancing I mean the so DNS sends request to different nodes and then they forward to the other nodes that's how it works back there hello the thresholds you mentioned are they dynamic or do you go and tune them if you tune them like how do you make the services aware of it the manual intervention how do you manage that what about the circuit breaker thresholds what are the thresholds what you have configured are they like static throughout but I think it should be dynamic so that you kind of be resilient and why do you think those should be dynamic because that's not what it is by the way we have this is like configuration so those are configured and then when the service comes up they don't change okay never come up with a situation where you need to go change or anything like that so we yeah so there are people so there is a way to externalize configuration right you could actually have a configuration service where you maintain the configuration and change it on the fly right that could be one way but then how do you test it right you made the change in production and you don't know if that's going to work right so I think the more conservative approach is to actually make the change and go through the regular testing CICD right so that you know that whatever config change is coming has already been through the same set of testing that you would usually yes but depends on what type of configuration it is like let's say you have something really safe like you want to disable a feature right that probably is safe and you can probably use a feature flag to disable it on the fly but if it is something more complex like tuning the circuit breaker it typically takes time right because you really want to be sure that your parameter the matrix that I talked about the time to close and time to open right they are really within the our SLA limit so usually we don't do it on the fly one question this side so you talked about the despite of doing all these things dependency can still fail so are you also recommending alternative design also like in some cases where it happens that there is a synchronous call is needed to complete the transaction but still can we segregate it and think of a scenario where if it is off still my system is on and maybe I can do a later processing so are you also working on that I mean do you recommend we have synchronous services we also use asynchronous services where we have a GMS based queue so the caller really gets a request puts it in that queue and then there is a consumer which takes it up and calls the dependency but the consumer is still synchronous so everything that I talked about he'll apply here but from the end to end point of view so I have quickly responded to my caller caller of my service saying that okay I will do it later it's not 200 it's 200 to accept it I'm going to work on it later right and then I can do retries and so on and actually sorry I just remembered so as I mentioned retries so with retries I think when the dependency goes down and then we do retry it again retries a big topic in itself but we have to be a little careful to avoid this thundering hard problem and you know lots of retries suddenly you know pile up and you know again bring that service down so it could be just a flip flop like it goes up and then again comes down because of this huge surge of retries we'll take more questions from Saabik during the next QA session we'll have another QA session with Saabik and other speakers so thank you Saabik for your thank you so much so if you want to tweet you can use the hash root conf tag twitter handle is at root conf we have office hours from Pukaraj Singh in the first floor starting at 10.30 so he spoke about enterprise security yesterday and also participated in the other debate so if you have any questions for him you can head to the first floor so next speaker we have in the main auditorium is Vishnu Gajendra he's going to be talking on building a reliable and scalable metrics aggregation and monitoring system Vishnu is passionate about building distributed systems that are reliable and scalable and has many years of experience in designing and implementing large scale data stream processing pipelines he also enjoys teaching college students on computer science topics and conducts many seminars and workshops when not in front of the computer he enjoys listening to music and he plays the bass guitar Vishnu Gajendra good morning guys good morning guys I'm Vishnu Gajendra I'm a developer engineer at Exotel today's talk is about metrics aggregation monitoring system that we have built at Exotel I'll briefly talk about Exotel we provide voice and SMS APIs using which you can make phone calls and SMS's some of the classic use cases are IVR the voice response systems last mile delivery for e-commerce and other online services for example Swiggy delivery agent calling you before they delivered your order and we recently implemented N-o-t-p it's called base OTP verification service we also offer other enterprise products tailor made for sorry for the interest the interest the interest the interest the interest sorry for the interest the interest so let's move on to the topic so I'm going to cover two different systems one is the metrics aggregation system the second topic is the monitoring system before we discuss the design and implementation we need to understand the need for a metric aggregation system we run 25 plus micro services more than the number of software developers that we have and we run them on hundreds of servers and cloud so we want to collect metrics about from these micro services and also metrics about the servers where the application runs like CPU memory etc once we collect an aggregate we want to visualize the metrics we want to run monitoring on top of the metrics for example if average latency of one of my API is greater than X send an alert that's a typical use case so this is the requirement the goal is to reliably, scalably aggregate metrics so that we can query the metrics run monitoring on top of the metrics so I'll discuss about the goals in the latest slides before that this is an example sample metric request that we collect from one of our micro services the metric name is make call API so it's a metric about the make call API request using which customer can initiate a phone call for every metric we have a set of tags and set of fields the tags captures the dimensions of the metrics for example in this metric we capture a couple of information about the metric which is the tenant ID second is the HTTP response code that is impact to customers so using tags you will be able to search metrics for example find all the metrics that belongs to specific customer filter metrics group metrics and moving on to the next fields so fields capture the actual data that we are interested in so in this case we are capturing the latency of the make call API request in milliseconds typically you do aggregations on field values for example find average latency of make call API request for a specific tenant so that's the difference between tags and fields using tags you will be able to search the metrics and fields are the actual data that you want to do aggregations on like averages, person tags etc so these are the functional requirements of the metric aggregation system the metric document should be a JSON as I discussed in the earlier slide user should be able to add any number of arbitrary tags with high cardinality for metric data point so to understand what cardinality means let's take the same example we have two tags to this metric data point one is the HTTP code so the number of unique values for HTTP code is maybe 15 or 20 I mean 2xx, 5xx, 4xx so at most you will have 20 or 30 unique values the cardinality of the HTTP code tag is low when it comes to tenant ID you can have millions of tenants using your platform that means you can have millions of unique values for tenant ID that means the cardinality of tenant ID tag is high so the implication here is the metric data store that you use for aggregation and persisting your metric data point should support tags with high cardinality so that you will be able to perform searches efficiently the second functional requirement is should have rich query capabilities for example you should be able to perform various types of aggregations like averages, percentiles and you should be able to bucketize metrics based on various dimensions like tenant ID response code etc so these are the functional requirements moving on to non functional requirements the whole system should be reliable it should be up 24x7 a single node crash should not bring down the whole system and the system should scale today I am ingesting 1000 metrics per second tomorrow I might be handling millions of records per second the system should automatically scale just by adding more machines to it and the end to end latency should be low I say end to end latency it is a time between the point where the metrics is collected from our micro service to the point where the metrics actually persisted into the metric data store so the overall latency should be minimal and the whole pipeline should be flexible to give you an example couple of years back I know many companies are using graphite as their metric data store but today I guess I heard that many people are using Prometheus so what this means is that technologies keep changing so you should be able to adapt new technologies as per your requirement in future so the pipeline should be flexible enough so that you should be able to plug and play different systems as per your requirement so before we move into our design and implementation some of the options that we explored before we implemented our own pipelines our inflexdb which is a metric data store Elasticsearch with xpack license I know many people will be familiar with Elasticsearch it's a product that's built on top of Elasticsearch that provides metric aggregation and monitoring capabilities and Prometheus is a monitoring system so the difference between inflexdb and prometheus inflexdb is more of a data store at prometheus it's more of a monitoring system which people typically use also as a data store but it's not suggested for long term storage so these are the options that we explored for metric aggregation and monitoring some of the downsides are it's very expensive we get a xpack license or get a license for inflexdb it proved very expensive for us and some of the system does not support tax with high cardinality for example inflexdb free version supports max of 10 million series of tax that means we can't have tax with high cardinality inflexdb also has a clustered version which might support more time series data there's always a single point of failure with these systems for example prometheus inflexdb free version runs on a single node means if the node crashes your monitoring system will be down and it doesn't scale again for the same reason because it runs on a single node so this is the design that we implemented at xotel first I'll cover the overall design and later slides I will go into individual components and discuss more in detail so at the left we have our micro service running on the cloud server and along with the micro service we also run other third party services like apache htbd server so we want to collect metrics about our micro service at the same time we want to collect metrics about the apache htbd servers and and also the physical server where the application runs like cpu, memory, stats etc so using our telemetry client that we implemented we pushed metrics from our micro service to a local UDP port where accesslogd listens accesslogd is a lightweight blogger and shipper service we can collect data from multiple input data sources and writes it to multiple output data sources so in our case accesslogd listens on a low closed UDP port to which our micro service sends metrics so accesslogd collects all the metrics that the micro service sends it batches it, it compresses it and then it writes it to Kafka similarly to collect server metrics and other third party service metrics like apache htbd server we use telegraph telegraph is a metrics collector agent which can collect server stats and it also has various input plugins using which you can collect metrics about specific services like mysql, harrow spike even elastic search so once it collects in our pipeline it collects the metrics it batches the metrics, compresses the metrics and it sends it to Kafka so moving on to the next component now the metrics is related to Kafka Kafka is a message broker service where multiple producer can produce metrics and multiple consumers can consume metrics from Kafka in simple terms you can think of it as a message queue where producers enqueue messages and consumers dequeue messages in our case logd and telegraph so once a metrics is produced we have a bunch of lockstash instances lockstash is again a shipper service that can read data from multiple input data sources write it to multiple output data sources so in our case we read it from Kafka and write it to elastic search so now the metrics are segregated and persisted in elastic search it's ready to be queried we use Grafana for visualization so we can query data from elastic search and visualize metrics in Grafana for alerting we use an application called Elastalert that can again query from elastic search query metrics from elastic search and based on the result that you define it sends alerts to your alerting system and use victory drops as our alerting system so if you see the overall design if you see the overall design these are producing metrics and these are consuming metrics so the first component is Asus log As I said Asus log is a shipper logger service that comes pre-installed with most of the Linux flavors like Ubuntu, CentOS it's very robust in our experience it consumes very less memory and CPU compared to other shipper services so metrics is batched and compressed at localhost before sending to Kafka so if you see the design we ship the metrics to a localhost UDP port so the latency is in sending metrics from our micro service is very minimal because we are sending to localhost also because we are using UDP which is non-blocking and once metrics is collected we also batch us and compress the metrics that means we also save on network bandwidth if you have a network bandwidth limitation in your systems and you can use various compression techniques like LZO GZO based on your requirement among the cons are configuring Asus log is a bit painful it will take some time for you to get used to it moving on to the next component telegraph telegraph is a metrics collector agent so we use telegraph to collect server metrics like CPU memory and metrics about specific services like MySQL history PD servers it's very robust and it also consumes very less memory and CPU it can collect metrics about 80 plus systems it has predefined input plugins which can readily use to collect metrics about various systems like MySQL and it's easy to configure moving on to the next component just Kafka it's a highly reliable scalable message broker service it's very reliable because it supports clustering and data replication so data written to Kafka will be replicated on multiple nodes even if one mission goes down you will not lose data and the cluster will be up and running this also implies that the metric aggregation system is also reliable because we are using Kafka it can be scaled to handle millions of writes per second this also implies our metric aggregation system can handle millions of metrics data points per second it decouples producers and the consumers right so if you look at the design these are the producers and these are the consumers so a producer now doesn't need to know who consumes the data because we are using Kafka in the middle so the advantage here is as I said maybe in future you might want to replace Elasticsearch with some other metric data store because it offers additional functionality compared to Elasticsearch so in that case all you have to do is configure LogStash to read from Kafka and try to do different data store you don't have to touch the producer at all so this means you are decoupling the producers and consumers so this makes the whole pipeline flexible so that you can plug and play different components for example you can read data from Kafka and ingest it into Prometheus if you want to use Prometheus for monitoring so that's the advantage of using Kafka and it also enables fault tolerant processing let's say Elasticsearch crashes for some reason so LogStash will not be able to write it to Elasticsearch so in this case since we are using Kafka metrics will be queued up so once Elasticsearch comes back you will be able to read the message from Kafka that's buffered and write to Elasticsearch so this enables fault tolerant processing some of the cons are it may not be trivial to operate you need to understand some of the internals of Kafka and configure your Kafka cluster appropriately before you push for protection some of the Kafka optimization that we did is the number of partitions based on your so Kafka supports partition so you can partition your key and assign it on multiple nodes so based on your read and write throughput requirement you might want to configure the number of partitions appropriately for the message queue we also define multiple message queues to categorize and prioritize metrics for example we create a different Kafka topic for application metrics we create a different topic for server metrics because we want to process them separately so this is how you can prioritize the metrics in Kafka moving on to the next component which is LogStash it's again a server service which can read data from various sources process and writes it to various data sources the advantage is it's a state-to-service so you can scale up or scale down the process based on your metrics ingestion rate for example if you're ingesting 1000 metrics per second probably you might just need one LogStash instance in couple of hours you might be processing 10,000 metrics per second so you can scale up LogStash instance so that you can increase your read and write throughput of the metrics pipeline the concept is it's very heavy on memory and CPU in our experience sometimes crashed due to auto-memory error so we are not very happy with LogStash in fact we are looking to replacing LogStash with something else which I'll discuss in the later slides next component is Elasticsearch we use Elasticsearch as the metric data store where metrics gets aggregated and persisted so it's a JSON store that supports rich query capabilities the advantages are it's very reliable it supports clustering, data replication it can be scaled to handle millions of writes per second that also means that our metric aggregation system can scale to handle millions of metric data points per second it supports tags with high cardinality and that's one of our requirement it supports rich query capabilities for example it supports metrics aggregation out of the box like average percentile and you can also bucket metrics based on various dimensions as I mentioned earlier now the cons are the storage is not very optimized for metrics aggregation use case for example let's say you are collecting metric from a specific micro service every second so over a minute you'll have 60 metric data point this also means that you'll be able to visualize or monitor metric at per second level typically for historic data you don't have to maintain data at per second level instead you want to aggregate the metric per minute level so what this means is that instead of having 60 metric data point you'll only need one metric data point per minute so the advantage here is it will reduce the displace utilization and it might also speed up the queries that you run for example some systems like Prometheus Inflex DB supports this out of the box there's a term for it called down sampling but Elasticsearch does not support down sampling primarily because Elasticsearch is not implemented specifically for metrics aggregation use case you know many people use Elasticsearch for log aggregation as well it doesn't support features specific to metrics aggregation and as like Kafka it's not very trivial to operate before you push it to production you need to understand the internals and configure it appropriately some of the ES optimizations that you've done are by default Elasticsearch stores the raw JSON document that we write for metrics aggregation use case we don't need the raw JSON document we just need the aggregated value so we disable the raw JSON document storage and that significantly saves the disk space utilization and we enable indexing searching only on tags not on fields as I said tags are the dimensions of the metrics using which you can search the metrics and fields are the actual data point on which you want to do aggregation so by enabling indexing only on specific tags you want to increase the write throughput because the number of indexes that it maintains is going to be minimal because we are only going to index tags not fields and Elasticsearch there's a process called compaction which basically merge all the data segments that it internally writes it to disk the reason for compaction happening is it will optimize your searches in Elasticsearch but compaction is a very CPU heavy process so it will significantly increase your CPU load if it runs so to reduce the impact on query searches we forcefully merge the documents to the same API in Elasticsearch called force merge using which you can trigger the compaction process manually so we trigger the compaction process manually during off peak hours so that it doesn't affect the query traffic during off peak hours so these are some of the ES optimization that we did so that's all about the Metrics aggregation system that we have built now the Metrics is aggregated in Elasticsearch we use Grafana for visualizing metrics Grafana supports Elasticsearch as a data store but unfortunately Grafana does not support monitoring although it supports monitoring on other databases like InplexDB or Graphite it does not support monitoring on Elasticsearch data store so this is how we use Grafana for visualizing metrics that's persisted in Elasticsearch so if you see the query we are visualizing metric called ES Client as you can see Elasticsearch also supports file card searches and we are filtering the metrics based on the tag value here we are using a service tag to filter the metrics so we are visualizing only metrics that are ingested from the call ES Worker application the metric that we are visualizing here is the latency in milliseconds and we are computing average of it we are also grouping the metric based on the timestamp like every 5 minutes or every 5 seconds you want to bucketize the metrics and compute aggregation so moving on to the next topic which is monitoring so first we need to understand the requirements for monitoring the requirement is we want to query Elasticsearch at a predefined time interval and send alerts if the metric reaches the threshold to give you an example for every 5 minutes we want to evaluate the following rule which is if average CPU load on server X is greater than or equal to 5 send an alert so this is a typical use case for many of us and as like the metric aggregation system the monitoring system should also be reliable and scalable for example if your monitoring system is down you will not be aware of any downtime which will result in bad customer experience so you want your monitoring system should be up and running all the time and it should also scale for example today I am running 10 rules every 5 minutes tomorrow I might have 10,000 rules or even 1 million rules which should automatically scale so that it can evaluate all those rules in real time so for monitoring we use ELAST alert application it's open source monitoring tool implemented by ELP so all it does is it queries Elasticsearch based on a predefined rule that you configured and if the metric reaches the threshold it sends alerts to various alerting systems it can also send an email or even send the information to your custom HTTP endpoint if you have any so ELAST alert tool was originally implemented to query logs and send alerts for example as many of you might know Elasticsearch is used for log aggregation as well so the typical use cases if there are too many errors in your logs then send an alert so that's the original use case for ELAST alert but we extended ELAST alert to support metrics aggregation as well it's still a close source whatever that we implemented but I am planning to put it in our public report by end of next week if you want to make use of it you can so now we have this ELAST alert application which runs as expected but the problem in hand is we want to reliably, scalably schedule ELAST but schedule and run ELAST alert for thousands of rules that we have so for that we use AWS Lambda AWS Lambda is a managed AWS service using which you can run your application in a lightweight docker like container so to give you an example I have one rule that I want to execute every 5 minutes for that I am going to launch a Lambda function that runs every 5 minutes so you can trigger Lambda function using variety of events one of the event could be a Cron event on what it does it triggers your Lambda function every x time interval x is something that you can configure so I am going to launch a Lambda function for each of the rules that I want to execute in real time so advantage of using AWS Lambda is it makes the whole scheduling and execution more reliable and scalable and this also implies that your monitoring system is reliable and scalable AWS Lambda also stores logs for each Lambda function application so it will be easy for you to debug any issues with ELAST ELAST alert if any some of the advantages are it is a stateless service it stores all the alert information in ELAST so it doesn't maintain state by itself for example the last known alert state for a specific rule is stored in ELAST search itself so that ELAST alert doesn't have to maintain any states about the alerts some of the cons are deploying new rules to AWS Lambda is not straightforward so we have built our own deployment system around deploying new rules to ELAST alert in AWS Lambda so if you want to use ELAST alert in AWS Lambda so you might also have to build your own deployment system around it so that's about the monitoring system now we have we have this metrics aggregation monitoring system that's reliable and scalable but we want to monitor the metrics aggregation system itself so that for example if Kafka node goes down we want to get an alert about it so for that we installed telegraph which is the metrics collector agent on each of the components like Kafka, LogStack, Elasticsearch so that can collect metrics about these systems of course we don't want to send the metrics to the same metrics aggregation system that's going to introduce cyclic dependency so for that we use InflixDB as the metric data store where metrics aggregated so once metrics aggregated we use Grafana for visualization and monitoring fortunately Grafana supports monitoring on InflixDB data store so we made use of that to monitor our metrics pipeline some stats about our production setup we run a three node Kafka cluster it's a core mission with Instastore SSD it's a local SSD attached we have two node LogStack instance again a two core mission for Elasticsearch it's a four node cluster again a two core mission with Instastore SSD this is our current traffic the metrics ingestion rate is 1,50,000 per minute which is roughly around 22 GB per day number of search queries that we run at a short level is 15,000 per minute and we retain data about the metrics for the last 70 days and the older metrics we snapshot and store it in S3 there's a snapshot and restore API in Elasticsearch using which you can snapshot the metrics in various data sources like S3 and other remote storages and if you want to restore older metrics you can restore it as well as maybe for it the disk utilization is 4.3 TB across four Elasticsearch nodes and at any point in time we store around 7 billion metric data points in our Elasticsearch cluster so for deployment we use Terraform for managing resources in AWS we are hosted in AWS and for deploying services we use Ansible Playbook and you can download Ansible Playbook for Kafka Elasticsearch LogStash from Ansible Caligraphy which is a repository for Ansible Playbooks so some of the future improvements that we are working on is we want to replace LogStash with RSS Log with the recent version of RSS Log it supports reading data from Kafka and writing to Elasticsearch and that's what we need in our experience RSS Log is much more reliable than compared to LogStash so we want to use RSS Log for LogStash and one of our requirement is for example we make phone calls in SMSS in some of the metric data points we also add the phone number as a tag to metric data point but we want to collect more metadata about a phone number for example the operator details the circle the phone number belongs to etc of course we don't want to collect all these data at the microservice which is processing the call because that's going to add significant more to the microservice itself so instead of adding it at the microservice we want to add this metadata to every metric data point at the latest stage in the pipeline for example LogStash after reading from Kafka it can read the metric data point add metadata by fetching those metadata from multiple external data sources degrade the metric data point and then ingest it into ES so we are working on a prototype for it the third requirement is we want to have anomaly detection support in elasteler and we are also working on this that's it guys thanks for questions that was a nice talk so I have two questions one is about the LogStash we also have a similar pipeline and why you choose LogStash was it because it came along with the Elk stack or because I noticed you had a structured logging I mean structured logs so you don't need to transform any logs that's correct so we choose LogStash because the integration with Elasteler chose good with LogStash compared to other ship of service we are not aware of the pitfalls that we experienced after pushing to prediction so that's why we chose LogStash in the first place but as I said we are blind to replace LogStash with RSS Log since there's no processing involved it's already a structured JSON document we can replace LogStash with RSS Log which can read data from Kafka and write to Elasticsearch okay one more question how you maintain your index Elasticsearch the speaker will be available offline so there are other questions if there are no more questions we'll come back to you or you can catch them after the talk hi I have three questions actually I will combine all of them one is about ES fields versus tags it's about performance you said let's say right now you have 25 plus services it can increase to let's say 40 50 services 100 services and each team wants to create maybe their own tags wouldn't that increase the indexing load on the Elasticsearch relative to performance it's going to increase your indexing load but you can scale up your entire pipeline as per your requirement I think the load is going to be high on Elasticsearch given that you will have many different tags for different data points but you can chart your data in Elasticsearch such a way that based on your read and write throughput requirement you distribute the data across different nodes did I answer your question hi I was looking to understand a little better about this anomaly detection that you mentioned I'm not an expert in anomaly detection so the thing that we want to implement is there are different algorithms that are proposed for anomaly detection metrics there's something called ASTAR algorithm there's someone from EB who proposed the static anomaly detection algorithm using which you can detect anomalies in a metric for example there's you're monitoring latency metrics for a specific API you want to find anomalies in the latency for example only at a specific period of time there's a kind of spiky and that's an anomaly compared to your average latency right so that's an anomaly in your system so you want to understand why there's a spikiness during that particular period of time during a day so that's an anomaly and you want to detect first and based on the anomaly you might want to dig deeper by viewing subsystem metrics like what is the latency of it that's what anomaly detection means and we are planning to implement the already proposed algorithm by many people by many data scientists that's what we are working on hello hi for the backend databases right have you explored something like OpenTestDB sorry I can't hear you for the backend databases you mentioned you're using PPS InfluxDB and a bunch of other stuff you were testing right so have you tried to use something like OpenTestDB we explored OpenTestDB as well before we chose Elasticsearch again with OpenTestDB when we compared OpenTestDB with Elasticsearch the query capability is much more richer in Elasticsearch compared to OpenTestDB for example I think OpenTestDB normalizes all your tags in a single meta data point for example if you're on a search based on different tags it's going to be quite complex with TestDB that's how OpenTestDB stores the information in your data store but with Elasticsearch it supports free tech search that's not really needed for metrics aggregation use case but you can slice and dice metrics based on various dimensions and that's what Elasticsearch provides that's why we chose Elasticsearch in your slide you have mentioned send metrics that is a part of API or through some other medium you are getting that feedback of metrics so my question is your API is a sending metrics as a call or how it is maintained with this metrics so we send metrics from our microservices to a local CUDP port and we have implemented telemetrics library in various languages that we use and all the library does is it gets the metrics from the microservice and it ships it to a local CUDP port where it asks us to log the lessons yeah it's local the reason being we don't want to add any delay in sending telemetrics to the actual API request for example if you are directly shipping the metrics to Kafka that's a remote call to Kafka and that's going to add the same kind of latency to your API request and since we are using UDP port which is non-blocking the latency will be very minimal even the any failures in the upstream services it will not be affected the microservices will not be affected I have a couple of questions one is there any replication yeah is there any replication for the Elasticsearch nodes what you have configured the four nodes because that becomes a single point of it's not very audible it's better yeah okay so is there any replication for the Elasticsearch nodes what you have configured even if one of the node goes off the data is lost right so is there any replication configured that's the first question and second question when the primary persistence goes off the logging or the alert will be like flooded right how is there any alert folding which you have configured in your systems there will be thousands of alert coming we don't know where the actual error is was that configured the alert folding concept alert folding like basically if one point can trigger multiple alerts right so the answer is first question yeah we configured data replication and Elasticsearch and Elasticsearch supports it out of the box it supports clustering so for example in our case the replication factor is three even if you lose one mission the data will not be lost and the cluster will be up and running so second question is whether Elasticsearch support alert folding I don't think it supports right now so if you need that feature we might have to implement that we can also plug in other systems that supports these features probably Prometheus might be supporting alert folders so you can plug in Prometheus to read data from Kafka and visualize or monitor your metrics in Prometheus the same pipeline can be used for other use cases as well we are using for metrics aggregation but you can use it for your logs even for your even streams thank you guys thank you Vishnu so Vishnu will be here we have a separate 15 minute Q&A session with Vishnu, Sobhik and the next speaker Ragu so please hold on to your questions we will head to the morning break before that we have given you feedback forms so please do rate the talks and give us any feedback or suggestions just help us improve the event and we will be having flash talks from the audience this afternoon so if you are interested do let me know we will be back at 11.30 if you have not got the feedback forms do let us know, we will give you copies thank you so again with that graph if you see the changes for each of the back ends and the one that receives most of the traffic is the quickest something Swamig also briefly mentioned in his talk so this causes load imbalance so what is the impact of this performance degradation as that single server starts receiving more and more traffic the requests that are sent to that server those users will have a degraded performance as well as server under utilization some of those rest of the servers we do not receive traffic are not utilized feature that the agent on the back end servers can send to HAProxy which can be a simple stream say 75% which means change my way to 75% or you can also send a command where you change the state of the back end can be maintenance down or up now Harald is the agent that we wrote which implements this protocol and that knows how to respond to HAProxy agent check requests one into maintenance or it can be up which will cause the server to come out of maintenance or it can be a string 75% which means reduce my way to 75% and HAProxy by default does this checks every 2 seconds and every 2 seconds will create a TCP connection to the back end on port 555 wait for a short while for a string command so this in essence is load feedback it is constructed by combination of these 4 components which is HAProxy rates the agent check protocol and finally the agent which is Harald so Harald as I mentioned it sits alongside your application and it will query your application it can query your application can query other resources as well and get the current load of the application and then when HAProxy sends the agent check request it can respond to the HAProxy request with the wait as the traffic pattern changes the load of the server will change and Harald will again get the current load calculate the response for the wait and feed that back to HAProxy so Harald has 2 responsibilities one collect the load matrix from an application calculate the response based on that load matrix and two to respond to HAProxy agent check request that come every 2 seconds now the application load matrix this is something the application can implement through a HTTP interface, file interface gmx it can be anything and in that interface you can provide the metric that your application cares about rps is the simplest one but it can be connections currently active or it can be a back end load metric it's all up to the application this is an example Harald configuration it's a yaml file so here where the load matrix are the response is a json response and the metric that specifies the load under this key this keyword specifies the threshold 9000 so what this means is that this service can do at peak 9000 request per second and if what Harald will do is for example let's say returns 4500 the response it will calculate to send to HAProxy would be 50% this is a graph that basically shows Harald in action where you have a bunch of back ends we can see the response time is constant but as per the input traffic the weights will change it's really built for production we've been using it for the past two years and there are a lot of features we've added to make it production worthy some of the things are first of all it's written in G event so it's single threaded it uses minimal resources as it has to sit alongside your application it uses async polling for matrix so your HAProxy request and the polling it does for the application matrix can run independently responses are also cached CPU or you could also query external matrix system like graphite, Prometheus and so on so what's the impact? it's been used in Hellchip for the last two years consistent response time to our users optimum utilization of the servers that we use therefore cost savings and this is a graph production that just shows the herald weight changes over time this is for the last two days so this period hey I have two questions exactly so the first thing you told is herald based on some value like 9000 it probably tells you how much load to pass so in a dynamic environment you always know that your load is increasing as your business grows so how do you always keep track of this 9000 value it should be dynamic so 9000 is based on the benchmark that you do for your application so 9000 is the most your application can do beyond 9000 it can't handle the load anymore that's where the number comes from so when it reaches 9000 what herald will send to HAProxy is 0% don't send me any load because I'm already at peak load that is not my question how do you come up with this number 9000 because if you do benchmark every day that benchmark is actually going to change based on your traffic pattern so is there a dynamic way of doing it why will it change it will not change because you know just to rephrase that the benchmark is definitely synthetic but it is matching you will try to match it as close to what you see in production as well if you are pretty conservative it can do 10,000 use a lower number that's always a difficult number to come up with or another scenario is you let it run in production for a while you see how much it can do in production and then you get a real world number and you pick a conservative number from there the other question is herald is essentially an agent that has to be installed in every server slash container for your other thing is it very lightweight or how does it impact the you know because your health checking say every 2 seconds or something isn't that hampering the application too much so it's very lightweight because it's a very simple TCP server HAProxy just creates a single TCP connection and you know it's purely IO bound because it's completely network based HAProxy is also other network and your application load metrics are also pretty light it just has to most of the work is in just writing the data to the clients so we've been using it I mean we have never seen herald calling a problem any more questions so don't you think it so it could lead to a cyclic effect so basically so let's just consider a scenario so let's say you have 15 instances so one of the instance says so the load needs to be 75 so gradually if the 15 instances also says the weight needs to be 75 so the HAProxy sets the weight as 75 across all 15 which means your weights have no effect so it will still be your own problem which means more of the traffic goes to each of the instances so now the feedback kicks in and so 15 instances says the weight should be 45 so then again so all the 15 instances get 45 so it could have a ascending cyclic effect yeah right so there are two things one is that so that can happen if the changes are too frequent so HAProxy by default checks every two seconds now if you change the weight every two seconds then that kind of feedback can happen so one way to solve that is you have the application load metric interval is at 30 seconds which means the weight can at most 30 seconds so it kind of smoothens that feedback delivery right but that also means that herald will take some time to kick in your traffic will gradually adjust and balance right so that is something that can help that to an extent the second point there is you also have to have capacity in your cluster right I mean if you have more request than what your backends can do in total then obviously herald can't help herald will only allow you to redistribute the traffic across the backends servers evenly does that answer your question let's also invite sobik and mishnu here in the auditorium to please join Raghu so we can start the Q&A on building reliable systems yeah let's continue with the questions hi this question is to Vishnu you use UDP to connect collect the application metrics right that's what you mentioned but as said the UDP there will be some losses so you have a sense of how much we lose or anything like that I agree it's not some of the losses acceptable but do you have like do you I mean check how much we are losing in the application metrics or all the application metrics going there sorry can you please there will be some packet losses right so that means we will be losing some application metrics so as your monitoring system takes care in that into account correct yeah so that's the disadvantage of using UDP but in most other cases we will not but in some cases we might but it's okay to lose metrics because application running well in your server is the most important thing compared to sending telemetry right so that's that always that I agree but the thing is you have how do you get a sense of how much metrics I am losing so to monitor that we also add instrumentation in ASUS log so ASUS log captures how many metric data point it collects every every minute every second so we get those metrics and push it to push it to inflexdb that's our the secondary metrics data store right that we use to monitor the metrics pipeline so we in fact collect all these metrics from all the ASUS log servers that runs on every single physical servers and then we monitor the number of the metric ingestion rate from the specific physical server question to Swami did you try evaluate any commercial packages for doing this circuit breaking etc or are there any commercial packages you know which give same functionality as the open source one or you know something more even you're talking about history is open source I was wondering if there are any commercial packages which have the same features or better features than history yeah there are commercial I don't know but yes there are other circuit breaker packages as well history is being most popular but commercial I don't know if there are any more questions the question for Raigu and it's kind of fall up to the other question so as the number of load you know it goes 45, 30, 20 how do you know when does Harold take care of adding more nodes to HAProxy or do you guys monitor it and then operations adds it like how do you handle adding more nodes to HAProxy Harold does not handle that but we do have auto scaling so that usually depends mostly on the same metric that Harold uses for its calculation because it's what determines that you need to add more nodes to the cluster so at help ship we have our own auto scaling auto scaling architecture so yeah that's what takes care of expanding or contracting the cluster Harold's only responsibility as of now is for load feedback guys the internet is temporarily down we are aware of the issue and working on it this is about the history so what is your opinion on NY because I see the problem with history is that you need to make sure all the developers use it right and it's only the Java ecosystem is something going that right now everybody is using microservices where polygot programming is common how do you think about it when it's about the histories is there any road map I mean supporting multiple languages non-JVM languages so how do we handle that like the rate limiting or throttling or what are we doing I have not used certificate in other languages unfortunately I don't know but eventually we also did not use history in some places and we wrote our own and it wasn't too difficult in the sense that building a library which is a general purpose like history is difficult but solving it for a given service is not all that difficult so you may just want to write your own oh hi this is for Mr Vishnu can you tell me something about matrix collection on client side how we can integrate client side in matrix aggregation in this architecture sorry can you please repeat can you tell us something about aggregation of matrix from frontend or client side unfortunately we are not doing it right now so I can't speak much about it but I guess you can make use of the same infrastructure in the backend you just have to think about the endpoint that you want to expose to collecting matrix from your frontend but yeah unfortunately I don't have experience collecting matrix from the user end yeah hi this is this question to Vishnu regarding the elastic alerts so as you said that it generates alerts based on some threshold is bridge right so what kind of rule engine you use to generate these alerts special rule engine it's just a python application the application is called elast alert and all it does is based on the rule that you define it constructs the elastic search query it queries the matrix and the metric response is greater than the threshold that you define it sends alert to your alert system and it will also check the previous alerting state of that specific rule if it is already an alerting state it will not send an alert again if it is in resolved state and if it goes back to alerting state it will send another so those states are maintained in elastic search so that means elast alert itself doesn't maintain any of these states it's completely stateless so you can so Vishnu so how do you manage your indices in elastic search you have limited memory in all their nodes and do you do some kind of rolling index based on date or so we maintain daily indexes in elastic search that means we create a new index every day for metrics we also maintain separate indexes for application metrics versus server metrics because for example server metrics we only want to retain metrics for the last two weeks or so application metrics we might want to retain for a longer period for a month or two months so based on that you can maintain separate indexes for metrics coming to upgrades I think elastic search supports rolling upgrades but if you haven't upgraded elastic search till now so I don't have much experience upgrading elastic search but as far as you know it supports rolling upgrades so there won't be any downtime while upgrading elastic search all you have to do is add new nodes install newer version of elastic search and bring down the old nodes is that answer your question so we retain metrics for the last 70 days but if you want to metrics for more than 70 days you will have to restore VTEC snapshots of the daily indexes at the end of every day and we pack it up in S3 so there's elastic search snapshot and restore API which you can make yourself you can snapshot to multiple remote data stores like S3 it might also support Azure Blob Storage so we snapshot it to S3 and if you want to restore a specific day index all you have to do is use the restore API to restore the index and you can query the metrics question for Vishnu how did you choose that you will build your own metrics collection and monitoring solution did you evaluate any readily available solution either open source or closed source and I just wanted to understand at what threshold probably implementing your own solution makes sense and when having a commercial solution or other ready solutions would make sense there are many open source solutions there's also paid versions of these softwares for which you will have to buy a license so one of the main downside that we saw with the open source version the free versions are the high availability most of them run on a single node that means it will not be as much reliable as the pipeline that we design and it won't scale after a certain point for example we have been using graphite before we moved to this pipeline and it didn't scale for us so that's why we thought we will build a pipeline which is reliable and scalable and as you can see we have used all the open source solutions to build a pipeline like Afga Elastic Search so we haven't built any systems specifically for metrics aggregation we have made use of all the open source systems that's available and one of the other requirement which made us to build our own pipeline is the flexibility for example as I said in the presentation many people are using graphite but people move to Prometheus now because that offers more functionality compared to graphite but couple of years later you might switch to a different monitoring pipeline so we want to have a flexible pipeline so that you can plug and play various components for example even in the pipeline that I described without touching any of the components also for monitoring if you need features that Prometheus provides Thank you Sobik, Vishnu and Raghu in the interest of time we would like to move to the next talk the speakers will be around so feel free to interact with them thank you so the next speaker is Aditya Patawari he's going to be talking on throttling requests before they hit your application he's been working on systems engineering and operations for many years and he's built and managed infrastructure to handle millions of requests he's experienced bottlenecks and has learned optimizations that can help build good infrastructure so we're not in front of his laptop he is usually traveling to just unwind or to find a better place to go back and sit in front of the laptop Aditya Hello everyone, my name is Aditya I'm going to talk about throttling APIs before it reaches your application a little bit about me I've been a system engineer and a DevOps engineer since quite some years now I consult at DevOps Nexus and I've contributed to some open source projects like Kubernetes and Fedora projects I've authored a couple of tech books and I'm a regular speaker at various conferences so what are we doing today? we're going to try to save the world from us now this is obviously an exaggeration because nothing can save the world from us so let's talk about APIs what is an API any ideas about what's an API and I'm not talking about the standard definition that we learned in our graduation days I'm more interested in very simplistic very narrow down version of what is an API since I don't see any hands raised I'm just going to say that an API is something where we send a valid request and we receive a valid response so today we are just going to focus on those two things that we will send a valid request to receive a valid response that's all we are going to do and whenever there are APIs in whatever companies, whatever there is a problem of API abuse now, what does that mean? and not all API abuses are malicious in intent sometimes some enthusiastic developer just writes an infinite loop just to check the status of the API that is my job, some material is my job executed so there are times when some people are not happy with your API so they might intentionally abuse it there are times when when your API is too open in nature so there are bots that might abuse it so I mean the abuse list goes on and it's a very standard problem that a lot of orgs face about how to mitigate or how to work with API abusers now there are some conventional popular methods which are almost always used by a lot of companies to handle this issue a very standard way is to use middlewares like let's say rack attack or rate limit rack attack is a ruby library, rate limit is a python library and there are a bunch of libraries present in almost all the programming languages out there my personal problem with these sorts of libraries is that the bad request still hits the application it's the application that decides that you know I'm not going to serve this but it still hits the application and because it still hits the application it ends up hitting up a lot of resources I mean if you are getting bombarded with a lot of data a lot of requests even if you are denying the request you still have to serve it in some way and that's not always a good thing to do and now to mitigate the standard problem the solution is to use of the shelf wafts wafts or web application firewalls right my second problem with web application firewalls is that sometimes they are expensive I mean if you are going for a proprietary solution almost always they will cost you thousands of dollars and sometimes they are less flexible and I'm going to come to that example of less flexibility in one of the most popular of the shelf wafts and when we talk about infrastructure and we talk about wafts one most common waft that comes in my mind is AWS AWS has a web application firewall which you can put in front of application load balancer so good news is that you can really rate limit using that so that is something that is available to you out of the box which is awesome my problem with it is that they have like a hard window that you can only rate limit in intervals of five minutes which is slightly weird for me because usually when people sell APIs they do per minute rate limiting or per hour rate limiting but somehow AWS found it a good idea to do per five minute rate limiting so you know a standard number of requests will go and when AWS sees that it has exceeded the threshold for five minutes and five minutes is I think slightly longer time interval in certain cases and I mean I can still live with five minutes maybe if I am like a very small company I don't have time to time or money to invest into something better but the worst news is that they can only rate limit on the basis of IP addresses now if you are let's say a large organization where you have multiple teams if one team actually ended up abusing the API of somebody and you were using this WAV solution basically all other teams are now locked out and again like imagine a worst scenario within the same company you can actually go to the team and say hey stop abusing the API because we are also using it let's say if you are in a sort of co-working space where another company abused it then you have pretty much no say that company can effectively block you out using somebody else's API which they won't even know so that's too weird for me now one solution that I figured out to mitigate this particular issue is to use standard IngenX now this is something which I have not seen a lot of people using it and I don't know why maybe this is not very well documented maybe not very popular but you can actually use IngenX to rate limit and that is awesome because not only based on IP addresses IngenX can actually rate limit on the basis of a variety of parameters like basic auth usernames HTTP agent HTTP version IP address of course is one of them I mean so on and so forth you can block it on a variety of parameters now my problem with them is that it doesn't handle customer tiers so when we build APIs when we are in the business of selling APIs we basically what we want is to sell various plans maybe there's a personal plan there's a business plan, there's enterprise plan and all those plans have different allowances somebody might be paying you $20 for a personal plan and you might allow them to use it say 100 times but there are companies who will pay you tens of thousands of dollars and you want them to use your APIs a lot more and if you are using standard IngenX method that standard IngenX will be able to differentiate so what the solution you can still use for is to weed out like huge spikes so that you can maybe you can get rid of some sort of small due doses wherein you can say that you know my biggest enterprise customer uses 10,000 per minute if I see a request from a single I could just in the sense of 20,000 per minute then I should drop it something like that you can weed out like huge spikes but still might not be what we want it to be so my solution to this entire mess is to use IngenX plus Lua for studies now IngenX has a Lua plugin which is very awesome basically you have to create a pipeline where IngenX will receive and respond to the requests and the Lua will maintain basically the logic for it limiting where you can say that you know based on this key do something and readers will keep a track of the current state like how many requests have been processed for this particular client and how many are left and so on the big plan is that in the beginning we assign a fixed number of tokens to each user and store it in the readers bank I'm choosing to call it back you can call it anything and each request will cost a certain amount of tokens for simplicity we can say that each request costs one token and as people keep on bombarding me with requests I keep on deducting their token balance and when the balance is zero I will not give them response to their API unless the token balance is reset or restored after a minute or whatever windows you choose it to be so I've prepared a quick demo for the same okay so I have I have a setup wherein I have an IngenX yeah so I have basically two IngenX is running first if you see the first IngenX in the top that has the Lua code second IngenX is just serving a static page that's basically where I'm proxy passing so your application that is supposed to serve the APIs that will be the second IngenX sort of so what I'm now going to do is I'm going to use Apache benchmark and bombard it with around 10,000 requests IngenX.Lua is basically my own local host I've just created an email right so they just made 10,000 requests and if you see non 2xx responses yeah non 2xx responses so basically these many requests were denied and I did 10,000 requests now let me try to do the request again quickly so since I did the request within a minute all the 10,000 requests got denied and the best part is that it didn't actually reach my application at all so my application resources were not wasted so I'm just going to talk gibberish for like 30 more seconds primarily because I want to show you that it gets reset every minute and we can do it again so how many of you had tea this morning I think 30 seconds are done okay so if I do it again I think it should serve all the 10,000 requests 10,000 sorry yeah so it denied approximately 3000 requests and served 10,000 so basically this is just standard enginex plus lua wherein I'm just checking the readers bank again and again and figuring out that if I should serve the request or not so I have some benchmarks to talk about what really happened here I'm sending 10,000 requests with the concurrency of 50 requests together this is just I've created 5000 users a random number of token and the test user that I'm using has 7,000 tokens so whatever number of requests I send only approximately now this is not very accurate by the way if I'm saying 7,000 because of the concurrency level not all of the requests will go through so when it is 7,000 approximately like half percent or something like that extra will go through so this is this is what we figured out when we benchmarked it that it does add a bit of latency to your overall response time for example if I was not using this enginex at all it was 3 milliseconds but with lua and readers 2 milliseconds were added to 50 percent of the requests and again if I talk about 99 percent of the requests then around 6 milliseconds were added to most of the requests but I think it's not too bad because if I look at the mean time only about 2 milliseconds has increased now that is something that I think we can live with it's not too much but again if you're doing video streaming that this might cause you a problem but for standard HTTP requests that's not a big deal it has fairly decent accuracy so this is basically an average of about 15 to 20 tries accuracy is fairly good but it's not like 100 percent accurate but I think the cost that it presents and the savings that it can do in terms of infrastructure this inaccuracy should be acceptable the longest request that we had problem with was also added just a latency of 6 milliseconds so that was all there are some corner cases that I might want to talk about so I'm using Redis you might you can plug in any back end you want memcache, mySQL, Postgres, SQLite maybe not SQLite but whatever if there is too much data in Redis I've seen that latency increases a little bit like if you insert a million users or something like that and there are a couple of cases that you might need to consider while implementing this the first one is that whatever request comes without a throttling parameter so right now we use username as the throttling parameter that's why I passed running a patcher benchmark I passed a flag minus u1 but what if it comes without throttling parameters now that is something that is less of a technical question more of a business question that does your business actually wants to handle things how does your business want to handle things when there is no throttling parameters we implemented this for BrowserStack, it was awesome by the way and according to their business plan they say that if there is no throttling parameter you just allow it because then application will take care of accepting or rejecting it based on what kind of API request it is and similarly there was one use case which we were worried about that what if Redis crashes so you can actually write error handling in Lua which will allow you to bypass the entire Redis workflow in case there is no Redis available so your customers will not see impact at all and that will give your ops time to figure out the root cause and fix the problem so that's all I have questions so you have built a very good system over there congrats on that so my question is have you considered any other open source platforms while you were building this or AWS's own API gateway what open source I mean we tried out a bunch of yeah I just want to know what were your experience with this and why then you build your own my experience was pretty much summed up in a slide where I said either they were too complicated and inflexible or they costed us a lot of money this was actually a reasonably good mix of not costing us much and it was quite straight forward to implement I mean it didn't take a lot of time to implement it and it was since it was fully programmable and very easy to program we could embed as many business cases as we wanted versus any off-tischel solution where is the code out here straight ahead where is the code I want to see how it works where is the code code is unfortunately not that open source but it's a standard library if you go to go to cloudflare folks open resty is the name of the plugin and then it's a standard if else and a redis plugin I mean there is nothing it's just like probably 20 lines there's nothing much to it hi my question is that how does the reset work like who does the reset here yeah that's actually a good question I was hoping somebody would ask that reset can be taken care of by multiple ways if you are let's say for example on Rails you can do a rake task that's one way another way is that and what we ended up doing was that we set each parameter with a TTL and when TTL expires the first request that's going to hit is going to renew the token back but there are a variety of ways to do it like I said the main thing that you need to know from this talk is that IngenX and Lua can be omitted very easily using cloudflare open resty plugin simple fl's and error handling and bunch of those things hi so you talked about sniffing header some attributes and headers and everything so does it have the capability to do I mean to sniff out certain attributes in the post or request body I mean generally the client details are available like the plans that you were talking about those are all available as part of the body in the request that part you need to take care of when you build the token back let's say here most likely your users will come with an API token or something like that and that API token API token can be a token bank identifier now depending upon their plan to basically update your reduced bank and it is taken care of like that in for browser stack we had like multiple groups we had like a giant company group then we had subgroups and then we had individual users and like I said it was it's really straightforward just bunch of fl statements here you have two questions one is you said the Redis can go down and you have tried and catch it a lot of times when it goes down it could actually accept the connection but stop responding so does Luo support some kind of timeout yes yes there is a you when you do the connection with Redis you can specify a timeout okay cool and the second one is you said you need to put some kind of tokens and then keep taking on out and all that looks too complicated setup sure yeah I will catch you offline alright thanks Aditya alright thank you guys you can catch me outside if you want Aditya is around so you can ask your questions to him so next speaker is Lina SN she is going to be talking on expanding contract pattern for continuous delivery of databases Lina has used the expand contract pattern to allow significant database schema changes safer and reversible she is here to share her experience on using the pattern to apply continuous delivery for databases so we are not in front of the computer she likes to spend time with her daughters and she likes to listen to music especially Carnatic music she is also attending concerts whenever time permits she is just listening not performing yet Lina SN hi everyone am I audible in the back so good afternoon I think people are getting ready for lunch so we are with two more sessions I will try to make it faster so that I don't bore you with my talk so few years back we were working for a product and it's an e-commerce product we were building a module for designing things online so the idea was you can design things using your browser and print it on your movies and not a it's very common but so the users can come to the site design things and say that this is what I want that you should be printed on the goodies and then they can pay it and then the design is delivered to the printing vendors and then they deliver the things get delivered to the customer and they can do multiple things they can upload their own images and resize it add their own text add different kinds of shapes and so forth and it was working well that is when we realize that we have a problem that is we wanted to introduce the functionality to for certain text transformation it's not exactly transformation text features such as actually rotating the text or adding shadows to the text stuff like that and that is when we realize that the library that we are using for converting the design that you see on the UI to the PDF that it was not very precise it has to be very precise the colors and the position has to be very precise so that the printers can print it very well so that it does exactly look like how do you see on the on the site it gets printed exactly like that on the on the goodies so we hit with the bottleneck at certain transformations were not working as we expected and that's a huge change and so we have two problems here one is that the current library doesn't support the feature that we want to introduce and we want to release that feature as soon as possible the next problem is that we already have a lot of features built using the existing library and then we also need to migrate that functionality to the new library so what we did was the immediate requirement or what business wants is to deliver the new functionality so what we did was we approached it using a concept called or a technique called parallel change so what it does is that we have both libraries sitting in the codebase the existing functionality uses the existing library and for the new features that we are adding where we had the issue we used the new library so this is the parallel change and over a period of time we migrated from the existing the old library to the new one and that gave us enough time to test the new library rather than going into a branch and doing the entire migration coming back and releasing and it surprises this give us enough time to evaluate the new library so that we don't hit with another surprise down the line and this is not a very new technique this is called the standard branch by abstraction so don't branch the repository instead branch your code and over a period as you gain confidence you migrate it and then how long the parallel implementation should be there it depends on a lot of factors and things moved on now I work for a product called good karma it's a product that is delivered to yoga studios it's a B2B product for yoga studios to banish their business and so we started, still a start up we started 12 to 18 months back and so the product is evolving and we started with certain idea then we as we onboard new customers or we learn from the market we realize that okay there are changes and that means the changes needs to happen even to our database model okay so because of the as the product grows the understanding has to reflect on the product as well as on the modeling that you have done on the database so and you know that the database is considered even if you are a start up you don't have millions of customers even then the slowest moving part of your product is database and that is considered as a very high risk thing because if something goes wrong it is very scary especially on the database side migrations and stuff like that it's very complicated and there is high tendency to become very nervous when database doesn't work as expected so can we the thought process is can we actually make it less risky and that the old premise of continuous delivery if there is something that pains do it all the time so that your pain gets reduced it's not the solution is not no to do that solution is actually do it all the time so that you don't have you you take the pain forward and do it continuously rather than doing it once in a while right and that is where the concept of database refactoring comes into picture and the the key word here is refactoring the refactoring as per that Martin Fuller has explained the beauty of refactoring is that you do it in small steps you do a series of small steps refactoring which gives you a bigger result and that every refactoring that you do has to be split into small tasks as small as possible and that is the value of refactoring if you do it in the actual way and similar things can be applied for databases too and that is what I learnt during the last few months or say even close to an year so there are techniques none of these techniques are mine there is a book called database refactoring it talks about how do you structure each migration depending upon what kind of migration you want to do right is it a are you adding a new validation or are you changing the structure of the database or are you adding new indexes how do you approach each of this is very well written in that book and I will take up two examples that I have used recently while building the product so this is called the split column refactoring what it does is so the product has a concept of has a module for booking online booking of trial process that is the customers can come to they get an SMS with a link then using an IVA system they go to that side and they can book a trial class with any yoga studio completely automated and what we need to know is that whether the customer has attended the class and that is not where the funnel ends right funnel ends if they become a member of the yoga studio so first they book they get the SMS that is the first step then they book a class then they attend the trial and then they become a member of the access criteria of the entire thing so we had a status field in our trial booking stable and the status field stays whether they have attended or not along with the status saying that have they registered so it can be they attended but they are not interested in registering now or it can be that I do not want to register now I want to register few months down the line then there has to be a fallout mechanism so both these fields together were stored in the both these data together were stored in this particular field which we realized over a period of time that okay we need to split it and so that we can run better analysis on that and then give more insights to our customers and that is when we had to split it and we called it as attendance status and membership status first attendance status says whether they attended or not if they have not attended membership status they cannot make much sense but that is not mandatory even without attending it they can still become a member and so what we did was we introduced new fields still kept the old field and apart from that introducing the field we also added certain triggers or say even every ORM supports callbacks so that any time when you save the data to status it has to split and store to these two okay so this is the first step that you migrate to the new implementation and then there are so many so many as part of the core base or even other dependencies that you have they depend on this field so it takes time to actually change everything across and even how do you display on the how do you display on the UI or say even other downstream dependencies that you have which uses the same field so it takes time so you can leave it like that for some time and then until all the downstream dependencies are moved to the new implementation and then you contract it to the new implementation so this is the entire refactoring process you introduce the migration that is the transition period and at the end of it you contract it to the new implementation and rather than doing it as a first one step you are doing it in multiple steps okay so another example is split a turbo where in this case instead of splitting into multiple columns we are splitting into multiple tables so we had a billing cycles which had all the payment details we wanted to segregate it because later we realized that okay people can't pay in installments so when they are paying they can say okay I will pay it now after a month I will pay the rest of the amount so it has to be tracked in multiple tables one billing cycle can have multiple payments so we had to segregate that the same technique we introduced triggers to save to the new table we changed across the application to use the new structure and then you completely move to the new one okay so the same thing you have a migration time and then you completely switch back and that is where the end-air process is called expand you expand to the new way and then you contract completely completely get rid of the old way of doing it and so what is required for you to implement it what is the basics that you need so that you can always approach this one is a versioning you need to have versioning for all your database migrations every migration has to be uniquely identified and as far as I know I might be wrong in this case but I think almost all the ORM frameworks supports especially those follow the active record model supports the migration way so that you can quickly roll back or say move forward or roll back to a certain state of a migration that is the basic need enough automated tests as safety measures to make sure that none of the changes are breaking especially when you are transitioning and contracting okay this gives you enough confidence and then the last is how long can how long it can take for a team to actually contract and that completely depends on various factor I have been a consultant so it depends it depends on various factors on how long you want it to be how long you need the parallel change to be implemented it depends on how big is the team how much critical is the current migration how much downstream dependencies you have how much time you may need to keep it so that you gain enough confidence lots and lots of factors that you need to think of and this is not just restricted to the expand contract pattern I think it is about how do you want to deploy database migrations do you want to have down times then how do you plan your dime times or if you don't want to do a down time what is the other mechanism of doing it and that is the strategy that you have to build and this is not the idea of how to change or expand contract pattern is about lowering the risks of every release that is the whole idea how can you bring down the risk that every deployment every release is risky but that doesn't mean that we can't release but you can only reduce that by doing it by taking care of certain things and what are those principles one is doing it incremental and doing it completely taking it as steps itself is a first step and then making sure that any change that you are doing is actually ready to deploy making sure that every commit has enough confidence that you can build that if it goes to production it doesn't break anything so automated tests and other things helps you in this case you can actually you don't have to every deployment doesn't have to be resulting in releases for example in the case of migration we had actually migrated the database but the user interface or it is not visible to the users that new changes have built into the system and then that is called the idea of dark launching when you are just deploying as a migration it is called the dark launching or even feature toggles help you in this case you can turn off that feature in certain environments because of various reasons you are not ready for doing that so having enough mechanism to decouple every deploy from release helps you to lower the risk of your deployments and doing it in small batches as small as possible and that is very very important I think the rest of the things will not work and as a reference I would highly recommend this has a lot of techniques on how do you approach database migrations and some more references and that's it we have time for questions I think there is So I can see the parallels between having feature flags and also doing this kind of step where you expand and then contract so what I see happening is over time you have a lot of code where you have things which are really not using you have expanded but not really contacted so the issue is like you have to remember to keep track of it so how do you do that as part of your planning process how do you kind of do the issue tracking for those kind of things I think it is very important because if you don't have do it right it is becoming really really messy and I emphasize on the strategy ok so what is the approach that you want to take what is the guideline that as a team you want to follow right same goes with same goes with feature toggles also you can say that no feature toggles can exist in the code base beyond 2 weeks if the entire team agrees that that's a good time frame then you may need to bring in system so that it is automatically detected or you may have to bring in some kind of system to know that there are certain toggles which are existing beyond this time so we had used our own backlog kind of thing for feature toggles and this kind of changes to make sure that you don't it is tracked separately but interestingly I have heard that there are tools automated tools that certain teams built I have never tried that the code say for if it is for feature toggles it checks how many toggles are there and then if there are certain toggles which beyond expiry time then it fails so you can go to that level so there will be some kind of a guideline that every team follows for either for expand contract or feature toggle and you can look for that patterns and then have automated systems to fail it that is very great but I have never tried that but having a backlog of yourself and then keep monitoring it when in the normal team meetings or the daily stand-ups bringing that every once in a while after a while what I have seen is that it sets it down in the team right if you bring it very consistently then I think over a period of time it becomes a habit for that more questions? ok thanks Reena the next speaker is Vicks Sridhar she is going to be talking on distributed racing with Jagger at scale so Vicks is a tech enthusiast with 11 years of experience in the software industry at percent he is a developer advocate with digital ocean he is a big time foodie Vicks Sridhar let me try it once again good afternoon wow not bad how many of you are thinking about this be honest come on but even I am thinking about that so it is time for lunch so I know I am standing in between your lunch and this particular session so let me let us all focus for next 30 minutes on this particular so what is this diagram so this is microservices and this is microservice dependency graph of Uber ok so this this is this much complicated it can get when you try to build microservices at scale and if you do not have right set of tools to handle these microservices it will be very difficult for you guys to figure out what your code is doing so I have added some of the disclaimers because eager or whatever we are going to discuss today is having a lot of intricacies because it is having dev and ops both involved in it ok so my first disclaimer is that I do not want to talk about any of the architectures because this architecture is available over internet and you can go through it and also I will be giving that reference at the end of my slide phrasing is actually implemented by developers so this is how many developers are here so phrasing is implemented by developers but my talk is not going to be from developer perspective so my talk is going to be from the ops perspective so the developers implemented and ops and dev both consume this particular tool ok the another disclaimer is that we have an open mind while listening to this discussion because I do not think that I am talking about 5 services but I have an open mind that there are more than 50 services which is interacting and we are building praise for that ok and another disclaimer I want to add is that Anand do not ask questions for me ok this is me I am a developer advocate at digital ocean and I was heading devops at blackberg and I was working with HCL as well as a devops solution for big data projects and was part of IBM India software labs for over 8 years so let us discuss what is microservices the most important thing about distributed systems so let us take an example we are building a marketplace and we have a certain services let me show you the product order, start, messaging account I have only taken 5 services as an example but there could be more than 30 more services involved at the back end there could be inventory there could be catalog there could be merchant management system things at the back end so all these services are decoupled why do I say it is decoupled because it has databases separated so any code push to your product will not affect the order system and any code push to order will not affect the start system so that is how it is decoupled unless and until these guys are interacting there is no interaction and there is a failure and we also had a talk in the morning on how to handle those failures so of course these microservices are interacting with each other with various set of queues in between message brokers and other stuff I do not want to get into the architecture building of those architecture so just assume that you have 30 different services interacting with each other and let us take an example if there is interaction then let us take an example that your user is trying to access a billing he is accessing his billing or a cart or something from your application the e-commerce site which you have built so when he is trying to access there is a latency there is a performance issue for that user when there is a latency issue what happens to your application the user will leave your application he will not try to use it anymore say things like it is pretty slow let us move on so that is what the user mentality is so there is a latency issue there is a latency in your application now what are the solutions how do you solve this problem today so first option what you will do is you will go and figure out logs you take the logs and you check the logs but there are services more than 30 services there is a lot of logs and looking into the logs is obviously painful when you are looking into many logs between different services it is pretty difficult to correlate sometimes it does not make sense and most of the critical parts are not logged at all I do not blame developers but they do not usually log most of the things what about metrics you have metrics but metrics will give you something happened at the time there is some issue happened at the time and there are 20 different services in between and there is obviously there is an issue with that metrics again so you will not be able to find the right set of information you cannot debug the request which you are getting this has been the problem when you are thinking large scale and each request is involved and developer is not able to debug so the more the number of services the complexity is more so as service A talk to service B and service B talk to service D and it also creates another child called service E and there is again X, Y, Z it is pretty complicated when all these services are creating logs all these services are creating interacting with each other and there is absolutely no way to figure out what happened to your request where did it all touch so so microservices and new problems so these problems are already there but microservices will complicate the problem so what is the problem root cause analysis so let's take one more example you have a merchant who goes and updates his the product information and product pricing once he updates that what happens to that it actually publishes that particular information to the catalog the catalog again publishes to elastic so that user can search based on the price or whatever the product etc how do you figure out there is a touch point between different services service A, service B, service C, service D and how do you follow the path of that particular request a single request I am talking about and again transaction monitoring so microservices have different databases and there is transaction on these databases there is a commit which is happening so this the commit has to happen from the backwards because the last commit has to tell okay this commit is fine and then it goes to the previous one and then say I got the last information and I need to commit this one so there is a transaction monitoring of a particular request is not available in logs and service dependency the dependency between different services is not available so I showed you the first slide where there was dependency graph which I was showing you it is pretty difficult to figure out what are the dependencies between the microservices and of course to improve the performance of your application to improve the optimize the latency you need to know what is happening in a particular request this is the most important thing of microservices in what context a request is talking to different microservices the context is the most important thing so if I am accessing buying or selling and there is a transaction which is happening and this context is to transact selling this context is to transact buying so this is the information is not available in logs unless and until the developer is actually coding and also the different correlation there is no correlation between the service touch point so to solve this problem what happened was people came up with so log analysis we still do we do not do away with it and we do not do away with monitoring as well but phrasing is something which is they brought in to manage a particular request to see what happens in a particular request so all these people started designing their own phrasing and zipkin became more popular but it had problems but all these things were based on three core steps they were tracking code and they were storing the data so basically they were storing what was happening in the case and they were storing the data so basically they were storing what was happening in the code what your code was doing when the request comes in and they all they all had what to query what not to query and the display for a UI to query the request but that did not solve the problem the microservice is again it's not just you it's not just you're talking to your own services right you're talking to services which is external so a book my show talks to pvr so so everyone started designing it but it did not solve it and also you can see logging, matrix and phrasing had overlapping parts because even tracing was doing logging even matrix was logging something and logging anyways logs but you had different systems there was no interoperability then there is obviously when there were so many vendors involved they were actually building different services and one service was on zipkin one service was on something else they were not able to share data right they cannot share data between them so there is obviously there is no logging standard when they are not able to share data so what happened so k and cf got together they brought in something called as open tracing and this is nothing but standardizing the apis so basically everybody can talk to each other if they implement this apis okay one of the key information open tracing brings in is a request from where it is coming for a particular request where it is going and some references and type of request that is nothing but the context right the type which is required I just want to bring in some kind of context related stuff so you have your service you have api you have design and you have implemented open tracing and you have shared libraries so you are sharing data in a particular context with not your service basically if it is not your service within your organization as well within organization as well there will be lot of different services which want to share data so using the open tracing standards you can actually share the data and also define context so we will see in the demo context the things which we can do so what is AGL AGL is a monitoring tool which is built on the api standards okay what are the three things it does is it will actually give a information for you guys on what is your code doing and patterns basically it is just to figure out at what context the interaction is happening at where there is a bottleneck and where there is your MySQL query is held up where there is lock issues are there so basically it gives you a complete pattern and you know there is one component to another component so basically you can actually trace the complete interaction between of a particular request from service A to service then let's take an example here okay so this is a bad picture okay this is a e-commerce site and you browse or you do some stuff it actually goes to service A service B, service D and then there is a comment happening in database okay when it comes to service A it is actually storing the information of the transaction and service B, service C all information is stored and this is where the AGL comes into picture it has a UI and a database you can actually query and you can actually do the integration of the transaction which is happening so some of the concepts a trace is a collection of spans this is the concept of open standards open tracing and so a trace is a collection of span what is a span so basically you have a request that request will create a span for a particular microservice A and then that is one span and then there is a service B there is another span then there is a service B service C service F then there is a span of a thing you know span gets created for a particular request and some of the properties a particular span has is operation name and start and end time and tags tags is nothing but searchable key value pair logs and baggages and references baggages is a metadata type and all those stuff and references is tell off and follows from and etc. so let's take an example you have a request to check your bill and user comes and he says I want to access all the bills and when that happens there is a span created so these are the spans and this is the authorization microservice so once the authorization provides information to billing how does it provide information it provides information via tags so it also logs of course so it provides information to billing and billing calls various various other spans and there is lot of there is lot of it can call a number of spans and then script could run and information is provided to the user but notice that it's over a period of time and everything is collected in the time series database and it is available via Prometheus and you can use it any time series database so let me do a live demo I hope it works so you can see there is a command which I am running which is creating the it is like all in one but I told you that I am not going to talk about the architecture it has a lot of architecture like elastic or you can use kazandra for storing this data and all the stuff so you should check out the website for that let me do this also notice the command so basically I just want to give you a kind of understanding there there is a UDP UDP port so basically what EGGER does is it just sends the information the code which is packaged the EGGER client it just just sends the information but will not wait for a response from the EGGER client this is not right so basically this is how it has been built is that you know you have a web client where Redis is there it is giving you a session ID and there are few set of kind of microservices and back end I have been to do this demo there is some kind of a simulation and all those stuff so this is something like we are all in root plan and we have to get back home we need to book a kind of cab so let us do this so when you click on this ok I told you pretty brave cool so this is now post to the EGGER client ok so EGGER client is this I hope you guys can see this ok some some information on EGGER client this is the graph which we were actually saying when I showed you the first slide of Uber since this is pretty we have very less microservices it is small we will see directed graph I have a front end it is making so many calls to Redis and it is making some calls to customers some failed you know we did not get a car at all some failed as well so we go back to this so here you all the services which is available you can check it out and so there were 9 trays created so basically I clicked 9 times and you know only one got through rest of them failed so if I click this this is how your trays look like front end call it call to customer again customer and there was some issue in my SQL and if you go there you know for acquiring lock it was waiting for acquiring lock right so and there is some tags which you can create it also shows you what you know a skill query it is running and you can see lot of stuff this is where the you know it is also showing what is the execution time and all those stuff let me go back you can see there are failures are here right if you see this this part you can see the failures here and there are some request which went through which now it went through few of the services but it failed to provide required information so only one got through which is here right so let me show you how it will look so this is basically the trace if you see it created lot of trace one after the other so you can actually figure out from here that there is redis something happened in redis and there was a session time out and all this thing for the particular driver so it could not provide the driver at all you can actually figure out for a particular request so that is the magic of tracing you can go to the level of single request so with one request you can actually trace a touch point of so many micro services which I showed you in the first slide right so coming back to this first one this is the first one right so this first one will actually show you the complete history so you can see the complete history here what is actually happening this is an algorithm which is running in the back end it is actually calling the route to figure out the route it is actually seeing which driver is available and all those stuff so there is a simple you know I will share you the repository as well so you guys can go back and you know check it out so there was only one car which was provided let me go back to search and you can use here driver this is how you can query for a particular request and you are actually if you see you can actually write a lot of log search code here and all the stuff and you can also see an R-back and all the stuff let me stop here and let me go back and give you guys some references so that you guys can stay around with it and learn a lot of stuff repo architecture framework and concepts and terminology is available here so you guys can go back and learn that's me we can discuss I think we have a question okay any questions hi Vivek so if I have an existing application consisting of microservices what would it take for me to add this distributed tracing tool into it and what is the minimum thing that I need to add so I can enable this yep I did not touch upon that because it was more from the developer oriented so I wanted to keep it more ops but if you want to do that so it's a standard so basically you have to index first you will have to initialize your code for these libraries to be used and then you will have to use that standard and create those but as it says microservice is complicated when you try to implement that will take some time for you to actually implement this because it is from the developer perspective right developer has to sit and code refactor his code when he is trying to put the distributed system with this phrasing which is already existing so if you are building from ground up or if you are building something new then this is pretty easy to implement if you go you can check out the reports and other stuff the code is pretty simple you can actually initialize and just use those libraries and start using it again if you are using microservice you will have a small team as well so small team of developer and if it is pretty small application if you are made sure that the microservice is it is not too heavy then you can easily implement it because it is not going to take much of the code as well if it is very small can you just hold on to your questions so we have a Q&A session where we will be there at 3.20 so we have a Q&A on dealing with legacy systems so we will be able to answer your questions then so in interest of time we will move ahead for a lunch break just few announcements the merchandise counter is at the venue as you enter the venue on the left if you have already paid for your t-shirts make sure you collect them if you want to buy them you are welcome to do so we have given you feedback forms there are a lot of talks that happened this morning please do rate the talk and give your feedback and solutions at 3.30 we have flash talks from the audience so if you want to give a talk anything open source any tips and tricks with DevOps we have Mehul here he will be here after lunch you can just give your name your mobile number and the talk so we will see you back at 2.15 thank you welcome back after the lunch break I hope you had all your delicious food before we start the talk announcement that on the first floor we have the BOF and office hours going on so if you are interested in attending that you can move to the first floor coming back to the talk now we have Kashif Radhakir he is an eternal father figure to his child and also to the engineering teams that he manages when he is not in front of the computers he digs into the technology landscape and the history of the Delhi city he loves his kebabs so don't pick a biff with him let's have Kashif talk about his topic now hi thanks so I think referencing Pukraj I have the post brand deal talk I think everybody is asleep which is fine you can continue to sleep I just wanted to understand the audience that I am talking with how many of you have heard of immutable infrastructure that's about a quarter of the population good that will make it easier cool I will begin right away so you know I work at while this talk was originally made I worked at a company called Kayako since then the company has gotten sold and I am moving on but this is an experience from Kayako that we will talk about so just quickly to kind of remind ourselves of the context and the problems that immutable infrastructure solves here is a slide that talks about what is our typical infrastructure so we think of our servers as pets we give them names you can call them Naruto or Bankai or whatever it is and you know those servers personally you kind of build them up yourself you know on this server I have done something special everything all the changes are done in place on the server itself automation is completely optional and testing is very very important because each server is different you know your infrastructure is susceptible to drift because it's likely that somebody made a change on one server but forgot to or wants to make that change on another server later it's difficult to kind of revert state you know in our setups typically and it's slow to recover from disaster so those that I think are the sort of problems that we face today when we do infrastructure so I know a lot of you will probably relate with this I think this GIF captures our lives so well and I hope it will pay off a bunch of you may have already seen this this life of a DevOps guy that's when you think you've got this shit working there comes another challenge I think that while it's very exciting to kind of see on a GIF it's not at all exciting at 2 in the night when you end up you know solving these problems so quickly what is immutable infrastructure right so immutable infrastructure is us treating our servers as cattle and not treating them as pets so you know farmers who have cattle they don't name each and every one of them because they don't want to develop a relationship with the cattle because one day they'll be responsible for killing it for its meat or other resources so the decisions are not hard makes the decision much easier and that's the paradigm shift that happens with immutable infrastructure so you know every time you want to make a change you recreate the whole infrastructure on change automation becomes compulsory you can't do anything without automation and testing becomes optional because really what you're doing is not different from what was originally tested there are a bunch of advantages that come with this one you can't drift there is no changing of the servers manually so there is no question of a drift arising in just one server you have fast recovery from disaster because it's completely automated it's easy to revert, it's easy to scale you know your staging environment can be made to look exactly like a production environment maybe with lesser resources and it's obviously self-documenting because it uses infrastructure as code or uses code to write out infrastructure so that's just setting the context for this the key idea here is that you don't repair things if a server is screwed up and how it's provisioned or its hardware has a problem or whatever you replace the server you don't try to repair it so very much like lego nobody's I think ever tried to repair a lego they just throw that lego piece out and they use a new lego piece if one doesn't work so what I'm trying to do here is talk about give you a case study to kind of relate this with broadly so that this is not just theoretical that case study is from the company and we lived the mutable infrastructure life long enough and then we decided to make this transition and we did successfully make that transition so this is just me sharing how you could do something similar if you choose to so what is kyako? kyako is a help desk it's very much like fresh desk or zen desk you know at kyako it used to run on soft layer which you know is not the cloud and every time we needed a server we needed to raise a ticket so it was usually there was a two or three day cycle time and if the rack that we wanted the server on you know there was nothing available there then it's likely that it would take even longer so hardware provisioning was done via tickets you know we use chef I don't know why we use chef but we used chef for automation and obviously we had SSH access you know to make changes because nothing was ever shipped out perfectly there were parts of things that had to be done manually because they were not automated yet and we had a bunch of you know running business so we had different types of servers that we provisioned workloads had become specialized so you had different types of servers, you had different components that you had provisioned for the whole thing was obviously fragile like I think most mutable infrastructure is but we've just become used to it and we don't think of it as particularly fragile but it was the example that I'm trying to present to you is not a small toy example it's real business, real servers behind that is actually the image of servers as they were in a mutable infrastructure that's an actual image built using cloud craft by sourcing actual servers from putting in actual server data essentially so it was a large scale operation so the key question we had when we set out to do this was how do you do this with immutability how do you build infrastructure with immutability so these are the key questions that you know our team had which was how do we provision a machine you know how do we connect various parts how do we do releases in this world how can we keep our costs in control how will we find out you know where are all the services gone because the services will be spawned at different places then we would have spawned them manually and then you know who's going to have access these are like a bunch of key questions there are lots of questions I think these will set the tone for understanding the space well so you know in order to pull this off we had to change our service provider from software to Amazon because immutable infrastructure requires you to do everything programmatically you can't do things by hand and so hardware provisioning needs an API and you need to get some kind of response to that request and so we move to AWS which obviously everyone knows has APIs for all kinds of things including hardware provisioning then AWS helps you know to a degree but there are some parts which it doesn't automate completely you know if you want to do immutability and so we choose the HashiCop stable HashiCop for those who don't know is a company which is helping with distributed computing and building out a lot of tools for it and I'll talk about them in a little bit so these are the two key technology providers that we use to kind of implement immutability and they might also be relevant to you if you choose to do this so you should look them up so I'll spend two minutes on what HashiCop is so that everybody kind of gets what I'm talking about later so HashiCop has this you know has taken the infrastructure or the running the operational process and broken it down into five parts and they have a tool for each of these parts and that's building so that's building of your images, the software that you want to deploy on your machine it's provisioning hardware and software that you might require it's making sure that everything is secure then there's the process of actually running those servers and making sure that they are able to talk to other servers to accomplish tasks and cohesion and so there are five different tools that they use for this so packer for building terraform for provisioning hardware largely terraform is very much like cloud formation if you're familiar with that or do much better and less painful then there's vault for managing all kinds of authorization and access even offers encryption as a service then there's the actual orchestration or running of machines and workloads that's Nomad so Kubernetes also does this for example and console is the service discovery or the service registry which you'll use for service discovery so this just to give you a brief context purpose of the talk is not to teach you Hashikov you can learn it with a lot of documentation out there other more proficient people talking about it but it's important that you know that these are the various components that we'll end up using because I'll end up referencing them in the talk later so again just to get back these are the questions that we want to answer and we'll pick each question up and say how do you do this in an immutable infrastructure world this is what provisioning a machine like is in a mutable infrastructure starts out fine never ends well because you feel like this time I'm going to get it right then you realize that somebody has changed the Java or PHP version in the repo and that happens so how do you do it with immutable infrastructure one obviously I mentioned we had to move to AWS and we used Terraform to go ahead and describe our complete infrastructure and then Terraform would use AWS APIs to provision it on the fly so we would build this out with Terraform somewhat like specifying JSON declaratively and then there's a Terraform command line kit that you would use to kind of deploy ok software was provisioned with Packers so we would make every release would kind of eventually end up in a software image which would get provisioned as an AMI on our EC2 instance and obviously the key point to all of this is because you want complete automation you want to make it infrastructure as code you can't have any manual parts so everything has to be codified and Terraform plays a pivotal role in that process in this table of tools ok so here's how do we connect various parts so here's an example of infrastructure as code in Terraform so that example on the right is basically helping you provision a load balancer and when you run this it'll provision an EC2 instance a T1 micro with a certain AMI that you specified and set up that configuration on it so that's like setting up a load balancer without doing anything by hand and once this is done you know everybody in your team can just run this whenever they need to provision a load balancer they don't have to keep doing that activity through the GUI console or by hand ok so Terraform describes infrastructure as code integrates very well with its other tools whether it's Packer or anything else in the Hashtag Cops table and AMIs it actually even works with Packer images it works across cloud providers so you're not bound to use only AWS you could use AWS or Google or AWS and Google and Terraform would still work for you it's modular and that modular aspect is I think extremely powerful and we'll spend a slide on it you can preview changes before you apply so it's not a case of you fired the command and you don't know how the outcome is going to be so you can see what changes Terraform is going to make before you actually execute them and obviously very importantly it's collaborative so typically in the cycle of absorbing a new DevOps guy there's a lot he doesn't know in context if you give him a task he has no way to collaborate with somebody who does know because there is no common workspace that they're working on lots of things are being done by hand because this by hand is eliminated he's going to commit to the Terraform repository and it can go through a pull request and somebody experienced can actually see that he's doing the right things before you push that code out otherwise you'll find that out by trying to cross the waterfall and falling down this is what is I think the fundamental change that is going to happen when you use individual infrastructure is that you'll be actually using Lego bricks and see Lego bricks are awesome because you replace and not repair and you can use them to craft any kind of architecture this visual is just to help you understand that Lego can be used in any way in case you haven't used Lego in many ways but that's the advantage that you have with Terraform as well and this whole stable actually they'll you can use them to build any kind of architecture there's no kind of architecture which is not open to change where Terraform an example of that modularization is this so if you go on their site they have a registry where people can submit modules that they have made and then there's a quality process also around that submission a review process but nonetheless here's a module in Terraform for Kubernetes if you use this if you see the section on the right this section here this is all you need to do to add it to your Terraform code base and get it to provision of Kubernetes infrastructure if you told someone to do it especially if you've never used Kubernetes before even if you've used Kubernetes before there's a very big chance that the provisioning will take hours and you will get it wrong and here they've solved the problem by bringing it down to 5 variables can you see that 5 required variables that's all you do insert the snippet of code and that is all it is and at the end of this when you run this 10 minutes later, 15 minutes later on your infrastructure there will be a Kubernetes cluster up and running and that's extremely powerful and the thing is everyone in this audience does that it will show up in exactly the same way that standardization is extremely powerful as well and so there are modules for everything they've truly been able to deliver the lego-like promise so they're modules for AWS they're modules for Google Platform they're modules for Azure for couple of modules for Oracle the Alibaba Cloud and even GitHub and so now you can pick these modules you don't have to reinvent the wheel each time around this was so possible earlier only for software engineers extremely difficult to do for DevOps or SRE folks and some of this was solved by tools like Chef and Puppet and all but they didn't really get it lego right and these guys kind of allow you to do that this is how one typically releases code with a prayer to the almighty who seems to be looking away anyway nowadays yeah this is just if I just wanted to put in because I thought this explains blue-green deployment better than anything on the planet with a little bit of sense of humor we'll talk about why blue-green deployment is important but that's blue one is blue and the other is green that's an existing infrastructure and he's replacing it with a new infrastructure and that's what we're going to do to release code so we're talking about releasing code and how we release code is with blue-green deployment so what happens is how that process works so I'll quickly tell you what is blue-green deployment there might be a few people who haven't heard about it so blue-green deployment is this idea that you have an existing infrastructure we'll nickname it blue we'll code name it blue and whenever you want to make a change any kind of change you don't touch that you make a new version of that in both hardware and software call it green and then you just switch the traffic from blue to green that's a blue-green deployment and these are just the steps that we did it we used to kind of work with the tools that we used so what would happen is every time somebody did commit a bunch of good commits would kind of go out as a release so whenever a release was built GitHub would hit or pack a setup and that would convert and build out an AMI that AMI had everything that software needed to run for example we had an app server so this was a PHP PHP based organization and so you had every time there was a release we needed PHP to be installed as well so that AMI had PHP it had all the libraries that we needed any other components and it had the latest piece of code so when you did the release say 15, 20, 30 minutes later and AMI would become available in the AWS repository then Terraform would then pick up pick up that AMI essentially and build the green cluster out the blue already exists and it would build the green cluster out with new code and then two other components would come into play console and Nomad and they'd basically be used to switch traffic from the blue cluster to the green cluster and it doesn't need to be 100% switch you can choose to switch some weight that's fine that's supported you can also choose to just do a canary the rest of your system needs to support that your software needs to support that but you can do that as well it's supported here so console and Nomad which I mentioned a part part of that toolkit console is used for service discovery and Nomad is used for orchestration of running resources they would together manage that then you know once you felt it was fine you would tell Terraform to an event or however else you would like there are many ways to do this take down the green infrastructure sorry the blue infrastructure so now your green becomes your new blue and your new releases out also you know rollback is quite easy because if you decide while you're releasing this that there's something wrong with green you can just switch the traffic back to blue and go back and running so that's basically the process we followed for release the next big question that we had is how do we keep costs in control because today now anybody can just file stuff automatically and it might he or she may not realize what they're creating so there's no easy answer to that if you automatically choose to create a hundred servers they'll get created but there's another aspect of costs which is that how do you know you're using those resources well and that Nomad helps you accomplish so Nomad will help you optimize workload for utilization and efficiency so you can say you know only spawn new boxes when this CPU is over x percent otherwise try to use the existing resources don't spawn new boxes easily possible to create infrastructure only on demand that's another way to kind of manage costs so what happens is a lot of businesses you know have very cyclical volume so in the day they'll receive a lot of volume and in the night that volume will dip significantly but we don't change the number of EC2 instances we have running in response to this we're always paying a large fat amount for the day you can actually bring down 30-40% of your cost maybe more if you just start following the cycle so in the evening after a certain time you can just set up an event which will trigger new deployment with lesser servers or with different size servers and you'll save cost and obviously you know Docker's helped improve utilization and reduce cost and this works with Docker so you can use that as well so in the earlier world we had you know when we provisioned a service we knew what IP addresses it was available at and we would then tell the developers that somehow usually through Confluence or something else and then they would kind of read and write there to that IP now what's happened is that service discovery has become essential for us in a mutable world because the IPs will change it won't be the same IP every time you deploy something you would you know it'll show up somewhere else so console from the Hashacop toolkit is the service discovery or service registry essentially and it supports both DNS and HTTP so you can you know have DNS records and ensure binding to IPs you could bind to DNS records which you could do in the mutable world as well it has the idea of service health so you are able to check on whether a service is healthy or not before you switch traffic to it so whenever you deploy so you can write custom tests if those tests turn green follow them turn green only then will terraform think that oh this service is healthy and I should move traffic to it it also provides key value store which is primary use case of which is you know configuration runtime configuration and you can also use it for orchestration because you can put watchers on it and locks on those keys and use that to kind of coordinate all your moving parts but yeah so you can even store configuration we moved our configuration from disk we had a per tenant so you know Kayaku is used by businesses each tenant has their own configuration to their own API limits the number of users they are allowed etc and so we moved that from disk to console and console is based on raft which is a distribution technology essentially that will allow you to make this you know not central and fully distributed our configuration was available all the time even if a partition in the network occurred both the partitions would have a copy of the configuration and be able to serve whatever traffic they are receiving if they are receiving any and you know it will allow you to do some advanced networking as well not critical the next question obviously we had is what about access right so the pool chain access right people to deploy stuff we RSRE team obviously will have access to that and they will have their own pool request based mechanism on which they will be able to release services which will cause service to change but the big thing that we were able to achieve is that we live with zero SSH access and what that means is that nobody can get in and run an accidental command from history which will screw things up nobody can leave a screen running or not run something in the background all of those errors disappeared completely because just nobody has access and each time a new machine is deployed you know if you wanted to maintain SSH access you would have to make sure you gave permissions there as well but that all that need is gone and then obviously we used Vault to kind of keep all our secrets and deliver them to the right applications at the right time and Vault is another tool by Hashikov that helps you do exactly that okay so that's it it's a short talk and that's what we used to kind of achieve our immutable infrastructure I wanted to keep a large portion of the talk open for questions and we have the time to do that just to quickly kind of summarize it for you we built to do immutable infrastructure we built images using packer that are formed, picked up these images along with GitHub post hooks post commit hooks deployed them we managed access through Vault we had Nomad worry about the resources are being used well and we had console kind of tell the application layer about configuration as well as services where they exist and if they have moved where have they moved to right and all of this is possible only if you're using a cloud provider which I think is now become the default so everybody is you can't do this with software and we are happy as that baby is things have been working well that's it open for questions and you can always reach out to me in case you think of a question later that I could answer Hi you talked about blue green deployments so one of the you talked about blue green deployments one of the problems with blue green deployments is when you have say 50 servers and 50 more are coming in I can't hear you this volume is a little too low for me hello can you hear me now yeah partially yeah so with blue green deployments you face a problem where you have 50 servers and then you have 50 more servers that come in all of them let's say they are connecting to a database and you have too many connections on the database so your database goes down that sort of problems have come up so how do you suggest we solve these problems yeah so that's a very real problem right the first time you do it that's the first iteration of what you do you take your database server down second iteration is obviously you move traffic slowly so the key thing to understand here is that traffic is actually not changing so it's not like your load is doubling anything that is not going to blue is now going to green so as long as you disconnect from as long as your application is not maintaining too many connections when the load goes down which is usually not the default behavior you will not have that problem so the way we solve it is essentially do it piece me so you can move 25% of your traffic let this establish its pool connection so on and so forth and then move the rest of the traffic and change the thread is only because you have your connection pools now there are two sets of connections Hi here couple of questions here one where do you store your state store first point and second point is what is your backup and recovery strategies on that sorry I couldn't hear the second part backup and recovery strategy let's say for example you store your state store in a console or S3 or somewhere in your locally and what about let's say if you lost the data then your infrastructure is nowhere in there and how exactly you have the backup and restore strategies implemented okay so yes sure by state store I'll assume something I don't know if you mean something differently so you mean application state terraform state okay so terraform doesn't is largely stateless what it does is it will query and figure out what the current setup is and then it will match that against the new setup and tell you what differences there are going to be okay so if you go and change something manually that might pose a problem when you're doing this because terraform will read that and either just delete that resource because it's not in its plan so on and so forth so terraform itself is stateless in terms of other pieces of state right so packer will make an AMI, AMI is you can version and keep Docker image is you can version and keep they basically end up on S3 buckets which is where they end up so your state ends up there you're not asking about application state but application state is obviously database and is unaffected by our if you have a follow up question please catch them off there are other questions sure maybe I didn't answer it well enough hi so I have two questions I have to guess where you are how can I find you because the voice is coming from there and you're not there is that handed so I have two questions one on the blue green deployment one is on the slide which you talk about cost effectiveness so one is while doing a blue green deployment the IP is changed and I get added to one of the DNS endpoint so how do you handle or manage the TTL issue on the client side right so might possible few of the legacy clients let's say Java 1.7 or other clients might cash the IPs of the infrastructure let's say the blue infrastructure and when they change they might refers to the old IPs in that case they might be having issues that's one question on the D the second is we talk about Nomad right which will we can use to do cost effectiveness so the question is why Nomad why not using a native functionality of the cloud solution like auto scaling groups and like this sorry why not using what auto scaling groups of let's say AWS right so it has cool so I'll just kind of say your questions back I think maybe if others didn't hear it and just to make sure I've heard the right thing so the first is how do you solve the problem of services caching or other clients caching IPs that they're using so that's why you have DNS available on console so you don't have to resort to IPs they continue to use DNS and you put a short ptl on that DNS so when you switch the traffic shortly thereafter they will query again and receive those changes right and you have to sometimes change your application services to do that developers did not write it thinking you are going to do immutable infrastructure they may have bound to IPs they need to make a change the second question was why use Nomad and say not auto scaling you could use auto scaling and you could have asked me the question why not use Nomad that is no good reason one the second is it works very well with consoles other toolkits right so it makes it easier for me to do things it has richer control it has three kinds of jobs it does services systems and bad jobs so it has three kinds of jobs that it does which auto scaling does not give me that fidelity and will obviously not orchestrate my stuff auto scaling will only add hardware which is one aspect of Nomad it will not orchestrate it and third I kind of need things to I need blue green to work so I need events to be generated and they are sent out by say console using a service health check and I wanted to react to that and auto scaling is not by default setup like that you could kind of use cloud watch and do a lot of your own right a lot of your own code and achieve that in the end but it was just easier to use so the short answer is you could use auto scaling with a little bit of work or you can just use Nomad if you're using the rest of the hash it fits very well obviously auto scaling does not do orchestration I have a question here what was the storage back end that you used for vault here was it console or was it fire system we used S3 S3 as a storage back end for vault for all the secrets and my second question was what were the other use cases that you found for vault besides secret storage and encryption as a service so we only use it we didn't use it for encryption as a service it was on a roadmap but it's not playing out now but we only used it for authentic shipping authentication to applications that will need them to perform their tasks well but you can use it for other things I mean there are use cases I'm not an expert on that Hi, yeah now, so my question is that using Terraform was it like your go to place like in the first place or did you like see other options before going to Terraform I've been burnt prior before this experience with cloud formation and if you think cloud formation is good for you do it and then you'll realize it isn't it's very verbose very fragile and doesn't offer any of the guarantees collaboration or planning that Terraform has built for for example Terraform is very clearly in a consumable way tell you what it's going to change before it changes it and then you can approve that so they've built it with all of that in mind secondly they're they've used a tone down version of JSON to do that for their markup which is much more usable than JSON you know when JSON came out everybody said see much better than XML human friendlies but JSON is not human friendly at all and then you add YAML all of these other things come out which are which try to bridge that gap so this language uses something that is Terraform uses something that is human friendly and makes the job easier because you have to live in that code all the time excuse me yeah like you just answered why you choose Nomad like was it that the reason only the HashiCop tools I mean you were you were using or I mean why didn't you use the Kubernetes yeah so Kubernetes is cool because it's from Google but it doesn't follow a philosophy that I personally like which is the UNIX philosophy of one tool doing one thing Terraform does that Kubernetes is like Emacs you know you can get married on Kubernetes just like you could do on Emacs that's my inherent dislike for it but despite that it's a very competent tool like Emacs is also right and people are able to use it very well you're free to use that this is being made by an organization which is trying to get customers for their documentation and support systems are excellent they have professional plans available these kind of things matter immensely when you run it as a business unfortunately we as a side or devops do not think of that we think of what is the coolest thing that somebody is blogging about and Kubernetes usually shows up much higher on on that rating right and so people use it but I believe people have found it competent the only thing I've heard is that you have to do too much work to achieve too little I would still prefer using this over Kubernetes hello sir hello sir whose mic is actually active the two guys competing this is a race conditioning sorry somebody was going to win it I think I'll start so this is the first time I'm hearing about immutable infrastructure and that's really cool to know I'm just trying to convince myself that since I've already trained my developers to use docker images why do I need to train myself or them to learn a new layer on top of it so they already know like this is the Java version I know I want this is the Ruby version I want they can write it down in their docker files and they've learned and trained themselves right why do we need this extra layer on top of it I'll answer that in two parts so I think this is how your developer feels this is how you feel and you're discounting that don't discount that this will end ok that's one reason the second is still not using Java 5 you moved on because you found technologies change and they offer new possibilities so those are two really basic reasons why you should do this because this is a new way to do stuff which solves for some problems and you should do that now as far as training your over-training your developers then they have to remember what knife is in Chef which has no semantic meaning to anything knife means nothing you can either use it to cut bread or kill somebody it's some of the worst semantic naming I've ever come across you've taught them all that they've tried to remember all that they didn't need to so now when you train them again you can drop all of these what I find rather ridiculous concepts and do something much more much simply finally it also offers a career path for your developers so all of your developers are now able to code or you hire people who are able to code but they're able to now grow code to express what they're doing so they can easily their default way of doing stuff or our default way of doing stuff now is to write code to do it okay and not actually to do anything manually because we can't do it manually if we do then the whole thing will come down so I think there's some very progressive reasons and then you have to acknowledge this pain you're so used to it you don't acknowledge it what you're doing today is painful right that's a good reason to change always sorry the other thread so I have couple of questions sir so let's say when you are running this console maybe only one afterwards you can ask him so we have a lot of time my counter says 6 minutes why not I have not even run out of my time for this sure sure sure so let's say when you are so when you are running this console members and all into multiple DC's fans across this multiple reasons so at that time how do you guys take care of let's say one of the member goes down so this was my first concern and the second was like if you are again spanning across multiple DC's with multiple regions and all in AWS so how do you guys take care of this integration of Nomad clustering along with this console in again multiple regions then so those are very good questions so he's used console but he didn't study all the documentation so this is a very good question actually so there are things available that allow you to replicate stuff so one of the challenges so his question is what let me just put it in context he's saying when you run this multi region you know how do you deploy a console because the latency between regions can be very much and the console quorum will think and you know because it's on the internet there will be a lot more partition so the console quorum will keep like re-electing masters it becomes a bit of a problem that's like the technical version of what he asked so the way you do that is you choose an architecture where you have separate quorums for each data center which is also what the documentation advises you to do that leaves you with the problem of coordinated state or same state so it's the same company and so you have to have a tenant in one region and another tenant in another region you still need to know that somewhere so it's very possible for you to both console allows you to replicate the key value store that comes from one data center to another and it can be read only at that data center or it can be read and write so whenever there's a change here it gets propagated quickly across the van protocol to the other data center where it gets where it gets used so the short answer is think about this this is indeed a problem and architect your solution to have multiple quorums each for one region to kind of do this right I think there are lots of other aspects that you will come up with but I could answer them offline and become a lot of question so we had enough questions we have run out of time for this session I'm sure we have more questions you can continue it afterwards we have another joint Q&A session or you can meet him after it afterwards we'll go on with the next talk before that we still have two more BOFs going on upstairs one is on are we concentrating on deaths and second one is on how do you plan your container strategy for your organization both the BOFs will start at 3.10 on the first floor so those interested can move there in 10 minutes so we'll start with the next talk the next talk here is by Pooja Pooja is an open source enthusiast and an automation nerd with a DevOps mindset trying to bridge the gap between the various teams she blogs on open source, creates YouTube videos and tutorials and gives talks at conferences she's also rumored to be a MC at Root Conf she and her colleagues have hired an intern called Alice a talking bot so she is free to spend time playing Pokemon with a friends family and let the bot do all the work for her so she'll be speaking a little about the bot and how she utilizes the bot to help her do more work so let's listen to her relate this image to our continuous integration pipeline I'm assuming every one of us use continuous integration pipeline some or the other way can we relate this any guesses no guesses when I ask this to my friend she says it's your life no let's relate it to the continuous integration do we now what is missing that's our sorrows about test cases before release we know that there is a bug still we release or we know that we don't have enough test coverage still we release it's one less tire but truck still runs and it's fun now but what happens when you remove one more tire is it going to be fun more no 3 AM calls right so that's what this talk is about to be today we are going to understand how at my workplace we try to work on bringing the continuous integration and adding the test coverage as the minimum criteria to actually have the code base pushed to the next pipeline before we move on brief about me which is not brief so I'll just sum up it I find myself an explorer I have explored a multiple range of things in professional personal life I've been a developer, tester and automation engineer and I try to learn, develop things also so that I can fill the gaps between all of these teams basically and which helps us build and ship the healthy product personally I like to share what I know and learn in return so I blog about them and record videos on that and I'm an automation nerd as well as the open source industry so whichever project I work for my I use for my work I make sure to contribute back to it in some other way my latest contributions can be found in Jenkins and automation related plugins I'm a big time foodie and sleep lover but I use yoga so that my the Hulk inside me doesn't come out the same secret I told Dr. Banner that's why you could not see the Hulk in Letters Manager coming out alright so this is about me and so today's talk we are going to start with why I am talking about this topic what are the pain points we face and I see the similarity like you also face and what are the things we did to solve it so that comes into what section what and how it works and then how is about you how can you also integrate in your systems and Q and A at the end sounds fair yes oh it's after lunch talk yes alright so we go over it this is the simple data another pie chart bar graph but I'm not going to bore you I'm going to talk about a very interesting thing here this chart represents the code commits in our system this I just pulled from our code base so back then initially starting off more engaged we had a monolithic repository where we were adding everything in one sounds familiar everybody starts with that because that gives us speed and pace to actually add code faster and serve our customers faster start-up story so that's what we had but do you see any pattern in this graph so green bar shows the lines added red shows the lines deleted orange shows the team size at that time and then you have year on year do we see any correlation or any pattern here so we we grown up exponentially so we were size 15 and we are 50 contributors code contributors so any pattern you see code is growing fast features are going fast a lot of features are going fast but there is a pattern reverse the additions are lesser what can be the reason so this is what is interesting about so one reason is here so since we understood monolithic is not going to help us serve better to the customers faster so we started segregating our services into different pieces that's one reason you see the less number of lines added in the monolithic code base but at the same time you see the deletions also reaching towards the number of lines added what is that we are adding a lot of features we are serving customer needs very very fast any customer needs but still the deletions are growing up the reason is because we are re-architecturing as well we are bringing better coding practices we are bringing better services architecture so that we can solve the same piece faster so that's something but there is one more point here the point is we already have the code there which you call legacy code we already have the code there now you add a small feature the addition will be less because code is already there so now the problem started the code was transforming addition deletion more, addition more things like that everybody can do it that's where a point came when a single line developer has to add there will be a big pain point because a single line can break service at any point at any point and it has to be tested a lot so I'll give you a summary like how it goes so my take lead comes to me let's say for example I'm a core developer he comes and say that what is the timeline which practically if I think takes only one one line to change but in reality I see if I change this one line I have to test in several ten more features it means all ten more features has to be tested I am afraid to give timeline and my take lead is like why does the code frighten you so much and I'm like I'd be a fool if this big giant code has not frightened me up one mistake I'm dead why risk it so this is one common point when you have the code base which is not test covered like we had monitored code and we did not have enough test coverage you can say zero coverage let's say and we actually wanted to add one line we have to test a lot of places one thing and second thing is even to write test cases for them is a difficult choice so I'll come to that point again in the next slides so we wanted to solve this problem and we were brainstorming what should we do one ideal approach is okay you go and write end to end but are they sufficient or are they good enough or can that approach scale and that's where we went to understand what are the best practices so we have seen this slide a lot of times a lot of conferences people talk about it everybody sees it everybody go and pitch in company and what happens it does it work does it work easily any successful example you just went and said we should do it no the US smiling faces chose that yes so that's what happened so we we wanted to actually achieve unit test coverage more than others because they are less expensive more faster and the developer gets feedback faster and so the the entire QA the quality automation works more smoother than having only end to end test cases so we wanted to do this we decided to do this so what we did was like every like instinct we just said everybody that okay we want to do this from now on whatever code we add we have the test coverage we have the test cases for them unit test cases and have coverage for them that's the approach we took and then the real life happens the good intentions works only when you make mechanism to happen after sometime we realize saying doesn't work we did not have enough to test coverage still so the developer started writing test cases some wrote some did not write the people who wrote also they were not enough coverage so now why we have to dig down why there is no enough coverage there can be multiple reasons even the person do not know she is into development she is not thinking about the test cases now so that's kind of challenges so we wanted to mitigate this problem so like continuous integration pipeline creation we also thought okay let's go ahead and we also do it so give a context to brief we use github for our version control and we create pull request and we have a pipeline structure where each stage something pass then only it goes ahead so it looks like this so we call this a shield we named it shield it's shield task to understand everything and then move ahead so it goes like this so developer creates a pull request the first column and it goes into painting state second column if any of the checks fails so we have checks like syntax validation unit test validation any unit test also the test failure or the coverage failure so we have benchmarking if this is below this coverage it should not go away so any of this fails the pull request is marked to be blocked so it looks like this in real so you have the create pull request it goes into waiting and the pull request is if you see here the coverage is missing that is why the pull request gets blocked so this is how we thought yes now we can have coverage the pipeline is set it automatically works you create a pull request the checks running behind and keeps reporting to the github again so we found okay yes this works this setup works completely and it's awesome and we can go ahead and that's where the reality hits again one more challenge can anybody guess what can be that challenge so I'll give you a hint we talked about legacy coding initially and we are talking about adding test coverage for new code only so it happens so that the developer who is adding test coverage for the new code so let's say I have pull request which is having two lines of change and just for example imagine that there is already existing code of 100 lines you added two more lines your test coverage will say can it say if you wrote test cases for both of them can it say 100% no because it is contributing to the existing code base so we have legacy code base for which test cases are not there and we have new code for which test cases are there but it's still the pull request going into blocked state all of them all the pull request gets into blocked again we hit the same point either we bypass the system which we created we bypass and let it happen and if that happens then again we are losing the point of adding test coverage so we wanted to again solve this okay how can we improve on the shield process so that we can encourage developers to write test cases for newly added code why newly added code only because we saw in the first slide we are already knowing that our code changes our code is transforming faster we know that we are removing a lot of code we are rearchitecting a lot of code it means everything every piece every feature becomes new someday so we know that if we are if we encourage only to add for new rather than fighting old this is the legacy code this is not my code why should I write for it without thinking about all these things we can just concentrate on one thing add for what you return so exactly so we did so we saw that we have run checks and we have a test coverage report so we added one simple quick the question itself is answered the question is I want to write test cases only for what I added in the system and your system should detect that and tell me if I miss only that portion so that's what we do before running checks we actually save the code diff and at the end we pass that diff and pass the coverage report find the mapping ones and then generate another report which only talks about the missing coverage in your added code now which goes like this so it's not beautiful it's just plain HTML which is behind the that check where missing coverage was there the coverage is missed only in your added code so this gives us benefit that developers now feel encouraged and they feel more ownership basically because they have added that code and they should write they should be writing test codes for them and this approach works for us because we know that at some point all the futures will turn to the new code so it means every time we will have new code coming in and eventually at a certain interval we will have everything discovered so we are using the system from almost a year now and we and we reached from almost 20% coverage we reached to 56 with this approach and eventually when all services segregates or code gets re-architectured we will have its coverage complete complete for the complete code base that is all I have so this is the how part how can you build if you want to build so we have the open source APIs for everything we just need to connect the dots github APIs, Jenkins APIs for running the checks in behind basically and then we have one more system called Alice which we created last last year and open source as well which helps us automatically to talk between these APIs and do let she'll do only the parsing task so we do not need to build everything together so the continuous integration system could be built on that tool which we already had written last year and there are some things we have credits for the problems I always give credits to problems first because that's why we solve if problem comes and people so this idea is originating from multiple people in my company that we should be focusing on only the diff coverage so I give credit to them then github APIs, Jenkins APIs and photo credits that's all so this is what I want to end with 100% coverage is a myth but at least 60% is better than nothing that's all thank you so I'm open for the question and answers any points you will that you want to integrate or you want to try it out or you also have problems in a different context we can have a discussion on that I can pretty much relate to what you are saying we are in the similar phase right now but I want to hear your perspective in terms of functional test coverage like what is the coverage at your company and like how is it helping having 100% functional test coverage yes so for us a problem set is different because when you say functional test coverage is it developer functional or it's end to end yes so so we are a SaaS company which which actually the endpoints are not in our context so it's customers customers basically so we cannot hit the complete endpoint so that is why 100% end to end is not feasible so that is why we have broken it down into different pieces so this is a big problem but since this problem exists we are able to follow the good practices otherwise we would have again ended up writing end to end and then struggling to maintain them so we are focusing on piece by piece individual services so unit tests should go broader and then we are segregating them into different chunks like this should go into the API test context and then this should go into end to end automation so end to end automation we started and we stopped so we are again going to start that but then again the complete end we will not touch because that's not something in our context am I able to we can have further discussion on this any more questions? if you don't have I have a question how many of have more than 80% coverage wow I should meet you what's your name how are you unit test coverage end to end we are planning to start end to end writing end to end test cases and I mean I don't have questions on that but like what sort of like you guys started it and stopped doing it so that could be an interesting like story why did you stop doing it was it creating more hurdles than solving problems yes yes so that's what so for that I need to give a brief about what we do so probably this is the offline thing we should again get in touch that is all thank you so we'll have a joint discussion Q&A session with Pooja Kashif and we'll also have Vivek Shreeder joining us so three of them before we start the Q&A the social media announcement the hashtag for the event is root conf on twitter and instagram for the twitter handle is at the rate root conf as geek is on instagram at the rate has.geek you can tag use the hashtag on twitter instagram and you can also tag on again on both so yeah we'll start with the Q&A Kashif if you are in the room please come over all the talk videos will be available on hasgeek.tv so you can watch it after the conference any questions we'll start with the Q&A and we'll just get Kashif to join us soon find your nearest mic runner how many of you here building applications using microservices or distributed systems as well so if there are no questions we would like to hear from you and our stories of what you are doing and how do you monitor it what are the issues you are facing yeah the first challenge is like you said the tracing which makes debugging really a nightmare in most of the times even with the tracing mechanisms that we have in place as of now although the services interact between each other smoothly the problem comes with the database because the device ID that we have we had seen in Yager such kind of keys I don't have the luxury to add any arbitrary key for example if I'm doing an SQL query I don't have the luxury to add an arbitrary ID or maybe unique identifier or something in MySQL database so our debugging like the tracing ends at the service level and when we have to go till database we fall back to manual correlation that is one major problem and with respect to monitoring and alerting in when it comes to microservices so we have a lot to monitor and we monitor everything and nowadays most of the tools by default give enough metrics and enough thresholds the problem is we don't know what to alert upon and whether that it's challenging to decide whether should I wake up at 3 a.m. for this alert or can I sleep until 9 a.m. and can it wait until 9 a.m. so that is one major challenge when it comes to microservices for us I would like to hear the solutions on how people are solving the database relation between database and service when we are tracing or when we are building observability so yeah that's my major question yeah I mean that's a classic example right so the tracing ends at the service level and again the database debugging again it's a different story all together so I would I would say just check out debisium whether that solves your problem I am not too sure whether it would solve or not solve but I strongly believe you should go and check it out can I ask you a question just to clarify your question you said you are using microservices yes and those microservices have a shared database some of them help okay then you shouldn't be using microservices and that really is your answer if you are not able to separate your databases then those services are not independent they are dependent on each other and they are coupled to the database not at the service level so this is actually service oriented architecture no I am sorry I didn't follow you every microservice has independent database so you don't share database you should not share a database no they don't have shared databases so then what is the question that you have so imagine there is a data layer service which is retrieving data from elastic search or MySQL maybe now we know that and this is maybe a Kafka consumer or something like that now when we are debugging that certain message got lost somewhere in the transition we are able to find out that application logs we may be able to find out okay this has been the request has been sent to database and what happened in the database or at the database layer the logs are we have to correlate according to the timestamps and according to our query language but the client making the query can log the ray ID for you when it makes the query and tell you what the results were even correlated with that so the application logs will have my query ID which is correct so that can go all the way down until services definitely but when it when that in the database if the query failed due to any reason or especially in Mongolia yeah so we get the error ideas and that's where we actually have to involve manual collaboration to find out there is this database error when this particular query has been made the thing is when the query is going on there is a lot of database interaction is happening so there is a failure at the some level and when that something fails he will have to see which database query actually failed of course yeah so basically I mean that problem can be solved with debisium that's what I am giving you that hint but I am also not sure so I just want you to go back and try it out and probably publish it for us hello puja the question is for you I want to know that what kind of tool you have used to achieve the only line level coverage in the sense you are able to identify what lines developer has changed and what is the coverage on top of it so are there any off the shelf component or any plugin available or how that was achieved yeah I will just repeat the question to understand which one should correctly so you mean the tools and things we used to identify the lines added so basically that's what so everything is available we did not need to think about tool the github api itself is the git diff so when what you see in when you create a pull request git diff is shown to you that's the diff if you are able to fetch you just need to build a parse alright so when I say I am saying just now it took us 2-3 weeks to actually create properly because in your context it can be different for us we focused on only added lines but there is a case that even the deleted lines can affect the existing code to give you analogy let's say in deleted line I delete a variable usage in deleted line but existing lines the unreferenced variables is still there so it's a basically syntax missing it's not best practices this is just a simple example it can be even bad that the declaration doesn't exist kind of things can happen so we had to actually improvise it that what are the things in our code base what should we parse this diff on or what should we neglect so we had to do that so one repository will not have one language it can have multiple languages also so then you have to build your parser in such a way that you basically have the your checks basically have those sort of rules what is okay and what is not okay so for example for running test we are using python so we are using nos test and unit test library which runs that and which generates the coverage and all we need to do is from that coverage we have to map it with the github the same way for java different set of tools are there so we are using core libraries which are best at doing for those languages for example j there are certain tools I can't recall now which gives the exact test coverage we need to find a way to parse them Ghasif this question is for you so every time you you are saying that every time you go for a deployment you use AMI right which means you build those AMI so let's say I have to deploy a simple java system I need to install java and all the other dependency that I need so don't you think this is costly every time you keep on doing it or using something which is all how do you how do you reduce the time of build at that point in time that's my question valid question so you can have pre-baked AMIs and use packer to kind of add further to that so that's also that packer's also that is my voice audible? yeah now it's audible no so when you build an image you don't think in terms of what languages you build a composite image once which is your base image on which you are going to add your code and you can basically use packer to help you just take a pre-baked AMI and add the new code to that and ship it out and also the notion of it being expensive I don't think it's expensive I feel there is no real cost or no real time that is going to take so I don't know where the notion of expensive is coming from can you say that again with the mic maybe I am coming from containers where containers is basically just one line you can literally compile it on your machine and then push it off so we are also using containers actually not easy to instance but it was not relevant to the core point but again there is no addition in cost I don't see any addition in cost or any time being spent which is affecting your service delivery adversely I don't think there is any I think there is a notion that it is expensive because somehow I think we are imagining humans are doing this task somewhere but it's not actually it's not expensive it works fairly fast we deploy multiple times a day from a Slack command people just do it whenever they want there is no prep and very shortly when the deployment is done it's always asynchronous whether you use immutable infrastructure or not you don't wait for the thing to deploy so whenever the task is done you get to know on Slack and that's it nobody knows any different it doesn't cost anymore in terms of the only real cost is if you have to roll back and you don't have an AMI ready then any time you waste is expensive those AMIs are lying because they were previously released and so there is no real I don't think it's costly at all and correct me if I'm wrong so the cost is only the diff right so only the diff actually the cost it's not much yeah privilege matches only the diff is the cost this is to Kashif so we are also thinking to migrate to Terraform infrastructure as code but the problem as of now is probably there are 600 easy to instances with fairly large infrastructure so how do we go about if we are starting today on the legacy infrastructure how do we go about transitioning do we go service ways or do we go probably S3 as a module or there is again a good question there is no good answer actually any answer that will satisfy you is fairly long but I will address it shortly though so one thing you should do is most architectures are not homogeneous they are heterogeneous they are actually composition of parts is very rarely the case that this not so so you can take a part out and move it to Terraform while the rest of it stays in a different way now you will be faced with some problems around service discovery and things like that but you can have hacks around it while you do the rest so obviously the short answer is you decompose your architecture to parts and do it and do justice to the problem because there are a lot of problems when you do something like that if it is possible you should do it all in one shot it will save you over all time and it is very testable so it is not like rewriting code they are not equivalents and so maybe a rewrite in Terraform completely usually if your architecture is not massive is a better answer we will end the Q&A session here we have run out of time for this we will start with the flash stocks now so far we have two flash stocks in case somebody is interested we still have a couple of slots available you will just get a mic and you will have 5 minutes to deliver the talk if you want to do a flash stock right now just come over and let me know so we will start off with the flash stocks Chandan Kumar are you here can you go on for the first talk we have Chandan Kumar for the talk on BGP lag summer training hello everybody my name is Chandan Kumar I will be going to talk about DGP LED summer training am I audible backside am I audible yeah good so I will start with the questions so do you guys contribute to free software or fast open source softwares and cool and those who have not raised the high assume that you you have not contributed to open source or free software but you are willing to contribute to open source so before coming to this topic why should contribute to open source people contribute to open source to get a better job with good salary because salary part is complicated for everybody and ought to enhance your new skills or learn about new technologies so here your problem will be get solved by DGP LED DGP LED is Linux user group of Durgapur which does not belongs to Durgapur we are a group of people who contributes a lot to free and open source software starting from kernel to open stock fedora everywhere and we are a set of friends which are not friends in the virtual world as real friends we help them to become an open source contributors so how and DGP LED started with a motto first learn and teach others it is started by Kushal Das who is a C-Python co-developer back in 2005 and 6 so he and his friends were struggling with contributing to open source and how they can learn and say the languages say the technologies with others so they have started this group and it is a group based on IRC where lots of people hang out and share their knowledge and we used to do 3 month summer training program so that anybody can join it and learn the kickstarter things and get started and become a contributor to open source world the biggest hurdle for everybody to getting started in open source is home to contact and where to start the disability fix the problem and the free leadership for this training is nothing you just require a laptop with GNU Linux installed second you need a decent internet connection I think in India internet connection is available everywhere and you must have the attitude to learn and the training is free so by the end of the training what you will learn so the training starts with how to communicate properly on IRC or in free and open source world how to use your box when I say word box your computer how to use an editor effectively how to use a version control system as well as well as learn a language to solve your real world life problems and apart from that during the course of summer training we used to invite guest speakers who are already the contributors of various projects like last year Gudo who is the creator of python showed up on the IRC and he shared his journey how he has written python on how he is contributing to open source to change the world so if you want to become a part of this journey just save the trade 17 June 2018 at 7 p.m. on IRC channel how many of you know about IRC good about the free note server where lots of open source software will develop like VLC so just join this channel say hi on 17 June 2018 you can find all the information on dgplg.org summer training 18 and we will be happy to help you out thanks any questions we will have next talk by Rahul Bajaj on configuration management at its peak with 4 men hi everyone so let's start with the talk the talk is about configuration management at its peak with 4 men so about me I am Rahul Bajaj and I work at redhat as a software associate software engineer and I have been working on this 4 men project for about a year now so let's see what are we going to look at today we are going to look at the project 4 men does anybody already know about 4 men anybody has heard about this project anybody who has not heard about this project not ever heard about this project and the other people who have not think it is crap so we are going to see what is 4 men how it adds up to the picture key features and functionality of 4 men so yeah before looking at 4 men let's look at its history a bit how many of us already know what is puppet ever heard about puppet not even knowing what it is we have heard about puppet so puppet is a configuration management tool right now what configuration management tools are used for is to configure your machines your nodes right so puppet solves a great and a huge amount of problem for itself but if you see a systems entire life cycle right a systems entire life cycle would consider installation then you know actually go on configuring the machine and then you monitor the machine right so puppet solves a great and a huge amount of problem by doing the initial configuration but what about you know installing and updates and drift management what about them so 4 men is a complete life cycle management tool which looks after your installation initial configuration and also the monitoring of your nodes correct okay so three major functionalities of 4 men are provisioning, configuring and monitoring correct so let's look at each of them I'll just go through very fast because we don't have a lot of time so going back to provisioning we provide all we do all types of provisioning 4 men provides all types of provisioning on bare metals or you know cloud we started doing virtualization now we also support containers so basically you provide provisioning for all kinds all sorts of machines provisioning in the sense we do both pixie and pixie less provisioning pixie less provisioning we do it in terms of you know image based provisioning and yeah so we also provision virtual machines as you've seen and to provide DHCP DNS and TFTP services we have something called a smart proxies which we'll just see in a minute okay so that's our provisioning 4 men also you can use any kind of configuration management tool after you provision your machine like once you've installed the operating system on your machine on your server then you can use any kind of configuration management tool to you know configure your nodes like you can it also supports Ansible Puppet Chef whatever you want to use next is a monitoring we support monitoring we've recently I did Prometheus so we're supporting monitoring to like end to end monitoring of your systems this is the 4 men architecture so I spoke about smart proxies so what are smart proxies smart proxies are basically you can have one 4 men centralized server in whatever place you are right now and say I want to serve some machines in the US right so what you can do is you can place a smart proxy there smart proxy will basically provide you services like DHCP DNS and TFTP server you can use the smart proxy in any location and you can serve your or provision your machines in that particular locations and having one single instance of 4 men wherever you have it like okay so that's how you have a smart proxy which makes 4 men a distributed architecture all together which is which couples up with puppet for provisioning for configuration management in the second half we also have a web user API so 4 men has a API and you can use or you can manage your systems in the way you want writing your own using this API customization so 4 men basically yeah customization so 4 men basically has a plugin kind of an architecture so some people like to use 4 men only for you know reporting you can do that you can you know plug off all the other features like provisioning configuration management you if you want to use 4 men only for provisioning machines you can you know plug off all the others and so yeah it's a plugin plug off plugin architecture API and CLI so we also have a API we also have a CLI we also have a UI interface so whatever you can do on the UI you also have a CLI for it so you can write your own custom scripts and use 4 men the way you want okay last is so you can just check this tool out at the4men.org it's an open source tool so you can the entire repository is present on GitHub you can have access to it and if you feel free to ask questions on the 4 men channel the 4 men dev channel and you can also post your questions on community.4men.org yeah that's it thank you we have our last last talk coming up it's on MySQL 8 the best gets even better right so for people who don't know me I'm Balasubramanian I work as a MySQL release engineering team and this is again as I said this is kind of presented in G8 years and that's by my director and I just want to present something about MySQL 8 the best gets even better okay that's a safe our statement and we have few things to highlight these are the list of customers who run on MySQL Google, Uber and GitHub there are few others as well and we have published MySQL 8.0 on 19th April so that's very recent so I just thought of highlighting few things so that you are aware of what's happening in MySQL with the release part so as you said MySQL 8.0 is G8 and we have got few futures which is NoSQL documents store which means that now we say MySQL 8.0 has got SQL plus NoSQL and we have even got the JSON support in it so for people who are looking out for JSON support we have kind of a solution for it so 801 is the solution and we have got CDEs and we have got Windows Function and again a very nice improvements for the EnoDB and we have got a very nice replication improvement so we are going to see those in the slides and we have got roles introduced and we have got unique code so we are going to see how we know emojis and the last thing is on the GIS the nice improvement on GIS as well so I will go through very quickly just to be able to read few minutes so MySQL 8.0 is SQL plus NoSQL so having said that we have got the full JSON support through our new X-Dev API NoSQL interface and again on top of it we just the other points are like how we do it this is a very neat diagram so we are saying that we have got connectors on top and we have got a MySQL shell and it connects to the SQL API and for the MySQL shell it interacts with the X protocol and internally if you see that we have got both NoSQL and SQL so the document store it is the main thing about the NoSQL support and it has got a full NodeJS integration and it has got auto-completion and MySQL 8.0 shell has got a nice prototyping capabilities and you get a full SQL and X-Dev API support with a built-in auto-completion which is kind of a very important feature and the best part is you can set up a new DB cluster support within few minutes and it is a nice DevOps tool for someone who is like want to try out you can try this out it is available in dev.mySQL.com let me not go to it in depth I will just walk over on all the important points so we get it through the JSON table function which is available in MySQL 8.0 we get the full support JSON support so that is very important and these are the examples on how we get it and as I said the nested arrays it is in laps if you want you can try out these are the list of functions very quickly on the efficient replication as I said this is the graph which clearly shows that how things are improved with 8.0 and we have got this transactional data dictionary which means that all the system tables are kind of moved into the data dictionary and with that we have got an improvement significant improvement which means that MySQL 8.0 is twice faster as 5.7 and a nice support for CTE and Windows function and a better handling of hot row contentions this is very important because we have got very kind of you know things from the community saying that this is a very good future to have and we have got it Invisible features this is something for a DVA in case if I want to hide it I can just you know say that it's invisible keep it for some time and see that things are okay if we are not using anything we can just either software it or do a rollout SQL roles which is very important and again this is kind of implemented in 8.0 performance is very important and you can use all these variables to get the performance right okay world of emojis we get the support using the UDF 8 yes MySQL 8 defaults to UDF 8 MB4 just we have completely built in and ready to use so there is no extra configuration or installation it's very powerful if you got a full geography support and it's kind of you know projected it's flat and across two dimensions persistent in case it's a world of DevOps cloud yes we have got it through this persistent variable you say persist max connection equal to 100 and you are set inodb a nice improvement on inodb we got a document store json and we have got automatic ddl as I said everything is in the inodb database and this is what this community is looking out for they are looking out for an instant add column so what we have done is we have got it it will be available very soon in the next police. Replication there are a lot of replications for it nice improvements and very good improvement on the group replication and this is very important on the inodb cluster part as well we get the high availability achieved through this MySQL and these are few graphs which shows benchmark on read write says bench and a lot of future so I can't cover everything in 5 minutes just all set and last it's upgrade checker which can help you in upgrading from 5.7 to 5.8 to 8.0 and coming soon as we are going to have your Oracle Cloud data center soon in India and MySQL 8 will be on cloud by June May yep few things for you to go over for more learnings any questions you can reach out in our booth we'll be happy to assist you thank you we are done with the flash talks before you head out just a reminder that we have left feedback forms on your seats after all the sessions are done you can whenever you are leaving just fill up the feedback form and submit it the box outside thank you you guys can head for your break now it takes 34 minutes so this here is an example of overshooting job now at 12 outside is supposed to schedule a job but it doesn't because it finds that the previous job is still running so the next job is scheduled at 12.30 and this results in a high chance of SLA bridge if it happens very often besides this we have we face some other problems and limitations in general like there is a continuous process even when no jobs are running it's very difficult to scale this horizontally in this form there is no way to run on demand jobs lack of visibility then logs are all interspersed because multiple threads are writing to the same file and it's only specific to Clojure and Java this is quite important actually so at this point I just need to clarify that since I said Quartzite and then immediately moved on to problems I just want to clarify that it's not about Quartzite Quartzite is an advanced scheduler and it is capable of solving the problems but that would need building all those things whereas Jenkins has those as we will see in the course of this talk so in this slide I just wanted to clarify that so let's move to the problem statement so as I said we were looking at building a platform for all our job scheduling needs and we started with these goals we obviously wanted to target the problems we were facing like prevention of SLA bridges distributed execution of jobs that would facilitate horizontal scalability job pipelines it would allow us to compose bigger jobs out of small jobs UI for running on demand jobs common functionality like the we want the platform to provide all the common functionality easy to write jobs and on board them and finally not just limited to Clojure or Java so why did we choose Jenkins this is a very valid question if you go to the website of Jenkins it says that it's an automation platform but a lot of people only end up using it for CDNCI or for running tests because CDNCI are its most highlighted capabilities but it is an automation platform it comes with a built-in job scheduler it has a very active community which is highly responsive to security vulnerabilities and it has a much better plugin ecosystem we had prior experience with this platform at the time we had to use it as a solution we were already running another cluster for CDNCI it had evolved over 4 years to support 500 jobs and around 20 slaves and it aligns with our philosophy like we want to build something on top of what we are familiar with because we invest in technologies that work for us then we reuse them and standardize them so it helps us just reuse code so now let's look at the Jenkins based approach in the course of next few slides I will be speaking about the various building blocks of this solution beginning with Jenkins of course so we run Jenkins cluster in a master slave configuration like this where we have one where we have one master and multiple slaves and master is the node that takes care of scheduling jobs it has a queue it shows job definitions and it provides a web UI whereas slaves are the nodes that connect to it and job gets executed on those nodes the second building block is the job wrapper script so this is essentially a python script that is installed on all the slaves so you can think of it like a layer between the scheduler and the code that is written by the developers and it is this place where we can pack a lot of common functionality because the scheduler will execute the code through this wrapper so we can we have retries timeouts and monitoring handled by this wrapper and which means that our jobs are very lean and all the reusable things have moved to the platform a third is the code itself which encapsulates business logic and it can now be written in any language the only requirement it has to satisfy is it needs to run like a command line script indicating the success or failure based on the exit code and fourth part is job definitions so now that we have a scheduler and the code we want to make the scheduler aware of this code right so to do that we need to create jobs on Jenkins but so it can be done manually but why do it manually we have the pipeline DSL plugin in Jenkins which allows you to describe your pipelines as code so it provides a groovy script groovy DSL for that so we have developers write these groovy DSL as groovy scripts as well and we keep it along with the source code in the repository and then we have the release integration which ties all these things together so you can imagine the release process to be made up of these four stages in the build stage we package the source code and groovy scripts into a tar wall in the pre-deploy stage we prepare the nodes for two things one is to be able to run the code and the other is to be able to join master or slave and in the deploy stage we just copy the artifact nodes and extract and in the post-deploy step this is important we trigger a special job on master called the seed job this job is going to translate all the DSL scripts into the Jenkins jobs so to build all this we use a lot of plugins like we didn't write much code all everything was there we just had to tie it up together so the first plugin let's look at the first plugin here which is pipelines it allows us to write multi-stage jobs to compose complex jobs out of smaller jobs so here is a pipeline which shows four jobs four stages where A and B run in the first two stages the third stage we have C and D running in parallel and the fourth stage we have E so the beauty of this is that these jobs can run on different slaves and they can be written in different languages and they can be maintained by different teams so an auxiliary plugin to the pipeline plugin is the pipeline DSL plugin it provides a groovy based DSL to describe the pipeline so here I won't go into much details in the interest of time but essentially we are just telling it which label which slave to use and then we are just listing down all the stages then we have the job DSL so this is similar to pipeline DSL in the sense that it provides a groovy based DSL but for describing how to create a job so we use this for the seed jobs so what seed job does is it extracts so this is the last step of release flow and the seed job will extract the artifact and it will just loop through all the groovy scripts and just create jobs for them the third important plugin is the Jenkins swarm slave plugin this is what gives the platform its distributedness and it enables auto scaling so without this in absence of this plugin the way Jenkins master connects to the slave is that it opens an SSH connection to the slave this plugin flips it around so it allows the slaves to initiate the connection to the master and therefore the master doesn't need to know about the slaves now so this is an important property of distributed system where your cluster size is not known in advance and it helps with auto scaling so now slaves can come and go as the load increases or decreases and it also helps in HA which we will see later and finally we have the matrix plugin which helps in monitoring it provides a drop wizard matrix API it's basically just HDP API exposing health checks and this is consumed by a sensor plugin which then sends alerts so what did we so these are not all the plugins we are using I am just highlighting the plugins that are important so what did we gain out of this first of all we could solve this problem of restarts now releases don't affect the jobs because there is no restart happening here when deployment takes place the job would be running on some slave and the master would get the jobs config would get updated on the master so there is no restart here and there is no SLA breach because of this problem anymore the problem of overshooting job is solved because Jenkins queues the jobs so in the previous example we saw that this third job run was overshooting in that case quadzite skip in this case Jenkins will queue and the job will start immediately after this one finishes which is at 12.4 so the interval between the job has now reduced from job runs has now reduced from 56 minutes to 4 minutes and it is horizontal scalable because jobs can be distributed across slaves it's also possible to have HA for slaves like if you need that the slave should always be available and some slaves make auto scaling possible these are some of the other benefits we get UI for free from where we can just run the jobs all the common functionality is not now into the platform thanks to the job wrapper it's easy to write because of this it's easy to onboard because of our release integration better visibility and logs because of the UI again and it also provides this RESTful API ACL and all these things which are usually required for any UI yeah so let's look at how it's doing in production we are running it in production for a few months it's a new project so far we have onboarded around 32 jobs on it we have 13 slaves and so far we have run on more than 15k jobs and so just to give you an idea of the load we are running around 100 hours of jobs per day on across all the slaves when we think about when we talk about running anything in production high availability is very important unfortunately Jenkins doesn't have built in high availability for master because here the most important component is the master as it schedules so we have an active passive setup where besides the master that's actively scheduling jobs we have the passive one it is similarly configured to master and continuously syncing from the master syncing files from master using this tool called unison now let's say something goes wrong on the active master we switch over that is we make the passive the new active and since these are swarm slaves they will re-resolve DNS and they will connect to the new master and again monitoring is also important production for jobs it's handled by the wrapper script for master we for jobs it's the wrapper script is doing everything so developers don't have to worry about monitoring at all for master we have the process checks and the health checks which are exposed by the matrix plugin as I said the senzu plugin which consumes it and for slaves we have process and health checks for the swarm client process so this is still in development and we know that there are a few issues so these are the most glaring issues the HA for master is not real HA and at present the switchover is also not automated so this is something that we know and we can live with it now but this has to be fixed soon second is auto scaling is not implemented yet this limits our use of this platform only for predictable load and going forward we also want to use it for other use cases and future plans as I said better HA with automated switchover auto scaling of swarm slaves we want to implement state passing between pipeline stages something that is not there in a lot of platforms and so it will make the it will make the Jenkins pipeline behave more like a unix pipeline where data can flow through the stages and finally maybe we may write this python wrapper in java like we have rewrite it so that it can be packaged as a plugin that would also mean that we are tightly coupling with Jenkins but we are not sure this we are thinking on this and that is it thank you no questions guys ok must have been a good break no question you mentioned a unison unison plugin that you are using to failover is that open source or it is open source and where can I find it first link like which company is it help shift itself that is developed yes ok thank you any more questions many here so the jobs get queued up right so is there any flag where you can check that not many jobs get queued at single point in time yes so the matrix plugin I just mentioned right it exposes those like exposes that those things over hdbi we can write auto van jobs which pull that api trigger alerts how different is Jenkins from run deck jobs from which jobs run deck run deck run deck so to be honest we we didn't evaluate run deck when we started building this like later on I did take a look at it and it's quite similar but I think there are quite a few things that so the most noticeable thing was that Jenkins has and that it's coming from someone who is not work with run deck at all so Jenkins has a far more matured plugin ecosystem and the kind of plugins we have found it like they kind of very closely matched the use cases that we had so that's that was my evaluation after I came to know about run deck and until then we had already implemented this any more questions in the example that you showed so you showed that 3 and 4 they were running in parallel so does it also handle synchronization that it will only run job 5 after 3 and 4 and it's provided by the plugin out of the box yeah it's provided out of the it's part of the pipeline plugin yeah this is related to the query she asked run deck and the Jenkins I had evaluated run deck I mean last year run deck is mainly for operations like whatever tools we have currently so we can develop any self services or automations Jenkins is more into like development day to day like whatever the development deployment and the feedback we want from the code daily we are checking in so this is the main difference I see so this is not purely I would say maybe apple to apple comparison we have between Jenkins and run deck but that is a lot of overlapping functionality and I don't see why you can't use Jenkins like I didn't see the one advantage we would get by using run deck or rather I would say that I could see the things we would lose by not using Jenkins in this case but we can speak like if you have any more points you can okay thank you okay thank you so next speaker is Swapnel Dubey he is going to talk on compute intensive applications on DCOS so Swapnel is passionate about designing and building scalable big data systems these days his expertise is in developing infrastructure for data science use cases when not in front of the computer he likes to spend time with his daughter he also participates in cleanup drives and contributes to child education Swapnel welcome to the talk on compute intensive data applications so a bit of history for me about me I have more than nine plus years of experience and I'm right now working with with XRATOM software services as the big data architect I have experience working on domains like BFSI e-commerce ad serving with companies like snap deal, pubmatic and slumberj for the past one and a half years my technical journey has taken a shift towards kind of developing infrastructure primarily for the use cases which talk about running data science applications for example how we can run the TensorFlow and how we can have the GPU enable TensorFlow running and dynamic scheduling and all so that's a bit of a background for me the agenda for the talk is that I'm going to take you through a journey where I'm going to introduce a use case which we are handling right now which is primarily a data engineering and a data science mixed kind of a use case and how I will be talking about how we used to do it like eight nine months back and what were the issues with that strategy and then I will talk about how we how we slowly re-strategized ourselves and move towards running something on DCOS few positives negatives what are the improvements we are thinking about how to improve our system what is the aim we want to achieve I will talk about that as well about achieving the aim as well as well as I have few demos for you just to make you comfortable that yes what I'm saying is that simple so our use case as I said is primarily data engineering and data science use case data engineering we have Hadoop jobs we have Spark jobs we have Kafka for doing ingestion pretty simple and standard pipeline for data science we had R scripts written so more on the BA side we have people who are kind of the role of BA so they primarily write these R scripts and we have few data scientists who write these TensorFlow jobs and how we try to solve the whole situation where both the systems can run together is basically thinking about running big data applications into containers now running big data applications into containers there is something like a kind of an overhead as communicated by few of my colleagues especially when we talk about systems like Hadoop or Spark where they talk about horizontal scalability you can add the notes and you can run the applications but I think after the next slide maybe I will be making more sense when I talk about doing continuation for them as well so our client is an e-commerce giant the data size what we are handling right now is approximately 55 GB per hour and it is traditionally a company who has DE and DF so earlier what used to happen was as an e-commerce giant we have the use cases in data science as collaborative filtering and logistic regressions and finding out that what exactly is the sale with respect to the way the price of a product is getting changed but now the thing is that with the size increasing every day and complex user patterns coming into picture our data engineering pipelines are becoming more and more complex and along with that we wanted to have a look into the deep learning use cases as well so we wanted to leverage TensorFlow so this is kind of that six, I mean this is the picture which was like six months back we have various sources, we had written a simple Java spring based REST client so these sources basically hit our REST client with the data which is primarily in JSON format the data gets through Kafka and gets to flume and gets ingested into HGFS so Kafka for those kind of system where SLA is not that primary concern but flume is kind of pushing the data through your system for example placing an order or finding out basically identifying the use cases for fraud detection and also for those couple of small use cases we were using flume as well and for the rest of them for example the user clicks viewing of the product and these kind of scenarios we are primarily using Kafka we are ingesting that into HGFS and from there as I said we have Hadoop and Spark jobs which process the data so basically they do filtering and sampling and all those kind of things and then they finally put the data into GCS and from there we have a Kubernetes cluster so basically we have a basic Kubernetes cluster running and we have the Docker images being created in different environments like if our BA wants to basically run an R script we have the provision of spawning a container which is apt to basically taking care of that R script and once the execution finishes we had written a complete job kind of a script which keeps on checking periodically that whatever pod has finished it simply kills that so our cluster grows in size and it comes back and mind you this whole infrastructure is right now Google Cloud Platform so with this Kubernetes kind of infrastructure on Google Cloud Platform there is something known as a concept of preemptible nodes and all where if you have those preemptible nodes the condition is that they will go away after 24 hours but the price associated is just one third of your actual nodes so those kind of nodes are not proper when we talk about running Spark applications but when we talk about running certain burst of applications like people firing 100 R scripts together I think that kind of infrastructure can work which runs for maybe 15-20 minutes and goes away a bit more on the size on the infrastructure what we were using so the event which we are ingesting is close to 1500 events per second which looking at 1600 to 700 kb it comes out to be somewhere around 55 GB per hour the rest layer we have 10 nodes right now 25 nodes for Kafka so how we do is that even if let's say we have we are trying to do let's say a task 8 and we want the data to be processed from stage 1 to stage 2 if anything fails we try to save the state into a Kafka job so that we can again go ahead and reschedule that task again so that is the reason why we have kind of a lot of Kafka brokers running as I said flume is just a very small use case for couple of use cases where we have kind of an SLA needs to be met we are using flume so it is pretty small at that point of time and for running my Spark and Hadoop jobs as I said we were using GCP so in GCP the managed Spark and Hadoop environment is your data proc and we were close to 100 nodes there for the Kubernetes we have 2 bare minimum nodes so 2 nodes where like our complete job script was running and 2 nodes were running for running all those housekeeping tasks containers apart from that everything was kind of scalable in terms of preemptible machines and it kind of the way we were firing the jobs the bigger that I mean it was a big team in terms of from the BSI as well as the data scientists so at an average normally it goes around 100 to 120 to be frank because at that point of time we were not doing the proper sizes to the containers and all so there was a bit of mismatch at that point of time but if I just refer to the number it was close to 100 to 120 in peak hours kind of now here I want to mention one paradigm shift kind of so I think my journey with Hadoop started 4 years back and at that point of time we were kind of really thrilled when looking at a system that if you want to increase the infrastructure all you have to go and add couple of nodes to it and it will scale up on the same line spark came bring that in memory thing it became fast but how the things are right now is that but then again in the previous two scenarios you are anyways bound by the infrastructure let's say if there is a requirement where you require 500 nodes not maybe not for the complete 24 hours but maybe just for one hour you need 500 nodes so what will happen in the original in the previous scenario is where you need horizontal scalability to be kind of increased manually the number of nodes increased manually there we need to go and we need to increase the number of nodes but we cannot expect the data scientists or the people who are running the job to go and provision maybe five nodes for them and then go ahead and do the processing along with that with these spark jobs and the Hadoop jobs how were they kind of planned was that they were catering to the maximum size so let's say if I have an SLA to meet maybe processing the complete performing the complete processing in one hour and in the peak time if I require 100 nodes so we normally schedule 100 nodes and let it be like that even in the non peak hours where you are you are paying for the infrastructure but like not utilizing the infrastructure completely so the DevOps team which was there five to six months back was four in number and we have a lot of software components as I said we were using Kafka, we were using Spark we were using Azkaban for triggering the for scheduling our spark jobs we had Hadoop, Flume, Kubernetes, Jenkins a lot of things happening approximately 125 Kafka topics 93 spark jobs to be frank different teams sharing the infrastructure as well and the basic problem was that we need to meet the SLA's with the aim of optimizing the cost as well so if you see this was kind of an infrastructure what we used to have we had like three nodes for Flume we had Kafka on 25 we have Jenkins maybe setting on one machine 100 for Spark, two nodes initially for Kubernetes right and they are kind of provision for example the spark machines spark cluster specifically is kind of provision to meet that peak or SLA so the thing was that even in the non peak hours the complete resources which were there on the spark cluster was kind of underutilized and this is true for all of them I mean I agree that with Flume the infrastructure loss I mean the money involved is pretty less but then again if we can optimize on that as well it will be good for us ideally speaking what we wanted is that instead of treating each and every component as a separate resource group as a separate I mean applications requiring separate infrastructure we wanted to have something like this kind of an infrastructure where each and every applications are nothing but just the applications so within this complete infrastructure if there is a requirement for my spark job to use maybe let's say 100 nodes a capacity parallel to 100 nodes it should have the capability to do that and as I said we have a data science data science use case as well so in the non peak hours we had to schedule the training of the models so in the non peak hours when the spark cluster and everything is kind of free in those hours only we are going to kind of do the model training for our data science jobs so with that particular aim to utilize the infrastructure and in talking in terms of infinite scalability whenever it is required our first choice was to move completely to Kubernetes right so we and why it was our first choice because anyways our infrastructure was we had decent experience on Kubernetes we were already creating the images in Google Google Container Registry we have decent experience of handling Kubernetes as well we could have gone ahead and containerized every application via Kubernetes but there was something with a few points missing we wanted to have couple of more things to be covered up with this ship so few points which we wanted to have is deployment should not be able to tie to a particular cloud provider now how many of us had used Kubernetes may I have the hands up please how many have used Kubernetes is your Kubernetes running out of I mean is it running on Google cloud platform or in some other environment anybody who is running Kubernetes outside GCP and that is the pain point that is exactly the thing like Kubernetes being the baby of Google they had decorated it in a way they had decorated that beautifully it looks good when I talk about the preemptible nodes taking one third of the money and all but those facilities are only available with with GCP if you try to install Kubernetes let's say on on-prem the pain points are big and if you see if you try to search that on Google there will be just few blogs who talk about it right so people are not doing that kind of a thing so we wanted to we didn't wanted to tie ourselves with Kubernetes and Google cloud platform to be frank we have the aim of once the applications get stabilized we have the aim of actually moving to on-premise things especially those kind of applications which does not require this certain burst of resource users who don't have a requirement of certain resource users burst in the resource so that was the next along with that we wanted we don't wanted to compromise on adding new components so easy addition of new components is also something that we are targeting and this becomes very important when we talk about hosting our infrastructure on on-prem it is simple to use Amazon and use its product but it is difficult to have them like cross-platform and frankly speaking if you go ahead and search running Hadoop on Kubernetes you will not find any example just one example I mean very small example here and there but nothing like that and we wanted to have our production pipelines of Hadoop being running on on Kubernetes so that is again a negative point against Kubernetes in our case so Kubernetes look promising and initially and as I said it looks very easy on GCP it takes maybe not more than a minute to spawn it the size of the cluster increases flawlessly it comes down pretty well but the thing is that what if I don't want to use GCP right so in short installing big data components so RM was just not looking into the big the applications which are like which are kind of perfect fit for containerization right when we talk about applications which have to transfer a lot of data across the containers I don't think those kind of applications kind of scale that well if you have proper if you try to run Spark job on complete containers it is not going to give you that complete performance as such when you have that running proper on actual machines and the thing is that either to talk about Kafka, Spark or Hadoop they all have that same nature so installing those big data components especially the data engineering side was difficult now not enough support available so if you just go to that Spark 2.3.1 version which is kind of released last month they just talk about a single example on Kubernetes right and if you talk about the previous version there they talked about Kubernetes and for the next one and a half years there was there was complete silence right so it was not it was not happening properly your Spark is not inherently built for running into those kind of applications for TensorFlow as we were targeting so we wanted to use TensorFlow and we were using started using TensorFlow for running deep learning use cases and the and like 5-6 months back when we started with developing of this infrastructure at that point of time the GPU support for Kubernetes was in alpha phase so they had kind of said that don't use it into production and that is something we wanted to achieve and if you talk in the current scenario we have few very good libraries which support this data parallelism and model parallelism and deep learning we have a library cube flow which is picking up these days but talking about 5-6 months back these were not present so it was very difficult to run your models and train them properly using these parallelism mechanism parallel approaches as well as the logging thing so each of the pods have their logs but it is difficult to basically pull all the logs at one place and go ahead with doing the processing so that was also basic but the thing again that I am not advocating that the components cannot be installed they cannot be installed in Kubernetes they can but the thing is that the amount of effort with Kubernetes and not GCP is huge so how we solved it is that we we have Marathon in Kubernetes as the orchestration engines and below that we started using Mesosphere DCOS as it will come in the next slide it is nothing but an operating system so how are when we talk about an operating system on my machine it is an operating system which basically takes various processes and it gives you a feel that all of them are like one you can use your different applications can run on different course and they produce the result with DCOS the only differences that we don't talk about process you have multiple nodes and you have a one big pool of resources together so for this machine I have four processes similarly if I have multiple machines just add all the cores together and that is your DCOS capabilities so why we started looking into this is because big data ecosystem components like Kafka, Hadoop and Spark was kind of one click deployment it is just if you click one and you are good to go it has very good logging framework we were using Kubernetes we had done a lot of working Kubernetes and thankfully DCOS also supports Kubernetes to be the orchestration manager so we have a case where our whole pipeline could have moved as it is to DCOS so whatever work we had done that can be reused and deploying a distributed TensorFlow application was not that difficult here so this is DCOS it is a three layered architecture if you see the very top layer it talks about the services and applications so you deploy your services and applications and those services and applications basically pass through the middle layer which is an operating system and this operating system basically adds up all the resources which are lying in the in the underlying nodes so you have all the nodes here you add all the resources and that will start pulling to the complete infrastructure which is available to your one operating system right so here basically I will go with the demo and my aim is to basically show you that what all steps at the basic level are required to have a DCOS cluster up and running the infrastructure I am going to use is Amazon we have infrastructure as a code, a written and ansible we will be using cloud cloud formation for that first I will show that how simple it is to spawn up for doing your POCs and then I am going to use that same infrastructure for the rest of the use cases as well so we have cloud formation templates that are already available and what I am going to do is I am going to pick North Virginia, I am going to select that the moment you select that and you have the complete script ready for those sample deployments I am going to use them I select OK and it will kind of if you just start looking into that script this is kind of an infrastructure if you are maintaining we have load balancers up front we have one node which is acting as master we have it and all the complete configuration being done here we have elastic load balancer being balancing the load and all so if you want to have at least a very basic one you don't need to do anything you can go ahead with with a click I click next I will put the stack name here I am using m4x large for my deployment right now the key name is DCOS that is a key pair which I had created I am going to use one big m3 m4x as my master and five as my slave so there will be one master and six slaves running which is kind of going to which is going to appear on our DCOS console this is the configuration we have and we can just go and create this it will start appearing here frankly I had changed the name in the morning so DCOS started appearing the create is in progress and if you see these all are the things which are happening behind the scenes so this whole thing has been managed by a script which I had used so this will take couple of minutes I mean not couple of minutes to be frank it will take around 12 to 13 minutes to kind of be up and running and in the meanwhile what I will do is I will just switch this off and we will do fast forward for those 10 minutes now our DCOS installation is kind of complete so it shows it is complete we are going to access our DCOS web UI using the DNS address of our master and your DCOS web UI is all set so this is the dashboard page which talks about the CPU allocation since we had started fresh there is nothing there the memory allocation, the disk allocation it says that it has six connected nodes right now it has 38 components right now so in short my script which we used it has kind of 38 services kind of things running behind the scenes no services configured right now this is pretty is one of the major reason why we used so they had certified the distributions like Cassandra is in the certified mode they are ready for production and we have used cases like Uber and Yelp and Cisco who are doing that so these all are the certified ones if you see I got Elasticsearch and I got almost all the components which was required by our use case configuring them configuring them pretty easy just click set the properties here and run the service so either you can use the UI or you can simply go ahead and put your JSON configuration as well similarly I am going to go with Kubernetes as well so it is just one click and we are done of course you can go ahead and configure based on your requirements but the basic setup is done right the matrices has started coming up if you see they are rising right now there are two tasks Kubernetes is coming up Elasticsearch is already up and just remember this Elasticsearch component I have to tell you a very interesting fact about it so just keep Elasticsearch in mind I will show that at the very last that what I was talking about with this so this was kind of a deployment but how to access this this infrastructure right so we have this up and running in within the nodes here I am just trying I am just using a simple Linux machine on my GCP and here I am doing the very basic steps so that I get this as the bootstrap machine frankly speaking this is nothing but parallel to the OAuth mechanism I am trying to configure so this is complete OAuth which is getting configured I get the OAuth token I will put that there and now that machine can actually go into my system alright so pretty straightforward there is kind of nothing much here in terms of in terms of if you really want to go with the basic setup of course adding the nodes and all is anyway supported you can increase the nodes to 20, 50, 35, 45 the only complexity will start appearing when you want to have an altogether different flavor of things running behind the scenes right a different flavor of load balancer and all you have to primate basically go ahead and manage this script a bit of information about the adoption and the success stories of DCOS so for the first 6 months of launch there were 31,000 open DCOS clusters created and this is kind of 3x more than what was expected by the community and the best part is that in that 60% was on on-premise alright so if you can do anything on-premise I think it will more or less follow your cloud as well and few companies who had kind of certified and who have said that they were able to reduce their infrastructure by 70% to be frank from 100 to just 30 if they require 100 nodes initially now they are just require 30 so this is something what has been said by Netflix and Yelp to be precise Uber uses that so that GPU thing which I was talking about it has a flow Uber data science team actually heavily uses that that is available as use cases so what exactly are couple of points what is the heuristic to design our system when we have this DCOS kind of an application running so the heuristic is like M4x has 4 cores so whatever container I am trying to spawn from inside it should be a factor of 4 so if I use the containers of size 1 core or all containers using just 1 core then the complete machine will be utilized but if let's say I use a container if I try to create a docker container of just 3 nodes sorry 3 cores then in that case that 1 core will just is kind of wasted so with that particular heuristic with that particular approach in mind that at any given point of time my machine should be completely utilized all 100% we have to design the compute as well as the memory usage for our containers and the only trick in comparison to this is true for Kubernetes as well and for both the only trick is that how beautifully you place your containers so that your machine is completely utilized right so here I will just go and I will bring another use case where I am going to use that same UI I am going to start application in front of you with just 1 clicks I will show how simple it is to maybe run Kafka and I will run a TensorFlow job as well right so here I just removed Kubernetes because my cluster is not that big with what I am targeting to deploy I will go to catalog I will click on spark and just install that so I will be using 2.3 and with just couple of clicks we are done for performing the same thing on Kubernetes you have to kind of do a bit of a setup it is not that straightforward kind of a thing here I am using the Docker image which is provided by the DCOS containerizer which is primarily running on Hadoop 2.6 and Spark 2.3 right spark will start appearing now how difficult it is to actually just use the command line interfaces it is pretty simple I mean default commands available in the documentation and it starts running so we have the way to modify these configurations so those are available I am just using the default configuration to present before you the ease with DCOS and on-prem deployments so it will come up so it is running now what if I whatever configuration we did if I want to increase the size of it I am using the number of cores for it the new size will be incorporated so Kafka will start utilizing a bigger infrastructure right so I am increasing the size of the brokers I am increasing the size from the CPU used by one broker to two and in total now I would be having total infrastructure for four cores right so it started adding it showed that the infrastructure utilizes now just from there I open the Spark UI already configured these were few of the applications which I tried to execute one was launched couple of driver code that means if the spark is accepting a spark application I think the configuration is ok otherwise the demon will simply reject for running it nothing much difference if you see the command how I was able to do that simple one DCOS spark run I am submitting the applications with the arguments right so nothing much changed in terms of the commands or the approach we are using no code changes required in my spark application for this so now I will go ahead and basically deploy a custom container I will try to run a TensorFlow job for this I will create a complete configuration right so I will just delete them the work is over I am removing spark and Kafka together because my I am going to start with so here the strategy is we can have single containers as well as multiple containers right now I am going to use a single container if you see this is my configuration which is required it says that my ID is TensorFlow the number of CPUs I am going to use to the number of GPUs is 0 memory is going to be 2048 so what exactly we have to do if what exactly do we need to do if you want to leverage a GPU so all we have to do is that instead of using m4x large we have to use the AMIs of Amazon where you have the GPUs available and then you can go ahead and utilize the GPUs and then the CLI which I configured from there straight forward Docker commands I will go into that same container from here and do a bit of updates because the Docker image which I am using is a bit old it does not install the new grid version which I want to use so I do a bit of that and I take a sample example available on the internet that I try to run it this is going to be the exercise my TensorFlow job will be executed so again no big things being done but kind of you are able to see that within few minutes you have your infrastructure ready which has Spark, Kafka TensorFlow and all ready that I will just go and run this convolution you relate pretty simple example there is nothing like a data science thing I mean it has a simple TensorFlow job running it gets accepted it normally takes 11 minutes to finish off with this infrastructure that will take a bit of time it takes approximately 11 minutes right so again we got after this for that particular infrastructure with that heuristic of properly placing the containers is kind of initially if you just sum up all the machines if you sum up all the cores there were kind of 900 cores which you are using at any given point of time it reduced to 650 that means with a 4-core machine it was just 160 nodes and with very basic infrastructure and basic things being done correctly at least we were able to get a gain of 28% so without reaching the SLAs it was getting this much as well as the number of infrastructure people since the deployment and all is pretty simple so the number of the number of people who are required to actually manage the complete infrastructure or running the POCs and all are produced and the number of infrastructure tickets being raised by the dev team or setup team is kind of almost the same and they are pretty quick to handle in terms of code changes so there is no code change required no code change in spark if you want to run it on DCOS it's all from outside you have to properly configure the jars and all fire the proper command and it will execute when we talk about running Kafka and all so what we did was that we had scripts like Kafka if you want to run a console producer on Kafka you have Kafka console producer.sh what we did was we created a custom console producer.sh for our use case and whatever changes in the commands was required it was hidden behind it so for developers it was kind of more or less the same for developing this infrastructure a team of around 4 people took around 2.5 months to successfully deploy the spark job so as the very first thing we targeted only on spark jobs and we converted them using this infrastructure now what we are doing right now is we are so right now as I said we are using marathon the inbuilt google orchestration engine but we have the aim of actually utilizing Kubernetes as well so we are kind of doing study for that along with that the library which is kubeflow which gives the capability of model parallelism and data parallelism for GPU jobs we are utilizing we are kind of doing analysis for that as well as the tensorflow is right now available out of the box again in the beta version so we are evaluating that as well so that's pretty much it from my side I'm open to questions now Hi I was just interested to know how are you managing the storage part I see the compute part what you are shown up in the slide like I see that there was about 55 GB of data per hour do you have any decoupling in place like compute in the storage so frankly speaking the concept of storage will remain same kind of the on-premise thing we primarily did on AWS so we were using S3 for in the same in the same region so we started M4x in north Virginia region and we were using the buckets within that same region to basically have that real-time kind of reading and writing so we were using that along with that the other things which are available is the machines which are on the on-prem there you have to mount more disk so the way you do in any normal system so you have to mount disk or you go ahead and have a configuration of NAS that will also help for this kind of thing so storage-wise there is nothing like a kind of improvement in terms of this thing because it just that initially also we were using S3 for our jobs or GCS for our jobs and we just converted that as it is to this so optimization on the storage part is kind of anyways the data size is as it is right so storage will remain same same concept supply if I say that I want to have an on-prem cluster it just that you have changed the interface on top of it you have a DCOS whatever operations you have to do if your hardware is good it will your cluster will behave good it will be fast if it is not that good I mean lower in terms of in the commodity hardware region as well if it is on the lower side the cluster will not behave well so those conditions apply Soapnil I think you have shown us a great demo it is really amazing I am assuming that you have the underlying OS also yes on the nodes yes but no virtualization it is the bare OS that you are using yes it is the bare machines any containers or docker that you are using in addition to your TCS so if you talk about Google computing so when we have this on GCP so in GCP we do not have machines it is the VM so a VM is also going to work so after all your machine is also a pool of resources either it is a VM or it is a bare metal machine it is ok so you have shown it without the GCP right yes this was without GCP without GCP so you can use the bare OS on top of that you can use the GCOS exactly we can there is no restriction in terms of the type of operating system and also that is all out is there any optimization you visualize between the DCOS and OS no no no nothing as such it is more of so we talked about having a master and we talked about having 5 slaves for it they have to kind of work in a closed environment all the basic rules and regulations apply because that was Ansible script so you have to define your private network and you have to hide it behind the load balancers and all those things apply so adjust that those scripts the basic requirement is you need to have a cluster and on one of the nodes the master demon of DCOS will run and in the other ones the slave demons will run they have basic requirements of connecting if we make sure that happens we are good so you mentioned that the nodes that you created they are all of the same type the nodes that you created they are all of the same type is it possible to create different types of instances and have certain applications only run on certain types of nodes is there some kind of application affinity to the nodes that can be defined so the type of machine going inside your infrastructure is a part of your Ansible script DCOS has no role to play in that so the script which I just initially the cloud formation template I used as it is there if you want to have multiple types of machine getting added as in a cluster you can have that so it is that level of configuration now selecting a machine so let's say if I want to fire a normal java application and that java application can only be executed on maybe on any machines available in the cluster we cannot predict that kind of a thing but if I have a requirement and if I have just five machines which require GPU we can guide those applications to get on because in the configuration file you mentioned CPU memory so I was wondering if there is a node affinity that can be added resource allocation policy where you can specify which node or which class of nodes to run on you don't have anything yet on that so I think I'm not very sure about that frankly because it's kind of more of GPU side GPU side we had tested and this and that and we had moved Spark so once I have that knowledge of effectively utilizing because we are using TensorFlow to do that then I would be in a better state to answer thank you there is one more question okay you can take the question thank you so next speaker is Devi ASL who is going to be talking on growing with elastic search so Devi has worked with solar and then fell in love with elastic search its architecture and APIs she also enjoys being close to nature likes to take long walks on beaches she also likes to create things out of clay and she is now trying her hand at painting Devi good evening everyone I am Devi and here to share with you all how we have grown along with elastic search a bit about me before going into this issue I have over a decade of experience and building software and currently I work as a lead developer and an architect at partofly partofly is a small startup based in New York and its a recruiting platform for women to connect with the clients that they need so partofly's core mission is to connect women to the roles that they deserve which is core of our platform and the core of our business also so that's where the search for search began of what search engine should we use and etc so the agenda of today's talk I would yeah I would take you to our journey with elastic search how it started, why did we start with elastic search and how we used or leveraged the elastic stack merging along these years to fit our needs so we launched in 2014 when we launched it was a very small team of two developers and we launched our search with postgres full text search and that was only for admins to search and all the matching to the jobs and candidates was being done manually by our admins so postgres full text search interface and that was not open for users later on in 2015 we have released faceted search built upon elastic search version 1.4 which was then rolling and then the next year 2015 log stash was prominent and elastic was known for ELK stack part of so we use that for log monitoring system and then in 2016 when Kibana was more prominent and it has got really powerful dash both at and etc we have utilized it for our analytics pipeline that was with 5.5 which we currently use so I will take you through this journey and the performance issues that we faced so far and how we have gotten through them and how we made best use of elastic search so as you can see here we have grown our user base from 2015 to now over a million of users aware on our platform which are candidates who are looking for jobs and as we rolled out the features the user base also has increased along with the number of the documents in elastic search so in 2015 so in 2015 we were nowhere few thousands of documents were there and now we have more than 20 m so back to 2014 where we were looking for a search engine which we can put into user front in few clicks they shouldn't be taking too long etc so what did we consider we already had Postgres as I said already so that's the reason for us to look into the search engines we took we considered three of them which was very much known at that point of time for faster index rates very very fast index rates and and also search rate that we have considered and sold it which was which has been there over a decade by then which was built on Lucene and very popular at that point of time for the search engine and there was elastic search too with version 1.4 which was coming up slowly and there was not so much of prominence for log stash at Kivan there was nothing prominent at that point so we took all these all these three solutions different solutions and saw what we needed and what was there and because we are using PDFs we get as our users are candidates they put in resumes they upload their resumes into our website and for matching candidates with jobs we needed a functionality wherein the search engine should be able to should be able to search in the PDFs and the docs of the resumes actually for the user submit rather than repassing each of the documents and because resumes will come in all kinds of formats passing itself is a big problem so elastic search and solar has that capability of searching in PDFs and documents directly without you passing but Sphinx did not have that functionality so the competitors were solar and elastic search and finally we have chosen elastic search for its cluster readiness so elastic search distributed search engine by design and solar was not solar cloud was emerging at that point of time but not really cluster ready so elastic search was chosen because it has got powerful and flexible query DSL you can combine any queries in elastic search to do the to build your algorithms around search and very powerful aggregations it supports so aggregations like you can go nested aggregations 1 level 2 level 3 level etc and also statistical aggregations like along with max min count etc which are very common in other search engines it also provides moving averages standard deviations and it has the best thing about it is it has got rest apis for everything you can just open your command line and put in a curl and do get things done so it has got rest apis from searches to you can monitor your clusters your indexes and manage your indexes just through the apis you don't need to really log into something and do something there but just get things done by rest apis that's the best thing and it has got very good support for nested documents and parent child relationships Lucine the underlying search engine it does not support nested documents on its own but elasticsearch supports nested documents as well and it has got suitable ecosystems for data pipelines so that's all about goodness of elasticsearch and all of these were very crucial for our needs so this is how search on part of fly looks like today where in client can client can type in any complex query there you can see programming or test driven development or you can see complex query there and they can put in and they can put in location and then filter through these refine the talents, talent results through the through the facets which are powered by aggregations of elasticsearch and clients were very happy and we were also happy so that's about search when we rolled out in 2015 now we were using elasticsearch for primary release for search and then so what how was the rate of flowing and what's its here we use Postgres DB for our core data meaning all the jobs and candidates all the data the basic data is all in Postgres DB and we have got an indexing job which pulls data from the DB every minute and pushes to the elasticsearch cluster now the users which are coming from the internet they request I mean the request are sent to search microservice we have got a microservice built for just for search and recommendation algorithms and such that which talks to the elasticsearch cluster so the search service passes the request and converts them into elasticsearch queries and gets a response back and pushes that to the user so that's how the data flows from elasticsearch user and from the database to the elasticsearch okay then once we have elasticsearch in our stack in the tech stack we in 2015 logstash was also in into the stack and Kibana was just emerging in 2016 statics then this is the typical ELK stack now which has become very common and so how it works you have got the servers there which are web and worker nodes like we have got few a bunch of microservices and the main web application on which filebeat is run along with Celery and worker jobs worker nodes actually filebeat is run on all these servers where the microservices and web application is done filebeat pushes the logs to logstash and logstash preprocesses the data and adds more metadata like it pushes puts in the geo information from the IP of the user from the logs and then takes a backup on to AWS S3 and pushes the daily indices to Elasticsearch cluster so the data log data which has been collected from all these servers reaches Elasticsearch cluster and we the developers can view the logs on Dashboards of Kibana before we had the solution we had to login into each of the machines in each of the servers to find the logs and grep for tail for a particular thing and this is how Kibana discover tab looks like wherein you can search you can search for the source of the log whether it is from the Celery or whether it is from nginx or web application or a microservice whatever you can find that at the index and you can give a time frame to search for and you can find the log it is much much easier than logging into the n number of servers and finding where the log is from to debug or to trace something so this saved lots of time of all the developers and boosted the product later so that was in 2016 and as we were working on these things Elasticsearch Elastic had improved Kibana so that the dashboarding capability is much more actually that Kibana supports any kind of graphs on any kind of data let's see what do I mean by analytics pipeline analytics by analytics pipeline I mean capture the user activity let's say user comes in and views a job I want to see which views are which jobs are getting more views which jobs are getting more applications or let's say which client is searching I mean what are the search keywords being used etc the general analytics stuff we do have Google analytics and heap analytics in the fountain but getting them into the backend right to use or to pump into the algorithms that you use is not easy job so how did we put in the analytics pipeline is the users interact with the web application right whenever the interact with the web application let's say open a job page view a page or click a button search for a job these are all activities we capture that as log on the nodes on the web server nodes where the file bit is run so file bit would capture that log take that log as a bit and then push us to log stack and here the typical again the ELK pipeline works once we have the activity captured as a log the entire workflow stats and finally log comes onto the elastic search cluster so how do we model the user activity as a log is so we have a source source of the action and what do I mean by action let's say a user comes into the platform and views a job so my action is view job and the source is candidate XYZ with ID XYZ and the target would be like job with ID XYZ that's how we model the documents and the reach as a search cluster now that we have the data the analytics data of how the users are behaving or interacting with the platform in elastic search we can make or model the queries in such a way that we can find interesting patterns of the user behavior and put them as an input to the recommendation engine and get interesting results out of the recommendation engine which will again in turn used for search ranking and matching algorithms which make which can make better decisions because it has got the input of user behavior because the use the all this user activity is in elastic search cluster already the higher management like all the CXOs who are not really techies they can view all the statistics of the user behavior using Kibana-Rachovic so though the source of the data is just one sitting on elastic search the algorithms are also being used I mean the algorithms also can make use of the data as well as being viewed on Kibana-Rachovic this is how Kibana-Rachovic typically looks like and this is how people senior management would view the number of the stats so with that I would switch to handling growth how we have grown over the years we have grown from from thousands of users to a million over a million of users and search the talent search that I have shown is very crucial for our business so it should perform really well and without any downtime at all because it's the core of Kibana-Rachovic when we have touched the one million user base there were lots of queries being timed out from elastic search and this was the first thing that we have done to enable slow query log this can be this can be switched on or customizable per index so we did not switch on for all the indexes but for the important indexes that we query mostly so this is how you enable the slow query logs for index and as soon as we switched it on we had lots of lots of queries filling the logs and one of the culprit was analytics pipeline document which had target like this you can see that the target is a list of dictionaries here so it's of an action some talent is taking on the platform with a typical target nested document so that's a nested document and we have got rid or flattened the document to a dictionary so a list of dictionaries has become a dictionary and that saved all the slow logs from the analytics pipeline this was the second thing when we use deep pagination you can see that from in this left side request is 50,000 so that's called deep pagination and elastic search is bad at it search API is bad at it once you go beyond a certain number and request for pages in between or after so long then deep pagination becomes really costly and you'll start facing timeouts again so what's the solution the solution is to use scroll API which is shown in the right so that elastic search can have save the search context and make the caches ready so that the caches are not lost and there are no timeouts all the timeouts were solved just by switching from search API to scroll API this should be used whenever you are fetching large data sets out of elastic search so that saved more timeouts and managing our indexes because we had log monitoring also in elastic search all our old indexes piled up which were not being used we were not paving for the old log typically any developers would be interested in the recent logs recent happenings of their life right so there is a closed API you can close the indexes and once you close the indexes they are there but they don't consume any resources from our cpu which saves lots of resources and you can open them whenever you want to search in them just by posting an open API API call and then there was a force merge there is a force merge API whenever you see that an index has got lots of segments lots of segmentation has happened for that index for whatever reason doing a force merge in off p cars actually one should be doing that in off p cars because force merge API consumes more cpu that saves lots of performance values and in the last one rollover API which was released in 5.x which was not there before it lets you do it lets you manage your indexes lets say you are interested only in the recent indexes or indexes with the data that you care more then those indexes can use best of your service you can tell the elastic search faster where those documents should sit and after that time frame the rollover API takes care of moving them to the lower level servers by itself you don't need to manage the index that was also very useful for us to take care of that was all about search performance tuning and here is what we have done for index performance tuning so elastic search comes with very so many defaults and if you just use those indexes that are created automatically without really looking into them they will be really in trouble you can disable indexing meaning there will be fields in the mappings when you define the mappings you should tell elastic search that I am not going to search in these fields I don't want them to search in these fields but I want them in the results lets say then you will disable index you should disable indexing on that and some fields you might be searching against them but you don't want it to be coming back in the results so you will disable storing and there are other things whatever you don't need disable all of them disabling all of the things that you don't need while indexing will save lots of speed will save lots of troubles and improves the rate of indexing and use smallest numerical data as possible or make it keyword so there is a data type keyword which should be used instead of integers whenever you are putting let's say putting database IDs in your document let's say I have got candidate ID from my database and I will put that in elastic search also for my reference and that is an integer logically but there is a range queries on it nobody queries on ranges of IDs in elastic search so make it a keyword and optimize the number of primary shads by default the number of primary shads and replica shads are 5 in elastic search and you should see for your data how many replicas that you need and what are the service sizes that you want you should experiment and then go into production once you fix the number of primary shads it's not as easy to change the number of primary shads it will cause re-indexing once you go into production so be careful about it optimize the number of primary shads and then use bulk request while posting or while deleting there are APIs in elastic search for bulk request there also you could issue the request in sizes of 10, 100,000 or whatever look about look at the document size that you have the service that handles them and optimize their size benchmark them and then do the bulk request we have saved lots of time of our indexing performance and improved our indexing performance and let me summarize everything so elastic search elastic stack I would say it used to be called only elastic search where there was no logstash and tibana then people started calling ELK stack or sometimes bulk bulk stack or beat now it is elastic stack so elastic stack is growing and they are going to open source there is something called expack in elastic search if you are aware of it so expack has got numerous capabilities monitoring or putting roles of your users who should be authenticating et cetera capabilities which comes from expack which is going to be open sourced soon in the next version which is going to come in May ending or June starting so see if it fits your needs and as I said defaults are good only to start with whether it has primary shads replicas or the defaults of norms or whether to store fields on the desk or everything go to all the defaults that elastic has put for you they are only there to make you up faster faster testing up the ground but not for deploying on to the product so check all the defaults and tune them before you go on to the product and use different indexes for different data to create different mappings in the same index let us say for our example for our use case we had jobs index jobs and candidates two data types we should not be putting them in same index as two different mappings because you cannot scale them separately put different kinds of data in different indexes so that you can address the scaling needs of them independently right and last thing model your documents well think about how it think about how it suits and model your documents especially avoid nested documents as much as you can unless you have complain reasons or parent child relationships etc it is all I have thanks for staying late if you have any questions we are almost all time maybe one over there we will take I work for elastic search and thank you for giving this talk from the community few corrections that I have to say in the last slide xpack is not open source xpack is open it is under custom yula it is open right now you can go on github and see xpack folder you can see contribute probably for that maybe you might get something but we are happy to talk to you about that but xpack is being included in the 6.3 version which is coming soon also xpack has a free tier as she told that is monitoring of elastic search cluster and various other things we have one important features like sequel and canvas coming in xpack which are free as well but don't get offended because we are including a commercial version and it is important so we are shipping open source zip as well that's it thank you thank you David please stay back on the stage we have a Q&A session coming up with Swapnil and Kashif on architecting infrastructure for scale and collaboration I would like to invite Rishu Mehrotra to moderate the session hey everyone we just give everybody a couple of minutes to settle down or walk in or walk out alright thanks again for staying back it's been two fun days at rootconf and I'm sure you have heard Kashif Swapnil and David delivered various presentations this is the last session for today and it's more of a chat regarding how to scale infrastructure right and as you guys would have heard folks here have had different experiences scaling their respective infrastructure in their firms so what we would like to probably start with is I'll start with Swapnil so when we talk about scaling right and it's again buzzword like all the other things that go on in the industry what does scaling mean to you or what would you define scaling as so so when we talk about so I will just start describing about what used to happen and what happens now as well what happens now as well like if you have a Hadoop cluster right we talk about scalability we talk about adding nodes to it but mind you that is a manual intervention when exactly is your whole application going to ask for more nodes it cannot be controlled from outside your system should have the capability of scaling at the runtime perform the process and just go away so it should de-scale as well scalability if scalability talks about adding it it should talk about removing it as well and not just that I think when we talk about the horizontal scalability thing so there are two kinds of resources one is the computer resources the other is the storage resources if you want scalability at the storage level we should not pay for the computer resources so again a decoupling is required from that side and because of the same things that serverless infrastructure and everything is coming into picture these days by the way guys if anybody has questions we would be more than happy to probably entertain them raise a hand we make sure that the mic reaches through hey Swapna so when using a big data company is trying to look into containers can you hear me now so if a big data company is trying to go into containers for the first time and if you have the option between mesos launching a Docker container but then you also have the option of some of the non big data applications like Node.js and things like that so what are some of the advantages of launching these non big data applications via Docker, Kubernetes and DCOS rather than just going with mesos and Docker so just let me know if I understood you correctly so you are asking about what we have to do if we don't want to use DCOS Kubernetes and all right or do you want to so not use a Kubernetes and just launch those non big data applications as well with mesos and then Docker that way why would you want to add an additional layer of a Kubernetes what advantages does that give us so I will just start with an analogy so during the talk I took an analogy of the operating system running on my machine so the DCOS is the OS which is on my machine now on my machine it can be a case that I have multiple different browsers so I can have Chrome and Mozilla both and it depends on us what to use now Kubernetes and marathon are more or less that thing so either you can use the orchestration engine as Kubernetes or you can use the orchestration engine as marathon so the days are gone when you have one machine which is being created looking at the peak load what will happen during the non peak load and why this situation has arrived where we are squeezing everything out of it because now is the time when we are producing a huge amount of data so now the adverse effects of underutilized infrastructure is kind of more dominant now so it again depends like the use case which I shared we were using Kubernetes and we have full plan of actually bringing our basically reusing our work there but right now for the time being we are kind of running I was showing all the demos on marathon just for the sake of simplicity and not adding complexity so here's another question right so when you talk about using all of these tool chains which is one very different story but if you look at any company which has growing pains when they start scaling and the business starts booming and it comes in a lot of times it turns out that the architecture you guys like the architecture company envisions to begin with may or may not scale right after a certain point and you start hitting a wall and how to tackle those problems have you like you guys share experiences on that how to tackle those issues how to when you say hey this is we are designing something but we are going to hit a wall at some level can you predict that can you forecast that can you do you like share experience I'm not clear what specifically so when you go ahead and you say that you start scaling infrastructure right you always have a certain scale in mind as in when the company grows more and more user comes in more and more traffic comes in or the equivalent events happen you are going to run into a wall at some point in time is it possible to kind of predict it so one thing is that you use two chains to scale which is the tech part of it but how do you kind of take other non-technical factors into consideration for example cost for scaling or where do you start planning for scale do you start only when the deployment happens you start it right when the coding happens where do we start like for example if somebody is there looking whose company and business is scaling what are the points they need to keep in mind so I don't think I have a good answer actually what I do is what I do is I try to use elasticity as the driving principle behind any design whether it is architectural or technical or whether it's organizational design and the idea there is that systems should be able to scale on their own and I try to design for that the way I do that with technologies distributed systems and the way that I am currently experimenting doing that with organizational design is using a system that Spotify invented it's their agile practices but I think the key point there is elasticity you have to you can't predict when things will happen and what will happen you have to be prepared for quick and nimble changes and build strengths for quick quickness of change basically just to add just to add to what he said so that elasticity and decoupling of things should be handled from the ground from the servers or the APIs or microservices or the application that it handles the request etc and the database that you use everything should be chosen or to be designed in such a way that you can replicate each one of them or you can scale at each one of them so that when that big time comes you won't lose it or... I want to discuss about actually what is scalability that we are talking about I think the different interpretation of scalability I see here one thing that I mentioned is scalability, autoscaling capability of the infrastructure I don't really see scalability as infrastructure autoscalability issue I mean I see scalability mostly look in terms of property of the application can it linearly grow given enough resources whether your resources grow automatically or you manually do it it's a secondary issue but I see scalability more as it's a behavior of an application as an application has the capability to grow take that load of resources I mean I don't know if that's the right interpretation I want your inputs on that I mean I agree with the point so you are talking about the technology thing so anyways if you see any of the technologies we are talking today the processing time is kind of function of the infrastructure right now so earlier we used to hear about that the data has grown so it is taking more time of course the parallelism distributed computing and all things will be there along with that we need to have the capability in terms of your system scaling up so that the SLA they can meet the SLA we should not have the we can talk about maybe let's say I have an application which requires 100 threads right now and in the future it requires 200 threads we should have the capability of actually giving my application 200 threads in one go otherwise my 100 threads will be kind of waiting and it's kind of it's I think as you rightly said it's more on the I mean it includes the application side as well as how it happens on the infrastructure how it scales up the way I think about this is I say what will happen if you don't scale what is the property the universe will exhibit right and essentially if you don't scale you are not able to service a customer that's the net problem so there are three things that kind of drive that the three levers to that one is the organization sometimes your organization cannot scale simply because it has say one example is it doesn't have a feedback loop with the customer is not able to understand so that's the organizational aspect the second is architecture you cannot simply scale because you've architected it wrong for whatever reason and the third is indeed application design which is also what you're talking about which is what choices have you made when you design the application for example concurrency model right so certain concurrency model will do well in some cases and certain concurrency model will do well in different cases so yes indeed I agree with you there are all of these aspects to scaling but in the end the effect is the same if you miss on any of them you won't be able to service customers you know growing customers yeah yeah yeah so architecture doesn't have only auto scaling that's indeed true architectures have been scalable for a while you know we've been scaling the cat in different clothes all this time in a way but there are other aspects to architecture apart from just the ability to add machines right for example the difference between a monolith which say is stateful and say service oriented architecture which has a bunch of stateless services is not is a function of architecture but not a function of the number of machines you can add so that is also there absolutely so here's one thing right that when we talk about design for scaling right we usually think in technical terms a lot but what about the non technical aspects for example cost so when we scale you guys can also talk about how do you factor in cost of scaling as well like elasticity yes it's a good thing but if elasticity let's say it is unlimited elasticity it's going to come at a cost so where is it that you start like we said you start right from the ground up but it is not always the case in real life right that's that's the ideal scenario it doesn't always exist so what trade-offs that are usually made so you guys have your own experiences on it like where you guys like probably design something and or maybe took a system that was not designed very well and tried to scale it and I'm sure all of us have taken systems and tried not to scale I mean tried to scale systems don't design for scale I don't really know how to answer this in terms of cost because the way I've looked at this and the thing that I advocate people doing so what happens you know if a system you've made is very expensive to run you rewrite that system right that's what happens in reality now but how do you think about this in terms of cost all I do really is put in a I try to drive a metric or a wayness of a metric which is the what is the cost of the core service rendered to the client right so now I'm going to be working with say which is an Uber for buses and there let's say a booking or a ride that a person is taking so the cost the metric that we drive is what is the cost per ride and before we start architecting a system we say if you want to go to 100,000 rides our cost per ride must be so much and then we use that as a barometer to continuously course correct I have not been smart enough to kind of do more than that for cost so actually have a follow up on that so we are kind of in the same boat you know cost and analysis like you know how much cost do we put in the cloud versus you know so we look at cost two ways right so one cost is of course the money but we also look at the cost of the engineers and like how much effort it puts on engineers to like get some of that happen so obviously auto scaling you know we can do on-prem and you know on AWS as well but implementing something like that you know internally on-prem is much more you know difficult with a small team so at that point the cost of AWS might offset the cost of the person or engineer spending the time like you know to end oftentimes I see you know engineers like the developers are always you know more in number than the folks the operations actually do the work so a lot of times we take the human cost or the engineer cost into a consideration as well and see what yep I completely agree with you I mean even whatever so it's just not the infrastructure cost and the cost we kind of can see up front it's how much effort how many tickets are coming up the DevOps kind of contributing for that of course those everything kind of cumulative we'll decide upon agree thanks for that so there is the aspect of cost that we've covered right there's also the aspect of technical tools that we have covered as well and like Devi said okay we start as close to the ground up as you can wherever you can so do you guys have had any experiences on legacies scaling legacy systems systems you probably would have inherited right they are running on a given scale and then all of a sudden the demand increases in your way this now needs to go up so any challenges you would want to talk about on that front as well so my in my experience I mean I had worked in like couple of projects who are kind of who are kind of replacement for the legacy systems reason being the the data we talk about today I mean that scale of data was not present and few years back I think we could have talked about scaling your legacy system maybe improving code maybe improving a better hardware I mean using a better hardware this and that but right now the amount of data we try to process and the amount of data we are aiming for let's say by 2025 it's I mean there would be a huge I mean it would be either impossible to scale it kind of and if it is possible we would be compromising at some other place so that's my assumption however I had worked on personally on couple of projects which were kind of match the numbers with the legacy numbers because those were the one production if they match we will have your system kind of so yeah any questions from the audience so far please feel free to so we talk about I also mentioned initially right that when you are making a technical choice you have to be very cognizant of what is it that you are paying for you want compute or you want storage right and any pointers on how to really make that choice like initially like you do prototyping you do like how it's how do you even arrive at how how do you test for it, how do you figure out whether it is going to be compute intensive or you know IO intensive so couple of things that we look for right I think the first thing that you look for is whether this thing has the features that we need that's one check second you look for what kind of documentation is available on that project what is the maintenance of that project these things matter because when you are in a lurch you will have nowhere to look third you look for if there are companies that offer services on that product that's usually a good sign that means it's a tenable product people have been able to sell it and that will probably scale and then lastly you see what is the delta or I see rather what is the delta from your current your current tooling to that tool sometimes a tool is so different from what you are using that when you change that you have to change other systems which are adjacent to it and so sometimes you might pick up something which is less advanced so that the change does not ripple across and then once that's stabilized maybe you can do a second change to move to the end state these are three or four things to look at can you can you just explain something on the state less versus stateful architecture practices like when you grow the big application around it can you talk something about it based on your own experience can you please repeat my question is out of your own experience like building large scale distributed systems can you talk something on a state less versus stateful applications like growing cloud native applications like how should we build our architecture around these practices so to kind of rephrase it like your experiences with stateful services and state less services when it comes to scaling because they are both very different areas of scaling did I get that right so ideally you want state less services but obviously the whole world can't be state less it will become very difficult to be useful but yeah state less services are easier to scale because simply they don't have to coordinate or wait on resources so that coordinates the cost of coordination doesn't exist you want to defer to that whenever possible make a service state less when you have to make it stateful my thought process is usually that you see what portion of it needs to be stateful sometimes just a small portion needs to be stateful and then you try to move that out so as little state is usually good one way of reducing state is transferring state to the client so making it the problem of the client it's not always a good idea but sometimes you say I can't make this state less well can I send the state to somebody else can I make it someone else's problem if you can do that that's usually a good solution so I try to reach state less services but it's not always possible yeah I completely agree the aim is to be state less but there are few situations where you just cannot be state less once in a while you can be let's say that your system is ingesting a lot of amount of data and you just cannot remove the possibility of one of your components not failing so you need to have that failover kind of a thing because with the growing data with growing factors still you want to meet the same SLA so you need to have that big data applications if you talk about ASPARC or Hadoop or for that metadata flow they talk about why they are good because they basically they can restart from exactly the place from where they where they failed but ideally speaking state less is fast and easy to handle so when you talk about different services right one of the key aspects is when a lot of services the microservices kind of an architecture come and start working together so do you guys have any pointers or thoughts on how do you scale an entire ecosystem of services that works together to meet increasing demand for me what may happen is you have a single entry point to your system and then the service fans out the request fans out to multiple services and each of those services have their own different performance measurements their own SLA is to their upstream and downstream so any thoughts on how to scale scaling or designing for infrastructure for those services as well where you say okay I'm allocating services for one given service but if they're not used they can be taken up by somebody somebody else who needs them so so when you have this one application running on a system when you talk about scaling you're kind of doing vertical scaling right and if your application is having vertical scaling it is going to hit a roadblock for example if you are using overclocked platform the maximum we see if you can assign to a particular VM is 64 if your requirement goes above that it's not possible right so you will you will be able to vertical scaling cannot be an option for a data intensive application right compute intensive and for rest of the things I can we can say that like I mean it is going to hit a roadblock and looking at this only I think Amazon with Amazon you get something like lambda or for that matter Google Cloud platform you get Google App Engine where if you're using low amount of resources it charges you for that and the moment you increase the size it will increase to infinite resources and it will it will charge you that much amount of time so I think and they behind the scenes actually use use containerization so yeah we have we have like Kubernetes and all which does the containerization but still not up to the level what Google has behind the scenes to be frank Google has just given a small subset of functionality of their big containerization which is Borg which is Kubernetes and the whole world is talking about it so they have that I mean they had done it how they are doing it is once they come out but right now if you talk about if you have just one system and you intend to do it scale it vertically it it is going to hit the block once your resources are kind of consumed I think the question was how do you scale a bunch of services together right and the way I look at it is with monoliths it is like a single point of entry so you have to scale the whole monolith and you couldn't you know one bottleneck at the top of the monolith what kind of effect quality of service to the rest of the monolith code flow and then when you broke it down into services you were able to kind of tackle that separately and I think what you can do with micro services and scaling a bunch of services make them distributed so actually not have one API gateway for your client but have multiple API gateways and have them delivered directly to the client and where possible use peer to peer chartered as opposed to orchestrated through some kind of a master slaves or you know some kind of a semi-distributed centralized architecture so it is better to have peer to peer and to have multiple gateways and not one gateway all right we take like two final questions from the audience if you guys have any quick one we are going to close and we actually are running the schedule time Hi there, I've got a question for you but it's more specific than the other questions we've been tackling so far so I want to ask you during your talk you spoke about how you sync the data from Postgres to your ES cluster and you had a job which ran every minute to do it so what I would like to ask is that do you sync your entire data set to ES every minute or do you have a way of finding out what incremental changes have taken place and those are the ones that you pick up to index or re-index to the cluster yeah could be a little more clear it's a little harsh sorry so what I was asking was that you spoke about a job that you run every minute to index data from your Postgres to ES cluster and it ran every minute so the question is that do you have a way do you sync the entire data set every minute or do you have a way of finding just the incremental changes that have taken place and you just sync that much data so that one minute has come from the business not from the technical side of it so as I was saying the Postgres was is our primary data store and our clients or the talent, the candidates they don't see a lag if it is just one minute or below one minute and that's the reason we just said one minute is fine for our requirements and we don't need real time exactly real time as when the candidate changes her profile on Postgres I mean which goes to Postgres it doesn't need to be synced into elastic search in real time and got back into the search results so we were good to add that but if you had real time needs for your application where you need to sync to elastic search then as I was saying in the last slide you should try the bulk APIs and see how much how much your cluster can take in for indexing and and then double it keep on doubling it and find the right size of your batch how much you can ingest into it so that should decide the time limit let's say if you reach let's say if your elastic cluster can take n number of thousands a second then how much is your application pushing to elastic search and how quickly you reach n number of thousands which needs to be pushed to elastic search so that should drive the time limit so technically elastic search can index more than not just tens of thousands it can ingest about a million records in bigger clusters I have read use cases like that so whether you need it or not is up to you I mean whether you want to maintain that large clusters to ingest the data do you have that much of data flow or not the business needs should define it did I answer your question yeah unfortunately we are like overshooting on time would I encourage folks please sync up with the panel and the individual folks outside the hall would out out there's a time limit on the venue so sorry you can blame him alright thanks a lot everybody thanks for your patience again and thanks to the panelists I hope you guys had a fun two days at rootconf are we happy we can do all of you