 So, very good morning and thanks a lot for joining me in the session on Unified Monitoring of Containers and Microservices. I am Nishant Sahai, I am affiliated with Wipro Ltd and working in the open source COE lab in Wipro. I have been working on Unified Monitoring and computer vision since last couple of years and basically involved with the data analysis team. I would also like to take the opportunity to thank my colleague Bhavani Anant, she is an architect and Wipro Ltd. She was a co-speaker for this particular talk but due to some personal emergencies she had to take leave. She had been instrumental in conceptualizing this entire Unified Monitoring framework and has deep expertise in log analytics. So, coming back to our today's talk, the goal of the presentation is I would like to explain how we have utilized the power of Zipkin along with the intelligence of a patches park ecosystem and the flexibility of elastic log stash Kibana plus bead stack to create this Unified Monitoring framework. So, I have divided this particular talk in three different parts. In the first section I will be setting up the context as to why Unified Monitoring is required with respect to microservices containers. In the second section we will go through the solution overview and in the final section I have a solution demonstration in which we will be catering to few use cases with respect to the Unified Monitoring including anomaly detection. So, here we are, this is a typical microservices use case which you can actually map it to any of the microservices use case which might you be implementing. So, for me this can map to a online e-commerce application and historically which was a monolithic app and then it was not able to scale up, had issues with respect to speed of go-to market and then we adopted the microservices architecture, thus the microservices architecture since it is built on modularization. So, it helps us in reducing the complexity. So, each of this high-level module can be considered as a service now and each of the service functionality can map to a business capability. So, with microservices we achieve service isolation not two services are not tightly coupled with each other, they have their own release cycle, they do not have a very tight integration between each other, they can now provide high scalability, now we are not actually scaling up the entire monolithic app if it need to scale up rather than we can pick and choose the services which are of most and at most importance and not having a high resource requirement that can be scaled up. It is giving us the flexibility in the sense that now services can be written in their own programming languages. So, you can actually write account service in a Python and whereas your customer service can be written in a Java programming language and finally it gives us a better resource utilization since it, you can actually scale up and scale down services as per the requirement that you have in hand. But again like everything microservices comes with a cost and the cost is more related to network overheads, now we have n number of services which are interacting with each other over HTTP, so we have a high number of network overhead then again we have services which can be written in their own programming languages, so they have their own deployment approach, they can have their own release cycle, so again it becomes an ops nightmare, so finally this awesome trial comes in where we have the microservices which can be deployed in containers and run in a Kubernetes cluster, so finally what we have achieved is service isolation as we have already seen with respect to microservices, the stag-organistic hosting again with the combination of containers and microservices and the container orchestration and the cluster and the workload management which is taken by Kubernetes, now we have a lot of moving pieces, we have a huge number of microservices, same number of containers, pods, nodes, data centers and there are applications which are not yet moved into Kubernetes cluster, they are still running on bare metal in their own private data centers, so the amount of data that is getting generated with respect to the logs and metrics is overwhelming and these data that is getting generated, the sheer volume, the velocity, veracity and the variety is simply mind-boggling, so but again we cannot ignore the data that is generated out of logs and metrics because they have of deep importance to both the ops and the business teams and in the current business landscape the data-driven approach actually affects both the top-line and bottom-line of the enterprises and as per the study in 2015-16 by Enterprise Management Associates, almost 65% of enterprises use more than 10 monitoring apps to keep tabs of their applications and their infrastructure and the same studies also states that almost 71% of the enterprises actually spend around 5% hours to fix an average performance issue, now that is quite obvious, you have an application that was running quite fine since last couple of months or years and slowly this performance start degrading, it may be because of some recent releases or patch, it may be because of the database it is connecting with, it is having some kind of a slow query issue or it may be because the external services are interacting with, having some performance issues with it, but with 10 monitoring frameworks or an apps in your hand, how can you go about and try to figure out where that particular problem is and this is where it becomes very imperative for enterprises to have a very strategic and a robust unified monitoring framework in place, so this is where our need for monitoring solutions came in like as I have mentioned we, there is a hybrid deployments, there is a high data volume, there is a multiple monitoring frameworks, their need for integrated insights and the one more important thing that enterprises are looking for is they need to get the issues to be figured out more in a proactive manner rather than rectified in a reactive approach and this is where analytically intelligent platform comes in picture, so our unified monitoring framework has the set of wish list, it has a customizable report and dashboard, so we want to have at least a single window through which a business or an ops can actually go into the monitoring solution and figure out where the problem is, we need to have a support for programmable interfaces, again with respect to flexibility of enterprise integration we need to have interfaces which can be extended, so it is more about having a open source frameworks and which is adhering to open standards so that we can actually customize it as per our application and the enterprise need, as I have said like operational intelligence platform because we need a more of a proactive approach for issue identification rather than a reactive approach of issue rectification and finally DevOps friendly, so this was one of our main wish list which we actually achieved having a one click kind of a deployment through Ansible so our entire unified monitoring framework can be deployed using one click with the Ansible script that is running behind the screen, so the list of assets that we are monitoring as of now we have a list in the infrastructure, containers, platforms and applications from infrastructure perspective we are monitoring most of the bare metal virtual machines, IoT devices, network services like STDPD, then from container platform perspective it is the Kubernetes, Docker containers, the messaging platforms, web servers and from application perspective is the application logs, application monitoring frame monitoring logs and even the events and traces, so now that we are done with the setting of the context as to why unified monitoring is required, so we will come into the second section, this is about the solution conceptual view, so again we are taking into consideration the same microservices, use case, typical use case that we had shown in the previous one of the slide and we were targeting an online e-commerce application and specifically from the demo perspective there is one this particular API, the prod info API which we will show in the later stage which that API is actually orchestrated through multiple services that is the prod detail, prod inventory and the prod review, so from the conceptual point of view like all the monitored assets the data in the form of logs and metrics are pushed through the exporters into the data acquisition layer, the exporters being the Zipkin emitters, the elastic beats, the lockstash shippers and through the data acquisition layer consisting of the Zipkin collector and the Kafka lockstash the data is pushed into elastic DB where finally it is being utilized by Patches Park for predictive analysis for Zipkin and Kibana for visualization and by Sentinel for alerting and reporting, so the first main important layer of our solution is the ingestion layer which consists of mainly the Zipkin emitters and the elastic beats, so the microservices are both fast and flexible but in case of a performance issue which comes in it is very tough to debug and do any kind of a profiling, this is the reason being like the microservices or a single service will never give you the complete picture of your entire application and in a normal production deployment of where the services are having a very complex kind of an integration, so it is very tough to find out where a particular service is deployed in which particular container in which particular pod or a host, so this is where distributed tracing helps us, so whenever a request passes on from each of the services a metadata is actually injected into that particular request and finally those information are aggregated and you get a services transaction graph in almost near real time, so our emitters like in our application, the demo application which consists of Node.js and Spring Boot apps we are using the Spring Sleuth and the Zipkin.js components for emitters, you have the Ruby emitters and there is a lot many emitters provided in the GitHub open communities which can be utilized as per your application needs which are now pushed into the Apache Kafka and from there the Zipkin collector picks it up and pushes into Elasticsearch, so Zipkin provides the capability both of collating the trace information as well as for visualizing them. The second most important component in our ingestion layer is the Elastic Beats, Elastic Beats are open source edge based shippers which are written in Golang, so there are by default quite a few beets are made available by Elastic like the Metric Beat, File Beat, Packet Beat, Heart Beat, Top Beat, these are the few default beets which are available by Elastic and you can also look into the community driven beets like few of the most famous which we have utilized are the Spring Beats and the Kafka Beats, so now all the as we can understand from the names of this Beat Metric Beat is more for capturing the infrastructure metric information, File Beat captures all the log related information, Packet Beat is more about it sniffs the network traffic and gathers the network metrics information and those are finally pushed into the Elastic DB and so Elastic DB is again an open source of text search analytics engine and it actually can store a huge volume of data and you can do fast searches and also you can do fast analytical operations on the data that is getting stored in it and so it internally uses BKD tree algorithm for multi-node, multi-data dimensional, so it actually helps in optimizing the disk space as well as also optimizes the indexing and the search that can be made, so we have also used a lot of aggregation features which helps in your analytical and statistical dashboard generations and there is also concept of ingest node which whose aim is actually to enrich the data and also whose aim is to process the data before actually that particular document is getting indexed into Elastic DB and there are support with respect to cross cluster search which helps us to actually scale up the Elastic DB and also roll over and shrink APIs which are so historical data which are not being accessed can be actually shrinked and thus you can optimize the disk space utilization. So now that we have all the data of ship from our APIs and from our microservices and our Kubernetes cluster containers into the Elastic DB, so the next part for our unified monitoring solution is to actually do some kind of predictive analysis, not that API, the prod info API we want to make sure that in case there is an issue with respect to the latency, there is a high latency or there is a huge number of request that is giving a 4-0-4 kind of a status, so we want to make sure that it is more of a proactive nature in which the ops team are notified not when the actual SLA has gone beyond the required limit. So this is where we are doing the predictive analysis approach in which we are using Apache Spark ecosystem which has the streaming ML live and Spark SQL component, we are utilizing the PySpark in the Spark core API for model generation and so the data collection is already done by the Elastic search and the preprocessing is done by the Spark RDD and the Spark SQL and finally the model ML model generation we are utilizing the key means for anomaly detection and then with the data that is being pushed into, so we have a time window once the model is generated within that particular time window a scheduler runs and fetches the data from ElasticDB and send it across to the model to infer if there is any kind of an outlier that is present in the data in that particular time window and then it pushes it on again back to the DB for the next component that is the Sentinel, the alerting and reporting component. So Sentinel is again, it is a plugin, Node.js plugin for Kibana which again falls into the category of extensible programmable interface that we are actually adhering to in our framework and it actually works in three main steps. So we have a watcher configured in our Sentinel which in a schedule interval of time sends out a request to the ElasticDB with a script and once the script is executed and then we can actually, it will either give a positive or a negative response and if in case the response is positive or if there is an alert that need to be triggered so the action is taken and alert is triggered you can also configure reports in Sentinel and schedule the interval when the report need to be generated and send out to the ops or the business teams. Now you have the outlier with you, now it may be in your email and so using that particular outlier detail the Sentinel pushes the information through the email which can now be accessed through the URL to the Zipkin visualization. So it, this is the main screen in the Zipkin UI which actually gives us the service dependency as well as you can capture the latency with respect to each of the services that are being invoked in that particular request and Zipkin has a direct integration with Kibana UI so you can actually then with a click off button you can go to Kibana UI as per the URL that you have configured it, you can go to a business view as per the URL that you have configured or to an ops view. So now, so this, this is how we actually have configured all the components in our stack for the unified monitoring. So I will now jump on to the solution demonstration. So now as we have seen that the main API that we wanted to monitor was the, the, the prod info API of the microservices use case that we had picked up for the online e-commerce application. So we will cater to three different features in our solution demonstration. This is more of a, this is a recorded demonstration because there's a lot of moving part and there are some emails that need to be accessed and thus we have recorded this demo. So the first one would be more about the, the unified monitoring of containers and microservices in a Kubernetes cluster. The second one would be about the distributed tracing and third is we will identify, we will introduce an anomaly in our, one of the services and try to figure out through our unified monitoring framework in a more proactive manner. So this is the microservices at rest. So we have, this is the Kibana UI. As of now the cluster is, the Kubernetes cluster is done. So we have all the metrics shown as zero like the nodes, the, the deployment containers and the available pods. Then we will start up our, the Kubernetes cluster and we have provided the size as three. And we can see all the pods and all the different deployments that are present in that particular cluster with the few of the pods in the Kube system and most of the application specific pods are in the default namespace. And we can see there are three nodes up and running. And now if we go to our, the Kubernetes overview dashboard, we can see that there are three nodes running with the 20 pods and 14 deployments and there are 25 containers. And so again this Kibana dashboard is highly customizable. So we have picked and choose those particular metrics which we wanted to showcase for this particular solution demonstration. But you can actually get other charts and graphs as per your needs. So this is for the, it gives us the top nodes by memory uses. So we have three node cluster in which it is, we are having three nodes with the memory uses displayed with them. And it gives us the information regarding the network in by node and network out for each of the three nodes. And then we go to the pods like the second tab is for the pods overview. So it will give all the pods that are running in your cluster. And on the right hand side it lists out all the pods that are running the product review service, the detail service. And then we, since we are looking more from the perspective of memory and the CPU utilization in our use case. So we have listed down the top memory intensive pods and the top CPU intensive pods. And this is the cloud tagging of all the containers that are running in our cluster. And finally we can also see what are the top CPU intensive containers and the top memory usage by the containers. So again, as I said, like all the graphs are customizable. So this is again more with respect to the use case that we have picked up for demonstration. So now we can go to a pod for a specific node. So we will, we can actually drill down from a node level to go to a pod level. And in that particular pod we can actually figure out what, which container is running. So we can, in that particular node we see that the list of memory intensive pods, the Zipkin deployment, the pod detail service. Similarly, we can see the top CPU intensive pods. And in that particular pod, the one which we have selected, the pod detail service is the container that is running. And it gives us the CPU information. And as well as the memory usage information for that particular container and pod. So now we will go into the distributed tracing part wherein, so this is the URL which is accessible for the pod and for API. And now once we have made quite a few number of requests on that particular API, we can go to the Zipkin and Zipkin, we can actually see the traceability with respect to the pod and for API. Look what all services are being called internally for that particular service. And we can see that as of now the total time for the response was around 721 milliseconds. And then we can go into the detail of this particular service and we can capture the trace ID. So these are all the information captured by the Zipkin emitters and posted into the elastic DB through the Zipkin collector. And now if we utilize this particular trace ID and go into the container tab in our Kibana dashboard, we can actually see that the entire information is also captured in Kibana with respect to all these services. And we can see on the right hand side what is the kind of the services that there is a server or client because internally those services are calling a MongoDB which is the storage for them. So whenever it calls a MongoDB tax as a client for those services. So this is the prod info service which I was talking about. And this is the normal trend which we have captured for last 20 to 50 days. This is the trend which we have simulated through JMeter as to initially in the initial part of the day we have quite a less number of requests that are coming in. And slowly those requests are going up and it is more about the HTTP code 200 and it peaks up at around 4 to 5 p.m. and then slowly tapers down during the fag end of the day. And then we pass it on that particular information to our patches park, outlier detection of the predictive analysis framework. And this is where we utilize the K-Mean cluster and those red spots are the centroid for our cluster. So we have configured four as the cluster. Again this is more about the intuition or the business and the technical team who can come up as to decide how many clusters need to be configured for your particular model. And now we have the clusters configured. We have four different clusters with the centroid available in them and then we will configure the alert in our sentinel. So as I have said the sentinel has watchers, alarms and reports. So we are configuring up prod info service anomaly which will run in every 5 minutes and there is a script which it needs to invoke in every 5 minutes and if there is a positive response then it will send out a report and an alert as configured in the system. And finally we will introduce some anomaly. So this is my prod review service and I am introducing an anomaly with respect to the memory consumption. I have written a small piece of code which actually runs, which actually eats up 104 bytes till the entire memory for that particular node is not exhausted. And then I will deploy that particular, I will build it up and push it into a Docker hub and this is my deployment script in Kubernetes for prod review service. As of now we have the version 9.0 running and then I will change the image for that particular deployment from 9.0 to 10.0, the one which we have pushed into the Docker and make some request through our browser and even through our jmeter scripts so that we are creating a right kind of anomaly and at that particular time through the sentinel and alert is sent across as an email to our inbox and here we have two emails, one is the report and other is the alert. So in the report it gives out the current status of your system and again you can, since it is running on phantom.js so you can configure which part of your Kibana dashboard that you want to push it on to your email inbox as a snapshot and the second is the alert, this is the actual alert for that particular time window in which there was an anomaly detected and if you click on this particular alert it takes you directly to the Zipkin UI and here we can actually see that the prod and for service the latency has increased from 0.78 millisecond which was earlier to around 13 seconds as of now and if you drill down further we can actually see that the product review service, the gate rating count is actually taking up the most amount of time and then we can actually see the trace ID, capture the trace ID and then this, we have a link of this logs link which if you click on it it will directly take you to the Kibana UI and there we can actually see which particular service is taking most of the time and it also lists out the other services which are part of that particular API invocation and we can, we have listed down, we have figured out that this is the particular pod and this is the particular container which is creating that outlier, now if we go to the pods overview and try to see what is the status, so we can see that the prod review service is actually consuming around 2.2 GB of RAM earlier it was around 800 MB and then if we drill down further we can within the container also we can see that the CPU is peaking up and also the memory is pretty high compared to the normal scenario when the prod review service was working in a correct manner and then we can actually roll out that particular image which we had deployed from 10.0 to 9.0 and if we go back to the current status after the, the current the deployment has been backed out from 10.0 to 9.0 we can see now the prod review service memory consumption is within the limit of 846 MB and the other CPU and the memory metrics are also well within the range, so yeah, so this is what we wanted to showcase as to how are the unified monitoring framework we are utilizing all the, the component with respect to the elastic, Kibana, Zipkin distributed logging, Apache Spark and helping us in our proactive issue detection and also for monitoring of our clusters, so that is it. So, I am done with the session, so if we have any questions, again like with respect to the we wanted to monitor the entire ecosystem of an enterprise and we wanted to make sure that even the, we just, we didn't wanted only to limit ourselves within the, you can see the Kubernetes or that particular ecosystem, so in order to monitor the entire enterprise which can have private data centers, applications which are running on bare metal and we also had a very, very, very good, you can see the performance with the, related to elastic DB, so we took up this particular approach. Hello, over here. Over here? Sorry, yeah. Can you give us more specific examples of like how you have used Spark machine learning predictive modeling to solve the problem? I hear you mentioned like you have used like K-Means. Exactly. So, in business terms, for example, I have an application which are micro-services, can you speak about what's the problem is and how it is solved? Yeah, so what we targeted this was, you can say this is not a multivariate model that we had created, this was a single variant just for the demonstration purpose, but again you can have a multivariate, you can have a combination of latency, you can have a combination of HTTP status code and also the time window in which that particular data you are actually capturing and then you can have a normal trend and then use a K-Mean cluster again, again this, when we have used the K-Mean clustering in Apache, it is more of a Spark, it is more about the intuition with which we capture that K value. So, for us it was more about capturing the latency, the HTTP status code and the time window in which those particular parameters are captured and then we passed it on to create a specific clusters and this is more about the application status at a particular time with respect to latency and the HTTP code. So that is what we were targeting. Okay, thank you. And my second question is on Kibana dashboards, do you have any standard configurations or do you have to create those dashboards explicitly? See, for most of the metrics and logs there are default dashboards available provided by Kibana and if you want to be more intuitive and you want to get more insights into the data then you can easily create those dashboards with all the metrics and the logs information that are captured. So it is very easy like I am not having a background of Kibana but again when I started working on unified monitoring it was pretty intuitive more of a drag and drop kind of stuff. So my name is Andy Santos, I work for eBay. I am actually one of my team actually focuses on exactly the same thing which is the observability. So coming back to Kibana, we actually, Kibana has this limitation because they have the license called XPEG. So they are trying to get from the open source into more commercial. So there are certain plugins that are not given as an open source but we internally we actually contributed back. We have an open source there but we actually created a lot of plugins for example if you wanted to create like a tail command prompt within the Kibana you don't have a way to do that but we created a plugin to do that. It is actually pretty simple to do. You can use JavaScript to do a lot of this one. For example another one is that we find it interesting is that you wanted to do a heat map within the bigger clusters. How do you see the density of the deployment and stuff like that? You can do a heat map and we have the code that does that. We will be more happy to share it. Exactly. But the other thing that I have a question is that what is the skill that we deal with? And I am actually interested to see how we can potentially collaborate and open source that because we are actually dealing with a much larger skill that we have done this architecture before. So our skill is about 10 million metrics per second. Okay. From the log size we are dealing with about 10 petabytes a day. And this is not just the infrastructure but the entire application that runs within the eBay. So what we were forced to do is that if we rely on the elastic search that it is by itself, it is not going to scale. We found that out. Okay. The hard way. So one way is push it on to influx DB. Exactly. So we didn't like influx DB because they also went from the open source into the commercial version so we decided. So on the high level I will be more than happy to go into detail. So we actually split this into two different storage. One is a hot tier and the other one is cold tier. The hot tier is anything within let's say two weeks. You can define that depending on how much money you are willing to do that. And then the cold storage is something that for metrics we store it in the open TSTP, open source. And then for logs we store it into the Hadoop. For the hot storage for metrics we actually evaluated a few things. So one of them is there is an Apache foundation called Ignite. They basically can store a lot of these metrics that are in memory. So it's super fast if you wanted to do something within the seconds. There is another one from Facebook called Bear and Jay. There also does the delta compressions also that we can evaluate. On the log size we do use elastic search for a short duration of the data. But for the longer duration of data, as I said, we use the dupe. There is also another startup company that I know we actually working with them called Dashbase that basically claim that they can actually take the elastic into 10 times of the performance. Maybe we can discuss all that and try to understand. I think my point is that there is a lot of companies that are interested to link with this if we can actually collaborate together because if it doesn't plan to make it into any kind of product we'll be more than happy to do some sort of open source collaboration. We'll have a talk offline. It's called Corilla or Bear and Jay. So they do a time series. They custom design this one. So they load everything into memory and they do a delta-delta compression. So the amount of memory that you can use can be maximized to store the data. Yeah. Yeah, sure. No, no, we are not using any close product. So that is the reason we went about using Sentinel. So Sentinel again is an open source, a patch or two license. We didn't use XPAC. That's why we introduced a patches pack. So our entire fine monitoring framework is on open source. Okay. So again, we have a separate plugin for that. We are not at all using XPAC because that is the boundary in which we need to work of open standard and open source. So we have our own plugin with which we do the access manager, RBAC specifically. Yeah. We will end this session. But the next speaker is here for the next session. Sorry. They didn't give you all enough time. I see they are to transition you. I am sorry.