 Good afternoon. I think we can get started. This session, we are going to talk about monitoring and diagnosing the performance issues on Cloud Foundry framework, because as more and more applications are being deployed, so how are we going to monitor? What are the ops considerations? And if we have any kind of a performance issue, how are we going to resolve them? So those are the things that we're going to talk about in this session. My name is Surya DeGuralla. I am an STSM responsible for the IBM's Watson and Cloud Platform Architecture and Performance Engineering. And with me, Milorad, he is the Senior Director in RBC Digital Banking Channels. So today's session will focus on the Cloud Foundry from a performance engineering point of view from within IBM, what we have identified, and what we are doing. And also we will share, because there are around 40 production applications that are there in Cloud Foundry with Bluemix on RBC. So what exactly RBC is actually doing for ops and monitoring? So it's going to be a mix of both from the lab as well as from the customer application point of view. So I'll talk a little bit about the Cloud Foundry monitoring framework, what facilities we have in the Cloud Foundry itself, and then we will get into the enterprise application monitoring, like whether it's online banking or commercial banking, and other applications that RBC is actually deploying. What exactly, what dashboards that they are developing, what they are using for operations, and those things. And then we will take a specific use case of a typical performance problem that we have encountered in RBC's online banking and how we have resolved those things using the framework that is offered and also the third party tools also we have. So Cloud Foundry monitoring framework, as most of you know, the framework itself has multiple tools that you have the simple CFCLI, where you can get the basic information, like the CPU memory and stuff, to a aggregator firehose, where you can enable it, and you have the different nodules that you can actually get the data from individual components. And that's another thing that we have available for advanced users because you get so much of data, so you have to be really familiar with what to look for. Because this firehose, once you attach it and then get that, you get a lot of data. And also, another thing that we do in the lab is we use Grafana, Graphite, and the Influx DB. And we design some dashboards to do a little bit deeper analysis and know each component of Cloud Foundry how it is performing as you actually run through different applications. And of course, we have the third party APM tools, the New Relic APM or the Health Center for profiling. So those are other tools that you can enhance the ability for you to diagnose any performance issues. So when it comes to the enterprise monitoring framework and what exactly RBC has, if you look at RBC uses Dynatrace for application monitoring for an APM framework. And of course, Dynatrace has some enhanced agents because most of the applications are node and Java. So we worked with Dynatrace to get the internal agents to put those hooks into the runtimes so that you can actually get additional information in the runtimes. So that's another thing that they are using. And we also have RBC's custom environment dashboards, mainly for three things. We have the Cloud Foundry metrics data. I'm going to show those snapshots. And also the Cloud Foundry application logs. And we have the admin console APIs that will enable you to set some thresholds and stuff so that operations team can really get some notifications when a specific threshold hits. And also a few other things that we have made some plug points. And of course, we have the operation management dashboards through the admin console. We have delivered them. And then another thing is the plug points. Like we have some plug points where if you want to combine, let's say you have an application. You have an application that not only going through and running within the Cloud Foundry at runtime, but also it uses the external third-party services, maybe data service or cognitive service or IoT. So there are some plug points where you can actually correlate the transaction data as the transaction flows through all these different layers. You may be able to design your own dashboard to get all these data points together and then display them on the single unified dashboard. So those are some of the plug points also we have. So when you look at these resource usage, like your application is actually deployed in a Diego cell or in a DEA, you can actually see the usage from a memory CPU or network point of view and the storage point of view. It will give you a very high-level view on an average. You have 5D Diego cells. What's the average CPU or memory or storage consumption? Of course, you may have in your environment, you may have a CPU overcommit or memory overcommit on all those things. So we can actually normalize the data based on your configuration also. If you look at the resource usage pattern, like here you can see that 77% CPU is being used. So these are a high-level indication that, OK, now you may have to enhance and get another Diego cell. This is a very coarse-level dashboard that at a very high-level operations team will be looking at. If you drill down further, because if you want to go a little bit deep into the internal components, this is from the Cloud Foundry. If you look at Diego architecture here, Cloud Foundry, you can see all the different internal components. You have the BBS or BBS database. You have the Diego cell, the garden containers, brain, and all these internal components. If you want to really understand when you are scaling to maybe 30 million transactions per day or at a very high volume, how each of these components are actually performing. Which one is going to be the bottleneck? So which pipe has to be, you have to broaden that and make it bigger so that that one single bottleneck is not going to take the whole transaction system for ransom. So we can do that. What we did for this is we have taken the open source metrics tool. And we took those agents and then put them in each of these components that I mentioned here, like whether it is brain or whether it is good outer. So you can actually put those agents in there. And then you can collect the data and then push that into an inflex DB that we have created for storing this data. And then that can be fed to the graphite server that we have running in our Cloud Foundry environment. And then from that server, actually it serves the Grafana dashboard. So this whole infrastructure that you can actually set it up, which will give you very fine grained information about when you have 10 instances running on a specific Diego cell. And all those 10 in the 10 garden containers within the cell and how many processes that those each garden container cell, each garden container instance, how many garden container instances in a cell, right? All that low level information you can actually get that. So here you can see this is showing the CPU, memory, and the network and disk usage from a specific Diego cell. We have in our Bluemix local environment, there are four Diego cells. You can see those four Diego cells each one, how much CPU, memory. So it's much easier for you to understand how your applications are actually being deployed across different cells. And whether your application may be a node application, Java application, how the memory and CPU usage patterns, how much the storage is being used, whether you have any kind of a disk bottleneck that you have. So all of those things can be easily proactively you can understand. And this is another view of this garden container count. Like you can see how many garden containers you have. This is very important for like we had an issue where the virtualization that we have used for our VMs, like for generating all these different components. The VM was using the para-virtualized PV. And we could not push more than 30 garden containers in a cell. Once you reach 30 containers, then your VM is completely saturated. So we could clearly see that at a number of garden containers in the cell, our CPU is completely 100%. So that gave us the insight that then we changed that para-virtualized to HVM, like hardware assisted. And all of a sudden, without any just one hardware change, the virtualization change, we could contain from 30 to 200. So the density of containers in a Diego cell has gone up by six times, more than six times. So those are the engineering decisions and the design bottlenecks that you may have either in your own Cloud Foundry installations or the one that's being managed by somebody else. So those are the very advanced insights you can actually get from here. Another important part is the router, because good outer is everything will be channeled through good outer. So the good outer, some of you may be familiar, the good outer in Cloud Foundry right now doesn't support Keep Alives. So if you don't have Keep Alives, then what will happen is each time a transaction flows through that, you have a connection opened. And then once that is done, that connection will be closed. So you may be familiar with our HTTP Keep Alives, where in a typical app server, once you have the persistent channel, you will send hundreds of requests on top of that. So that's how you get the performance, especially if you're using SSL. And that's where you can clearly see the performance thing there. So here, good outer, unfortunately, you don't have the Keep Alive support. So you will have additional overhead when you're going through that. So how do you circumvent that? So you can actually see this from here. How far the good outer is actually taking, whether it is a CPU or network. And then you can actually increase the concurrency. Like you can add a few more good outers, instances, to increase the concurrency to offset the latency impact that you have. So those are the advanced monitoring tools and techniques that you can use to make sure that your applications deployed in Cloud Foundry are performing and scaling well. Another thing that RBC has actually gone one step further in their monitoring, they are using Google Analytics and stuff. I would like to have Miller talk a little bit about how they're doing the data analytics and client insights. Thanks, Surya. Let's see if this is on. Thank you. Thank you, Surya. So I'm going to keep my part really short. We have a nice dark room here. So very quickly, in addition to everything that Surya has said, of course we are interested in making sure that we have close monitoring of the application in production. In addition to that, we also are very much interested in terms of designing the applications in the best possible way to maximize the experience for our clients. And also being able to help our back office processes, which still, in some cases, with RBC being a major bank, are still manual, we want to make sure that we drive the efficiency of those processes as much as possible. So what we are doing with our group with Digital Business Banking is we are combining the data points that we have from tools like Google Analytics, Dynatrace, consolidating that into a view that gives us a holistic perspective on the interaction that we have with our clients, which also gives us a holistic view of the applications in production and helps us focus effort in a specific set of areas where we can make a difference for our clients and where we can save money, frankly speaking, from the operational perspective to RBC. We also want to make sure that we are the first ones to know about any potential situations or issues that we have in production before we started getting those client calls. Why is this important? Business banking, one of the key characteristics of business banking is that we process big, large transactions and large volumes of those transactions as well, for small businesses but also for large corporate clients. And we want to make sure that those transactions are flowing as smoothly as possible and that we are very fast in terms of responding to any potential issues. So this is what really makes a difference for us. Thank you, Milrad. So we saw different ways of monitoring what tools that we have, what custom tools that we can actually build from the Grafana graphite and the influx DB and other techniques. So that's all about monitoring, right? And you want to make sure that your applications are actually performing and scaling without any bottlenecks. If you have any bottlenecks, we will know. But if you want to diagnose a performance problem already you have in the infrastructure or your application. So how do you go about it? What tools do we have? I'm going to go over a specific incidence that occurred while we are looking at the online banking application deployed on Bluemix Cloud Foundry and RBC. We saw that the performance was, our expectation was 400 to 500 transactions per second with sub-second response time. That was what is expected. And what we saw was we could not go past 12 transactions per second with 90 seconds, not sub-second. It is 90 seconds response time. And we tried to increase the number of instances and all that. So there we were out of options of what exactly is happening because there are a lot of moving parts. We have the fabric. You have the firewall. You have the front door. You have the back and mainframe system. And you have these application itself that is actually moved from on-premise to cloud. So we don't know where to look for. Those are the tough problems. And one of the things that we have used, multiple tools we have used. But just to give you more context here, you can see this is RBC's online banking topology that you can see. So when a transaction comes in, right, so it's actually going through the firewall and then it's going into the security layer, where the data power and then the TAI, the Trust Association Interceptor. And then it will take the LTP token. And then it will get into Bluemix Cloud Foundry, the orchestration application. From there, it has to go to the back and mainframe. That's where we have the data. And then it'll get the result set back. And then it'll go to the stub. And it will review the result set and then get the thing back into the front, like AngularJS frontend. So as you can see, multiple layers here. So what we did is we have used four different tools and manually correlated all of those to come up with where the problem was. One of the tools that we have used is New Relic for path length analysis. We did a profile using the New Relic. So what we found with New Relic was in the path length, we saw a lot of threads that are waiting there. So that gave us some hint that there is something wrong within the runtime itself, which is Java in this case. And we are running in Liberty BuildPack. So we focused because we have other things, but we focused now on only that area. And then what we came to know about, then we have used another tool, like the APM, like Monitoring Analytics tool. That will give us the internal details about the thread pools, how they're being used, what is the utilization there. So you can clearly see that. This is a snapshot of the Monitoring Analytics tool. So what it is giving, you can see that on your left-hand side, you have a service latency, the backend service latency of 10 milliseconds. And you can see on your right-hand side, a service latency of 100 milliseconds. In this case, around 1,000 milliseconds. So you have increased the latency in the backend. So what happened was you can see the free pool and the idle threads and the used threads. You can see that on your left-hand side, the blue bar is actually about idle threads. How many threads within that app server that are free and ready to take the work? You have sufficient number of threads there. So when you increase the workload, you have enough threads to take care of that workload so you can actually scale. Whereas on your right-hand side, you can clearly see that the yellow one is the threads that are used already, used up. So you don't have any thread left. So if you want to increase your workload from 10 users to 100 users to 1,000 users, you don't have anything there. So the traffic was basically getting accumulated and just blocked at the go-router level because you don't have the capacity to handle on the runtime. So what is happening was as you increase the workload, there is more backup happening at the go-router. So you are increasing the latency. Your response time is actually going up. Even though your runtime is showing your Diego cells are only barely used like 5%, 10%, but you don't have a way to scale. So you have to correlate the go-router data with the dashboard that I have shown before and use New Relic to identify that it is the runtime that is the culprit here. So you have to correlate those two things and then say that, okay, this is the backend service latency that is impacting. So this is a very good way to correlate these things, but of course it's laborious because you should understand all these pieces together. So it will be nice to have an autocorrelation of all these things together and come up with a unified console so that way a less skilled operator can easily identify where the bottleneck is. So that's why it's very important for us to use these tools and correlate them and then make a unified dashboard so that any kind of problems like this can be solved easily because it took some time and of course you should know the runtime, you should know the Cloud Foundry fabric and all those things. Again, this is the same thing, but with 10 milliseconds and 100 milliseconds, this is an in-between case. The previous one was extremely, I've gone 10 times. Now if you go from 10 to 100, you can see on your right-hand side, the number of used threads, threads that are in use, executed threads, there's gone up, but still you have some idle threads. So you are still, the system is able to take the load, but once you increase that back-end service slate and see that the LO bar will go up. So in summary, there are certain tools that are built-in in the Cloud Foundry itself that like CFCLI or Logregator, those things you should use as a basic monitoring tool set. And I recommend using tools like open-source tools like Metrix and put those agents, if you can and get some data, if you're an advanced user, you can actually create your own Grafana dashboards and actually get the data and put it in FlexDB and then use the Graphite server. And third-party tools like APM, like New Relic or M&A and those tools also will give you a transaction tracking from end-to-end perspective. And of course, the operational dashboards is another way, as Miller had mentioned, some you can use the Google Analytics and others to understand your end customer usage patterns also that will enable the operations team as well as the Cloud administrators to proactively take some actions. So those are some of the tools, from a Cloud Foundry perspective. So that will conclude my session, I can take some questions. Sure. What is your experience with the Max-Side-of-Connection setting in the Go Router? Experience with the Max Connections. In the Go Router? Yes, the Keep Alive, so you're talking about, right? The Go Router has that, because as I said, it doesn't support Keep Alive, so at one point, you can clearly see how it is impacting the connection and disconnect of those connections. You can see the kernel CPU go up because these operating system operations, they will chew up the kernel CPU. The user CPU is okay, but the kernel CPU will slowly go up. So at one point, you will just increase the latency. The Max number of clients that each Go Router instance can support. I think for a network intensive application, what we saw was maximum 800 to 900 connections, pair instance that we could support, but beyond that, it will just block. So you have to bring in the second instance. Any more questions? Okay, thank you.