 Hello, everyone. My name is Matt Tosola. Today I'll be presenting our recent work called Automating Instrumentation Choices for Performance Problems in Distributed Applications with Viya. Diagnosing performance problems using logs is extremely challenging and time-consuming. With the emergence of distributed applications, problem can not be anywhere. It could be in one of the many components that constitute the distributed application. It could also be due to intercomponent interactions. As a result of these challenges, developers heavily instrument their distributed applications. Instrumentations, such as logs, are the de facto data source engineers used to diagnose performance problems in their applications. However, it's difficult to know a priority where these logs are needed to help diagnose problems that may occur in the future. And exhaustively recording all possible application behaviors using these logs is invisible due to resulting overheads. As a result of all of these issues, distributed applications end up containing lots of log statements, but rarely the right ones in the right locations needed to diagnose a specific problem. Currently, enabling the right instrumentation requires manual iterations of test and check. You gather more data from instrumentation, you try to localize your problem. If it's not enough, you instrument more and then gather more instrumentation data. This takes a lot of valuable developer time. It increases downtime for your application. This also comes with additional costs of money. To reduce diagnosis times, researchers have developed automated techniques to choose needed logs for performance diagnosis. However, existing literature have major drawbacks. First of all, they focus on correctness problems, not performance. For example, log 20 helps diagnose correctness problems by enabling logs to differentiate unit co-pads in your application. However, this is not directly applicable to performance problems because fast co-pads, for example, need not to be differentiated for performance problems. And the slow ones, even after they are differentiated from the fast ones, they will require additional logs to pinpoint problems further. Another problem is that some other works are not designed for distributed applications and they ignore very critical context information, which is request workflows. Request workflows are necessary to know information of what services, processes, and methods involved in a user request. Engineers need this context to efficiently debug their applications. For example, log 2 is designed for performance problems. It is indiscriminate because it ignores request workflows, thus will enable logs in the areas of application that are not really performance sensitive or not really contributing to end user requests. To overcome these mentioned shortcomings, we present the variance driven automated instrumentation framework, which we call WAV. In response to needless observed performance problems, WAV automatically searches the space of possible instrumentation choices to enable the logs needed to have diagnosed problems. To work, our framework combines distributed tracing, which is an enhanced form of logging, with the insights about how response time varies can be decomposed into the portions of the request traces. In the rest of this talk, we will be presenting requirements for the automated logging frameworks, then we will discuss our approach for addressing them, we then go through our design and present our findings for empirical evaluation. So let's first review requirements for automated logging frameworks. Past research has argued that logs needed to localize source of one problem may not be useful for others. The lack of one-size-fits-all logs leads to a tussle to identify which log statements are actually most useful and should be enabled by default. For example, Sahar et al. state that Hadoop, HBase, and ZooCooper family of applications have been patched over almost 30,000 times just to add, remove, or modify static log statements embedded in their source code. This challenge results in the following requirements. So our first requirement would be, logging frameworks must allow logs to be enabled selectively in response to performance problems observed during runtime. Let's now go through our second requirement. Even modesty-sized distributed applications will have large search spaces of instrumentation such as hundreds or thousands of possible log points in the source code. For example, we can assume a distributed application that allows log points to be inserted or enabled at every function's entry, exit, or exceptional return. To address this scalability challenge, we refine our first requirement to require logging frameworks to automatically enable trace points. We further read the requirements stating that automated logging frameworks must be also capable of narrowing down the search space when exploring what possible additional logs are needed to localize need logs or problems. Our last requirement would be the following. Existing logging infrastructures capture huge amounts of data. For example, in the paper from Facebook's Canopy, we see that this infrastructure captures gigabytes per second of trace data, and each individual trace actually contains thousands of trace points. Problem diagnosis, even with the needed instrumentation is present, becomes trying to find a needle in a haystack, because there are thousands of available instrumentation, and you don't know which one is contributing the problem most. This challenge is actually partially addressed by requirement line, so we automatically choose what instrumentation is to enable. But to avoid the needle in a haystack problem further, for the cases, for example, when there are multiple problems simultaneously in your application, we add the following requirement. So the requirement tree would be automated logging frameworks must be capable of explaining their decisions to the developers or to the operators. We next discuss insights and enableers that let us address the requirements. We follow the insight that because with similar critical paths that are required to process similarly by the application, we'll have similar response times. If not, this unknown behavior may be a performance problem such as log functions, resource contentions, or so on. This is a useful insight, an existing use of this that is as follows. Operators use separate performance counters for different types of requests or API calls, such as, for example, beat and the right attribute in a distributed storage application will have different performance counters. And the expectation is that requests of different types will have different critical paths, thus will have different response times. Furthermore, as a second insider enabler, we use distributed tracing, which is an enhanced form of logging. Because distributed tracing provides us the discriminant context we need. These are series of causally related distributed events that come together to contribute end-to-end request workflow, which is essentially key for effective diagnosis. By using these contacts, we can know which specific request to which API is experiencing what kind of problem and what component of these traces is contributing to problem. Third, high variance, high response time variance can be localized to sources of high variance between critical pet portions of the request workflow traces. Then it will give us insight into where more trace points must be enabled to explain this unknown behavior further. Building on these critical enablers and insights, we present the design of our framework. So let's talk about our approach. Wave identify requests whose traces exhibit identical critical paths, but which exhibit high response time variance. Wave then identify edges of these traces that contribute most to the variance. From there, it will enable additional trace points in the code regions corresponding to these high variance areas. This way, Wave is able to differentiate slow code paths from past ones, and from these slow code paths, it will isolate code with unpredictable performance. Let's now go through an example. Example on the right shows how to differentiate high variance due to caching operation. In figure A, the response time variance of lead to cache is high. The figure shows that the trace site spanning the storage node accesses is the dominant contributor to response time variance. From there, we see figure 2B and C shows that enabling trace points to differentiate caches from cache misses distinguishes fast paths from the slow ones. By doing so, Wave addresses requirement 1, which is to enable trace points in response to performance variations in the request workflow, and it also narrows down the search space so it will just focus on the caching operations within that region. Wave further maintains the history of trace points enabled on behalf of high variance along with the statistics that motivated these decisions. By doing so, Wave can explain why it makes the decisions in order to localize the problem, thus addresses the requirement tree to explain these decisions to the users. Let's next discuss the design of Wave. The main goal of Wave is as follows. During nominal operation, Wave operates identically to distribute the tracing and generate traces with a developed level of trace points enabled. When need problems occur, developers can use Wave to automatically enrich traces with additional trace points to localize them. Wave localizes problems due to slow code or the one with the high variance. Similarly, similar to dynamic instrumentation, our approach reduces the burden of deciding of which log points to enable upfront, and it also additionally eliminates the manual effort required to search the space of possible trace point choices. So figure on the right shows our design. It is comprised of distributed tracing infrastructure that allows trace points to be enabled and disabled, and the control logic where the enabling trace point decisions are made according to the performance variation insights. It has control plane on the top and instrumentation plane on the bottom. Components in the control plane implement the control logic which is to localize problems and further enriching traces. Instrumentation plane implements control logic trace points hypothesis which is to enable and disable trace points, and our framework operates in a continuous loop which is shown in the arrow which we will discuss next. So at each iteration, Wave's instrumentation plane components gather new traces which is marked A in this figure. Control plane components then examine these traces to identify hypothesis of which trace point should be enabled next, which is marked B in this figure. Hypothesis are then sent to instrumentation plane components shown in marked C which enable the relevant trace points and the cycle repeats. While doing all these steps, Wave also maintains a history of his decisions and their effects under the data structure called hypothesis first. So we first started the expectation that identical request types should perform similarly as we discussed in our insights. For example, our first expectation is that this request should perform similarly as a high-level expectation, and then Wave collects traces and group them by the request type so the list request are grouped together. It then iteratively drives hypothesis, so it enables some trace points to refine these expectations. For example, a refined expectation will be list request with cache should perform similarly, list request with cache should perform similarly. In the next round, Wave will group traces according to this new expectation, and in this new group, the list request are going to be together. Under that, it will group according to cache sheet and cache map. This hypothesis creates nodes in the tree, then new groups of traces are populated under these new leaves, which are analyzed in the next round to find the problem or to elaborate the problem further. And we consider any group of traces that show either high coefficient of variance or high mean latency as problematic. We have some statistical threshold, so if a group has higher statistic than the statistical threshold, then we point this group as a potential problem. And then we will start localizing the problem and make trace point enabling decisions. To know actually what can be enabled on behalf of a group, Wave has an applied profiling phase where it constructs a search space. The figure on the right shows an example search space for a given workflow. Wave represents search space as a set of paths observed in a request workflow. The search space is a collection of unique paths observed during search space collection. For requested exhibit concurrency, each concurrent path is stored separately, and the paths themselves are stored as JSON or dot files in our prototype. The nodes of these paths are trace point names and the edges are happens before relationships. And additionally, hierarchical relationships between edges are represented by nested start and end annotations and the trace point names. These paths are learned by running workloads against the application. And this is often the case during the code coverage, regression, or integration test. So let's now go through our search strategies. At runtime, when Wave observes a problem, it employs search strategy with problematic groups to determine next set of trace points to enable. Figure on the right is an example of distributed trace. It consists of a span, which is a logical unit of work with iteration. For example, A refers to a client operation from start to end, B, C, and D are the other methods executed during this call. We present two out-of-the-box search strategies and depict them in this figure. The first strategy is called hierarchical search, which explores top down by enabling trace points that denote a child level using the hierarchical search as shown in this figure. For example, in this figure, if the problem is in A, then the hierarchical search strategy can enable B or D spans. Our second strategy is called flat search, which uses a binary search light strategy and enable the middle trace point using happen before relationships. For example, if the problem is in A, it will then look at the hierarchical happen before relationship, so it will see that B, C, and D are executed in this order, and then it will just enable the C trace point. Let's not talk about why it's output and have to use it. Why it has two outputs? The first output is need traces whose critical paths are enriched with additional trace points needed to localize trace problems. During the time of problem, developer will take a look at the trace and she'll be able to find out that trace points corresponding to the problem will be already there. And the second output is attached to each trace containing the corresponding hypothesis tree that explains the result of Wave's hypothesis. For example, it may show that enabling the trace point around the cache differentiated critical paths and generates two new groups and increasing predictability for one group and isolating unpredictability for the other group. By using this stack, developers will leverage Wave API to query about hypothesis history for the corresponding trace. So we provide the CLI for interaction and expose an API for accessing and querying the hypothesis forest. By doing so, developers can also query hypothesis forest directly, identify a group of requests with high response line variation or which are considered this low. Some capabilities include, for example, identify all groups or request types that are really problematic in terms of high variance or mean. We also surface statistical summaries, for example, mean variance, coefficient of variance for per group, per trace point per request type. And also you can access to ranked ordering of problematic groups. For example, you can query top 10 problematic groups in your system or top 10 problematic trace points in your overall application. In this section, we present and discuss some of the limitations of our current implementation. First of all, we cannot identify if the observed variance or slow response times are a result of the code in the lower layer such as virtualization or the current. This is because current tracing infrastructures we use are limited to capturing elements of request workflows in the application only. Second, our method of enabling trace points cannot provide value for transit problems that disappear before we collect sufficient number of samples. We also cannot provide value for very rare problems that occur only once or twice, or they occur with very large periods that Wave cannot actually catch. And third, if the search space that we populate in the offline phases is incomplete, then we will not have complete visibility into trace points it can enable. So we require periodically populating this search space during the integration tests. However, even though there are these limitations in our evaluations, we will show that we can still provide value for OpenStack, HTFS and Uber and Death Star benchmark social network application despite these issues and we are able to find interesting performance bugs in these applications. So let's go through our implementation. We wrote two prototype implementation for OpenStack and HTFS by modifying their tracing infrastructure, which are OS provider and X-rays. Our modification to existing trace infrastructure are really minimal if you just add conditional checks before the trace points to check if they're enabled or not. And we also have a control plane component, which intends to be modular. For example, in both of these applications of HTFS and OpenStack, we use the same control plane components and which is implemented in the Rust language. And in our experiments, we created search spaces for OpenStack and HTFS using Workload Generator P Custom Build. Our generators uniformly issues a mixture of request types such as create, list and bleed. And we randomly select number of concurrent workers per round and then we populate our search space. And in the experimental evaluation, we use different systems and large trace datasets. We use the let wave run for against these applications and we will show that it's able to find different problems or interesting problems from these applications. We have two large distributed applications which are OpenStack and HTFS and two trace datasets from Uber and the social network application. Our setup has nine nodes with Ubuntu on bloodlap environment. And each of these nodes has eight core CPUs and 64 gigabytes of memory. We first investigate the trade of between flat and the hierarchical search strategies. And we try to understand if Wave can actually find the problems using these strategies and how do they differ from each other. To realize this, we choose random edges that correspond to random code locations in the OpenStack to inject problems. The chosen region in the source code is delayed following your normal distribution similar to our literature. First figure shows the average number of trace points enabled to localize problematic regions for both search strategies. We find that both search strategies localize problems within 15 trace points on average out of thousands of available in the search space. We also observe that flat search strategy improves performance over hierarchical search. And the second figure on the right shows the distribution of trace sizes for Wave compared to default tracing which we call vanilla in OpenStack. We find that Wave finds problems with 100% precision while reducing trace sizes by up to 90% on average compared to the vanilla case. So in Manila, developers will take a look at the traces and she will face thousands of trace points in a single trace. Whereas if she were to use Wave, she will only see critical path trace points and reach with significant trace points that are actually needed to diagnose performance problems. And in this case, it's shown to be less than 90% of the original traces. Overall, Wave serves as a fundamental step for developers to diagnose performance variation and localizing them to a specific array of the system. The table on the right provides an overview of the cases we found during our experiments using three different applications. We find that Wave reduces trace sizes for all these applications about 90% on average while finding interesting performance bugs. So for the sake of time, let's just discuss one case study from the table. All instances on OpenStack can be listed using the command VMList. So we set up Wave with the OpenStack, we let it run, and after a while we check that we find out that Wave finds a problem in the VMList request that shows high variance. And when we analyze the Wi-Fi hypothesis for us, we find that it first localized problems, it localized problems to problem source to three different edges. From there, we find the following. The two edges correspond to the identity service called Keystone, which is utilized for authentication token. Another edge corresponds to a function, which is getAll function that constitutes 2,000 lines of code and performs numerous database lookups to get every instance, including deleted ones too. So we also corroborate these findings in the bug reports as well. So these are actually existing performance bugs in the OpenStack, the version that we are using. So in this case, we find that Wave helps diagnose performance problems by isolating latency to first, specific service and operation, and second, to inefficient function. In this work, we presented Wave, a distributed tracing framework and variance-based control logic to automatically adjust instrumentation in response to performance diagnosis approach. We demonstrated the efficacy of our implementation by using it to diagnose problems in OpenStack, HTFS, social network, and Uber applications. And thank you again for very much for listening to my work. All right. So thank you for your talk. That was really, really cool. So I'm not seeing many or really any questions in chat, but we can get things a minute or two. I guess in the meanwhile, I'll just make the announcement. So right now we have a 30-minute break, after which we'll be resuming the track. Our next talk will be at 3 o'clock Eastern time by Alexander Bulikov on bending input space to fuzz, virtual devices, and beyond. Sounds very complicated, so I'm assuming it's going to be super interesting. I guess we did just get a question, so I'll just read it out. So Heidi asks, if there are any plans for VAI app to be applied at the infrastructure level? Yes. So we had this in our mind from the beginning, although it has been very challenging to merge application level traces with the kernel level traces, because even in the application layer, there are thousands of trace points, again, putting us along and hard work to localize performance problems. But starting this summer, we've been exploring some ways to stitch these traces together, and there are actually ongoing works to stitch them together. And once we can get to work together, get them to work together, stitch together, so you have requests going through your system, and at the end of your request, you will have like trace points from the application layer, from the kernel layer, but they're all in a single graph. If you can get this, I am very hopeful that VAI will be able to find very, very interesting bugs and performance issues in an application, even though the problem resides in the infrastructure level. Let's give people a second for a follow-up questions. It'd be nice if Poppin showed us that. Oh, there you go. So she says you could just look at it, or you could just look at the infrastructure, even without the application being traced. Yeah, that's also correct. One of the things that we've been doing actually, which I didn't include in this presentation for the sake of time, is that, for example, there's metrics collected from the infrastructure level, right? So these metrics are sometimes, according to the application, developers are included in the traces already. So in the case of that, we have lots of labels. Yeah. So in the case of that, for example, these traces can include some text or key value pairs, which correspond to these metrics or these indicators, whatever they are. For this case, we have key value pair correlation with the end-to-end latency. So in our paper, we have one case study where there is a contention issue due to the number of concurrent workers exceeding the limits. The request getting slow and slow. And then once we have this key value pair or the tag, and we correlated with the end-to-end latency, we are able to find that this is actually causing the problem. So similar to that, for example, infrastructure level tags or the metrics can be correlated with the end-to-end latency. So if you're just talking about not using traces at all, that's also possible, then you get a little bit distracted from the application layer, too. The last thing I had asked was that it would be very exciting to try the MOC. Yeah, I agree. And thank you for your insightful questions. Well, thank you, everyone. I'll see you all in half an hour. Take care. Bye-bye.