 Hello, everyone. My name is Vijay Samuel and I'm the architect of the observability platform at eBay. Hello, I'm Nick Portas and I'm the lead engineer of the observability platform at eBay. We're going to spend the next few minutes talking about how we use continuous profiling in our production environments to monitor our observability platform which are mostly comprised of Kubernetes deployments. That said, let's spend a few minutes knowing what the pillars of observability are. What is continuous profiling specifically in that? What is the architecture that we use today to do continuous profiling? Some examples on how we are benefiting from the usage of continuous profiling with real world optimizations that we have done and the savings that we see in that. We'll do a quick demo and then talk about the potential that we see with continuous profiling given that it's a very new space in observability and how it can evolve in the long run. So that said, we are used to talking about the three pillars of observability which are logs. Logs can be structured or unstructured depending on nomenclature. You could even split this pillar into two pillars between logs and events, but typically they are strings that are generated by loggers like SLF4J or ZAP. We are written into a file or written out through GCP, GRPC or HTTP. We have metrics which is continuous time-season data. Any measure that you have in the system that you want to emit periodically or expose, you typically go for metrics. Some common examples are CPU usage, the number of requests that you are encountering or the latency profile that you see for a given request. Tracing allows you to have a record of the request execution through massively distributed systems. So if you have a user interface that's making a bunch of API calls, how the request flow navigates through all the microservices, you should be able to elegantly see if you are using distributed tracing across all the microservices that are of interest. And the final one, which is the new kid on the block, is basically profiles. Profiles go all the way down to the code level consumption numbers. So you can see how much CPU you are spending or how much memory is being used at each and every method level. And it basically gives you an idea of how the program behaves over a period of time across various functions that are involved. So what is a profile? Profiles provide performance metrics at the most granular level possible. Historically, they have been viewed as very expensive because what are called trace level profilers, they collect very granular information and the way that they collect them is typically very expensive. So you cannot keep running them all the time because the application would degrade in performance substantially. And there are various types of profiles that you can take. You can do a CPU profile or a heap profile, mutex, IO, GPU, and then there may be different kind of profiles that are very specific to a language. So Go is a good example for that where you have a GoRoutine profile as well, with which you can analyze all the GoRoutines that are being spawned by a given program. And you get the information of why a certain part of the server resources are being consumed and you can get a lot of statistics down to the line number of your program. Continuous profiling is a concept where you take an action and over a period of time, you keep doing the same thing over and over again. So recently with the advent of sampling profilers, the profiling itself has become a lot cheaper. What basically does is that over a certain granularity, you keep picking samples of information that you care about and that information over a period of time can be viewed. You can aggregate it and you can trend it and do many things with that. So there are several products or projects that allow us to do that. Some of them are Pyroscope, CNCF has something called Pixie, there's Parker, and profiles are catching up to a certain level that even OpenTelemetry has its own working group that is bringing profile as a standard into the OpenTelemetry project. Profiles can, in the context of continuous profiling can be collected either through scraping, which people who are familiar with Golang, Golang has the concept of P-PROF. A P-PROF HTTP endpoint can be periodically pulled to be able to find, periodically collect all the samples for that given application. The other way is to be able to push where you have a standard client that either an open source project or a vendor product ships out, you add a few lines of boilerplate code into your application, announcing which endpoint the profiles need to be published into, and the application will periodically push samples into the backend. And the final one is instrumentation, the instrumentation-free approach, where you use something like EBPF, where a module is loaded into the kernel, and that is responsible for extracting all the profiles out. So how did we go about doing this? We looked at various open source projects and we landed on Pyroscope as one of them. It was relatively easy to set up, it had a very nice user experience for us to visualize profiling data and it also has some nifty features where you can do side-by-side view, look at the difference between various profiles and even upload ad hoc profiles to see it using their user experience. Initially we started it as an experiment where we used our pre-production environment to do some load and performance analysis where we had Pyroscope deployed in the pre-production namespace, let it run against our metrics platform so that any new build that we are considering to release, we could gain some insights from a profiling perspective. We saw good amount of success to the point where now we are deploying it on our production namespaces as well so that if there is an incident, the profiles are already there for us to view at a later point in time. So right now what we do is we, given that Pyroscope uses the Prometheus service discovery, we add pod annotations on all the deployments or stable sets that we are interested in collecting profiles and we set it to a 10-second polling period. So as Vijay was saying with our current deployment architecture for this, it's nothing fancy. The primary motivator as Vijay touched on a bit was that we were looking at continuous profiling kind of from two perspectives for me, which was we want to have profiling data available in the event that we need to do active triaging for something that requires it because a lot of times, not a lot of times, but sometimes metrics or logs does not really show us what we need. Another one is that when you run systems at high scale, you're always looking for places to kind of save costs and have things run cheaper and having profiles simply on hand whenever you want them makes it a lot easier for any engineer to kind of just go in whenever they have some free time and just start poking around and maybe you catch things if you're lucky. So our setup is, like I say, simple. We basically have multiple Kubernetes clusters internally and we have different namespaces for our different observability offerings and we just deploy a Pyroscope into each one of those namespaces in that cluster to do the scraping and things like that. There's no requirement for us at the moment to do any type of querying across them or anything like that, so it's a very just vanilla deployment configured to scrape whatever is locally based on whatever the annotations are set on the pod specs and we're good to go. We just have the profiling data available whenever we need it. Next slide, Vijay. So some typical questions just to get out of the way. So as Vijay was saying, historically profiling has been considered somewhat expensive and this is probably debatable depending on the runtime that you're on. At least for us, our stack is primarily Golang and it's incredibly cheap in the sense that even pulling every 10 seconds, we don't really notice anything. If there's any CPU consumption going on with any of our services, it's lost in the signal noise and we can't even really tell. I know there are some exceptions with this, depending on the runtime that you're on, things like that and like block and mutex profiling can have impacts and you have to be careful with things like that, but generally speaking, it's easy enough to just keep it on indefinitely. In terms of like storage for profiles, this is where we've had some of I think the most trouble, so to speak, in the sense that they're surprisingly large, depending on how many pods you have deployed into our namespace with the setup that we have. So we've had to kind of aggressively have the tiering aggregation setup on Pyroscope to where we don't keep 10 second profiles around for very long because we just run out of disk space on a single Pyroscope pod. So that seems to be the largest challenge at the moment and then we just rely on like longer sampled data. And then of course the question is, did we find any benefit out of this as opposed to, for example, just going and doing ad hoc profiling, hitting a P-PROP endpoint, using the go tool chain to look at things and stuff like that. And the answer is yes because which we'll get into with a couple examples here. So the first one, so this was an interesting example in that, like I was saying, because of how convenient it was that the profiles were just readily available, this was something that I noticed in the morning, just having coffee, like looking at email and then just poking around profiles because I was curious. So this is from one of our ingestion services. And what's with the picture at the bottom, this is CPU profile, we can see that the top contender for the CPU usage was an internal method for the go runtime called map internet. And based on the cost stack, this is coming from a part of the code that's effectively delete like resetting a map state. So clearing it because we want to reuse the memory. And it was quite surprising to see this in the profile showing up because for those that are maybe unaware of some of the go link internals and optimizations that it does, the code that we see in the context where says 4k over range deletes the key from a map. That's the idiomatic way and go to basically delete everything out of a map. And at first sight, it seems wildly inefficient as opposed to other languages like on the JVM where you might have like a clear or a truncate method or something like that. However, the go runtime is smart enough to see this and it actually optimizes it to not iterate over the map and it calls something called map clear instead. So the question came up, why is map clear not being called here, which should make this not show up in the profile? Why is it calling map internet next? So that's what we need to figure out by based on this next slide. So because of the nature of this problem, there was no easy way to, at least I could think of to figure this out without dropping into some assembly. So the first step was, well, we can compile the source file for this code that the map being deleted and see what the compiler is telling us that it's doing. And we can see that the compiler actually says that it's going to create a call to runtime map clear. So the question is why is it not showing up like that in the profile? So the next step was, okay, well, maybe something else is going on. So we'll compile a source file that calls that instead of just the source file itself and see if there's a difference. And we can see in this case, instead of saying map clear for the same line of code, it's saying map it or next now. So the particular function where we're resetting the map is small enough that it was probably being inlined by the compiler. So the question was, well, is for some reason, code inlining like making this optimization not happen. So that was an experiment to try out. So we that you can see a PR diff in the picture where we just said, okay, we're just not going to inline this particular method and see if the problem goes away. So then as some initial validation, we check the initial source file and make sure that it's still calling runtime map clear after we added the note inline annotation. And then we do the same check on another source file that's calling it. And we can see that now it's not making any reference to runtime anything, it's just making a reference to that method, since we're not inlining anymore. So that means that it should be getting, it should be a runtime map clear now. And after deploying the change, we can see that runtime map clear is in that call stack now, instead of map it or next. So the interesting thing about this particular problem is how nuanced it was, given that it was the top CPU contender on this service, it actually saved us 12% CPU for something that was so silly in a sense. Okay, and then another example, which is memory related. So in this particular case, this was the result of myself doing some premature optimization that we caught after the fact. So I was at the time I was in the habit, we were rewriting a lot of our kind of internal APIs. And the majority, a large majority of our API calls highly benefit from various types of memory pooling and things like that. And so for this particular service and this API call, I just made blindly made the same assumption and kind of just rolled forward with it. But what we can see is with this diff of kind of a an older version of the service and a newer one is on the left side, there's a the majority of the heat memory is being taken up by one particular method. And this method just really manages a memory buffer for kind of receiving updating updated RPC calls and things like that. And in hindsight, for some context, it was a case to where well, upon startup and getting responses from a particular upstream service, the buffer might be really big initially. And then we would never see that much data come over again, you get these tiny incremental updates. So it was probably a huge waste to have that whole thing pooled. So on the right side, we can see, well, if we just remove the pooling, what happens? And we can see that by removing that pooling, we were able to reduce the memory by roughly 50%. And then just for some additional validation, we had three deployments of this going. The ones in the memory side of the graph, we can see cluster 134 and 137 are running the old version that was doing the pooling. And then 129 was running the new one without the pooling. And we see the memory and again, at least even from the metrics is showing that roughly half of the memory is being used. And then just for some additional, to see if there's any CPU impact on the right side, we said that there's no notable CPU impact. So this was a clear case of this is like, we had a buffer for no good reason that was actually ended up doing us some harm. Given that we see good benefit, we also wanted to do a quick demo for people to be aware of how to use these profilers. So that being said, let me move to my terminal. What I've done is I've cloned the Pyrescope repository on my local and they have some good examples on how you can do profiling against various languages and whatnot. So I picked the easy one, which we are very familiar with. I picked the Go example. So they have a Docker compose file where they're using the Golang hot rod, which is a very common playground app that is built to demonstrate various observability functions like metrics logs and traces along with it. Pyrescope is also deployed and Yeager is deployed as well so that you can see you use the hot rod app. It shows you sample traces, navigate to the traces and in turn for this demo specifically, we show the usage of Pyrescope. So I already have these containers running so that there is some amount of profiles that are already built. The browser, so I go to localhost 8080 where I can see the hot rod application that's there. I'll make some sample API calls to generate some amount of traffic. This application in itself, you could use it to use the Yeager UI and look at the distributed traces. But let's go to the Pyrescope UI, which is on 4040. So when you come into the UI, it gives you the ability to apply certain tags and also look at a bunch of applications that are there. So by default, Pyrescope is being profiled by itself. We can see the hot rod application that's there and all the profiles that are being collected for it. CPU go routines and then the objects that have been seen by the allocations and the new objects along with the space. So you can pick one of them. There is a time picker where you can select the time and you can see the profile for that. During this time period, how much time is spent on various methods that are there. You can view it both in a tabular view and also in a flame graph, which is what we typically are used to if we talk about profiles. There is the option for doing ad hoc profiling where you can upload the P props. You can either do a single view or a comparison if you will take a look at that next. But this allows for being able to do profiling without having to actually continuous profile. You can always do it ad hoc as well. Let's go to the comparison view next. So the comparison view allows you to either compare two different instances or even the same instance, but different time periods. A typical use case for this would be I have a known good build that's running. I have a canary that was deployed as part of a new roll out. The canary is not behaving well. So how do I look at what is the difference between both builds? Or a certain period, there was an issue that happened between steady state and during the issue. How do I know what was going on that was different compared to the normal? So we do the same thing here. Let's pick an instance given that I only have one. We can do a timeline diff where let me pick one here. And I think this was the time when I ran a few API calls. So now you have a side-by-side view of the flame graphs. The bottom you have a side-by-side view of the tabular view. And basically it gives you the opportunity to analyze what was going on at that point in time. The diff view basically shows you any differences. So in this case, you can see more time being spent on serving HTTP calls because obviously we made a few HTTP calls at that point in time. More than the user experience of a given project or product, I think the important part is that we have the ability to be able to collect all these profiles and store them in a place where we can come back to at any point in time and be able to view them as necessary. So that being said, let's go back to the slides. What is the potential that we see in all of this? First and foremost, one of the harder things to observe when you are deploying things at scale is very easy to monitor or roll out, identify KPIs or golden signals are going out of the ordinary, alert, roll back and things like that. But what about slow bleeds like memory leaks that are progressing over a period of time? A profile makes it very easy to analyze these kinds of slow bleeds that are there. But you could also have detection mechanisms that are put in place where you detect such a scenario and maybe you roll back to the last stable build and given that you can view profiles across various time periods, it also gives you the opportunity to compare where exactly do you think that the memory leak may be and things like that. So three-click RCA. So an RCA is a root cause analysis. With the concept of exemplars in metrics, it is now possible for us to say start from an alert, view a metric, click on an exemplar that's there on a metric and then you can go to a trace and then from a trace, you can go to logs. So that's typically what's dubbed as a three-click RCA. But with the advent of profiles, you could even say take it a step further to say that when you're on the trace, as long as the trace has a reference to a profile, you can go from alert to metric to trace to profile and then you can start observing what things are happening at a method level. So this should in theory help drive reduction in the time to triage. And the final one which we felt the most value in our limited but powerful way of using continuous profiling is to see continuous cost savings. Whenever we write new code or even just sit with a cup of coffee and look at how things are behaving on our production environment, these profiles give us the opportunity to really get down and optimize the code. And this means that we are going to reduce the amount of CPU and memory we are utilizing. And in turn, we are not wasting resources. So that being said, we are happy to take questions offline. And thank you very much for your time. Thank you.