 Ladies and gentlemen, I request you all to kindly settle down. Next up, we have a very fond mentor of mine and a passionate PhD student at Northeastern University, Mania Abdi. She's going to talk about end-to-end tracing in SEF with Yeager. Hello, everyone. My name is Mania, and today I'm talking about end-to-end tracing in SEF using Yeager. This is a joint work between MOC, North Sea Stern, BU, and Red Hat. And SEF is a larger-scale distributed system, and it consists of many nodes. And it is highly praised by community, and it has been deployed over many data centers around the world. Now let's look at the SEF architecture and take a look at how SEF works. Actually, we have SEF nodes. And from SEF nodes, we start from the client. And here are the client nodes. Client nodes send data to back-end and read data and write data from end-to-back-end. And there are tons of clients. And then there are OSD nodes, which are used to store data within the SEF. And we have the metadata server, which is used for storing metadata information for SEF-FS. And SEF clients also communicate with SEF storage through RAIDOS API. And then we have monitors. And monitors are used for maintaining the cluster status. They are insufficient for debugging. And let me give you an example of SEF, debugging a problem in SEF. Here, as you can see, it's a very, very simple right request that a user can issue, and a client can issue to SEF. And as you can see, a very simple right request for very tiny amount of data first issue a check metadata request. And the check metadata request involves many, many clients, many, many OSDs. And then the user starts to write data to back-end. And again, it involves several OSDs. And then the user starts to update the metadata and finish and commit the right request. And again, it involves many other OSDs and clients and nodes. And now, assume that a problem happens here and something goes wrong and the request crashes or something gets slow. As you can see, we have many OSDs and many components involved in this operation. How can we define which OSD has the problem? How can we find out which OSD or which component could have the problem? This is one problem with logging that we cannot extract those information from. And the other thing is that if even we can find out which node has the problem, we have to look at the box that could even be tons of gigabytes of data. And looking into those bugs is so hard, especially stitching together different parts of the lot that can make the problem is overwhelming and it's so hard. The other thing is that we are unable to show the communication between nodes through the logs. And the step community, and there is the end-to-end tracing. End-to-end tracing is a new approach that creates the request flow from the logs. And as you can see in this figure, it creates a flow for the request that we generate and issue to the user. And it's becoming extremely popular everywhere. For example, Open Tracing has the consistent API between all softwares for implementing end-to-end tracing into system. And one of its implementation is Yeager Tracing. Step community also has a start thinking about having end-to-end tracing into their system. And they implement block-in. Step has block-in as its end-to-end tracing infrastructure. However, block-in has some limitations. And it's great that they are start thinking to use that, but we think that it is very limited and it's better to use another approaches that can provide more advanced functionality for us. So here if you look at the table, you can see that you can see different features that I list as the features of tracing that we want to support and we compare block-in with Yeager implementation as Open Tracing Tracer. Let's look at advanced features. Advanced features such as advanced visualization. Can we have advanced visualization with block-in? No, we cannot have because block-in has its own specific type of presenting request and we cannot replace block-in with any visualizer that we want. The other thing is that can we use block-in in production? Unfortunately not. And it has two reasons. One of them is that to be able to use block-in, we need to start tracing on and off. But it cannot be used in a production system because in a production system, to be able to debug a system, you shouldn't rerun the procedure and you shouldn't reproduce the logs. However, we have the Open Tracing environment which can enable online tracing using sampling. One is that the block-in provides its own API and it provides its own way of tracing. However, in a data center that has many, many applications and have many, many components involved in a single request, having a very specific type of tracing is not good and it's better to be able to have all components traced with a standard API. And Open Tracing, again, Yeager using Open Tracing provides this functionality for us. The other thing is that can we leverage community improvements using block-in? Block-in is a specific to CIF. However, Yeager and Open Tracing is used by a large community of open source softwares and they can leverage new advanced features that are coming and they can improve their tracing infrastructure. So in this type, we are trying to adding support for Open Tracing infrastructure as fast as possible by layering Open Tracing underneath API. How Open Tracing works and then have an introduction on how tracing works and how Yeager implements Open Tracing and how Block-in implements Open Tracing. Take a look at Anatomy of Open Tracing. Let's start with client. Users issue requests to clients and then we have trace points. We have trace points and each trace point is used by the tracing API and the tracing API associate two metadata information or two context information to these trace points and they are called metadata, they are called span ID and trace ID. A collection of trace points also, we are calling them span. So we have tracing agent. The tracing agent receives trace points from tracing API and stores them and cache them temporarily on its own system. And then send them to Yeager Tracing backend. Tracing backend is responsible for getting traces from all over components in the system, stitch traces that are related to a single request together and store them for further use in the future. Let's look at how Block-in implements these infrastructure and how Yeager implements these infrastructures. Again, if you look at the figures, we have OSD and rados as our clients and we have trace points that are generated with user instrumentation. And then we have Block-in API. Block-in API takes trace points and using Block-in functionality, we associate tracing context such as a span ID and trace ID to those trace points. And then Block-in sends those trace points to LTTNG-LGIN. LTTNG works as the tracing engine for Block-in and it stores and caches the tracing information temporarily in its disk and its mera. And then we use bubble trace and also ZIPKIN to stitch together data, to collect data from all nodes around the system and then stitch them together and present the request workflow. And as you can see in here, bubble trace and ZIPKIN together works as the tracing backend. Here, look at Yeager architecture and see how Yeager works. Again, we have OSD and rados as our client nodes and we have trace points. And these trace points are trans and these trace points are sent to Yeager agent through Open Tracing API because Yeager is implemented underneath the Open Tracing API. So when a user issues a trace point, the Open Tracing API associates metadata information, associates span ID and trace ID to those trace points and sends them back to Yeager agent. Yeager agent stores them temporarily and then to UDP, it sends them to Yeager collector. The Yeager collector collects traces from many different nodes and then stitch them together and stitch them together and provide a unified view for each request in the system. As you can see, before the Yeager collector sends data back to source data into its distributed storage or any other storage type that is provided by the user, it first queues those data into memory and then sends them back. Also, Yeager provides sampling for us and for sampling, we have different type of sampling that is provided by Yeager. But this Yeager collector is issuing the sampling policy to other agents and ask them to do sampling for the request. So as one of the steps of, one of our steps of replacing Blockin with Yeager was how to map API, how to map API of Blockin with the API of Yeager. It was not a very one-to-one mapping and it has some type of complications. One of them is that let's take a look at how we start a trace point and how we start and span in Yeager and how we start it in Blockin. In Blockin, we have to first call trace function and then we have to use init function to initialize a trace. This work together has many different variations over the set code and it should support different type of calling. However, we try to replace it with start to span function in Yeager and we modify the functionality of Yeager API to start to span instead of using Yeager, using Blockin spans. And then we have the event and keyval function in Blockin. The event function is used for annotating the span with the timing stamp and the keyval function is used to annotate the span with an integer value or with a string value. However, on the other hand, open tracing API support block KV which is used to annotate everything that we want. We are not limited to very specific type of annotation. We can have any annotation we want using like KV. And then open tracing have init tracer which is used to attach the application component to the tracing engine which is presented in the system. However, because Blockin is not using an always on method and we have to turn it on and turn it off, we don't have such a functionality in Blockin. The other thing is that for propagating the information between two different components, we have the inject function in open tracing API. And the corresponding one was encode trace in Blockin API. However, it was not again a one-to-one mapping and we need to modify some and we need to have some modifications to be able to add the inject function. The other one is the code trace which is used by the receiver of a request and extract the metadata from the other node, extract the context data from the other node and then build a new trace based on that. And it is, again, it was not a one-to-one corresponding between extract function and the decode trace function. And we again have some modification to make those possible. Take a look at our implementation. What we have done was that we first want to have the least amount of modification that could be added to the system. So if you look at the figure one by one, on the left I present the Blockin architecture and on the right it is the Yeager architecture. We have set components and we didn't modify any set components in our first round of implementation. And we have set components the same. Then we have Blockin API and we use Blockin API both for Yeager and for our Yeager functionality and for Blockin functionality. And we replace the Blockin functionality underneath to instead of inserting information into trace points into LTTNG, it sends them to Yeager agent. And then we enable Yeager agent on each node and we connect the Yeager functionality to our Yeager agent. And Yeager agent again sends data to Yeager collector. Actually after the Blockin API we removed everything and we replace them with the Yeager functionality. Our modification has 300 lines, almost 300 lines of code. However, for adding new trace point because we want to be compatible with Open Tracing API and we want to compatible with Open Tracing with Open Source Community. So for adding new trace points we are using Open Tracing API and we are not using Blockin API for new trace points that we are added to the system. And what was so hard in our modification there was no consistency on context propagation and we couldn't have exactly one-to-one mapping between the components. Also, to be able to provide context propagation Ceph modifies the API of Ceph functions. But it was not what is desired and we implemented to have a more advanced feature. Here is an example of traces. As you can see in this figure each line shows different component and each column shows different component. Each line shows a single span and the length of each line shows the latency of that span. As you can see it's a little bit small but in here you can see that we can also able to annotate the latency of each span within the trace. The other thing that you can show is that the hierarchy that is visible through the span is the causality relationship and happens before relationship between different teraspont and spans in the system. Take a look at our evaluation. For evaluating our work we evaluate disk overhead, we evaluate memory overhead and we also evaluate the CPU overhead. We want to evaluate because all tracing infrastructure comes with an overhead and first of all we want to see what was the overhead that we imposed to the system. So our experimental setup. For our experimental setup we had two type of nodes. One of them was the node that we set up. It was a physical node that we set up self cluster on that and it has 64 gigabytes RAM. It has turned you to a CPU. It has 10 gigabytes per second leak and it has two HDD drives. And then for the collector node that we want to collect every traces from the system together we use a virtual node that has 32 gigabytes RAM and 12 virtual cores and a virtual disk. We ran our experiment for 15 minutes and the reason was that during that 15 minutes we issue several read and write requests. However there are some background activity that happens in the step and for those activity we also generate some traces. So we want to be able to capture those traces. Then we define a time frame of 15 minutes and we capture and we get our statistics during those 15 minutes. Overhead is the disk overhead for writing and reading data and also for the background overhead. As you can see the x-axis shows the time duration of 15 minutes and the y-axis shows the trace that the amount of data that was generated due to tracing. And we run our experiments for different sampling rates for 20-person sampling rates for 50-person sampling rates and for 100-person sampling rates. And as you can see it was not a linear relationship between different sampling rates because it's a probability based model. And then the other thing that is visible here is that this model of sending data that we send nothing and then send data and then send nothing and then send data is coming from the fact that we have caching on both agent side and on the collector side. So look at the memory overhead that was imposed by our infrastructure. As you can see in the figure the x-axis again shows the time of 15 minutes within the system. And then the y-axis shows the memory overhead of the memory overhead in megabyte. And again we run our mechanism for our tests, our evaluation for 20% for 50% and for 100% for sampling rate. And as you can see the amount of memory overhead was slightly different between different types of sampling and it was not a huge overhead. And it was less than 1.10% of memory overhead overall system. Then we look at also CPU overhead. And for looking at CPU overhead again the x-axis shows the time which was 15 minutes and then the y-axis shows the CPU usage in percentage. And as you can see we run it for different sampling rate 20%, 50% and 100%. And our maximum CPU overhead was almost 8% for the collector nodes. For the future work we are seeing to have several future works here. First of all we want to use the out message to, we want to use the out log message because self code has a rich body of log messages and they almost have log messages all over the code on all components. We want to use those log messages to annotate our traces because we believe that those information are really valuable and reproducing them needs a huge effort. So we prefer to use those log messages. This is one of the first effort that we want to have as our future work. The other thing is that we are trying to add more trace points on different components. For example, currently Steph has several components but not all of them has trace points. We try to add more trace points to client side to be able to capture more sophisticated and more complicated traces. The other thing that we are interested to capture is synchronization point because in order to be able to present a correct view of the system we need to know when our concurrency is finished. When our concurrent threads are merged together and provide a single, and reach to a single point and we need to capture the synchronization point for that. And also we want to be able to find frequent patterns within those traces, to be able to provide better debugging opportunity for the users. Because, for example, if we can be able to retrieve frequent patterns we can detect what was happening wrong in other patterns but we don't need to look at the whole traces to be able to show that. We can look at specific parts of the trace to be able to detect the anomaly in several traces. And as a summary, we enable advanced end-to-end tracing in-safe using Yeager. We replace blocking with Yeager and we kept blocking trace points as there were and we add new trace points using Open Tracing API. Our overhead CPU and memory was less than 1% and our discover head was less than 60 megabyte per second. The source product? Is it possible for me to take what you've done and put it on a self cluster and then run it? We are planning to have it up a stream and yes, it's open source project because... But aren't all the bits already available somewhere on GitHub? They are available in my GitHub, but it's not on up a stream code of self. I mean, I can share the repository with you. Yeah, it's all available. Thank you. Did you actually make modifications to Yeager or was it all using, you know, adapting to the array? We only adapting to Yeager API, but for adding synchronization point, we need to have modification to Yeager because currently Yeager is not able to capture synchronization points, but we want to add this feature to Yeager as well because it's very important for us. You mentioned that context propagation is hard and it sounds like it slowed you down. If that was resolved, how much faster would it be? Would the 30W3C trace spec open and I suspect that's gonna get approved probably by the end of October? Would that make this easier in the future? By context propagation, for context propagation, there is one thing that is, that was so hard in CIF, actually the way that they implement blocking and have instrumentation point in CIF and it was, they modify CIF API and I think they modify CIF API and it makes things much harder to get rid of those parts and what we are thinking to use is that actually this is available for Yeager. We can use thread local storage to keep the metadata information in local thread and then retrieve those information whenever we want to create an span instead of modifying all over the code and change every API in the code because currently this is the way that it's implemented and it's not very good. And one more question, in general, distributing tracing is typically not mainstream today because it's very hard to add it to your code. This is kind of a first example I've seen of someone that had tracing and tried to move to a different tracing which seems like an even additional layer of complexity because you're kind of rewriting something that you've already done. Are there any tips or lessons learned from that going forward that may be useful for others? Actually the good thing about open tracing is that open tracing, actually open tracing is the way to answer this question and it provides a standard API for any tracing infrastructure that you want to have underneath. For example, you have a specific function and that's API and you use that API for tracing your system. Then there is no matter what tracing infrastructure you are using, you can use Yeager, you can use another tracing infrastructure but since you have the same API you can have different modules on the system, different software on your system to be instrumented using the same API and you don't have the difficulty to going from this Yeager, going from this tracing infrastructure to another tracing infrastructure. Where does the sampling happen? In one of the pictures it looked like you were doing all the trace points and then they were just getting stripped apart so it was less storage space or? Actually Yeager provides several type of sampling. You can have, one of them is the remote sampling that is imposed by the collector and collector decides on the sampling rate and this is what we have evaluated on our system but there is another approaches and there is another types of sampling that they provided through Yeager that enables each agent to sample by itself and the different sampling again. Well, so, but what I mean is when you're doing sampling does that mean that only 20% of the trace points actually get turned on or do all they get turned on and the data they generate gets thrown away or where's the CPU? If we enable it on the agent side, we for example take one of, take every 10th of the, if we define it as 10%, we get every 10th of the trace points and then propagate those, generate actually trace points for them and propagate them but if it's enabled by the collector as the way that we evaluate our system, the agent collects all trace points and then sample them and send them back to the Yeager agent. So the key to getting this really efficient is to do the sampling at the agent. How come you couldn't do that? Sampling at the agent. We decide which node, which actually we decide on the request to be sampled and then send the metadata down. It is sampled on the Yeager. But why didn't you use, that's the key to getting really low CPU overhead, right? Yes. So how come you didn't use that approach? Actually, I didn't have much time to run more experiments because this is the only reason. So is it feasible? Yeah, it's feasible. It's just a configuration file that I have to specify. I see, okay. It's different configuration that we have to impose to our system. I'd like to, can you fit this into the broader context of why you're doing this? Like is this just for an alternative debugging thing or what's the use cases for collecting tracing information? Collecting information, tracing information used for many things. One of them is debugging. The other thing is that performance evaluation of the system, for example. If we want to evaluate, what is the, if we change, if we have an optimization on the system, if we have traces from previous execution and if we know how traces was executed previously and with our modification, with our optimization, how traces was, how request was executed and what is the traces now, then we have a better comparison between the two and we can evaluate our optimization better. Yeah. That's it. Any other question? Question? Sorry, I couldn't hear you. Any more questions? Walter? Thank you very much, Mania. That was great talk.