 talking about Apache HTrace, which is a new big data project that's been proposed to the Apache Incubator. Just let me know if you can't hear me or if you can't see something. So let me say a few words about myself. I worked for Cloudera on HDFS, which is the Hadoop distributed file system. Previously I worked on the Ceph distributed file system. I've also worked on a few other projects as well. So today I'm going to talk about some of the motivations for HTrace, you know, why we felt HTrace was important to create. And then I'm going to give sort of a brief overview of the architecture and some of the motivations behind it. And after that I'll talk about using HTrace from two points of view. One is sort of a more developer focused point of view. How do you add HTrace instrumentation to your project? And the other is a more user focused point of view. So hopefully we can cover both of those. And we'll talk about the HTrace community, which is a pretty vibrant community. And I have a demo today. And hopefully we'll have some time at the end for question and answer. But if you have questions along the way then you can also feel free to ask them then. So, you know, big data in 2016. The last few years have been really huge for big data. The volume of data continues to grow. We used to talk about petabytes. Now we talk about exabytes. We used to talk mostly about MapReduce. Now we're talking about projects like Spark, Impala, other SQL engines. We have stuff like Record Service, Kudu, lots of new projects on the horizon that really expand the capabilities of what you can do with the big data stack. But at the same time we've seen a lot of challenges. So as clusters get larger, managing them gets more difficult. We've seen people with thousands of nodes, multiple thousands. It's not just Yahoo anymore. It's others as well. We've seen lots of density. So what happens when you have 20 hard disks per node rather than 10? And these hard disks are much larger than they ever were before. We will soon see stuff like 10 terabytes per drive on the market. We've seen latency targets get lower. So an overnight MapReduce job used to be pretty acceptable. But now people would like to see latencies in minutes, sometimes even seconds. So that becomes even more of a challenge. Manageability and monitoring continue to be things that we need to work on. And as the number of projects grows and the number of moving parts grows, we have to work on those more. And we're seeing more heterogeneous clusters. We're seeing stuff like Flash. We're seeing shingled magnetic hard drives. Lots of different stuff. So you can no longer necessarily assume that every node looks the same or that every node just has pool of disks which all look the same. And of course the stack is getting more complex. So just to give you kind of a flavor of what a big data stack in 2016 might look like, here's two example stacks. And these are just examples. There's many, many more possible. But you might have something like Impala which is a SQL engine running on top of HBase which is a key value store. HBase of course uses HDFS as its underlying data store. And then all that's on top of Linux. Or alternately you might see something like the Hive SQL engine on top of Spark which is also an execution engine. On top of something like Record Service which is basically a security enforcement service. That in term would be on top of HDFS and Linux. So this is all something that is Hadoop. And I guess I'm showing this just to give you a flavor of the fact that Hadoop is not just MapReduce and HDFS anymore. It hasn't really been that for many years. And the number of projects in the ecosystem continues to really grow. So diagnosing these distributed systems is pretty tough. And there's a lot of different reasons for that. One of them is because we have these timeouts and fallbacks in the system, we like to hide errors as much as we can. So if you can bury the error, that's normally a good thing. Failure is bad. But because we take so much effort to bury the errors, it's actually sometimes hard to diagnose performance problems because let's say one node is behaving, the cluster will try to recover from that. Everything will just run slower. So unless you're really paying attention and unless you're really monitoring it closely, you may not know that your performance is suffering as a result. And of course performance problems are often not repeatable. They can result from hardware failures that are intermittent. They can result from configuration problems that are intermittent. They can result from load. So all these things combine to make performance problems one of the most challenging problems to debug when you're dealing with these systems. And of course when we have many different projects and we have many different nodes, we have to be able to follow what's going on across these multiple projects because just because you're seeing slowness in Hive or you're seeing it in Spark or Impala, doesn't mean that these are the projects that are at fault. It just means that that's where it's manifesting itself. So here's sort of a really small example of all the different components that might be involved here. Let's say that you have something like an HBase client, which is writing to HDFS. That's all this is. So you already have the HBase client involved. You have the HDFS client code involved. You have the name node involved and three data nodes. So we're already looking at basically five different daemons involved here and potentially three or four different nodes. So if we have a problem with one of these data nodes, we're not really going to necessarily see it here. We're going to see it here. And then we're going to have to find a way to trace it back to where it came from. So there's a lot of different approaches that we have today and that we've historically had. Metrics are obviously one. I guess everyone's familiar with metrics like top and VM stat, IOS stat. They give you an idea of what's going on on a particular node. Is the CPU pegged on this node? Is the disk pegged on this node? What's the distribution of disk requests? And metrics can be pretty sophisticated. Metrics can include a max min, variance, median, all that stuff. We've also developed sophisticated systems to store and aggregate these metrics. Stuff like Cloud Air Manager will store them for a long time and it will down sample the older ones so that you sort of use your bounded storage space for metrics in an intelligent way. And there's other systems too that will manage Hadoop. So JMX, of course, is very important for Java processes, which most Hadoop daemons are, not all. So metrics are really good for getting an overall view of throughput. And metrics can be even application to find stuff like how many transactions per second am I doing? How many files per second am I creating? Stuff like that. It doesn't have to be just disk or CPU. Metrics are usually not very good at identifying latency problems, especially for particular requests. So the averages can often hide significant outliers. And, of course, min max is one way of coping with this, and so is variance. But even if you have those numbers, it's hard to figure out why you're seeing a particular metric. For example, you might see low disk IO, but you don't really necessarily know why. Low disk IO could be many different reasons. It could be because you're experiencing errors that are bottlenecking the writes or reads. It could just be because there's not much to do at that moment. It could be because of some higher level thing, like we are not really balancing where we're writing across the cluster very intelligently. So metrics tell you what, and they give a good idea of aggregates, but they really don't tell you why. You have to figure that out in other ways. And, of course, that's where stuff like log files often comes in. And daemons pretty much all generate log files. And increasingly, they generate multiple log files. So you have stuff like the HDFS audit log, which is sort of a record of what operations have been done. You have stuff like log for J files. Log for J files, of course, can be tuned. So you can turn up the logging for particular services or faculties within the daemon, or you can turn it down. Clients, of course, have log files, too, so the HDFS client is also logging. So, you know, the important thing about log files is that they're usually stored on the nodes that generated them, pretty much always. And this is another thing where you only have a bounded amount of time to store them, a bounded amount of space to store them, which translates into you only see logs back to this time. With log files, it's a little bit harder to down sample them than it is with something like metrics. So that's often an issue here. We often don't have the logs that we would like. Log files are really good for getting detailed information about what's going on in a particular point in time, but they're not necessarily good for getting a holistic view of a single request, because due to performance reasons, you can't necessarily afford to log everything. And even if you do, you have to have some way of correlating what's happening in these different log files across the whole cluster. And as I showed you earlier, you might have, let's say, even for just a single write, you might already have five servers involved. So that's kind of, you know, you're really doing a lot of work to try to correlate these logs. And it's a lot of manual work. So, and, you know, increasingly we're seeing logs split into more and more files, because single log files can be a big performance bottleneck. So we're not only, you know, we're not only sharding by hosts, by service, but we're also sharding in multiple ways by what the log is. Is it an audit log? Is it a block log? Is it a log for J log? And so forth. So the HTrace approach is a little bit different than either one of those. And I think it's complementary to those, sort of filling in the gaps. The HTrace approach is to follow a specific request or set of requests across the entire cluster. And the idea is you can follow the request across network boundaries, but also project boundaries. So it doesn't just stop at the border of HBase or stop at the border of HTFS, but it can go all the way down. And we would like tracing to be end to end. So what does end to end really mean in this context? It means multiple cluster nodes, multiple projects, and if necessary, multiple languages. So we're seeing C and C++ get more and more popular for big data implementation languages. So we need an HTrace client for those languages as well. And we need integration for those as well. HTrace also tries to use the available compute and storage stack as much as possible because we can't make assumptions about what is and is not available, or what rather we can, but that would really limit our user base. So our goals are, you know, supporting these multiple backends and not necessarily being tied to any one RPC or language or framework. Just within Hadoop itself, we already have Hadoop RPC. We have people using Thrift. We have people using KRPC. We have people using ACA. We have people using tons of RPC frameworks. So if you start from just saying tracing is tightly tied to this one RPC framework, it's already a non-starter for Hadoop and for big data. We want to have a stable and well-supported client API that can really be easily integrated with different projects. And we want to have zero impact when not in use so that we can actually use this on production clusters. And it's also very important, I think, to get integration with upstream projects so that people don't have to do a one-off to trace their workloads. And you would be surprised by how many people do write a one-off to do tracing in Hadoop. And that's something we'd like to kind of avoid when it's possible to avoid it. So spans are basically an interval of time. And one way of looking at them is that you annotate your program with information about when trace spans begin and end. So there's a parent-child relationship between trace spans. Sorry, the PDF export kind of messed this up. So the idea is this is a copy from local operation. And in the course of doing this copy from local operation, we did a create file system operation. That ended. Then we did a globber operation. And the globber operation interned at its own thing. So it's a little bit like a flame graph if you're familiar with that. Or it's a little bit like viewing the stack over time. So a trace span, as I said, represents a length of time. And trace spans have a lot of different attributes. For example, they have a description. That tells you what it is, what's going on. They have a start time. And the start time is stored in basically a millisecond since the epoch. The end time as well. Trace spans have parents. At least trace spans can't have parents if they're not the top level one. In order to keep trace spans unique across the whole network, we have a UUID, which is 128 bits and randomly generated. We also store the process ID and IP address of each trace span. And this is configurable. So if you want to store hostname instead, for example, you can. So trace spans also allow you to add arbitrary annotations if you want to add extra information that's specific to your system. So sampling is an important concept in age trace because tracing every single request would generate too much data in most cases. So tracing every single request could be useful for testing, for example, but it's not really going to be useful in production because it's just going to be too much. So normally, we do sampling of less than 1%. And that rate will depend on a few different things. It will depend on how verbose the tracing is in the system you're tracing. It will depend on how much bandwidth you have to store and process traces. But in general, less than 1% is sort of the ballpark here. So we also like to trace things at the whole request level. So I guess the intuition here is if I want to see a request, I want to see the whole thing. I don't want to see gaps in the requests that are holes that I can't look into, which is sort of what would happen if I did sampling at the level of each individual span. So we want to trace an entire request or not trace an entire request. We don't want to go halfway. So we have a pluggable architecture for HTrace. And basically, the idea is the HTrace core API is the one that you integrate with. And span receivers are things that plug into the HTrace core API and process the spans. So we also have a web interface that can be used to query the span receiver. So the idea behind having this separation is that if you're a developer, all you have to care about is this API here. You don't have to care about the span receiver or the GUI. You just have to integrate with this. It also tends to minimize the set of dependencies that we're pulling in, which is pretty important, especially in big projects like Hadoop. So we have a few different span receivers. The two most important ones that I'll talk about today are the local file span receiver and the HTrace de-span receiver. Those are the ones that I think are, we've gotten the most mileage out of them. So local file span receiver, as the name implies, stores spans and files in the local file system. And the idea here is that you can post-process the files later with whatever tools you like. And it's also useful, of course, just to test things out and see what kind of spans you're getting. We have a JSON format for our spans, which is pretty well-defined. And I think this makes it a lot easier to write tools that go on top of HTrace. Having a well-defined JSON format makes that a lot easier. And you can see that our span IDs are 128 bits. They're stored as hex. Our begin and end are stored as numbers. And, yeah, so we also have the process ID in there. And there can be more stuff than this even. So this is what you would see in the files. Now the files would not be pretty printed like this, but this is the basic idea. So we also have the HTrace de-span receiver, which is a little more easy to use because it stores the spans in the central daemon rather than putting in them in files on each host. So the idea here is to have indexing, have a web UI, have aggregation all in one place on the HTrace de-daemon. So the HTrace de-daemon is written in go. As far as RPC goes, it serializes its spans via message pack for greater efficiency. We also find that message pack and JSON are really easy to convert to one another. They're sort of isomorphic in a way. So HTrace de-exposes a REST API so that you can have command line tools to query it. The web apps can query it via the REST API as well. And yet it also handles overload pretty gracefully as well. So as far as storage goes, HTrace de-stores its spans in level DB, which is sort of an analogous to H file in Hadoop. It's an LSM tree. So the idea is it's optimized for a very high write throughput. We use multiple level DB instances so that we can take advantage of multiple disks, which nearly every big data cluster will have. We also index certain fields, such as begin time, end time, duration, and span ID. And that makes queries a lot faster to have the index on them. And of course, level DB persisted at a disk so that we don't have to worry about being limited by the size of memory or anything like that. So the third component is the HTrace de-graphical interface. And I'm going to give you a demo of this later today. But the basic idea here is that we have the ability to do queries on what's going on on all the spans that are stored on HTrace de. And these queries, you can add multiple predicates so that you can actually have, you can select, for example, a range of time by setting both a begin and an end time. You can select only spans that have a certain description or that come from a certain node. And once you have those spans, you can follow them so you can see where the parents and children were. So let's talk a little bit about using HTrace. I'm going to start by talking a little bit more about how you add HTrace support to an application. After that, I'll talk about configuring HTrace, which is more of a user level concern. And finally, I'll talk about using the HTrace web interface. So adding HTrace support to code is really easy. Basically, you just have to link against the HTrace core jar. Or if you're using CRC++, you can link against the libhtrace.so library. And you have to have some way of hooking up your configuration to HTrace. So this is sort of a system-specific thing. So however your system is configured, HTrace needs to be able to access some of that configuration information. In the example of Hadoop, we usually configure things via these XML files. So you would need to give HTrace a way of looking at that. And it's actually really simple to do. After that, you would need to add HTrace spans to measure the important events. And basically for requests in your system, you'd like to create a span for each request. And you'd like to have spans underneath that. And I think the most difficult thing is probably just being sure that you're not tracing too big of a chunk at a time. So you want to have a request that's small enough to be reasonable to look at, but also big enough to be interesting. And that's the more challenging part. And there's annotations you can add too for system-specific information, which can be very helpful. So, you know, a lot of applications will need to pass parent IDs over the network as well. So let's say your RPC system starts, let's say you start an operation on the FS client and you want to be able to trace that operation on the data node. You'll have to pass the trace ID that you're currently using, the span ID that you're currently using over the network. That way the data node can create a new trace, which has that trace ID as its parent. So I'll talk a little bit about the core API here. And the core API has a few different concepts. Probably the most important one is the tracer, or probably the one that you would encounter first is the tracer. So tracer creates trace scopes. Tracers also have their own sampling configurations. So you can do sampling on a per-tracer basis if you like. Tracers are thread-safe, so you can create, you can have multiple threads using the same tracer at the same time. It doesn't matter. It's completely thread-safe. Trace scopes manage the trace span for the particular thread. And they're created by tracers. So for example, you would call tracer.newscope to create a calculate-pi scope right before you call the calculate-pi. And then you would have something like a finally block that would close out that scope. And that way you know exactly how much time that calculate-pi operation took. And because there is thread-local data involved here, if you create a new scope inside the calculate-pi function, it will automatically point back to that trace scope as its parent. So of course we have the concept of spans that we've been talking about. And there's an object that represents those as well. You typically don't have to deal with spans directly, but you can sometimes. So for example, you can add a key value annotation to a span. You can find the trace ID of a span and stuff like that. HTraceCore also has various functions that are basically wrappers. So for example, the trace runnable object wraps a runnable. So that before calling the, when you call the run method, it starts the trace scope. And when it exits the run method, it will close the trace scope. So it's really just a convenience thing if you'd like to write a little bit less code. And there's also a trace callable, which is very similar, and trace executor service, which wraps all the stuff that it executes in its span. So there's a few other internal classes that you usually don't have to deal with. But just for completeness, I'll describe them here. We have a sampler object, which is determining which spans to sample. Tracer ID is the 128-bit ID, which is unique for each span across the network. Tracer pool manages a group of tracers. And it's basically only really useful if you want to do certain things in resource management. So these classes are all related. The tracer pool owns a tracer. The tracer owns a sampler. Actually, potentially more than one sampler. Tracers create these runnables and scopes, and these spans own span IDs. So this is pretty much the complete API. And it's actually a pretty small API. I'm going to shift gears a little bit and talk about how you would configure HTrace in an application that supports it. And I'm going to talk about Hadoop specifically in a few cases, just because that's kind of the first application that we're considering. But a lot of these things also apply to any application. So, I mean, the first thing that you want to do is determine which span receiver you want to use. Basically, do you want to use the local file span receiver? Do you want to use HTraceD? There are even other ones. There's an accumular one, or there's an HBase one. So you want to run whichever daemons are necessary to run that span receiver. And finally, set up the configuration. So the configuration basically is a bit system dependent, because we would like HTraceD to be integrated very tightly with the systems that it's tracing. So if we're integrated with Hadoop, we want to use the Hadoop configuration. If we're integrated with Kudu, we want to use the Kudu integration, et cetera, et cetera. So in the case of Hadoop, basically what you need to do is you need to set this Hadoop.HTrace.SpanReceiver classes. And it's important to note that any configuration key that starts with Hadoop.HTrace, we will basically cut off the Hadoop.HTrace and we'll pass that on to HTrace. So it's sort of like saying everything under Hadoop.HTrace is an HTrace configuration key we just passed through. Hadoop also has the ability to configure things on a per tracer basis. So if I want to trace just the data node, then I can do that as well. By setting a specific configuration key prefix, these prefixes sort of determine which service the configuration is going to. Hadoop.HTrace is sort of the global configuration prefix that says everything gets it in Hadoop. There's a link which is sort of cut off on this slide, but I'll show it later that provides more information about this. Another thing that you need to do is you need to add a few span receiver jars to your class path. So for example, if you're using HTrace.HTraceD, you would need to add that to your class path. And there's a few different ways to add things to the class path in Hadoop, which I'm not going to go into here, but the basic idea is that the core API is always on the class path, but the span receiver needs to be put on the class path depending on which one you choose. So HTrace has a really great community, and it's not just Cloudera. It's also stuff companies like NTTData, Hortonworks, Facebook, and a bunch of other people from the Apache community. Of course, it's a true open source project, so if anyone is welcome to contribute and we'd love to get more contributors. It's also licensed under the Apache license, obviously. So it's very compatible with the rest of the stuff that's going on in the ecosystem. In the last few months, we've done a few different releases. The first Apache release was 3.1, and then we had 3.2 after that. In the last few months, we've done a 4.0 release and a 4.0.1 release. And the 4.0 release had a bunch of API cleanups that really made things a lot better. Especially when you're tracing a library, 4.0 is much better at doing that. It makes many fewer assumptions about globals and stuff like that. We basically got rid of all the globals that we possibly could to make things easy for the library guys. So, you know, we're really big on sharing ideas with other big data projects like Hadoop and HBase. There's other projects in this ecosystem too, like Open Tracing, Xtrace, which is an academic project. Twitter has a project called Zipkin that they do, which is sort of similar to Htrace in some ways. We have the ability to send spans to Zipkin, so we have that kind of integration. And, you know, we'd really like to get more ideas, more use cases from everywhere. And I think that's one of the best things about open source is that you really easily can, you know. So, some of the recent work we've done, we added more effective error checking in the Htrace client. This was a really big sticking point for some people because they were forgetting to close spans and then there needed to be a better way of determining that they had done that, but they had forgotten that. We have an optimized RPC format now for Htrace D, so that's nice to reduce the volume of data over the network. We have better integration with HDFS. Our HDFS integration is probably the best integration we have in any project. So, we're very proud about that. We have a new GUI for visualizing spans. So, the user interface is actually one of the newer components, probably the newest component. I think it's one of the most important components because it really adds a lot of the value to be able to see what's going on. We added the ability to tag trace spans with IP address or hostname. As I said before, this is configurable, but that's another thing which I think is super important is knowing exactly what IP address stuff came from. And we extended the span ID to 128 bits. The original span ID in the 3.x series was 64 bits, but the problem is due to the birthday paradox, collisions occur a lot more frequently than you intuitively think they should. And of course, they only really tended to happen once you had an enormous volume of spans, but with a 128-bit ID space, collisions are basically impossible now. So, we've really future-proofed ourselves in that regard. So, Clutter has also made HTrace available as a Clutter Labs project. So, basically, Labs is our way of having stuff that's in beta that people can try out. It's kind of analogous to some other programs that companies have. Right now, we support HDFS tracing. We're hoping to add HBase very soon. We've made RPMs and devs available for HTrace D to make it easier for people to actually run this stuff. Because, believe it or not, most people don't actually like to compile stuff from source and just upload it to their cluster. Maybe a few people do. I don't know. Maybe some people here do. But it's certainly nice to have the packages. We also have some minimal integration with Clutter Manager. We'd like to have better integration, but we have the basics done, which is nice. So, we have a bunch of stuff planned. We'd like to improve the HTrace integration in HBase and bring it up to HTrace 4. Right now, we only really have some old stuff from the previous HTrace version. And we'd like to add a few more annotations to the Hadoop span data. So, as we use HTrace more, we find more stuff that we'd like to know and we don't. Supporting more span receivers would be nice. Supporting the HBase span receiver would be pretty cool. We'd like to have some better integration with cluster management systems. So, certainly CM and potentially even other ones, too. We'd like to improve the C and C++ support more. The basics are all there, but testing is something we need to do there. And it would be really nice to have an aggregate view for spans. And this is something that's sort of an open design question. What kind of aggregates do we want to do? It's certainly more easy to do a streaming aggregate, like counting the number of spans in the last hour, doing a moving window of some kind. Doing a completely freeform aggregate that is configurable is more difficult to do without tighter integration with one of these SQL or execution engines. So, we're probably going to take a look at how we might do that and where we'd find the most value there. So, I have a demo here that I'm going to show. Let's see this. Okay. So, this is a demo of tracing HDFS writes. And basically, it's a little bit similar to what I showed you earlier in the first few slides. Basically, we have a write to HDFS. How do we trace what's going on? Before I start, I'll give you some really basic background. HDFS is a distributed final system. And HDFS, by its nature, will distribute data across the network. The goal is that we don't want data to be all on one node as in a traditional final system because if that node goes down, we would lose that data. We also don't gain the parallelism of having the ability to process the data on multiple nodes that we would like to get. So, the goal behind HDFS is certainly distribute out the data. In this case, we're going to be using the FS shell process to write to HDFS. When we do that, there's going to be a few different steps. So, one of them is FS shell, when it creates a file, it will ask the name node, which is a single node system that stores HDFS's metadata. And then after it's asked the name node, it's going to talk to the data nodes. Basically, HDFS sets up a pipeline of data so that HDFS passes a message to the data node. It writes it in parallel with passing that message to the next data node. So, the analogy that we like to use is a bucket brigade. People are passing water down to the end of the pipeline. Then they're passing back the empty buckets. And it's not a perfect analogy because, of course, we're not just passing the data on the data node. We're also writing the data to that local data node. But hopefully it gives a little bit of the flavor of what's going on. The specific case we're going to test here is where one of the data nodes is slow and how we can determine that that's the case. So, here's me running a Hadoop FSLS. It might be a little hard to see, but this is my shell command up here. Here's a bunch of files that are in the directory. So, here's HtraceD. And you can see that HtraceD is running... You can see stuff like the git hash here, the server release version, all that stuff. So, we're going to do a query on the web UI to see if we can find the trace spans that we just created. And we'll query by time. And we'll also add a predicate that the description is ls. So, here we found a trace span. And we can take a look at sort of what's going on here. So, we can see that the top level span here is an ls. So, the amount of time taken for the whole ls is here. We can see that a large portion of that time is taken by create file system, which is setting up the Hadoop file system stuff. And that after that, running the globber is taking up some time. The globber ultimately ends up talking to the name node here, as you can see. In this particular case, the name node is on the same PC as the client. That normally wouldn't be the case, but it is the case in this test cluster. They're all on 10.20.212.10. So, you can see that as I move the cursor, it actually changes the beginning and end that I see here, and not the beginning and end, but the current time. So, you can double click on this to get more span details. So, you can get the precise begin and end time and the precise duration. And you can also get the arguments here. This is a system specific thing we've added. So, what arguments was the ls running? Not very interesting here, but it would be more interesting in most cases where we're not just doing the ls of the slash. So, now we're going to do something a little more interesting, which is basically running a copy from local operation. And a copy from local operation is going to be copying some local files into HDFS. Now, this little snippet of shell here is just generating a unique ID so that I don't have to worry about overwriting a file that already exists. So, you can see that that took about 4.4 seconds. Oh, sorry, 5.5 seconds maybe. Whereas this one is definitely taking longer. It took at least an extra second. So, we'd like to figure out why one of these requests took longer. After all, we did the exact same thing. In this cluster, we have four different data nodes as well as a name node. So, we'd like to see, you know, whose performance is slow, why is it slow, and figure out stuff like that. So, what the query that we're doing here is actually looking for tracer IDs that contain data node. And so, this should give us all the trace spans from the data nodes. And we're again also filtering by time so that we don't see stuff that didn't, that happened a long time ago. So, in this case, we're looking at upright block proto. And you can see that this one had a very high max write to disk MS. And basically what that means is that one of our writes to disk took a very long time. And 227 milliseconds is a very long time for a write to a disk to take. So, 10.20.212.32 seems to be having some disk problems here. Whereas if we look at some of the other trace spans, we see that they're actually only seeing like one millisecond of disk write time. Probably 1.5 or something, but it got rounded down. But certainly nowhere near like 200 or 250. So, you know, the idea here is that we want to find out why this particular request was slow. And here's an example. So, here's all the data nodes we have in the cluster. It's a little hard to see, the font is kind of small. But we have basically 14, 12, 16, and 32. So, if we log into 10.20.212.32, we see that, you know, things are actually not going very well. We have a really high disk load on this one. So, here's an example of using metrics, right? This is a very primitive example. Normally you'd use something like CM or something like that. But this is a really basic way of getting metrics. Just looking at IOSAT. You can see that there's almost nothing going on on A24.04. But there's a ton going on on A24.24. So, that's really an explanation for why you're seeing such a high maximum write latency. Is that 24.24 or A24.24 is just really swamped. And I can tell you that I did that deliberately. I had a tool that's generating tons of IOLODE on that one data node. So, that's what you're seeing. And that's what's reflected here in the fact that this is taking eight seconds and this is taking, like, not quite half that, but much less time. I'd also like to say that these times are longer than the... Every time I run a shell command in Hadoop, it always gives a misleading amount of... It gives a misleading idea of Hadoop's performance because you're spinning up the VM and that's a substantial amount of time. But certainly because we're paying that cost in both of these cases, we can compare these two cases to each other without having any bias there. So, yeah, and here's another example of me looking at the data node stuff. And we're going to try to find the top-level request that it came from. So, here's a copy from local that took just about six seconds. And you can see that 10.20.212.32 was involved in this one. It was the last data node in the pipeline. And again, it has that really high rate latency. So, this corresponds to the second request that we were looking at. Now, if we look at a request that took less time, we see that this request only took basically 2.5 seconds. If we take a closer look at this request, we see that all the rate latencies are really low for one thing. And we also see that there's no, the 10.20.212.32 data node is not involved here. So, having that data node involved clearly blows up the time. And basically, so this is exactly what we're looking at here, right? The last node in the pipeline was really slow. And it slowed down the whole request with its long tail of latency. And we can see this clearly in the Web UI, whereas if we were just looking at metrics, it's a little bit harder to see that. And certainly, if we're looking at logs, it can be harder to see that because we would need to somehow correlate the logs for that copy from local across multiple nodes, which we could do, and I've done it in the past, but it's easier to see everything on one screen than to have to look up each log file individually. So, these things are complementary, right? Like, once you determine that request is slow, you can start looking at metrics, you can start looking at logs, and you can start examining the rest of the system to see, to dig further into why the request was slow. So, cool. Any questions? This is our question and answer time. Yeah. Yeah. Console. I guess I'm not completely familiar with console. Are you saying, like, is console a particular configuration management system, or what? Okay. Oh, console. Oh, okay. I thought you said the console. I'm like, well, yeah, I guess you can use it from the console. I think that the smartest thing for us to do is to piggyback on the configuration of whatever system we're tracing. So, if that system uses console, then absolutely. If it doesn't, then I don't think we would want to require console just to do that. I mean, it's not something that I would veto or anything, but it just seems like if I'm an admin, I just want to do the config that I'm doing. For example, we're working on integrating this into YCSB, which is the Yahoo... Well, not just Yahoo, but it started life as the Yahoo cluster service benchmark, something like that. But it's a very popular big-data benchmark. And its configuration management is different than Hadoops because it's not just a Hadoop project. So we want to integrate it into that project's config, which seems to be mostly based on Java properties. So, again, like, if your project likes using Java properties for config, we'll use that. If you like using XML files, we'll use that. It's just a layer of glue you have to put in to use whatever you want. Yeah. No, not currently. We have some ones that are planned, but currently we don't have any project using it in C++ world. We'd like to, and we wrote the client. We spent a lot of time thinking about the API stuff, by the way. I'm an ex-C++ programmer, so I sort of know my way around library APIs in C++, and so hopefully we've done everything that we need to do to make that all work. We have some plans for next projects, but I can't say anything about them yet. Correct. Yeah. We would love to have more people taking the tires on that. That's a good question. So, I mean, accounting is an interesting way of putting it. So, I mean, HTrace is really about requests at the moment. And the idea is, you know, we annotate requests, and so we follow requests across the network. So, as we follow requests, requests are slow for a variety of reasons, like maybe they are slow because of network, maybe they are slow because of disk. I think maybe this gets to the heart of your question. If you experience a really long latency, you can annotate the span with that latency. Or even if you, like one of the approaches we took in HDFS was we annotated each trace span with the maximum disk latency we saw during that span. Now, we could also do the same approach with networking, too, right? We can annotate it with the maximum network latency that we saw. Typically, one way that we see network latencies is if one span... How do I put this? If a span on one node starts a lot later than a span on another node, you start to ask questions about why that happened, right? If there's normally almost no delay between starting them, but then there is a large delay, then you have to wonder why. I think your question also touches on the aggregate question, too. Like, what should we be aggregating? I have a friend who's working on trying to run mappity jobs on some htrace output so he can try to basically find patterns and stuff, which I think is a really interesting idea. If you have this uniform sampling going on, then you can sort of even use it to find out the answers to questions like, this job, on average, makes writes that are what size exactly. I wouldn't say that htrace is just about latency, even though latency is probably the most important use case, but there's other stuff that we can find out once we have these annotations. That's a really good question. Xtrace, I think, is more of an academic project. I haven't heard of xtrace being deployed, although maybe I just missed it and it was deployed somewhere. It's really hard for me to say never because everything's deployed somewhere, right? But I haven't seen a lot of activity there. I get the impression xtrace was a more academic endeavor, which had many good contributions and was older than many of these things. Well, so Zipkin is an interesting project, which is still going on. It's still somewhat alive. And I actually talked to the guy behind the Zipkin project a lot. Adrian Cole is the main guy behind it at the moment. So Zipkin started from a little bit of a different point of view than htrace. Zipkin started very tightly integrated with the finagle RPC system at Twitter. I think they've been trying to move away from that, but that was where they started. Zipkin also started from the perspective of a very fixed tech stack, like you will use this stack, period. Again, that's something they're trying to migrate away from. So one way of looking at it is that we're trying to gain their features like a web UI and they're trying to gain our flexibility, and maybe we'll meet somewhere in the middle. Some of the more important distinctions between htrace and Zipkin at the moment is that htrace supports the concept of multiple parents for each trace span. There's some debate about whether we can use different ways of representing these relationships, but currently that's our way of representing the relationships. Whereas Twitter, sorry, Zipkin does not. There's a few other distinctions such as the data model is a little bit different. So Zipkin tends to, because they're enforcing a tree model, they have a trace span for the whole tree, whereas we don't really have that. I'm trying to remember what the other distinctions are. Zipkin doesn't obviously have Hadoop integration of any kind, so a lot of the people who are using it are just hacking stuff in and they're keeping that patch on top. They don't contribute that patch back to the community necessarily. So integration is something that, and that's one reason why we added the ability for htrace to talk to Zipkin so that people could, using Hadoop, also use Zipkin if they want to. I think people should have the freedom to choose, you know, what they want to do. But yeah, so they're similar-ish projects, and they both have similar-ish focuses. We're trying to do a little bit of standardization. Well, not really standardization, but there's another effort, which is called Open Tracing, where people are trying to create a tracing API that can be on top of other systems. And so that's a work in progress, I would say. I think that it's a little bit like POSIX in a way, like so. Personally, I'm really happy with a lot of the decisions we made in htrace, but on the other hand, I also think people should have the ability to use what they want. So, good question. Well, I would love to see it in Impala, but we have to spend some time doing that. One of the things that we're thinking about is how we can effectively correlate stuff that's going on in a higher-level system to stuff that's going on in HDFS and HBase. For example, if we have a Mapretty shop or an Impala query, we don't want that to be a whole trace span because that would just be enormous. But on the other hand, we don't want it to be untraced, so we have to think about the best way to do that. And probably we'll end up having some way of correlating things by an ID that we can match up. That's probably the best way to go about it. The annotations in HDFS are upstream. They're in the 2.8 release of Apache Hadoop, so it's not just a vendor thing. Although I would argue the support is best in our distribution, but that's just the way it is. But it's upstream. We strongly believe in doing things upstream at Cloudera, so similar to how Red Hat does things, we actually get the patch into the upstream release before we put it into our release. On the other hand, we also do take things from, let's say, an unstable branch at times because we feel that, like just to give one example, which isn't related to HTrace, the native Mapretty's task stuff is stuff that's in trunk right now, but it's also in CDH5. Just because we feel that it's stable, but we didn't manage to convince the upstream community that it is. But we do have an upstream-first policy that's very robust. That's a good question. No HTrace is not instrumented with HTrace. Circularity is kind of a problem in a lot of tracing systems. It's especially a problem when your tracing system is implemented with your normal storage back end. It's not an insurmountable problem. I've heard people argue that the best systems are ones that use your existing storage and your existing compute stuff, which there's a lot to be said for that, but it also does create circularity problems. And I would argue that the Trace HTrace would certainly create circularity problems like that because then you don't want to create an infinite amount of traces based on one input, right? All right. Other questions? Oh my gosh, we're going to finish... Oh, wow. Almost on time. All right, thanks a lot, guys. Thanks for coming by today. And if you have any more questions, just send me an email. I'm seeing McCabe at Apache. So we'd love to see more people in the community and more people, you know, asking questions, trying it out. It'd be great.