 I want to first welcome everyone to the distributed track at RailsConf this year. This is the first year that I think we've been doing a distributed track. So I'm excited to be able to give you guys a talk on distributed request tracing. A little bit about myself real quick. My name is Kenny Hawksworth. I am a software engineer at Twilio. I'm actually not doing any Rails development anymore, unfortunately. But I have taken a number of the things I've learned on the scaling architectures that Twilio has as part of their services and thought there's a lot of stuff that can actually come back to the Rails world. And that's kind of what I want to be able to bring to you guys today. So real quick, I just wanted to talk about, before we get into the investigation that led us down the path of Twilio's architecture, I wanted to get into really what distributed request tracing is. And so distributed request tracing can be kind of thought of as a profiler for your distributed system as a request goes into your distributed system. So we have profilers for a whole bunch of other frameworks and programming languages where you can break down and actually see what function calls taking so much time and what is the exact stack trace for each function call. And so a distributed tracer essentially provides the exact same profiling from a request cycle as it enters into a distributed system. So given this diagram, you can look at, this is exactly a profiling diagram of a full scale trace running from end to end as it entered and exited a system. And then a web service call that actually was the first call that was made in the distributed system and off service that ended up getting called as a part of that and then another DB call. And then after everything was done on the web service side, it got handed off to a worker thread to finish up some processing or whatever else. Okay, so I work on the messaging team at Twilio and I'm not familiar, or I don't know if all of you are familiar with what Twilio does, but Twilio provides an API for voice communication so you can actually send and receive phone calls via Twilio's API. And we also provide a messaging API so you can send and receive text messages and MMSs and for received MMSs, you can actually have web hooks that go out to your servers to figure out what exactly should we be doing with this message. It allows your servers to respond to the incoming message. And one of the things that, the messaging stack at Twilio is set up in a distributed fashion where every portion of the, or every service that can actually occur for an inbound message or an outbound message is provided by an individual service on an AWS, or AWS instance. And so at any given time, there's nine different distributed systems that are all, or components that are all communicating when a message comes into the service via the API until it's handed off via the, or out to the carriers. And so we ran into some problems. So this is just a very basic overview of it comes in, it's handed off to another server, it's handed off to another server, there's some DB persistence communication that's going on. But where we started to have problems is a distributed tracing system for this is kind of a little bland. You don't really need it when you're just handing it off, or handing a request off to another system to another system. Profiling is pretty useful in the sense that it can give you more information about latencies and other issues that are occurring between your services. But where it becomes really useful is when you start scaling. So Twilio's, each service that Twilio has is scaled to be able to meet whatever the current needs are for our service. And at any given time, we can have 10, 20 nodes or however many nodes running for each individual service that we have. And a lot of times we have customers or support personnel who come to us, the engineers, and say, hey, we actually have a report from a customer who says, you know, every 20 or so SMSes that go out, sometimes it takes like two seconds as opposed to, you know, near instantaneous. So why is that? And then there's other customers that come to us and say, hey, an MMS sometimes when it goes out, it's taking five seconds. Why is that longer than SMS? And then also some customers will come and say, we tried sending a message and it completely failed. And tracing that information down can be difficult in a distributed system. You can do log parsing and go through logs and you can use log stash to actually be able to get some information about what the actual exception was that occurred. But being able to recreate the entire event from when it entered into the system all the way to the end is incredibly useful from an analytic standpoint and from a debugging standpoint because you can actually figure out, I know exactly what code was running on this server. I know exactly what was running here. I know exactly what was running here. I know at this point that this server was having some problems because of a specific memory issue that was occurring while all of the other servers in that load balancer were not affected. So as an example, when you add distributed tracing to it, you can actually follow the exact path that a message travels throughout your system. So distributed tracing can give you a little bit of information on performance, obviously, which we talked about previously. In this example, you can see that a DV had 500 millisecond latency during a transaction which is pretty bad. But then there's one of the other things that a tracing framework can give you is information on bad nodes. This is a problem that occurs a lot in a distributed system where a node is bad. It might actually have run out of disk space. It might have, there might be some sort of unplanned problem that's actually occurring on one of your boxes. And if you're running a robust system, it should be, if there's a failure, it should be retrying and sending it back into the load balancer and then going to hopefully a good server. A lot of times you don't know when these problems are occurring. There might be a slight latency that's introduced in your overall framework. But usually there's not gonna be a little bit, there's not gonna be an error message that gets back up to the customer. And there's not gonna be an error message that you're going to see unless, aside from some exceptions that might get thrown and you might see it on a nodule alert. But being able to use a distributed tracing system, you can actually follow the exact paths. See that it hit one node, it failed, it hit another node, it failed, and then it finally hit a third node and it succeeded. And you actually know which nodes that you need to go investigate and be able to clean up and then possibly plan to scale better in the future. Okay, so this was kind of like giving the Twilio example. A lot of you may be wondering really, what I run is a distributed system. Essentially, if you are running a service that has two or more components that talk to each other, whether that be a web service and a database, it's a distributed system. And the more services that you add to that, it becomes more and more distributed. And laying the groundwork for a good tracing framework is really easy if you start in the beginning, but it can be really difficult to go through and actually instrument all of your services after you've started everything. So I wanted to run through a quick little example. It's kind of silly, but it's what I'm gonna use for the rest of the talk is I wanted to come up with a service that we can use as an example here, try to think of something that the internet uses and needs more of, and that's pretty obvious. We need more cats. Everyone needs cats. Cats need to be everywhere. And the thing is, we didn't want to create another ICAT that has cheeseburger or anything like that. We wanted to be able to provide an API so people can go and retrieve cats, whatever cats they want, they send them out to customers or send them out to customers or clients and integrate them with their service. So we have cats as a service, which is fantastic. Everyone loves cats, and I figured I would just have a, well. I have to watch it again. All right, so that's what our service is gonna provide, wonderful cats that attack children. Okay, so starting off a very basic, with a very basic architecture, we are gonna provide both an API to our service and a web interface to our service, and then we're gonna send things out through Twilio. I'm Twilio and I'm gonna talk about Twilio. But we allow our customers to be able to retrieve, retrieve cats and be able to send cats out to their friends using Twilio, and this is actually cat spammer. catspammer.com actually exists if you guys wanna play with that. But so this is pretty simple. You might think, I don't actually need to create a, I don't need to add distributed tracing into this. This is a very simple architecture. Well, let's say this is a service that you put together and you thought, I wanna do this as a side project, but well here I am giving this conversation in this room and let's say, let's say a VC's in the room right now and he's thinking, oh my gosh, this is great. We really, really, really need to start sending cats all over the internet, which is already done, but just bear with me. And thinking, all we need is a social element. So if he happens to come up to me later and say, I've got $50 million that I need you to go build this system because of all this massive traffic that we're gonna be bringing in and we wanna add a new social element to and all this sort of stuff, it suddenly starts to become a big problem. So we start out with our API and our web interface. We add a social interface, which I don't even know what a social interface for cats would be, but it seems to make good sense. We could add an authorization layer because it might make sense to protect images of cats that people have. A worker process that might actually go out and build cats, a media fetching service that actually does de-duping and caching of any services that exist, or I'm sorry, de-duping or caching of any images that actually retrieved. And then we'll also provide a message queuing infrastructure that we're gonna call per MQ that communicates throughout the entire service. I mean, it's web scale. We have to have a message queue. And then so we'll still send out via Twilio, but we probably send out via other social sites like Twitter, FakeBlock, any kind of those services. So at this point, you can start to see how a distributed system or a distributed tracing system could be fairly useful. Wow. Anyway. This, you can see how right now we have the API and the web and the social interface as all of our ingress points. We never know what customers are gonna be using, so it would be good from any request into our system to be able to see where it's coming from. Once it gets into the message queue, it's even more difficult because we don't know if the authorization system is actually needed on each request. We don't know if the media service is actually going to be needed. Maybe the media service is going out, making a call, retrieving a giant cat image and it ends up blowing the entire system up. And then we have multiple egress points. And so with that said, once you start scaling it, it ends up becoming even crazier. Because then you think, you don't even know where you're gonna end up starting. So I wanna go through and real quick talk about what can you do to, what makes a good tracing system? So the first goal of building a good tracing system is to have low overhead. So this would be, it's incredibly important not any tracing system to have as little impact on the server resources as possible on the system that it's actually tracing. This seems like an obvious conversation point. However, if you're talking about a production system that is doing traces on every single request and is trying to log all of those somewhere and is trying to send them into another system, this has to be thought about all the way from the beginning. The second point is it needs to be scalable. This is not as important when you're first just starting to build out a distributed tracing system. However, it needs to be scalable from the start so that developers don't have to worry about, about kicking over the tracing system as they add services. It seems silly, but developers don't wanna actually have to sit there and think about tracing and all the metrics that occur. They want it to be transparent, which brings us to the third goal, which is transparent instrumentation. This is by far the most important point of building a tracing system. And by this I'm saying, if you have a distributed system that has 20 different components and let's say you don't have tracing already as part of your system, you are not going to want to have your developers go out and add tracing to every system, not only every system, but then every call out that occurs throughout your process. I mean, it would be a nightmare. You're going to miss some calls, it's going to happen. So with that said, I wanna go through a couple of tracing systems that already exist. Xtrace is one of them. Xtrace has been around for a few years now, I think the early 2000s. Actually all of these were put together as academic papers in the early 2000s. Xtrace is a system that has some C++ bindings right now and a few other systems, or I think they have Java bindings as well. Xtrace is not, like there's no Ruby instrumentation aside from a commercial distributed tracing system that's called, that's AppNet as TraceView, which I actually am not doing any kind of demonstration of. There is one other commercial system that's called, or that's part of New Relics cross application tracing, but I just wanted to throw those out there. So Xtrace, Magpie and Pinpoint all follow a very standard basic system, which is bringing this back up. They all create a single unique identifier for the trace that you are running through your distributed system. And then another unique identifier gets created for each of the spans that occur. So the span is each of the services that occur throughout your distributed system. And then you can actually link them all together by linking the parent spans with the child spans and you can end up creating an entire tree infrastructure. So Google put together, they have an internal system called Dapper. They have a paper that they put out a few years ago also regarding Dapper. Dapper is very, very, very similar to Xtrace and Magpie, but there's two key differences and they built these key differences in to be able to make their system a lot more scalable. One is they implemented a low latency logging system on the back end. So as traces get created, there's a low latency system that will grab traces if there are system resources available to be able to grab the traces. But there are, but if the systems are getting pounded, the log collector won't actually do any of the collection. The other point that was introduced as part of Dapper is they ended up introducing a sample rate. So not every single request gets traced. This is huge for Google because of the amount of requests that they end up receiving. And so Dapper is actually a really nice system. Google has the luxury of building out an entire network, networking RPC layer underneath all of their services, however. So Google maintains control of all of their communications paths, whether it be the HTTP communication or database communication. All of it's done using part of Google's underlying networking RPC layer. Because of that, they can integrate all of the instrumentation directly into the networking layer, which is critical to being able to put together a transparent instrumentation platform for developers to use. So Dappers puts together all of the services are running on individual nodes and are transparently doing all of the tracing and logging out to a log daemon that's running on each of the boxes. There's also a Dapper collector service that's running that actually goes through and collects all of the log files at specific intervals. And then all of the log files end up getting set to the Dapper UI and Dapper service on the back end. Twitter, however, took the Dapper paper and decided to try to build an open source project off of it. So this has been fantastic because it's something that has enabled, opened it up for the rest of us. And this project is called Zipkin. Zipkin's been around for a little while now and it seems to kind of like bounce back and forth in the open source community on what kind of development's actually occurring on it. So Zipkin is also based off of a custom networking RPC layer. That Twitter introduced called FNAGLE. And FNAGLE is great, but it's a really nice networking layer that has all the instrumentation built into it. The problem is that most of the clients and services are built in Scala, which don't really match up very well with a Ruby and Rails world. I mean, you could use JRuby and somehow use it to actually communicate, but for a larger enterprise that's using binary Ruby and MRI and stuff, it would make sense to be able to do instrumentation outside of it. The good thing about Zipkin is it's open source. So all of the headers, all of the communication paths are all open and all of the FNAGLE services are all written in thrift, so it's pretty easy to actually do instrumentation or creation of servers and clients that will actually speak the language that Zipkin expects. So Zipkin also has a couple of really nice aspects to it. One of them is it has a pluggable backend data store. So the entirety of Zipkin's written in Scala, but it's actually incredibly easy to start up and get running, so you guys can actually go out and easily start doing some instrumentation in your Rails projects today. You basically, the pluggable data store by default starts using SQLite, which is great for doing some initial development. It's not very good for production traffic, but it does allow MySQL, Postgres, Cassandra backends, and I'm pretty sure Twitter's been using Cassandra pretty heavily on the backend for using their pluggable data store. So the, the Scribe, or it uses Scribe as its log correct, log collector. Scribe is a open source implementation of a log, or a low latency log collector service. The other really nice thing about Zipkin's infrastructure is it doesn't require all of the individual instrumented services to communicate via a Scribe collector. You can actually talk directly to Zipkin. So if you're doing development, just trying to introduce distributed tracing to your, to your application, you can write directly to Zipkin's Scribe port and it works flawlessly. And that's what we'll do in the example in a second. The last point is it's highly configurable. There's lots of tuning options. Obviously I already said that the data store and the log collectors are pluggable. It also can be, it can use ZooKeeper on the backend to automate service discovery to be able to, when you actually do have different log, log collectors running on individual boxes. So Zipkin's interface is pretty simple. You can see here the profiling view of a request at the top that's the entire request and then each service that ends up getting called ends up having its own entry where you can actually go and look at each individual point throughout the tracing architecture. And then we'll look at a live example of that in just a second. So anyway, I wanna go into an example of instrumenting a Rails application using Zipkin. It's actually incredibly easy. I'm a little embarrassed that I'm up here talking to you about the actual Rails portion of it, but at the same time there are some things that need to be, there's some additional work that needs to be done to be able to instrument stuff that doesn't already automatically integrate with Zipkin and with Finagle. So the initial setup is installing Zipkin. It's pretty straightforward. You do need to have a JVM running and you do need to have Scala installed. Both of them are, I mean, everyone's installed a JVM probably and Scala is incredibly easy to install. Once it's installed, it's fine. Scala does take forever to actually compile all of its services, but that's a bit of a pain point. But you can end up just cloning the Zipkin repo and then each of those commands will start up each of the individual three services that are provided throughout Zipkin. Zipkin provides all of the collector engine, the query engine and the web engine separately so that you can actually put different services on different hosts depending on the load on your system. The Ruby setup is actually pretty easy. We use the Scribe gem, which allows us to talk with the Scribe log system. The Finagle thrift gem, which is part of Twitter's Finagle thrift GitHub repo. And this introduces all of the trace IDs, the span IDs, all of the communications and all of the recording stuff for you so it's all fairly simple and straightforward to actually do request tracing. And then finally, Zipkin tracer is a rack middleware component that is part of the Zipkin project that allows you to automate the inbound portion of request tracing on a rail service. So as long as you're handling all just rail services, it's incredibly easy. There's no extra configuration that needs to be done beyond just getting these things installed and then adding the middleware. And so we'll show that in just a second. The one difficulty is the Zipkin tracer middleware hasn't been abandoned, but there's a couple of forks that actually work a lot better than the base Zipkin library. So I'll show you right now. I've got the gem file that I use for this. I end up grabbing from that specific repo. I'll have information for this where I can give this information to people who are interested after the talk, that's for sure. But it's pretty simple. You start up your gem file. You can start up a config initializer to actually set up or to introduce the middleware into your rail stack. There is a little bit of configuration that needs to occur. All it is is the service name that you're creating. So right now this would be our CastBammer API. The service port that Zipkin runs on. The sample rate, which is a number between zero and one, specifying the percentage of requests that should get sampled. And then the scribe server. This is the point where I'm pointing directly to my scribe server. If you had a scribe daemon running on each individual box, each component would have to be set up in rails to point to whatever the scribe daemon that's running on the boxes. Okay, so here's where a little bit of difficulty comes in. So active record in Redis and using RabbitMQ and any other communication path that you want to use inside of your service. Unfortunately, the current Zipkin tracer gem does not provide default tracing and instrumentation for all of those. So it's a bit of a womp-womp moment. But it's something that's actually being actively worked. I was actually hoping to have it available for you guys today, but I was trying to add it to active record in Redis and RabbitMQ and a whole bunch of other services and wasn't able to get all of them working and playing nicely together. But it is something that hopefully in the next couple of weeks I'll have up in writing. So you definitely still can, I don't know why I did that, you definitely still can do tracing of all of these services. Unfortunately, you have to wrap them all in tracing calls such as this. This will allow you to, on each of your GB requests, as long as this is a synchronous request, this will actually add tracing, client tracing for your actual DB call. For synchronous communication, it's a little bit different. You do have to be careful that you're actually passing along your trace IDs and your span IDs. But that's something we can talk about offline also. So anyway, I want to do just a real quick demo. It's nothing incredibly difficult, but no. All right, so I'm starting up, this is the CastMammer interface that we ended up creating. Oh gosh, this is really difficult. This is the one thing I couldn't test before. Oh, I can see it down here, perfect. So give me one second, actually. Well, actually, I'll just put my own number in there. So let's say I'm going to, everyone please spam me after this. Yeah, exactly. So I'm going to send a cat to my phone right now using CastMammer. Okay, so it's sent. So I actually have on the back end, I'll pull this up real quick. Everything's cut off pretty bad here, but I can show you real quick. Here I have all the three individual scribe and zipkin services that are running, and then I also have in these terminals everything's cut off, but in this terminal I have a bunch of, all of the Rails applications that are running for, or all of the Rails service that's running, did something fail? Oh, that's no fun. It's running for our applications. So there's the API portion, there's an auth component, there's a social component that talks to Twilio and emails, and all of it's set up to, there is a little bit of, all of them are set up to communicate with one another so I can actually create an example trace for you guys. So I'll bring up the zipkin interface here real quick. Okay, so this is the base zipkin interface. It's actually a pretty simple interface. There's nothing really difficult about it at all. The one thing that's interesting about it is it does allow you to select different service names so I can actually look individually at my CastBammer interface, and then I can actually grab the trace from that. So here's one trace that ended up coming through, and I'll pull that up. Oh, and unfortunately it didn't actually work very well. Oh, that's too bad. I got more womp-womp. Give me one second, I'm gonna run another request through. I might not have internet, that's the problem. Well, unfortunately I don't have internet right now so my actual sending of the message out is going to fail. But I can drop back real quick here. And we can act as if this is the interface that I actually was just showing you. So assuming that this was the request, this shows each individual service as it was called. You can actually click on each of the spans here and it'll give you annotation information. So you can actually see when the request started, when the request ended. You can actually add tracing information as part of your request while you're doing processing. So you can actually add extra information to each individual tracing request. And you can actually obviously see the latency issues or anything that's actually happening as part of your application. So I'll run through here real quick. Okay, so I have the demo and that actually concludes what I have. I would love to take any questions for you guys. The actual Rails instrumentation, like I showed you, is actually really easy with Zipkin. The Appnet of TraceView instrumentation is just as simple and I'm pretty sure that the New Relic CAD is pretty simple also. I haven't used that one unfortunately but I have used the TraceView. Okay, excellent. Well, I would be happy to talk to any of you about this offline or you guys can actually go and visit castmember.com and send your friends images. So thank you and yeah, I hope you guys have a good rest of the time at RailsConf.