 Hi everyone, and thanks for the introduction. I'm Milo from Clockwork. Today I'm gonna talk about clocks and observability. Time is a fundamental concept in distributed systems. So you open up every distributed system textbook. There's a section about clocks. We have physical clocks. We have logical clocks. We have hybrid clocks. The clocks are already everywhere. They are useful for databases, consensus protocols, snapshotting, and observability. However, the general consensus in the industry has been to never rely on clocks when you build our own system, right? Because clocks are very hard to synchronize, and they are very fragile. Even if they are synchronized, they can go out of sync just very quickly, right? So clocks are just coarse crystals fundamentally, right? And all the crystals they have inherently have different frequencies, and the frequencies also change with time, right? So the clocks are ticking at different rates. And also, in order to measure the offsets between clocks, you have to send packets between the servers and to measure latencies. And thus, clock synchronization is also subject to network congestion, right? So if we had super stable clocks that all have the same frequencies, and we had a very stable network with constant network delays, clock sync is not a problem. However, that's not the reality, right? So nobody actually uses clocks, or depend on clocks when they build their systems. So at Clockwork, we believe that a highly accurate, scalable, and stable clock system is gonna change the paradigm, right? It's gonna remove the clock list assumption when building distributed systems, and it's gonna enable new system designs, and it's gonna improve the performance of existing systems. So NTP and observability, right? So first of all, time stamps are very important for observability because we have time stamps for spans, we have time stamps for logs, they're just everywhere, right? And they're critical for analyzing the delays and analyzing the ordering of events and everything. However, NTP is not serving as well. So even the best NTP solutions today, for example, in the cloud in the same region, they can do tens to hundreds of microseconds. However, the one-way delays in the cloud are on the order of 50 microseconds, right? So if you send a packet from server A to server B, and you take a transmit time stamp and receive time stamp on two different servers, you can often measure NTP when we delays, right? For example, this figure on the right shows that we are measuring NTP when we delays just using NTP in the cloud. So these are very bad in terms of measuring latencies. And also if you actually used these time stamps to determine the order of these two events, you are getting the wrong results, right? You are saying, okay, I received the packet before I even send the packet. And outside of the cloud, NTP is doing even much worse, right? You generally gotta get milliseconds, and also people sometimes see tens of milliseconds or even seconds, right? So actually, Michael from Aspecto just had such an instant in his own demo, right? Where all the stamps are out of place. Not all, like some of the stamps are out of place. So that's what we think, you know, we have already spent so much effort in instrumenting the code, auto-generating the instrumentation and collecting the data. We really deserve some pure, accurate time stamps for us to understand how the system is working, right? So, okay, so next I'm gonna talk about three use cases of clock sync in observability. The first use case is aligning traces, right? And this has been actually a long-standing problem. Right, the problem is easy to understand. When the clocks are not well-synced, the spans are not well-aligned, right? And thus, the timing diagram is wrong and then we can actually build delays incorrectly. The community has also tried hard to cope with this problem. For example, when people know that the child span should be within the parent span, but however the timestamp is not saying so, people would just center justify the child span, right? Just to make the trace, just make the trace more makes sense, make makes more sense, right? However, this creates further confusions. So this is an example where bad clocks can confuse us and can create traces that make no sense, right? So in this example, initially, we were synchronizing clocks using NTP. We can see that a span from the currency service actually started before its parent span started, right? And another span from the shipping service started after its parent span ended, right? Which obviously makes no sense. However, if you simply switch to actually timestamps, the trace suddenly is making sense, right? And all the timings are correct and everything. And there's also an additional benefit is now we can measure when we delays. For example, in addition to round trip times, right? Now we can measure the time between a RPC requested send to the RPC requested received, right? And similarly, we can also measure the time when the RPC responses send to RPC responses received. So this has never been possible before without accurately synchronized clocks. A second use case of clocks observability is to put distributed logs on a single timeline. As we all know, the open temperature community have been working hard on logs, right? And this is only one place, particularly within accurate synchronized clocks will help. So in tracing, we can propagate the context, right? And we can kind of infer the ordering of some of the spans from the context. However, this will not be available in logs because logs are by nature unstructured. And we can only rely on the timestamps to determine the ordering of the events. In this example, we have two processes logging, process Alex and process Bob, right? And the logs say, Alex added one to the inventory at T1, and Bob took five from the inventory at T2, and Alex added 10 to the inventory at T3, right? If I were a developer looking at these logs, I would be wondering, okay, is there a bug in my code? Because the logs are showing the inventory when it's not active at T2, right? Or is it a timestamp problem? Maybe the clocks are not synchronized and that's why the logs are not making sense, right? I would be wondering why they're actually spending time looking into this problem. And I won't have any of such doubts if I had accurate timestamps in the first place. A third use case of accurate time, sorry about this, is in instrumenting message-based microservices, right? So generally there are two classes of microservices, RPC tree-based and message-based, right? And nowadays RPC-based systems are more popular, but still there are many systems are message-based, right? So in RPC-based systems, so for every request you always get a response, right? So you can kind of get away with measuring wrong trip times, right? You get a sense of the latency in the system by measuring wrong trip times. However, in the message-based systems, there are no requests and responses, right? The messages simply flow through the system and it's actually very hard to pin down where the latency are in the system, right? And that's why if we had a synchronized clocks in the first place, we can simply measure timestamps at different stages in the system and get a sense of the delays. Are the delays happening in the network, in the service, or in the message bus, okay? So these are the three use cases that we thought of for clocks in observability, right? I believe there are way more than we talked about. So a few quick words about clockwork. So at clockwork, we built a very accurate, scalable and stable clock sync system, right? And it syncs clocks to nanoseconds with native cover timestamps. And in the cloud, we have CPU timestamps and we can synchronize clocks to microseconds. You know, the clock sync system has got onto the front page of New York Times. It has been adopted by many reputable companies and is available on all three clouds. So clockwork's clock sync system is very different from NTP, right? And we actually published a paper about the system and it's published in NICDI 18. So for the technical details, you know, you can find them in this paper. Well, so there's also a video of a talk in this link. If I just talk about one difference between clockwork and NTP, I would say NTP synchronizes clocks through a tree, right? And then there are a few problems that can come with it. For example, if a certain node fails, it's possible that the entire subject could fail, right? So by the way, just to be clear, so NTP is like a multi-routed tree is slightly more complicated than this tree, right? But however, at clockwork, we move much further. In clockwork, we synchronize clocks through a mesh, right? In this example, we have three regions across the US, West, Central, East, right? And we have 10 clocks or 10 virtual machines actually inside each region. And I just picked a random clock, the pink one. And you see that this pink clock is talking to four other clocks in the same region and five other clocks in different regions, right? And every clock does this. Every clock talks to a number of random neighbors, right? And this forms a probe mesh. It turned out with this probe mesh, we can do clock sync in a much more accurate and tighter way, right? We can actually discover asymmetries in the network. We can discover that we are doing badly or we are doing very well. Like we have an estimate of how well we are doing, right? And we can also handle different types of failures thanks to the redundancy of the system. So this is how we thought about, we can help with the open time machine community. First of all, we wanna make clock sync available to the developers community, right? So we're gonna have UTC synchronized time that are tied to GPS clocks. And we're gonna have globally distributed time servers, right? So that accurate time is accessible by all developers. The clock sync system is gonna do orders of magnitude better than NTP. And we also plan to provide a Google TrueTime API where there's a bound associated with every timestamp. So you know how much you should be trusting the timestamps. Secondly, the second idea is that this is only an idea and we'd love to get your feedback on that. Is that we think in addition to timestamps, it's actually critical to know how much we should be trusting the timestamps, right? And we think maybe we should be generating and propagating and storing the accuracy levels for the timestamps along with the timestamps, right? And that's when you're analyzing the data and you don't make silly mistakes, right? You don't just trust wrong clocks. A third idea is that we can make a timestamp translation service, right? It may be difficult for people to switch out NTP and use a different clock sync system. So however, we can do translation after the fact. You can just run NTP as is and collect whatever timestamps they provide. And when you want to analyze the traces, you can just translate the timestamps and get the corrected timestamps on demand. And that concludes the presentation. A clock where we believe accurate clocks and globally consistent timestamps can make open time-metry tools better and debugging much faster and easier. And we'd love to hear your use cases and we want to hear about your feedbacks too. Thank you.