 Thanks a lot. Good morning, by the way. I'm a little bit jet-lagged because I flew from California and This is the exact route. I actually took several days ago from San Francisco to Copenhagen and then to Brussels and you know when I think about like commercial flights I just we usually think that hey you're going through some security and then getting on the plane flying you know eventually like getting some drinks snacks and Lending and live in the airport right, but if you the reality of commercial flights is actually more complicated than that Like it's so complicated that as a frequent flyer. I actually am appreciating that You know even the you know cancellations and delays like there's so much going on and there are zillions of things that need to go right in order for them to fly me from you know, San Francisco to Brussels and I Actually is a customer like I don't really have visibility to what is going on So my experience was more like a I arrived at you know, San Francisco Airport. The airport was functional The security was still working. You know, I went through security. They you know validated my identity I spent some time, you know waiting for my plane Everything like traffic control was working like no planes were crashing over us or whatever The gates were working the announcement system was working, you know, they started to announce that they are boarding and I ended up like boarding the last week close the doors and you know, I got on the plane So our pilots probably went through all these like different checklists and I assumed that they were all green and They pushed our plane traffic control allowed us to leave slightly earlier. So everything went well and Finally, we were just you know flying We use all these like, you know different facilities. I mean Just imagine like all this stuff that you are Engaging when you're flying and that all the facilities that you are actually using but none of these facilities are actually reserved Specifically for our flight, but they are really essential for our flight to happen So I had a comfortable seat whatever good food I actually flew and engines kept working time passed quickly We actually ended up landing in Copenhagen and I went through the border and you know made it to my connecting flight and landed in Brussels so Anything could have been wrong right like including the details. I don't know and Imagine all the like ground staff staff the flight staff the machinery the electronics Everything involved and as well as they know the board border enforcement So it requires a really complex system of you know cooperation to fly somebody From San Francisco to Brussels and it's not always you know, the things are not always good Things may go wrong as I said that there are cancellations delays Whatever different components of the system may fail differently in isolation And we sometimes only even realized the existence of these like small subcomponents until they fail Because we don't really have like visibility as a person who is flying. I don't have a lot of visibility into you know commercial flights Well, you don't really appreciate a lot of good things because you just assume you just don't have a lot of visibility For example, if you return you from the border when you are trying to you know get into the Schengen area You wouldn't appreciate all of a sudden like hey all this stuff until now like worked out fine You actually were promised to be flying to Brussels from San Francisco But you eventually had to like return back earlier Anyways, like any computer system is actually like flying our users engage with a really tiny part of the stack And that's their experience So your billing infrastructure for example may work so well for the most of the time But if a user transaction doesn't go through it doesn't mean like that's the end-end game end of the game for that particular user So let's talk about a little bit about our everyday computer systems So I don't think that like we have a lot of visibility Into our stack and I don't really think that like Everybody thinks that they scale in terms of both development, you know maintenance and generally the production experience If anybody thinks that they do I'm really impressed like I work from you know I worked at so many small and large companies and Never thought that we did at my current company I kind of feel like we do but there are so many small gaps here and there Today's talk is actually about like how we are trying to fill those gaps. So by the way I'm Yana. I work at Google and I had opportunity to work on a bunch of projects including contributing to some of our infrastructure projects and I have several stories about my time at this company and this is one of them so, you know the earliest of a company or our project is really nice because Things are simple. You often usually have a Simple server and a few other components like you maybe have a Postgres Cluster for your data, but everything is just kind of like fitting into your brain at this point Your architecture is just like a few notes here and there if somebody joins to the team You're just taking the person to the whiteboard explaining them what is going on in a couple of minutes If there's something going wrong, you can take a look at the logs And you know, it's easy to debug things from the logs and then the next step, you know You're growing you have more engineers The company culture is also changing because your monolith is not really, you know helping you organizationally to scale sometimes want to like, you know push Stuff to production more often sometimes just want to you know keep things more stable so there are different demands between different teams and You know, this is this is the point that you want to do something else and you know breaking down your big monolith so the growth is great, but You started to see the first symptoms of actually like diversifying and Fragmenting your tech stack you start to include more stuff You have like more storage more DB's you have like different, you know cues or whatever each team is kind of like, you know coming up with their own ideas and Everybody is sort of like siloed and not really talking to each other and you have this like huge mess because all you know the monolith organization is gone and This reality is coming with a lot of news challenges because you know Just this one single problem just became many and you can't really depend on the earlier ways of doing things Especially in terms of debugging you can't really, you know self document all this stuff and to end because you know It touches to a lot of teams and there's no coordination between these teams And you know reading the logs is just so hard because some stuff is here and there and some stuff is there like everything is fail in Isolation or maybe just working well, whatever you just don't understand the overall state of what is going on and Your engineers just really don't know what to do when things fail You know who's who's the person that we should contact like Everybody's kind of like escalating stuff to each other because it's hard to pinpoint. What is the root cause of the problem and Sometimes it gets larger like for some large companies. It becomes really intolerable like the company is like mine, it's actually such a big pain and You know a lot of people even asking me like who really wants to work for such a large company This was my biggest concern Before coming to Google I work for another large company and I just didn't want to you know Take the job because it would be a nice opportunity But I will be feeling so unproductive because it will take me a lot of time to learn the stack and our systems So in some cases like people think that like key people like Jeff Dean might be the answer to all of these problems because you know these people haven't around for a long time they have been like there when Initially all these conversations were happening about you know critical parts of the infrastructure They sort of have like a better understanding of things end-to-end But is this are really like our strategy? Are we you know keep escalating things to Jeff Dean? Nobody does that and you know Jeff Dean actually went to AI But maybe this is partial why he went to AI because lots of people are still coming to him and like he's trying to Just basically replace himself with AI. I don't know so anyways To be more realistic like it especially, you know Very fragmented in large Systems everything becomes such a big mess. We have all these different services different, you know storage systems different DBs Who actually has invisibility? At some point, you know, this is a typical meme we had at Google like we just need to delete that all of this stuff We have this huge complexity The code is owned by different teams, you know, nobody knows what how it works end-to-end People who have been around for a long time actually have been burnt out because they became the canonical source of truth and They can't really take vacation because if you if they know just like take two weeks off everything just goes out and Don't you know the documentation is not an answer to any of this because documentation doesn't really can catch up with the Change so you know docs always lie like they're always like come in late like and you really don't want to depend on that So when I joined this company on my first hour like literally it was my first hour They were like hey, there is this thing called code search. This is an internal tool. It's it's called code search It's basically a search engine for our internal monorepo It's really amazing. It's just really, you know the ranking everything is so easily You can just really pinpoint to stuff But the only problem the restriction is you need to have an entry point You need to you know know the symbol name whatever like the file name or some sort of like entry point project name Maybe user name you can actually see the history as well So it's nice if you have an entry point I use this every day and everybody at Google uses this every day Especially if they're engaging with the monorepo, but it's partially why we can understand better About our systems. The second thing is we have this unified build tool which is called blaze which inspired basal the open source version and Since everybody is using the same Build tool it kind of makes things easier because you can actually like take a you know click on a build target You can see who is dependent on this build target and everything like but this is really good for static dependencies It doesn't tell you anything about your dynamic dependence like service dependencies. So this is really cool but it's not the actual solution and None of these tools are really good in capturing like what is important in terms of pointing out to like the most critical execution paths You know, this is a typical system that returns some user profile images and a bit of metadata and There are many services here. There are many services You know we're going through some services and like as well as some storage in order to you know return a response to the user But the blue line is here What we call a critical path because the user request comes in to the load balancer and hopes through like a certain path in order to serve this request and You know all the way down to the low level disks and at some point we actually have a response It might company. We actually put some emphasis making these critical Execution paths more visible. So, you know engineers can actually see investigate and debug If you know engineers can really see this type of execution paths They they can also see what exact Components were required to serve that request. They can also Run some like analytics on like what type of like execution paths are important or what type of like service dependencies around This requires, you know, some sort of like dynamicness. It's actually generated by data coming from the Dynamic instrumentation in production and it's a bird's-eye view, you know for our systems You can really understand the service dependencies in what paths and what dependencies and what relationships are more important So just like my flights, you know, our users don't often know much about under the hood They don't know what is going on in our systems. They really expect this big black black box to respond them Respond to them. They don't really care about the health or the availability of this underlying stuff either They just expect and you know a response for example, if the user service here one of the replicas just dies like just crashes and The scheduler just spawns a new one in the you know lifetime of a request and if the user cannot really tell because the latency Was not that high Like the user wouldn't care So, you know being able to think about your systems from the perspective of Critical paths are really important because the goal here is not the availability of these like different individual components The goal here is the availability of the user execution path So some of the goals are the goals of critical path analysis As I said the availability of underlying processes doesn't really matter at all. It's more about like having the user experience up and running having that critical path up and running and It's just kind of like flying right like I I fly but I don't really understand what else is going on I know what I engage with I don't care if the airport is entirely functional or not I just want to use one gate. Maybe I will just you know sit and eat I just really engage with a small section and from my perspective. That's what matters So being able to you know, think about our Think about our systems and evaluate our systems, you know from the user perspective. It's a very different approach But it's really useful so some of our engineering practices are really around discovering critical paths automatically and I briefly mentioned that we do this in production and Once we discovered these critical paths, we also, you know use them as debugging tools is debugging data We want to make them, you know more reliable as fast as enough And if there's an incident in production The data provides us some debuggability. We can just go in and actually see what has happened You know, it was the middle of the night if your own call You just really want to be able to see everything end-to-end without Noving too much about the entire system Even if you haven't you know read the source code before So that's cool. You know how we get there. There are different ways to do this actually we Talk about like two main emerging tools in the industry nowadays I don't know if you are interested like if you are familiar with distributed tracing or event collection But login can also be used the tricky part of login is you know, it needs to be more It needs to have more context about the execution path You need to like carry a request idea around and you need to you know Structure maybe the log messages in a way so you can generate these execution paths by looking at parts in the data So This is a very common way of like, you know understanding a problem just keep asking why like being able to ask why why why why It's kind of like this golden rule of exploring your cause effect relationships So the granular, you know event collection or distributed tracing has this promise. It's kind of like Hey, we want to give you this ability to be able to go deeper in the stack and try to you know Like help you what actually happens This is why people sometimes say Observability is about having answers for questions that you haven't had yet Because you know the question appears when there's an incident or there's some unlikely event And you should have planned to you know collect enough data So if the incident is there you should be able to take a look at your debugging data They should have answers to your questions so I will show you some illustrations of like some traces I don't know if you are familiar with tools like zipkin or like your network tab on you know developer tools on browsers Typically traces look like this like you know, there are a bunch of things going on the each row here is Representing a different component some different action is happening. It captures, you know, this is What happened here and like how much it took and like where it actually initiated that and Here what we see is This is a trace for an API server. I have so Foot slash timeline. You can see all these like different server components that I actually had to like go And make our PCs too in order to respond to the user So you can take a look at these, you know, rods to learn more about the lifetime of a request Like if somebody for example new joins to my team and if they're going to work on the API server They can take a look at the trace to understand. Oh, yeah, this is all this stuff like they don't have to read the code They can start with looking at the trace and learn what is going on So also, you know, welcome to 2019. This is the basic stack. We have it's like ever growing number of layers and Maybe you have more understanding of your user space, but everything underneath I don't know what I called cloud stack, which is like Kubernetes whatever. It's just a big black box And we absolutely have no understanding Just imagine if your info was providing some like, you know, granol events or traces from that layer In the lifetime of your request if there's something relevant, it would be so much easier for you to understand Imagine some of the for example, Kubernetes scheduling decisions are leaked into the you know, the networking layer because load balancer need to understand the scheduling decisions If we can actually like, you know, capture some traces from that layer, wouldn't it be nice if there's an, you know Unlikely event unlikely like scheduling decision You will be able to just go and like take a look at your user trace and like pinpoint Hey, it was the scheduler or the load balancer because of this reason this happens so we There's no way that we can learn this entire stack in our lifetime But I think that this type of like debugging tools allow us to understand more of each layer So this is a this traces are a cross-stack, you know debugging tool You can basically use them to blame who is fault it is It's like returning back to that like API server thing Just imagine that you know the engineer just comes in and takes a look at like a you know We waited this much time for the scheduling decisions to happen. So, you know That person can escalate the issues of the scheduling team because hey, like there's this additional latency But it's actually not my fault So it kind of gives you this like, you know Git blame for production issues type of functionality, which is really nice And it's really nice when your cloud providers are participating into your trace because what you can see is hey There's all this like additional, you know latency coming from the cloud providers load balancer Maybe I can escalate this so they are SRE and this is the evidence Because each time you escalate stuff to cloud cloud providers They're a little bit like hey, are you sure that it's us like so hey, here's the data And it's the proof The other thing is you can actually It's really nice that You can build some tools to map This type of data to the source code Here for example each bar Each span is representing an RPC call So it would be nice to just you know take it to the RPC Maybe you know the handler or to the client Source code so you can take a look and Just basically, you know review what has happened Especially, you know, it could be a you know misconfiguration or a mall forms payloads Or some sort of like a stupid, you know Timeout configuration that there was some additional retries or latency So it's pretty cool to be able to do that if you build some additional tooling around this type of data and The other thing is, you know, who to call You know, there's an issue with this specific component I am not the person who can understand I read the code I took a look at like all the additional data and I don't really know what is going on here Who's responsible for this if you have a catalog of you know different teams If you can map one of these blocks to an team to an SRE team or a regular team You can actually give them a call and like try to understand what is going on in production Another thing is, you know, give me all the other stuff you know about this block like in the lifetime of this block What else has happened because we don't capture a lot Everything in you know in forms of traces or something else would be nice to just you know see what else is going on Very typical example for this is you know just the logs because we already have logs everywhere So just give me the logs like what happened. I just want to you know understand more of like, you know Randomly what happened? So nothing comes free unfortunately, and I will explain some of the challenges and I think we shouldn't undermine the level of investment required to roll some of these Technologies in your organizations, especially if you already have an established organization. It's a little bit harder to you know put these in and This is where people usually got stuck Especially if they haven't thought about these capabilities and some of these challenges early in the early days So if you need critical pet analysis, it actually is a cross-team problem So the entire organization need to agree on a bunch of things They need to agree on you know You need to do some instrumentation in order to be able to generate that that data You need to be able to collect those like a this is a start This is the end whatever but you need to just be able to reconstruct the trees so you need to like really propagate some request ideas around and there is not like a lot of Agreement on this like there's no like golden standards or whatever even though for distributed tracing at least there's a W3C proposal coming up, but apart from that like there is actually no real standard and and Just imagine that Even if you Internally understand like even if you enter internally agree on one particular Identifier format you are hopping through these differently in a load balancers or whatever and they don't really understand that format So we really need an industry format for that and for distributed tracing We're actually working on one of on a standard. So that's really good news The other thing is engineers really don't know where to start a like we can capture a lot as a part of Traces or event collection We usually say that like a start with your networks that is specifically HTTP and RPC and This is actually where things are getting really easy because if you have a framework or if you can start things the load balancer Maybe You just get like some automaticness You can you know for every RPC or every request you can just automatically generate one of these bars and You know You can simply you can simply start by Using an instrumentation library that works well with your existing frameworks and get there some data from there the other thing is infrastructure is still a black box and None of the vendor services are really designed with these type of capabilities You know nobody actually considers about these capabilities in the first place We still expect people to learn a lot about the underlying stack by reading the code dogs you know talking and we don't expose this type of like you know debugging information and It's it's becoming it is it's a it's a huge challenge especially if you are you know running stuff in managed environments in your cloud provider for example You know serverless like function managed, you know function running Environments where you basically do not have like a lot of like control and If the environment itself was automatically Providing this type of debugging information and you can participate into that that would be so nice The other thing is you know instrumentation is really expensive high-traffic systems really you know end up down sampling and It's a it's a huge topic You know what kind of down sampling strategy makes sense and we sometimes miss you know collecting data for interesting cases And we want to just improve this over time and we only have like a few brief, you know ideas what to do and Other challenges the dynamic capabilities of instrumentation has been a you know It has been underestimated for a long time in the industry And this is really important because we want to be ideally want to be able to tweak things We want to be able to collect more maybe if there's an issue, but doing it in a safe way is not really easy so There is you know, this is still a big challenge you need to have your own strategy and To kind of like wrap up we are still kind of in the dark ages when it comes to like understanding and maintaining our systems When I talk about these concepts, you know, I just feel like I'm this snake oil, you know Salesman just traveling and like talking about these concepts But I have to say that like we have some sort of like, you know some of these abilities actually At my company at least and it really helps us I you know makes your engineers slightly more happier because you can give them an overview of what is going on But at the same time, you know, this is not the end game It's just an overview and it really just changes the way you interact with other teams and everything But it's just a you know enter point tool not the The entire solution so I just want to you know say that like this is a tool that really closes knowledge gaps in our organizations And we don't talk about it. I think I came here to talk about it And I would love to you know, answer your questions if you have any and things much for having me here Yeah, be quiet when entering be quiet. Yeah, and remain seated for now Hi, Anna, and you said that they You're not aware of a distributed tracing standard. I that's like I understood open tracing to be that standard Can you explain maybe why that doesn't fit in with this or? Is the question is open tracing is a standard or I Mentioned like the lack of standards and how open tracing is not like kind of like Aligning with that sorry to interrupt again about respect of the speakers and their time and their commitment Can you please be quiet? Thank you So the question was was the question I mentioned that there are lack of standards that there are not a lot of standards around and you mentioned open tracing So the question is is open tracing another standard in like kind of like, you know Okay, so open tracing actually provides some hooks the instrument things But it's not openated about like you know the propagation standard or it's not openated about the data You export so in and you need to like so each provider has a different they Is enforcing a different data export format and a header format And if you're using open tracing you need to like link the entire system with the One specific you know Implementation of open tracing so you can have like end-to-end stuff. This is not like practical For example, if you have engine X as a binary, you know engine X just doesn't know much about Whatever, you know tracing provider you want to use so we want to make this less of a build time constraint but more of like a wire format so everybody can understand the same header and produce similar type of data So, you know, nobody actually like has to rebuild stuff in order to have traces. I hope that the answers your question Hi, Anna. My question was can you go into a bit of details of what you meant by dynamic? Dynamic constraints it Yeah, dynamic capabilities. What do you would you have an example to elaborate this? so one of the typical questions is like how much data should I collect or you know, how granular should it be or Sometimes we collect distributed traces, but it's not enough like we just need to go and enable some other stuff momentarily for example It's a very common thing at Google that hey, we just don't know like there's some additional latency here It turns out sometimes ended up being the GC latency in order to capture that like you actually need to have some Runtime events coming from there and it's not like really, you know in the scope of distributed tracing So you just need to go and like tweak a few other things and you know enable some more Instrumentation and need to be able to correlate in a way with that distributed trace so you can still capture that like hey This is what has happened in the lifetime of this So that's kind of like what I mentioned and again, it's it's a huge topic It's more about enabling more signals. Maybe even more granularity. Maybe even more debugging information Maybe you know adding more like granule other stuff It's all in the scope of dynamic capabilities and We just don't really think too much about it when we're developing things. So yeah, it's it's something that is not very ideal Nowadays Hi, and thanks for the talk. You were mentioning freeing ourselves from like build time constraints and stuff like engine X and and things like that Do you think that we can ever? Automate the sort of dynamic instrumentation given that it seems to require at least some degree of domain knowledge that at some level has to be input by humans So the question is If we have some snakes or some sort of standards, is it possible to automate all the distributed tracing? By you know putting this in our load balancers engine X or you know in similar stuff So there's you know, you can actually start a trace, but you still need to you know propagate it all across in your stack That still becomes a problem. You we can just Generate the RPC level the entry the ingress, you know, maybe Request to spend but you still need to you know Just put the right header to the outgoing requests and all the other stuff need to you know align with that and sort of like Participate into that to have actual you know any meaningful traces So it's not really easy to do this especially in language runtimes where you know For example for go it's really hard because you need to be in the runtime propagating the right Identifiers in some language runtimes. It's like so much easier Especially if there's one single thread or you can just really put the context in the trend So it depends on also the language, you know the language and some sort of like you know capabilities and limitations around that But it's the actual answer is Unfortunately, no, there is no like super automatic magic way Hello, thanks for the talk. Yeah here. Oh, yeah, sorry So you've mentioned that infra is still a black box. Do you think that? Things like open compute is the way to go to solve that problem or anything like that like open source Can you repeat that again? Yeah, so do you you mentioned an infra is still a black box and Do you think like things like open compute is the way to go to solve that problem? I'm not really familiar with open compute But lots of infrastructure builders really really want to like expose this type of data Once there is a real standard like a export for like standard for example, they want to provide like a hook or whatever They will just write it like in in whatever particular format and they want to be able to expose and they want to really like work with the user trace It's just like the more of like this like lack of you know wire standard at this point that they are nobody's interested But I want to take a look at open compute because I don't have much context about it. Maybe we can chat after this Thank you. Thanks Hi Is it any projects like going on you mentioned the guy going to AI? Do you try to to implement some AI machine learning technology to get like okay? We have some problems and it's Analyzed and then some suggestion where it can be So the question is is there any AI work related to consume this type of data? There's actually a lot especially in the like scope of anomaly detection They want to be able to automatically say that like these this pattern this uses pattern this path pattern is not really Normal there's some additional latency or like different components or you know Additional number of retries and everything so they just want to use some sort of like I don't a lot Like I don't understand what they do at this point. I haven't spent a lot of time reading But they want to do some better, you know recognition of unusual patterns By you know looking at the data Thank you. Oh, thanks