 Wrth i bawb, mae nhw ymddech chi'n gallu gyda'r hynny o gael y byd. Yn y fwy o mater, Richard Benwell. Mae wedi bod nhw wedi fath o'n meddwl ganddangos, a wneud o bryd yn cyfathio data ac yn ei centres, a maen nhw'n bwrdd yn lleion hefyd i'r cyffredinol i'r newid yn ymduno'r hanes gael, mae yna ddweud ond yn wir i ddweud a'i sgiliau newid rydych chi'n yn eu cyflwyng sy'n gyfathol sy'n ei data. Mae o armae, maen nhw'n meddwl ar gael y tawl yw'r opaith. The task will be great to wait for you and today my talk is on why large-scale observability needs graph. By large scale I mean not necessarily volume of data but the complex mesti kind of systems that we are trying to observe as we scale up our organisations. By graph I mean not line graph type of graph but the camera may have the same connectivity type of graph. So, yn y dŵr ydydd yw'r ddych chi'n gweithio y gweithio'r dda i'r ffag o fath i ddechrau gwahanol. Mae'r gwahanol yn ei gweithio'r ddod o'r fgaf amdano. Mae hyn yw ymdweithio'r ddod yn y ddiddorol ar gyfer y dyfodol ar gyfer dda'r gwasanaethau. O'r ddych chi'n gweithio, o'r ddych chi'n gweithio'r ddod o meddwl, a'r treisio, ac yn ddod i'n gweithio'r barthau, mae'n dda chi'n gweithio'r ddod, y three pillars maybe there's six pillars maybe maybe there's maybe there's a lot of pillars, but I'll use metrics logs and traces in this talk often when we talk about this data we really focus on how do I collect and store that data and many of the talks today are about how we do that at massive scale, how we store that data at massive scale and it's a huge engineering challenge the other challenge though is how we use that data collecting that data It does not necessarily actually offer any value to us. Sitting in a database is only valuable if we can actually use that to answer questions. How do we find and use the right data to answer questions. Where is the problem for this incident? How reliable are my components. Where's the weak spot? How is my application going to scale? With its performance and if I make a change to one component. How is it going to impact other components? How do I plan change? Ac mae'n gandd iawn ei gynhyrch yn cyflwyno erbyn chi'n bwysig beth i gael y cy paternau yn g lawsawd. Felly y gwiruriaethau sy'n tuustio'r gom residencell i by Designer, ac mae hynny o Bath yw deinech chi fydd ei gynnwys gynhyrch, os y gall Gwyrtaf jedcaf gwirawchten. Dw' telefonwch greu casodd yw ddweud yn g Damnodd a potwn hwn. Ac yna ll öz wrth mai am wneud hynys gwirioneddol os wedi gweld arigимеч hi. Is there anyone who hasn't read this Wikipedia page? So the Wikipedia page on observability. It's where the inspiration for observability came from was from the control systems domain. And so you go Google observability and you'll find this Wikipedia page. And right at the front at the top is the definition of observability that we're all very familiar with. Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. So how do we understand what's happening, give them what we are observing from the external outputs? So that's great. That's the bit we've taken from this Wikipedia page and we've put on every single blog about what is observability. It's where every single observability talk starts with that definition. And if you have been to this page, you've probably done like me the first time I got here, which was you read that and you think that's pretty cool. You maybe read some of the rest of the paragraph and then you start to scroll down and your eyes glaze over and your scroll finger kicks into action and you just scroll down the bottom of the page and move on. But I think there's more to be inspired from the control system's observability and that's further down the page. So the first instinct we have is collect the data but actually we scroll down the page, we hit some of these equations and again most of us kind of glaze over and move on. And I'm not going to talk in depth about equations, this isn't going to be a maths session but I did want to just talk about what those equations actually mean and what we can learn from them. So if you scroll down the page from control theory and observability, it talks about whether a system is observable. And these two equations brought in me in the first one is the x dot. So kind of the new state of the system is a function of the previous state of the system plus the inputs. So you've got x dot equals ax plus bu. So u is the inputs and x is the current state of the system. So that's the first equation. How does the system change over time with inputs? The second equation is what our outputs look like. Our outputs, our signals are some function of the current state of the system plus the inputs that are going into it. So that's what control theory talks about. It has the idea of the current state inputs and those outputs. And then it talks about this observability matrix. So when in control theory they say is a system observable, you need the outputs, sure you need the signals but you need something else. You need those matrices that define how a system works. How does a system change over time? How are the outputs a function of the state of the system? So in control theory they have these matrices that define the model of the system. And that raises the question, where's our system model? So we're talking about collecting the data. We have the signals, the outputs but where's our system model? How do we work out the internal state of the system? So just putting this into context if we take a control system observability case, this is like a textbook case, what is the internal state of the system? What's the voltage across this capacitor? We know these signals. We've got a voltage across the inductor there and an input voltage. So what's the voltage of C? Now it's pretty obvious we have absolutely no idea because we don't know what circuit it's talking about. So in a control systems case you have a circuit diagram, that's then model using those matrices and you can find out whether we can in fact understand what the voltage across that capacitor is if we have these output signals, right? And so you put those into that observability matrix equation and say yes it is observable. Given those outputs we can figure out what's happening within that system. So very obviously we need a model to interpret those voltages. Those voltages are meaningless without that model. So we hop over to something that's a bit more familiar, our control, so our software system observability. Maybe a sort of similar question where's the problem, where's the fault in component. We have our signals, the metrics, logs and traces. We can't use those unless we have a model of the system, right? Just if we had a database full of metrics, logs and traces we wouldn't know what we're looking at. And often that is the case in organisations that the data is collected but not many people actually know how to interpret that data. So of course we need a model as well, right? We have a model maybe of our microservice architecture or a model of our infrastructure and our clusters. So we absolutely need a model to interpret the observability data, the signals that we have. So we clearly do have some sort of model. Well, just to recap there, sorry. In terms of asking the question is our system observable, the proposition here is that actually we need the signals plus the model, right? We can't just say our system's observable because we're collecting the data. We need to put that data into context with a model. So clearly we do have a model of some sort because we do use this data. It's not absolutely useless to us. So where is the model typically? Well, the model typically is in someone's head or in our heads in terms of informal knowledge. Maybe it's written down in a confidence page that's out of date or an out-of-date diagram but it's probably just informal knowledge shared within a team and across teams. Now this is absolutely fine if you're a two-pizza devops team, right? You build it, you run it. You know what's going on. Everyone knows how the system works. You know how to interpret all this data. But as you scale and become multiple teams and you're not just a single system, you're a system of systems, where's the model of the system of systems? Who in their head understands how it all fits together? And that's when you start to see the pattern of something like an architect who's the kind of job it is to work out how everything fits together. Everything comes to them. Or maybe the unlucky SRE who's really trying to figure out how everything fits together and so all questions come to them. So that scales so far, right? You get maybe a few people who have this knowledge and understand the model and so know where to look and interpret that data. But at some point in a large-scale application, you get to this level of complexity. So this is the infamous AWS Death Star so when they mapped out their microservices they ended up with a diagram that looked something like this. Where's the knowledge of that? How do we interpret all of that data and put it into context when we have a system of that sort of complexity? Now, of course, we're not all necessarily like that. We don't have sort of massively complex systems like that but if you have, if you work in an organisation that's been around for a few years and has gone through technology platforms, has gone through developers, gone through scale, you will have a messy system with a lot of different dependencies that are probably not written down anywhere. And that's why it's so hard to use that data that we collect is we're not quite sure where to look when things go wrong or how to interpret the data that we have. So suggestion here is that we need to store a model if we're going to use our data and have true observability. So how do we capture, store and use a model? So that's why I want to talk about graph. I think the model is a graph. And like I said in the intro, when I say graph I don't mean the sort of line chart type graph, I mean the graph of connected things. These two diagrams are actually just from the Wikipedia pages. So line chart on the left and social graph on the right. So if you look up Wikipedia for social graph, that's what you see. And of course we're all quite familiar with that idea of a graph, the social network graph, you know interconnected things. And very clearly this maps very well on to our domain with our services that maybe call another service or maybe even that service puts a message onto a Kafka topic. So publish this to a Kafka topic and maybe another service subscribes that Kafka topic. So they are relating to have those relationships. Maybe that service is hosted on a particular cluster. Maybe those clusters have particular hosts in them. So a lot of connected data and graph is the way to model that data. So before I go on, I wanted to do a couple of quick slides on sort of graph 101 just so we kind of all understand what graph is. It's a technology I think that's trending for a lot of the moments and we're sort of seeing it more and more and I'll talk about how we are starting to see it in our observability platforms today. So the graph concept is this idea of the two dots in a line, right? The boxes and lines. So we have two nodes with a link in sort of mathematics graph term. They're called vertices and edges, which I never get used to. So I kind of often call them just some nodes and links. And so we have something like service A called service B. Okay, so we can capture that relationship as a graph of two nodes there. And then just sort of slightly academic, but just to demystify some terms you hear, you then start to hear about something called a property graph and a knowledge graph. And so a property graph is against something very familiar. It's when we have those nodes with relationships. So service A called service B, but we start to add properties to both the nodes and potentially the edges as well. So we can add the label of that service. It's called service A. It's owned by team, team alpha. It's in this particular repo. And we can put properties on the links if we want to as well. So we can mix properties and edges and just sort of kind of easily express what we want to express. And then just to, as I say, demystify a term if you've heard it. There's also this term called knowledge graph that's being used more and more. And you may have heard about it from Google's knowledge graph. So now when you Google something instead of just getting web pages as results, you get that sidebar, that knowledge panel that actually knows what you're interested in and starts to give you facts about what you're interested in. It could be a person, a famous person, or it could be the results of the football match last night. So Google used something called a knowledge graph and it's slightly academic, but the idea really is you sort of blow apart that graph to mean that everything is expressed as this subject, predicate object. So instead of having properties on a node, you say, well, that node has repo and the identity of the repo is a separate node and it's owned by a team. That team is a separate node and it also calls another service and that service is a separate node. It's a little bit like how we normalise a relational database. You blow everything apart and put everything in its own space. And so within this sort of idea of knowledge graph, everything has a URI, a unique resource indicator, and those identifiers can be then shared across different domains and that's how you start to link data together. So when Google built their knowledge graph, they ingest all this data from different places but they correlate it all using these URIs. This sort of is very exciting. This idea of like, could you express every fact in the world as a graph? That's pretty cool. What could you do with it if you could? Unfortunately, it's typically very academic. It has the feeling of XML web services and sort of soap if you've ever worked with those back to 20 years ago. So it feels a little bit too academic, not very practical, which is why the property graph is generally how graphs are used today. And just another couple of aspects of graph, just so you know these terms if you hear about them. Just as a popular database is around graphs, first is AWS have Neptune. So that's their graph database. You have Azure who has Cosmos DB, which is a multipurpose graph, but can act as a graph database. And the open source world is dominated by Neo4j, which one of the earlier graph databases is now used very widely. So if you come across those databases and talk about graph, that's what they do. And you get something like this as an example query. They're very powerful if you've got all this data connected. You can ask questions like, what upstream front ends use this particular microservice? So within larger organisations, I've heard this several times, you're a development team developing a particular component. You don't necessarily even know how that component is being used by all the applications within your organisation. You don't necessarily know who is calling you and what end user applications you're part of. So the Amazon Cosmos DB both use Gremlin, I think Neo4j supports Gremlin as well, but they have their own language called Cypher. And you can see there it's a little bit complex, but it's this sort of idea that you can express a query like that in this language. G represents the graph, V is those vertices, and then you just say, hey, I'm looking for a service that has the name service X, and I'm going to iteratively look at the incoming links to those services with a calls relationship until I find a node that has type front end. And then I'm going to map those out. So super powerful once all this data is connected. So that's graph, how can graph be used to model our systems that we're trying to observe? The good news is, is that it already is. Not necessarily in the form of those databases, but in terms of concepts, graph is already being used within many observability tools and this is kind of one of the reasons why I wanted to talk today about this topic because it is just a concept and a technology that is seeing a sort of increased adoption and I think as a community, we should look at how we can formalize that and really accelerate its adoption. So one system I wanted to quickly talk about is probably the most familiar, Jaeger. When we perform our tracing ostensibly, that's to allow us to identify performance issues with a particular transaction or type of transaction. But typically, I think one of the greater benefits is when it's first run is you understand what's talking to what and you end up with this graph and go, aha, I didn't realize that actually, this microservice is actually being called by so many other different services. So in Jaeger, they have a graph view and this is the graph you get, but it's just a visualization at the moment. That's not stored anywhere, but it is one of the most useful outputs of these traces. The other use of graph you may be less familiar with is in a tool called Backstage. Is anyone here familiar with Backstage? Yeah, just maybe a handful. So Backstage, actually have BackstageCon just a few doors down the hallway for the first time. It's a Spotify project that was open sourced a few years ago and is now in the incubator phase of the CNCF. And it came into life within Spotify by some platform engineering team there saying, well, we really think we need to understand what we have and how it all works together, right? Effectively the model. And they didn't know what to call it at the time, they called it System Z, or System Z. And it allows developers to effectively catalogue their services and metadata about those services, including then what services they depend on in a service catalogue. So those definitions are defined normally in the code repo. Backstage pulls in the ammo files for different code repos, stores it in a centralized catalogue and enables you to map out your entire environment. It's not specific to any particular infrastructure technology or deployment style or application style. And it's just a service catalogue. You might hear Backstage also called a developer portal. I think it's got more uses as a service catalogue than a developer portal. But again, the guys within Spotify weren't quite sure what to call it, which is why they call it System Z. But it's a very powerful concept which is a service catalogue that allows you just to dump all this metadata about your services and what they talk to into it in a single place. And so this is starting to provide people with a model of what's going on. When I saw Backstage within Spotify, they did have a demo of how they actually imported the output of Jager traces into Backstage as well. So you could actually spot anomalies or discrepancies between what your developers were declaring as their dependencies and actually what the dependencies were when you're looking at the code running for real. And it's kind of an interesting use case, right? So do your developers really know what's calling them and what applications they're part of? So Graph is already being used to model systems certainly sort of conceptually. I just wanted to also present on a couple of examples of where it's being used to combine those signals with the model. I think Backstage is very exciting. And just down the hallway there, I think there is great potential to combine what Backstage are doing with what we're seeing here in the open observability community. But a couple of examples of how we can then correlate observability data across services using tools like this. One is Keali on top of Istio service mesh. This is specific to Istio. But this is a really nice example where they use the Prometheus data to understand what services are connecting to what other services with Envoy there just tracing all incoming and outgoing calls they can create that map. And then they overlay on top of that other Prometheus data. So including alerts. So you can see the data in the context of your service map and the performance metrics as well. And I think Bartek also mentioned about Prometheus maybe having something similar as well. I haven't seen that yet, but I'm looking forward to seeing that. So starting to combine the model with the observability data I think really produces some powerful results. And I kind of won very far end of the scale of how we could use it. I just wanted to highlight a project that was internal to eBay. Something called Groot. There's a white paper and I think I think it may have been open sourced. There's an event graph based approach for root cause analysis. It's a little small on the screen there. But this idea is saying, okay, if we have on the left our dependency graph that we've got from either sort of declarative or tracing. And then we overlay on top of that the events that are going on within our system, then we can actually understand causality. We can understand that an issue or an event that occurred on some service has impacted some upstream service that's impacted a customer by correlating events across different tools. So there's definitely scoped for using this in some very powerful ways and starting to make useful that observability data we're collecting. I wanted just to finish with some ideas around like how can we use this sort of concept further? How can we adopt this? And how can we do more with our data? The first one is about extending the scope of observability. So as I say, we're talking a lot about metrics, logs and traces. And typically that's because we're focused on the system as we define it as our software. Like what is the software doing? What is the code doing right now? Or perhaps the infrastructure that code is running on? But really that software system is just running within a larger system that we really do care about. I've called it here the business system. But there are kind of two elements of that. One is when we're looking at how our software is performing we shouldn't be looking at that in isolation of what the business is actually doing and cares about or what our customers are experiencing. If we're just looking at the performance of the code then we're really blinkered to the bigger picture. So really as a system in terms of modelling the system we should be thinking more broadly than just our code. We should be thinking about the business that the code is running within. So for example, is the business hitting certain KPIs like conversions or adoption or revenue? Or are the customer events something that we need to observe as well? Like logon, purchases, service tickets even? That's what we should really be thinking about as our system that we're trying to observe. Not just the code. And on the other side, very importantly, as we all know, things break often because of changes. So when we're talking about the system that we're trying to observe we really need to include in that the development teams. Like pipeline deploys, code changes, the issues that they're tracking and fixing. And of course our third party dependencies, our external dependencies, the cloud providers that we're running on, some external API that we're depending on. And we need to understand the status of that. Potentially even cost as well. That's another aspect of observability. So when we're thinking about observability in the model, the system we're trying to model we really should think beyond just the software. We should look at it in the context of the broader system. And again, that's where graph can really help us because we're trying to find answers really by connecting the dots across everything that's going on. So for example, if we have a customer impacting incident that might be recorded in our service desk tool. And we might know which customer it is and it has some information about that ticket. And of course, first question we'll ask ourselves is where do I look? Where's the problem? But often within a larger organisation it's not so much about where do I look? It's like who do I get involved? I don't want to now A have to go and get some architect who understands how all pieces together and B have to try and pull in multiple dev teams and take them off what they should be doing which is delivering new features and value to try and resolve this incident. So by correlating that data across multiple different tools and domains we should more easily be able to relate something that's happening on the customer front with what our dev team are doing and changes happening to our system. And then just the final point in terms of adoption I've talked about a number of different tools that are really either defining graphs, creating graphs perhaps consuming data and graph form. But right now we don't have a way of connecting these things. And that's very frustrating given the whole idea here is that we should be able to extend that graph from one tool and one domain across multiple tools and multiple domains and that's when really we start to see the big picture. So for example, taking our service catalodes in back stage with the data we're seeing from our observability platforms with our dev systems like our CICD pipelines and JIRA and then even platform resources. We've got Azure Resource Graph, we have AWS. How can we stitch this all together? So potentially in the future Open Observability Day hopefully we can be talking about an open standard for something we might be calling the observability graph that allows us to really start to make sense of all this data that we're collecting. So that's all I had just to recap. Observability needs signal and model. Models are graphs already in use today across a number of tools but we should be thinking about extending the scope of what we think about as observability. We can use that to connect the dots to find those answers and do we need an open standard for the observability graph. So thank you very much for listening. If you do have questions I think we might have a few minutes otherwise grab me in the hallway, I'd love to talk to you about this if you're interested. Thank you.