 Hello, welcome everybody. My name is Mandy Chassel, and this is my colleague, Dan Wolffson. We work for a company called Pragmatic Data Research Limited, and we're both closely associated with a project called Aduria. The topic today is actually about three open source projects. Aduria is obviously one of them. Another is a project called Open Lineage, and the third is Marcairs. All three of these projects are part of the LF for AI and Data Foundation. We're sort of closely related, and this is really an example of a collaboration that we put together to show a broader value than each one of the projects could produce. We call it truly Open Lineage because it's all about creating traceability between the execution of work across different workloads. That sounds quite simple, but actually to really be able to use and analyse it, it's a complex problem. So the plan today is to take you a little bit through what lineage is, some of the challenges, and then what we've managed to achieve together with the three projects. All right, so yes, very simple. What it is, what we did, and what we got from it. So let's start with the definition of lineage. So let's think about what could be happening. So here we have three regions of a company all doing very well. All the graphs are going up. But when the CEO looks at the combination, something strange has happened. The graph is going down, so something is wrong. The question is, where is it wrong, and what is wrong? Are the values incorrect? Where are they incorrect? Something's inconsistent. What's happened is the way that the data is being combined, causing some sort of, you know, are all the units the same, anything like that? Or is there something missing? Has one of the regions not actually reported any numbers? And these are the types of questions you get when data doesn't look right and where lineage can help. There are other questions as well. Are these coming from the right sources? Are the schemas matching? Are we actually tying different, the wrong types of data together? And we also need to understand whether it's right in the question of time. So it's not just did it flow from the right place, but did it flow in a timely enough manner that the aggregation worked correctly? And often, once you know these things, there are follow-on questions that you can answer. Sort of like, okay, this data is wrong. Who do I talk to? So who owns that data? And also if I own data, I need to know who's using it. So if I need to change things, I can actually let people know that things are changing. So lots of questions that you can have when you start to share data among different teams. You want to switch over? So a lot of this is about building a healthy data ecosystem. And this is really, you know, one of the main points is that this is a team sport. We do need to have the different teams communicating with each other in order to understand how the data is flowing and what you can do about it and how to understand between the different teams. There's implicit contracts often that are taking place between these different teams. So if we move on. And so the key around this is the data lineage. And the data lineage really talks about how the data is produced, how it's consumed, how it flows through the system. And we can do a lot of things by understanding that, all the different attributes around that. So both understanding the sources, understanding the targets, understanding how it's flowing, how long it takes to flow, all those characteristics, the inputs and outputs of each process that's taking place. And so we can talk about lineage and lineage shows how the data is flowing from its origins to the different destinations. And there's not just one lineage graph. There's lots and lots of different paths through the system that the data takes. And so even when we're starting from a single source, that data may be promulgated through many different chains of lineage to a lot of different targets for different purposes at different times. And so we generally, we want to understand that, understand the traceability of that information so that when we're looking at from a consumer's point of view this data arrived, we want to be able to say where did it come from and how did it get here. And then we want to be able to look at things, once we understand that, we can look at things such as impact analysis. We can say if somebody changes something upstream from me, what how's that going to impact me? If I change this table, what are all the systems that are going to break behind it? And so we can get to different kinds of impact analysis and then we also get to data observability in terms of understanding how the data is flowing. Is there a breakage in the chain? Is it taking too long? Is it in some cases taking too short? And so we can start to get to a governance by expectation by observing what's actually taking place in the system as we collect the facts about how the data is moving. Let's go on. There you go. Yes, that's fine. So I think Mike has slipped, let's just slip aside that one. OK, so let's start diving into what we would expect in lineage. And the traditional approach to lineage, it came really from the ETL engines. So these are engines that move data from one place to another or copy it really, copy and transform. So we had this idea of a graph, like a flowchart basically showing how the data was moving between the systems. So here's an example of this type of job. It's reading from a file, it's looking stuff up in a Hive table and writing out to a Kafka topic. Now also today we have a lot of microservices and API calls. So everything is connected via request responses. So our lineage graph has to think about not just a flow of data. We have to think about control flows and we also have to think about core responses. So let's start sticking things together because there are a lot of technologies that support lineage. And this is a good thing. But on their own, they only tell us about one box. And one of the things that we've been doing as a collection of projects is saying, well, things do come from a long way away and they do pass through a lot of technology. How do we bring that picture together so that we can really see how this data got to here? And that's such a key challenge. It's the heterogeneous nature of the way that we put systems together today. Now, I'll just go back to this. You see here I've got different technologies here. One of our projects does, Nigeria, is it has a set of types that describe all types of technology. And so if I take that same graph and move it through, what you see here are the metadata elements that we would use to describe the shape of this inside Nigeria. So we can capture and Dan will go into this in a little bit more detail. This is what we call design lineage, which is the structure of how things have been implemented inside Nigeria. But that's all great. We know what's deployed, but what actually ran. And that actually is not what we do in Nigeria. We're focused very much on exactly the structure and how things flow. So Dan, you want to go into that? Sure. As Randy mentioned, we can simplify things by thinking about design lineage and operational lineage. Design lineage really represents your intent. This is how I intend things to operate. This is how I'm going to transform the data from one thing to another. How I'm going to deploy that out. But when you do that design, that design may give you multiple paths to take through the system. There's not just a single way through. It may be that if the data is, let's say, critical data, it goes down this path. If it's less critical data, it goes down that path. If it's potential fraud, it goes down maybe a third path. And so what you want to also do is understand what really happened. And that's really what you can see through the operational lineage, is you see what in fact happened. How long did it take? Where did it really flow? What systems did it touch? Where did the data really come from? How many rows got moved on this process run? All those kinds of details are part of the operational lineage that we can see through the system. And that we can, by looking at that, we can understand both how the system is operating well and how it could operate better and what to do when the system fails. And did it actually fail? Because you can see that. Suddenly I'm not seeing rows move through my operational lineage in this particular path. There's something that broke. Maybe. And so that's again why we need to look at these things. Thank you. So in terms of Nigeria, as I said, from our perspective on our project, we were very good at design lineage. We can capture all sorts of information from different technologies and show how things should work. But this operational aspect became, it was a problem to us, we didn't really have capability for it. So as we started to hear about these other two projects, Open Lineage and Marcairs, we started to realise that between us we could pull together a good story. And particularly since, as you start to examine this problem and the way systems work together, it's not a simple division between the technology that produces design lineage and the technology that produces operational lineage. So if we start to think about something very simple, here's a process that takes data from one database and writes it out to another. So from a design lineage point of view, this is the sort of metadata that we would capture in something like Nigeria. And then when the processes ran over time, we would have lots of process instances being captured and this is what we would call operational lineage. So this is saying that the process ran three times and in each case we transferred a certain number of records between the two servers. So we know how often it's running and how much data is being transferred. So that's very simple separation between what we knew at deployment time and what we are getting from observing the run time. But now let's think of files. Now files are much more problematic because they appear when things run. They're not necessarily deployed out of a CI CD pipeline. So what we may know at deployment time is that we're going to create source files, we're going to appear, say, in a landing area and we're going to put them somewhere depending on the logic and the process and that we're deploying a process that's going to be monitoring the landing area and creating things as they go past. So as the process runs, we are picking up information about source files and each one is creating a new instance of the process that is creating the destination file. So that capture of lineage is operational in exactly the same way as above but it's also creating the design lineage for this at the same time. So you can imagine that the operational lineage is being harvested to create the equivalent design lineage that we would have had at deployment time for the database case. So the distinction between operational and design lineage is actually quite fuzzy and you say, well, maybe you just need this then you don't need the thing above. But actually each engine uses its own way of identifying things. And so the more traditional catalogue that understands the relationship between things is then able to stitch together the knowledge that we get from the different engines. So the two work very well together because we have a knowledge base and then we can see what's running and we can correlate the two perspectives using a combination of the knowledge from both areas. Right, Dan. Okay, and that really sets up the motivation for the open lineage project. It's really about how do we get to some of these common interfaces that allow us to publish these events out from all the different technologies so that they can be consumed and that we can link them together. So open lineage is an open standard for the collection of the lineage information from the pipelines as they are running from the different technologies. And if we go on. And where we can see that it fits in is that there's both producers of this information and consumers of this information. And so some of the producers might be libraries such as Pandas or runtimes such as Spark, DVT and Airflow that can produce the open lineage events and that can be consumed by technology such as Amundsen which is another data catalogue or Egeria or Marquez or anything else that wants to listen to these events as if they're taking place. And so before we had open lineage if you had a metadata catalogue such as Amundsen or Egeria for that matter then you would have to write special adapters to each of the different engines to go and capture their kinds of events. And so you would end up basically with a rats nest of different kinds of connectors and every time one little thing changed everybody had to go rush around and change all their adapters to be able to consume that information at that point. Whereas with open lineage, we can get a nice smooth transition. We make life easier for everybody both the producers and the consumers and in fact we add more value because we make it easier to tie these things together. And that's really the main goal behind the open lineage. So the open lineage contributors let's move forward one include projects such as Pandiz, Marquez, DPT, Amundsen, great expectations on the quality side, Microsoft, Iceberg, Parquet and others are looking to join as well and looking at how to participate and we strongly encourage that community to continue to develop and grow. It makes everybody's life easier. All right, so I'm going to now talk about the work that we did together as our projects. So this is actually what the open lineage standard is and my mic is slipping again. This is all it does and this is its power actually. The fact it does, it defines a structure of an event that is an open lineage event and it's called the run event and it defines an HTTP endpoint and that is HTTP not HTTPS and that will become significant a little bit later. So the idea is that an engine like Spark when it's running it just is formatting its own logs into this run event and publishing it to an endpoint that it's given the address of when it runs and that's the standard that we have with it. The event is quite straightforward. It has an event type which says I'll be starting, stopping, doing something else and you can describe the job that's run, the job parent and the inputs and the outputs. That's roughly those sections and then we have facets and these facets are what the different technologies can do and add and in fact we could pull an open lineage event say into a gerio, a gerio called augmented and add new facets and then push it out to somebody else. So it's not just the first person can put facets in. Downstream people who are consuming it can also add the facets. The facets are, there are some facets defined in the standard and you can add your own. So it's extensible, you can experiment with new ideas and then when you're sure of a particular facet then you can push it back into the standard. So that's how it is organised and here are some examples of the flows and so if you think about this process that's running there's an outer process and then three sub-processes. So you see the starting and the stopping of the main process then the first sub-process that stops and the whole thing is tied together because the outer process is the parent of the sub-processes. So you can re-establish the hierarchies within processes that are running. The reference implementation is Marquez. So Marquez is a metadata repository that's designed to monitor the running of processes. That's its focus area and so it supports the HTTP endpoint and whenever we're testing if it runs in Marquez then the events are good. So it is considered the reference implementation. One of the things that I worked on is a thing called the proxy backend. So because the endpoint is HTTP and that makes it nice and quick and easy to set up and things you can't deploy the catcher in an enterprise environment. It needs to be kept very local to the actual runtime engine. So the proxy backend is deployed into that runtime engine into the secure enclave that the process is running in and then it publishes out the event. So it acts, as I say, as a proxy. In the first instance we use Kafka and so now the events are coming out very fast from the Spark engine. We only need to keep the proxy backend up and running at the same time as the engine so we haven't got to keep it up 24 by 7. And then we have the ability for enterprise capability to pick up the events from the Kafka topic. And so here in the yellow boxes is Adgeria listening on the Kafka topic, pulling the events in and processing them, extracting the design lineages. We talked before correlating things and augmenting the events. So we're literally going from open lineage to open lineage. We also put in support for the HCTPM point for small, for the occasions when the organisation wants to go straight to Adgeria. I suspect the proxy backend will be used more than this direct but we want to just to keep it to have a complete implementation and have another endpoint and the proxy backend. So it makes no difference whether it comes in through Kafka or through the HCTPM point. And then, so we're capturing we also need to produce. So Adgeria itself is a governance engine, it runs processes. And so it produces open lineage, it's a processing engine. And so here we have sort of capture of our normal metadata, this databases and files and things that are going into the metadata environment. We're capturing the lineage and we're correlating the two together. We're also publishing either the open lineage that came from an external engine so it can be stored and processed later or our own open lineage showing how I'll get the governance processes running because we think it's as important that the governance is governed as well as the data that it's managing. The publishing could, we have a log store so we publish the events in their original format to the log store so any other process can process it offline and we also can publish to Marcairs and you can use Marcairs' API to do additional analysis on that work as well. And as Dan said, one of the things that becomes very important, particularly when you've got a lot of things running, you can't rely on people looking at the logs and trying to spot errors. You want systems to do that. That means you need to know what should happen and those processes that know what should happen are then monitoring the logs to making sure that what should happen is happening because generally it's easy to spot when something fails you get exceptions but when something doesn't happen that's much, much harder because there's nothing happened. But these types of monitoring, if it knows that the process is supposed to run every 10 minutes and nothing's happened for a half an hour, it can start raising alerts to say something has fallen over when we're missing something there. So this is bringing this together. We have the open lineage coming in. We have Marcairs acting as one of the destinations for the lineage for monitoring and Nigeria is receiving, augmenting, publishing the open lineage plus also it can run the processes that do the validation of the operation environment. That's a bit tiny. But here's just an example of viewing the log store. So it's useful for certain types of processes you may need to provide proof that the process ran successfully and so this log store can provide a place to you might not keep it there forever but it's a place to gather that information so that it can be preserved for later use. And this is Marcairs. So this particular picture is Marcairs showing the various runs of a governance process running in Nigeria. So we're now seeing not just Nigeria capturing and publishing but also we're now using Marcairs to monitor how the processes are operating within Nigeria which is really nice. So that's how we're doing on time, that's fine. What was the value? It's not odd function, we've shown you what lineage is and why it's so important but the thing that has been amazing for the three projects is each of us has our individual value that we're all proud of and our ways of working but as we came together not only did we build this extra value between us but each project became more mature so for example Nigeria taught the open lineage team about SPDX tagging. Nigeria learned about Drop Wizard which gave us a really great way of producing lots of sample applications to drive Nigeria and very many different ways. The Nigeria team gave an open lineage to the POPSI back end so between us we're not just collecting things together but we're all enriching each other's communities through this collaboration and so I think this has been to me the beauty of this not only has it solved a very difficult problem for a lot of people who are working with data but our projects have got better as a result of the collaboration. So I think we're on to questions so has anybody got any questions on what we covered or how the projects work together? In any of these projects so these projects are all LFAI and data projects if you go to the LFAI and data site you can see the projects listed and for each project you click on the tile and it will show you how to contact and get involved in the different projects. So for Nigeria for example there's weekly calls community calls and we cover different topics every week and there's also office hours and other things that we have to help people get involved. Each project still operates independently Open lineage is they meet once a month and the aim there is they've got two aims really one is to capture is to get the standardisation of new facets from different aspects of processing and also to bring in new integrations with the engines they're much more focused on getting engines to produce open lineage than thinking about the consumers but I think that's more of a state of time. So that's really a lot of the calls we have with Open lineage is about new consumers and possible extensions to the standard. Nigeria is a very big project with lots of tracks so there's tracks focused on supporting security governance as tracks are focused on supporting data governance we're doing a lot of work with sustainability at the moment looking at how we help organisations think about sustainability monitor and understand their operations in a way that means that they can start to improve the way they operate. So for each of those we have a large scale roadmap that shows how each of these tracks are operating. Generally we have a community call every Wednesday and that's where different stages of those pieces are talked about but if you want to see more of how that roadmap goes because it's open source generally we release every month and that's so that we can keep the levels of software up to the latest levels of all our dependent software and whatever's ready goes. So we can't say this will definitely be ready in October but it's like this is the next thing that will come out and when it's ready it will be released. Does that help? I mean there's an ongoing flow of new integrations new capabilities, new use cases. It's a pretty active project. In going in many directions. So we've got other people here from the Nigeria team as well who are contributing on different tracks really. Yep, go ahead. Yeah, please. You talked about data governance. Myself I'm involved quite heavily with standards. Yeah, lots actually. So our sort of byline is open metadata and governance and we started it about three or four years ago and it was, well, let's follow the standards we want to be open. And what we did was we pulled together lots of standards around catalogs and data descriptions, i.e. metadata about particular types of data, into a patchwork layout and that's our open type system comes from those standards. So what you see is although because they overlap and we needed to reconcile those standards you say well it doesn't exactly spell how that particular standard works. But if you start looking at those standards just map very easily because they were the source of our type system. When it comes to governance the standards like ISO 27.01 for security maps straight on to the standard. The dam of staff marks maps very, very closely on the way that we do things. But what's fascinating is that the way data governance is described is very close to the way security governance is described and then you bring the idle stuff in and you can say oh wait a minute this is all the same. And so Adgeria defines things called governance domains and they're focused on each of these things. But actually they can collaborate and things that are being done in security can help data governance. So we allow that meshing of actions within the organisation for the different domains. Does that help? There's another dimension as well which is around management of standard values on what we often call either valid values or reference data. And so again you'll have different standards to your de facto and de facto standards and of course local standards within companies around particular kinds of data. Whether it's country names or area codes or whatever it's going to be there's hundreds or thousands of these and being able to capture those and use that as part of your quality program to validate that the data that you're seeing conforms to the standard sets of the reference values is incredibly important when you're trying to integrate the data together in order to gather meaning. And so Adgeria supports both the management of the reference data and the mapping of the reference data and then tying that into semantics through glossary terms and other things. Anything more questions? OK, well thank you so much for listening and coming to this I hope this was interesting and please connect with any of the three projects that we've talked about today I think I might have put them on. They are as I say and we continue to meet and collaborate to bring these projects together. So thank you so much.