 Hello, my name is Mandy Chesle and I'm the speaker for the next session. In this session, I'd like to talk about becoming a data-driven organisation using a particular technology called ODPI Agiria. This is a Linux Foundation open source project and we're certainly looking for more people to get involved, so if you find this interesting, please get in touch. So what's all this about? Many organisations over the last six months have discovered that when in times of crisis they need to act quickly to re-establish themselves in a new world that they find themselves in. In general, that involves understanding the resources they have, where they're located, what status they are, how healthy they are, and then be able to redirect them to new opportunities. Many organisations have also found that that's actually harder than they expected and the reason is that throughout the organisation, professionals are working together using specialised tools to achieve the jobs that they do and they're organised in the way that supports today's business. When the business needs to change, obviously that organisation needs to be re-established but unfortunately, all the knowledge of those teams is locked into the tools that they were using, which makes that reorganisation much harder. So what we're looking to do is to enable an organisation to link their tools together irrespective of the vendor that they bought them from, where they're open source, they've been bought for a specific purpose and allow that knowledge to flow between the tools as appropriate. So I thought I would give you an example of what this could look like in an organisation. So many organisations have what we've described here as an encoded vocabulary. So this is a glossary of the terminology used by the business that actually describes the data that they use. So for example, they might talk about customers and customer names and customer status. All of these terms are often also reflected in the data that the organisation stores. And so it becomes a very useful way of describing the types of data that the organisation needs to support. And so you can then take that glossary and talk to experts in the business and say, well, how should a customer's address be handled? Who should be allowed to see it? That type of question can then be attached to, I'm sorry, I have no answer, I should say, and then be attached to the vocabulary. Now if we then think about data structures and databases or passing through APIs, they often have what we call a schema, which describes the data fields that are passed across that. If we then take that schema and attach terms to it, we're basically saying that the data in this field means this to the business. And because the business has attached the classification or the rules effectively associated with managing that type of data, we now know how the data should be managed. Now we can ask developers to look this up as their coding, say a new API or a new database. But it would be so much better that when they're working with their chosen developer tool, that their activities automatically pull in that knowledge so they can do everything from the tool that they love. So one example we have is where developers were using the swagger tooling to develop new APIs. There's a search box put in that tool that allows them to say, OK, type in say customer because they want to create an API that passes customer data. The tool then uses a link to the metadata to the marked up schema to create the payload for that particular part of the API. And so the developer's very happy because I haven't had to type it in and they know that structure is exactly what the business needs. I haven't forgotten anything. It's all spelt consistently and that's really good. But the other thing is that that schema is actually marked up with the knowledge from the vocabulary because we've created that integration. So the developer finishes their work and they may have created a database in the same manner as the API. And that then goes into the DevOps pipeline. Now those tags are machine readable. So the DevOps pipeline could use those to determine whether they're sensitive data in the APIs and the database. And if it is, maybe there are extra tests that need to be run or the deployment needs to go to a specific secure environment. So again, we're extending the value of the work that was done in encoding the original vocabulary because we are including another tool. Basically passing that knowledge on to another tool in our chain here that I'm showing. So now we've gone through and help the developer and the DevOps team from the work of the business. And we've got greater assurance that actually what's being deployed is actually what the business specified. Now let's think about some time passing. So this application has been running in production for a while and is producing a lot of very valuable data. And so they want to do some analytics and AI maybe on it. And the first thing that needs to happen is that the data scientists need access to that data so that they can experiment and look for patterns within it. And that's knowledge needed to configure the analytics that they want to run. Now, obviously, there may be very sensitive personal data in the application. And so when that data is passed to the data scientists, it may need to be filtered or some of it even encrypted or some sort of transformation happening to it to hide the personal information without destroying the ability to do the analysis. And again, this is where that original metadata becomes so valuable because we actually know what the data is in the database because it's been linked to those terms indirectly through the DevOps pipeline process. So the data scientist gets data that not only is sort of safe for them to use, but also is tagged with the real meaning of the data in each field. So they're not having to waste a lot of time guessing or trying to find somebody who can tell them exactly what the data means in each field. So their work is speeded up because they have this extra knowledge. So what I hope I've shown you is that as you add in a new tool and as you share this knowledge between the tools, each person in the chain becomes more effective and their work adds new knowledge that can then be consumed by others. And the way that people work in an organization is not of straight flow. It's more of a network. And so the value that one team gives through sharing their metadata comes back in another form to them as additional knowledge that's been attached to what they contributed comes back to them. So the whole organization becomes more effective. It's much more transparent as to what's going on. And so the company is able to be more collaborative and more agile in terms of, you know, in times of challenge as we are in today. So what does only PI Algeria do? The major role is to support those blue arrows, to support that exchange between tools and processes that become from different technology and support different professions within the organization. And as I say, it's all about that building that knowledge base and using it to make decisions throughout the organization to create increased agility and collaboration. Now, the cynical among us and I was certainly one when we started this is that there have been many attempts to try and share metadata. You know, we've had standards, we've had approaches where there's a single centralized metadata repository that everybody uses. We've had vendors try to create a suite of tools which that you'll use the same metadata repository so that they're sharing it. And that works very well, but it's a limited scope. It doesn't cover everything that the organization needs to do. And, you know, the other tools that copy just copy, you know, sort of like a cut and paste between different tools, which sort of does help initially, but the copies tend to diverge over time. So there are many attempts because this is an extremely valuable capability to have. And so why do we think that Nigeria is different and special? The major reason is that we decided not to do it as a single company effort. We decided to work with the Linux Foundation to create an open environment where vendors and different organizations that need to use this technology can come together and work through the problem and build a solution that works for all. We there are no and that sort of led to a number of very key technical requirements. And the first is that it's fair that there is no controlling technology. It's peer to peer and it's designed on retaining the value of each technology. So the technologies support different subsets of metadata. They have different sophistication in the way that they work with metadata, and that should not matter. The Nigeria technology needs to enable them to be the best they can without limiting the capability of other technologies. So this is this is extremely important that we we have this sense of fairness and value throughout the integrated tools. The other thing is that not everybody has huge IT departments, so we need to make the technology self configuring, distributed and enable both a sort of batch background update of the technology of metadata, as well as real time federated query access across the ecosystem. Data and IT has spread to sort of tiny devices, distributed out in the environment through to very large scale, highly available, globally accessible services. And so Nigeria itself has to be able to scale to support that environment. We need to support multi tenants. So this means that an Nigeria service can be offered. A single service can be offered to lots of different organisations. And it doesn't muddle up and provide access to metadata from from another organisation to its users. And we call that multi tenant where we sort of we silo and separate the data depending on which organisation the user is coming from. We also need a huge focus on security, and I'll cover that in a little bit more detail because metadata as it's brought together, as it's linked, becomes incredibly valuable. And so we need to control the visibility of that metadata as a core part of the ecosystem. And then finally, in order to make it run anywhere, effectively, we need to make sure that all its calls to external resources like platform resources are customisable. And so we have a connective framework that's used for all pluggable code. The other thing is that we work in a very iterative process where we're looking at a particular scenario. We hope to attract people to come and work with us to work through the problem to bring new use cases. And through that process, we started to develop new services and enable that use case to go through. And the result is that we have actually created a huge number of innovations in this space, just because we're taking a different view on the way that metadata is managed. And I'm going to cover a number of those as we go through, which is why I skip that chart quite quickly. So let's look at these innovations. Let's first think about what it means to be peer to peer in terms of the protocol. So basically what happens is each technology builds a connector or two connectors, depending on whether it's bidirectional or unidirectional. That is a translator. It translates from the open APIs of Algeria to the specific API of the tool. And that's sort of inbound and outbound. If the tool creates events when things happen, like its metadata changes, then we listen to those events and translate the events into open Algeria events. And that's all the technology has to do to send and receive the metadata that it understands. We don't send it metadata. It doesn't understand because that would be pointless. So there are specialized interfaces for particular types of technology to make that integration as simple as possible for each type of technology. And as I say, once it's connected, then Algeria does the rest and it doesn't matter how other technologies that it's sharing metadata with connect into the ecosystem. That tool will receive the metadata it needs. Now, you might think, well, that's easy. I mean, everybody knows how to send stuff over the internet. But actually, what we're doing is we need to make sure that we we we translate, it's not just about transferring bytes, it's about transferring knowledge. And that means that we need to make sure that if one tool shares just details of a database column, we know it's a database column and only pass it to tools that are interested in database columns. We need to understand and preserve the meaning of the metadata as we exchange it. And the trick is that there's huge impedance mismatches in the capability of each technology, the names they use, the granularity of the metadata they work with. And we have to accommodate all of that in Algeria so that each tool, as I say, is able to operate at its maximum capability without having to dumb it down. To support the lowest common denominator. So what do we have in Algeria? We have a common language and the associated data structures to allow that metadata to flow over the network. And there are protocols that explain exactly how how and when metadata should flow. We then provide those integration points that host connectors to specific technologies. And the aim is that all you have to do is write the translator and we will manage the hosting of your connector starting stopping, managing the recent process. Because one of the things about these environments is that they're not all available at the same time. And we need to be able to be able to handle a restart when a particular technology has been unavailable for a while. And, you know, or maybe upgraded or all that type of thing. So the whole ecosystem has to be extremely robust and handling, versioning back with compatibility and things because things will not be upgraded, you know, all together in such a complex environment. So if we looked at it from the outside, Algeria would be the fluffy cloud in the middle with the connectors doing that synchronization through its different APIs to the different types of tools. Now, of course, life is not simple and systems are distributed across lots of different power platforms, different data centers within the organization, right out into IoT devices and the edge of the network for IoT type environments. And of course, even onto our mobile phones, which of themselves are computing platforms, collecting data, making decisions and also distributing data. So we need to be able to work in a wide variety of environments. And the way you operate with Algeria is that you put Algeria in all these locations and Algeria connects to itself and manages the exchange. And we can be connected in a live way to allow a real time exchange. Or it might be that you have two business partners sharing data and they want to share metadata too. And so the open formats can be used to create an archive of metadata associated with data that's moving between organizations. And because it's in the open format, the receiving organization can pick up that metadata and understand its classification, meaning, terms and conditions, all the types of things that mean that they can create a trusted exchange of data between one another without connecting their metadata recordries together. So here's another view of sort of the big picture where we have the blue boxes here represent the Algeria, what we call the OMAC Open Metadata and Governance Server Platforms, each deployed into a particular location. And the orange circles are servers that have been configured on the platform. So the servers are effectively a hosting environment for a particular type of connector that talks to a particular type of technology. And again, this is all about getting it making it as simple as possible to connect an individual tool and pushing the complexity of the integration into the blue area, into the platform itself so that everybody doesn't have to write the same logic. I talked about the connector framework and so this connector framework is used for every piece of type of plug-in logic. And so it's used for the loading, starting, stopping of this plug-in logic. And then there's a specialist API that each type of connector, so we will talk about a repository connector, an integration connector and audit log connector. So there are lots of different types and they add in the specialist interface to allow the appropriate flow of metadata across that interface. The different servers, I showed a random collection of blue lines, but the different servers actually are configured in a very specific way. There's a central core which I've labeled as integrating metadata. Which is the peer-to-peer sharing metadata between metadata repositories. So this is really for tools that whose primary purpose is to manage metadata. So a data catalog would be an example of that type of tool. And they connect through a Kafka topic and share configuration information. So this is how they self-configure. And then they are able to peer-to-peer both exchange metadata and issue federated queries to each other in a sort of fair manner. Then that's all great for collecting and sharing metadata, but if we really want to use it and push it to lots of different types of tools, we then have governance servers that enable that to weigh exchange. And then finally, so that brings in extra types of tools, but we also need to bring in people. And so the view servers provide REST APIs for different types of UIs that allow integration into user-oriented tools and also to allow a area to provide its own user interfaces. And this is just the different types of connectors shown in a sort of type hierarchy. So you can see the cohort members, these are in that central group. The view servers obviously supporting the user interfaces. And then the governance servers, there are different types that do different types of roles. And the two main ones are the integration daemon that's responsible for the exchange of metadata and the engine host that is actually running governance type engines like a discovery engine for profiling data and stewardship engine for handling any errors or work that requires a steward to make a decision about what should happen to a particular situation. I talked about the platform and I said it was very scalable. So basically this platform can run on something as small as a Raspberry Pi. It can run standalone, a single platform providing support for all the different types of servers that a particular organization needs. Or it can be sort of stretched a server can be stretched over multiple platforms to give a highly available environment where updates can be rolled across the different platforms and the servers can stay running even if a platform is out for a period of time. Now let's talk a little bit about what's going on inside the cohort. So the cohort has said is peer to peer sharing between metadata repositories. And so here we have three different metadata repositories from different parts of the organization sharing what they know about say the data in the organization. And the moment that they dynamically register and once they're connected they can see each other's metadata. It is also possible to join multiple cohorts. So an organization may use the cohorts to group together tools in different divisions within the organization. But there are always corporate level enterprise level teams and they can connect to multiple cohorts to get an overall enterprise view of the metadata. The process of setting up the cohort as I said is dynamic. So here we have a server that's added a registration document to the topic for the cohort and we call that the RMRS topic. And that's it, they just sit and wait. Then another server joins and they add their registration document which is picked up by the first server and they have an exchange. And what they're doing is they're exchanging configuration because once that's over they can now issue queries to each other's database. And you can see that in this picture, server one, the pink server is serving up blue metadata and server two, the blue server is serving up pink metadata. So they're able to issue queries. And that gives you a very real time integration. But it's also possible that, well actually what's going on in the background is that there's an opportunity also to store metadata from other repositories so that it's always available. Or it might be that it's used all the time and rarely changes and so it's actually more efficient to store it locally. And that is also going on in the background to create a more robust integration between the technologies. It's also the case that there are many technologies and they're able to use a federated query. So this synchronization in the background gives them an opportunity to receive metadata from another server and still have their user interfaces that can only pull metadata from their own repository see external metadata. Now what's going on inside is that we break down the metadata into small nuggets or atoms of knowledge. And there's sort of the entity is a knowledge about a thing or a concept. We have relationships that link them together. So we can say this database column is linked to this glossary term and it has this meaning. And also we can add classifications on top. So these are just like extra pieces of information that group things that are related together. So you might say that this particular database or this particular type of data is glossary term represents data that's confidential or personal information. And that would be a classification that can be attached to an entity. And all of these things have properties that can be stored with it. And it's not required that they all are stored in the same repository. So here we have the two servers again and one of them's got a database column in it and the other has a glossary term stored in it. Now we want to create a relationship between them. We have a number of choices. We can either say move a read only copy of the glossary term to server one and server one creates the relationship. And that's all well and good. As long as server one actually supports glossary terms and meaning relationships. But if they don't, we can pull read only copies of both of those pieces into a third server and create the relationship there. And even though the database column, the term and the meaning relationship are in completely different servers when you issue a query to a jury it will be returned as if they're all located together. We can also then add extra classification. So in orange here this confidential classification has been added. And the fact that we can do this to a third server means that we can add governance capabilities and knowledge to metadata from tools that don't support governance. And so we're effectively using a jury to extend and manage the capabilities of the tools that are currently deployed. We cover a very wide range of metadata and this set of types of about 500 now comes from or is sourced from the very high quality metadata standards that exist. And these metadata standards are, as I say, they're very high quality, but they're limited in scope and they don't cover the wide range of metadata that's needed by an organization. So what we've done is we've gone through these standards and sort of stitched them together so we extracted the concepts and compared them and worked out where the overlaps are so that we can pull metadata from one standard and then where we have overlaps share it out to another standard because we're able to provide that translation. We also fill the gaps because there are gaps between the different metadata standards that tend to focus on a particular type of asset or a particular type of governance or whatever. So as I say, in each place, we have very good standards and what we've done is link them together in our types. And here you can see the type of linking that's being done and not only are we linking technical tools together but there are different governance tools covering different types of governance like privacy and security and IT infrastructure management and software development. These are all, again, very well established governance processes with independent tools but in today's world, we need to bring them together. And so Aduria integrates governance as well as integrating metadata about assets and people. I talked about there being specialist API so in the center, in the cohort, they're all speaking that language of entities, relationships and classifications but that's quite complex for somebody who understands ETL engines but not the bigger metadata piece. So we provide access services that give those simplified interfaces two different types of tools. And this is a sort of summary of the access services that we have at the moment. And they do cover all parts of that metadata that we're talking about from governance through to lineage of assets. And then we support those governance servers with the different styles of integration to support the capabilities of technology from technology that just provides open APIs that we can connect and pull to monitor the changing metadata within it to ones that actually create events and provide us with a notification every time metadata changes that we can then capture and share through the open metadata ecosystem. And here's an example of this in progress in process. So here we have the first integration demon is extracting metadata from different data sources. So there's a database server that it's pulling schemers from and a file system where it's pulling knowledge of the files that are being added to the file system over time. And these are being cataloged into the metadata server in the middle. As that metadata server receives new metadata, it's creating events through the cohort which is being picked up by another server that's working with an integration demon to just one of the governance servers. And that's pushing and automatically creating database views over the data sources that were captured by the first server. And those views mean that the data virtualization ending can be used as a single access point for data where security and auditing can be applied to monitor what type of data is being pulled by different people. And because it also is a data source, those views are then recataloged back into the metadata server. And our visibility control, and I'll talk a little bit about that called governance zones, can make sure that anybody who searches on the catalog actually just sees the views in the sort of search results. When they click on that, they go through the data virtualization ending to the real data. And this means, as I was saying, that the virtualization ending can filter the data that's returned, apply different types of rules to that data so that people see only what they should see going forward. And the beauty of this approach is that there's no manual configuration from one end to the other. Now, I mentioned these zones. So if you think about the fact that we're pulling knowledge together from lots of different parts of the organization, and this knowledge as we link it becomes almost in some cases more valuable than the data that it represents. And so we need to create those silos again, but in a virtual space. And so we have these things called governance zones. And these are effectively sets of metadata for different purposes or have come from different places owned by different people. And you can have as many types of zones organized on different lines and use them together. So a particular asset can belong to multiple zones. And the way we do it is actually the zone membership for that asset is just a tag on the asset. So it's very, very cheap. We're not moving it around. We just basically update the tag to change the zones that it belongs to. And through that, we can do an awful lot of governance management by just controlling the visibility that a particular tool has to the metadata or that particular users have. And we do this because each API has a list of zones. So default zones are used when we create an asset. Supported zones are the zones that control which assets can be seen through that interface. And then the published zones, when we're creating an asset we might sort of, it would initially go into the default zone and then the published zone would be the zones to, when it's ready to go, ready to be made usable by the business. The published zones just list what those zones should be updated to to enable a broader access. So this just gives us a very simple way of limiting what individual teams can see when they access it through a particular interface. And that gives you sort of a coarse grain view and then we have the very fine grain view of security that allows us to go right down to a single instance of an asset or a connection, for example, and apply specialist rules to those objects as people are requesting access. The other thing that's very important is if anything goes wrong, we need to know where metadata comes from. And so every piece of metadata has a header in it, an audit header that gives us that information about the originator of metadata. And that is also used to control who has permission to update it as well across the ecosystem. So the ability, the transparency and the ability to trace the origin of metadata is a key capability of Nigeria. Finally, as I said, this technology needs to be deployed in many different places and we need to make sure that its operation is fully automatable. So the diagnostics for Nigeria go through something called the audit log framework and all the sort of messages coming out of the servers have a fully externalized with a description of, you know, so what happened in the server and what you have to do to resolve it, it's possible to translate the messages so that they can operate in different languages. And also all of the inserts from the messages are available in the log record so that different types of processes can be used to automatically manage the Nigeria environment. I talked about user interfaces now, obviously different tools are going to make use of the federated queries, enterprise view of metadata, but we also provided some user interfaces of our own. This is one of them called the repository explorer that allows you to step through and create reports of metadata and it works using the federated queries so it works across the different repositories that are in the cohort. So you get a single view of your metadata as you're browsing. There's another user interface that allows you to create a view of all the different technologies that are being integrated by Nigeria and another that allows you to explore the types of metadata that a particular server supports. And so finally, why do vendors work with us? Well, we've been seeing some very common patterns coming out so sometimes a vendor has created a new platform but they still have customers on the old platform and so by linking them together and allowing them to share metadata helps organizations migrate on to the new platform. Or it might be that the same many versions of a single product is deployed across the enterprise and they want to link them together or they want to take integration costs to different technologies off of their platform and bring it into the open source or get access to new types of metadata or push their metadata to other technologies to control how they're managed. So there are many different use cases that are coming up across the vendor landscape as we work with different companies. And our development status, as I said, we were very iterative, innovative community and so the area technology is very modular and that big purple area is the developer platform that's where most of the work has been done because we want to be a toolkit to help people integrate tools as effectively as possible. But we also want to provide a lot of function out of the box and so the integration platform, this blue area, is the sort of pre-built, plug-able stuff that you can deploy. Education is extremely important, not just in how to use our technology but how to create a governed environment for a digital enterprise. And then on top of that, we're building specialist solutions that help people manage their enterprise and their digital ecosystem from making use of the integrated enterprise view of metadata that Aduria brings. And everything is done in plain sight. So you will see a constant stream of new modules that go through development, sort of tech preview released, and then on to deprecator, not that we've got any deprecated yet, but that's the sort of process. And so here I've got a picture of the desktop and you can see there's bits of it that are completely finished and we have modules like that, but we also have modules that if you see, they're flagged as in development and they are literally from, here's the API to some use cases working and then as it matures to a release status. And our current status is really shown here. So green means that it's at least a tech preview. Orange means that it's actively being developed. Red means it's a sort of concept at the moment and we're looking at doing some initial design work on it. And our focus is really around integration at the moment and enabling a lineage and sort of higher governance functions as well. So that seems a garea. I hope this has been an interesting session. We are a very active community. We are looking for more people to work with us whether you have use cases or you're interested in doing some interesting development work. You're a writer, you love doing developer advocacy. We have so much to do that we would love to hear from you. So thank you very much. And that is the end of the presentation.