 Hello, my name is Randy Cheswell and I'm the speaker for today's presentation on a project called ODPI Algeria. So what we're going to do is we're going to look at how the technology provided by ODPI Algeria helps organizations become more integrated and better governed. So let's start with an example of the type of problem that we're trying to solve here. Many organizations, particularly those in regulated industries, they find it very helpful to capture the vocabulary that they use in the business and the data that is associated with it. So what do we mean by vocabulary? We mean the words that are used to talk about what's going on in the business, such as customer and customer order and invoice. So all of the terminology, you know, if we can understand exactly what that means and how these terms relate, then we understand the language of the business and typically that language has data associated with it. So for example, if you are supporting customers, then you often keep records about customers and you associate those that information with particular orders that they're making or invoices that they're paying for. So that vocabulary gives you a very clear idea of the type of data that you're going to be working with. Now, because this is written in the language of the business, experts from the business can attach extra knowledge about how data of that type should be managed. For example, it might be that when we look at credit card information, there's a certain set of regulations around managing that type of data and definitions about what can be stored and what has to be gathered from the credit card owner each time they make a purchase, for example. So we can attach the rules and regulations and classifications of particular types of data to that vocabulary. And the certain advantages to that it creates a conversation in the business about how data should be managed. And that helps the IT team in terms of capturing requirements around the way that they have to manage data. But we can do so much more once we have this encoding of a particular business's vocabulary. Now, think of a developer working on building a new application and so they're probably going to build an API, maybe they're going to build a database with a particular schema structure. From the vocabulary, we can generate schema structures. So this is basically a description of how the fields, what type of fields and sort of how they're arranged. We can actually build that from the vocabulary. And if that's integrated into the developers tool, you imagine them creating an API called create customer, for example. If they can start from their tool, if they can say give me the schema for a customer, then and sort of and that's embedded into their new API. This gives them tremendous advantage. Firstly, they haven't had to type all those fields in and that could be quite a lot. Secondly, they are sure that they've got it right because it's come from the vocabulary. And so they're happier they reduce their rework and their work is much faster. Now imagine that they've built their API and they've built a database to sit behind it. And it's all finished and tested and it's now moving into the DevOps pipeline to go through the final set of testing in order to bring it into production. Now, because it's been built with schemas that have been derived from the vocabulary, these schemas actually can have markers in them that can then be read by any subsequent processing. So as that piece of new system is the new service comes goes through the DevOps pipeline, then it can the DevOps, the logic in the DevOps pipeline can look at these markers and say, well, wait a minute, this sensitive personal data that's going through this particular application, we need to do some extra testing. The whole service now needs to be put onto a particular secured gateway and the database needs to be encrypted. So all those rules about how the new service has to be deployed can be enacted by the DevOps pipeline because it can see what type of data is being supported by that particular service. So we've by linking these tools together, we have speeded up the process of building the new service so the developers working faster and the DevOps pipeline is able to ensure proper deployment of the service into production. Now time passes and the application is running very successfully. It is generating interesting data and the business thinks well we can probably serve our customers better if we know a little bit more about them. So the data scientist is commissioned to do some analysis on the customer data to see if there are any trends or different types of insight that they can get from that application's data. Now of course within that data there's obviously interesting things to look at but there's also personal data that is really probably of no interest or to this particular type of analysis. So because that application came from a sort of marked up schema and is cataloged as a process as a part of the DevOps pipeline delivery. It's very easy for the data scientist to locate where the data is the catalog tells them exactly what's in that data so they're not having to sort of ask people or try to guess what's in the data. So also as a copy of that data is brought into the data science tool so that they can experiment with it. Additional governance rules stripping out things that are inappropriate for that type of analysis can happen automatically because we understand the data. So and this sort of story can go on the more and more we connect tools together and take advantage of the knowledge that's been encoded by by experts or by people who are, you know, sort of working and focusing on particular aspects of the system. The higher the value that the business gets from gets gets from the work that people are doing now. So this is the role of ODPI Algeria or just Algeria that that that we're focused on in this open source project. So it's, it's all about enabling that flow of knowledge between people processes tools engines whatever you want to talk about them. And it's particularly important when a business wants to be agile it needs to change the way it's working it needs to embrace new types of technology and it wants its staff to be as sort of enabled as possible particularly when it comes to data. So it's extremely important to businesses in today's world. Now if you've been around a long time, which in this industry which I have you probably going, but haven't there been a lot of attempts to do this type of work in the past and you'd be completely correct. We actually, this is such an important piece of capability that we have done lots of things over the years to try and make this happen. But now we're at a point where open source is very widely accepted. It's widely deployed, and there's an awful lot of recognition that sort of core infrastructure should be open and shared and used across multiple vendors. And so we're at a perfect time to take this type of idea and through open source and open governance of that open source project move, move the whole sort of process forward so that we can get to a point where that ability to flow metadata between different tools is it becomes an industry standard. So what's different about Nigeria is that we focus very much on being open, not just in the code itself and the way the code operates, but in the way that we operate to make sure that many people have a say in how it's developing. And, and also to think about the fact that there are many ways and places and environments that we need this technology to work in, and that there may be multiple vendors involved of different sizes as well as the open source projects. We need to make it fair. So it's not just enough that the technology is open, but the way we work is open and that the architecture allows, you know, multiple participants to deliver value to the organization running the software. The other thing is that it's a very big problem. And so, as a team we work in a very iterative way, looking at the problem holistically working through different use cases, and making sure that everything we do is very visible so that people can provide feedback at all stages of the design and development. And the result has been very positive in that we have managed to create some, some quite interesting innovations in this space as we break down the traditional silos between different tools. And I'm going to show you a number of these in the time that we have left in this session. So the very first part of this is that we're not planning on creating a new mega database of metadata. Each of the tools continues to use their own metadata repository, but we provide the ability for them to exchange metadata and appear to play away. And we recognize that this may be deployed into an environment where there's not a very large IT team that can sort of run this. So it's designed to be self configuring. And allow each tool to deliver its maximum value while making up the difference between, you know, any deficiencies that a particular tool may have in terms of its ability to work with metadata. Now, on a picture like that, it all looks very simple, but actually there's an awful lot of devil in the detail in that different tools use support a different subset of information. And there are huge mismatches in the granularity the terminology used the ability to maintain and manage the integrity of the metadata as we exchange it. And of course using different they will be using different technologies. So we need to make sure that that a jury is able to fill the gaps wherever is needed. And so what we have is, we've built a common language for metadata so these are the types of metadata that we need to exchange and the definition of the structures the protocols to do that exchange. But also to provide a lot of the sort of core middleware type implementation that makes it as easy as possible for a particular piece of technology to be integrated into the ecosystem. Because we want to make this as easy as possible for a particular technology to be part of the bigger ecosystem. So if we looked at this from the outside, you would see that a jury is sort of the blue, the blue cloud in the middle. And it is providing linkage for each tool. So each tool connects into a jury through a connector and is able to send and receive metadata. And then a jury takes what's coming and distributes it to the places that that needs it. So for each tool, the effort is just one connector to translate in and out of the open metadata type system types and interfaces. And then everything that they need is then brought to them by a jury. Now, it looks again, it looks nice on these, this simple picture, but the environment that is going into is highly diverse. So we have. So, you know, you can imagine that we multiple will be technology running on multiple different types of clouds in on premise data centers, right out into the Internet of Things right, you know, right out into the environment. And all of these come software components are and are exchanging data and performing processor on data in this distributed manner. So for a jury to be successful. It needs to be where the data and the processing is so a jury in this picture is the orange. And you can see we have pushed it to all of the key places. And then where it where it's located it actually communicates with itself. We also support the fact that nowadays businesses share data and they might not want to connect their metadata repositories and their tools together. So there's an import export format for metadata to allow a business partner to share data and the metadata that goes with it, which could be the classifications terms and conditions associated with the data along with all those delivery descriptions and things like that. So, not only is this sort of a live exchange of metadata for within the organization, but also an import export to allow metadata to flow between business partners and sort of disconnected mechanism. So here we're starting to show the fact that, yes, the area is deployed into lots of different environments and that's represented by the green clouds and the jury itself is consists of a platform for hosting the connectors. So each of the blue boxes in this picture shows the area platform sitting in each of the different environments and of course the blue arrows show the fact that a jury is doing the exchange between itself, and the sort of yellow and orange orange arrows are the exchange between that the connectors are managing to the specific technologies that are being that are being used. And I keep using the word connector and the base of a jury is a connector framework that allows us to integrate our own time into different platforms. And also to allow connectors to the party technologies to be integrated into the into the area servers so that's how that's how we basically plug things into a jury. But also applications can use our connector framework to connect to different types of data resources or services with a additional method on there that allows them to access the metadata that's equivalent to the data or service that they're accessing so applications can use metadata directly through the connector framework as opposed to having a jury of push metadata into the applications as needed. In that picture we talked about, you know that there are these connectors and they run, they run on the platform but actually the connectors and the different types of connectors running what we call a server. So a server is a configured sort of virtual runtime that sits on top of the platform. So a platform can support multiple servers at any one time. There are different types of servers that perform the support different types of connectors and perform a particular role in the ecosystem. In the center we have the metadata repository so there are tools that their job is to maintain a database of metadata. And maybe there are other governance services around it and catalog search API is that sort of thing you'd expect for that. The first thing we need to do is to take those rich sources of metadata and that those to be exchanged between the different the different metadata repositories. And this is what we call the integrated metadata part of the solution. It's, it's the core of what's going on. There are lots of other tools that really they use metadata but it's not their main job. So you think of a sort of data processing engine a database a database of course has a lot of metadata and it has a schema in it, but it's real job is storing data. So we need to connect and exchange metadata with all these tools that use metadata, but it's a simpler integration because, as I say these see these tools treat metadata as a means to an end rather than their main job so this is where the governance servers come in. And what we think what we're saying is that the central core is like the core knowledge base for metadata. And then we need to actively exchange metadata with the tools that are using it, so that not only can we gather knowledge about the different resources that are being created, but we can push metadata out to configure those technologies so that they are operating in a consistent and compliant manner from by using the metadata and in this way. Finally we need to bring people into the story. And so we have the, the, the sort of the view services that allow this, all of this integrated technology to be brought together into a solution. These are services that are designed for user interfaces and they're very much focused on enabling humans to be part of the bigger ecosystem. Although it's important to remember that actually most of the user interfaces that people will deal with in this ecosystem actually come from the tools that are integrated through integrated governance. So this is just another picture showing those different types of of servers and how they are sort of grouped and organized in in our internal architecture. So here is talking about different deployment approaches between the servers and the platform. So, as I said before, Jiri has to run in a wide variety of environments. So this core platform that is deployed into a particular environment can run on something as small as a Raspberry Pi, or it can be scaled across a large Kubernetes cluster, allowing sort of rolling updates and high availability through that type of clustering. The platform itself allows multiple servers to run. So you could run all, all the different types of servers you need for your organization on a single platform, or it might be that your server is a service type vendor and you want to run a different server for each of your customers and they will sit as virtual services on the platform on a single platform or on this sort of highly scalable Kubernetes style. So there's a lot of flexibility in the way that you can deploy the platform and as a result in the way that you can configure and set up the integration environment you need for your tools. So let's start looking about how this what's going on behind the scenes because I've talked about the fact that metadata is being exchanged and it's all sort of all slightly different in each tool. That core piece I talked about the integrated metadata, the exchange of metadata between metadata repositories that happens in that cohort to that collection that's peer to peer exchange that is at the heart of Ijiria's integration. And so different tools can connect into the cohort and they have then got visibility security allowing for all of the metadata in all of the peers or the members of the same cohort. And it's also possible, particularly where you have particular servers that are wanting to serve sort of multiple groups in a sort of corporate level services they can join multiple cohorts, and then we'll see the super set of the metadata from the cohorts that they join. And what's happening under the covers is the whole cohort is configured automatically. So here we've got a metadata server, server one in shone in pink, and it wants to join the cohort so it puts a registration document into a Kafka topic, which we call the RMRS topic. Then here's a second server joining server to the blue server, and it puts its registration document into the topic and they both receive the others, and they have a negotiated exchange. And what they're exchanging is knowledge of the types of metadata they each support where they're located in the network so they can call one another, and just making sure that they are compatible compatible to exchange metadata. Once that's complete, they're able to then call one another so they can issue, they can combine metadata from the other server and their own metadata. So here we see pink metadata appearing in server two and blue metadata appearing in server one. And this, you might leave it at this but you can also set it up so that in the background, certain types of metadata are replicated to create copies in different repositories. And this is useful to increase availability, or if you have a server that's not able to do federated queries and needs to have all the metadata it's offering to its users in its own database. So Adgeria will combine the use of federated or distributed queries with the ability to do replication in the background. The way we store metadata is actually we break it down into small nuggets of information that is each owned by a single server. So we have the notion of entities as information about a thing relationships between entities, showing how they're tied together and classifications which are the way to actually augment a particular definition so we can say, here's a definition of a credit card number and we can add a classification to say this credit card or we could add a classification to say this credit card information has to be kept for seven years. So that's the role of the classification. And what's going on in under the covers so to speak is that we are shuffling stuff around so that we can link it together. So for example here we've got a description of a database column in one server and a description of a glossary term or a sort of vocabulary definition in another, and we want to link them together. So it could be that we shuffle the glossary term into server one and connect it there. It might be that server one doesn't support glossary terms and server two doesn't support database columns, in which case we can bring in a third server into the cohort and make the connection there. And it actually doesn't matter which we do it when we issue a query we will get all three pieces back together as if they were stored in one repository. So this means that we can augment the capabilities of the tools and engines that are connecting to the cohort with additional governance function that none of them support because we are able to store that metadata in in an area repository. So looking at the needs of many companies in terms of supporting the regulations and their needs for did completely digital operation. We've come up with a broad range of metadata that is that that is needed and it's about 500 different types. I'm sure over time this is going to grow but this is our effective starter set and they link together to allow you to go from sort of regulatory requirements through to specific implementations and the current state of those implementations who's working with it and how it's being used in the organization. This also includes lineage which is very important to many organizations. And that whole set of types can be exchanged across the cohort but also we provide higher level interfaces to make it easier for different types of tools to connect into into the ecosystem. And here you can see these white boxes represent the different types of APIs that we have and the name of them gives you an idea of the type of focus the type of metadata that flows across those interfaces. And then we need to integrate with tools. This is the governance service and we need to integrate with tools that have different types of capabilities so some might like a database typically. We can call it and we can pull it and we can pull the schemas and monitor the changing schemas within it. Other types of technology create events when things change and we can listen for those events and use them so we support a variety of different integration patterns and then they can be built up to allow us to not only automatically extract metadata from the different types of tools but push metadata to other technologies to allow that technology to be configured for different types of scenarios. So here we're building views over data as it's captured and brought into a particular data lake to allow the virtualization engine to act as an access point where additional security can be applied to the user to the calls from users of a data lake. There's also quite a focus on user interfaces in in a jury at the moment, particularly allowing people to connect to repository and explore the content to look at the types that are supported in each server and how the servers are connected together in the in the ecosystem. And the vendors we're working with we're working with a lot of household names when it comes to vendors are looking at it from their own internal perspective linking together different versions of their products, or where multiple versions of their product to deploy or making taking advantage of the integration with different technologies that a jury provides, or expanding the amount of metadata that they can access to so this is a huge range of opportunities that the jury can provide to a particular vendor and to open source projects in general. The way we operate is we have a very modular architecture, and we've been which we've been working through building up increasingly more sophisticated APIs to enable the integration. And as I said before, we're very open in terms of how we operate so you'll see if you look at our get repository there are some things that are released functions and other things that are actually still in progress and at different levels of development. So, the way we work is each month we create a release, and that has whatever is ready is then incorporated in the release. And from here you can see the green areas are the released and sort of tech preview type function orange means that there's active development work going on red means that it's, it's still in negotiation so it's pretty much a paper exercise at this point as to what's in that particular function. And let's see what we got. So this picture here is just showing roughly where we're focused on as you can see, good focus on integration user interfaces and looking at expanding the capability the ability to capture lineage and those controlled vocabularies. So, what do we get from Nigeria area to say open source distributed ability to connect together different tools and allow an organization to be far more agile, and build and share knowledge, far more effectively, even though they use tools from lots of different vendors and open source projects. Thank you so much for listening. And I will hand over to the next talk.