 Hello everybody and welcome to this session called Becoming a Data-Driven Organization. My name's Mandy Chessel. I'm the leader of an open-source project called ODPI Algeria. I'd like to talk a little bit about the background of this project, why we created this open-source project, and the value that can come from using it within an organization. The story starts back in 2013. I was working with the ING Bank, this is a global bank, and they were interested in building a data lake. They had a number of requirements because they're a bank, they have a lot of regulations around how they use data. They were also very concerned that their customer's data wasn't lost, they knew exactly what was happening to it, and it was being used in an ethical way. We came up with this architecture for a data lake that looked at all the different requirements that they needed for data. It became a very successful architecture within the bank, and has been used with many different organizations around the world. I spend a lot of my time working with companies on the data strategy, how they make better use of data, and ensure that it's properly governed. The experience that we had with that data lake architecture resulted in a number of books, and it was very well received because it helped to guide people on all the decisions that they needed to make. However, there was a problem with it in that it was an awful lot of effort to build. This is a cross-section through the different technologies that are used within a data lake. At the base, you can see barely standard commercial products, and you can use open-source products in those spaces as well. But an awful lot of what made the data lake was actually handwritten components that the organization needed to build. There was an awful lot of integration between those components as well. Data lakes became very expensive to build, even when you had a fairly well-defined blueprint to create them. The other thing that we discovered was that as an organization becomes data-driven, everything changes. The way that projects are funded, typically they're funded on application lines. Data-driven organizations need to fund a lateral flow of data between the data silos. Then that creates a completely new way that resources need to be distributed around the organization. It becomes very important that development projects are using real data in order to test and manage the services that they're building. The regulations around data, such as the GDPR, actually start to combine the governance domains between the data privacy people, the data governance people, the IT infrastructure people, the security people. We need to start bringing that metadata together, that they're using the policy metadata. What's happening is that a huge number of tools used by many different professions need to be integrated in order to put that knowledge that's being created by one group of professionals can be shared and used and linked to the knowledge from other professionals. That body of knowledge is what we call metadata. Having gone through this process with ING and a number of other very large companies, we wrote another book that basically said, this is the reason that data lakes are very hard to write, which are to create, and also why it's hard for an organization to become a data-driven organization. One of the things that was in this book was the idea of a maturity model for an organization. Here we are, I'll show that to you. That actually starts from where is my data? That's this very low data awareness. And then knowing how it should be governed, that's governance awareness. Then starting to embed that into all the technologies that are working with the data. That's embedded governance, then building up to allowing the business users to own the settings of governance. So as they make changes to those settings, the data lake and the other processes that are managing data change with it. And then finally, something called data citizenship, which comes from Forrester, is the idea that all employees are enabled with the data that they need, and they work with data as a normal part of their operation. So these different levels had different types of technology integration requirements between them. And when you started to look at any organization, you realized that this was a huge requirement to get all this metadata and tools integrated. And so what we proposed was we need to use an open source project to link together all the different metadata standards and provide a platform to enable an organization to create that data-driven layer, to link together all the tools that they needed. And we sort of started to think, well, what's an example of a good, a way that metadata is working very well? And one of the examples we looked at was photographic metadata. So if you think of a digital camera today, when you take a photograph, an awful lot of information is captured about where you were, the time, date, settings of the camera and things, and that's captured in a standard way and stored with the photo. So when you then load it up into a program, you know, sort of a photograph management program, all of that metadata is available, and then you can add to it and share it. And that's really what we want to enable with all types of data. Since most data, particularly within a business, has no metadata associated with it. So we came up with what we call the metadata and governance manifesto. And this talks about the requirements for managing metadata that's required by a data-driven organization. So automation is key, open standards, open interfaces, making this metadata visible and usable across a wide range of technologies. All of these pieces require both open source projects and different vendor technologies to come together and work together to allow an organization to make better use of their data. We then looked at what types of metadata do we need, and this is a sort of very high-level patchwork quilt of the different standards that we bought together to describe the data assets. And that's a broad definition. That's the thing, the sort of data stores, obviously, feeds and data moving in different feeds and APIs and event type in data sources. So the wide range, there's the infrastructure that sits underneath it, there's the lineage. That's the flow of processing that actually created the dataset. All the different governance requirements from policies and terms and conditions, classifications and different types of procedures that have to happen around data. Glossary describes the meaning of data and you build sort of ontologies of different domains of metadata. The collaboration is the feedback from the people who are actually using the data, so likes and comments and reviews and things on the different data assets. Then we've got the data standards and reference data. So this helps to reduce the cost if things are more consistent. And then finally metadata discovery is where you have automatic processes that are looking into the assets and basically calculating the profile and various other things that are being used. So there's a wide range of metadata. And if I show you the next picture, this is just sort of the next level of detail that seems to have been displaying very well. Oh, I see. So, sorry, I was a different picture I was expecting. So this picture is showing sort of an example of that type of metadata. So at this bottom layer, you have the sort of structural information. So this is what a lot of people think of when they think of metadata. So the database schema is an example of the structural data and it shows you the fields that are in the data. And an idea, there's some sort of symbolic name of what that data means. And then the next thing that we add is links to glossary terms if you remember the glossary. And this starts to give you a much clearer definition of exactly what this means. And there'll be an explanation of sort of its real definition associated with this label. And then we can get really fancy and start linking it together to show how values relate to one another. So we can start to say, well, these are all part of an employee, which means potentially they could be personal data because we're bringing it together. And then finally, we can be very explicit and start to tag the metadata with flags to say which pieces are sensitive. And this picture, which was the one I was expecting earlier, it shows you some of the detail of that metadata structure. And for Agere, we've actually defined nearly 500 types that are part of the metadata requirements and the linkage between them for a data-driven organization. So it's a huge undertaking that we're aiming to support through the Agere project. I mentioned earlier that we're trying to connect together many different tools and there were many different tools involved. So imagine the Fluffy Cloud, that's open metadata and governance, that's what the ODPI Agere project supports. And then it has connection points to all the different types of tools that need to exchange metadata. And it's not just databases, it is reporting tools and security and different developer tools and things like that. So there's a huge range of values. I see a potential question coming in, nope. And so that's really what Agere is doing. But of course, the world is not sort of the place where you could just have one single database that everything is connecting to. In fact, many tools have their own database with metadata in them. So what Agere is trying to do is to create a peer-to-peer protocol between these repositories to allow them to exchange the metadata that they need. So they don't need modifying, basically we create connectors that link them together and exchange the metadata that each repository cares about. The other thing that when you start looking at pictures like that, it looks quite straightforward, but actually the real world is multi-cloud, multi-platform, highly distributed. And so our solution has to work in this environment. And again, a single centralised place that everything is connecting to really doesn't work. So when I said this is a peer-to-peer distributed protocol, the orange here that's gone into the picture is really showing where the metadata is located and the dotted line shows the integration that Agere does under the covers to create the illusion that there is one single virtual view of metadata that each of the repositories are connecting to, but actually they are all part of... They are actually separate, but connected through the Agere protocols. The other thing that makes this a little bit harder is that different capabilities, different technologies have different capabilities when it comes to integration. So for example, a relational database typically is what I would call a passive technology. It has open APIs, but it doesn't send out any notifications when its scheme has changed. So we need to periodically poll it and validate that we've got all the metadata that is all the schema information that's been defined in that repository. So that's a passive technology. Something like Apache Cassandra is an active technology and it creates an event every time a new schema is created in Cassandra in its own format. So all we have to do is to listen and then convert it to our format and bring it through. So we have specialist servers that are sitting, listening or polling different technologies to continuously automate the capture of metadata from different technologies. We also have APIs that UIs and other scripting tools and things can allow more manual capture of metadata as required. And they connect into specialist interfaces for the different types of tools which are then brought into the ecosystem and shared. So when we look at the sort of a connected environment what you see is in the center there's a thing here called cohort A. And that's a set of tools that are sharing metadata. And the cohort is built dynamically it automatically configures itself as new members join or leave. And when a member joins they pass information about themselves and that automatically configures the other members so that they can now issue queries on the new member that's just joined the cohort. And then connected off of those sort of cohort members as we call them are other servers the sort of governance servers that are connecting into different technologies and providing and sort of gathering metadata. We also push metadata. So in this picture you can see at the bottom is metadata being gathered from different data platforms. But then at the top on the left hand side you can see that we're actually pushing metadata to Apache Ranger and a data virtualization tool to dynamically create secured views over the data that's being added to the ecosystem. So even though the metadata is gathered in one tool it can be consumed and used by a different tool even though these tools are surrounding the same data. The other thing that we need to be mindful of is when you think about that distributed environment we've got some technology that's running in a centralized cloud environment needs to be highly scalable needs to be continuously available. We have other places where it may be a small team that has a variety of requirements. So we might need to sort of have lots of different servers sitting on the same platform so that we have a sort of multi-tenant or a sort of consolidated environment for small situations. And then in the IoT space we may need to run on something as small as a Raspberry Pi. And so the Aduria technology has a thing called the Omag server platform that hosts all the integration technology and it can be configured in lots of different ways to allow us to go from the smallest machine up to the largest cloud environment using the same technology and just by altering its configuration. So the aim is that when an organization is using Aduria so in this picture here the light blue is Aduria and you can see that there are lots of different technologies that are connecting and supporting different communities. So there's a sort of community of different professionals connected at each point. And Aduria will gather the metadata that is of value to that community by their interaction. It uses their interaction and their activity and its configuration in order to determine what metadata is gathered at any one of those points. But the result is that one group of users might create information about a data set. That is then shared. Somebody else may add classifications to it to identify where sensitive data is. That might then go on to configure a security tool and may also create reports to the privacy team so that they can see that private data is now being deployed to a new location. So you can start to see how the value of the knowledge is shared and augmented and shared again as we bring this ecosystem together. Now, technology is important but this open source project also focuses on creating a secure and conformant environment so that technologies from different vendors can be connected together safely and they don't corrupt one another. And also we're interested in helping people who are trying to transition their organization to become more data-driven with education examples, suggested best practices that then link down into the Algeria technology and show how it can be used. Because this is an incredibly complicated topic and there's an awful lot of people being given data-oriented roles that they've never done before. And so the aim in the Algeria project is to make sure that we help people as much as possible both in the technology that we provide but also in the advice and the best practices that are linked with it. We are completely open project. Everything we've done has been created through GitHub. We have a number of companies that are working together to create this common technology. And so everything that we show, we use is available through our various GitHub repositories. The technology itself is organized into three layers. There's what we call the developers toolkit at the bottom which is all the libraries and plug points and connector interfaces to allow you to build your own integration environment depending on the technologies that you use. Then above that is what we call the integration platform and these are the pre-built connectors for different technologies. So these are the things you don't have to write and then there are some UIs that make it easier to monitor what's happening in the integration environment. And then finally we have the governance solutions that sit on top and they exploit the fact that we now have this view of metadata, view of the organization and its operations that has never been visible before. And so the solutions start to take advantage of that and allow people to a more powerful use of the metadata that they've created. So that was a bit of a whistle-stop tour on ODP Igeria and what it potentially can do for your organization. I have some time for some questions so we can put the questions slide up. I don't have any questions there now but if anybody would like to leave and anybody got any questions, I'm happy to answer them now. Otherwise, I will move you over to the links page and you can see here we've got a number of press releases and external things but really everything that you need to know is in those two GitHub repositories that I'm showing you. The Denti Governance one is the one that contains all the best practices and then the Igeria one is the source code and all the documentation for the Igeria software. So that's it. So if there are no questions, I think we have finished.