 from Berlin, Germany. It's theCUBE, covering DataWorks Summit Europe 2018. Brought to you by Hortonworks. Well, hello and welcome to theCUBE. I'm James Kobiela, I'm the lead analyst for Big Data Analytics within the Wikibon team of SiliconANGLE Media. I'm hosting theCUBE this week at DataWorks Summit 2018 in Berlin, Germany. It's been an excellent event. Hortonworks, the host, we've completed two days of key notes. They made an announcement of the DataSteward Studio as the latest of their offerings and demonstrated it this morning to address GDPR compliance, which of course is hot and heavy. It's coming down on enterprises, both in the EU and around the world, including in the U.S., and the May 25th deadline is fast approaching. One of Hortonworks's prime partners is IBM. And today on this CUBE segment, we have Mandy Chessel. Mandy is a distinguished engineer at IBM who did an excellent keynote yesterday all about metadata and metadata management. Mandy, great to have you. Hi, thank you. So I wonder if you can just reprise or summarize the main takeaways from your keynote yesterday on metadata and its role in GDPR compliance and so forth, and in the broader strategies that enterprise customers have regarding managing their data in this new multi-cloud world where Hadoop and open source platforms are critically important for storing and processing data. So Mandy, go ahead. Sure, so metadata's not new. I mean, it's basically information about data. And a lot of companies are trying to build a data catalog, which is not a catalog of actually containing their data. It's a catalog that describes their data. Is it different from an index or a glossary? How's the catalog different from that? Yeah, so a catalog actually includes both. So it is a list of all the data sets plus a links to glossary definitions of what those data items mean within the data sets, plus information about the lineage of the data. It includes information about who's using it, what they're using it for, how it should be governed. It's like a governance repository for me. So governance is part of it. So the governance part is really saying this is how you're allowed to use it. This is how the data is classified. This is the automated actions that are going to happen on the data as it's used within the operational environment. So there's that aspect to it, but there is the collaboration side. Hey, I've been using this data sets great, or actually this data sets full of errors. We can't use it. So you've got feedback to data set owners, as well as exchanging collaboration between data scientists working with the data. So it is a central resource for an organization that has a strong data strategy, is interested in becoming a data-driven organization and such. So this becomes their major catalog of their data assets and how they're using it. So when a regulator comes in and says, can you show up, show me that you're managing personal data, the data catalog will have the information about where personal data is located or type of infrastructure it's sitting on, how it's being used by different services. So they can really show that they know what they're doing. And then from that they can show how the processes are using the metadata in order to use the data appropriately day to day. So Apache Atlas, so it's basically a catalog, if I understand correctly, at least for IBM and Hortonworks, it's Hadoop. It's Apache Atlas. And Apache Atlas is essentially a metadata open source code base. So explain what Atlas is in this context. So yes, Atlas is a collection of code, but it supports a server, a graph-based metadata server. It also supports- A graph-based metadata server. Yes. So explain what you mean by graph-based in this context. Okay, so it runs using the Janus Graph repository. And this is very good for metadata because if you think about what it is, it's connecting dots. It's basically saying this dataset means this value and needs to be classified in this way in this classic- Like a semantic knowledge graph. It is, yes, actually. And on top of it, we impose a type system that describes the different types of things you need to control and manage in a data catalog. But the graph, so the Atlas component gives you that graph-based, sorry, this graph-based repository underneath. But on top, we've built what we call the open metadata and governance libraries. They run inside Atlas, so when you run Atlas, you will have all the open metadata interfaces. But you can also take those libraries and connect them and load them actually into another vendor's product. And what they're doing is allowing metadata to be exchanged between repositories of different types. And this becomes incredibly important as an organization increases their maturity and their use of data because you can't just have knowledge about data in a single server. It just doesn't scale. You need to get that knowledge into every runtime environment, into the data tools that people are using across the organization. And so it needs to be distributed. Mani, I'm wondering, the whole notion of what you catalog in that repository, does it include or does Apache Atlas support adding metadata relevant to data-derivative assets like machine learning models and so forth? Absolutely, absolutely. So we have base types in the open metadata layer, but also it's a very flexible, extensible type system. So if you've got a specialist machine learning model that needs additional information stored about it, that can easily be added to the runtime environment. And then it will be managed through the open metadata protocols as if it was part of the native type system. Because one of the, of course, as an analyst, one of my core areas is artificial intelligence. And one of the hot themes in artificial intelligence, there's a broad umbrella called AI safety. And one of the core subsets of that is something called explicable AI, being able to identify the lineage of a given algorithmic decision back to what machine learning models fed from what data, through what action, like when, say, a self-driving vehicle hits a human being for legal discovery, whatever. So what I'm getting at, what I'm working through to is the extent to which the Hortmerc's IBM big data catalog running Atlas can be a foundation for explicable AI either now or in the future. We see a lot of enterprise, leave me as an analyst at least, see there's lots of enterprises that are exploring this topic, but it's not to the point where it's in production, explicable AI, but where it's clearly, companies like IBM are exploring building a stack of architecture for doing this kind of thing in a standardized way. What are your thoughts there? Is IBM working on bringing, say, Atlas and the overall big data catalog into that kind of a use case? Yes, yeah. So if you think about what's required, you need to understand the data that was used to train the AI. How, what data's been fed to it since it was deployed because that's going to change its behavior. And then also a view of how that data's going to change in the future so you can start to anticipate issues that might be arising from the models changing behavior. And this is where the data catalog can actually associate and maintain information about the data that's being used with the algorithm. You can also associate the checking mechanism that's constantly monitoring the profile of the data so you can see where the data is changing over time that will obviously affect the behavior of the machine learning model. So it's really about providing not just information about the model itself, but also the data that's feeding it, how those characteristics are changing over time so that you know the model is continuing to work into the future. So tell us about the IBM Port and Works partnership on metadata and so forth. How is that evolving? So, you know, your partnership is fairly tight. Clearly you've got ODPI, you've got the work that you're doing related to the big data catalog. What can we expect to see in the near future initiatives building on all of that for governance of big data in a multi-cloud environment? Yeah, so Hortonworks started the Apache Atlas project a couple of years ago with a number of their customers and they built a base repository and a set of APIs that allow it to work in the Hadoop environment. We came along last year and formed our partnership. That partnership includes this open metadata and governance layer. So since then we work with ING as well and ING bring the sort of user perspective. This is the organization's use of the data. And so between the three of us, we are basically transforming Apache Atlas from an Hadoop focused metadata repository to an enterprise focused metadata repository plus enabling other vendors to connect into the open metadata ecosystem. So with standardizing types, standardizing format, the format of metadata, there's a protocol for exchanging metadata between repositories. And this is all coming from that three-way partnership where you've got a consuming organization, you've got a company who's used to building enterprise middleware, and you've got Hortonworks with their knowledge of open source development and the Hadoop environment. So it's a- There's a question on the left field. As you develop this architecture, clearly you're leveraging Hadoop HDFS for storage. Are you looking into, at least evaluating, maybe using blockchain for more distributed management of the metadata in these heterogeneous environments in the multi-cloud or not? So Atlas itself does run on HDFS but doesn't need to run on HDFS. It's got other storage environments so that we can run it outside of Hadoop. When it comes to blockchain, so blockchain is for sharing data between partners, small amounts of data that basically express agreements. So it's like a ledger. There are some aspects that we could use for metadata management. It's more that we actually need to put metadata management into blockchain. So the agreements and contracts that are stored in blockchain are only meaningful if we understand the data that's there, what its quality, where it came from, what it means. And so actually there's a very interesting distributed metadata question that comes with the blockchain technology. And I think that's an important area of research. Manny, we're at the end of our time. Thank you very much. We could go on and on. You're a true expert and it's great to have you on the queue. Thank you for inviting me. So this is James Kabilis with Manny Cheslow IBM. We are here this week in Berlin at DataWorks Summit 2018. It's a great event and we have some more interviews coming up. So thank you very much for tuning in.