 from Berlin, Germany. It's theCUBE, covering DataWorks Summit Europe 2018. Brought to you by Hortonworks. Well, hello and welcome to theCUBE. I'm James Kobielus. I'm the lead analyst within the Wikibon team at Silicon Angle Media, focused on big data analytics. And big data analytics is what DataWorks Summit is all about. We are at DataWorks Summit 2018 in Berlin, Germany. We're on day two. I have, as my special guest here, Pankaj Sodhi, who is the big data practice lead with Accenture. He's based in London, and he's here to discuss really what he's seeing in terms of what his clients are doing with big data. So hello, welcome. Pankaj, how's it going? Thank you, Jim. Very pleased to be here. Great, great. So what are you seeing in terms of customers' adoption of Duke and so forth, big data platforms for what kind of use cases are you seeing? GDPR is coming down very quickly. And we saw this poll this morning that John Chrysler of Portland Works did from the stage. And it's a little bit worrisome if you're an enterprise data administrator, really an enterprise period, because it sounds like not everybody in this audience, in fact a sizable portion, is not entirely ready to comply with GDPR on day one, which is May 25th. What are you seeing in terms of customer readiness for this new regulation? So Jim, I'll answer the question in sort of two ways. One was just in terms of the adoption of Hadoop and then getting to GDPR. So in regards to Hadoop adoption, I think I would place clients in about three different categories. The first ones are the ones that have been quite successful in terms of adoption of Hadoop. And what they've done there is taken a very use case driven approach to actually build out the capabilities to deploy these use cases. And they've taken an iterative approach, deployed hybrid architectures, and then taken the time- Hybrid public private cloud. Cloud as well, but often sort of on premise. Hybrid being, for example, with an EDW and potentially Hadoop. And in that scenario, they've taken the time to actually work out some of the technical complexities and nuances of deploying these pipelines in production. Consequently, what they're in a good position to do now is to leverage the best of cloud computing, open source technologies, whilst looking at getting the investment protection that they have from on-premise deployments as well. So they are in a fairly good position. Another set of customers are have done successful pilots, looking at either cost optimization use cases. Palates of Hadoop? Yes, leveraging Hadoop, either again from a cost optimization play or potentially advanced analytics capabilities. And they're in the process of going to production and starting to work out from a footprint perspective what elements of the future pipelines are going to be on-prem, potentially with Hadoop or on cloud with Hadoop. When you say the pipeline in this context, what are you referring to? When I think of pipeline, in fact in our coverage of pipeline, it refers to an end-to-end life cycle for development and deployment and management of big data and analytics assets. Absolutely, so all the way from ingestion to curating and consuming the data through multiple different access spots. So that's the full pipeline. And I think what the organizations that have been successful have done is not just looked at the technology aspect, which is just Hadoop in this case, but looked at a mix of architecture, delivery approaches, governance and skills. So I'd like to bring this to life with looking at advanced analytics as a use case. So rather than take the approach of let's ingest all data into a data lake, it's been driven by a use case mapped to a set of valuable data sets that can be ingested. But what's interesting then is the delivery approach has been to bring together diverse skill sets. For example, data engineers, data scientists, data ops and visualization folks, and then use them to actually challenge architecture and delivery approach. I think this is a very key ingredient for success, which is for me the modern sort of Hadoop-based pipelines need to be iteratively built and deployed rather than linear and monolithic. So this notion of actually I have raw data, let me come up with a minimally curated data set and then look at how I can do feature engineering and build an analytical model. If that works and I need to get additional data attributes, I then enhance the pipeline. So this is already starting to challenge organizations architecture approaches and how you also deploy into production. And I think that's been one of the key differences between organizations that have embarked on the journey ingested data, but not had a path to production. So I think that's one aspect. How are the data stewards of the world or are they challenging the architecture? Now the GDPR is coming down fast and furious. We're seeing, for example, Hortonworks architecture for data steward studio. Are you seeing the data governance, the data stewards of the world coming, sitting around the virtual table, challenging this architecture further to evolve? Or to enable privacy by design and default and so forth? I think again, the organizations that have been successful have already been looking at privacy by design before GDPR came along. Now one of the reasons a lot of the data lake implementations haven't been as successful is the business haven't had an ability to actually query the data sets, work out what the definitions are, what the curation levels are. So therefore, what we see with business glossaries and sort of data architectures, from a GDPR perspective, we see this as an opportunity rather than as a threat. So to actually make the data usable in the data lakes, we often talk to clients about this concept of a data marketplace. So in the data marketplace, what you need to have is well curated data sets with proper definitions searchable through a business glossary or a data catalog underpinned by the right user access model and available, for example, through search or APIs. So GDPR actually is an enabler here. This is not a public marketplace. This is an architectural concept. It could be inside, completely inside the private data center, but it's reusable data exposed through APIs. Correct. And standard glossaries and metadata and so forth. Is that correct? Correct. So the data marketplace is reusable both internally, for example, to unlock, for example, access to data scientists who might want to use a data set and then put that into a data lab. It can also be extended from an API perspective for a third-party data marketplace for exchanging data with consumers or third parties as organizations look at data monetization as well. And therefore I think the role of data stewards is changing around a bit. Rather than looking at it from a compliance perspective, it's about how can we make data usable to the analysts and the data scientists. So actually focusing on getting the right definitions upfront and then as we curate and publish data and as we enrich it, what's the next definition that comes out there and actually have that available before we publish the data? That's a fascinating concept. So the notion of a data steward or a data curator, it sort of sounds like you're blending them where the data curator, their job involved, part of it is very much involves identifying the relevance of data and the potentially reusability and attractiveness of that data for various downstream uses and possibly being a player in the ongoing identification of the monetizability of data elements both internally and externally in the value chains. Am I describing correctly? I think you are, yes. And I think it's an interesting implication for the CDO function because rather than the function being looked as a policy sort of- The chief data officer. Yes, chief data officer function. So rather than imposition of policies and standards, it's about actually trying to unlock business value. So rather than looking at it from a compliance perspective, which is very important, but actually flip it around and look at it from a business value perspective. So for example, if you're able to tag and classify data and then apply the right kind of protection against it, it actually helps the data scientists use that data for their models whilst actually following GDPR guidelines. So it's a win-win from that perspective. So in many ways, the core requirement for GDPR compliance, which is to discover an inventory and essentially tag all of your data on a fine-grained level, can be the greatest thing that ever happened to data monetization. In other words, it's the foundation of data reuse and monetization and unlocking the true value to your business of the data. So it needn't be an overhead burden. It can be the foundation of a new business model. Absolutely, because I think if you talk about organizations becoming data-driven, you have to look at what does a data asset actually mean. So to me, that's a curated data set with the right level of description. Again, underpinned by the right authority or privacy and ability to use the data. So I think GDPR is going to be a very good enabler. So again, the small, I would say, minority of organizations that have been successful have done this. They've had business glossaries, data catalogs. But now with GDPR, that's almost, I think, going to force the issue, which I think is a very positive outcome. Now, Pankaj, do you see any of your customers taking this concept of curation and so forth? The next step in terms of there's data assets, but then there's data-derived assets like machine learning models and so forth. Data scientists build and train and deploy these models and algorithms. That's the core of their job. And model governance is a hot, hot topic we see all over. You've got to have type controls, not just on the data, but on the models because they're core business IP. Do you see this architecture evolving among your customers so that they'll also increasingly be required or want to, essentially, catalog the models and identify and curate them for reusability, possibly monetization opportunities. Is that something that any of your customers are doing or exploring as a model? There are, I would say, some of our customers are looking at that as well. So, and again, initially for internal purposes, exactly it's an extension of the marketplace. So whilst one aspect of the marketplace is data sets you can then combine to run the models, the other aspect is models that you can also search for and prescribe to. Yeah, like pre-trained models can be golden. If the pre-trained and the core domain for which they're trained doesn't change all that often, they can have a great aftermarket value conceivably if you want to resell that. Absolutely, and I think this is also key enabler for the way data scientists and data engineers expect to operate. So there's notion of IDEs, or collaborative notebooks and so forth, and being able to share the outputs of models and to be able to share that with the folks in the team who can then maybe tweak it for a different algorithm is a huge, I think, productivity enabler. And we're seeing quite a few of our technology partners working towards enabling these data scientists to move very quickly from a model they may have initially maybe developed on a laptop to actually then deploying that on a Kerberized cluster and how can you do that very quickly and reduce the time from an idea and hypothesis going into production. If I agree modularization of machine learning and deep learning, I'm seeing a lot of that among data scientists in the business world. Well, thank you, Pankaj. We're out of time right now. This has been a very engaging and fascinating discussion and we thank you very much for coming on theCUBE. This has been Pankaj Sodhi of Accenture. We're here at DataWorks Summit 2018 in Berlin, Germany. It's been a great show and we have more expert guests we'll be interviewing later in the day. Thank you very much, Pankaj. Thank you very much, Jim. Thank you.