 Live from San Jose in the heart of Silicon Valley, it's theCUBE, covering DataWorks Summit 2017, brought to you by Quortonworks. Hey, welcome back to theCUBE, live from the DataWorks Summit day two. We've been here for a day and a half talking with fantastic leaders and innovators, learning a lot about what's happening in the world of big data, the convergence with the Internet of Things, machine learning, artificial intelligence, it go on and on. I'm Lisa Martin, my co-host is George Gilbert, and we are joined by a couple of guys. One is a CUBE alumni, Itamar and Corey, and the CMO of Atunity, welcome back to theCUBE. Thank you very much, good to be here. Thank you for joining us. And Arvind Rajagopalan, the director of technology services for Verizon, welcome to theCUBE. Thank you. So we were chatting before we went on, and Verizon, you're actually going to be presenting tomorrow at the DataWorks Summit. Tell us about building the journey that Verizon has been on, building a data lake. Well, Verizon, as you know, over the last 20 years has been a large corporation made up of a lot of different acquisitions and mergers, and that's how it was formed in 20 years back. And as we've gone through the journey of the mergers and acquisitions over the years, we had data from different companies come together and form a lot of different data silos. So the reason we kind of started looking at this is our CFO started asking questions around being able to answer one Verizon question. It's as simple as having a day's payable or working capital analysis across all the lines of businesses. And since we have a three major ERP footprint, it is extremely hard to get that data out. And there was a lot of manual data prep activities that was bringing together those one Verizon views. So that's really what was the catalyst to get the journey started for us. And it was driven by your CFO, you said. That's right. Very interesting, okay. So what are some of the things that people are going to hear tomorrow from your breakout session? I'm sorry, say that again. Sorry, what are some of the things that the people, the attendees from your breakout session are going to learn about the steps in the journey? So I'm going to primarily be talking about the challenges that we ran into and share some around that and also talk about some of the factors such as the catalyst and what drove us to move in that direction, as well as get into some architectural components from a high level standpoint, talk about certain partners that we have worked with, the choices we made from an architecture perspective and the tools, as well as to kind of close the loop on user adoption and what users are seeing in terms of business value as we start centralizing all of the data at Verizon from a back office finance and supply chain standpoint. So that's kind of what I'm looking at talking tomorrow. Arvin, it's interesting to hear you talk about sort of collecting data from essentially back office operational systems in a data lake. Were there, I assume that this data is sort of more refined and easily structured than the typical stories we hear about data lakes. Were there challenges in making it available for exploration and visualization? Or were all the early use cases really like just production reporting? So standard reporting across the ERP systems is very mature and those capabilities are there. But when you look at across ERP systems and we have three major ERP systems for each of the lines of businesses, when you want to look at combining all of the data, it's very hard. And to add to that, to your point around self-service discovery and visualization across all three data sets, it's even more challenging because it takes a lot of heavy lift to normalize all of the data and bring it into one centralized platform. And we started out the journey with Oracle and then we had SAP HANA where we were trying to bring all the data together. But when we were looking at systems, non-SAP ERP systems and bringing that data into SAP kind of footprint, one, the cost was tremendously high. Also, there was a lot of heavy lift and challenges in terms of manually having to normalize the data and bring it into the same kind of data models. And even after all of that was done, it was not very self-service oriented for our users in finance and supply chain. Let me drill into two of those things. So it sounds like the ETL process of converting it into a consumable format was very complex. And then it sounds like also the discoverability, like where a tool perhaps like Elation might help, which is very, very immature right now or maybe not immature, it's still young. Is that what was missing? Or why was the ETL process so much more heavyweight than with the traditional data warehouse? The ETL process is a lot of heavy lifting there involved because of the proprietary data structures of the ERP systems. Especially SAP is the data structures and how the data is used across clustered and pool tables is very proprietary. And on top of that, bringing the data formats and structures from a PeopleSoft ERP system and which are supporting different lines of businesses. There are a lot of customization that's gone into place. There are specific things that we use in the ERPs in terms of the modules and how the processes are modeled in each of the lines of businesses complicates things a lot. And when you try and bring all these three different ERPs and the nuances that they have over the years, try and bring them together, it actually makes it very complex. So tell us then, help us understand how the data lake made that easier. Was it because you didn't have to do all the refinement before it got there and tell us how attunity helped make that possible? Oh, absolutely. So I think that's one of the big things where we picked the Hortonworks as one of our key partners in terms of building out the data lake. Schema and read, you're not necessarily worried about doing a whole lot of ETL before you bring the data in. And it also provides with the tools and the technologies from a lot of other partners. We have a lot of maturity now where it provides self-service discovery capabilities for ad hoc analysis and reporting. So this is helpful to the users because now they don't have to wait for prolonged IT development cycles to model the data, do the ETL and build reports for them to consume, which sometimes could take weeks and months. Now in a matter of days, they're able to see the data that they're looking for and they're able to start the analysis. And once they start the analysis and the data is accessible, it's a matter of minutes and seconds looking at the different tools, how they want to look at it, how they want to model it. So it's actually been a huge value add from the perspective of the users and what they're looking to do. Speaking of value, one of the things that was kind of thematic yesterday was, we see enterprises are now embracing big data, they're embracing Hadoop. They say, you know, it's got to coexist within our ecosystem and it's got to interoperate. But just putting data in a data lake or Hadoop, there's not the value there. It's being able to analyze that data in motion at rest, structured, unstructured and start being able to glean or take actionable insights. From your CFO's perspective, where are you now on answering some of the questions that he or she had from an insights perspective with the data lake that you have in place? Yeah, you know, before I address that, I want to kind of quickly touch upon, you know, wrap up George's question, if you don't mind. Because one of the key challenges, and you talked about how it really helped, you know, I was just about to answer the question before we move on. So I just want to close the loop on that a little bit. Absolutely, go ahead. So in terms of bringing the data in, you know, the data acquisition or ingestion is a key aspect of it. And again, you know, looking at the proprietary data structures from the ERP systems, it's very complex and involves a multi-step process to bring the data into a strange environment and be able to put it in the swamp and bring it into the lake. And what Attinity has been able to help us, you know, with is, you know, it has the, you know, the intelligence to look at and understand the proprietary data structures of the ERPs. And it's able to bring all the data from the ERP source systems directly into Hadoop, you know, without any stops or, you know, staging databases along the way. So it's been a huge value from that standpoint. And I'll get into more details around that. And, you know, to answer your question around, you know, how it's helping, you know, from a CFO standpoint and the users in finance, you know, as I said, now all the data is available in one place. So it's very easy for them to consume the data and be able to do ad hoc analysis. So if somebody's looking to, you know, like I said earlier, want to look at, you know, and calculate they're payable as an example or they want to look at working capital, we're actually moving data using a Trinity CDC replicate product of you getting data in near real time into the data lake. So now they're able to turn things around and do that kind of analysis in a matter of hours versus, you know, overnight or a matter of days, which was the, you know, the previous environment. And that was kind of one of the themes this morning is it's really about speed, right? It's how fast can you move? And it sounds like together with a Trinity Verizon is really not only making things simpler as you talked about in this kind of model that you have with different ERP systems, but you're also really able to get information into the right hands much, much faster. Absolutely, you know, that's the whole beauty of the near real time and the CDC architecture. We're able to get data in, you know, very easily and quickly. And at Trinity also provides a lot of visibility as the data is in flight. We're able to see what's happening in the source system, how many records are flowing through. And to a point my developers are so excited to work with the product because they don't have to worry about, you know, the changes happening on the source systems in terms of, you know, DDL and those changes are automatically, you know, understood by the product and pushed to the, you know, destination on Hadoop. So it's been a game changer because, you know, we have not had any downtime because when there are things changing on the source system side, you know, historically we had to take downtime, you know, to change those configurations and the scripts and publish it across environments. So that's been huge from that standpoint as well. Absolutely. It, Amar, maybe help us understand where, where attunity can, it sounds like there's greatly reduced latency in the pipeline between the operational systems and the analytics system. But it also sounds like you still need to essentially reformat the data so that it's consumable. So it sounds like there's an ETL pipeline that's just much, much faster. But at the same time, when it's like replicate, it sounds like that goes without transformation. So help us, help us sort of, you know, understand that nuance. Yeah, that's a great, great question, George. And indeed in the past few years, customers have been focused predominantly on getting the data to the lake. I actually think that one of the changes in the theme we're hearing, hearing the show in the last few months is how do we move to start using the data, to create applications on the data. So we're kind of moving to the next step. In the last few years, we've focused a lot on innovating and creating the solutions that facilitate, accelerate the process of getting data to the lake from a large, you know, scope of systems, including complex ones like SAP, and also making the process of doing that easier, providing real-time data that can both fit streaming architectures as well as batch ones. So once you got that covered to your questions, what happens next? And one of the things we found, I think Verizon is also looking at it now and Arvin can comment on it later. What we're seeing is that when you bring data in and you want to adopt the streaming or continuous incremental type of data ingestion process, you're inherently building an architecture that takes what was originally a database, but you're kind of, in a sense, breaking it apart to partitions as you're loading it over time. So when you lend the data, and Arvin was referring to a swamp or some customers refer to it as a landing zone, you bring the data into your lake environment, but at the first stage, that data is not structured to your point, George, in a manner that's easily consumable. So the next step is how do we facilitate the next step of the process, which today is still very manual-driven, price-custom development and dealing with complex structures. So we actually are very excited. We've introduced in the show here, we announced a new product by Attunity, Compose for Hive, which extends our data lake solutions. What Compose for Hive is exactly designed to do is address part of the problem that you just described, whereas when the data comes in and is partitioned, what Compose for Hive does is it reassembles these partitions and it then creates analytic-ready data sets back in Hive. So it can create operational data stores, it can create historical data stores, so then the data becomes formatted in a manner that's more easily accessible for users who want to use analytic tools, BI tools, Tableau, Clique, any type of tool that can easily access the database. Would there be, as the next step, whether led by Verizon's requirements or Attunity's anticipation of broader customer requirements, something where there's a, it's not near real-time, but a very low latency landing and transformation so that data that is time-sensitive can join the historical data. Absolutely, absolutely. So what we've done is focused on real-time availability of data. So when we feed the data into the data lake, we feed it into ways, one is directly into Hive, but we also go through a streaming architecture like Kafka and the case of Hortonworks can feed also very well into HDL. So then the next step of the process is producing those analytic data sets or data stores out of it, which we enable. And what we do is designing together with our partners and our customers. So again, when we worked on Replicate, then we worked on Compose. We worked very closely with, Fortune Company is trying to deal with these challenges so we can design the product. In the case of Compose for Hive, for example, we've done a lot of collaboration at the product engineering level with Hortonworks to leverage the latest and greatest in Hive 2.2 Hive LLAP to be able to push down transformations so those can be done faster, including in real time, so those data sets can be updated on a frequent basis. Got it. You talked about kind of customer requirements, either those specific or not, obviously talking to telecommunications company. Are you seeing it tomorrow from itinerary's perspective more of this need to, all right, the data is in the lake, or first it comes to the swap, now it's in the lake to start partitioning it. Are you seeing this need driven in specific industries or is this really pretty horizontal? Yeah, that's a good question. And this is definitely a horizontal need, it's part of the infrastructure needs. So Verizon is a great customer and we've been working similarly in telecommunications, we've been working with other customers in other industries from manufacturing to retail to healthcare to automotive and others. And in all of those cases, on the foundation level, it's very similar architectural challenges. If you need to ingest the data, you want to do it fast, you want to do it incrementally or continuously, even if you're loading directly into a loop. Naturally, when you're loading the data through a Kafka or streaming architecture, it's a continuous fashion. And then you partition the data. So the partitioning of the data is kind of inherent to the architecture and then you need to provide to help deal with the data for the next step in the process. And we're doing it both with Compost for Hive, but also for customers using streaming architectures like Kafka, we provide the mechanisms from supporting or facilitating things like schema evolution and schema decoding to be able to facilitate the downstream process of processing those partitions of data so you can make the data available. That works both for analytics and streaming analytics as well as for scenarios like microservices where the way in which you partition the data or deliver the data allows each microservice to pick up only the data it needs from the relevant partition. Well, guys, it's been a really informative conversation. Congratulations, each of them are on the new announcement that you guys made today. Are been great to hear the use case and how Verizon really sounds quite pioneering in what you're doing. We wish you continued success there. We look forward to hearing what's next for Verizon. We want to thank you for watching theCUBE. We are again live day two of the DataWorks Summit hashtag DWS17 forum. My co-host George Gilbert, I am Lisa Martin. Stick around, we'll be right back.