 Live from San Jose, in the heart of Silicon Valley, it's theCUBE, covering DataWorks Summit 2017, brought to you by Hortonworks. Welcome back to theCUBE, day two of the DataWorks Summit. I'm Lisa Martin with my co-host George Gilbert. We've had a great day to have so far learning a ton in this hyper growth big data world meets IoT machine learning data science. George and I are excited to welcome our next guest. We have Josh Clark, the VP of Product Management from MattSkell. Welcome George, welcome back. Thank you. And we have Prashanti Patti, the head of Data Engineering for GoDaddy. Welcome to theCUBE. Thank you. Great to have you guys here. So, wanted to kind of talk to you guys about one, how you guys are working together, but two, also some of the trends that you guys are seeing. So, as we talked about in the tech industry, it's two degrees for Kevin Bacon, right? You guys work together back in the day at Yahoo. Talk to us about what you both visualized and experienced in terms of the Hadoop adoption maturity cycle. Sure. You want to start Josh? Yeah, I'll start and you can chime in and correct me. But yeah, as you mentioned, Prashanti and I worked together at Yahoo, feels like a long time ago, in a central data group. And we had two main jobs. The first job was collect all of the data from our ad systems, our audience systems and stick that data into a Hadoop cluster. At the time, we were kind of doing it while Hadoop was being developed. And the other thing that we did was we had to support a bunch of BI consumers. So, we built cubes, we built data marts, we used MicroStrategy, Tableau. And I would say the experience here, there was a great experience with Hadoop in terms of the ability to have low cost storage, scale out data processing of all of what were really billions and billions, tens of billions of events a day. But when it came to BI, it felt like we were doing stuff the old way. We were moving data off cluster and making it small. In fact, you did a lot of that. Well, yeah, at the end of the day, we were using Hadoop as a staging layer. So we would process a whole bunch of data there and then we would scale it back and move it into, again, relational stores or cubes because basically we couldn't afford to give any accessibility to BI tools or to our end users directly onto Hadoop. Yeah, so, while we really did a large scale data processing in Hadoop layer, we failed to turn on the insights right there, okay. Maybe there's a lesson in there for folks who are getting slightly more mature versions of Hadoop now, but can learn from also some of the experiences you've had. Were there issues in terms of having clean and curated data, were there issues for BI with performance and the lack of proper file formats like Parquet? What was it where you hit the wall? It was both. You have to remember this. We were probably one of the first teams to put a data warehouse on Hadoop. So we were dealing with big versions of like 0.5, 0.6. So we were putting a lot of demand on the tooling and the infrastructure. Hadoop was still in a very nascent stage at that time. So that was one. And I think the lot of the focus was on, hey, now we have the ability to do click stream analytics at scale, right? So we did a lot of the backend stuff, but the presentation is where I think we struggled. So would that mean then that you did do, like the idea is that you could do full resolution without sampling on the backend and then you would extract and presumably sort of denormalize so that you could essentially run data marts for subject matter interests. Yeah, and that's exactly what we did is we took all of this big data, but to make it work for BI, which were two things. One was performance. It was really can you get an interactive query and a response time? And the other thing was the interface. Kind of Tableau user connect and understand what they're looking at. You had to make the data small again. And that was actually the genesis of at scale, which is where I am today was, we were frustrated with this big data platform and having to then make the data small again in order to support BI. That's a great transition, Josh. Let's actually talk about at scale. You guys saw BI on Hadoop as this big white space. How have you succeeded there? And then let's talk about what GoDaddy is doing with at scale and big data. Yeah, I think that we definitely learned, we took the learnings from our experience at Yahoo. And we really thought about if we were to start from scratch and solve the problem the way we wanted it to be solved, what would that system look like? And it was a few things. One was an interface that worked for BI. I don't want to date myself, but my experience in the software space started with OLAP. And I can tell you OLAP isn't dead. When you go and talk to an enterprise, a Fortune 1000 enterprise and you talk about OLAP, that's what they, that's how they think. They think in terms of measures and dimensions and hierarchies. So one important thing for us was to project an OLAP interface on top of data that's Hadoop native. It's HiveTables, Parquet, ORC. You kind of talk about all of the mess that may sit underneath the covers. So one thing was projecting that interface. The other thing was delivering performance. So we've invested a lot in using the Hadoop cluster natively to deliver performance queries. We do this by creating aggregate tables and summary tables and being smart about how we route queries. But we've done it in a way that makes a Hadoop admin very happy. You don't have to buy a bunch of ad scale servers. In addition to your Hadoop cluster, we scale the way the Hadoop cluster scales. So we don't require separate technology. So we fit really nicely into that Hadoop ecosystem. So how do you make, so making the Hadoop admin happy is a good thing. How do you make the business user happy? Who needs now, as we were hearing yesterday, to kind of merge more with the data science folks to be able to understand or even have the chance to articulate, these are the business outcomes we want to look for or we want to see. How do you guys, maybe under the hood, if you will, at that scale, make the business guys and gals happy? Yeah, I'll share my opinion and then Prashanti can comment on her experience. But as I mentioned before, business users want an interface that's simple to use. And so that's one thing we do, is we give them the ability to just look at measures and dimensions. If I'm a business, I grew up using Excel to do my analysis. The thing I like most as an analyst is a big, fat-wide table. And so that's what we make an underlying Hadoop cluster in what could be tens or hundreds of tables look like a single, big, fat-wide table for a data analyst. You talk to a data scientist, you talk to a business analyst, that's the way they want to view the world. So that's one thing we do. And then we give them response times that are fast, we give them interactivity so that you can really quickly start to get a sense of the shape of the data. And now allowing them to get that time to value? Yes. I can imagine. Just to follow up on that, when you have to prepare the aggregates, essentially like the cubes, instead of the old BI tools running on a data mark, what is the additional latency that's required from data coming fresh into the data lake and then transforming it into something that's consumption ready for the business user? Yeah, I think I can take that. So again, if you look at the last 10 years in the initial periods, certainly at Yahoo, we just throw engineering resources at that problem. So we had teams dedicated to building these aggregates. But the whole premise of Hadoop was the ability to do unstructured optimizations. And by having a team find out the new data coming in and then integrating that into your pipeline. So we were adding a lot of latency. And so we needed to figure out how we can do this in a more seamless way, in a more real-time way. And get the real premise of Hadoop, get it at the end, hands of our business users. I mean, I think that's where, at scale, is doing a lot of the good work in terms of dynamically being able to create aggregates based on the design that you put in the cube. So we are starting to work with them on our implementation. We are looking forward to the results. Tell us a little bit more about what you're looking to achieve, so GoDaddy is a customer of at scale. Tell us a little bit more about that. What are you looking to build together and kind of where are you in that journey right now? Yeah, so the main goal for us is to move beyond predefined models, dashboards, and reports, right? So we want to be more agile with our schema changes. Time to market is one. And performance, right? Ability to put BI tools directly on top of Hadoop is one. And also to push as much of the semantics as possible down into the Hadoop layer. So those are the things that we are looking to do. So that sounds like classic business intelligence components, but sort of rethought for a big data era. I love that quote, I'm going to steal it, yeah. That's exactly what we are trying to do. But that's also some of the things you mentioned are non-trivial, where you want to have, there's time goes into the pre-processing of data so that it's consumable, but you also want it to be dynamic, which is sort of a trade-off, which means that takes time. So is that sort of a set of requirements, a wish list for at scale? Or is that something that you're building on your own? I think there's a lot happening in that space. They are one of the first people to come out with a product, which was solving a real problem that we tried to solve for a long time. And I think as we start using them more and more, we'll be surely pushing them to bring in more features. I think the algorithm that they have to dynamically generate the aggregates is something that we are giving quite a lot of feedback to them. One of our last guests from Pentaho was talking about, there was, in her keynote today, the quote that from I think McKinsey report that said like 40% of machine learning data is either not fully exploited or not used at all. So tell us kind of where is Big Daddy regarding machine learning, what are you seeing, what are you seeing at at scale, and how are you guys going to work together to maybe venture into that frontier? Yeah, I mean, I think one of the key requirements we are placing on our data scientists is, not only do you have to be very good at your data science job, you have to be a very good programmer too, to make use of the big data technologies. And we are seeing some interesting developments, like very workload specific engines coming into the market now, for search, for graph, for machine learning as well, and which is supposed to give the tools right into the hands of data scientists. I personally haven't worked with them to be able to comment, but I do think that the next realm on big data is this workload specific engines coming on top of Hadoop and realizing more of the insights for end users. Just curious, can you elaborate a little more on those workload specific engines? I mean, that sounds rather intriguing. Well, I think interacting with Hadoop on a real-time basis, we see search-based engines like Elasticsearch, Solar, and there is also Droid. At Yahoo, we were quite a big shop of Droid actually, and we were using it as an interactive query layer directly with our applications, BI applications, the search-ascript-based BI applications, and Hadoop. So I think there are quite a few means to realize insights from Hadoop now, and that's the space where I see workload specific engines coming in. And you mentioned earlier, before we started, that you were using Mahoot, presumably for machine learning, and I guess I thought the center of gravity for that type of analytics has moved to Spark, and you haven't mentioned Spark yet. We are not using Mahoot. I mentioned it as something that's in that space. But yeah, I mean, Spark is pretty interesting. You know, it's Spark SQL. Doing ETL with Spark as well as using Spark SQL for queries is something that looks very, very promising lately. Quick question for you from a business perspective. So you're the head of engineering at GoDaddy. How do you interact with your business users? The C-suite, for example, where data science machine learning, they understand we have to have, where they're embracing Hadoop more and more. They need to really embracing big data and leveraging Hadoop as an enabler. What's the conversation like? Or maybe even the influence of the GoDaddy business C-suite on engineering. How do you guys work collaboratively? Yeah, so we do have a very regular stakeholder meeting, and these are business stakeholders. So we have representatives from our marketing teams, finance, product teams, and data science team. We consider data science as one of our customers. And we take requirements from them. We give them, you know, peek into the work we are doing. We also let them be part of our agile team so that when we have something released, they are the first ones looking at it and testing it. So they're very much part of the process. I don't think we can afford to just sit back and work on this monolithic data warehouse, and at the end of the day, say, hey, here's what we have built, and ask them to go get the insights from it. So it's a very agile process, and they are very much part of it. One last question for you, sorry, George, is you guys mentioned you're sort of early in your partnership, unless I misunderstood. What have you achieved, what has helped Gojati achieve so far, and what are your expectations, say, the next six months? We want the world. Just that, yeah, yeah. But the premise is, I mean, Sir Josh and I were part of the same team at Yahoo, where we faced the problems that at scale is trying to solve. So the premise of being able to solve those problems, which is like their name, basically delivering data at scale, that's the premise that I'm very much looking forward to from them. Well, excellent. Well, we want to thank you both for joining us on theCUBE. We wish you the best of luck in attaining the world. There we go, thank you. Excellent, guys. Josh Clark, thank you so much, Prasanti. My pleasure. Thank you for being on theCUBE for the first time. No problem. You've been watching theCUBE live at the day two of the DataWorks Summit. For my co-host, George Gilbert. I'm Lisa Martin. Take a round, guys. We'll be right back.