 Line from Orlando, Florida, extracting a signal from the noise. It's theCUBE, covering Pentaho World 2015. Now your host, Dave Vellante and George Gilbert. Welcome back to Orlando. This is theCUBE. This is our first year at Pentaho World 2015. This is the second Pentaho World Pentaho. Of course, it was just acquired by Hitachi. Company does a, well, you know what Hitachi does, Pentaho builds a platform for blending data and operationalizing analytics. Mike Weis is here. He's with NASDAQ, a Pentaho user. Michael, thanks very much for coming to theCUBE. Thanks for having me. So you keynoted last year. Yep. So we'll talk about that a little bit, kind of what's changed. But before we do, what's your role at NASDAQ? I am a senior manager overseeing the team that is in charge of the data warehouse and NASDAQ billing systems. We do more bacheloretic stuff and we're going to be looking to do some streaming stuff in the near future. Yeah, so back in the day, I used to talk to the folks at NASDAQ about the data warehouse and they would describe the hell that was managing that. I mean, they used to, not just NASDAQ, but others. The industry would chase chips until we come up with a new chip. Oh, great, grab that. Ingesting data. The data was exploding. Trying to shove it all through the thin pipe into a box and manage it. Very challenging environment. How has that changed? I think the emergence of AWS has really helped solve the challenge, at least for us. It makes it easier to manage provision, scale up, scale down and really use what we need and not anymore, not any less. We're big in the red shifts. We have cut costs to be used in red shifts significantly from storing it internally. And we're actually looking to do even further cost reductions by moving to our new NASDAQ data platform, which is going to be built on top of S3, EMR and Presto to maintain longer data history. The biggest thing we have right now is we have a lot of SQL servers running around NASDAQ that have data from 2000, 2001, all these previous years and we haven't empowered the analysts to easily take that data and then look at it and see historical trends and things like that and that's where we're trying to focus at now. Okay, so what's changed in your conversations? You gave a keynote last year. Maybe you could touch on some of those themes, but what's changed in the past 12 months? So last year when we keynoteed, we discussed how that you can do complex analytics with a relatively small team. Even though NASDAQ is a pretty big company, we are maintaining all that data with a team of six developers and we have four dedicated business analysts constantly looking at that data. Last year it was half those numbers. So even with the theme of last year's keynote was that you can do a lot of good things with big data with relatively few people. The biggest changes that we have more people looking at it we're now empowering more data scientists like people at NASDAQ, our Economic Research Department and those folks to really start diving into other interesting facets of data that we haven't yet touched. So you could do big things with a small team, but you got to have the right team, right? I mean, this stuff is pretty complicated for a lot of companies. So NASDAQ maybe is a unique case. What are you seeing in the marketplace? Is sort of Hadoop big data analytics, is it simplifying enough for the average bear to be able to consume it? Or are guys like NASDAQ going to maintain some kind of competitive advantage because you got the Brainiacs who can handle this stuff? I think it is becoming simpler. I think if you look at it over the past 12 months of what has come in the Hadoop ecosystem, the Prestos, the Sparks and those of the world, I think they are going to end up revolutionizing space because it's going to make it really easy to leverage the Hadoop ecosystem. And they're all relatively new and I just see that growing and making things a lot easier. I don't think it's going to take very specialized skill sets in the near future to really do it. They're going to empower a lot of other users. And then they have tools like Pentaho who are also now building it in that space. And with their PDI tool, they're enabling your everyday ETL developers to now leverage some of that ecosystem as well without having to be that knowledgeable. This is like the holy grail of big data, putting insights, analytics in the hands of users. And to the extent that that occurs, that's a game changer. But so I want to make it back up. Talk about your infrastructure and the applications that you're supporting. And then I want to get into sort of where Pentaho fits and what the data pipeline looks like. Okay. So we currently have an in-house workflow engine that is constantly just pulling data all day from our various markets. Edge systems, as we refer to them, like reference data, membership data, for security data, all that other data that enriches our data. We're constantly pulling that data in S3 all day. And then at night, we dump it into Redshift. Once we do that dump of the Redshift, that's where we start leveraging PDI to then clean that data, combine it, and spit it back out into a more usable fashion for our analysts to really get into it and look into it. And it's at a high level kind of the ecosystem now. As I mentioned earlier, we're looking to move to more streaming, more real-time stuff in the next 12 months as well. The Kafka's, the sparks, the storms of the world. And I think that's going to be the next revolutionary step in our process. And what about Kinesis? Does that fit into your plans at all? It's a possibility. We're going to have to take a look at the gamut, right? You don't want to make a decision without looking at all the options. And I think that's critical. So I mean the trade-off there is it's all integrated into Amazon versus some of the other momentum that you're seeing in the industry. So that's a TBD. Yep, so we are very early in that process and we haven't ruled anything out in that terms of technology. I'm curious if we just step up one level. You talked about this sort of batch-oriented billing pipeline. What would you be able to do if you were on a really low-latency pipeline and you're using streaming analytics or just streaming data to feed the pipeline? Yeah, so a couple different things. Surveillance is a big one. Entriday Surveillance right now, a T-plus-one surveillance shop. So if we had more of a streaming, more low-latency system, we can do Entriday Surveillance and notice patterns, detect patterns earlier in the market and prevent fraud and people from doing bad things, right? Economic research and sales can also see trading behavior of our member firms and make determinations of like maybe they have to reach out to a firm and see what's going on. Maybe there's a pattern of a firm trading one stock versus another that we want to take advantage of and look at. So I think the lower latency, the Entriday stuff and those are the real use cases we're going to have for them. Is that just to be clear, when you talk about economic research and sales, in other words, if you see a firm that's made that would not be recommended by your economic, your macro work, you would then go to the, you would sort of go to the client and say, hey, you know, base models we think you should be doing something different. Right, so, and this is all going to be in terms of based on pricing and tiering how we charge firms, but if we notice a firm's trading in a certain type of liquidity, Entriday, we can recommend them change their trading behavior to meet certain thresholds to get better pricing, right? So our sales people can reach out and say, hey, you know, you're great, you're adding this type, you're adding this type of liquidity to the market, but if you do a little X amount more today, you can get to the next threshold, you can get these added bonuses. And if we had that capability Entriday, I think it would be a good insight for sales and drive more flow to the market to be quite honest. So, did you see the FINRA keynote this morning? I did. Okay, is it a similar situation where you're ingesting all these different data types? You bring it in, you blend it with Pentaho, and then you're using S3 as sort of an infinite data store, and then I guess the difference is you're pushing it to red at night. Talk about the process of operationalizing that and getting into the hands of users. Sure, it's been a little bit of an experience to get it to the end of the users. Users don't always know what they want, so you kind of have to lead them a little bit. So the way we operationalize and the way we work is we do a very iterative approach to how we design data. So we'll get some set of initial requirements from our BAs, work with them, design the initial process, and then let them play with the model and go back and forth in that manner. Obviously, there's also making sure the whole environment's up and running, and so we have an operations team that's constantly monitoring 24-7, making sure the system's up, data's always available. We're meeting all our SLAs. Yeah, so we've got teams watching that system 24-7 to make sure it doesn't go down and everything's up and running for a BAs to get in there and do their work. So what about that secret sauce of embedding that analytics into the operational procedures so that a business user can actually take advantage, because traditionally we were talking about the BI, EDW world, you get these insights and two or three people would have them when they became a bottleneck, and or they didn't have the context, and or it was too late. So that seems to be a huge advantage for you guys. How does that all work? Well, I think NASDAQ has always had a culture of being on top of data. We're a very data-driven business in general. If you don't know your market, you're not going to succeed in the exchange world. So we've always been more data-driven and I think that's really helped drive where we're at now, because to your point, having those one or two people or three people be very specialized and be able to access the data is a bottleneck. By taking this and making it more general, cleansing the data and exposing it in an easier manner to be queried, where someone doesn't necessarily have to know SQL to go in and really get information out of it, you're now empowering other level of users. You're more of your executives, you're higher up BAs, the people who may not be down in the early details. And I think Pentaho has really helped speed up that process and allowed a new set of users who've always wanted that data, but had to wait for that other person to generate the aggregates and send it to them. This now lets them get in there and do it for themselves. How much of this is the reporting for systems of record versus operationalizing something that's going to affect the business? I mean, you touched on this with surveillance pattern recognition and the economic research and sales. Sort of where are you in terms of rolling out traditional reporting versus prescriptive reporting? Traditional reporting, we are rolled out all over. I mean, traditional reporting was the first thing we actually accomplished when we did the initial move to the red shift in Pentaho, mainly because that stuff was already in place, so we couldn't just turn it off and not have any type of reporting. So we had to replace that, make it better and make it more efficient. And that's where we focused initially, and now we're going back and kind of saying, what else can we do with this technology to really enrich our users' experience, letting them ask different questions that can't be explained in a canned report? And that's where the Pentaho data refinery thing is going to play a key role in that. The ability for people to go in there, really design their own workflows, materialize data only when they need it, for how long they need it, and then being able to get rid of it, and we're not taking that hit of storing that large data set for long periods of time. But traditionally users, I mean, seriously business kind of users really have trouble with dimensions greater than like three. I mean, even when they're presented with a cube and they can slice and dice, so what data are you presenting to them? How much refining have you done? It depends on the level of the user, for the people that we targeted, mainly for the billing systems, we've done a decent amount of refining the data. We've really removed some of the, a lot of the white noise around the data, orders that didn't execute, we've removed cancellations, we've gotten rid of bad data points, so sometimes you get weird characters in our data streams, so we clean that up, we make them actually presentable data, because to your point, if you just dumped everything on the plate of your typical business person, they're just going to look at it and be like, there's way too much here, I don't even know where to start, they get overwhelmed and then they're discouraged. So we try to not do that for the current users we are targeting by doing a decent amount of cleansing on that data set. So it's not just cleansing, but you're trying to make a business judgment as to what dimensions they really want to look at. Right, yeah, so as I brought up earlier, what we'll typically do is try to get a handful of data points that the business person wants to look at and then we expand upon that, so they're looking at it little by little, so we're not just dumping a model of like 50 attributes on them at once. Right, not to overwhelm them. Right, we'll give them like 10 to start with, let them play, what works, what doesn't work, what else do you need, and we just build up from there. And that process has actually really worked because now they can be focused on a couple of attributes and making sure they actually do what they're expecting. So how is this different, how does this whole pipeline look differently from when you were doing data warehouse and all the data was carefully curated and you had the ETL tool and the visualization tool. What's it look like, or why is it different now? Prior to where we're at now, we actually didn't do much in terms of ETL. We kind of dumped the raw data into a data warehouse and let people sit and say, here, go figure it out, right? I mean, and to the average business person that's not an acceptable answer, they're not developers, they're not DBAs, but that is the approach we had taken at that point. Mainly because at the time, there wasn't many tools that could handle the large quantities of data we had, especially around equity data. So with Pentaho and some of the newer toys in AWS, we were now able to handle those larger datas, do proper ETL, do proper curation, really clean it and really button it up so that when the end user has to come and get it, it's just there. They don't have to worry about relationships, they don't have to worry about going back and figuring out what message relates to what message and how they relate. We removed that from that process. So in other words, because it's all in one tool, the ETL is linked to the governance, it's linked to the analytics and that can be embedded in another app if necessary. So this is the case for end-to-end as simplification. Right. Okay. So and all this happens in the cloud, in the AWS, correct? What are you doing on-prem? On-prem, so in general, NASDAQ on-prem, all our trading platforms are on-prem. But for us, we talk to all our internal stuff on-prem and then ship it to the cloud. So we have essentially a gateway between us, between the cloud and NASDAQ and that gateway is what we run on-prem, to pull that data, send it up, S3 and then tell other systems in the cloud to operate on it. So it's not a ridiculous amount of data, is that fair or is it? We do about 18 to 20 billion records per day of data. Right. Up to the cloud. So I mean, it depends on what's your... Capacity-wise, I mean, it's not like a petabyte a day. Yeah, I mean, it's not that bad passing it off. I mean, it's very manageable to pass it up to the cloud and then process it. And you pretty much, so that process that you just described, given the volume of data, do you see that running out of gas or is it not a huge amount of data, I'm presuming? No, I mean, you can look at it, a company like Finner, who takes a very similar approach to what we do, and they're at scales of five to 10 of what we do, because they're dealing with all exchanges, all ATSs, all broker-dealers and all that. So they're getting larger quantities of data and they're doing the exact same thing. So I don't see that running out of gas. I mean, we high-watermarked back in August a new, I think we got to 20, 25 billion, roughly messages on one day in August and no one even batted an eye because we didn't even know. So being with us is not the gate. Right. Compute and storage, which you got plenty of, infinite. That's pretty much, that's why we're in AWS because we can scale what we need. So it's people in process, it's really. Now, we heard Mike Olson talk about sort of the evolution of a dupe, 10th birthday, so sort of HDFS and MapReduce and HBase, and now we see all kinds of tools coming in, but Spark is obviously getting a lot of attention. It's something that you're looking at. How has your sort of tools evolution progressed? I wonder if you could talk about that a little bit. We've Pentaho in, I'm interested in where they fit. Yeah, I mean, especially in the past 12 months to two years, you really look at the Hadoop ecosystem. I think it's now become really more viable for most people to use, and that's why we're now looking to get more into it. I think if you went there before, you were maybe a little ahead of the curve and I don't think it was operationally ready is the way I would phrase it. It was nice as a side project, but I think for an enterprise, it wasn't quite there. And I think as it has evolved over the past 12 months, we have now been able to take the next step in our evolution and say, all right, look, Redshift is great, but how do we cut our costs even further and store even more data? And with the tools being offered now, with EMR and Presto and Parquet, we can now evolve our architecture to handle those larger data sets. I mean, the key there is agility. I mean, I'm saying a year. You couldn't do that in a year in the old days. It would take five years and you'd never get there. Everything would change so much, but I mean, you're able to respond much, much more quickly. Yep. But just to elaborate on that comment about the Hadoop ecosystem became viable in the last 12 to 18 months, and then you went on to say Presto, Parquet, Redshift, is it that those services complemented things that were too difficult? Yes. Okay, so you needed those. Right, to make that stuff possible. I think the tool, the Hadoop, I was in the analysts round table earlier and the question is, why is Hadoop so complicated? I think it depends on who you ask, why Hadoop is so complicated. And I think these tools and the tools that are coming out, the sparks and the Prestos and those type of tools that are going to make Hadoop more available for everyone. Because I see that rapidly converging on a simple use of Hadoop. And I think Pentaho's now lining themselves up with that as well. Well, of course, the other thing is Hadoop is not just Hadoop. Yeah, well fair enough. I mean, pick your tool, right? And so you have to have expertise. You know, we know from the, we have a tool with CrowdChat and it's complicated to get that up and running. I mean, I think there's 10 new Apache projects every day, I feel like, all around the Hadoop ecosystem. Right, and you struggle with HBase, you try, you know, DynamoDB and that simplifies things, but then there's all these other tools that come out, so it's like this tools creep that's very, very rapid. So how do you deal with that, you know, complex? And adding on that, the fact that now vendors, it's no longer they're all supporting the core sets of tools, we're seeing fragmentation. So it's no longer second source just because it's open source. Right, yeah. I think that I was described earlier as you need a heat shield to guard yourself from all the new technology coming out. I mean, it's not, as obvious as not just a problem for us, it's a problem for everyone, right? And I think good evaluations of tools, not always being eager to just jump on everything that's new is how you really ultimately protect yourself from that. I think too often as developers, like we always want to get our hands in the shiny new toy, right? But you got to take a step back and really focus on what the business needs are. And I think if you focus on the business needs, you could figure out what tools that meet will fit your needs. Right, right, at the same time, you can't ignore these trends because it could be game changing for your business. Yep, you want to keep a careful eye on new technology and make sure you're always looking for the next big thing, but you don't always want to jump in head first without proper evaluations. We're almost out of time, but go ahead, George. One question on the sort of viability and maturity over the last 12 to 18 months. Do you see and do some of your peers at your company or elsewhere see the native cloud services like at Amazon or Azure or Google as potential competitors to doing the Hadoop kind of ecosystem approach? I don't think so. I actually think they're complementary. I think if you go back to five, 10 years ago when the cloud was first becoming a thing, people were going private cloud. And I think if you look at the public cloud offerings now, they're way more viable than private clouds from a cost perspective. And I think you're actually seeing a trend go the opposite direction, whereas I think five, 10 years ago, people were like, hey, private cloud, private cloud, private cloud. Now more people are taking a step back and saying, wait a second, this is what these guys do. Let's really take a look at this. And then they look at the cost savings you get from going public. And I think they can compliment each other in the long run. You mean like running a Hadoop cluster or clusters locally and having something in the cloud. We couldn't agree more. I mean, we've been on this bandwagon for a while. Amazon, I guess, throwing Google and Microsoft as well, their cost of provisioning infrastructure is tracking software economics. So at volume, their cost of doing that goes down to zero. You can't do that on premises. And it's funny, every time I think they can't go any lower at the prices, they seem to. The pace of innovation is phenomenal. It's interesting times, hard times and hardware, but from a practitioner standpoint, it's nirvana. So Michael, thanks very much for coming to theCUBE. Really great to have you. Thank you for having me. All right, keep right there, everybody will be back. Right after this word, this is theCUBE we're live from Pentaho World 2015 in Orlando. Right back.