 Live from New York, extracting the signal from the noise. It's theCUBE, covering Spark Summit East, brought to you by Spark Summit. Now your hosts, Jeff Frick and George Gilbert. Hey, welcome back everybody. Jeff Frick here with theCUBE. We are live in Midtown Manhattan at the Spark Summit East Conference. Our first time to Spark Summit, we did a little drive-by last year in San Francisco, but we're excited to bring the whole CUBE production this year to Spark Summit East. A lot of good energy here, a lot of good sessions, very geeky, very techy. Still early days of this conference, which is really an exciting time because people are really about sharing information, sharing best practices, a lot of hallway chatter. So we're excited to be here. We've got George Gilbert from Wikibon, our big data analyst. Welcome, George. In this segment, we have CUBE alumni. We'd love to have Chris on all the time. Chris Selland, VP, Big Data from HP Enterprise, or HPE for short. Welcome, Chris. Thanks so much, Jeff. And George, great to see you as always. So good to see you, HPE. What are you doing here at Spark Summit? Well, like you said, I'm doing a bit of a drive-by myself, or in this case, train-by, since I took the Acela down from Boston yesterday. But we really wanted to get a sense of what's going on here. I mean, there's so much enthusiasm and excitement in this community. And I think, as you said, I was in the keynotes this morning. It's still a very techy, geeky audience. You don't see many keynotes where you actually see code snippets up on screen. So, you know, this is sort of, it's sort of where strata was five years ago. You know, and ironically, it's in the same venue. The same location, right? You know, there's a lot of excitement and enthusiasm building around Spark, although it's still, as you said, also very early days. So, you know, we're trying to get a sense of, we know what we're doing, but we're also really more importantly trying to get a sense of what customers are doing. And so if, from your drive-by's and from the sort of natural feedback from the field, what are some of the, you know, top priorities customers are saying they would like to be able to do to marry the two technologies? With Spark and Vertica or HP software portfolio. Yeah, well we built a connector, which we actually released in beta back last fall and it was announced along with our newest release of Vertica to Spark. And really what we've been doing to date is letting customers play with it. And it lets you do a few things. And we've really found two primary use cases. One is ETL. There were customers experimenting with MapReduce for ETL and some of them have had performance issues and haven't been happy with performance and so there's some experimentation going on which seems to be fairly successful. And I shouldn't say formal ETL per se, but ETL like capabilities for transforming and loading data. And then the other one, of course, has been analytics and modeling. And that's, based on what I drove by or heard this morning in the keynotes, that seems to be certainly where I would say the most enthusiasm is right now from customers. Not surprising because of some of, you know, the robustness. At the same time, it was interesting this morning listening to some of the modeling that companies were doing with Spark because it's still pretty simple stuff and it's on smaller data sets and so on and so forth. And that's really where we see the opportunity because obviously we've got a platform that really scales and provides very high performance across very large volumes of data. So the ability to sort of take those models, make them persistent, make them shareable and make them usable across very large data volumes. That's really where we see the opportunity. I'll drill into that a little more where would a customer be sort of prototyping, perhaps if they're a Vertica customer, prototyping a model in Spark, operationalizing it in Vertica, not in Spark code, but in R, perhaps, and then running on the full data set and then evolving again the model in Spark. And what they're doing in Spark may not even be a prototype. It may be a full model, but a full model run on a sample of data or only run on a smaller data set, like something like clickstream analytics is a great example, right? I might do my modeling on what's going on on the website right now in Spark in memory because it's very fast, very high performance, but there's only so much clickstream data I'm going to be able to store in memory and there is no persistence engine in Spark. We were spending a lot of time with the Databricks guys talking about, because we have a lot of common customers. I'm talking about clickstream analytics and maybe a better example would be social gaming. As you probably know, Vertica has been very, very strong in the gaming industry where, what's the customer doing in the game right now and those sort of analytics? Instrumenting it to find out usability. Right, and the right now data, that may be something I can model in memory and build in memory, but then I might want to run that analytics across my entire customer base. So if I want to know what George right now is doing in a mobile game, I can maybe model that closer to the, well, on a more in memory configuration might be a better way to say it, but if I want to know what all customers are doing and how it maps against what George is doing or what George has done longitudinally over time, you get the analogy from a clickstream customer. A little bit of higher latency. You're a little more lead time to get to that sort of calculation. But one of the things that we offer and why we've been so popular is we still offer really, really fast performance, but again, you can really extend the model onto much broader sets of data. So that's exactly the idea. And then maybe I have, you know, older data, because we talk a lot about hot, warm and cold data. Maybe it's older, less frequently used, but data I might want to hit sometimes that I want to keep in HDFS. You know, that's what we've done with our Vertica for SQL and Hadoop product is again, let you use Vertica on that data. You take somewhat of a performance hit because, you know, it's just the access and the IO is a little bit slower, but at the same time the cost is lower as well. So you kind of get the ability to kind of do analytics across the entirety of your dataset and use similar, you know, same models, same interfaces, so on and so forth. So how do you think about, let's say you have three to five years of historical data that you've accumulated and I guess you might call that cold or archive. And then you have the most recent 90 days. Right. And I'm assuming that's in Vertica. And maybe the most recent 90 seconds. Yeah, 90 seconds. Yeah, exactly, yeah. So let's say that's in the high value solution. Do you have Spark doing the rich analytics on top of that or does Vertica have everything it needs? That's the idea. And then you do Spark on the archive. Customers right now are experimenting. So, and that's exactly what we're trying to get a sense of what customers are doing. And that's to the point about it being very early days. I don't know that there's a single answer. I'm sure there isn't a single answer for that just yet. And whether we're really sort of seeing true production level work going on. I mean, we're starting to, and that's what some of we heard this morning in the various keynote sessions. But that's the whole idea, right? Is that, yeah, you've got data. I mean, we're obviously in New York, we're the center of financial services industry compliance. I mean, it's usually seven years. I have to have this data, but these days I want to have the data, right? Why do I not want to keep everything? Because storage has gotten so inexpensive and that's really been a lot of the value of HDFS too is that it's really provide a platform for very low cost storage and you can essentially keep everything. But, you know, I'm not touching a lot of that data very often or hardly at all. But if I do want to get to it, I want to be able, why do I want to have to change the tools as UI and everything else that I'm using and, you know, change my models? I don't want to do that. So I kind of want to use, you know, single set of kind of modeling capabilities in a single UI across everything. So that's the whole idea of what we're really trying to do. Is anyone else providing that, you know, single virtual query capability across tiers of data? And when I say tiers, I don't mean, you know, okay, tape over here and spinning rust here and flash SSD, but I mean, even different classes of databases. Well, you know, I obviously can talk much more to what we're doing as opposed to what others are doing, but that's certainly what we're focused on because that's where, you know, we're doing what our customers are asking us to do, basically. That's been the model. When we came out with our SQL on Hadoop product, we actually thought it was going to be kind of a door opener in new accounts. And we've had some of that, but what we've really found is that our existing accounts have embraced it like, oh good, now I can use Vertica on this data as well, that, you know, because I've really been starting to archive a lot of stuff over at HDFS, but I want to get at it. And so we've actually found it's been more an extension to existing accounts, has been where a lot of the big utilization has been for that very reason. At the same time where something like Spark comes in is traditionally where you've seen sort of streaming technologies, right? Like what's going on right now in memory. So that's where we really, that's where we're tending to see some of the experimentation going on. Okay, that's an area where we keep hearing about where a classic data warehouse is sort of at the end of the analytic tool chain or the pipeline. And, well, it will correct me if I'm wrong. I'd say it's the middle. If you really look at, you know, kind of HDFS as the archive, right? Then I would say, because we don't see yes, we do get used as a data warehouse, but we're an analytic database or data platform. We see it more as kind of the middle of the curve, if you will. The tail end of the curve were... And the pieces. So, you know, on the part of, there's a price performance curve. That's kind of what I always like to talk about when I'm talking to customers. Because, and it's funny how seldom that gets talked about. I mean, I think I said this to you guys when I was on theCUBE. It was probably a year or two ago where I was at a Gartner conference and they were talking about, there was two sessions and there was one on Hadoop and one on a memory databases. And they were sort of talking about, and Spark wasn't even kind of being talked about at all at that point in time. And, you know, it was sort of the two end points of the curve, right? You've got the sort of, one was talking purely about price and one was talking purely about performance. I said, you know, there's a whole curve in between there, right? You know, there's data that I'm just not accessing or using or analyzing probably the best way to say it very often. But I need to keep it. I want to keep it. Every once in a while I'm going to need to use it. Maybe it's more than once in a while and it might just be more, it might be for compliance purposes, but it might be for more than that. It gives me that longitudinal profile. That's where something like HDFS and that's what we built their SQL and Hadoop product to give the ability, you know, there's a lot of ways you can obviously access that data. It's not, performance isn't necessarily the most important characteristic, it's price. But the fact that you're already using Vertica, you know, really has made it powerful. That's the extensions that we're talking about. Middle of the curve, you know, when Vertica's sweet spot has always been price performance, but as I always say, you know, Vertica is really valuable when you have both, when both are important. Because really if it's just performance, then, you know, in memory technology obviously is blazing fast, but the cost of memory is still, you know, it's not falling quite as fast as the price of software. Let's put it that way. You can't necessarily afford to put everything in memory, but you've got sort of your performance and that's where we sort of see Spark. So it's, and you know, that's where you see technologies like HANA and others as well. So I actually haven't a chance to really drill down on what SAP is doing with HANA and Spark. I want to actually see a little more of what they're doing, because that seems to be the area where there's really the overlap. It's the real, immediate, real-time data in memory. How you manage that, you know, but the sort of, the whole middle of the curve is really where our sweet spot is at Vertica. But like I said, and then, you know, we sort of extend it to the part, the data that we want to archive, put in HGFS, or some medium where I'm, you know, I don't care as much about performance, but I care a lot about price. That's where a lot of the value is. So does that make sense as an explanation of your question? Yes, yes. But let me come at it then from a couple other angles, which is, so when we have in-memory technology, you know the 3DX pointer, whatever, the idea that memory-intensive systems, capacity of storage, you know, latency of DRAM, a database was, everything about it was oriented around spinning disk, you know, and being really careful, but not to step on each other when they're doing the same thing. We now throw all that out and rethink. What does that do for? Well, that's a better question, you know, I want to make sure, you know, you're definitely talking to more of a biz-dev guy and I don't want to pretend that I'm a database engineer because I don't even want to pretend to play one on TV, but we should. But it affects your use cases. It absolutely affects your use cases because that's one thing, you know, that I mean, we both work with and sometimes compete with various in-memory database products, but one of the things you see is we never see them in deals above, I would say even five terabytes or 10 terabytes, which, you know, I came to HB, I came to the Vertica BU four years ago, four years ago, that was, people would talk about five terabytes, 10 terabytes as if it was big data. But today it's like, you know, that's sort of next to nothing. You don't see, and my understanding, although again, this is the area that I really don't want to sort of talk out of turn, but is that, you know, because of some of what you just described, you know, an in-memory database is really not going to be capable and of course then there's the affordability factor of scaling above and beyond that volume of data and of course something like Spark, which is what we're really here to talk about, doesn't have a persistent store at all, but if you can use Spark as in-memory processing and still get the benefits of a database for more long-term longitudinal and then, you know, you can use HGFS for something like archiving. Yeah, that's the idea. The world seems to be moving towards this direction where you do your analytics in a stream processor before you store it, so you get the immediacy, then you store whatever that's necessary. As long as you're only trying to analyze the immediate data. Yes. But then you store stuff so that you can get richer context later when you need it. Right. Is that a scenario, you know? Yes, but what I was trying to say is what if I want to run that model across a much larger volume of data in a much, whether it's the time longitude or whether it's, you know, across my entire customer set. Like I said, I was talking about before what George is doing on the website or what George is doing in the game. Yeah, I can sort of do that very easily in memory, but if I want to say all of my customers are George oil time, you know, the data sets start to grow. Right. And again, you know, in memory is obviously, I mean, the cost of memory has dropped not as fast as the cost of software, but you know, it's dropped dramatically. I mean, that's the industry we're in, right? But it's still, there's still, you know, a price premium to be paid for doing things in memory and that for the foreseeable future, at least will continue to exist. So I'm not necessarily going to put everything in memory and maybe I probably can't, but that's the whole idea. So again, it's that whole price performance curve. It's still horses for courses, right? It's never technology for technology sake. What's the value proposition? What's the objective? And what's the appropriate tool to achieve your objective? Well, and that's so much about what the industry, where the industry's going. Cause I love conferences like this, but you know, one of the things about events like this sometimes is you really just don't sort of hear the use case, right? So although I got to say like the Capital One that I got this morning did a really good job of talking about, you know, not just the needs of a technologist, but the needs of like a marketer. And you know, that's where you start to say, well okay, you know, again, that sort of, you know, real time streaming based model that shows what's happening right now with this particular web visitor or customer or what have you, I want to extend that and I want to use it across the longer period of time or a larger volume of customers or what have you. I need to have that capability and you don't want to have to change your data, your products and tools to do that. But there are different types of use cases. So, but to be able to sort of do that all together and you know, the CMO just says, you know, okay great, that's a good analysis of what this person did, you know, with our product or with our website at that point in time. But show it to me, you know, since 2004, show that to me across all of our customers or something like that. That can be, that can require if you don't architect intelligently, you know, a massive change in how things operate at the back end. And you know, if you just focus on technology for technology's sake, you may not be architected to address that issue at all. And we see a lot of that going on these days. So it's great to experiment with the technology, but at the same time, think business case. That's kind of since the separation of HP November 1st, that's how we're trying to reorganize the company around, you know, become more sort of, in our case, there's something called the data driven enterprise transformation area, which is how to become more data driven enterprise. But it's not just, it's where IT and business meet. Right, so. I want to come back a little bit to the technology side is in that, you know, you're running a business, you've got a big business, and yet there's Hadoop, which, you know, we see you at Hadoop, now we see you at Spark. I wonder if you can, you know, reflect on kind of the challenges, opportunities. I pop up in all. You're everywhere, I love it. Super Bowl City the other day, when I said I get to see Chris. But, you know, kind of the opportunities and the challenges, and I didn't go to the game. Actually, I wrote my biases out there. You're not interested in it. I wrote, yeah, we don't want to go there for now. Chris, you're going to put a damper on the whole interview. I'm sorry. But that's okay, but anyway, we digress. The opportunities and challenges from an established platform place, to, you know, integrate, incorporate, and leverage these new technologies. I think you made an interesting statement. It really is helping your existing customers and you help them versus, you know, kind of greenfield opportunities or new things. How are you addressing that challenge and, you know, what is the great opportunity for HP Enterprise? Well, I think it is, it's listening to customers, right? It's to the drive-by comment. It's understanding what customers are saying. That's what made this really interesting. This morning, particularly, was hearing what customers are doing. And I think Databricks has done a great job of putting customers up on stage, right? They can put their engineers up to talk about the cool new feature. And I'm sure their breakout session is going on exactly that, probably while we're talking right now. But at the same time, getting customers up there, what are we doing? That's really what, you know, people want to hear. And, you know, we've seen this with a lot of sort of data lake, data swamp projects so far as well, too, where, you know, if you just build it, they're not necessarily going to come if it doesn't solve a problem. Now, some customers have been very smart about doing that, but others, and somebody had a slide up this morning that showed that, that's just like, throw it all the lake and it'll magically work. That's been, that's already starting to kind of be seen as, yeah, not so much. Not necessarily, because if we're not smart about what business problems are, if we're not at least as a technology group, I mean, it's great to experiment with early new technology. And that's, there's so much dynamism in this industry. That's why I love this industry. But, you know, if you're not talking to the business folks about what they're trying to accomplish and if what, you know, and there's also like what you're experimenting with and what you're rolling out in an enterprise basis, we want to sort of have a pulse on what the experimentation is going, but really be there for the enterprise rollout. That's what we're trying to do. So. And we're about out of time, but you've got your own conference. We do. The Cube has been going to the, you used to be the HP Vertica Show, then the HP Big Data Show now, it's the HP E Big Data Conference in Boston, August 29th to September 1st, we'll be there. Let's give a little plug. What can people expect? Give us kind of an update on the conference. Well, we have always focused that conference on, you know, and I'm going to sound like a broken record here, but customers. It's funny because, you know, I was basically very much there and very much part of launching this event. And initially it was a user group meeting for Vertica. That was the whole idea, but we had pretty recently been acquired by HP and we started, you know, getting some HP folks involved and it wound up becoming a bigger event, but it was always to get the customers to talk about what they're doing and up on stage. And then the other reason we've always historically held it in Boston, although this is changing as we globalize, but the Vertica engineering base in particular is based in our Cambridge Mass. We have, we have sort of the two Cambridges. We have a lot of the former autonomy engineering base in Cambridge, UK, we have a lot of the most of the Vertica bases in Cambridge Mass. So to be able to get our engineers and customers together because we really, so we put our engineers on stage, we put our customers on stage and we get them together to talk to each other. Now, the conference, as you mentioned, has broadened, but that's still the core of it. So it's really a customer event because as you know, HP does, HPE does events like Discover and, you know, and that's where we oftentimes announce a lot of our new offerings. Although we sometimes do that big data as well, but the big data conference is very, very much about customers together and also getting the customers and the engineers, the people are actually building the products together as well. Well, Chris, thanks for stopping by. Our pleasure. We'll let you jump back on the train. It's a train by, not a drive by. And we look forward to bringing the queue back to the big data conference. Again, that's August 29th to September 1st towards the end of the summer, but we'll be there. We've been there, I think this is our third or fourth year. It's a great show. I'm Jeff Frick. Yeah, this is four. This is four. Four. Yep. And you get the five, you go to Roman numerals. That's right. That's what the Super Bowl did. So I'm Jeff Frick. We are live in Midtown Manhattan with George Gilbert from Wikibon. We're covering Spark Summit East. Going wall to wall day two. We'll be back on their next segment after this short break. Thanks for watching.