 Live from Boston, Massachusetts, it's theCUBE. Covering HPE Big Data Conference 2016. Now, here are your hosts, Dave Vellante and Paul Gillan. Welcome back to Boston, everybody. Aiman O'Neill is here with Misha Davidson. They are in product management and engineering at HPE. Inside of the vertical architecture we're going to go. Aiman, let me start. First of all, gentlemen, welcome to theCUBE. Thank you. Aiman, let me start with you. Big news today is Vertica 8. I mean, it's like, as close as we as men can come to childbirth, I would imagine. It's when you guys announce new products that we know how hard you work and heads down and deadlines and so it's got to feel good. It feels great. This one is different in that it gives our customers a choice about which cloud to run in. We've been running them in the cloud for years. We've now broadened that so that customers can run Vertica well in Amazon and Azure now. We've also expanded the cloud services that we interact with. So it's about more than just deploying but helping customers to manage it well in the cloud as well. So you say you've always run on the cloud. Can you elaborate? I mean, has it been AWS or now you're expanding that to Azure or is it a reverse? Exactly. We started with AWS and now it's Azure and we have customers using both and customers have told us that they want to to use different ones for different use cases. So in the main stage demo that was shown about an hour ago, we showed off the fact that we can replicate data between the different ones and exchange data between the BI tools that run in multiple clouds. So that was one highlight of Vertica eight. Another one is that we've increased our performance at scale and our customers really have a new description of scale. For years they just told us it was about having more and more data. But now with Vertica they're saying they're getting a lot more concurrent users and a lot more competing or simultaneous workloads going on in Vertica and they asked us to solve performance challenges particularly around that. And then finally we've really expanded the kind of rich analytics that people are doing with our parallel cluster now. There's still running SQL but in addition we have machine learning jobs going on there now and we have people doing more geospatial analysis, people doing a lot of internet and things analysis. So we've really broadened. We look for use cases that are ripe for our massive parallelism and bring those inside our cluster. You talked this morning about bringing machine learning inside the database and you just referenced geospatial analysis, predictive analytics would be another. Our database is Vertica morphing really into becoming an analytics engine that will encompass all of these different predictive analytics uses. I think so because the power of Vertica is the parallelism, the fact that we can use so many commodity nodes in a cooperating cluster and there are a lot of these use cases where you can chop the job up into pieces and share it among lots of workers. And it is broadening. Customers are finding more and more ways to use that architecture. So we should take us inside. Let's go back. Paul, you're relatively new to Vertica, obviously. Can you give us a quick Vertica 101 and tell us what's now evolving and changing with the architecture? I see. Well, there's a number of important changes. First of all, as Iman mentioned, we're doing a lot more to support high degrees of concurrency and parallelism. We no longer serve five privileged data scientists running three well curated queries. We have customers that run thousands of users concurrently doing interesting things on massive amounts of data. We've made some architectural changes to ensure that we can handle that degree of concurrency effectively and we're continuing to improve the product there. For cloud integrations, we are working on providing a level of abstraction that lets our users run Vertica in the clouds that are relevant to them without being locked into a particular cloud. Finally, for in database machine learning, we used to have a product called distributed R which was a separate standalone product that used to run side by side with Vertica and required data transfer. We've determined that is a necessary burden on our customers and we've taken all the IP that we developed in running these parallel machine learning algorithms and brought it inside Vertica so you can actually effectively analyze your data and make predictions without exporting it out of Vertica. Last, but not least, we're continuing our investment in working with open formats. So it's not just open source. We've contributed to open source for a number of years now under Apache project. We are now supporting optimized reads for ORC and Parquet files which means that if you have data in a data lake somewhere, we can efficiently access it. We can process it alongside the data that's stored in Vertica and we can give you the best of both worlds, open data and performance and scale of work. So what is in database, you know, analytics and machine learning, what does that mean for customers? What's the business impact? I think it's, maybe we should get buttons made that say friends don't let friends duplicate data. That's what we're all about. We're looking for ways that our customers are copying data today and we're trying to have them avoid that. And this is why we did such extensive integration to Spark and to HDFS because we now let them analyze the data in place where it is without necessarily having to make a copy in Vertica so they don't have to duplicate a data into our system and vice versa, we used to have people with petabytes of data in Vertica and they would take it out somewhere else to do machine learning and then bring the results back in. We've done integration so they don't have to move that data around anymore. So it's the open format, it's analyze it as it is where it is, don't have to transform it or move it. So it's more efficient, it's cheaper that way, I'm sure it's faster but it's also the data is more current and it's different uses. Is that right? That's true, we used to deal mostly with big batch loads every few hours. Now we have micro batches streaming as one of the most popular ways of getting data into Vertica. So last year was very popular at this conference that we did Kafka integration. We spent, in this release, we sort of genericized that so that it will work with other real-time ingest mechanisms. You talk about the growing interest among your customers in that. We've seen Spark essentially as replacing MapReduce now as framework within Hadoop. We're seeing tremendous amount of interest in Kafka and other streaming technologies. What are some of the use cases that you're seeing customers really put into practice today? We have several customers now doing network monitoring to make sure the performance meets SLAs or to look for fraud. So we've got a lot of OEM customers that now will embed us in appliances that watch over networks. That's a particularly popular Internet of Things case. I think another use case that we're seeing more often than not has to do with democratization of data. So again, we started with a privileged cast of data scientists who knew exactly what they were doing. Now we have customer after customer coming to us and saying, guys, we took Vertica and we hooked it up to a couple of web servers. Here's a traffic pattern where you say, can you help us optimize this? So this is a very common use case where Vertica is actually used to serve data to the entire enterprise and not just the thin slice of data scientists. Another thing that comes to mind in terms of use cases that is important from the RDBMS perspective is all the operational capabilities, backup, restore, recovery. As Vertica customers mature and become more successful in large part of the whole because of the technologies that we give them, they rightly realize that they have to have their operational capabilities at the top notch of the enterprise level. And so they want to have effective backup, effective recovery, effective restore. And we are working hard at providing these capabilities, improving them over time and giving our customers an ability to run an app system all the time. Should anything go wrong with their hardware, we improved our recovery performance significantly. We've optimized data replication between clusters and this is an active area of development and investment for us. So it's not just a matter of integrating with tooling, it's architectural is what you're saying? Yes, very much so. We've rewired some of our storage formats in order to make sure that we can more effectively move data between clusters, for example, and get our customers closer to scenarios like active, active replication. Okay, so that's a replication sort of. Support, yes, we're actively working in that direction. You're sounding more and more like an operational database, though, like a production database. Are you seeing use cases evolve where perhaps you're replacing the mission critical relational engine? Or is that a possibility in the future? Is that a direction? We seem to be used in conjunction with operational databases more often. We've got a number of partners in this area and people use them to front end, Vertica, so I wouldn't say it's a replacement. You know, speaking of use cases, this afternoon I have a session where I'm going to show one that's becoming increasingly popular, which is manufacturers using Vertica to try to do proactive maintenance, to try to predict when machines out there in the field will fail and Vertica will help them optimize when they should go and maintain them instead of having to do it after a failure. That's an increasingly popular use case. That's your internet of things application, then? Absolutely, and we see it both in healthcare machines and also silicon manufacturing machines. Pretty diverse. I mean, it's been talked about for quite some time and it sounds like it's close or it's there. I think it's there now because these machines are now riddled with sensors that are broadcasting their temperature and pressure, which they didn't do until about a year ago. The scale which we operate in the internet of things is mind-boggling. We have customers that load 3,000 unique tables an hour and each of those tables has up to 1,500 columns. So it's not just the volume of data that we're ingesting but the variety is incredible and it's coming at a ferocious speed. So it's a great opportunity for us to optimize for this use case. Plus the topology of the deployment, right? You talk internet of things, you're going to talk about edge. So are you putting Vertica at the edge? You're on those parts of the internet. Well, at the recent HPE Discover in June, there was an announcement about how Vertica now runs well on the HPE edge line servers. So it has been tested with that. Which is a new shot, right? Is that right? It's just a new generation of new shots, yeah. The other thing that's challenging about internet of things that we've had to adapt to is the data is a lot more messy or less predictable. Some of the sensors won't broadcast for a few minutes and then you're missing data. And so we have time series functions and event series functions that can smooth those gaps, can fill in data. Increasingly in 8.0, we've got some new functions that will help you clean up the data by finding outliers or by normalizing data from differences feeds. Speaking of that, going back to what Misha said earlier about your activity in the open source realm, some of the things you're talking about are also being solved or addressed by open source projects. You play nicely with open source, but you're not a pure open source database. How, what is your approach toward adopting open source technologies into Vertica? We wholeheartedly embrace open source because our belief is that open source is not so much a mechanism for making software available to the masses because all the companies that have an open source model also charge you support license. So it's not about money. It's a mechanism that enables collaboration. We work with Hortonworks on the org creator. We work with Cloudera on libhdfs++, the latest library that enables us to natively access storage on HDFS, much more efficiently than webhdfs. So for us, open source is a medium of collaboration between large companies, and in as much as it enables us to do it, we embrace it, we talk to a number of different companies, and we have an ecosystem team that provides these custom crafted integrations, but we find the direct engineering organization to engineering organization contact like we've done with Hadoop vendors provides much deeper and much more efficient integrations that are more impactful to our product. So a little more of those. So you understand it's not about cost as in free software, but there's still a cost in terms of resource. I mean, you've got to decide where you're putting your resource and the zillion open source projects that are out there. So talk us through sort of your prioritization scheme. We'll talk to our PM, no. We'll listen to our customers. And we do tend to prioritize those use cases that benefit from massive parallelism. So we are picky about that. And so geospatial was ripe for that. And I'd say that's how we prioritize these cases. We also look to see where the data lives, right? So when we got tangible evidence that the notion of data lakes is real and our major customers were building data lakes and data lake really meant HDFS with Parquet files in it that clearly told us, we have to have an efficient connection to HDFS where HDFS is not good enough. So we worked on the connector. We had an ORC reader and we saw that a larger proportion of the data out there was in Parquet. We've written a Parquet reader, right? So that's the driver. Vertically is all about efficient processing of data at scale be it the volume of the data or the concurrency. We'll go to where the data is. I want to ask you about efficiency and Amon mentioned earlier, performance. I mean, we hear those terms being thrown around a lot. Nobody really trusts benchmarks. I mean, when you go into customer engagements, how do you talk about this? How do you prove the performance claims that you make? Well, they have SLAs to meet in terms of how much data is loaded per hour so that they can give answers on executive dashboards. So we don't really, it's not so much looking for some abstract standard that we try to match. It's customer SLAs that we meet. We work with our customers, we do POCs. We have some internal cookbooks that tell us how to appropriately size vertical deployments for certain loads and use cases. We know how to make the trade-offs between the speed of load, concurrency of queries. We do run benchmarks internally, but you are right. Benchmarks are highly overrated so we don't publish that. We keep ourselves honest and we keep on improving our performance. But fundamentally, it's driven by customer use cases. Benchmarks are useful for you because you understand all the dimensions and the assumptions and everything else. They are, and we quote them over the next few days, you'll hear lots of, in the sessions, you'll hear us say lots of times that this is TPCDS query number 68 is now five times faster than it was in 7.5. That's a data science benchmark that's well respected. We use it in our testing all the time. Yeah, I mean, I personally love benchmarks when I have somebody who's smart enough to ask the right questions of, all right, what's the workload? What's the, whatever the read-write ratio, the cash hit rates and all that stuff. And then, you know, you can squint through that and say, ah, okay, that makes sense and that doesn't, or this one's rigged and that one's not. I mean, smart people can squint through it and figure it out. But now this, so this is a fourth big data conference that you guys have had. I think I've seen some signage that suggests that you guys have having little customer meetings and advisory boards. And I'm sure they're highly confidential, but maybe you could just give us a high level, show us a little leg as to what kinds of things customers are demanding. What are some of the big trends and, you know, riptides and currents that you're seeing out there? Well, we had our customer advisory council yesterday and what was new for me is I saw the, I noticed the customers becoming more sophisticated about cloud. A couple of years ago we would say they would sort of segment in, we're going to the cloud, we're not going to the cloud based on whether they were in banking or healthcare, they weren't going to the cloud and everyone else in retail and digital marketing was going to the cloud. That's changed. Yesterday it was very much, everybody said we're playing in the cloud, but this use case, no way I'd put it in the cloud, this one I would. And they're becoming much more savvy about how they decide which data to put in the cloud. I think the other thing that goes with the customer sophistication is that our customers demand transparency, right? They don't care much about open source, not open source, but they want to understand how the system works. If, for some reason Vertica becomes slow, they want to be able to go in and pinpoint this is the fending query, this is the user that opened to 100 connections when they shouldn't have. And so this is an active area of work for us because we realize that in the end, it is people that use the systems, right? And we need to make sure that we make not just an awesome race car, but we'll make it usable for non-professionally trained drivers. And a lot of our, as analytics becomes more democratized, as more and more companies put in systems like Vertica, more and more users are not as professionally trained in the arts, right? So we need to help them figure out what's going on, isolate problems, deal with those effectively. And this is another dimension of scaling complexity for us. Look at our own operational data, do some analysis, make some suggestions and tell customers what to do. You alluded, Misha, at the top of the segment to not just giving insights for a few, but opening up to the masses, not just a few Uber data scientists. But, and so that leads us to sort of the summary. Been so far of great keynotes and maybe you guys could give us the wrap on what you've seen so far and final thoughts. Since I did the intro, why don't you summarize it? Well, I think it is great to see that the virtuous loop of listening to our customers really works. When we come to this conference every year, we have formal meetings with our advisory board. We also have individual meetings with customers. We just wander the halls and talk to people and listen to what they have to say. We compare notes when they come back and we say, these are the priorities. And so last year when we left the conference, we said, you know what? We should do more for the cloud. I think we should invest more in scale. We've done that work and the custom advisory council, we got the confirmation that yes, we're on the right track. We've done the right things. We need more. Now we need to explain to our customers more as to what's actually happening in their systems. We need to do more to integrate with the clouds. And this is the work that we'll continue doing while making vertical more robust, more scalable, making it easier to operate. It's a lot of different things that are technically exciting and are important and valuable to our customers. So, you know, the conference is a great place for us to validate what we do and set the direction for next year. Misha, thank you. Amen. Great to see you again. Thank you. Great to be here. Thank you very much. All right, keep right there. Everybody will be back to the HP big data conference. Hashtag seize the data right after this short word.