 TheCube at Hadoop Summit 2014 is brought to you by anchor sponsor Hortonworks. We do Hadoop. And headline sponsor WANDISCO. We make Hadoop invincible. Hey, welcome back everyone. We're here live in Silicon Valley in San Jose for Hadoop Summit 2014. This is TheCube, our flagship program. We go out to the events and extract the signal from the noise. I'm John Furrier, the founder of SiliconAngle.com. I'm here with WookieBond.org's own Jeff Kelly, big data analyst. And our next guest is Peter Bond, the senior research scientist from Voo University in Amsterdam. Welcome to TheCube. Thanks. So we love having computer science guys on, especially when they're also professors and also practitioners. I want to get your take. Honestly, we're talking to the Akvian folks who have a nice little momentum going here at the show with this hashtag cut Hadoop loose, which we'll get to in a minute. What is some of the things that you've been involved with? And take us through the history of what's happened with this hottest database technology that you guys have. Thanks for the question, yes. So I'm a research scientist at the University of Amsterdam, but also an advisor to Action Corporation. So my involvement with them is that my spin-off, my recent spin-off, vector-wise, was acquired by Action and now sold as the Action Vector product. An Action Vector is an analytical database system that's leading the TPCH charts for the past, well, since its existence actually, in the single server category. So it's a really, really fast database engine, the fastest on the market. And it pioneered a number of techniques that are now becoming very popular. One of them is vector processing. So vector-wise, length, its name from vector processing, which was invented by me and my PhD student, Marcin Czikowski, who worked on that, and who we were the driving force behind. So take a step back onto vector-wise, because that's some of the most cutting-edge vector technology. Take us through, why did you guys do that and how does that relate to why it's relevant today? Right, well, previously vector-wise, we had experienced building a new analytical database system called Monetiby. Monetiby is an open source column store. It's one of the earliest column stores and it's kind of started the wave of having specialized analytical database engines in the relational domain. And well, Monetiby was great, but we definitely, after working on it for a couple of years, had some new ideas. We saw new opportunities and these got realized in vector-wise. So that- You like doing spin-offs, don't you? You're a creator. You like to get tickering and then spinning him off the little highest buyer, in this case, Actium, right? Right, well, computer science is an applied science. Certainly my area, database systems. Now, there is a huge demand for database technology. You see it now happening again in the Adobe ecosystem. People are really discovering the needs to have database systems there as well. So there is a tremendous opportunity to apply database technology and I think if you're a good research scientist in this applied area, you should do spin-offs. So I got to ask you, so my co-founder of CrowdChat, Danny Ryan, Georgia Tech, Karsten Schwann, as an advisor and a participant in our venture, he's the large-scale distributed computing, but what you're talking about is databases aren't necessarily viewed as large-scale, so I want you to specifically connect the dots for us. Now, with big data, you have to go large-scale. Certainly the unstructured stuff with Hadoop makes a lot of sense, restoring piles of data, but when you try to intersect large-scale distributed computing with large-scale databases, there's challenges. Can you connect the dots of that concept? Sure, yeah, today actually is cutting loose Hadoop and it's actually unleashing a new product called Vortex, which is the Hadoop version of vectorize. So the best relational database engine around in the analytical domain is now coming into Hadoop and really integrated in Hadoop. What I see as big data, the need for databases in big data, it's becoming more and more evident. So people are using and adopting Hadoop, typically starting with unstructured, messy data, running that produced jobs, cleaning data, finding out things using data mining algorithm, but the output of those cleaning and analytical steps is less dirty, so more clean and more structured data and in the end, the end products of big data pipelines want or need to be consumed by a SQL engine or let me rephrase that, there is tremendous value of being able to use the end of the data, at the end of a data pipeline, of a big data pipeline in Hadoop to have SQL systems and SQL processing there and in the spirit of large-scale, you shouldn't then copy out that data to some existing outside database engine. The best is to really process also the SQL queries inside Hadoop. So that's what's driving, not only for Actium, but for the whole industry. So who's using the large-scale databases that Actium has? What's the prototype use case, customer use case? Well, I'm a database research scientist, so I'm not day-to-day involved at Customer Cases. I mean, Vector-wise has hundreds of customers. You're going to be talking about customer names, talk about the environment, what applications would use it, leverage the value of that? It is very diverse. So on our wall, we've got all the customers and they come from the financial institutions, but also social networking sites, energy, medicine. People deal with a ton of data. It is quite broad. The adoption of big data is also quite broad. So you mentioned, Peter, that the Actium SQL on Hadoop technology is native to Hadoop. Can you describe a little bit of what does that mean? Because as you say, you design the database eight years ago, it's kind of a single server environment. When you say it's native, what does that really mean? Very good question. So I kind of think you can look at it like we escaped lucky in the sense that the design of how vector-wise is really fits Hadoop and HDFS. HDFS is really difficult to port an existing database system to because it's an append-only file system and append-only file system and databases, they do in place updates all the time. So typically it won't really work to do a straight port, but compressed column or storage like vector-wise uses is really not a friend of updating place. So in a column store, you actually are looking already for other ways of updating data. And this is what actually is in the DNA of vector-wise anyway and that made it for us really easy to go native on HDFS. Second is, I mean, the second pillar of being native Hadoop is yarn integration. So you've got to play with resource management in Hadoop to be able not to be trashed by other jobs and also to not get in the way of the other users of the Hadoop cluster. So it's so interesting, so when you developed vector-wise, it was not, you know, this was, you know, Hadoop was also quite young and it wasn't something you were thinking about the time. 2004, yeah. Good minds think alike in similar approaches. Is that kind of a way to look at it in terms of the compatibility? I think it's about latency. So the reason why HDFS is the way it is, is that for Hadoop, you've got to deal with network latency. And that led to Hadoop and HDFS preferring very large block transfers and appends only to keep things, to make the consistency algorithms feasible. So, I mean, if you look at what David Patterson, a great computer scientist has said, is that the biggest challenge in computer hardware evolution is latency. So latency is not going away, it's true for networks, but it's also, for the kind of vector-wise, IO latency is also a big problem and that's why column stores end up preferring large sequence of IOs and avoiding updates in place. So in some sense, the underlying force, the danger of latency to an architecture of a software system is there in both cases. And I think coming back at, sorry, I think coming back at, let's say, other alternatives in the marketplace, you see that there are quite a few SQL-induced solutions that are based on what I call the wrapped legacy. So they take a legacy database system and then I mean a database system that wasn't designed for analytical workloads. So not a column store, not vectorized. There's something like, well, very often, Postgres or Derby, so an open source solution, people then put that as a component in a SQL-induced product. The problem with that is that SQL-induced workloads are truly analytical. So you're putting a query engine there that's really not suited to do fast query execution and that's why, well, vector-wise is so much faster than those legacy systems. The other point is that if you put Postgres or Derby in HDFS, you'll get into the problem that HDFS is a pandome. So you have to switch off, basically, updates. So there is no SQL-induced solution right now that supports trickle updates. Even Hawk, they're all disabled, the possibility to delete a tuple and to modify a tuple. And for SQL users and use cases that are considered mature, this is a big drawback and Vortex is going to change that because Atyan has this very different way. So vector-wise is a system that doesn't do update in place, it has a special patent data structure called positional data trees, which are differential updates. So you're writing on the side, not in the files. And this allows Vortex to support updates without running into HDFS problems. And I think with that, it will also be unique in the market. So I got to ask you about the future. A lot of database architects are trying to figure out two things. What can I do today to create an architecture as a database guy or an analytics guy that's solid for today, but also gets me down a no pun intended, a vector that's in line with the market. In other words, I want to create a viable architecture today, but I don't want to have to go back and deal with re-changing it again. So take us through your advice to those folks. Well, I'm not sure. So the thing is that the hardware environment and the workloads, they are changing and these factors can change what makes up a great architecture. Another thing that I have to add to that is that to create a mature database solution, a SQL solution that's mature, that's considered mature, you're actually running against time because this takes a lot of time. And this is also something that systems like Impala and Hive are experiencing. A lot of effort is being put there. These are systems that are being designed as native analytical database systems. And they are on the right track, but the problem is they will need years, I would say a decade to get to the maturity level that customers really expect from SQL systems. I'm talking about workload management, access control, internationalization support, wide APIs, complex queries, trickle updates, doing all of that, a windowing functions takes a lot of time and a lot of effort. And in the meantime, of course, the landscape is changing. So this is an interesting corner. Where do you see the technology heading? So what's around the corner from a research perspective, from a practicality perspective, what do you see around the corner for future technology? Well, I'm very excited about the democratization of cluster computing, which is now happening around the do. So you see really widespread. Cluster computing where around the world? Yeah, in IT in general. So it used to be only big organizations who would be able to leverage clusters. And now it's becoming easier to do this because people are also standardizing. A very important factor is human skills because to administer a cluster and to administer a system based on cluster technology takes skills. And these skills for, let's say, specialized MPP database solutions used to be very rare. So it would be almost impossible to find people to maintain that. And now people are rallying around to do. And redub knowledge is going to be, is already much more prevalent than, let's say, specialized Teradata knowledge. And this allows the industry as a whole to adopt a Hadoop-based or cluster solution. So both that, both standardized software and also cloud computing will really see a much broader set of IT projects being executed on clusters. And this is actually also in hand with, goes hand in hand with the general trend that we are at the end of silicon scaling. So, this is- That's software, shifts to software, right? So you're thinking- Well, if it isn't so easy to buy your way out of a problem or to get a competitive advantage by just buying a faster computer, we all have to go to clusters, cluster solutions more in the future. So how do you see that when Intel's doing with systems on a chip, obviously there is some limitation on silicon. Some argue there's some new stuff around that. But assuming that it's getting down to a diminishing return on silicon, what advances do you see that's going to really take to the next level? Is it virtualization, data virtualization? Is it system of chip, software particular? Anything that makes it easier to scale on clusters will do great because if that's the only option, then people will try to go there. But until now, the barrier of doing so is just very high in terms of skills and finding people to do that. Scales is a problem. Let me ask the question the opposite way. What scares you around stuff that doesn't scale clusters? What things do you see that you go, oh no, that's not going to scale a cluster? What should people pay attention to as problem areas? Well, there are many because writing parallel programs is notoriously hard. It is really difficult. So we are already in that world of pain right now. Because even on a single server, you also have to write parallel programs if you care for performance. This is also, by the way, so the fact that you need to write software that runs on parallel hardware, and this is almost black magic for the average programmer, is the reason why MapReduce, but also SQL on the Duke, are becoming more important because these are standard technologies that out of the box run an IT problem in parallel. So these kinds of software are the only solutions to this drive for more parallel processing. So you're saying homegrown parallel processing path is not a good solution? No, that's only for the few elite soldiers that can do that, right? That's a really, really few. I mean, if you look at your average programmer, this is truly black magic. It's really high-end computer science. So you're saying is look for the tools that abstract away the complexity. Exactly. And what would they be? MapReduce, what else? MapReduce, as SQL on the Duke, and you see also, I mean, there is something between MapReduce and SQL on the Duke. It's not entirely clear what that is. So Spark has been maving waves with Scala-based programming. It's not clear. I mean, it also depends on how that's going to be adopted, but there are many algorithms, data mining algorithms, analytical algorithms that go beyond SQL, but yeah. So tell us what you're working on now. Final question, what are you working on now? What's getting you excited? What's the technology drug that you're taking? Well, I'm still very much involved with Vortex. Vortex is announced today, and there is a whole roadmap that of yarn integration, of elasticity. Elasticity is very important to share. To share a Duke cluster between many different applications. All this update work. What's the elasticity piece more applied to? Be specific on that. The elasticity piece. The elasticity. So, I mean, once you go to... In which aspect? To the MapReduce? If you, yeah. So, a Duke cluster has many different uses, and one of the big arguments for SQL and Duke is that a single cluster installation can be used by different kinds of users, different kind of jobs. So MapReduce jobs, pick jobs, what have your jobs, and also Vortex jobs. And the best situation is where the sharing of the same hardware cluster resources, and also the cis-admin resources that go with it, allow you to balance out the use of the system. So when one type of job is not very busy, then this room is given to the other kinds of jobs, because not all of them are peaking at the same moment, very likely. So the new version of Hadoop Yarn has a lot of new features that are really awesome around resource management. Exactly. What's going on with that area of innovation, and what's the new compelling thing coming out of resource, around yarn in particular? Well, I mean, we would like to have in yarn capabilities where you can dynamically, while a job is running, the ability to adjust the resource usage, because now it's when you start up a job, you request a number of CPUs and a quantity of memory, but you cannot change that while the job is running. And you can view a SQL and Duke server as a very, very, very long-running job. And so we are right now trying to, we are working around this limitation of Hadoop. So it would be actually great if yarn, so it would actually be great if yarn would be able to dynamically change the resource footprint. That would be good. It would also be great if network would also be monitored, because network bottleneck is dangerous for both for HDFS access, it's dangerous. It could lead to query stalling, so currently yarn is not really looking at that. It's always the plumbing that everyone gets, everyone pissed off, it's always the network guy, right? Right, right, right. Okay, final question for you. Thanks for coming on the queue, really appreciate it. Where can someone get information around Hadoop and vector resources? How do I get more information? Is there a public open source project? Is there a community? Is there obviously Acvian? Vortex, so vector for Hadoop, vector Hadoop edition, as it's called, is a commercial product. There will be an evaluation version out, so you can try it for free. That will happen at the end of June. Go to www.action.com for that. You can also, if you want to read in-depth about what Vortex is, go to databasearchitects.blogspot.com, and there is an in-depth post on how Vortex works, so I recommend that. Peter, thanks for coming on the queue, really appreciate it, great to hear from you. Black magic and a lot of this high-end computer science and the beautiful thing about, we had the same amount of data scientists too, like there'll be some black magic, some high-end skill sets, but the winners will abstract away the complexity and create tooling and automation for the rest of us. So we really appreciate the work you're doing and good job at acting. They scored some good IP and great team, so congratulations, thanks. Okay, we'll be right back with our next guest after this short break.