 Everybody, I'm Jeff Kelly, Big Data Analyst here for Wikibon, we're at Wikibon World Headquarters on theCUBE. So earlier this month, IBM released a slew of Big Data products and technologies, among them was a new offering or a new member of the Pure Systems family called Pure Data System for Hadoop. Today we're joined by IBM's Tim Vincent to talk about those releases and specifically and of course also about IBM's overall approach to bringing Big Data to the enterprise. Tim is an IBM Fellow as well as Vice President and CTO of the Information Management of Information Management inside IBM's software group. Tim joins us today by phone from his office, thanks so much for joining us today Tim. Well thanks Jeff, I look forward to the conversation today. Great, so why don't you help us kind of just set the table first off. What is Pure Data System for Hadoop? And can you talk a little bit about your role in kind of developing the new system? The Pure Data System for Hadoop is just falling along in the overall philosophy of what we've got with these Pure Data Systems and the philosophy is to really try to get to the point where we can provide end users with an integrated system that really allows them to get to the point where they're spending their time on actually building out the business insights and the actual technology that they want to build on top of Hadoop that's driving value for them versus spending their time on actually engineering the system and trying to acquire all the parts, bring all the parts together engineering and designing the architecture of how to build a system, installing the software and getting to the point where they can actually do the former. We've actually had customers we've talked to who are spending weeks of time doing these activities. So you know, thinking through what they want the architecture to look like, how do they get all the software installed, which switches are they going to use, how are they going to build the systems out and we've got to the point where we have one customer we had them up and going to the point where they're actually low data ready within under two hours. So the goal here is really to get a system on the floor so the customers can get to the point where they're building the logic they want to build out versus trying to get the system stood up and running. That's really in line with what you and I discussed last fall when we were at information on demand, the idea that a lot of IT projects take a lot longer than originally projected and of course a lot of that time is spent, as you mentioned, just kind of setting up the system. So if you can kind of dive into some specifics here, how does pure data system for Hadoop actually go about accomplishing the goal of minimizing that time and minimizing those kind of setup challenges so that really you can, as an enterprise, can get to the value add that big data provides and that is kind of doing the analytics and really driving data-driven decision making? Well I think the answer is pretty simple. Think of a system that's showing up on your floor. You power it on, you hook it up to your network and at that point in time you set up security policies, et cetera, but at that point in time you're ready to load data. So don't trust that with what you would have to do if you wanted to build this from scratch. So you would have to start thinking about, okay, what is the size of my system? What servers are going to use? How am I going to lay my storage out? Am I going to use a sand model? Am I going to use a storage server model? How am I going to network the systems together? How do I, you know, you've got to order all that hardware, you've got a racket, you've got to stack it, you've got to cable it up, you've got to install Hadoop, you've got to configure that. So there's a lot of work that you have to go through, not only from a planning perspective, but from a design perspective and a physical installation perspective. All that cost is gone. It's out of the system. But I think that's just the beginning point. Then you've got to think about the ongoing management of the system. You've got to think about, okay, the reality is you are going to install software patches. You are going to install firmware patches. How do you do that? What is the process for doing that? Your data systems for Hadoop comes with a single console, which gives you that consolidated view of the system. It allows you to provide a process for upgrading software. It allows you to have a process for upgrading the firmware. But it does more than that as well. It gives you monitoring capabilities, problem determination. You'd have to think about how you would build those in and make sure that your availability characteristics are designed in. I know Hadoop's got a level of built-in availability, but you have to think through how are you going to set up your availability model, and all that is done for you. And the other aspect I think that's important here is one of Hadoop's skills. And a lot of the customers we talk to, one of the challenges they have is, even though Hadoop is exciting and a lot of people are talking about it, actually finding the people with the skills who actually can understand a managed Hadoop system is hard. So in a lot of cases, they're repurposing people they have. They are repurposing people who may have been running or are still running their warehouse system. So again, these people are trying to fill up. They don't want to have to spend all that time trying to figure out how to stand the system up. System comes pre-packaged. It's all these problems that have been addressed for you. So it's really as simple as rolling something onto the floor and plugging it in versus all these other steps you have to do. Interesting. So what is that aligned with when we think about Hadoop? We often think about the scale-up model stringing together cheap commodity boxes when you need to expand the cluster. You just kind of bring in another box. Are there scaling challenges or are there any trade-offs when you take the appliance approach as you've done with pure data system for Hadoop, either in terms of being able to scale it easily or in terms of the cost benefit analysis of really bringing Hadoop into your environment versus your traditional data warehouse, which can get very expensive when you're trying to scale to big data levels. Are there any trade-offs when you're approaching Hadoop or implementing it in the form factor of kind of the appliance model? So there's a few questions in there. So let me start with the one of, okay, how do I scale the system? So in its current, in its current incarnation, the system is a fixed-rack configuration. So it comes at a fixed size. So effectively, the decision of scaling is, in some ways, being taken away from you. You're buying a system at a price point that is very attractive that allows you to grow up to a specific size. And that size is still that significant amount of data storage in the system. So you can grow up to a size that's probably going to be sufficient for a lot of the enterprises. So that scaling problem is taken away from you to a great degree. Now, if you wanted to start with a smaller system and scale, you can still do so. But again, you have to go through all these steps. Now, the other thing you talked about is the data warehouse. Now, one of the things I think we should get into, and it's probably a separate discussion on this call, is how and where does a Hadoop system fit into your overall enterprise architecture and where and how and what are the impacts to your current warehouse systems? So I see that as sort of a second question. Well, why don't we get into that a little bit? So how does, if you're a CIO and you're looking to bring in Hadoop for your kind of the core platform when it comes to big data inside your enterprise, as you mentioned, you've got enterprise data warehouses potentially, you know, strewn throughout your organization. You've got other data management and database technologies. How does a CIO, how should a CIO start thinking about that in terms of bringing Hadoop into their environment without necessarily, obviously, they don't want to replace their existing technology. They've invested a lot of money over many years in that technology and it's often playing really mission-critical roles. So what should they be thinking about from a technology perspective about bringing in Hadoop, whether it's in the appliance form or any other form, when they're looking to bring big data into their environment? I think your question and the way you've worded the question is spot on. You're hitting the key issues. Let me start with one of the things that I've seen people do and I think it's the wrong place to start. And we saw this originally actually several years ago in China and some of the telcos in China where the researchers were starting to play with Hadoop and they were starting from the perspective of can we use this technology to replace our warehouse. So a lot of customers start with this from the perspective of a cosplay and they look at it and say, well, I've got cheap servers. I've got something that's open source. So the acquisition cost is theoretically free and can I replace my warehouse? And I think that's the wrong place to start and I think you'll find if you start looking at the cost equation, yes, maybe the acquisition cost is initially cheaper. But as you look at the ongoing cost, I'm not sure how well that costly works out and also there's going to be challenges in what will work and what won't work. So what we've been spending time on is how do you use the technology to augment what you've already got? How do you actually provide more value to the system? So the way we've been looking at rolling out big data and these Hadoop systems is really an augmentation of the systems you have today. And the warehouse is one thing you can augment. Another area that's actually interesting from an augmentation perspective is master data management systems. But let's start with the warehouse. So you've got a warehouse and that warehouse could be, warehouse is probably a highly overused term by the way. But let's say you've got a divisional warehouse and an EDW and you've got a collection of consolidated marks. They could be dependent on the warehouse. They could be independent. And you've got business users using those warehouses today. And you'll find that the performance is either strong depending on how good the architecture is or you've got some challenges. But in any case, you've got a system that's performing to a degree. You've got workload management. You've got monitoring infrastructure in place. You've got ETL into that system. You maybe've got some direct streams. We've got customers who are doing things like point-of-sale operations directly on the warehouse. So you've got a lot of infrastructure set up there. So the question is how can you use Hadoop to do more? So what we're looking at is Hadoop becomes what we're thinking of as a landing zone or an exploratory zone that you can think of almost as something that you can put into those systems. So what we're trying to use this model for is allowing people to actually expand what they've got in the warehouse. So you could actually start doing things like putting some of that warehouse data into Hadoop. So let's pick on, say, the transactional detailed data and say, okay, I'm going to actually start putting my transactional detailed data in both my warehouse and in Hadoop. But in Hadoop, I'm going to put in additional attributes, which I don't store in my warehouse today because I've not necessarily seen a valid use case for them. But I'm going to now also bring in other data. That data could be SEC filings. It could be Lexus-Nexus report. It could be social media. It could be data from other sources of data. I've got my enterprise. It could be email. It could be machine data. So you're bringing all these different forms of data into this landing or exploratory zone, and then you're letting the business users actually start doing more of an exploratory type of workload on them. Or ad hoc workload. You could be building ad hoc reports, which we're really looking across this supersetted data. You've not really been able to query before. So what we see the system then doing is this capability of doing exploration, things like you may have done with a sandbox in a warehouse before, but you're going to do it in this location instead in the landing zone. And that sandboxing activity could now be sandboxing over just, again, the transactional details from your warehouse, and you're offloading some of that work from the warehouse. Or it could be sandboxing activity where you're looking at that data plus additional data, or even just looking at additional data. So we're really trying to get to the point where people are saying this is an opportunity to do new things versus and really derive new value versus saying replacing something, which, as you pointed out, could be a very costly proposition as you start really looking at the process of actually completely ripping and replacing. So I'll stop there, Jeff, because I'm sure you've got potentially some other questions that we can scroll down. Right, well, there's some really great insights in there. And one of them that struck me was the idea of using Hadoop really is an area to do some more exploratory analytics. And you mentioned maybe even a business user doing that. But of course, one of the challenges with Hadoop is it's sometimes difficult to work with the data. You've either got to know MapReduce scientists in many cases, at least in the use cases we've been hearing about out in the market in order to use Hadoop for that kind of exploratory analysis. Now, of course, among some of the other announcements that IBM made, one was early this month, one was something called BigSQL, basically allowing a SQL type interface on top of data stored in Hadoop so that business analysts and others who are steeped in SQL can actually start accessing that data. And of course, I understand that big insights in that form, it builds off the core Hadoop patchy distribution, but it brings in some of the other elements that IBM has in their arsenal around analytics and visualizations. So maybe, can you talk a little bit about IBM's approach to actually making it possible for business users to do that type of exploratory work inside of Hadoop without having to be MapReduce experts or otherwise really sophisticated data scientists? I guess a great question, and there was many questions in there. Let me start with your big SQL comment because you hit on, I think, one of the key points. With SQL, you do have a large community of users who can build SQL, and more importantly I think you've got a large ecosystem around SQL. You've got tools such as Cognos which business users can use to build reports. So it actually starts bringing a larger ecosystem for people and the business users to use. So you are correct that the business users are not going to be the ones who are writing the MapReduce jobs. They are probably not going to be the ones who are writing hives because they don't have that skill set. But some of them do have a SQL skill and more importantly you do have tools which the business users are using today which can issue SQL. So that allows you to bring a broader set of skills and a broader set of capability into the system. And I think this is going to be a key thing going forward. And SQL is a starting point, but I think the ecosystem is going to be more important. I think what's going to happen, so if I put my crystal ball out on the table I think one of the key things going forward is going to be more that tool set that allows the business users to actually start doing things and actually doing building out analytic jobs, building reports, building insights on the system without having to become a MapReduce programmer. So the same thing that's happened in the warehouse world I think will happen on this big insight world. I'll stop there, Jeff, and see if there's any direction you want to take this on. I know I didn't answer everything, but I figured it's best to see if there's one direction or another you want to go. Sure, so I think maybe a good way to kind of wrap up the conversation would be to kind of follow on that point and maybe if you could get your crystal ball out a little bit and talk about how you see that actually developing in terms of bringing more capabilities analytic capabilities to Hadoop for business users. Do you see that obviously IBM has a play there when you mention the ecosystem out there. There are a lot of startups in this space, companies like IBM partners like DataMirror for instance that are doing some interesting visualization on top of Hadoop. How do you see that kind of playing out and maybe if we start that, if you could answer maybe from a ecosystem perspective, but also just generally speaking, how do you see Hadoop evolving in terms of being a tool for the business user? So let me take that in a slightly different way and talk about what I see big data as and I'll start off with saying I don't really like the word big data. And the reason I don't is because it means too many different things to too many different people. So for example, I've gone into customers and they said well we don't have a big data problem because we don't have petabytes of data and you look at it and say well okay but I think that what's going on in the industry is actually more than that and other people say well I don't have a big data problem because I'm not really interested in all the social media stuff I'm a financial company and all I'm doing is crunching numbers and other people say well I don't necessarily want to be into big data because I don't have the skills or I don't know how to use Hadoop and unfortunately I think they're missing the point because what we're really trying to get to is a world where you can actually look at all the data you have available and all that data is a combination of things you already have in your structure systems could be in your unstructured systems such as E.C.M. Repositories, it could be data that you're generating such as machine data, it could be email or it could be external data like again Lexus Nexus reports it could be sex problems etc. So I think what we want to get to is again where the business users, the data scientists and all the workers have access to data and it becomes more of an information supply chain management problem starting with easy tools like what we have with Big Data Explorer that allows you to understand what data you have and what that data looks like and I think that understanding of the data is going to be an area that's going to evolve, an ecosystem around that's going to evolve where it's going to be a combination of things like Big Data Explorer and enhanced metadata repositories for example. How do I then start asking questions of the data and how do you provide potentially easy simple ways to ask data, ask questions which could be natural language processing over the data as it exists and then maybe you next step is bringing it into a Hadoop system where you said okay I want a subset of that data so if I got unstructured data what I'm going to do is not bring all the unstructured data in there but I may run some annotators bringing a subset of data and I may augment that with other data that I've got and then the next thing is okay now that I've got that data into the system and coming back to the ecosystem how do I evolve it because if you look at most if you go back to a warehouse workload again and you start thinking of the flow of data scientists they're going through a level of data preparation as they're bringing the data into a model, into a format where they can run and gen their models where they can generate the scoring algorithm so you start getting into evolving data and as you evolve the data you want to make sure you're keeping track of where you've been what's the lineage of that data so when you get to the point where you say okay now I've got to the point where I've got an analytic algorithm it could be a scoring algorithm for example and I say now I want to bring that scoring algorithm and I want to bring that into my warehouse it could be a pure data free analytic system how do I actually keep track of what I've done and again come back into the lineage of the data what's the tooling that could go through that lineage of the data generate the data model I've just evolved build that out in your warehouse your pure data for analytic system generate an ETL flow and actually bring that data into the system so it really I think we have to get to the world where we're seeing this as an information supply chain management problem where you really have to have an ecosystem that allows you to look at data bring it into a Hadoop system tooling on Hadoop that allows you to build out your analytics it could be through things like SPSS model or as an example or other tools to build this out but as you're building those jobs keep track of where you've gone where you've been so as you get to that end game and you decide okay this is really important you could decide to continue to run that job in Hadoop if the characteristics of the of the workload the performance characteristics the currency of the data is sufficient or you could say okay I need a very high degree of currency on my data I have to put this into a system that can deal with very high demand because maybe this is a scoring algorithm that you're running as part of a customer care system as an example I want to move that into my an ATESA system which is going to give me a better SLA but I think we've got to get to the point where this entire ecosystem around information supply chain management becomes a reality so I think Hadoop is an important part of that and we're going to have to get to the point where we have an ecosystem that brings all these things into play you talked about visualization I think that's going to be an absolute key point in that in that view and the visualization tooling again has to plug into that overall flow of what you're trying to achieve and I know I said a lot yeah but hopefully that all makes sense. Yes some very good points information supply chain management I think is a really interesting concept and one that maybe has been left out of the conversation a little bit as we've focused on some of the more specific areas of big data like Hadoop and maybe analytics where as really it's about being able to be flexible to get the right data to the right people in the right form with the right tools and some really interesting things to think about and could certainly continue this conversation for a lot longer but I want to be sensitive to your time so thanks Tim Vincent, IBM fellow CTO and VP in the information management at IBM software. Thank you so much for joining us today we really appreciate it hopefully we can have you back on and continue the conversation it was really very interesting. I'd love to get on and talk to you about blue our blue acceleration technology that would be great. Fantastic we'll definitely do that so thanks so much and thanks everybody for watching. Okay thank you