 and it's going to take another 20 minutes, another two days doing it, and then you look and say, I'm not quite good, and do it again a different way. So I think your energy and initiative to keep looking at data until you find something meaningful, it's a very important characteristic of data science. You're going to love data, you love it, take a bath. Exactly, you'll need to push steep a little bit, yeah. All right, and data science as a concept has been around for a long time, but we've just hear so many Halvarian's comments a couple years ago have really started to make this explode. How important has Hadoop been in this whole resurgence of data science and that whole title? How important, Hadoop just started. I think now people are realizing that Hadoop is available, right? Solving many different problems related to distributing computing. Tough problems, and Hadoop does that automatically. So people kind of hear that Hadoop is good for unstructured data, and that is true. But not because Hadoop is exceptionally good for unstructured data, it's because you don't have competition in unstructured data. Not too many other tools can do what Hadoop can with unstructured data, as opposed to so many tools that can do for structured data. So people will realize that Hadoop can do many good things for structured data as well, and that will expedite adoptions of Hadoop over time. So what will happen is in the near future, not too far from now, in near future, Hadoop is going to be a major component of the data infrastructure in all companies, because it can do unstructured and structured data as well. So you have, in your role, EMC Consulting, you work with enterprise customers obviously, EMC's biggest customers. Where would you say they're at with regard to Hadoop adoption? Just made a prediction that it's going to be mainstream. Are they actively deploying Hadoop projects today? Are they putting their big toe in the water, asking questions? Where are we? I mean 2009, 2010 was what is Hadoop? Where are we today? There is a really broad range. The spectrum is enormous. You have everything in the spectrum. There are companies with thousands of nodes doing everything extremely sophisticated with Hadoop. Integrating Hadoop with other tools, leveraging every possible feature. There are other companies that they have no idea what Hadoop means. They never heard about Hadoop, I was, I told you the other day, I was with a customer and about 10 people in the room, everybody related to Data Warehouse and Business Intelligence. Nobody ever heard about Hadoop. Wow, okay, so let's step back. Back up, that's a wake up. Many people are just trying. I just met a few guys today. They have relatively small data sets. Like, not even one terabyte, but they are trying Hadoop. It's such a hot topic, right? The people are trying. Let me try too, let me see how it works. So I think there will be some disappointment in terms of people with small data sets. And by small, I mean, 10 terabytes is a small data set. 50 terabytes is still a small data set. So if you are below that range, 10s of terabytes, Hadoop may be something a little down the road, right? Maybe not as necessary for a small data set. But it's good that people are trying. They are learning and eventually they will apply that knowledge. So we were talking about use cases and when I interviewed you about a week or so ago. That's right. We were talking about Hadoop initially being used for storage. That's right. Hadoop being used with the data warehouse. And in some cases, Hadoop being used just separately for data analytics, but that's the more rare case. Can you talk about those three? And I believe there is one or two other cases, these cases you were talking about as well. Yeah, there are four cases. One is you have Hadoop right up front between your source systems and your data house. So that means Hadoop works as a staging system that would capture everything. So you just dump it to Hadoop to HDFS because it's easy to accept anything. And then after the fact, you realize or you figure out what's necessary, what you want to extract after that and we store it in the data warehouse. So in this case, Hadoop is up front here between source systems and data warehouse. The second use case is you use Hadoop as your data warehouse, right? So you load, use Hadoop to analyze and store, right? And report out of it. And the third case is use Hadoop after the data warehouse. So you upload from the source system only the information that you want and use the data warehouse as just a cache for the really long term. So instead of moving data to tape, you just move to Hadoop. And the big advantage of this is it's easier and you have the data right there available for you to processing whenever you want. So it is there with computing horsepower associated to the data. You don't have to figure out where to go, bring the data from the tape back to the data warehouse. It's all there. So that's it, the third one. And the fourth use case is use Hadoop outside your mainstream. You have the intention to analyze something and you bring data to Hadoop, do your processing, figure out what you want, try to find your insights, but it's not part of your production systems. Once that is stable, then you move it back to the production system. I love this whole concept of experimenting. And I think it's been around for a long time, people who've been using technology. And that seems to be something that you really need to, it's not something you can really just instill in people. They have to discover that themselves. Is there a barrier with the people who are traditionally trained in business intelligence with data warehouses to adopt Hadoop and these new methodologies for analyzing data? Oh yeah, absolutely. I think that is the main distinction between BI professionals and data scientists. The BI professionals, they understand the tools, they understand what they are doing, they bring the data, they know how to transform, I mean, transformed by training data, right? Making is more transformation. But they don't, in general, they don't have the theoretical background, the statistical solid background to look at data and see, oh, I'm going to apply a supportive vector machine here because it's a good situation. They don't know that, it takes years to learn that aspect of the statistical data mining machine learning. So I think the theoretical part is important. So it's very difficult to have people with backgrounds, not in computer science, not in math, not in statistics, doing data mining, doing data science. But it's very common find people with that background being excellent BI professionals because it doesn't require the depths of knowledge, theoretical knowledge. So no matter which use case of those four that you described, it's fair to say that Hadoop generally is the smaller portion of the infrastructure today. Maybe there are exceptions, but on balance, most companies, that's the case. That's right. How do you see that changing over time? Do you see Hadoop or Hadoop-like infrastructures becoming the dominant approach and really driving the data strategy of companies? And if I could just add an additional question to that, what about that in comparison to tools like we see from Teradata with their acquisition of Teradata Aster? And so you're bringing the data back in so the BI person can actually use it. Yeah, so how about that, Pedro and Susan? What do you see? I think there are two aspects to this. One is the ecosystem, Hadoop ecosystem. The ecosystem is growing, right? And as you have high venti and Mahout, better and better, more sophisticated, all the complexity of Hadoop will be hidden, right? So people, you can drive up high today and you don't have to know a thing about the engine. It's going to be the same thing with Hadoop, right? So people can use Hadoop without the complexity and that will expedite the adoption of Hadoop, right? So you'll be able to do whatever you want on top of Hadoop without knowing anything what's happening because the ecosystem will hide that for you. It's going to be major. The other big advantage of Hadoop is scale out. It can grow and grow and grow and can keep adding new servers without changing your architecture. It's very affordable, right? So if you combine these kind of features, it will become a major tool for the future. I think every enterprise will have Hadoop as a major component of the data infrastructure. And now it's probably a good time to mention wikibon.org, siliconangle.com, siliconangle.tv, services angle. For those of you, I mean, Pedro, you're just talking about H-basin, scoop and flume and hive. Jeff Kelly wrote a great piece off of last year's Hadoop World. It's called H-Base, Scoop, Flume and More. Scoop, by the way, is S-Q-O-O-P. For those of you who don't know, you can just Google that. Jeff Kelly, H-Base, Scoop, Flume and More. He describes it all, you know, MapReduce, Hive, Pig, H-Base, Flume, Uzi, Wur, Avro, Meshout, it's all there. A lot of new terms. It's complicated for a lot of people. Your questions, hopefully we got answers, so check it out. But that's a wonderful part of an ecosystem, right? I mean, an ecosystem should be that way. It should be complex. There should be interdependencies. There should be different, you know, almost different textures to it and different aspects of it that you really have no idea of there. And we're not quite there yet with Hadoop. That's correct. I think it's in the very beginning there are many limitations, right? If you use Hive, you cannot run the same sophistication of queries that you can use in Postgrease or Oracle or other systems, right, in Green Plum as well. So Hive is still limited. Pig is still limited, right? You cannot do everything in terms of orchestration using Pig. But it is growing, right? So there are more and more that the Apache projects are growing and all these deficiencies will disappear. They will be extremely efficient. All the BI tools that we have today will be able to connect directly to Hive eventually so you can leverage all the sophistication of the BI tool directly on top of Hive using HDFS and it's going to be transparent to the user. So Pedro, you and I both got a little gray here. We've been around for a little bit and we've seen a lot of transformations in the business from the microprocessor-based revolution which drove whole new modes of computing and productivity to the web browser which really changed so many things in terms of information access and obviously application development. How is data changing your job as a consultant? Oh, completely. I just give offerings and my job is all about big data, right? Everything that they do today is about big data. The entire department is about this. So we are moving away actually from standard data warehouses, standard ETL projects to only infrastructure related to recent technology providing the right component to the customers based on Hadoop, on no-SQL database, on data grid because it's changed everything, right? In terms of latest, in terms of turnaround time, right? Response time, volume of data. So you can do completely new approach to solution. So as Bill was saying today in his talk, it's the questions are still the same. How you answer the question, what's the important aspect to answer the question that's changing? So all these new wave of technologies, right? They are helping this and Hadoop is part of that. So if I could just ask one last question. So what are you looking for here? What are some of the sessions that look really interesting to you and what you're looking to learn? I think visualization is one aspect that is still not well explored. Visualization is behind, I believe, from the rest of data science in general. We have statistics for tens of years, hundreds, more than 100 years, people are applying statistics. Data mining is around for a long time. Machine learning has been researching, work offering research for a long time, but visualization, I think it's not quite there yet, applied to numerical visualization, not to the artistical part, right? I think we have a lot to improve in terms of flexibility, new tools providing more flexible, easier to use characteristics and functionality to the user to create a new powerful visualization on the fly. So that's what you're going to be looking for here, those new visualization tools. That's correct. Okay, great, thanks. Pedro de Souza, thanks very much for coming on theCUBE and sharing your insights, your knowledge, another great interview, and good luck with the new endeavor. I mean, basically you're telling us that you guys are betting your business on data. That's right. And it's probably not a bad bet. So thanks again. Thank you very much. And great to have you, great to meet you. Appreciate it. Thank you, Pedro. Pleasure.