 Okay, we're live back here at Hadoop World 2012. We're here with Stefan from Datamere. We've been in theCUBE many times. Welcome back. We have a short, we want to squeeze you in because one, you're a CUBE alumni. We always make room for CUBE alumni. And we're expanding, you see we've got Studio B over here. So after the interview, you can go to Studio B and do some additional commentary. We do want to hear from you. Last time we were in theCUBE, we talked about unstructured, structured data. And that was kind of the rage back then. Now it's kind of pretty much out of the conversation is it's a done deal people are admitting that's just the way it is. It's not an either or, they're out there. You got to do it. In some cases, structured data sets are great for rolling up reports out of the no-SQL environments. So that's your bread and butter. So give us an update on that. Your business is relative to that dynamic. And then we'd like to ask you some questions around some scale around that. Yeah. So, well, let me back up a little bit, right? So really the big picture is traditionally we have a treaty architecture for data analytics. We have ETL, extract transform load where we take data from a data source and then we massage it into our static data warehouse or pick your favorite database that has a Star or Snowflake schema. And we highly optimize the data into that schema for performance reasons because we had limited social compute. And then we put BI on top of that, right? So it's a treaty architecture. Every time you add a new data source you change your schema. What means you change your BI? Every time you change your BI have a more sophisticated question you change your schema. What means you change your ETL? So we do this now since 40 years basically because we had unlimited, we had limitations in storage and compute. What Hadoop really brings to the table is that we don't have storage and compute limitations anymore, Moore's law picked up. So what we do is structural or unstructured data. We pull it into Hadoop without pre-optimization, without Snowflake schema stuff, schema or anything like this. And then because Hadoop has so much compute power for such a low cost, we can create as many data models, as many views, as many transformation on the raw data, but we always leave the raw data there. So I think we really break generally the concept of the slow, treat here, three vendors, three groups of people, three pieces of hardware architecture there. And the beauty really with all of this, as you said, with no SQL we're not just limited to structural data. We really can create as many views on all kinds of data. And that's important, right? The problem today is not so much big data. Big data's a big buzzword to make big money, I would say. But the problem is- Or big views of your media company like us. Yeah. The problem is data complexity really, right? So we have one of the biggest banks in the world. They have to implement Basel II. Basel II is that law that was implemented after the financial meltdown. They have 250 data sources. How do you want to do this with traditional? And I think there's really the power of a dupe and the modern approach, most loss kicking, and we can do a lot of things. I want to ask you about Abhimata's comment, to say that founder just was on. He said, you have to throw away all your assumptions when asked about sending, because he built a successful business from scratch in the financial vertical around using Hadoop. It's a little bit different. You have a different background, your data's of your company. But he said, for people out there, throw away all assumptions. So you're with the new solid state advanced with disk. You now got compute power. You mentioned you can bring new stuff in. What assumptions does the customers have to throw away? Dealing with your value proposition and Hadoop in general. What would you say? So I think we really need to get away from that 3-T architecture, from that pre-optimization of data, right? Do you guys have iPhone by any chance? Yeah. Max. So, you know, the reason we have those kind of wonderful technology is that we have more compute power, stronger chips. And I think it's time that we bring this to analytics. And the opportunities are gigantic. Though, you know, I think that traditional business intelligence will stay the way it is. We look at transactional data, you know, how much money per region did I make? Well, we really see as a big poll for Hadoop and this NoSQL technology wave is looking at interactions, right? How much click through did I got from my ads? And not how much click through I got, but actually how much converted into deals? Right, it's three different data sources. One is semi-structured, machine-generated data, and this is where it's getting interesting, where we actually try to create this 360 degree view in our organization to understand what's going on. We need to understand interactions of our customers with our brand to optimize. I mean, that's the only thing that really differentiates companies today from each other, really looking at how people interact with us. Did they come back after they called them to our call center? You know, are they crossing absolute opportunities? And there's gigantic business opportunities to lower churn, increase conversion rate, and that's really what we see being extremely successful in this technology. And I think, I mean, we obviously go to a traditional hype cycle here. I think we need to bring the conversation down where Sadoop and NoSQL are really helpful tool in the toolbox and stop talking about how you can do the dishes and print money as well. But really look at what it brings to the table. It's obviously not good for real time. I had to solve sequential data access, all that stuff. Well, what about, you know, I've been covering the BI market for a long time and we've been hearing from the more traditional BI vendors about, you know, BI for the masses. We're going to roll it out to, you know, self-service BI where there's going to be, you know, wide adoption in the enterprise. And we haven't, it's never really happened. Yeah. It's still pretty low adoption just in the traditional BI world. So that's, problem is still going to be an issue in the big data world. How are you attacking it? How is that, how is BI in the big data context going to be different than BI in the more traditional context? That's a very good question. So I would say that my observation in almost two decades working in the BI space is that you have a friction point and the friction is between IT and the business users. And to be very honest, even though we want to democratize data access on the BI side, IT is controlling which data do I have access to? They model the schema, they set up the databases, right? So if I want to get a new insight today, I basically have to go back to IT and argue why I want to change the database schema. Let me give you a real world example. So we're working with a retailer, a pretty big retailer and they're trying to optimize their logistic. So they look at market basket analytics, all that data, and it's a beautiful, gigantic MMP database. And we have a lot of young data analysts that says, look, I found this 10 megabyte weather data set and we like to bring it together because our, you know, highly sophisticated data mining algorithms are just moving the needle by half percent. So instead of changing the schema, what was totally impossible in his gigantic MMP database, he used Datamere and pull 2.5 trillion records into the Datamere environment, added a 10 megabyte weather data set. And of course they sell more ice cream and water when it's hot and more packaged food and batteries if the next storm's coming up, but they could never see that. Where again, I think what is really, really interesting is that schema free approach, the approach of bringing data together because this is bringing more holistic view together. And for data scientists, this is important. Everybody now said, with a better feature vector, you actually get better results. And as the city of Google said, it's about more data, not better algorithms. You know, one of the other recent release of Datamere 2.0, I believe, so we talked a little bit beforehand about some of the collaboration capabilities. Share with our audience a bit about that and talk a little bit about the importance of collaboration when it comes to working with data. Absolutely, so in every organization, there's multiple people that are working on data and there are folks that like SQL, but how do you collaborate on SQL? Well, you send it around per email or you put it into a source control system. So I think what's very important, if you write software, native SQL code, Java code, or you collaborate on data analytics is that everybody understands what you're doing and then we can talk about, that's really the first hurdle. With a spreadsheet user interface that we provide, it's very similar, it's very visual. We actually, in 2.0, introduced the first version of our data lineage capabilities that gives you actually graph representation. This is very important. For example, one of the biggest problems in the banking meltdown was data was so many times copied, ETL, and you did a rounding mistake here and a rounding mistake there, that our folks just lost track. What's the data flow in my organization? Now, with our approach, just bring in the data on Hadoop and do as many transformations as you want. We can visually show you a graph that this data set that you see right here was a multi-join from those three tables and then a string manipulation here and a disc here and so on. So I think that's very important that first of all, you have a common context you can talk about and then I think it's very, again, important that the tools support the collaboration effect. So it's all about the data lineage and seeing where it came from and not, you mentioned those little errors, but they add up to the point where the data is suspect. So talk about, we've had a few guests on talking about bringing analytics into the Hadoop environment as opposed to moving the data out of Hadoop and we've kind of been touching on that in our conversation now, but talk a little bit more specifically about why that's important to bring the analytics to where the data lives as opposed to moving it out to a separate system. I think it's important that we make things simpler. And again, the trick here is to invest more in hardware, storage and compute, but minimize the complexity. The really big problem of organizations is complexity. They have a piece of hardware for ETL. They have a piece of hardware for their database. They have a piece of hardware for their BI tool. Then they have three groups working on this. They have three vendors, three phone numbers and everything is highly dependent. Putting everything in Hadoop, the data integration, the data management, storage, compute there and the analytics and leave it in Hadoop is a reduction of complexity. I mean, Hadoop is not even coughing if you pull out hardware. On the other hand side, if one of your ETL servers fail, your whole analytics pipeline basically collapse. So the beauty of putting everything into Hadoop and kind of give access to the data scientists that might like to write code, cascading, pick, map reduce, whatever it might be. Or again, with our tool and powering really the folks that kind of understands the data, the business user that really have ideas and won't figure things out is the right approach. I think we need to reduce complexity in our data centers. You know, with that being said, we're hearing a lot about at this conference about the need to integrate Hadoop seamlessly into your existing environment. We're hearing the strategy of Clutter and Hortonworks and others is to partner with the likes of Teradata and others. So what is your approach in terms of, for the time being, there's still a lot of, people aren't ripping and replacing at this point. So how do you approach integrating with existing IT when your real view of the world is, let's try to load everything into Hadoop? Well, so first of all, I totally agree with you. Integration is a must have and there's no way around. And our approach, I don't think is really rip and replace. I think we have a new tool in the toolbox that allows you to do other kind of use case. That's why I said, traditional BI in my RDBMS with ETL and BI, it's just fine, but it's transactional data. If we want to understand interactions, this is where Hadoop comes in and this is kind of a new use case, new approach and there's a big business opportunities here and that's what people starting to get. They don't use approaches, we really seamlessly integrate actually more than any other vendor out there. You know, other vendors approaches, well, you copy the data out of Hadoop. Well, we don't care. We can link to data in your Teradata, in your Oracle, in your DB2 or MySQL, MSSQL. We have connectors to 25 different data sources where if I mean data source, all the different databases is one data source. We talk social media, Twitter, Facebook data, JSON, XML. We even have a connector to mainframe data to pull data out out of mainframe if you heard Phil Shelley earlier today to build out of the mainframe and actually pushing it back again into the system. So this is really, really important but integration does not stop just in pulling data in or pushing data out back. We have a lot of customers that, you know, run a crystal report on a database or what have you. So we actually, all our connectors are B-directional but you know, let's be realistic. Enterprise integration means security integration where we do more than just Kerberos. Kerberos is not really the industry standard, industry standard, SELDA, Active Directory. We actually integrate with all of them to make sure it's possible. You need monitoring solutions, right? You need alerts and notification. You need data cleaning integration. You need metadata integration. All of that we actually do, we're not a bubble. We understand if you really want to be installed at the biggest banks in the world, what we are, that you know, there's a whole collection of things you need to do. So talk about, you know, from a company perspective. Give us kind of an update and talk a little bit about, you know, there's a lot of entrepreneurs here and this ecosystem is really just exploding. I mean, there's so many different types of vendors here so maybe talk a little bit about your experiences kind of building a company from the ground up in this kind of environment. Yeah, great question. So I'm coming up on 10 years to do. I was one of the first three guys that actually contributed to Nudge, right? And if you wonder where the Hadoop logo comes from, now you happen to know where yellow elephants run around. But it's an interesting right. So from a few guys, you know, we didn't start in a garage but we all kind of worked from home back then at Nudge and we contributed and we brought Nudge from a more kind of a research focus project into a platform. I worked in the plug-in system for Nudge that actually allowed companies to be built on that platform. Funcast Google, build it on that platform. And this is where it took off really, where we saw storage and compute requirements beyond web crawling and indexing there. And it's really, really beautiful. What most people really forgot in the last few years with all the type about big data, is that actually Hadoop is highly optimized for sequential data access. What means is incredible fast on your laptop, right? So a lot of people have instant reactions. Oh, you have to have a big cluster. But it's not sure. It actually is very efficient on your laptop, in that phase. So building a company with really a groundbreaking different way, this is very exciting. It's interesting to see, actually, if you come to my presentation later, I show a graph of the Hadoop ecosystem, where there was five companies in 2008. We have 111 companies by today that say they have a big data something, right? Big data pretzels or big data fridge now. It's really interesting. But what we generally will see is that the traditional vendors really, you know, you have kind of the first day fighting you, first they love it you, then they fight you, then they integrate with you, and eventually you really disrupt the market there. You know, folks say, well, Hadoop is a perfect ETL engine. Well, guess what? We have ETL, distributed ETLs in 15 years. Well, Hadoop is a great MMP database, other folks say. Well, we have MMP databases since 10 years. And actually systems like Hyval not very fast and only support 30% of the SQL there. And then you have folks that just repeating big data often enough, and the memory database is now big data as well. And copying data back and forth is the big story. I think there's an extremely powerful new technology and just jumping on the hype of the bus worlds is not helpful. And I see a lot of rather younger companies doing that. And there's a risk that as an entrepreneur you invest a lot of time in your life. That is, you know, you really put your heart to this but it will not have a chance that you're not truly doing something new. So that's what we see, and we're very excited about really changing something here. And we see folks doing amazing things, but it's really important to understand the low level, you know, and really go for some innovative approach that helps people. Just having the right bus worlds on your marketing material will not cut it. Well, big data meets big analytics meets big money, right? So that's the big story, right? So my final question to you is much more of a just practical one. The big problem that everyone's having here at the show or challenge in this opportunity is getting data out of, say, H-Base, for example. There's just too much information in there and, you know, export tools are challenging. How do you guys are addressing that? If I'm going to pretend for a minute I'm a potential customer and I have a H-Base, big H-Base database, fairly good size and I want to get stuff out. I don't want to export it to Excel. You're definitely done, right? How do I do that? Can you help me? Yes, absolutely. We have H-Base or Casandra Connectors for that matter. And what is important is- How much can I move out all of it? So here's the difference between data, Mia and other spreadsheets that you hear, right? Traditionally, spreadsheet copies data into the spreadsheet and then holds the data hostage. Our spreadsheet is a design tool. Basically, it allows you to design the data processing pipeline you want to apply and then we push the processing downstream, right? In that case, Hadoop or H-Base for that matter. So I think what is very important, but what Hadoop brings to the table is that the computation analytics happens very closely to the data. And that's where we get that incredible throughput that we see. Where traditionally you move data around to do the analytics and you need super expensive network equipment. Where again, what Hadoop is all about is running the analytics on the CPU that also has the data. And that's a big fundamental breakthrough here. And it's the approach we're doing. So don't move it into the spreadsheet, actually move the spreadsheet to the data. And then you can do the things you're looking for. So minimize the movement of data is the trick to make- You want the limitations of how much data can be manipulated. I mean, we operate in petabyte scale, data Mia, right? Because we again- On analytics. On analytics. Wow. We have two digit petabyte analytics running daily. We have customers that running two million reports a year on our platform. And yeah, the beauty is just data Mia natively sits on Hadoop where other analytics pipelines just copying data out of Hadoop. We pushing our analytics to Hadoop. We just exposed the design tool into your spreadsheet. What means we want it? So given that you're involved in notch with Doug cutting and those guys, what do you think about Doug's focus with Afro? I think it's great. I really like Afro. I mean, there's a set of different serialization and storage tools around there. Obviously Afro does a lot of really smart things to optimize throughput. It's very important. I think standardization on data serialization is the right way to go. It's very early. We have to do more in performance optimization there. So we integrate with Afro, right? We can read Afro or write Afro file format, but under the hood we use a highly optimized tuple format to get certain performances that we would like to see. So this is the right way to go. This is the future. Because in the long run, we just want to have storage and compute as a plug in the wall and we have to have open architecture there to integrate with all the different data. Okay, Stefan, thanks for coming and sharing. We did a little bit more time. Great to have you in pioneer and big data. Business is good. Obviously doing better than even the last time. It is big money, big data, big money, big analytics. Congratulations, all your success. Okay, we'll be right back with our next guest at this short break. Thank you.