 Live from the Fairmont Hotel in San Jose, California, it's theCUBE at Big Data SV 2015. Live inside theCUBE, inside Silicon Valley's core Big Data community and we are theCUBE broadcasting all the signal, all the data sharing that with you. I'm John Furrier with SiliconANGLE. I'm my co-host Jeff Kelly, Big Data Analyst at Wikibon.org. Our next guest is Rick Stelwagen, Data Lake Program Director, Think Big, Teradata Company, welcome to theCUBE. Thanks, happy to be here. Great to have you, I love the word Data Lake. Dave knows I don't like that word, but I like data ocean. But, you know, data is a changing market and we always like to joke and we always have to debate. You know, big data, fast data, small data, Internet of Things certainly changes the game. So, like, this data market is evolving very, very rapidly. Obviously, customers are deploying Hadoop and other solutions in memory analytics. It's exploding. Big Data is the killer app. You're seeing it embedded in applications, you're seeing it on the agenda for everybody out there. All the customers, and certainly you follow the money, companies are going public and startups are rallying around. So, Rick, give us a take, okay? The Data Lake really implies, okay, data warehouse, you gotta have data, you gotta put it into applications in real time. What's your take on the current state of the big data market? Well, I think, you know, things have really evolved over the last few years. I got involved with Big Data about three years ago for Teradata, doing POCs and benchmarks for Global Benchmarks Center, kind of globally. So, had a lot of experience as Teradata rolled things out. And then, when Think Big came along, we were really excited to join them because they were able to provide a lot of guidance along that way. So, what we've seen is customers, they were dipping their toe in the water, they were playing around with it, they were dumping data into the lake or the reservoir or the ocean, whatever you want to call it. It was a landfill. Yeah, and you know, they really weren't doing all the standard stuff that everybody's always done in data management forever, right? I mean, things that I've been doing and Teradata have been doing and all the other big companies have been doing, they weren't really putting the governance in place, they weren't making sure authentication of not only the users, but the data before it was adjusted, right? And they weren't doing all the normal things where you track and trace and understand the lineage of your data. Because now we've added a lot more lineage points, and that's really the big difference here. The plain lineage, because you bring up a good point. People, from a compliance standpoint, would store everything. Storage business has been booming because of it. Okay, store it, I'll get to it later. But lineage brings up some knowledge about the front end of it. Can you explain what that means? Well, you need to know where data came from, you know, who was the source of it, who had the original rights to it, how did it originally get created, right? And that's really important to be able to track it all the way end to end, right? So that's really the big part lineage. And then as the data transforms and moves into different forms, first to be just accessible, then to be maybe used for analytics and maybe to use for downstream facilities. You know, data warehouses may actually get involved towards the back end of it as well. So, you know, lineage points will track anywhere from four to 12, lineage points oftentimes, sometimes even more depending on error handling and those kinds of things, so. Yeah, I think, you know, to your point, data lineage is important for a couple of reasons. One, because it helps you assess the validity of that data, the quality of that data to some extent, where did it come from? So when you're doing analysis, you can actually have some confidence that it's gonna deliver good results. But of course also, there's the compliance question and you've got to show, you know, when the feds come knocking, you've got to show who touched that data, when, where, came from, that kind of thing. Talk a little bit about lineage though in a big data context and specifically a Hadoop context. Are there challenges that are not, that we don't see in the more traditional data warehouse space and more traditional database space when it comes to tracking that kind of thing, understanding it? Are there some nuances that are specific to big data and Hadoop that you need to tackle? I think it's more of a cultural thing. It's not really a technology thing. It's really kind of people said, this is the new thing. I'm gonna, I'm gonna use it real quick as fast as I can. And they kind of forgot about their roots and what really needs to be there. There really isn't anything preventing you the same kind of security and providence and authentication. It's all still there and it's not, it's not that it is missing. It's just, it's whether it's being used and really people, how do you apply that practice to this new kind of thing? So I just want to take a step back actually and just let's, you talked about the different terms, John, data ocean, data reservoir, from whatever you call it. How, how? I was not gonna be a doctor at any time, so I'll tell you that. You're right, so from a think-based perspective, I mean, how do you define what is a data lake? Because I think we're seeing a few different concepts out there. From your perspective, what is a data lake? Well, to us a data lake is a jumping in point to use another metaphor, but fundamentally, you know, it's the point where you get data into a Hadoop cluster or a set of clusters. And we want to make sure that we track how, where it came from, all right? And that's kind of the beginning because what we want to see is, what you're gonna see is as the data comes in, it gets schooled and organized and transformed, it then becomes like a data reservoir. In a matter of fact, what we build, what we like to say is our data lake program is all about building a data reservoir. Yeah, so talk a little bit about that and how you actually go about actually building out a data lake in a, you gotta be agile, but you also have to be secure, compliant, data lineage, all those kind of things. What's Think Big's approach to actually practically accomplishing that? Well, we always start off with kind of an architecture strategy session. We try to understand the objectives and understand the data streams that they want to flow into the lake, all right? So kind of start off by understanding that and sometimes we do where we kind of boil the ocean and look at everything with big companies and look at all the sources and just come out with a plan. Sometimes we get real focused. We have some starter things that where we kind of like, let's just look at a couple streams and get one data owner or steward so you can really show success. And most of the time those data owners or stewards already have an application in mind, all right? They may have an application or they may be doing something like offloading a reporting system. They may be doing something like adding a new thing where they're trying to look at their machine data and try to optimize their manufacturing process. So we kind of started at the very beginning, do this planning type of approach. We have a team-based, okay? So we're just not like a couple unicorns, this is the new term, right? Where you come in and have an expertise, but there's like, we have several experts in lots of areas and we're able to look at holistically, right? So we have a data scientist, we have cluster management guys, we have Hive guys, we've got Spark guys, we have everything you name it. We kind of cover the gambit and each area is focused and we're able to come in and look at that strategy. So once we kind of like laid out an architecture and design and, you know, two to four week timeframe, you know, then we start to deploy, right? And we actually, and part of our architecture and design is to deploy serious governance, right? So right at the beginning, like we talked about at the source, we gather and capture metadata. Operational, not just schema metadata, but operational business security, business index types of metadata. All those kinds of things, it's gonna require some customization in the front to understand the engagement, right? To understand the people and the organization so that you can capture that up. What's so cool when I joined Think Big is that they've been doing this for four years where they capture that metadata up front. They kind of work on, you know, Hadoop has some issues with ingesting data, either it's big file problems or small file problems that we all know about them or anyone who just played around with it. So Think Big kind of started to say, you know, we gotta understand the data. We need to get it ready for ingest. Let's really understand the, and then we throw it across the wall. Whatever transport, secure transport method you have, we do the ingest, lay in the data and the metadata together, right? Use the metadata to know how to do the next step. Maybe even push the metadata into a tool that allows you a visualization so you can graphically see your lineage. Sir, can I ask you the POC? You mentioned global benchmarking. Global is a huge issue. Every time we talk about cloud and our big data, the global consumption contract with customers becomes pretty big. I mean, so it's hard for startups to compete at a global level. They have to have these partnerships. You guys are doing a lot of global bench. Talk about that specific global impact in terms of the customer. Obviously large customers are gonna have that kind of global requirement. And then the question I just posted on the crowd chat is to add on to that is that the Wikibon data shows that the buyers are shifting workloads. How is Teradata dealing with that shift specifically? The global and then the shifting of workloads and how does that impact the data lake and whatnot? Sure, global. Well, this depends on the company and the business they're in, whether or not they're gonna do everything on-premise versus they start to move to cloud, right? So we've got some customers that most of them are staying on-premise because Teradata is used to the forge in 2000, right? And we have a lot of banks. None of that stuff is ever gonna go to cloud, right? But there's a lot of high-tech manufacturers where they can kind of control things and really the interpretation of this binary log data is not a secret. It's not, you know, only they can understand it and actually Teradata think they help them to understand that data. So that kind of stuff goes to the cloud a lot. So we see deployments going either way, you know? It's really a combination of secure and public, but frankly what we're seeing a lot more emergence is the private cloud. And so that's really where you can guarantee end-to-end security and, you know, there's VPNs in place. And so you can make sure that those things can, so that's really what we're seeing a bigger shift right now with our customer base is going towards that private cloud, not so much the public cloud, where you can drive the cost down a lot more into public cloud. There's no doubt about it, but you also open yourself up from our vulnerability, right? And then the second question, it was around- Shifting the workloads, how are you guys dealing with that workload shift? Exactly, so that's exactly what I've been doing for the last three years with Teradata base. A lot of the Teradata base wanted to move their history off of Teradata, the warehouse and be able to concentrate and focus on the most recent few years. And so we've seen a big shift of moving historical data, some stuff that's cold, mostly off of Teradata and into the data lake, okay? And actually Teradata has built all kinds of tools to promote that, to tell you the truth. We have federated query, we have this thing called query grid and all those kinds of things that help you do that. So Teradata as a company has seen that shift, some archiving going on. Sometimes the deep history is necessary, it needs to stay inside. Because you still can't touch the analytics on the Teradata side, but you can certainly do deep history and lightweight analysis, aggregation, multi-dimensional analysis over in the lake very readily. So it's interesting. So in the past, is this essentially causing any kind of issue around the revenue stream of Teradata? So you've got moving some of the older data off to the data lake. In the past, was that data kept on Teradata or was it just maybe discarded and put on tape somewhere else? Or is this kind of a new approach? Well, no, usually they would keep the history as long as they could, right? And they may offload it and then bring it back online. So what really allows them to do is be able to bring the more history online quicker. And that's really, so they're really not changing the practice, they really had to consolidate what they did or add more nodes, right? So we see a lot of that. But I'll tell you the truth. Teradata has made its business out of unplugging Oracle and SQL server instances and consolidating silos, okay? And so we've always done that with the analytic workload. What we're seeing now is more the operational workload moving over to Hadoop. That's really been the big focus because frankly, the technology in this space lends itself more to that kind of workload. So it's not really so much an erosion of what we've seen in our actually expanding of our revenue base, including more things that we didn't normally do. It's not really in our wheelhouse. Not from a cost-effectiveness. So when you say operational workloads, what do you mean? What are some examples? You know, BI reporting, multi-dimensional analysis, that kind of stuff, right? Rollups, traditional OLAP kind of stuff, right? Those kind of things, we see a lot of that happening in the data like. Matter of fact, Think Big is built IP around that. Very reporting. So kind of building on that. Talk a little bit about building and running this professional services business in the big data context. So we've got, part of, we talked to practitioners in the early days, of course, it's, what is Hadoop? How do I bring it in? How do I integrate it? But you're seeing things like yesterday with the open data platform announcement, which Teradata is a part of. The goal there is to really solidify the core of Hadoop so it can kind of spur adoption. As Hadoop itself matures, how does that change what a professional services firm like Think Big does? Are you finding you're doing less work around kind of that initial building out the foundation and moving more to more of some of the advanced analytics? How do you see that playing out as the technology itself kind of matures and gets stronger? Well, it's important to note that Think Big is really agnostic to all of these things. And we plan to deploy it on Cloudera, Mapbar and Hortonworks equally and whatever our customer base sends us to. It's also important to note that we're always hardware agnostic as well. So we're a pure play Hadoop vendor, professional services. So these things like the consortium that you mentioned before is cool. I understand it and we'll be able to leverage some things more. Maybe they'll actually be a meaningful metadata standard someday. I've been part of many over the years that never seem to have taken off. So I'm hopeful maybe that will happen and consolidate but the forces usually drive away from that. So it's challenging to talk about how hard that is to pull off. The metadata piece of it is really a critical piece but it's hard to just start. Well, every software component has its own metadata and needs to manage it and control it. So the last thing they want to do is to be dependent on somebody else's metadata repository, right? So that's the reason why it's always kind of never happened. But, and there's consortiums and it happened and there's products out there, really big ones that are very expensive that allow you to be able to bridge between all those metadata pieces but most people don't want to spend that million dollar plus on that kind of stuff. It's a customer forcing function too. The customers at some point have to move down and evolve and say, okay, what's state of the art? What's bleeding edge? What do I need as a requirement kind of table stake? So with that I got to ask the question from a customer's perspective, looking at the big data SV landscape or the Stratoconference and Hadoop world, what is the customer's view of, if they're looking down on the stage, if you look from the balcony of the industry, because they're the ones who really will be deploying these solutions. What do they think of all this acting going on? You know, the big companies going public like Hortonworks, Cloudera is growing, and MapR, you got startups, you have the big players like Teradata out there, GlobalScale. What's the view from the customer? Are they moving along nicely? Are they confused? Are they looking at board consolidation? What's your takeaway if you could tease that out? Well, I think they really are looking for, both two things really important is practices with open source, the right practices to be able to use, and also they're really looking towards this consolidation where you've seen these components that are kind of loosely coupled now to be much more tightly coupled, right? Because it's one big system of integration nightmare sometimes, to pull all these things together. And so, professional services organizations like Think Big become experts at being able to pull those pieces together. And out of that, kind of grew IP to actually accelerate that, right? So that's kind of the natural evolution of what happened, reusable assets, and moving on like that. So the customer is looking for any kind of help, they're looking for a team to come in and guide them in all different aspects of it and kind of give them an overall strategy and try to figure out exactly what path to take based on their particular data sets or their streams or their enterprise requirements because their system environment could be different every place. So really they're looking for guidance. Consolidation is really gonna be critical. I think there's no doubt about it. So these consortiums, I hope things go good for them. From our perspective, it could just be another piece, another less things we have to integrate with. As an industry observer, obviously, and a participant at Territor House, you're not a VC or you work for Morgan Stanley, Goldman Sachs, the big guys, there's a lot of action going on. People are getting funded, super funding. There's a lot of startups out there that are looking for a home and we were saying on our intro, consolidation's happening, we see new waves coming like internet of things, just seeing things like metadata discussions is critical. So we are obviously pro big data. We think that's gonna be part of the force of the cloud and also the infrastructure with virtualization, seeing some cool things there. So there's always another wave coming. So what do you think about that? I mean, if you look at, you know, your friends all work for startups, we have friends who work for big companies. What's your personal take on this market? Do you think there is another wave coming? Is it internet of things? What's your take on startups? What should they do? Some starts might not make it, some will be acquired or acquired for hiring purposes. This growth certainly in this market. So what's your personal take on this? Well, you see the growth in all the real activities in Spark right now, right? So that's kind of where Spark is starting to grow up, right? We've been using it and playing with it. And it's great stuff and it has great promise. I think it's got a lot of growing up to do still. And that's where we're seeing a lot of the money go towards these days, right? And all the startups, a lot of the startups are focused on that and leveraging that in some way. Because that's the next wave of open source kind of thing that's really hot. So we're seeing that as one of the big trends right now and, you know, it'll be interesting to see, the people that have focused and invested so much in MapReduce, what's going to happen, right? Because is this another shift for them? Right now it's clearly not yet, but as a technology maturation, it'll be real interesting to see the migration. And companies that are able to navigate and be able to move their customers quickly from one open source platform to another are going to be really successful, right? And so you see the value opportunity for companies to provide value to customers in what? What's the number one thing you see right now on the table for value creation on behalf of the customer? Migration, new apps, what's your take on that? Well, you know, frankly, I think it's both migration and new apps. What's super exciting and what's always gonna be the highest revenue is the new apps for businesses, right? So, unable to do more streamlining and quality control, right? So we've seen a lot, especially with machine data, log data, that's really what it's all about, right? Is tuning up the quality, getting rid of the scrap, making sure you fix the process, be able to react to how your product operates in the field and in a real-time kind of basis, yeah. So, Rick, we really appreciate you coming on, you sharing your personal opinion as well as what's going on with the company. Final question for you is what's your take on the show this week? What are you expecting out of Stratoconverse, Hadoop World and all the activities? Certainly there's a slew of parties, we're having ours tonight at seven o'clock, there's a lot of networking, a lot of cool stuff happening. What do you see evolving this week? What's your big focus and what would you say is gonna be the big outcome from this week? Well, I'm gonna really be paying attention to the startups, okay? And I'm really excited to join Think Big, who really focuses on, likes to really get involved with startups and that's where my background was for a long time, is being involved with startups. So really taking a look at the new technologies to see how far I notice some metadata startups out there, they just popped up all of a sudden. You know, people notice me on LinkedIn and say, oh, better did it, you want to know about this, right? So there's a lot of stuff. We need you to come talk to us. Yeah, yeah, exactly, yeah. So there's a lot of interesting stuff there, the machine learning stuff, always been a fantastic thing. And I think we're gonna see more AI happen, right? So that's really what's exciting right now. We love startups too. And I think that's gonna be a great innovation engine. Certainly has the big whales starting to leverage their expertise, but the startups are an interesting spot, so I gotta ask you the kind of final, final question. What's the mindset of a startup, in your opinion? What would you say to your startup brothers out there and sisters saying, in this week, what's the mindset? Partnerships survive, get revenue all the above? If you had to say to the startups, focus, what would that be? Customers, they need partnerships. That's the number one thing. They need to partner up, line up with companies, like us that have a lot of customers that are ready and eager for this technology because these companies, they don't want to invest in early stage startups, but companies like Teradata or Think Big, they're able to bring all these startups to be able to be useful. Yeah, introductions, right? Absolutely. So just to put it bluntly, do you think there's all these comfortable startups out there on the floor at Hadoop World, Hadoop Summit, and some other shows? Is that floor gonna get smaller next year and the year after? Are we gonna see acquisitions? Are a lot of those companies gonna make it? How do you see this market evolving in that sense? I think it's gonna stay pretty close. I think you're gonna have consolidation, but you're still gonna have a whole new branch of stuff within memory happening to take into forefront, right? So if that can really scale, like we hope it does, right? And memory prices keep going down, but I think that you're gonna see a lot of shift over there, right? And some consolidation will happen, right? All right, Rick, thanks for coming on theCUBE. We appreciate your stalling. Data Lake program director, Think Big, a Teradata company. Obviously, it starts where the innovation happens and we think it's gonna be a great growing market, but people are gonna have to start making their moves and it's certainly exciting for theCUBE. We'll be sharing that with you. We'll be right back after this short break. This is theCUBE. I'm John Furrier with Jeff Kelly.