 from the campus of MIT in Cambridge, Massachusetts. It's the Cube, covering the MIT Chief Data Officer and the Information Quality Symposium. Now, here are your hosts, Stu Miniman and Paul Gillan. We are back at the Cube live from the MIT CDO IQ Symposium in Cambridge, Mass. Paul Gillan with my colleague Stu Miniman, and we're now joined by Paul Barth, our guest is the CEO of Podium Data, a company that you may not have heard of, but it's a company that's trying to solve a problem that you've heard a lot about, which is Hadoop complexity and making Hadoop useful to the enterprise, really as a coordinate management platform, not just as a playground. Paul, talk a bit about the problem that you're trying to solve. Why did you come up with the idea for Podium in the first place? Thanks, Paul. One of the things that the benefits of experience has is that you've seen the same problems over and over, and in fact, I've been at this conference for many years since its founding, talking about what it takes to implement governance. And one of the problems that I found as consistent was that the complexity of the systems in an enterprise, the morass of data sets and sources spread all over the place that are constantly in motion, make it very difficult to implement governance. You may come up with great policies. You may have good processes and roles and stewards, but if you don't have an asset to govern, you really will have trouble or never be able to catch up and maintain the governance policies you're trying to put in place. So four years ago, I gave a presentation here about next generation data governance and how I feel that Hadoop is not a threatening platform for data governance, but it actually is ideally suited for governing your data. And the reason it's ideally suited is the cost point, the low cost point in the barrier, it removes barriers to entry, it removes a lot of optimization and design and technical engineering to make an efficient solution. You can dump all your data in one place. You're still left with these issues around organizing the data, ensuring quality. There's still a lot of data management work to do, but a lot of the barriers to getting started and to moving quickly are eliminated. And I knew this because we built some of these, but what I found was the Hadoop stack alone was really not a turnkey data lake and that many enterprises didn't have the Hadoop skills or the integration capabilities to pull together all of this technology with new languages and new protocols and new databases. So what I decided was that there was room for a product in here. And this product would allow us to turn data management principles on its head. And that is to really allow agile data management that it and do away with some maxims that we've done that we've held for a lot of years, which is get good requirements on the data you need. With a data lake, the concept is no, bring in all the data you can and use the power of Hadoop and the data lake to improve the data and figure out whether it's useful. The problem with, if you're, the reason that podium is designed the way it is or the problem that we focused on solving with our solution was that it needs, this environment has a lot of data, it has sensitive data, it has a lot of different users, it has stakeholders in the business, stakeholders in technology. And you need an enterprise class platform that manages all of that consistently end to end. And so we built a metadata driven platform that installs either in the cloud or on premises and talks to any Hadoop distribution. And uses the power of Hadoop to give you that cost performance, the flexibility, the responsiveness, but it gives you the management controls that you expect in a robust, mature enterprise platform. How is your approach to solving this complexity problem different from everybody else? I mean, there's a lot of other companies, a lot of companies trying to make Hadoop more, you know, friendlier. Friendlier and useful. And there are certain pockets of capabilities. So data wrangling is a very popular topic. Data profiling is interesting. Of course, analytics. The first thing I'll say is that we are a data management platform. We don't do any BI, so we partner with all the major BI firms and the data we prepare is immediately accessible from the likes of Tableau and Click and others. The other piece of it though is if you look at the life cycle, and data professionals know that the real life cycle of data management starts with raw data and it starts with data. You don't know the quality of the provenance. You don't know whether it's going to be useful. And you need to manage this process of raw to ready, which we used to do with ETL tools like talent and ab initio and informatica. But that was purely a programmer's domain. And now what's happening is that by using metadata and a business friendly interface, we're bringing business analysts and data analysts into that process of understanding the raw data for helping with the preparation and validation of it. And then using a lot of automation during our ingestion to bring in things like legacy mainframe data sources and convert them into Hadoop ready formats to automatically profile every field of every dataset in advance because we have this powerful platform to do it, even if we don't know whether we need to profile the data. So by pre-processing and automating, we're going to be, we have a ready to go environment that's much, much faster. Our customers are up and in production in 30, 60, 90 days as opposed to a year of custom building. So Paul, can you help us connect to what you're working on to kind of the mandates we hear from Chief Data Officer? Sure. Often we hear the Chief Data Officer isn't necessarily involved in a lot of the technology discussions, I hear things like Hadoop and all these real-time things, it's going to be a different piece there. So how do those go together? I think it's a really important question. And I do think that what we do, what we often see is that our platform brings is a kind of common table. We bring everybody to the table from IT, the business and the analytic community and risk. And the reason is data is such an important and sensitive asset, all of those stakeholders need a hand in it. But the implementation of policies, the implementation and monitoring of your guidelines that you're setting up as a CDO can't be buried in a set of code. It needs to be exposed and visible and that's why metadata is so critical. Every asset, every user, every process in the data lake's podium manages is recorded and tracked through metadata. So you have rich reporting on who's using what, who has access to what, what is the hot area of data, what data should we clean up because we're seeing some sprawl. That manageability is now finally able to be done because we have the data in one place, a consolidated platform and we have a management layer that's keeping it organized and consumable and manageable by business users in various functions. Do you see looking out in the longer term, right now, the data lake is typically used as a staging area to the data warehouse, which is an expensive platform. Over time, do you see perhaps the data lake becoming the data warehouse? Will we not need data warehouses as we now think of them? Well, we are seeing that in some of our early customers like the Stellis Pharmaceuticals actually replaced their Netiza database that they were using for analytics with Hadoop managed by Podium. And they did that because it would took them much longer and was much more expensive to deliver on those warehouse appliances from the major vendors. And so there are pockets where people are doing that. However, we are a product built for large enterprises and many large enterprises have a very significant investment in an infrastructure, security infrastructure, usage, data models, that you won't unplug overnight and you don't want to come in and say the value proposition is displacing it. What we're displacing is the morass of data extract, transformation, staging, technology that feeds those. And there are many companies right now taking workload off the warehouse, putting it onto Hadoop managed by Podium. We are seeing a big interest right now shift to streaming and to near real-time analytics led by Apache Spark. And George Gilbert of Wikibon has predicted that Spark will drive about two thirds of the growth in the big data market over the next five years. Does that affect what you do, the data lake concept? Will data lakes adapt well to stream processing? Well we've designed ours to. And what I can say is that there's a concept of a Lambda architecture which includes both streaming and real-time as well as batch-oriented history. And every business and every analyst knows you need both. You need to have rich, complete histories of your information, of your business activities and outside data to make sense of what rules you want to fire in real-time. The problem in the past is that it's been very disjoint. In our solution, they become pulled together. And today we run on Spark and we'll have streaming in the near future. Well, one thing about your market, you can say is never boring. Paul Barton. That's right. Thank you for joining us today. We will be right back. We have another guest before we wrap up the day. This is Paul Gillan with the CUBE MIT CDO IQ Symposium. Hi, this is Christa Vanny from.