 Live from San Jose, it's theCUBE. Presenting Big Data Silicon Valley, brought to you by SiliconANGLE Media and its ecosystem partners. Welcome back to theCUBE, our continuing coverage on day one of our event, Big Data SV, I'm Lisa Martin with George Gilbert. We are down the street from the Startup Data Conference. We've got a great, a lot of cool stuff going on. You can see the cool set behind me. We are at Forager Tasting Room and either we come down and join us, be in our audience today. We have a cocktail event tonight. Who doesn't want to join that? And we have a nice presentation tomorrow morning of our 20, Wikibon's 2018 Big Data Forecast and Review. Joining us next is Matthew Baer, the co-founder of ATSCA. Matthew, welcome to theCUBE. Thanks for having me. Fantastic venue. Isn't it cool? This is very cool. Yeah, it is. So talking about Big Data, Gartner says 85% of Big Data projects have failed. I often say failure is not a bad f-word because it can spawn the genesis of a lot of great business opportunities. Data lakes were big a few years ago, turned into swamps. ATSCA has this vision of Data Lake 2.0. What is that? So you're right, there have been a lot of failures. There's no doubt about it. And you're also right that that is how we evolve. And we're a Silicon Valley-based company. We don't give up when faced with these things. It's just another way to not do something. So what we've seen and what we've learned through our customers is they need to have a solution that is integrated with all the technologies that they've adopted in the enterprise. And it's really about if you're going to make a Data Lake, you're going to have data on there that is the crown jewels of your business. How are you going to get that in the hands of your constituents so that they can analyze it and they can use it to make decisions? And how can we furthermore do that in a way that supplies governance and auditability on top of it so that we aren't just sort of sending data out into the ether and not knowing where it goes? We have a lot of customers in the insurance, health insurance space and with financial customers that the data is absolutely must be managed. I think one of the biggest changes is around that integration with the current technologies. There's a lot of movement into the cloud. The new Data Lake is kind of focused more on these large data stores where it was HDFS with Hadoop, now it's S3, Google Object Storage and Azure ADLS, those are the sorts of things that are backing the new Data Lake, I believe. So if we take these, where the Data Lake store didn't have to be something that's an open source HDFS implementation, it could even be just through an HDFS API. Yeah, absolutely. What are some of the, how should we think about the data sources and feeds for this repository? And then what is it on top that we need to put to make the data more consumable? Yeah, that's a good point. S3, Google Object Storage and Azure, they all have a characteristic of their large stores, you can store as much as you want, they generally on the clouds and in the open source on it on-prem software for landing the data, it exists for streaming the data and landing it, but the important thing there is that it's cost effective. S3 is a cost effective storage system. HDFS is a mostly cost effective storage system, you have to manage it so it has a slightly higher cost, but the advice has been, get it to the place you're going to store it, store it in a unified format, you get a halo effect when you have a unified format and I think the industry is coalescing around, I'd probably say Parquet is in the lead right now, but once Parquet can be read by, let's take Amazon for instance, can be read by Athena, can be read by Redshift Spectrum, it can be read by their EMR, now you have this halo effect where your data is always there, always available to be consumed by a tool or a technology that can then deliver to your end users. So when we talk about Parquet, we're talking about a columnar sort of serialization format, but there's more on top of that needs to be layered so that you can, as we were talking about earlier, combine the experience of a data warehouse and the curated data access where there's guard rails and it's simple versus sort of the wild west but where I capture everything in a data lake, how do you bring those two together? Well specifically for AtScale, we allow you to integrate multiple data access tools in AtScale and then we use the appropriate tool to access the data for the use case. So let me give you an example, in the Amazon case, Redshift is wonderful for accessing interactive data, which BI users want, right? They want fast queries, sub-second queries. They don't want to pay to have all the raw data necessarily stored in Redshift because that's pretty expensive. So they have this Redshift spectrum, it's sitting in S3, that's cost effective. So when we go and we read raw data to build these summary tables to deliver the data fast, we can read from spectrum, can put it all together, drop it into Redshift, a much smaller volume of data so it has faster characteristics for being accessed and it delivers it to the user that way. We do that in Hadoop when we access via Hive for building aggregate tables but Spark or Impala is a much faster interactive engine so we use those. So it's about, as I step back and look at this, I think the Data Lake 2.0 from a technical perspective is about abstraction. And abstraction is sort of what separates us from the animals, right? It's a concept where we can pack a lot of sophistication and complexity behind an interface that allows people to just do what they want to do. You don't know how, or maybe you do know how a car engine works. I don't really, kind of, a little bit, but I do know how to press the gas pedal and steer. I don't need to know these things and I think that the Data Lake 2.0 is about, I don't need to know how Sentry or Ranger or Atlas or any of these technologies work. I need to know that they're there and when I access data, they're going to be applied to that data and they're going to deliver me the stuff that I have access to and that I can see. So a couple of things. It sounded like I was hearing abstraction and you said, really, kind of, that sounds like a differentiator for AtScale is giving customers that abstraction they need. But I'm also curious from a data value perspective. You talked about, in Redshift and from an expense perspective, do you also help customers gain abstraction by helping them evaluate value of data and where they ought to keep it and then you give them access to it or is that something that they need to do kind of bring to the table? Well, so we don't really care necessarily about the source of the data. As long as it can be expressed in a way that can be accessed by whatever engine it is. Lift and Shift is an example. There's a big move to move from Teradata or from Netiza into a cloud-based offering. People want to lift it and shift it. It's the easiest way to do this. Same table definitions, but that's not optimized necessarily for the underlying data store. Take BigQuery, for example. BigQuery is an amazing piece of technology. I think there's nothing like it out there in the market today. But if you really want BigQuery to be cost-effective and perform and scale up to the concurrency of one of our customers is going to roll out about 8,000 users on this, you have to do things in BigQuery that are BigQuery-friendly. The data structures, the way that you store the data, repeated values, those sorts of things need to be taken into consideration when you build your scheme out for consumption. With that scale, they don't need to think about that. They don't need to worry about it. We do it for them. They drop the schema in the same way that it exists on their current technology. And then behind the scenes, what we're doing is we're looking at signals. We're looking at queries. We're looking at all the different ways that people access the data naturally. And then we restructure those summary tables using algorithms and statistics and I think people would probably call it ML-type approaches to build out something that answers those questions and adapts over time to new questions and new use cases. So it's really about, imagine you have the best data engineering team in the world in a box. They're never tired. They never stop. And they're always interacting with what the customers really want, which is, now I want to look at the data this way. It sounds actually like what you're talking about is you have a whole set of sources and targets. And you understand how they operate, but when I say you, I mean your software. And so that you can take data from wherever it's coming in and then you apply, if it's machine learning or whatever other capabilities to learn from the access methods, how to optimize that data for that engine. Exactly. And then the end users have an optimal experience and it's almost like the data migration service that Amazon has. It's like you give us your Postgres or Oracle database and we'll migrate it to the cloud. It sounds like you add a lot of intelligence to that process for decision support workloads and figure out, okay, so now you're going to, it's not Postgres to Postgres, but it might be teradata to Redshift or S3 that's going to be accessed by Athena or Redshift. And then let's put that in the right format. Yeah, that's, I think you sort of hit something that we've noticed is very powerful, which is if you can set up, and we've done this with a number of customers, if you can set up the abstraction layer that is at scale on your on-prem data, literally in say hours, you can move it into the cloud. I mean, obviously you have to write the detail to move it into the cloud, but once it's in the cloud, you take the same at scale instance, you re-point it at that new data source and it works. We've done that with multiple customers and it's fast and effective and it lets you actually try out things that you might not have had the agility to do before because there's differences in how the SQL dialects work. There's differences in potentially how the schema might be built. So a couple of things I'm interested in here, two A words, that abstraction that we've talked about a number of times. You also mentioned adaptability. So when you're talking with customers, what are some of the key business outcomes they need to drive where adaptability and abstraction are concerned in terms of like cost reduction, revenue generation, what are some of those C-suite business objectives that at scale can help companies achieve? So looking at say a customer, a large retailer on the East Coast, everybody knows the stores, they're everywhere, they sell hardware. They have a 20 terabyte cube that they use for revenue analytics, day-to-day revenue analytics. So they do period over period analysis. When they're looking at stores, they're looking at things like we just tried out a new marketing approach. We just, I was talking to somebody there last week about how they have these special stores where they completely redo one area and just see how that works. They have to be able to look at those analytics and they run those for a short amount of time. So if your window for getting data, refreshing data, building cubes, which in the old world could take a week, I'm a co-founder at Yahoo, we had a week and a half build time, that data is now two weeks old, maybe three weeks old. There might be bugs in it. And the relevance might be. And the relevance goes down or you can't react as fast. I've been at companies where speed is so important these days. And the new companies that are grasping data aggressively, putting it somewhere where they can make decisions on it on a day-to-day basis, they're winning. And they're spending, I was at a company that was spending about $3 million on pay-per-click data a month. If you can't get a data every day, you're on the wrong campaigns and everything goes off the rails and you only learn about a week later. That's 25% of your spend right there, gone. So the biggest thing, sorry, George, it really sounds to me like what AtScale can facilitate for probably customers in any industry is the ability to truly make data-driven business decisions that can really directly affect revenue and profit. Yes, and in an agile format. So you can build it. That's a third A, agile, adaptability, as an action. There you go, the three A's. We had the three V's, now we have the three A's. The fact that you're building a curated model so in retail, the calendars are complex. I'm sure everybody that uses Tableau is good at analyzing data, but they might not know what your rules are around your financial calendar or around the hierarchies of your product or around, there's a lot of things that happen where you really want an enterprise group of data modelers to build it, bless it, and roll it out. But then you're a user and you say, wait, you forgot X, Y, and Z. I don't want to wait a week. I don't want to wait two weeks, three weeks a month, maybe more. I want that data to be available in the model sort of like an hour later, because that's what I get with Tableau today. And that's where we've taken the two approaches of enterprise analytics and sort of self-service and tried to create a scenario where you kind of get the best of both worlds. So we know that sort of an implication of what you're telling us is that insights are perishable. And sort of latency is becoming more and more critical. How do you plan to work with streaming data where you've got historical archive, but you've got fresh data coming in? But that fresh could mean a variety of things. Tell us what some of those scenarios look like. Absolutely. I think there's two approaches to this problem. And I'm seeing both used in practice, and I'm not exactly sure, although I have some theories on which one's going to win. So in one case, you are streaming everything into sort of a, like I talked about this data lake S3 and you're putting it in a format like Parquet and then people are accessing it. The other way is access the data where it is. Maybe it's already in, this is a common BI scenario, you have a big data store and then you have a dimensional data store, right? Like Oracle has your customers, Hadoop has machine data about those customers accessing on their mobile devices or something. If there was some way to access those data without having to move the Oracle stuff into the big data store, that's a federation story that I think we've talked about in the Bay Area for a long time or around the world for a long time. We're getting closer to understanding how we can do that in practice and have it be tenable. You don't move the big data around, you move the small data around. For data coming in from outside sources, that's probably a little bit more difficult but it is kind of a degenerate version of the same story. I would say that streaming is gaining a lot of momentum and with what we do, we're always mapping because of the governance piece that we built into the product, we're always mapping where did the data come from? Where did it land and how did we use it to build summary tables? So if we build five summary tables because we're answering different types of questions, we still need to know that it goes back to this piece of data which has these security constraints and these audit requirements and we always track it back to that and we always apply those to our derived data. So when you're accessing this automatically ETL summary tables, it just works the way it is. So I think that there are two ways that this is going to expand and I'm excited about federation because I think it's actually the time has come. I'm also excited about streaming. I think they can serve two different use cases and I don't actually know what the answer will be for, because I've seen both the customers that some of the biggest customers we have. Well, Matthew, thank you so much for stopping by and 4A's at scale can facilitate abstraction, adaptability and agility, hashtag 4A's. I don't even want credit for that. Oh wow, I'm going to get five more followers. I know. There you go. There you go, we want to thank you for watching theCUBE. I'm Lisa Martin. We are live in San Jose at our event, Big Data SV. I'm with George Gilbert. Stick around, we will be right back with our next guest after a short break.