 Union Square in the heart of San Francisco. It's theCUBE, covering Spark Summit 2016, brought to you by Databricks and IBM. Now, here are your hosts, John Walls and George Gilbert. And welcome back to San Francisco here at Spark Summit 2016 here on theCUBE. Our coverage continues here from Hilton as we look at day two of what's happening here with regard to the Spark community. Along with George Gilbert, I'm John Walls and we're joined now by Sumat Sarkar, who is the Chief Data Evangelist of Progress. And you were saying earlier, Sumat, not Chief Data Evangelist, you called yourself the Data Palmer, is what you, I think it's probably a little higher level than that. But from your vantage point right now, I know progress you do with a lot of companies and you're trying to help them become more acclimated and more useful in the digital world. Generally, for those who aren't yet involved, the clients that you're pitching, I'm kind of curious, what's the reluctance or where's the trepidation right now and does big data just scare the heck out of them? Out of clients, the progress as a whole is an application development company. So our core DNA is application development. So a lot of companies that are looking at, it's a big buzzword, digital transformation. And so they're trying to take all of their data sources and they're trying to make more sense of it. And big data plays a big part in that. So as we get digital businesses and we see different organizations try to transform their business, you're seeing more websites, you're seeing more mobile, you're seeing two new types of application development patterns. So progress as a whole is serving that community. And we're really here at Spark Summit in the progress data direct business line, especially the Data Plumber comment comes in. I thought that was off camera. George, I'm sorry, and John and George. And so the Data Plumber part is very accurate because we're really trying to connect data, right? So there's a lot of different data sources. You look at Spark as a data source, you look at Hadoop Hive, then you start looking at other cloud systems. I mean, some of this stuff is running in the cloud. You look at your traditional relational databases, you look at no SQL databases like MongoDB Cassandra. There's a whole lot of different interfaces out there. And this data is all different shapes and sizes. So we'll say there's a lot of disruptive technology that comes with digital transformation. So what's Spark been doing for you then? And how long have you been working with it? With Spark? So when we look at Spark, we were looking at the whole big data ecosystem. So we see, they started out with a lot of Hadoop and the things running in MapReduce and HiveJobs, batch, and we're seeing Spark come on board for some real-time streaming, other types of in-memory operations, and batch... Connectivity. So our core is about standards-based connectivity. So SQL standards like ODBC and JDBC, and then even REST interfaces like OData is the emerging standard for the web, which is kind of like SQL for the web. If you've not heard of it, it's... So that's our core. We've tapped all these different standard interfaces, and we connect all these different data sources. So we build all those connectors. What drove, have there been step changes in demand for pulling data at multiple repositories? Like maybe with mobile, where mobile apps really need to access many legacy applications so that there's sort of a single view. What, is that a valid example, or are there others where you need to bring data together and something drives demand? Yeah, so our core business is industry standards and connectivity, so we think about democratization of data, so making that data open and available to everybody. So a good example is if you look at our public reference to be somebody like into it, they have a, they use actually salesforce.com in the cloud, and they're able to leverage our connectivity through our OData standard, which is a REST endpoint, and they can access any data behind their firewall. It's instantly turned non-salesforce data to make it look like salesforce data. So if you have data in Spark and Hadoop or in Oracle, we can make all that data through standard interfaces available for mobile application development salesforce, create visual force pages, create whatever thing salesforce supports, but we use external data, and that's all powered through standard connectivity. Okay, so in this world of where we have huge repositories on-prem and we're growing data sources in the cloud, what determines what stays on-prem and what goes in the cloud? And then when do they move between them? Oh, okay, yeah, that's a good question. And usually the answer is the IT guys or the engineers are kind of like, well, if our applications in the cloud or data is always ends up being on-premise for converse, which is bad luck. I'm sorry, starts the... So like if you've got data in the cloud or you got an application in the cloud and you have data on-premise, usually they have the opposite environment. So they need to connect them. Usually they're either behind a firewall or they need to get out to the cloud. So it's always a converse, but in reality, like the, I guess we see different verticals have different requirements. So we see in financial services, you know, they're trying to run workloads in the cloud, maybe some kind of testing. We're not ready to commit this kind of sensitive data to the cloud. Whereas other industries, you might have like in manufacturing, you can see more of this kind of IoT sensor data going to the cloud and then being processed behind firewall. So you have a mix and it varies really by industry on these kind of patterns we see. So, what's that being driven by security concerns? Privacy concerns or what? I'll say this is a mix of security and privacy. So like if you, when you start talking about geographies, you know, in Europe and things like that, you have different government laws on data residency. And then also there's HIPAA, there's all these different government regulations and compliance that govern each industry. So you see different patterns emerging. But one thing's for sure, we do see people moving data into the cloud across all verticals. It's just the purposes are different for each one. It's, if we were to try and ferret out some common threads for moving it into the cloud, would it be that it's easy to add additional data sources for richer contexts so you can add, you know, better analytics? Is that one, you know, reason for moving to the cloud? So from our perspective, the way we operate is we build standard connectors and we deploy them and connect things whether regardless of where you are. So we're really enabling people to keep the systems where it makes the most sense for them. So if you have a big data system and you're pulling data from behind your firewall, it may not make sense to move that out to the cloud. And so we enable connectivity in any direction or vice versa. If you have data in the cloud and all of your systems that support it are also in the cloud and it makes more sense to run the cloud. So we can support that environment as well. So, it sounds like there's a lot of activity moving back and forth, but the patterns, the reasons behind the patterns haven't established themselves yet. Yeah, so from our perspective we have to really support every pattern there is. We do do surveys and things of our base to understand trends and to build our portfolio to make sure we're anticipating the next big things. And one thing that we're seeing is Spark Summit is a lot of that data integration workloads and this transformation stuff is happening. There's a lot of different types of sources of data. And so we're starting to see patterns emerge with cloud data sources, things like Salesforce.com, marketing automation systems like Marketo, Eloqua, and it's part of this digital transformation and marketing and digital business. So we're seeing them ingest that data through Spark so our connectors allow a standard access that we provide federation, right? So if you have your Spark environment with our connectors, you have instant access to any data on demand, whether it's in Salesforce or Oracle or anywhere, and it doesn't matter if it's beyond the firewall. So we're really, that's one of the trends we're seeing is having this data wherever it is and ingesting it into Spark. So would that be part of a customer 360 initiative? You know, pull my Salesforce, pull my Marketo, you know, and then put it in a on-prem repository? Yeah, and that repository could be you might bring it in through Spark, do some transformations, and then load it into some other on-premise database. So, or you can keep it in Spark and access it through, we have Spark SQL connectors as well. So again, it's maximum flexibility for accessing data. We really want customers and organizations to decide what works best for them. I've heard a lot about ecosystem, around big data ecosystem and how it's exponentially grown and I mean, is it static at this point? I mean, we have all the inputs coming in that we do and all the contributors that we do or do you see this as a continually expanding universe and difficult to predict just how much larger or where those new entrants are going to be? Definitely not static. So the ecosystem changes so much. So for us, if you look at our journey as a connectivity company, we started out supporting relational sources like Oracle, SQL Server, Postgres, your traditional relational databases. They didn't change quite that often. So we're able to build connectors, we're able to run like 85 million tests on them and send them out the door. The big data world, you've got high versions, you've got multiple distributions, you've got Spark projects. Some distributions let you connect through Spark SQL with the Thrift client, some don't have it. So it's just this big matrix of, we'll call it disruption I guess, because every system's different. So what we had to do is instead of running 85 million source tests on each possible version of each interface, we've ended up having to introduce what we call day one support. So if someone upgrades, so let's say they have the next distribution of Hadoop or Spark and they have the next version of Spark SQL interface or Hive, and a customer runs into an issue, we have to treat that like a bug going forward. And so it's written really, from our perspective as a testing and providing production ready drivers, we still run our 85 million tests, but the new paradigm is we'll support you no matter what version of a supported distribution or interface you're using. And the new release with Apache Spark on the 2.0, what does that do for you or in terms of changes in terms of the improvements that are not being made available? Yeah, so I guess with each release of whether Spark SQL interface we support, we're going to certify that within a certain number of days. We usually try to do it within 30 to 60 days. And then we, but if a customer runs into an issue, we're going to fix that right away. And when we look at the new features around, maybe what's new in Spark SQL or the SQL engine itself, we look at those with each release and we'll introduce things that make the most sense from a data connectivity perspective. Because we want to improve the, that we have a way to improve performance, we're going to adopt those things in the next release. So if you look at the big data landscape, do you see patterns of adoption emerging for why people build their data lakes and then why people maybe might take the next step and build sort of Spark applications on top of that? So the data lake concept is really interesting for us. A lot of, I mean, in the world of data integration, the pattern is kind of traditionally what's labeled called the ETL, right? You extract stuff, you transform it, you load it. We're definitely seeing, it's been for a while now, the ELT model, you extract it, you load it, then you do transformations. So from our perspective, we're providing that extract layer. And so it's becoming really important to have that breadth of sources, whether it's cloud, big data, no SQL. So providing that extracting layer, we can then really effectively build this rich data lake across all your data sources, whether you're doing like customer click-through rates, a 360 view, lead scoring, whatever it is you're doing, you get the rich source of data, you bring it into your data lake, and we provide that breadth, right, to bring, and if you have horrible data, just bring it all in for context. And then we let the, there's a rich ecosystem of tooling around that that can then do maybe it's preparation, maybe it's transformation, all the different things you would do on that. Is there a kind of like a sequence to the assembly line of working on the data once it's in the lake? Is it the data scientist who goes first, and then the data engineer, you know, and how does that work? Yeah, I guess it's different for me to organization. Some organizations, some of them just dump a bunch of data in and figure it out later. Well, maybe the scheme of a read approach. Some organizations actually think about what they want to know, and they selectively bring the data in. So I would say it varies, but at the end of the day, once you have all that data in, depending on the nature of it, sometimes you need some data scientists, sometimes you need some hardcore things. Sometimes you can turn to some of this, the improvement ecosystem of commercial tools that can really get the job done. But really, I think what we see is that you have different levels of data, different details. So it's all about the details with big data, right? So, you know, summary data is one thing, but to understand real big data value, you got to get like, you know, how many, what's the click through rates? How many people open this email? What did they do next? So that's a big data level detail we like to see in a lake. So we're saying, I guess the pattern is you have a mix of these low level big data details, some of the summary data, some business context data, and bring it all together to answer your question. But that in a sense is just like traditional data warehouse dimensions and facts, except that you're keeping more facts and maybe more dimensions. Yeah, and then you've cleaned it up. Yes, I mean, for us, we're really just bringing the source in. What they end up doing in the lake might be to clean that up later. And then, you know, on top of that, and then I guess after they get the lake in place, they're going to run analytics, they're going to do their machine learning, whatever things they do. And Spark is great for all these operations. Once you get the data available, you can have these great, rich batch processes and streaming capabilities. So it's a really cool ecosystem. Okay. Before we head out. Just tell me what you think about what's going on here. The show. The floor, the sessions, you know, what you've seen over the past day and a half or so and maybe what you expected to see and what it wound up being. Oh yeah, so it's a very interesting show for us. Actually, this is my, I was in Hadoop, the Strata Hadoop show earlier in the year and Spark as depth reference is a different paradigm. I think that both of these ecosystems are definitely in, you know, they're growing maturity. So I think when we come in, so a lot of our connectivity is around, you know, core business systems, I guess high value data. And with Hadoop, you know, we started to see people starting to bring in some of this high value data after a certain point with Spark. We're just now starting to see people bringing that core business data to then start merging it with other things. But, you know, we see a very similar pattern between Hadoop and Spark in terms of, I guess those projects get up to, you know, get up to speed, they get started with, you know, version 1.0 or what have you. And then as the business is getting involved with this data, you start to see them introduce some of this business data, some of this cloud data, some of this stuff from Fit for Purpose Databases. And then we get involved in that. So we're really here to let people know about our connectivity capabilities. When they're ready. Well thank you, and for the time we appreciate that and like you said, we're kind of watching this mature before our eyes in a way. Very good. Sumit, thanks for being here with us. Oh, it's my pleasure. Thank you so much. We appreciate that very much. We'll continue to cue Will, our coverage here from Spark Summit 2016 in just a bit.