 Live from San Francisco, it's theCUBE, covering Spark Summit 2017, brought to you by Databricks. Welcome to theCUBE at Spark Summit here in San Francisco at Moscone Center West, and we're going to be competing with all the excitement happening behind us. They're going to be going off with raffles, I don't know what all, but we're just going to talk above them, right? We are next guest on the show here, it's Clark Patterson from Confluent. You're the senior director of product marketing, is that correct? You got it. Pleasure to be here. Pleasure to have you on the show. It's my first time here. First time on theCUBE? I feel like one of those radio people. First time caller here I am. Yep, first time on theCUBE. Well, a long time listener too, I hope. I am. And so have you announced anything new that you want to talk about from Confluent? Yeah, I mean, not particularly at this show per se, but most recently we've done a lot of stuff to enable customers to adopt Confluent in the cloud. So we came out with a Confluent cloud offer which is a managed service of our Confluent platform a couple of weeks ago at our event around Kafka. So we're really excited about that. It really fits that need where cloud first or operations starved organizations are really wanting to do things with streaming platforms based on Kafka, but they just don't have the means to make it happen. And so we're now standing this up as a managed service and it allows them to get their hands on this great set of capabilities with us as the backstop to do things with it. And you said Kafka's not just a publish and subscribe engine, right? Yeah, that's, I'm glad that you asked that. So that's one of the big misconceptions I think of Kafka. You know, it's made its way into a lot of organizations from the early use case of publish and subscribe for data, but over the last 12 to 18 months in particular, there's been a lot of interesting advancements. Two things in particular. One is the ability to connect which is called a connect API in Kafka. And it essentially simplifies how you integrate large amounts of producers and consumers of data as information flows through. So a modernization of ETL, if you will. The second thing is stream processing. So there's a Kafka Streams API that's built in now as well that allows you to do the lightweight transformations of data as it flows from point A to point B and you can publish out new topics if you need to manipulate things. And it expands the overall capabilities of what Kafka can do. Okay, I'm going to ask George here to dive in. And I was just going to ask you if I could dive in. So this is interesting. If we want to frame this in terms of what people understand from I don't want to say prehistoric errors, but earlier approaches to similar problems. So let's say in days gone by you had a ETL solution. So now let's put connect together with stream processing. And how does that change the whole architecture of integrating your systems? Yeah, I mean, I think the easiest way to think about this is if you think about some of the different market segments that have existed over the last 10 to 20 years. So data integration was all about, how do I get a lot of different systems to integrate a bunch of data and transform it in some manner and ship it off to some other place in my business. And it was really good at building these end-to-end workflows, moving big quantities of data, but it was generally kind of batch oriented. And so we've been fixated on how do we make this process faster. To some degree, in the other segment is application integration, which said, hey, when I want applications to talk to one another, it doesn't have the scale of information exchange, but it needs to happen a whole lot faster. So these real-time integration systems, ESBs and things like that came along and was able to serve that particular need. But as we move forward into this world that we're in now where there's just all sorts of information, companies want to become advanced centric, need to be able to get the best of both of those worlds. And this is really where Kafka's starting to sit, right? It's saying, hey, let's take massive amounts of data producers that need to connect to massive amounts of data consumers, be able to ship a super granular level of information, transform it as you need, and do that in real time so that everything can get served out very, very fast. But now that you've done, I mean, that's a wonderful and kind of pissy kind of way to distill it. But now that we have this new way of thinking of app integration, data integration, best of both worlds, that has sort of second-order consequences in terms of how we build applications and connect them. So what does that look like? What do applications look like in the old world and now what enables them to be sort of refactored? Or for new apps, how do you build them differently? Yeah, I mean, so we see a lot of people that are going into microservices-oriented architecture. So moving away from one big monolithic app that takes like this inordinate amount of effort to change in some capacity, and quite frankly, it happens very, very slow. And so they look to microservices to be able to split those up into very small functional components that they can iterate on a whole lot faster, decouple engineering teams that were not dependent on one another and just make things happen a whole lot quicker than we could before. But obviously, when you do that, you need something that can connect all those pieces. And Kafka's a great thing to sit in there as a way to exchange state across all these things. So that's a massive use case for us and for Kafka specifically in terms of what we're seeing people do. You said something in there at the end that I want to key off, which is to exchange state. So in the old world, we used a massive shared database to share state for a monolithic app, or sometimes between monolithic apps. So what's sort of the state of the art way that that's done now with microservices? If there's more than one, how does that work? Yeah, I mean, so this is kind of rooted in the way we do stream processing. So there's this concept of topics which effectively could align to individual microservices and you're able to kind of make sure that the most recent state of any particular one is stored in the central repository of Kafka. But then given that we take an API approach to stream processing, it's easy to embed those types of capabilities in any of the endpoints. And so some of the activity can happen in that particular front and it all gets synchronized down into the centralized hub. Okay, let me unpack that a little bit. Because you take an API approach, that means that if you're manipulating a topic, you're processing a microservice and that has state in it, is that the right way to think about it? Yeah, I think that's the easiest way to think about it. Yeah. Okay, so where are we? Is this a 10 year migration or is it a some certain class of apps lend themselves well to microservices? Legacy apps will stay monolithic, and some new apps, some new Greenfield apps will still be database centric. How do you, or how should customers think about that mix? Yeah, that's a great question. I don't know that I have the answer to it. You know, the best gauge I can have is just the amount of interest in conversations that we have on this particular topic. I will say that from just one of the topics that we do engage with, it's easily one of the most popular that people are interested in. So if that's a data point, it's definitely a lot of interest in people trying to figure out how to do this stuff very fast. How to do the microservice. Yeah, yeah, and I think it's, if you look at some of the more notable tech companies of late, they're architected this way from the start. And so everyone's kind of looking at the Netflix of the world and the Ubers of the world saying I want to be like those guys, how do I do that? And it's driving them down this path. So competitive pressure I think will help force people's hands. The more that your competitors are getting in front of you and are able to deliver a better customer experience through some sort of mobile app or something like that, then it's going to force people to have to make these changes quicker. But how long that takes, it'll be interesting to see. Great, great stuff. Switch gears us a little bit. Talk about maybe why you're using Databricks and what some of the key value you've gotten out of that. Yeah, I mean, so I wouldn't say that we're using Databricks per se, but we integrate directly with Spark, right? So if you look at a lot of the use cases that people use Spark for, they need to obviously get data to where it is. And some of the principles that I said before about Kafka generally, it's just a very flexible, very dynamic mechanism for taking lots of sources of information, calling all that down into one centralized place and then distributing it into places such as Spark. So we see a lot of people using the technologies together to get the data from point A to point B, do some transformation as they so need, and then obviously do some amazing compute and horsepower and whatnot in Spark itself. All right, don't worry. I'm processing this because it, and it's tough because you can go in so many different directions. You know, especially like the question about Spark, I guess give us some of the scenarios where Spark would fit as would it be like doing microservices that require more advanced analytics and then they feed other topics or feed consumers. And then where might you stick with a shared database that a couple of services might communicate with rather than maintaining the state within the microservice? Yeah, I mean, I think, let me see if I can kind of unpack that myself a little bit. Yeah, that was packed pretty hard. There's a lot packed in there. When folks want to do things like, I guess when you think about like an overall business process, if you think about something like an order-to-cash business process these days, it has a whole bunch of different systems that hang off it. It's got your order processing, you've got an inventory management, maybe you have some real-time pricing, you've got some shipments, things, like it all just kind of hang off of the flow of data across there. Now, what any given system that you use for addressing any answers to each of those problems could be vastly different. It could be Spark, it could be a relational database, it could be a whole bunch of different things. Where the centralization of data comes in for us is to be able to just kind of make sure that all those components can be communicating with each other based on the last thing that happened within each of them individually. And so, their ability to embed transformation, data transformations and data processing in themselves, and then publish back out any change that they had into the shared cluster subsequently makes that state available to everybody else so that if necessary they can react to it, right? So, in a lot of ways we're kind of agnostic to the type of processing that happens on the endpoints. It's more just a free movement of all the data to all those things. And then if they have any relevant updates that need to make it back to any of the other components hanging on that process flow, they should have the ability to publish that back down in. So, one thing that Jay Kreps, who's founder and CEO talks about is that Kafka may ultimately or in his language will ultimately grow into something that rivals the relational database. Tell us what that world would look like. It would be controversial. That's okay. You want me to be the bad guy? So, it's interesting because we did Kafka Summit about a month ago. And there's a lot of people, a lot of companies, I should say that are actually using and calling Kafka an enterprise data hub, a central hub for data, data distribution network, and they're literally storing all sorts of different lengths of data. So, one interesting example was the New York Times. So, they use Kafka and literally store every piece of content that has ever been generated at that publisher since the beginning of time in Kafka. So, all the way back to 1851, they've obviously digitized everything and it sits in there and then they disposition that back out to various forms of the business. So, that's a- So, they replay it, they pull it, they replay and pull. Wow, okay. So, that has some very interesting implications, right? So, you can replay data if you run some analytics on something and you didn't get the result that you wanted, you wanted to redo it, say, it makes it really easy and really fast to be able to do that. If you want to bring on a new system that has some new functionality, you can do that really quickly because you have the full pedigree of everything that sits in there. And then, imagine this world where you could actually start to ask questions on it directly, that's where it starts to get very profound and it'll be interesting to see where that goes. Two things, first it sounds like you don't, like a database takes updates, so you don't have a perfect historical record. You have a snapshot of current values, like, whereas in a log, like Kafka or log structured data structure, you have every event that ever happened. Correct. Now, what's the impact on performance if you want to pull, you know- That much data? Yeah. Yeah, I mean, so it all comes down to managing the environment on which you run it, right? So, obviously the more data you're going to store in here and the more types of things you're going to try to connect to it, you're going to have to take that into account. And you mentioned just a moment ago about directly sort of asking about the data contained in the hub, in the data hub. How would that work? Not quite sure today, to be honest with you. And I think this is where that question, I think is a pretty provocative one, like what does it mean to have this entire view of all granular event streams, not in some aggregated form over time. I think that the key will be some mechanism to come on to an environment like this to make it more consumable for more business types of users. And that's probably one of the areas we'll want to watch to see how that's enabled. All right, I only want an answered question, but you answered all the other ones really well. So, we're going to wrap it up here up against the hard break right now. I want to thank Clark Patterson from Confluent for joining us. Thank you so much for being on the show. Thank you so much. Appreciate it. For watching theCUBE, we'll be back after the raffle in just a few minutes. We have one more guest. Stay with us. Thank you.