 Hello, and welcome to theCUBE's presentation of the AWS Startup Showcase Data as Code. This is season two, episode two of the ongoing series covering exciting startups from the AWS ecosystem. And we're going to talk about the future of enterprise data analytics. I'm your host, John Furrier. And today we're joined by Gian Merlino, CTO and co-founder of Impli.io. Welcome to theCUBE. Thanks for joining me. Building analytics apps with Apache Druid and Impli is what the focus is of this talk is in your company being showcased today. So thanks for coming on. You guys have been in the streaming data large scale for many, many years, a pioneer going back. You know, this past decade has been the key focus. Druid's unique position in that market has been key. You guys been powering it. Take a minute to explain what you guys are doing over there at Impli. Yeah, for sure. So I guess to talk about, talk about Impli, I got to talk about Druid first. That's the, Impli is a open source based company and Apache Druid is the open source project that the Impli products built around. So what Druid's all about is it's a database to power analytical applications. And there's a couple of things I want to talk about there. You know, the first off is, why do we need that? And the second is why we could add, and I'll just do a little flavor of both. So why do we need a database to power analytical apps? Yeah, it's the same reason we need databases to power transactional apps. I mean, the requirements of these applications are different analytical applications, you know, apps where you have tons of data coming in, you have lots of different people wanting to interact with that data, see what's happening, both real time and historical. The requirements of that kind of application have sort of given rise to a new kind of database that Druid is one example of. There's others of course out there in the both the open source and non open source world. And what makes Druid really good at it, it's just, you know, people often say, you know, what is Druid's big secret? How is it, how is it so good at these things? Why is it so fast? And I never know what to say to that. So I always sort of go to, well, you know, it's just getting all the little details, right? It's a lot of pieces that individually need to be engineered. Now you build up software and layers, you build up a database and layers just like any other piece of software and have really high performance and to do really well at a specific purpose, you kind of have to get each layer right and have each layer have as little overhead as possible and so just a lot of, a lot of kind of nitty gritty engineering work. You know, it's interesting about the trends over the past 10 years in particular, maybe you can go back 15, 15, 10, 15 years is, you know, state of the art database was, you know, stream a bunch of data, put it into a pile, index it, interrogate it, get some reports, pretty basic stuff. And then all of a sudden now you have with cloud thousands of databases out there, potentially hundreds of databases living in the wild. So now data's in with Kafka and Kinesis, these kinds of technologies, streaming data is happening in real time so you don't have time to put it in a pile or index it. You want real time analytics. And so apps whether they're a mobile app, Instagrams of the world, this is now what people want in the enterprise. You guys are the heart of this. Can you talk about that dynamic of getting data quickly at scale? Yeah, so our thinking is that actually both things matter. What real time data matters, but also historical context matters. And the best way to get historical context out of data is to put it in a pile indexed, so to speak. And then the best way to get real time context of what's happening right now is to be able to operate on these streams. And so one of the things that we did in Druid, I wish I had more time to talk about it, but one of the things that we did in Druid is we kind of integrate this real time processing and this historical processing. So we actually have a system that we call the historical system that does, to what you're saying, take all this data put in a pile indexed for all your historical data. And we have a system that we call the real time system that is pulling data in from things like Kafka, Kinesis, getting or in our, getting it pushed into it as the case may be. And this system is responsible for all the data that's recent, maybe the last hour or two of data will be handled by this system. And then the older stuff handled by the historical system and our query layer blends these two together seamlessly. So a user never needs to think about whether they're querying real time data or historical data. It's perceived, it's presented as a blended view. It's interesting. And you know, a lot of the people just say, Hey, I don't really have the expertise. And now they're trying to learn it. So their default was thrown into a data lake, right? So that brings back that historical. So the rise of the data lake, you're seeing data bricks and others out there doing very well with the data lakes. How do you guys fit into that? Cause that makes a lot of sense too. Cause that's, it looks like historical information. Yeah. So data lake circuit technology. We love that kind of stuff. I would say that a really popular pattern with Druid, there's actually two very popular patterns. One is to, I would say streaming forward. So stream focus where you connect your it up to something like Kafka and you load data as a stream. And then we will actually take that data, we'll store all the historical data that came from the stream and instead of blend those two together. And another pattern that's also very common is the data lake pattern. So you have a data lake and then you're sort of mirroring that data from the data lake into Druid. This is really common when, you know, you have a data lake that you want to be able to build an application on top of. You want to say, you know, I have this data in the data lake. I have my table. I want to build an application that has hundreds of people using it that has really fast response time that is always online. And so I'm going to mirror that data into Druid and then build my app on top of that. You can take me through the progression of the maturity cycle here as you look back even a few years, the pioneers in the hardcore streaming data, using data analytics at scale that you guys are doing with Druid was really a few percentage of the population doing that. And then as the hyperscale went mainstream, it's now in the enterprise. How stable is it? What's the current state of the art relative to the stability and adoption of the techniques that you guys are seeing? Yeah, I think what we're seeing right now at this stage of the game, and this is something that we kind of see at the commercial side of the fly. What we're seeing at this stage of the game is that these kinds of realization that you actually can get a lot of value out of data by building interactive apps around it and by allowing people to kind of slice and dice it and play with it and try getting out there to everybody. But there is a lot of value here and that it is actually very feasible to do with current technology. So I've been working on this problem just in my own career for the past decade. And 10 years ago where we were is even the most high tech of tech companies were like, well, I could sort of see the value. It seems like it might be difficult. And we're kind of getting from there to the high tech companies realizing that it is valuable and it is very doable. And I think that was something that there was a tipping point that I saw a few years ago when these Druid and databases like it really started to blow up. And I think now we're seeing that extend out beyond tech beyond sort of the high tech companies, which is great to see. Yeah, and a lot of people see the value of the data and they see the application as data as code means the application developers really want to have that functionality. Can you share the roadmap for the next 12 months for you guys on what's coming next? What's coming around the corner? Yeah, for sure. So I mentioned Druid, Apache Druid, open source community, different products. We're one member of that community, very prominent one, but one member. So I'll talk a bit about what we're doing for the Druid project as part of our effort to make Druid better and take it to the next level. And then I'll talk about some of the stuff we're doing on the, I guess the pure sort of a commercial side. So on the Druid side, stuff that we're doing to make Druid better to take it to the next level. The big thing is something that we really started writing about a few weeks ago, the multi-stage query engine that we're working on, a new multi-stage query engine. This is, if you're interested, the full details are in a blog on our website, but also on GitHub, on Apache Druid GitHub. But short version is Druid's, we're sort of extending Druid's query engine to support more and varied kinds of queries with a focus on sort of reporting queries, more complex queries. Druid's core query engine has classically been extremely good at doing rapid fire queries very quickly. I think thousands of queries per second where each query is maybe something that involves a filter and a group by like a relatively straightforward query, but we're just doing thousands of them constantly. And where historically folks have not reached for technologies like Druid is really complex, no thousand line SQL queries, complex reporting needs, although people really do need to do both interactive stuff and complex stuff on the same data set. And so that's why we're building out these capabilities in Druid. And then on the implied commercial side, the big effort for this year is Polaris, which is our cloud-based Druid offering. Talk about the relationship between Druid and Impli. Share with the folks out there how that worked. Yeah, yeah, so Druid is, like I mentioned before, it's Apache Druid, so it's a community-based project. It's not a project that is owned by Impli. Some open source projects are sort of owned or sponsored by a particular organization. Druid is not, Druid is an independent project. Impli is the biggest contributor to Druid. So the Impli engineering team is contributing tons of stuff constantly and we're really putting a lot of the work in to improve Druid, although it is a community effort. You guys are launching a new SaaS service on AWS. Can you tell me about what that's happening there? What's all about? Yeah, so we actually launched that a couple weeks ago. It's called Polaris. It's very cool. So historically, there's been two ways that you know, or I really, you know, you can either get started with Apache Druid, it's open source, you install it yourself, or you can get started with Impli Enterprise, which is our enterprise offering. And, you know, these are the two ways you can get started historically. One of the issues of getting started with Apache Druid is that it is a very complicated distributed database. You know, it's simple enough to run on a single server, but once you want to scale things out, once you need to get all these things set up, you know, you may want someone to take some of that operational burden off your hands. And on the Impli Enterprise site, you know, it says right there in the name, it's enterprise product. It's something that, you know, may take a little bit of time to get started with. It's not something you can just roll up with a credit card and sign up for. So Polaris is really about having a cloud product that's sort of designed to be really easy to get started with, really self-service, that kind of stuff. So kind of providing a really nice getting started experience that does take that maintenance burden and operational burden away from you, but is also sort of as easy to get started with as, you know, something that's download-based would be. So more developer-friendly than from an onboarding standpoint, classic. Exactly, yeah, much more developer-friendly is what we're going for with that product. So take me through the state-of-the-art data as code in your mind, because, you know, infrastructure as code, DevOps has been awesome. That's cloud scale, we've seen that. Data as code is, I'm turning wheat coin, but, you know, means data's in the developer process. How do you see data being integrated into the workflow for developers in the future? Yeah, great question. I mean, all kinds of ways. I mean, part of the reason that, you know, I kind of alluded to this earlier, building analytical applications, building applications based on data and based on letting people do analysis, how valuable it is. And I guess to develop, you know, in that context, there's kind of two big ways that we sort of see these things getting pushed out. One is developers building apps for other people to use. So, you know, think like, you know, I want to build something like Google Analytics. I want to build something that collects my web traffic and then lets the marketing team slice and dice through it and make decisions about how well the marketing's doing. You know, you can build something like that with databases like Juriden and products like what we're having in play. And the other, I guess the other way is things that are actually helping developers do their own jobs. So kind of like, you know, use your own products or use it for yourself. And in this world, you kind of have things like, so going beyond, I think my favorite use case, I'll just talk about one. My favorite use case is, so I'm really into performance. You know, I spend the last 10 years of my life working on high performance database. So obviously I'm into this kind of stuff. And so, you know, I love it to, I love when people use our products to help make their own products faster. So this concept of performance, monitoring and performance management for applications. One thing that I've seen some of our customers do and some of our users do that I really love is when you kind of take that performance data of your own app, as far as, you know, as far as it can possibly go, take it to the next level. So, you know, I think the basic level of using performance data is I collect performance data from my application deployed out there in the world. And I can just use it for monitoring. I can say, okay, you know, my response times are getting high in this region. Maybe there's something wrong with that region. You know, one of the, one of the very original use cases for jurid was that Netflix doing performance analysis and performance analysis more exciting than monitoring because you're not just understanding that there's a performance, you know, as good or bad in whatever region, you're sort of getting very fine grain. You're saying, you know, in this region, on this server rack, you know, for these devices, I'm seeing a degradation or I'm seeing an increase. You can see things like, you know, Apple just rolled out a new version of iOS. And on that new version of iOS, my app is performing worse than the older version. And even though not many devices are on that new version yet, I can kind of, you know, I can kind of see that because I have the ability to get really deep in the data. And then I can start to slice and dice and that more. You know, I can say for that, for those new iOS people, is it all iOS devices? Is it just the iPhone? Is it just the iPad? And that kind of stuff is just one example, but it's something that I, it's an example that I really like. It's kind of like the old data for about the data, about the data was always good to have context. You're like data analytics for data analytics. You see how it's working at scale. So, I mean, this is interesting because now you're bringing up the classic, you know, finding the needle in the haystack of needles, so to speak, where you have so much data out there, like edge cases, edge computing, for instance, you have devices sending data off. There's so much data coming in. The scale is a big issue. This is kind of where you guys are, seem to be a nice fit for. Large scale data ingestion, large scale data management, large scale data insights kind of all rolled in to one. Is that kind of- Yeah, for sure. So, I mean, I can, you know, one of the things that we knew we had to do with jurid was we were building it for the sort of internet age. And so we knew it had to scale well. So the original use case for jurid, the very first one that we ended up building for, the reason we built it in the first place is because that original use case had massive scale and we struggled finding something. We were trying to, we were, you know, literally trying to do what we see people doing now, which is we're trying to build an app on a massive data set and we're struggling to do it. And so we knew it had to scale to massive data sets. And so that's this, you know, a little flavor of kind of how that works is, you know, like I was mentioning earlier, this real-time system and historical system. You know, the real-time system is scalable. It's scalable outs. If you're reading from Kafka, we scale out just like any other Kafka consumer. And then the historical system is all based on what we call segments, which are these a file that has a few million rows per file and a cluster that's really big might have thousands of servers, millions of segments, but it's a design that is kind of, it's a design that does scale to these multi-trillion row tables. It's interesting, you know, you go back when you probably started, you had Twitter, Netflix, Facebook, I mean, a handful of companies that were at the scale now, the trend is you're on this wave now where those hyperscalers and or these unique, huge-scale app companies are now mainstream enterprise. So as you guys roll out the enterprise version of building analytics and applications which ruin and imply, they got to kind of get religion on this. And I think it's not hard because it's distributed computing, which they're used to. So how's that enterprise transition going? Because I can imagine people would want it and are just kicking the tires or learning and then trying to put it into action. How are you seeing the adoption of the enterprise piece of it? Yeah, so the thing that's driving the interest is for sure doing more and more stuff on the internet because anything that happens on the internet, whether it's apps or web-based, there's more and more happening there and anything that is connected to the internet, anything that's serving customers on the internet is going to generate an absolute mountain of data. And the only question is not if you're going to have that much data. I mean, you do if you're doing anything on the internet, the only question is, what are you going to do with it? So that's I think what drives the interest is people want to try to get value out of this. And then what drives the actual adoption is, I think I don't want to necessarily talk about specific folks, but within every industry, there's I would say there's people that are leaders, there's organizations that are leaders, teams that are leaders, what drives a lot of interest is seeing someone in your own industry that has adopted new technology and has gotten a lot of value out of it. So a big part of what we do and imply is that identify those leaders, work with them and then you can talk about how it's helped them in their business. And then also I think just, I guess the classic enterprise thing what they're looking for is a sense of stability, a sense of supportability, a sense of robustness. And this is something that comes with maturity. I think that the super high tech companies are comfortable using something that just, some open source software that's rolled off the presses a few months ago. The big enterprises are looking for something that has corporate backing, they're looking for something that's been around for a while. And I think that these technologies, Drude and technologies like it are reaching that little maturity right now. It's interesting that supply chain has come up on the software side, that conversation is a lot now and you're hearing about open source being great, but in the cloud scale you can get the data in there to identify opportunities and also potentially vulnerabilities is a big discussion. Question for you on the cloud native side, how do you see cloud native, cloud scale with services like serverless lambda, edge merging? It's easier to get into the cloud, cloud scale. How do you see the enterprise being hardened out with Drude and implying? Yeah, I mean, I think the cloud stuff is great. I love, I mean, like we love using it to build all of our own stuff. Our product is, of course, built on other cloud technologies. And I think these technologies build on each other. You sort of have, like I mentioned earlier all software is built in layers and cloud architecture is the same thing. What we're doing is, we see ourselves as doing is we're building the next layer of that stack. So we're building the analytics layer, the analytics database layer. And you saw when people first started doing these in public cloud, the very first two services that came out were just, you can get a virtual machine and you can store some data and you can retrieve that data, but there's no real analytics on it. There's not, there's just kind of storage and retrieval. And then as time goes on, higher and higher levels get built out delivering more and more value and then the levels mature as they go up. And so the, you know, the bottom most layers are incredibly mature. The top most layers are cutting edge and there's a kind of maturity gradient between those two. And so what we're doing is building out we're building out one of those layers. Awesome. Extraction layers, faster performance, great stuff. Final question for you, Gian. What's your vision for the future? How do you see Imply and Druid going? What's, what's it look like five years from now? Yeah, yeah, for sure. So, I mean, I think that, that for sure it seems like there's two big trends that are happening in the world and it's going to sound a little bit self-serving for me to say it, but I believe in, I believe what we're doing here. So I'm here because I believe it. I believe in open source and I believe in cloud stuff. So I think that, that's why I'm really excited that what we're doing is we're building a great cloud products based on a great open source project. I think that's the kind of company that I would want to buy from. If I wasn't at this company and I was just building something, I would want to buy a great cloud product that's backed by a great open source project. So I think that kind of the way I see the industry going and the way I see us going is, and where I think would be a great place to end up just kind of, as an engineering world, as an industry is a lot of these really, really great open source projects doing things like what Kubernetes doing containers, what we're doing with analytics, et cetera. And then really first class, really well done cloud versions of each one of them. And so you can kind of choose, do you want to get down and dirty with the open source or do you want to just kind of have the abstraction of the cloud? That's awesome. Cloud scale, cloud flexibility, community getting down and dirty in open source, the best of both worlds. Great solution. Jean, thanks for coming on and thanks for sharing here in the showcase. Thanks for coming on theCUBE. Thank you too. Okay, this is theCUBE showcase season two, episode two. I'm John Furrier, your host data as code is the theme this episode. Thanks for watching.