 Live from San Jose in the heart of Silicon Valley, it's theCUBE, covering DataWorks Summit 2017, brought to you by Hortonworks. Welcome back to theCUBE. We are live on day two of the DataWorks Summit from the heart of Silicon Valley. I'm Lisa Martin. My co-host is George Gilbert. We're very excited to be joined by our next guest from DataTorrent. We've got Nathan Trubla, VPF product. Hey, Nathan. Hi. And the man who gave me my start in high tech 12 years ago, the SVP of marketing, Jeff Bettencourt. Welcome, Jeff. Hi, Lisa. Good to see you. Great to see you too. So tell us about the SVP of marketing. Who is DataTorrent? What do you guys do? What are you doing in the big data space? So DataTorrent is all about real time streaming. So it's really taking a different paradigm to handling information as it comes from the different sources that are out there. So you think big IoT, you think all of these different new things that are creating pieces of information. It could be humans, it could be machines, sensors, whatever it is, and taking that in real time rather than putting it traditionally just in a data lake and then later on coming back and investigating the data that you stored. So we started about 2011, started by some of the early founders, people that started Yahoo. And we're pioneers in Hadoop and Hadoop Yarn. This is one of the guys here too. And so we're all about building real time analytics for our customers, making sure that they can get business decisions done in real time as the information is created. And Nathan will talk a little bit about what we're doing on the application side of it as well, building these hard application pipelines for our customers to assist them to get started faster. Excellent. So, all right, let's turn to those real time applications. My familiarity with DataTorrent started probably about five years ago, I think, where it was, I think, positioned as, I don't think there was much talk about streaming, but it was like real time data feeds. But now streaming is sort of center of gravity and is sort of a peer to big data. So tell us how someone who's building apps should think about the two solution categories, how they complement each other and what sort of applications we can build now that we couldn't build before. So I think the way I look at it is not so much two different things that complement each other, but streaming analytics and real time data processing and analytics is really just a natural progression of where big data has been going. So when we were at Yahoo and running Hadoop at scale, first thing on the scene was just simply the ability to produce insight out of a massive amount of data, but then there was this constant pressure, well, okay, now we've produced that insight in a day, can you do it in an hour? Can you do it in half an hour? And particularly at Yahoo at the time that Amol, our CTO and I were there, there was this constant pressure of can you produce insight from a huge volume of data more quickly? And so we kind of saw at that time two major trends. One was that we were kind of reaching the limit of where you could go with the Hadoop and batch architecture at that time. And so a new approach was required and that's what really was sort of the foundation of the Apache Apex project and a data torrent the company was simply realizing that a new approach was required because the more that Yahoo or other businesses can take information from the world around them and take action on that as quickly as possible, that's going to make you more competitive. So I look at streaming as really just a natural progression where now it's possible to get insight and take action on data as close to the time of data creation as possible. And if you can do that, then you're going to be competitive. And so we see this cutting across a whole bunch of different verticals. So that's kind of how I look at this sort of, it's not so much complimentary as this is a trend in where big data is going. Now the kinds of things that weren't possible before this are the kinds of applications where now you can take insight, whether it's from IoT or from sensors or from retail, all the things that are going on, whereas before you would land this in a data lake, do a bunch of analysis, produce some insight, maybe change your behavior, but ultimately you weren't being as responsive as you could be to customers. So now what we're seeing why I think the center of mass has moved into real time in streaming is that now it's possible to give a customer and offer the second they walk into a store based on what you know about them and their history. This was always something that the internet properties were trying to move towards, but now we see that same technology is being made available across a whole bunch of different verticals and a whole bunch of different industries. And that's why when you look at Apex and DataTorrent we're involved not only in things like ad tech, but in industrial automation and IoT and we're involved in retail and customer 360 because in every one of these cases, insurance, finance, security and fraud prevention, it's a huge competitive advantage if you can get insight and make a decision close to the time of data creation. So I think that's really where the shift is coming from. And then the other thing I would mention here is that a big thrust of our company and of Apache Apex and this is, so we saw streaming was going to be something that everyone was going to need. The other thing we saw from our experience at Yahoo was that really getting something to work at a POC level showing that something is possible with streaming analytics is really only a small part of the problem. Being able to take and put something into production at scale and run a business on it is a much bigger part of the problem. And so we put into both of the Apache Apex project as well as into our product the ability to not only get insight out of data in motion, but to be able to put that in production at scale. And so that's why we've had quite a few customers who have put our product in production at scale and have been running that way in some cases for years. And so that's another sort of key area where we're forging a path which is it's not enough to do a POC and show that something is possible. You have to be able to run a business on it. So talk to us about where DataTour sits within a modern data architecture. You guys are kind of playing in a couple of, integrated in a couple of different areas. Walk us through what that looks like. So in terms of a modern data architecture, I mean part of it is what I just covered in that we're moving sort of from a batch to a streaming world where the notion of batch is not going away, but now when you have something, a streaming application, that's something that's running all the time 24-7. There's no concept of batch. Batch is really more the concept of how you are processing data through that streaming application. So what we're seeing in the modern data architecture is that typically you have people taking data, extracting it and eventually loading it into some kind of a data lake, right? What we're doing is shifting left of the data lake, analyzing information when it's created, produce insight from it, take action on it, and then yes, land it in the data lake, but once you land it in the data lake, now the purposes of what you're doing with that data have shifted. We're producing insight, taking action to the left of the data lake, and then use that data lake to do things like train your machine learning model that we're then going to use to the left of the data lake. Use the data lake to do slicing and dicing of your data to better understand what kinds of campaigns you want to run, things like that. But ultimately you're using the real time portion of this to be able to take those campaigns and then measure the impacts you're having on your customers in real time. So, okay, because that was going to be my follow-up question, which is there does seem to be a role for the historical repository for richer contact. Absolutely. And you're acknowledging that. Like the low latency analytics happen first, then store up for a richer model later. Correct. So there are a couple things then that seem to be requirements, next steps, which is if you're doing the modeling, the research model in the cloud, how do you orchestrate its distribution towards the sources of the real time data? And in other words, if you do training up in the cloud where you have the biggest data and the richest data, is data torrent or apex part of the process of orchestrating the distribution and coherence of the models that should be at the edge or closer to where the data sources are? So I guess there's a couple of different ways we can think about that problem. So we have customers today who are essentially providing into the streaming analytics application the models that have been trained on the data from the data lake. And part of the approach we take in Apex and data torrent is that you can reload and be changing those models all of the time. And so our architecture is such that it's fault tolerant, it stays up all of the time. So you can actually change the application and evolve it over time. So we have customers that are reloading models on a regular basis. And so that's whether it's machine learning or even just a rules engine, we're able to reload that on a regular basis. The other part of your question, if I understood you, was really about the distribution of data and the distribution of models and the distribution of data and where you train that. And again, I think that you're going to have data in the cloud, you're going to have data on-premises, you're going to have data at the edge. Again, what we allow customers to do is to be able to take and integrate that data, make decisions on it regardless of kind of where it lives. So we'll see streaming applications, they get deployed into the cloud, but they may be synchronizing some portion of the data to on-premises or vice versa. So certainly we can orchestrate all of that as part of an overall streaming application. I wanted to ask Jeff now, give us a cross-section of your customers. You've got customers ranging from small businesses to fortune 10. Give us some kind of use cases that really jump out at you and that really showcase the great potential that data tour gives. So if you think about the heritage of our company coming out of the early guys that were in Yahoo, AdTech is obviously one that we hit hard and it's something we know how to do really, really well. So AdTech is one of those things where there's constantly changing. So you can take that same model and say, if I'm looking at AdTech and saying, if I applied that to a distribution of products in a manufacturing facility, it's kind of all the same type of activities, right? I'm managing a lot of inventory. I'm trying to get that inventory to the right place at the right time and I'm trying to fill that aspect of it. So that's kind of where we kind of started but we've got customers in the financial sector, right, that are really looking at instantaneous type of transactions that are happening and then how do you apply knowledge and information to that while you're bringing that source data in so that you can make decisions. Some of those decisions have people involved with them and some of them are just machine based, right? So you take the people equation out. We kind of have this funny thing that Guy Churchward, our CEO talks about called the do loop and it's kind of the do loop is where the people come in and how do we remove people out of that do loop and really make it easier for companies to act, prevent. So then if you take that aspect of it, we've got companies like in the publishing space, we've got companies in the IoT space so they're doing energy management and stuff like that. So we go from very medium sized customers all the way up to very, very large enterprises. They're really turning a variety of industries into tech companies because they have to be these days. Well, and one other thing I would mention there which is important, especially as we look at big data and a lot of customer concern about complexity, I mentioned earlier the challenge of not just coming up with an idea but being able to put that into production. So one of the other big areas of focus for data torrent as a company is that not only have we developed a platform for streaming analytics and applications, but we're starting to deliver applications that you can download and run on our platform that deliver an outcome to a customer immediately. So increasingly as we see in different verticals, different applications, then we turn those into applications we can make available to all of our customers that solve business problems immediately. One of the challenges for a long time in IT is simply how do you eliminate complexity? And there's no getting away from the fact that this is big data and it's complex systems, but to drive mass adoption we're focused on how can we deliver outcomes for our customers as quickly as possible. And the way to do that is by making applications available across all these different verticals. Well you guys, this has been so educational. We wish you guys continued success here. It sounds like you're really being quite disruptive in and of yourself. So if you haven't heard of them, datatorrent.com check them out. Nathan, Jeff, thanks so much for giving us your time this afternoon. Thank you. Thanks for the opportunity. I look forward to having you back. You've been watching theCUBE live for day two of the DataWorks Summit from the heart of Silicon Valley for my co-host George Gilbert. I'm Lisa Martin. Stick around, we'll be right back.