 Live from the wigwam in Phoenix, Arizona. It's theCUBE, covering Data Platforms 2017. Brought to you by Cubull. Hey, welcome back everybody. Jeff Frick here with theCUBE. We are down at the historic wigwam 99 years young just outside of Phoenix, Arizona at Data Platforms 2017. It's really talking about a new approach to big data in cloud put on by Koa Wool, about 200 people. Very interesting conversations this morning and we're really excited to have Karthik Ramasami. He is the co-founder of Streamlio, which is still in stealth mode according to his LinkedIn profile, so we won't talk about that, but longtime Twitter guy and really shared some great lessons this morning about things that you guys learned while growing Twitter, so welcome. Thank you. Thanks for having me. Absolutely. So one of the key parts of your whole talk was this concept of real time. And I always joke with people, real time is in time to do something about it. And you went through a bunch of examples of real time is really a variable depending on what the right application is, but at Twitter real time was super, super important. Yes, it is indeed important because the nature of the streaming data, the nature of the Twitter data is streaming data because the tweets are coming at a high velocity and Twitter pushing itself as a more of a real time delivery company because that way what happens is whatever the information that we get within Twitter, we need to have a strong time budget before we can deliver it to people so that people kind of, when they consume the information, the information is live out of real time, right? But the real time too is becoming obviously for Twitter, but for a lot of big enterprises, right? It's more and more important in the great analogy I've heard before is we used to sample data, used to sample historic data to make decisions. Now you want to keep all the data in real time to make decisions, so it's a very different way to drive your decision making process. Very different way of thinking and especially considering the fact that as you said the enterprises are getting into understanding what real time means for them and but if you look at some of the traditional enterprise like financial, they understand the value of real time. Similarly, the upcoming new use cases like IOT, they understand the value of real time like autonomous vehicles where they have to make quick decisions, healthcare, you have to make quick decisions because the preventive and the predictive maintenance is very important in those kind of segments, right? So because of those segments, it's getting really popular and traditional enterprises like retail and all, they are also valuing real time because it allows to blend in into the user behavior so that they can recommend products and other things in real time so that people can react to it so that it's becoming more and more important, that's what I would say, yep. So Hadoop started out as mostly batch infrastructure and Twitter was a pioneer in the design pattern to accommodate both batch and near real time. How has that big data infrastructure evolved so that one, you don't have to split batch in real time and what should we expect going forward to make that platform stronger in terms of near real time analytics and potentially so that it can inform decisions in systems of record? Yes, so I think like today as of now there are two different infrastructures, one is generally Hadoop infrastructure, other one is more of a real time infrastructure at this point and the Hadoop is kind of considered as monolithic or not monolithic, it's kind of a mega store where every data like similar to all the reverse kind of reach the sea, it kind of becomes a storage sea where all the data comes and stores there, right? But before the data comes and stores there a lot of analytics, a lot of visibility about the data from the point of its creation before it ends up there it's getting done on those river, whatever they call the data river, right? So you could get a lot of analytics done during the time before it ends up so that it's more live and the other analytics, right? I mean Hadoop had its own kind of limitations in terms of how much data it can handle, how real time the data can be, for example you can kind of dump the data in real time into Hadoop but until you close the file you cannot see the data at all, right? So there is a time budget gets into play there and you could do smaller files like small, small files writing but the name node will blow up because like within a day you write million files the name node is going to not sustain that, right? So those are the trade-offs, that's one of the reason we have to end up doing a new real-time infrastructure like the distributed log that allows you to, the moment the data comes in data is immediately visible within the three to five millisecond time frame. So this distributed log you're talking about would be Kafka and at the output of that would be to train a model or just score a model and then would that model essentially be carved off from this big data platform and be integrated within a system of record where it would inform decisions? So there are multiple things you can do, I mean first of all the distributed log essentially the data is kind of, you can think about as a data staging environment where the data kind of lands up there and once it lands up there when there's a lot of sharing of the same data going on in real time when several jobs are using some popular data source, right? It provides a high fan out in the sense like a hundred jobs can consume the same data, they can be the different parts of the data itself, right? So that provides a nice sharing environment. Now once the data is around there, now the data is being used for some different kind of analytics, right? And one of them could be a model enhancement because typically in the batch segment you build a model because you're looking at a lot of data and other things then once the model is built that models is pre-loaded into the real time compute environment like Heron then you look up this model and the serve data based on that model whatever it tells you for example when you do an ad serving you look up that model and what is our relevant ad for you to click then the next aspect is model enhancement because users behavior is going to change over a period of time. Now can you capture and incrementally update the model? So that those things are also partly done on the real time aspects rather than recomputing the batch and again and again and again, right? Okay, so it's sort of like what's the delta? Yes. Let's train on the delta and let's score on the delta. Yes, and once the delta gets updated then when the user behavior comes in they can look at that new model whether it's being continuously being enhanced and once the enhancement is kind of captured you know that the user's behavior is changing, right? And the ads are served accordingly. Okay, so now that customers are sort of getting serious about moving to the cloud with their big data platforms and the applications on them have you seen a change in the patterns of apps they're looking to build or a change in the makeup of the platform that they want to use? So that depends on, I mean typically like one disclosure is I've worked with Amazon and all the AWS but within the companies that I work for it's everybody, everything is on prem. So, but the thing is having said that like cloud is nice because it gives you machines on the fly whenever you need to and it gives a bunch of tools around it where you can bootstrap it and all the various stuff, right? So this works ideal for smaller companies and medium companies but the big company is one of these things that we calculate in terms of the cost wise how much is the cost that we have to pay versus doing it in house. So there's still a huge gap unless the cloud provider is going to provide a huge discount or whatever for the big companies to move in, right? So that is always a challenge that we get into because think about like if I have 10 or 20,000 nodes of Hadoop can I move all of them into Amazon AWS? How much I'm going to pay, right? So this is the cost of maintaining my own data centers and everything. I mean I would say like I don't know the latest pricing and other things but approximately it comes to 3x in terms of cost wise, right? If you're using our own on-prem and the data center and all the staffing and everything there's a difference of I would say 3x. For on-prem being higher? On-prem being lower. Lower? Yes. But that assumes then that you've got flat utilization that you're... I mean flat utilization but I mean cloud of course I have the expand out of scale and all the various things that you can give the illusion of unlimited resources but in our case if you are provisioning so much machines, most of the, at least 50 or 60% of the machines are used for production but the rest of them are used for staging, development and all the various other environments. So which means like the cost of the total cost of those machines even though like only 50% utilized still end up saving so much like one, operate out one third of the cost and it might be in the cloud. All right, Karthik, that opens up a whole can of interesting conversations that we can't, we just don't have time to jump into. So I'll give you the last word. When can we expect you to come out of stealth or is that stealthy too? It's kind of, that is stealthy. Okay, fair enough. I didn't want to put you on the spot but thanks for stopping by and sharing your story. Thanks, thanks for having me. All right, he's Karthik, he's George, I'm Jeff, you're watching theCUBE. We're in the Wigwam Resort in just outside of Phoenix, a data platform from 2017. We'll be back after this short break. Thanks for watching.