 Hi, everybody. This is Dave Vellante. We're here live at Wikibon headquarters in Marlboro, Massachusetts, post Hurricane Sandy. We survived not too many power outages, not too long, but, of course, we're right back at it. Last week was big data week for SiliconANGLE Wikibon. We had the Cube at IBM's IOD conference out in Las Vegas, and then we brought it to Strata plus Hadoop World, the O'Reilly media and Cloudera sponsored conference. So we had the Cube at both. It was very interesting to juxtapose IBM's IOD to Strata plus Hadoop World. IBM is really super gluing its analytics business to the big data meme. IBM's taking the concept that Thomas Watson put forth the mantra of think, and they're using think big, tying into the smarter planet. IBM has an incredible portfolio of capabilities spanning hardware and software and services and really is going after the big data business, trying to be a leader there, integrating across its portfolio, going hard after different industries, a lot of, a lot of suits, a lot of blue suits at IBM IOD. Juxtapose that to Strata plus Hadoop World, a lot of young people in, you know, middle age and older people, the whole mix of individuals from across the industry, startups, established companies, really trying to put forth visions of a better society, of big data really changing the world, improving education, improving healthcare. A lot of big ideas, as I say, a lot of startups and a lot of innovations. Now one of the things that we've been tracking now for quite some time is the notion that real time meets big data. And we saw that last week both at IBM IOD and at Strata. And one of the companies that is really leading that charge is ADAPT. ADAPT won the Startup Showcase Award at Strata plus Hadoop World, and we're here with Ming Shang Hong, who's the Chief Data Scientist at ADAPT. And we're going to talk about that and other trends. Welcome, Ming Shang. Thank you, Dave. Glad to be here. Good to have you back on the Cube. We just saw you last week at Strata. You gave us a demo. It was a shorter demo. Of course, we had, as you know, the planes were backing up on the Cube. We had a lot of people who wanted to get on, so we had to cut that short. So we've invited Ming Shang back here to do a little bit more of a drill down. Now, before we go into the demo, I want to talk a little bit about that concept that I put forth up front. Real-time meets big data. Now, we saw that at Oracle Open World with Larry Ellison, where he said real-time meet big iron or Hadoop meet big iron. Now, Larry Ellison's view of the world is you basically use Hadoop for the filtering, and then you ETL the data into an exadata infrastructure, bring in exolytics or exologic, in the case of Oracle, million, million and a half dollar infrastructure to really run your deep analytics. So I have to ask you, Ming Shang, is that your philosophy of the world? I would say that this way of using Hadoop is certainly one of the popular or even prevalent ways. The notion of using Hadoop as sort of the big data refinery, because based on the original design of Hadoop, it doesn't support standard SQL or interactive performance, so it doesn't really address all the analytics needs. But actually, as things evolve, these days, if you look at Cloud Era, Hadoop and many other companies that are looking to improve the Hadoop technology, they're really looking to expand the role of Hadoop in the entire big data ecosystem. So it's definitely going much beyond the original ETL, sort of simple data dumping and transformation kind of role. Now, the other thing that we've been tracking is really the unification of SQL and NoSQL, those two worlds coming together. And of course, the advantage of this is that there's a whole lot of people out there that understand structured query language, and there's a big base of programmers that can take advantage of that capability. And if you marry that with Hadoop, where there's a lot less skill sets, now you're bringing a whole new set of capabilities and skill sets into the Hadoop ecosystem. And the prediction is that that will really begin to explode the steepness of the S-curve. And we saw that with Hadoop. Hadoop made the announcement two weeks ago, actually, and then showcased it last week at Strata. Cloud Era announced something called Impala. MapR really brought together some of its capabilities, and isn't talking about drill, a new capability to bring SQL and NoSQL together. We saw a company like Platform talking more about real-time Hadoop. But I really want to talk to you, Ming Sheng, about Hadoop and what you've done to really unify SQL and NoSQL. So what's your, what's HADAP's angle on that? Yeah, so HADAP from day one, which was traced back to, I think, four years ago, originated from the Yale Research Lab by our, you know, Daniel Abadi and his team. They focused on unifying MPP database with Hadoop. As we discussed earlier, people do understand these two pieces of technology are complementary. So the standard practice these days is, okay, I need to set up two clusters, Hadoop One and MPP Database. I need to connect them, move data around. There's a lot of administration overhead, as well as performance overhead for the moving the data around. So the logical next step is, how can you unify these two clusters into one, and that's really the original thesis of HADAP. Technically non-trivial, a lot of patterns originating from that. And since then we have, you know, evolved to really adding advanced analytics that, you know, help empower the business analysts, along with BI tool support. Because if you, as you know, business analysts love to use the visualization tools and the interactivity performance, like what they experience with Tableau and other BI tools. And they don't have the expertise to write, map reduce, or SQL queries. So the way to bring the millions of business analysts onto the bandwagon of Hadoop is by providing BI tool support, interactive performance, and building advanced analytics packages that they can just click with a couple, you know, by clicking mouse buttons. And that's some of the key things that we added in the HADAP 2.0 that we announced last week. So a couple of points based on what Mingxing just said. So Daniel Abadi is no lightweight. He's been around. His PhD thesis essentially became Vertica. So we saw obviously a very successful application of that core research become a company, very successful company with a great exit. I guess the second point is the other big trend that we're seeing is visualization, bringing big data to the masses. Just historically, the decision support business, the BI business, has been targeted at a few analysts, a very small number of people within organizations that can maybe make some big decisions and have high impact decisions on the organization's bottom line. But the thrust that we're hearing in the big data world today is really to bring things like visualization and other tools and simplifying the capabilities of big data and putting it in the hands of business people, not just the super analytic geeks and the risk management statisticians. So that's something that we're going to be talking about. So Ming-Sheng, why don't you set up the demo and then we'll get into it? Sure. So actually, this is going to be a demo showcasing some of the key features from HADAP 2.0. As I mentioned earlier, BI tools integration. And we picked one of the popular tools on the market, Tableau. I personally love it. So this is going to be conducting not just standard analytics, but advanced ones like computing sentiment, building predictive models, and doing full tech search. It's all through the visual interface. And on the back end, we have HADAPT running, receiving and processing SQL queries with very low latency. And finally, I'm going to show you a couple of advanced analytic packages. So to start the demo, I'm going to just kick off the incremental load. So in other words, I'm going to do loading and querying at the same time. It's all happening in my Mac laptop. So the key point of this demo is not about scalability, but certainly inheriting the properties of HADAPT. We scale beyond tens of terabytes. So in this setting, what we can show you is you have the Tableau dashboard. And I can actually refresh that to give you a feel of how interactive the dashboard rendering is. So as you see, it's already finished. So now I, as the analyst, can just go through the marketing trend, when I analyze my own product, compared to my competitors. So as I'm stepping through the trend, I can also tell you a little more about the four black panels at the bottom, just to lift the curtain a little bit of the back end of the demo. So we have the first window where it's tailing on HADAPT. So you have a lot of activities happening in the HADAPT back end. Then we have the second window that's basically showing you the activity of loading more data into the HADAPT back end. And the third window here is actually all the queries being generated by Tableau and processed by HADAPT in real time. The last one is actually integrating with Mahaut. As you know, Mahaut is one of the popular NPP Hadoop-based machine learning engine. And this is something we have integrated with HADAPT so that, again, the analysts can leverage very advanced machine learning functionality just by clicking a couple mouse buttons. So you're bringing some of these popular tools into a single platform. And you've done that integration. You mentioned Mahaut, Tableau, the visualization tools, the native HADAPT capability, which is SQL and Hadoop. What would I have had to do prior to doing this in HADAPT? How would it have worked if I were an IT organization? How would I have had to cobble this together? So first of all, you will need to hire a team of so-called data scientists or people who have strong expertise to write big data, map-produce kind of jobs. And I haven't done detailed research, but my intuition is there's probably around 10,000 people in the world that have that kind of expertise. And many of them are, of course, employed by Cloudera, Hortonworks, and others. HADAPT is about democratizing big data. By that we mean you have the big data scientists develop these advanced packages once and packaged up as SQL functions. So now millions of Tableau users and BI analysts can just conduct the analysis by clicking the button. So back to your point, Dave, before this, we have to find these data analysts, data scientists, that can write the map-produce jobs, and also invoke them and hand the result back to the data analysts, either in the form of visualization or maybe some spreadsheet. And the problem with that is, first of all, there's a lot of latency in developing the technical packages and adapting to the business requirements. And when the analyst wants something to be tweaked, that's another very long cycle of the collaboration, versus now when we have the data scientists develop the package, hand it off to the analyst, she can use it, mix and match it in whatever way she wants. It's all interactive and she can invoke it on different kind of data sets. So you don't need this kind of very, I would say, involved collaboration cycle. So that's really one thing that we added for facilitating the collaboration between the data scientists and the business analysts. OK. What else can you show us? Yeah, OK. So as I am loading more data, if I refresh the dashboard, you will be able to see that more data is being loaded into the back end and reflected in turn on the table dashboard. So here, as you can see, when I step through the time, there's one bubble corresponding to New York that has turned really red. And that signals attention. So now as the analyst, I can draw into that bubble to understand what's going on. And that triggers a bunch of sophisticated SQL queries, which in turn renders these advanced dashboards. So let me walk you through that. On the upper right corner, we can see the revenue of my product trending in the recent time window. And we see it hasn't really changed much. So that's not really what's requiring the attention. But when you look at the upper right corner, we're computing the social buzz of our own product compared to our competitors. And over the last 24 windows, there's somehow a huge spike of our competitors. So now we can draw into that time window, notice how the queries are being processed in an interactive fashion. And now we can pull up some detailed tweets, tweets happening within that time window. So I'm going to step back and call for attention for a couple of things. First of all, we started with high level aggregation kind of analysis. And now we can draw into very detailed individual raw tweets. That's from the raw data. That's all happening within one unified platform. You don't need to move data around or materialize things into cubes or other structures. It's all computed on the fly within the same platform. The second one is, what is social buzz? It's not a standard MapReduce function or a SQL function. It's something that, as I mentioned earlier, the big data scientists can develop once. And this is based on computing the sentiment, which we actually leveraged my heart to build such a binary classifier. And we can talk more about how we build it later. So this is combining the sentiment of individual tweets as well as the influence level of the Twitter authors. So you can match up different variables to compute a single metric we call social buzz. And now you expose that advanced package on the Tableau interface. So the business analyst doesn't need to know how it's implemented. You can just refresh the dashboard. And as I will show you later, drag and drop columns to invoke this kind of functionality. So actually, let me show you another example to drive the point home. If I could ask you a question before you do that. So when you loaded that data in, you showed us you loaded more data in. So the data source was some Twitter data. And when you load it in, you're doing like a, it's not an external ETL, right? It's a sort of simplified internal ETL. Is that a layman, I know a layman term here, but can you explain that a little bit? I would say the ETL process is definitely much simplified compared to some of the traditional BI kind of ETL workflow. So in this case, we are integrating different kinds of data sources. That's one of the advantages of HADAP, being able to analyze a large variety of data. So the first dimension is public versus private data, right? We downloaded live Twitter feeds from social media. That's public. But you're also matching up with our private company's sales record because we want to understand the dynamics of different variables and how that has been influencing the adoption of my company's product. In this case, we just picked energy drink, a topic that everyone can relate to, right? Public or private. Second dimension, structured and unstructured. You have sales records that are especially like structured, but you have those Twitter, human text, something that's unstructured. And again, you can analyze all of them within one unified platform. Excellent. So let me show you another advanced analytics package. Just for example, if I am a business analyst, I may want to attach a dollar number to each tweet, right? I want to understand which tweets are highly influential. So maybe I can go after those Twitter authors or find authors of similar profile to have them endorse my product. I'd like to know, like, could I get paid every time I tweet? Absolutely. Why not? That's a great business model. So the way I do it is I drag and drop an advanced analytics package called DollarWars. I drag it onto the dashboard. OK, it's looking good, but I don't like summation. So let me compute average. So it's done. It's all interactive. What's happening behind the hood is definitely non-trivial. The way we invoke this kind of advanced analytics as part of the MPP sort of hot-down platform, right? And this is actually bypassing MapReduce to achieve this kind of interactive performance. But the analyst doesn't need to know anything about that. You just drag and drop columns, and that's something that anyone can extend a Tableau dashboard with this kind of advanced functionality. So that's something that Tableau users are very familiar with. But before hot-down, they are not able to invoke advanced packages in this intuitive, visualized fashion. Just to show you one more advanced analytics function we have integrated into hot-down. It's full-text search. So I vaguely recall there were some tweets mentioning interesting recipe with, let's say, Red Bull and Vodka. So if I search all tweets mentioning Vodka, OK, I have a couple hits. That's good. Now, if I spell it slightly differently, right, V-O-T-C-A, I don't find anything. So maybe because I'm not a native speaker or for other reasons, I can't find anything. That's annoying, right? But let me do what's called a fuzzy search, right? So notice this is something that standard SQL doesn't support. The exact matching that I gave earlier, right? If you spell it correctly, you have the substring matching. That's something you can leverage standard SQL databases to do. But if I do fuzzy search, look, I found a couple tweets I like. And even the tweets themselves have different spellings. You have V-O-D-K-A, V-O-D-C-A, and so on, right? Because people are speaking different languages or this is just informal social context. So this is very powerful. I don't want to mislead the audience into thinking only non-native speakers can leverage this, right? This is really about being able to analyze the social body of text in a fuzzy way to find all the relevant information that you are looking for. And previously, only the data scientists and very technical people can set up, let's say, Apache Solar, a very popular open source full-text engine. So the solar underneath, OK. Exactly. Set it up and run this kind of search. What we did is we integrated it with HADAPT and exposed the full-text functionality onto the BI interface, such as Tableau. So now you can do this kind of search engine kind of experience interactively. Is the fuzzy search capability. Is that native to solar? Is that something you developed and integrated? Talk about that. So this functionality is actually native to solar. Maybe not everyone knows this. But the beauty of this kind of integrating with other tools, as opposed to building from scratch, is you can really harness a lot of nice functionality that you could just leverage. And of course, one of the key computations with HADAPT is we integrated into our SQL language. So now you don't need to issue a separate solar query with its own syntax, and then maybe bring back the data for additional SQL or ML processing. You just do a unified SQL query that that's part of the keyword searching and additional analytics for sentiment, for building predictive modeling, and what have you. I wonder if we can geek out a little bit. Not too much, but just at a high level. So can you talk, Ming Sheng, about how the team at HADAPT has actually achieved this unification? I mean, what's going on underneath here? What's the innovation, the secret sauce? Sure, so one of the key contributions, as we discussed earlier, Dave, is really unifying the MPP database cluster with the Hadoop cluster. So for those of you who are interested, you can actually go back to the research paper called HadoopDB, originating from Yale Research Lab. In short, what we did is basically we built a relational storage and full-fledged engine. So it's not just for storage, but also it does all the relational processing for filtering, join aggregation kind of processing on each Hadoop node. So now on each Hadoop node, you have your familiar HDFS storage and MapReduce engine, but you also have the relational storage and engine. Now for each incoming query, for a HADAPT SQL query, we would decide, first of all, if it could bypass the MapReduce engine, because as you know, whenever you invoke MapReduce engine, even to extract one single bit, it might take 15 seconds. It's just a constant startup overhead. And that's the core reason why MapReduce or Hadoop is not interactive. And we actually built a separate engine sitting side-by-side with MapReduce so that for the incoming query, if it can be made interactive, we bypass the MapReduce engine. We go to our own engine directly and answer that query. Now, for more sophisticated queries, we basically would translate that SQL query into a MapReduce job, which then runs on top of all the Hadoop and Hadoop nodes. And the data can come from HDFS, our optimized relational store, as well as HBase. And the performance, of course, would vary depending on where you store the data. For example, if the data is in our optimized storage, you get far better performance. That's one of the original benefits in the Hadoop DB. That's one of the key contributions because you can push down, not just filtering, but aggregation and in many cases, joining into the individual Hadoop nodes without needing to retrieve all the data, shuffling the data around and processing them in the MapReduce layer. So as you bring together these SQL and those SQL worlds, obviously you don't have the maturity of SQL, it's been around for decades. But can you do things like user-defined functions? Yeah, exactly. So, we call in the demo, a couple pieces we showed you, full text search, computing the sentiment, computing the dollar worth of each tweet. These are all user-defined functions that someone can just write once, plug into Hadoop, and now millions of BI users can invoke them. And how about column store? That's something that obviously a body knows something about. Do you use a column store in this approach? Not at this moment. It's on our product roadmap. You've got to start somewhere. We definitely see column store as yet another major source of performance integration. When we designed the product roadmap and drive sort of the maturity of the product, we took care in striking a balance between providing very fast performance and some additional benefits such as invoking advanced analytics, being able to do full text search and BI tools integration. So, as of now, certainly the performance of our SQL processing won't be on par with some of the MPP leaders. On the other hand, compared to MapReduce and Hype, it's already in all of magnitude or even faster. And that's what you see here, that allows this kind of interactive performance. Well, you're trying to simplify Hadoop and bring Hadoop to a much wider audience at substantially better economics. That's what I see now. So, the trade-off might be the full functionality that you'll get out of SQL, but that's really not your objective. So, that'll evolve over time, presumably. So, you're saying column store, for instance, is on your roadmap. When column store comes out, that will presumably enhance performance. Is that correct? Absolutely. It will further enhance the performance on top of the interactivity that we're already providing. And to your point earlier, Dave, it's actually our goal to build all the standard SQL 92 and 99 functionality. And since the company was founded, we have made a lot of headway, building this kind of standard SQL functionality. So, it's definitely not just the Hive that people see on top of Hadoop. Well, so you mentioned Hive. I mean, it really is a Cloudera's approach with Impala to integrate HBase and Hive. Can you talk about, because it sounds very similar of what Cloudera's doing, there's a unification going on. So, talk about the differences. Why would a company, a customer, buy Hadoop and, for example, not Impala? So, a couple of things. I would say, first of all, if you look at the nature of these two companies, interactive SQL processing and advanced analytics is all we do for our living. That's the core strengths and the core positioning of Hadoop. We're a product company. Whereas compared to Cloudera, they have very established a successful business in service and training. And we don't need to worry about cannibalizing any of our existing revenue base or being concerned with making the product too easy to use. We're all about democratizing big data. And as you see from the demo here, Dave, we really empowered millions of BI analysts to leverage the power of Hadoop and Hadoop. So, that's the first one. Now, I'm a technologist, so I'm actually more comfortable talking about a couple more technical differentiations. So, I would say that the first one is, as I showed you in the demo, we have, this is a working demo and the productized version will be released in Q1 next year. And this is already working with Tableau and other BI tools. We are aware that it is, in PalaceGo to also support BI integration, but at some point, I think they will be ready in that front. Another item that I want to call out again is the advanced functionality. We call it HADAPT Development Kit or HDK. So, through this HDK, you can integrate advanced features like integrating with Mahoud for sentiment analysis or other predictive modeling, integrating with Solar for full-text search, and integrating with HBase for low latency, data loading and updates, and with R for advanced analytics. So, these are just examples, right? The possibility is limitless, is endless, and it's all through this kind of customized, pluggable sort of analytic modules that we support. So, your strategy then is to expose the world through your HDK or HDK, you call it to your capabilities, the API, and that's how integration of other tools ideally will happen, is that correct? Exactly. Excellent. Now, I had another question about visualization. Tableau is by far the company that you hear of the buzz and big data visualization all about, but interestingly, Tableau predates Hadoop. So, it really wasn't designed with this sort of distributed nature in mind. Are you seeing other visualization tools emerge and will you integrate with those or do you expect those to really catch on? Or does Tableau have such a lead? I mean, do you have an opinion on that? That's a really interesting question. I would say my answer is twofold, right? First of all, there are a lot of existing BI tools in the market that people just simply cannot afford to ignore, and that's one of the beauties with Hadoop. Although, to your point, Tableau was released, designed, and went to market much earlier even before Hadoop was founded, right? But through the standard ODBC, JDBC protocol and standard SQL, as long as the backend supports it, now let me stress the point, it's not Hive, right? Hive might work with Tableau through some customization, but other standard BI tools will not. So either you have enough resource to convince each BI tool to change their product, or you have to go and comply with the standard, which is SQL, ODBC and JDBC, and that's exactly what Hadoop does, right? So this is what allows us to connect with not just Tableau and all the other existing BI tools. Now, the second part is additional new paradigms of visualization. I certainly see that up and coming. You have a lot of HTML5-based or specific, let's say, map-based visualization tools. I would say that at this point, Hadoop, as a company, sees the core strengths in developing the backend. So we are not looking to develop customized, let's say, HTML5-based visualization yet. Down the road, it might change. But right now, that's our core focus, as opposed to some other vendors that also provide an integrated experience with their own customized BI tool or visualization tool and the data engine and some other components. But you're trying to be a platform to the tools, agnostic to the individual tool. So interesting conversation about Tableau. I mean, my prediction is that Tableau is hot company. They get the user conference coming up next week, next month. They are very mature. I think they're going to IPO. I think it's going to be a hot IPO. They're exploding. And I also think they're going to get a lot of competition for the reasons that we mentioned, that it's a hot space. They've been around for a while. There's new ways to attack that problem. So I think you're going to see tons of people coming into the marketplace there. Okay, so Ming-Sheng, thanks very much. I really appreciate you coming on theCUBE and sharing with us your perspectives, the demo, real-time, meet-a-dupe, the unification of the sequel and the no-sequel worlds, trends that we're watching here, bringing visualization in, integrating multiple tools. Congratulations on winning the startup showcase at Strata plus a dupe world and getting the product ready. We'll be looking forward to the progress and keep it in touch. Thank you, Dave. Great to be here. I really appreciate it. All right, everybody. Thank you very much for watching. We'll see you next time. This is Dave Vellante from wikibon.org, and this is theCUBE. Thanks for watching.