 Live from New York, it's theCUBE. Cover Big Data NYC 2015. Brought to you by Hortonworks, IBM, EMC, and Pivotal. Now your host, Dave Vellante, and George Gilbert. Welcome back to New York City, everybody. This is theCUBE. We're here, this is day three at Big Data New York City, Big Data NYC as part of Strata and Hadoop World. This is theCUBE's sixth year covering Hadoop World. Yuan Hao is here, he's the co-founder and CTO of Transwarp Technology out of China. Great to see you in the U.S., we're talking off camera about what you guys do. Very interesting company doing massive scale work in Big Data within China, entering the U.S. market. Welcome to theCUBE, thanks for coming on. Thank you, and it's also very good to see you. Yeah, so tell us a little bit more about Transwarp. Many people in our audience have not heard of you. You're doing some really great work in China. Talk more about that. Okay, so Transwarp is a big data company in China. So we found a Transwarp three years ago, and Transwarp is growing very fast from the early 10 engineers to today a few hundreds of engineers. And we are actually the leading Hadoop digital vendor in China. So we have quite a lot of customers. They are deploying our clusters and they're storing a lot of data on our platform. Just since your name recognition isn't yet as great in the U.S., one of the things that might surprise people is the scale at which your customers are operating. Perhaps you can tell us a little more about that. Okay, yes. So actually the number one sector to deploy our product is in technical. So one of the technical companies like China Unicom, they have a very large cluster. And they are collecting all the core data records from their customers. And the cluster is storing about 20 petabytes of the records. So it's, because you know in China there's a lot of people. There are more population than in U.S., so the data is still also larger. And what are folks doing? What are your customers doing with the technology? Is it a churn analysis? Is it sort of sales and marketing? Is it infrastructure assessment? Can we talk about that? Yeah, there are several typical use cases. The first one is a data warehouse, actually. So we already acquired quite a lot customers in financial services. So they are using our product to do a traditional data warehouse workload. Just like they move their data from their systems, core banking systems to Hadoop and do a lot of risk analysis and some batch processing to clean the data. So because we provide the most completely secure support so that they are able to migrate their existing applications to our platform very easily. So that's very typical use case because the data volume is increasing very quickly and their existing regional database cannot handle so much data. So they need a new infrastructure but the application is still the older one. So they need to migrate their applications to new architecture. So that's the first requirement I think we saw in China. So in technical companies, there are still the major one problems for them because there are a lot of data, like several bits of data. They need to do massive processing like to do some statistics to evaluate some KPI or matrix and to actually create a tag for every customer and to define their data plan for their customers. So this is the batch processing use case. And the second typical use case is actually arising in the Internet of Things. So there are a lot of sensors in China actually, like in public transmitting systems. Many cities, they have deployed a lot of sensors on the road. So they actually collect massive amount of data. They need to do this data to do the processing in real time. Okay, and the number one distro vendor in China, a Duke distro vendor, and can you describe your distribution a little bit? To hear we're used to Hortonworks, Cloudera and MapR and we kind of understand Hortonworks is pure Apache. Cloudera is kind of in between. MapR builds its own IP. Where do you fit in that spectrum? Yeah, actually we have two product lines. The first product we call it Transomber Data Hub. So that is basically a Hadoop distro. We have, we also patch the Hadoop call for customers. And we actually build four sub-products on top of Hadoop. The first is SQL engine on top of Hadoop, we call it Inceptor. So that SQL engine provides complete support for SQL 2003 standard. And we also provide support for PL SQL. So that's the Oracle's extension to SQL. And then we are like about 98% compatible with Oracle's syntax. What's your SQL engine called again? Inceptor. Inceptor, okay. So that engine is able to support global transactions on top of Hadoop. That means you can maintain the data consent to the same. So in case of failures. So this is a very critical feature for financial services. Yeah, sure. So this is the first product on top of Hadoop. We have the other three. And the second successful product is our Stream product. So that is a streaming framework on top of Hadoop. We are able to do complete SQL on the streaming data. We're able to run machine learning algorithms on top of streaming data. So that means you can do the processing when the data flows in. So we call it Stream, transfer of Stream. So this product is also very successful. We deployed in like a modern, I think it's already more than 50 cities. They are deployed our Stream product. Can you tell us? I know there's two more products to go, but one, just to clarify on the SQL product, understand that it includes PL SQL so it's easy for customers perhaps to migrate some of their Oracle workloads. But did I understand you correctly to say that this is not just for decision support, transaction processing as well? Not for transaction processing, but even in data warehouse you need to modify the data, right? So you have to insert, update, or modify. Then you have to maintain the transactions, otherwise the data will be incase of failures. Okay, so let me then jump ahead just to the streaming product and the machine learning it can perform. Can you tell us a little more about all the sensors in some of these cities and when you're learning about the data as it's coming in, what could you do with it? So like, for example, so first you need to transform the data into a new form like a FFT transformation. So to, from time to frequency. Oh, from time. So and then you have to run some statistics over the streaming data, like you can do SQL processing over the streaming data so that you can collect some metrics. And these metrics are fed into a machine learning algorithm to detect some, maybe some risks or may a function of the devices. So basically you can do some, like from statistics to machine learning, all this can be done on the streaming framework. So the latency is very low. It's less than like 300 milliseconds. This is already, I mean, Spark streaming really gets sort of stuck when you try and get below 400 milliseconds and that they don't yet have deep integration between SQL and streaming and machine learning. So it sounds like you have a rather sophisticated product here at this, you know, that sounds like it's ahead of some of the other products that are very well known here. Right, actually we will write the streaming framework. So this is a different framework from Spark streaming. Understood. Because we need to support like PRC or on top of streams. But then what are some of the capabilities? What are some of the activities that you do once you've learned from this incoming data? Let me give some examples. Yes. The first example actually is from the energy segment. So they, actually there are a lot of power company. They use wind to generate power. And also there are some solar power companies. There are already a lot of sensors in their devices. Like in one wind generator, there is at most one thousand sensors in one single generator. So they need to detect any malfunction of their generator so that they can alert their staff earlier. So we collect all the data from sensors. So in one example, they actually have 10 million sensors collecting from all their generators every second. Wow. So they have to process this data within every second to detect some malfunction or to do some alerts. Before it breaks. Yeah, before it breaks. No, I want to ask you about your business model. You mentioned a number of products. And if I understood it quickly, you sell those products on top of basically Hadoop Core. Is that correct? Yes. So we adopt a similar business model like Clare area. So we have open Hadoop Core. So for some customers, we just open source this Hadoop Core to these customers. And we sell our products, proprietary products like Incept or Stream. We also have two other sub-products on top of Hadoop. So these are closed source. And then we, so for business model, we also adopt subscription model too for Hadoop. Two private services to Hadoop. And for the components on top of Hadoop, usually customer will buy our product as proprietary license. Okay, so you're like Clare area in that sense, but also similar to Hortonworks in the subscription model for support. Is that correct? Am I understanding that right? Yes, for Hadoop Core. Yeah. We provide similar subscription services. But for the products on top of Hadoop, they just buy our product like Oracle or some other device. So you have the best of both worlds. Okay, and then you started the company, you said three years ago, were you co-founder? Two years ago. Two years ago. Wow, and you're already up to, you said several hundred engineers. Right. That's impressive. Now, so I wonder if we could talk a little bit about the entrepreneurial climate, the startup climate in China. Talk about why you started the company, how it went about, how you get funding, maybe give us some insight to that. Okay. So actually we found the demand in China is slightly different from the US. So they have, actually they have more data. And they, so Chinese customers are more practical. So they need to solve their problems instead of adopting to the new architecture, new product, new technology. So we found, that's why we found we need to build a SQL layer on top of Hadoop. And this SQL layer must be very complete so that they can just use our product as a database. So this is the major reason we found the gaps between Hadoop and the applications. That's why we found, we need to start a new company to fill the gaps. And in China, I think a lot of VC and the PE, so they are looking for good company, good startups. And actually they are very eager to find those companies. And the investment environment has changed a lot, I think. You know, the term of China, the legacy is encouraging young people to start companies. And actually the government, like Shanghai, Beijing, Guangzhou and Shenzhen, they are actually hoping this start companies to grow quickly. So there is, I think there is a very good climate for startup companies in China today. Excellent. And so you have outside investors, yes? Is that right? Yes. So we already finished Angel A and A-plus round of funding. And we are doing Series B funding. And are your competitors US based distribution companies or do you have local Chinese company competitors? And I think most of them are Chinese, local competitors. So they are using open source to compete against us. I see. Okay, but you were first, is that right? No, we are the first. And also we found that Klara has its office in China too. But they are not the major player in China market. And what about aspirations in the US market? You've launched a division in the US, headquarters in the US. What are your aspirations here? Yes, we are actually finding some looking for partners in US. We hope to actually like apply our experiences or applications, some knowledge is in China and through US so that we do find some different use cases. Like traditional data warehouse workloads. We are capable to do this, I think we are more capable than other Hadoop users in the US market. It sounds like, whereas others might be able to do other competitors in China might be able to do core Hadoop distributions from the open source, just pulling it down from GitHub, an Apache product. But you've put these value add layers on because it sounds like Chinese customers are more pragmatic. Whereas here, customers might buy it because they're like their CIO says, what are we doing in big data? It sounds like in China, they'll buy it if it says, what problem are we solving? Maybe are you looking to sell those four value add products on top of Hadoop core? Yes, yes, I think this four products on top of Hadoop, I think we are very competitive in comparison to other Hadoop vendors. So I think we hope to sell these products in US. So these are very capable to fill the gaps between top layer applications and Hadoop core. So what are your thoughts on what you've seen this week at Strata Hadoop World and Big Data New York City? Any impressions that you can share with us? What have you learned? Yes, and this year's Strata conference is, I think it's much larger than last year's. And there are a lot of start companies coming up. So even more than last year, I think in last year, we found about 100 companies. And this year, there are much more than 100 companies that sponsored this conference. And we also found some new technologies coming out. Like Karez Kudu, they are building a new storage engine. So complimenting HDFS and HPACE. And we also found Spark is still growing very fast. Yeah, so I wonder if I could give, because you're a technologist, you understand this stuff pretty well. So you mentioned complimenting HPACE and HDFS. Some people feel as though perhaps they're competing, if you will. What are your thoughts as a technologist? It's particularly HDFS. I mean, maybe HPACE, there's a different sort of use case, but what about that? I think there are some critics about HDFS and HPACE in part of several years. They cannot do the best processing or some OLAP analysis on top of the real-time data. So I think the new product from Karez Kudu is filling the gap. And we actually have a similar product called Holodesk. So initially we designed this product as a catch layer at first. And then we allow insertion and modification into the storage in real-time so that this is also a column store so that people can do OLAP query on top of the real-time data too. Because we found some problems in China, like the sensor data, we have to collect this data in real-time and do the query, or the whole query on top of the real-time data real quickly. So I think the potential of this kind of story engine may replace HPACE in their own realm. So actually there's another alternative approach that is to improve HPACE to be able to do these queries very quickly. So that is another approach. So these two approaches, they probably may converge just like a Google Spanner and a Google F1. So I think that story engine is inspired from Google Spanner. Yeah, interesting. Google Spanner came out a couple of years ago and it was a very interesting global consistent database capability, which is a lot of good science work going on. What about the cloud in China? A lot of discussion here about is it on-premises, on-premises, in the cloud? How prevalent is the cloud for Hadoop workloads in China? It's still in the early stage. There are quite a lot of cloud providers in China. We actually have another product line called a trans-operating system. So it is a derived work of Google Kubernetes. You know Kubernetes? Yeah, yeah. So we build a trans-operated hub on top of Kubernetes so that we can run any components of Hadoop very smoothly on top of Kubernetes and Docker. So you can store Hadoop cluster in few minutes. The whole cluster installation is done in few minutes using Docker and containers. Just a quick follow-up on that. When customers in China evaluate running in the cloud or running hybrid, what are some of their reasons? Because here I'm just wondering if it's different from the US. Yes, so technically we run our Hadoop product on top of some public cloud like a Chinn cloud and we are going to launch services on your cloud later. So some customers like retail company, they often do their analysis like every week or every month. They do not need to build a large cluster to process their data. So that is not cost effective. So they prefer to use public cloud. The data size is usually less than like one terabytes or two terabytes. So they can upload their data onto public cloud and do the analysis. And then we found another company called ATA company. ATA company, they do some examination or tested services for students. So every year they hold like three to four exams every year. So it makes sense for them to use public cloud to do the analysis because they only need to do the analysis like every year they only do this three to four years. Four times. So they don't need to bother to buy a cluster. Excellent. We have to leave it there. So thanks very much for coming on theCUBE. It was great to meet you and appreciate you sharing your experiences and your knowledge about the market in China and as well the technologies. So congratulations and look forward to seeing you in the future. Thank you. Okay, keep right there everybody. We'll be back with our next guest. This is theCUBE. We're live from NYC, right back.