 Live from New York. It's theCUBE. Cover Big Data New York City 2016. Brought to you by headline sponsors, Cisco, IBM, NVIDIA, and our ecosystem sponsors. Now, here are your hosts, Jeff Frick and George Gilbert. Hey, welcome back everybody. Jeff Frick here with theCUBE. We are in day three of Big Data NYC. We're going wall to wall. It actually came on Monday to cover some of the artificial intelligence, machine learning activities that are now part of Strata and also now Big Data NYC. And we're really excited to be bringing you the coverage like nobody else does all day long. All the thought leaders and our next guest we're really excited to have on Juanjo Sun, the co-founder and CTO of Transwarp Technology. Welcome. Thank you, Jeff. So what are your impressions of the last few days here at the show? Oh, it's also very crowded as of last year. So, so many people attending Strata conference. Excellent. And what have you guys been, if you have any new announcements that you want to, that you're bringing out here at the show? No, not this year. But probably we plan to sponsor an extra conference in New York or San Jose. Oh, excellent. So, Juanjo, Transwarp, it's not familiar, all that familiar to American audiences, but tell us your pedigree in China in terms of some of your large customers and how long you've been in operation so that the American audience, the American audience can understand just how big a force you are in China. Okay, so Transwarp was founded in 2013, so three years ago. So we are a big company in China. We have, until now we have about 500 customers. So they are all deploying Hadoop and Spark. So they are all Spark users, actually, in China. Wow. So we are providing Hadoop distribution to customers in China, but we also build several different products on top of Hadoop, like the first year's database layer on top of Hadoop. So we provide a SQL engine, and that SQL engine is compatible with oracles and DB2s and Territas SQL extensions so that, so the compatibility level is about 90%, about 90%. So that people can move their, like, data warehouse workload to our product. And we also have a streaming engine on top of Hadoop. So that streaming engine can support full SQL as well as PLC core. So this product is used that widely in IoT. So people use our product to collect the data from sensors like the wind power generators so that they can detect the real functions of the generators in real time. Okay, so let's, we'll come back to IoT because that's at the top of mind for everyone. But what we found that's interesting here in the States with, and even in Europe, with Hadoop distributions, most of the vendors provide an MPP, messily parallel processing SQL database as a complimentary product. I don't mean free, but you know, as an add-on. But these are all sort of new implementations. They're not compatible with customers, existing investments in Teradata, IBM or Oracle. And so it's usually they're offloading particular workloads like ETL, you know, or they're doing completely new things like the schema on read. How much effort was it for you to keep that compatibility? And then what does that allow your customers to do when they migrate from the legacy systems? Yes, actually in the past or at present, if they are using Apache Hadoop and they have to use hybrid architecture. So they combine MPP database together with Hadoop using Hadoop for ETL and using MPP database for like interactive analysis or some batch processing. But with our product, our goal is to create a one-stop platform so that you do not need to use additional databases. All data is stored in one single place and you can run batch processing. You can use our product to do interactive analysis using one single database. So the data is not copied from one database to another. So when you mentioned that with other distributions, typically the customer uses one part of Hadoop for the extract transform load, taking that off the data warehouse, which is running on a data warehouse, very expensive process. And then they have typically do the analysis in an MPP SQL database. The ETL offloader, is that typically done in Hive or is it done now more often perhaps in Spark? So if you are using open source projects, then people usually use Hive for batch processing for ETL because it is more stable and scalable. But with our product, our engine is built on top of Spark. So that means all the batch processing is done using Spark engine. And again, because we are compatible with the rules, traditional databases, single syntax, that means all the ETL logic can be migrated to our platform without any modification, mostly without modification or with minor modifications. Because we also support distributed transactions, we allow you to insert update records on HDFS. That means you can even synchronize the new records or new update from traditional databases to Hadoop in real time. And you can also batch insert or batch update records on HDFS because we can roll back any failures. And then we can also maintain the transaction atomic. So that's very, you've opened up a gazillion questions that I can add asked, but I'm not gonna get a chance to on the air at least right now. We'll pick some favorites first. Okay, we'll have to have you back a few dozen more times to actually unpack all that. But let me just ask you, when if you built on Spark, the concurrency that you would expect from a data warehouse, Spark doesn't have that sort of workload management built in. How do you compensate for that so that many people can be asking queries at the same time? Actually we, so there are two types of workload. So one is the batch processing kind of workload that is a traditional data warehouse workload. So you can submit multiple queries, multiple PLC stored procedures. And we actually write a schedule of Spark so that we can run these sort of procedures in parallel. And then we also do some, we call the inter-SQL automations. We can optimize SQLs, optimize all those stored procedures to find like a common expressions to emulate the data codes. And then we can determine the dependency between different SQL statements so that we can schedule these SQL statements. So this is basically what a compiler typically do. So this is what the core of an engine, not the query optimizer, but the other part of a database engine where it says, tell me in what order I'm supposed to do the work so that I can get a lot of work done at the same time? Yes. That you've gone and re-implemented that? Yes, we write a compiler. So that is a distributed compiler for PLC SQL. So we can schedule these SQLs. And just to be clear, PLC SQL is the oracle extension to SQL. Right. We also support DB2 SQL PL. So that's the IBM's extension to SQL. Okay. There's a ton of interest in streaming analysis now and you had touched on Internet of Things. Tell us what your product set looks like there and then some of the use cases that customers are using it for now. Yeah. So all stream product is, we are supporting for SQL, for as a SQL. That means you can use SQL to create a stream, to connect to Kafka streams and do any calculation and aggregations. And you can also send alerts using PLC SQL on top of streaming. So in this way, we can simplify the streaming application development significantly. Today, if you are writing a SQL, a streaming application, you can use only a few lines of SQL statements. In the past, or if you are using Spark streaming or Storm, you have to develop a case to use. You have to use like hundreds of lines of codes. But with the SQL, you only need to write a few lines of SQL statements. So this can simplify the development. And then the second features of our stream product is that we are even driven. So actually our stream engine is built on top of Spark, but our own version of Spark core, but it is written in even the driven fashion. So you know Spark streaming is a mini batch processing framework. That means you have accumulated a few records and do them, process them in a batch. But we are even driven. We modified the communication mechanism inside of Spark so that the records can be sent to next stage in real time. That's another few million things I want to unpack. But that's a very... That's sending George to China next week. That's an extremely... I mean the creator of Spark, Matei Zaharia, who's gone back to academia, that's one of the things he's researching and with his students is how to do a sort of a lightweight core in Spark so that you don't have to always do these micro batch elements, so... We haven't done that. Wow. So the latency can be significantly reduced to like several milliseconds. You know Spark streaming latency typically is 300 milliseconds. That's phenomenal. Okay, so tell us, who are some of the early customers who are taking advantage of this? You mean streaming? Yeah. So like one of our typical customers that is the wind power generator farm. So you know the rows generators are usually installed on mountains or on the seaside. But the problem is it's very difficult to maintain these generators. People do not want to go the farm because there's far away from the city. Right. And because the equipment is very pressing. So if there is any malfunction, you have to repair it very quickly. So today in China, there are lots of wind generators installed. They have to build a maintenance system. So we call it intelligent maintenance systems so that we can collect all the data from the sensors on the generators. So the generators will send the sensor data every second. In one typical customers, they have 10 million sensors. And the sensors will send back the data every second. So that means you have 10 million records per second. And we install a hardware cluster, actually our streaming cluster to receive all these events. And you're looking for signals that would be predicting failure, a potential failure on a windmill? Yes, there are two use cases. The first is to detect any malfunction in real time so that you can send alerts very quickly. The second is we actually do some prediction based on the vibration of the generator so that we can predict which part will be... Fail. Will fail in your future. And then do you shut it down at that point? The either the generator or the windmill? Yes, so there is a remote control systems so that you can control the generator remotely. You can shut it down or slow it or turn it off to minimize the damage. Yes. And how much of the system is operating kind of on the mountain with the wind farm and how much of that is back at a central cloud location and how do you decide kind of how much data and compute is at each of those locations or points in between? Yeah, so the distance is usually like 100 kilometers away from the wind farm. So the maintenance system is located in city center, actually. So the latency between the network is usually like several milliseconds and the data volume is actually huge. So we have a cluster of 100 nodes to receive those sensor data. And that's in the cloud or is that? On premise. On premise, locally for the wind farm? No, so it's remotely in city center. 100 kilometers away. Oh, 100 kilometers. That's where you've got a cluster of compute and store there. And then that connects back. And then is that the same one then that drives back to the control systems? Yes. As well. Yes. It's like seven mil, seven, what was the latency again? The latency is like tens of milliseconds. Okay. It usually, so the latency is high but we only need to send alerts every second. Within one second, it will be enough to take care of any problem. Yes. And how much of the 10 million records per second are you, well, what type of analysis are you doing near the edge? And then what gets forwarded to the cloud or does it all get forwarded? So in this case, we didn't press the data in the edge. So we just send all the data back to the center. But for another customers, they actually install real-time database called Pi database in PAS. The Raspberry Pi? Yes. Oh, okay. So, no, so it's, I think it's OSI Pi database. It's a very traditional database. It's a real-time database. So they will store this center in Pi database and then we will have a receiver to copy the data to the center. So this is deployed in a distributed way. So you have a localized database to store the center data. I got to ask one quick meta question. Most of the major Hadoop distro vendors here, their customers are struggling to get a, well, they've pretty much gotten an ETL project off the ground and they're just dipping their toes in the business intelligence kind of water. You've got IoT applications with, you know, single-digit millisecond latency running. Where, what's the secret sauce? I mean, besides, of course, you're generally barely on self. Where, how did you guys do this in three years? So actually it's driven by customers. You know, for the rose type of use cases, they pushed us to achieve that. Otherwise they would not buy our product. So we have to achieve that. And it's the pressure from customers, actually. So for streaming part, I think it is doable because, you know, there is an existing streaming frameworks like Flink, like Apex, and Storm. So they, so Storm is event-driven. Flink is also event-driven, but it can allow you to develop better processing logic on top of streaming events. So we actually borrowed ideas from Flink, but we think it's doable on Spark because we built our products on Spark in the past three years, and I would think it made more sense for us to modify Spark to adopt that programming model. So, and after several prototypes, we found it is working. So, but another part, another product from transfer is the SQL engine. Actually, we spent three years to develop the SQL compiler. And it has more than 5 million lines of codes today. So we have to do a lot of work to make it compatible with traditional databases. And because our team members, they have the background in compiler and operating systems. So it's actually easier to write a SQL compiler than to develop a distributed C++ compiler. So we have quite a lot of C++ compiler experts in our company. So it's actually easier for us to develop that SQL compiler. Wow. Well, Juanjo, it sounds like you guys are doing all the right things in the right space. Obviously, the IoT angle is huge and renewable energy is growing super, super fast. So I'm sure we're going to hear more and more about trans work without having to go all the way to China, but I think we're going to send George over to spend some time with you and the team. So thank you for taking a few minutes out of your busy day. Thank you. Thanks, Juanjo. All right, I'm Jeff Rick, you're watching theCUBE. We are live in Manhattan at Big Date NYC. Be back with our next segment after this short break. Thanks for watching.