 Live from San Jose, California, it's the Cube, covering Big Data Silicon Valley 2017. Okay, welcome back everyone. Live here in Silicon Valley, San Jose is the Big Data Silicon Valley in conjunction with Strata Hadoop. This is the Cube's exclusive coverage over the next two days. We've got wall-to-wall interviews with thought leaders, experts breaking down the future of Big Data, future of analytics, future of the cloud. I'm John Furrier with my co-host George Gilbert Wiggy Bond, our next guest is Jan House Sun, who's the co-founder and CTO of Transwar Technologies. Welcome to the Cube. You were on during the, 166 days ago, I noticed, on the Cube previously, but now you've got some news. So let's get the news out of the way. What are you guys announcing here this week? Yes, we are announcing our 5.0, the latest version of Transwar Data Hub. So in this version, we would call it probably the revolutionary product because the first one is we embedded communities in our product. So we allow people to isolate different kind of workloads using Docker and containers. And we also provide a scheduler to better support mixed workloads. And the second is we are building a set of tools allow people to build their warehouse and migrate from existing or traditional their warehouse to Hadoop. And we are also providing people capability to build a data mat, actually. It allow you to interactively query the data. So we build a column store in memory and on SSD and we totally write the whole SQL engine. That is a very tiny SQL engine allow people to query it very quickly. And so today that tiny SQL engine is like about five to 10 times faster than Spark 2.0. And we also allow people to build cubes on top of Hadoop. And then once the cube is built, the SQL performance like the TPSH performance is about 100 times faster than existing database or existing Spark 2.0. So it's super fast. And actually we found a pilot customer so they replaced Teradata with our software to build a data mat. And we already migrate 700 reports from Teradata to our product. So the performance is very good. And the first one is we are providing a tool for people to build machine learning pipelines. And we are leveraging TensorFlow, MXNet, and also Spark for people to visualize to the pipeline and to build the data mining workflows. So this is kind of like data science tools. It's very easy for people to use. Okay, so take a minute to explain just first because that's great. You have the performance there. That's the news out of the way. Take a minute to explain, how you can transform your value proposition and when people engage you as a customer. Yeah, so people choose our product. The major reason is our compatibility to Oracle, DB2, and Teradata as a SQL syntax because they have built a lot of applications on top of those databases. So when they migrate to Hadoop, they don't want to rewrote the whole program. So our compatibility, SQL compatibility is a big advantage to them. So this is the first one. You know, we also support full answered and distributed transactions on top of Hadoop so that a lot of applications can be migrated to our product with few modifications or without any changes. So this is the first advantage. The second is because we are providing even the best streaming engine that is actually derived from Spark. So we apply this technology to IoT applications. You know, the IoT application, they need a very low latency but they also need very complicated models on top of streams. So that's why we are providing full SQL support and machine learning support on top of streaming events. And we are also using even driven technology to reduce the latency to five to 10 milliseconds. So this is the second reason people choose our product. And today we are announcing 5.0 and I think people will find more reasons to choose our product. So you have the compatibility with SQL, you have the tooling and now you have the performance. So the kind of the triple threat there. So what's the customer saying? When you go out and talk with your customers, what's the view of the current landscape for customers? What are they solving right now? What are the key challenges and pain points that customers have today? We have customers in like more than 12 vertical segments and in different verticals they have different pain points actually. So take one example in financial services and the main pain point for them is to migrate existing legacy applications to Hadoop. You know they have accumulated lots of data and the performance is very bad using a legacy database. So they need a high performance Hadoop and Spark to speed up the performance like reports. So, but in another vertical like in, logistic and transportation and the IOT, the pain point is to find a very low latency streaming engine at the same time they need a very complicated programming model to write their applications. So another example like in public sector, they actually need a very complicated and a large scale search engine and they need to build a little capability on top of such engine. They can search the results and as a result in the same time. You on how as always whenever we get to interview you on theCUBE you toss out these like gems sort of like, you know diamonds are like big rocks that under millions of years and the incredible pressure have been like squeezed down into these incredibly valuable kind of, you know value sort of minerals with lots of goodness in them. So I need you to unpack that diamond back into something that we can like make sense out of or I should say that's more accessible. The, you've done something that none of the Hadoop distro guys have managed to do which is to build databases that are not just decision support but can handle OLTP, you know operational applications. You've done the streaming. You've done, you know, what even Databricks can't do without even trying any other stuff which is getting the streaming down to event at a time. Let's step back from, you know, all these amazing things and tell us what was the secret sauce that lets you build a platform that's advanced. So actually we are driven by our customers and we do see the trends that people are looking for better solutions. You know, there are a lot of pain to set up a Hadoop cluster to use the Hadoop technology. So that's why we found it's very meaningful and also very necessary for us to build a SQL database on top of Hadoop. And like, so quite a lot of customers in FSI they ask us to provide ACID and digital transaction capability on top of Hadoop because they have to guarantee the consistency of their data. Otherwise they cannot use the technology. At the risk of interrupting, maybe you can tell us why others have built the analytic databases on top of Hadoop to give, you know, the familiar SQL access and obviously have a desire also to have transactions next to it so you can inform a transaction decision with the analytics. One of the questions is, how did you combine the two capabilities? I mean, it only took Oracle like 40 years. Right, so yeah, actually our transaction capability is only for analytics, you know. So this OATP capability it is not for short term transaction applications, it's for data warehouse kind of workloads. Okay, so when you're ingesting? Yes, when you're ingesting, when you modify your data in batch, you have to guarantee the consistency. So that's the OATP capability. But we are also building another distributed data storage and a distributed database and that are providing that with OATP capability that means you can do concurrent transactions on that database. But we are still developing that software right now. Today, our product providing the distributed transaction capabilities for people to actually build a data warehouse, you know, quite a lot of people believe the data warehouse do not need a transaction capability but we found a lot of people modify their data in data warehouse, you know, they are loading that continuously to data warehouse like the CIM tables, customer information, they can be changed over time. So everyday people need to update or change the data. That's why we have to provide transaction capability in data warehouse. Okay, and then so then we'll tell us also because the streaming problem is, you know, we're told that roughly two thirds of Spark deployments use streaming as, you know, a workload. And the biggest knock on Spark is that it can't process one event at a time. You know, you got to do a little batch out of them. Tell us some of the use cases that can take advantage of doing one event at a time and how you solve that problem. Yeah, so the first use case we encounter is the anti-fraud or fraud detection application in FSI. So, you know, whenever you swap your credit card, the banker needs to tell you if the transaction is fraud or not in a few milliseconds. So, but if you are using Spark streaming, it will usually take 500 milliseconds. So, and the latency is too high for such kind of application. And that's why we have to provide, even to put time, so that means even driven processing to detect fraud, so that we can interrupt the transaction in a few milliseconds. So that's one kind of application. The other kind of requirement come from IoT applications. So we already put our streaming framework in a large manufacturer factory. So they have to detect the malfunction of the equipment in a very short time. Otherwise, it may explode. So if you are using Spark streaming, probably when you submit to your application, it will take you hundreds of milliseconds. And when you finish your detection, it usually takes a few seconds. So that will be too long for such kind of application. And that's why we need a low latency streaming engine. But you can see it is okay to use Storm or Flink, right? And the problem is we found it is very, they need a very complicated programming model that they are going to solve an equation on the streaming events. They need to do the FFT transformation. And they are also asking to run some linear regression or some like a neural network on top of events. So that's why we have to provide like SQL interface and we are also embedding the CEP capability into our streaming engine so that you can use the pattern to match the events and to send out alerts. So SQL to get a set of events and maybe join some and the complex event processing CEP to say does this fit a pattern I'm looking for? Yes. Okay. And then with the lightweight OLTP, that or any other new projects you're looking at, tell us perhaps the new use cases you'd be appropriate for. Yes, that's our future product actually. So we are going to solve the problem of a large scale OLTP transaction problems. Like, so you know a lot of, you know in China, there's so many population like in public sector or in banks, they need to build a highly scalable transaction systems so that they can support very high concurrent transactions at the same time. So that's why we are building such kind of technology. You know in the past, people just divided transaction into multiple databases like multiple Oracle instances or multiple MySQL instances. But the problem is if the application is simple, you can very easily divide a transaction over the multiple instances of databases. But if the application is very complicated, especially when the SV already wrote the applications based on Oracle or traditional database, they already depends on the transaction systems. So that's why we have to build same kind of transaction systems so that we can support their legs applications but they can scale to hundreds of nodes and they can scale to millions of transactions per second. On the transactional stuff? Yes. Because just correct me if I'm wrong, I know we're running out of time but I thought Oracle only scales out when you're doing decision support work, not when you're doing OLTP, not that it can only, that it can maybe stretch to 10 nodes or something like that. Am I mistaken that? Yes, they can scale to 16 to 32 nodes. For transaction work, for Oracle. But so that's the theoretical limit but like Google F1 and Google Spanner, they can scale to hundreds of nodes. But the latency is higher than Oracle because you have to use distributed protocol to communicate with multiple nodes so the latency is higher. On Google? Yes. On Google. The latency is higher on the Google? Because it has to go all the way to Europe? Or Oracle or Google? Google. Google, okay. Because if you are using true face committed protocol, you have to talk to multiple nodes to broadcast your request to multiple nodes and then wait for the feedback. So that means you have much higher latency. But it's necessary to maintain the consistency. So in a distributed OITP database, the latency is usually higher but the concurrency is also much higher and the scalability is much better. So that's a problem. You've stretched beyond what Oracle's done? Yes, so because the customer can tolerate the higher latency but they need to scale to millions of transactions per second so that's why we have to build a distributed database. Okay, for this reason, we're going to have to have you back for maybe five or 10 consecutive segments, maybe starting tomorrow. We'll have to get you back for sure. Final question for you, what are you excited about from a technology in the landscape, as you look at open source, you work with Spark, you mentioned Kubernetes, you have microservices, all the cloud. What are you most excited about right now in terms of new technology that's going to help simplify and scale with low latency. The database is the software. Because you've got IoT, you've got autonomous vehicles, you have all this data. What are you excited about? Yeah, so actually, so this technology, we already solved these problems actually but I think the most exciting thing is we found, you know, there's actually two trends. The first trend is we found it's very exciting to find more competition framework coming out like the AI framework, like TensorFlow, MXNet, Torch and like such kind of, like tens of such machine learning frameworks are coming out. So they are solving different kinds of problems like official recognition from video and images, like human computer interactions using voice. Using audio. So it's very exciting I think, but for and then, and also it's very, we found it's very exciting. We are embedding this, we're combining this technology together. So that's why we combine, we are using Kubernetes, you know, we didn't use Yang because it cannot support TensorFlow or other framework. But you know, if you are using containers and if you have a good scheduler, you can schedule any kind of competition frameworks. So we found it's very interesting to have this new frameworks and we can combine together to solve different kinds of problems. Yeah, thank you so much for coming on theCUBE. It's an operating system world we're living in now. It's a great time to be a technologist. Certainly the opportunities are out there and we're breaking it down here inside theCUBE live in Silicon Valley with the best tech executives, best thought leaders and experts here inside theCUBE. I'm John Furrier with George Gilbert. We'll be right back with more after this short break.